The Outage has begun

Message boards : Number crunching : The Outage has begun
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Robert Ribbeck
Avatar

Send message
Joined: 7 Jun 02
Posts: 644
Credit: 5,283,174
RAC: 0
United States
Message 1024568 - Posted: 13 Aug 2010, 15:08:44 UTC - in response to Message 1024561.  

Uploads have restarted
ID: 1024568 · Report as offensive
Profile Odan

Send message
Joined: 8 May 03
Posts: 91
Credit: 15,331,177
RAC: 0
United Kingdom
Message 1024570 - Posted: 13 Aug 2010, 15:15:31 UTC - in response to Message 1024568.  

Do you sit there all day waiting for it to start or stop? :)
ID: 1024570 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1024573 - Posted: 13 Aug 2010, 15:23:27 UTC - in response to Message 1024561.  

It's part of the newer "intelligent" BOINC clients V6.10.5x and maybe a little bit earlier. If a unit is past it's return date it automatically deletes it. Even if its complete and ready to upload. It's a bleeding PIA especially when you only have 3 days or so when returns are accepted by the project servers.

As I have said before the BOINC client is getting too "smart" for it's own good.

T.A.

Really? Must check that out - if so, that's really going too far.

I know the client now aborts work which hasn't even started before deadline, and at least warns, and suggests aborting, work that hasn't completed - but aborting work that's finished and held up at the report stage? What's the point in that?
ID: 1024573 · Report as offensive
Walt Bennett

Send message
Joined: 20 Aug 99
Posts: 1
Credit: 1,009,047
RAC: 0
United States
Message 1024578 - Posted: 13 Aug 2010, 15:40:59 UTC - in response to Message 1024204.  

According to my notebook, they've been down for two days. It used to download enough tasks to keep crunching through the outages, but lately it'll only get ten or so and just sits there using up electricity until they come become available again. Not cool.
ID: 1024578 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1024580 - Posted: 13 Aug 2010, 15:50:09 UTC - in response to Message 1024578.  

According to my notebook, they've been down for two days. It used to download enough tasks to keep crunching through the outages, but lately it'll only get ten or so and just sits there using up electricity until they come become available again. Not cool.


Your notebook missed a day of the 3 day planned outtage.


Janice
ID: 1024580 · Report as offensive
Mike.Gibson

Send message
Joined: 13 Oct 07
Posts: 34
Credit: 198,038
RAC: 0
United Kingdom
Message 1024584 - Posted: 13 Aug 2010, 16:57:31 UTC

For the information of anyone interested, some of the units got in quickly and were accepted immediately. The rest were timed out and replacements were generated. However, when they eventually were accepted (and I manually re-submitted) they were accepted and the extra generated WUs had not been sent out.

In view of the huge amount of computer power needed for all the returns after an outage, it seems rather a waste of scarce resources for them to have generated the extra WUs. I would have thought that it would be better for all concerned if there was to be a moratorium until, say, 6 hours after an outage.

That would give us time to submit the completed units and not waste the computing power generating unnecessary units.

Mike
ID: 1024584 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1024593 - Posted: 13 Aug 2010, 17:13:30 UTC - in response to Message 1024584.  

For the information of anyone interested, some of the units got in quickly and were accepted immediately. The rest were timed out and replacements were generated. However, when they eventually were accepted (and I manually re-submitted) they were accepted and the extra generated WUs had not been sent out.

In view of the huge amount of computer power needed for all the returns after an outage, it seems rather a waste of scarce resources for them to have generated the extra WUs. I would have thought that it would be better for all concerned if there was to be a moratorium until, say, 6 hours after an outage.

That would give us time to submit the completed units and not waste the computing power generating unnecessary units.

Mike

The trouble is, it takes more time and effort to not generate them than it does to generate them.

Making the replacement is something that's been built into the server code for years. It happens automatically, and nobody has even to think about it. The tasks generated go to the end of the queue, so there's a reasonable chance that the missing reply can report in and be validated before the replacement is sent out - in that case the replacement is cancelled before it wastes any bandwidth.

Trying to do it the other way is attractive, but far more complicated - and in the computing world that makes it more error-prone, too. You'd have to have mechanisms for turning replacement generation off, and back on again: and decisions to make about what constitutes an outage - if you turn things off for just five minutes while the pipes clear, does that trigger the full six-hour back-off? That sort of thing.

Far better to keep the simpler, tried-and-tested rule, and put up with the slight inefficiency, I reckon.
ID: 1024593 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1024595 - Posted: 13 Aug 2010, 17:26:48 UTC - in response to Message 1024573.  

It's part of the newer "intelligent" BOINC clients V6.10.5x and maybe a little bit earlier. If a unit is past it's return date it automatically deletes it. Even if its complete and ready to upload. It's a bleeding PIA especially when you only have 3 days or so when returns are accepted by the project servers.

As I have said before the BOINC client is getting too "smart" for it's own good.

T.A.

Really? Must check that out - if so, that's really going too far.

I know the client now aborts work which hasn't even started before deadline, and at least warns, and suggests aborting, work that hasn't completed - but aborting work that's finished and held up at the report stage? What's the point in that?

That was my reaction too, so I spent some time walking through BOINC source code. All I can say for sure is that aborting completed work is not intentional, and I haven't spotted how it can happen accidentally. Soft^spirit's observations combined with T.A.'s comment are enough to make me believe there's a problem, but Dr. Anderson would require solid evidence from a message log.
                                                                 Joe
ID: 1024595 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1024596 - Posted: 13 Aug 2010, 17:32:40 UTC - in response to Message 1024573.  

I know the client now aborts work which hasn't even started before deadline, and at least warns, and suggests aborting, work that hasn't completed - but aborting work that's finished and held up at the report stage? What's the point in that?

It is only work that hasn't started yet. Work that's in progress or has been finished will continue to run and give you the warning that it's so many days overdue, consider aborting it. But BOINC won't abort it automatically.

Good thing too, as else those CPDN models would never reach home base again on the slower computers. ;)
ID: 1024596 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1024598 - Posted: 13 Aug 2010, 17:37:43 UTC - in response to Message 1024595.  

I honestly did not see anything in my message log. It seemed to keep the messages ready to return, in fact returned them, but of course by then it was beyond the deadline, And the servers were already preparing to send out the follow up.

All 21 are fairly obvious in my error log on one of my machines(following the earlier unit and the deadline past computer from it).

The painful part was seeing them short fused, crunching them in time, and by the servers not being available, Time out. They still got returned. Just too late.
Janice
ID: 1024598 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1024601 - Posted: 13 Aug 2010, 17:41:19 UTC

Show names Work unit
click for details Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Credit Application
1664682305 635760429 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682299 635760411 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682291 635760387 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682285 635760369 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682281 635760357 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682279 635760530 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682275 635760518 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682271 635760506 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682269 635760500 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682267 635760494 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682265 635760488 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682263 635760482 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682259 635760470 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682255 635760458 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682250 635760446 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03
1664682165 635760415 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03

Janice
ID: 1024601 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1024603 - Posted: 13 Aug 2010, 17:48:45 UTC - in response to Message 1024598.  

They still got returned. Just too late.

Which isn't BOINC aborting them since they were past the deadline.

But it's also possible that due to the problems around here, the database isn't showing the correct state of tasks. I wouldn't at all be surprised if they showed up 'correctly' later on.
ID: 1024603 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1024604 - Posted: 13 Aug 2010, 17:56:33 UTC - in response to Message 1024603.  

They still got returned. Just too late.

Which isn't BOINC aborting them since they were past the deadline.

But it's also possible that due to the problems around here, the database isn't showing the correct state of tasks. I wouldn't at all be surprised if they showed up 'correctly' later on.

What is correctly? Assuming that all 3 achieve the same results.. which 2?
It is technically true they were returned late. Server issue, not cruncher issue.. but late is late.

I just hope to see no more short fused tasks come in.. And I am still trying to understand why they did not complete sooner, i do not keep a huge cache. Priority issues perhaps? If I did indeed receive them by the 24th, .. Why wasn't my machine crunching the soonest due first.. (which would probably have completed them well before the aug 3-6 outtage)..

I honestly did not have a huge cache.. in fact computers were running dry..

There are some more missing pieces to the puzzle.


Janice
ID: 1024604 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1024610 - Posted: 13 Aug 2010, 18:14:38 UTC - in response to Message 1024593.  

...
In view of the huge amount of computer power needed for all the returns after an outage, it seems rather a waste of scarce resources for them to have generated the extra WUs. I would have thought that it would be better for all concerned if there was to be a moratorium until, say, 6 hours after an outage.
...
Mike

The trouble is, it takes more time and effort to not generate them than it does to generate them.

Making the replacement is something that's been built into the server code for years. It happens automatically, and nobody has even to think about it. The tasks generated go to the end of the queue, so there's a reasonable chance that the missing reply can report in and be validated before the replacement is sent out - in that case the replacement is cancelled before it wastes any bandwidth.

Trying to do it the other way is attractive, but far more complicated - and in the computing world that makes it more error-prone, too. You'd have to have mechanisms for turning replacement generation off, and back on again: and decisions to make about what constitutes an outage - if you turn things off for just five minutes while the pipes clear, does that trigger the full six-hour back-off? That sort of thing.

Far better to keep the simpler, tried-and-tested rule, and put up with the slight inefficiency, I reckon.

Circumstances change, and I think there's possibly good reason to try to extend the deadlines. Something like a script run at the beginning of the outage which increases the deadline for all tasks due to time out within or shortly after the outage might work.

For this week's outage, the "Results ready to send" queue was essentially empty Tuesday morning, and grew slowly for the next couple of days. I think those result creations were probably mostly caused by deadline misses. Those tasks would have been sent almost immediately as the uptime began today, before completions during the outage had a chance to make them "Not needed". That increases the amount of inefficiency, particularly since this project cannot afford to send the "abort if not started" server abort.
                                                                Joe
ID: 1024610 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1024620 - Posted: 13 Aug 2010, 18:50:47 UTC - in response to Message 1024610.  
Last modified: 13 Aug 2010, 19:02:09 UTC

Circumstances change, and I think there's possibly good reason to try to extend the deadlines. Something like a script run at the beginning of the outage which increases the deadline for all tasks due to time out within or shortly after the outage might work.

For this week's outage, the "Results ready to send" queue was essentially empty Tuesday morning, and grew slowly for the next couple of days. I think those result creations were probably mostly caused by deadline misses. Those tasks would have been sent almost immediately as the uptime began today, before completions during the outage had a chance to make them "Not needed". That increases the amount of inefficiency, particularly since this project cannot afford to send the "abort if not started" server abort.
                                                                Joe

Yes, you're probably right - I've got a bunch trying to download now, and the vast majority are replication _2 or later - both for shorties (deadline passed?), and mid-range (ghosts?). I've even got an _8 - I'll have a look through the web page for the host in question, and see what I can find.

Edit: the _8 is WU 632857301. Anyone like to pick the bones out of that?

Edit2: Oh, a -12 - rebranded it, that should put it out of its misery.
ID: 1024620 · Report as offensive
Mike.Gibson

Send message
Joined: 13 Oct 07
Posts: 34
Credit: 198,038
RAC: 0
United Kingdom
Message 1024931 - Posted: 14 Aug 2010, 8:07:35 UTC

The alternative would be to set deadlines in the first place that didn't finish in the planned outage periods.

Mike
ID: 1024931 · Report as offensive
TheFreshPrince a.k.a. BlueTooth76
Avatar

Send message
Joined: 4 Jun 99
Posts: 210
Credit: 10,315,944
RAC: 0
Netherlands
Message 1024951 - Posted: 14 Aug 2010, 10:24:37 UTC
Last modified: 14 Aug 2010, 10:25:44 UTC

Jeff said:


We're on line with these limits:

40/CPU
320/GPU

planned Monday limits:

320/CPU
2560/GPU

With the MySQL replica down, read-only queries that would normally go to the replica will hit the primary instead. We'll see what the impact of this is.


But I get work from the servers like there isn't any limit...
I have more than 2600GPU units and it still asks for more work... And gets more work...

I don't complain, go on a vacation on monday so I like it that my caches are filled but was just wondering...
Rig name: "x6Crunchy"
OS: Win 7 x64
MB: Asus M4N98TD EVO
CPU: AMD X6 1055T 2.8(1,2v)
GPU: 2x Asus GTX560ti
Member of: Dutch Power Cows
ID: 1024951 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1024955 - Posted: 14 Aug 2010, 11:04:07 UTC - in response to Message 1024951.  

I have more than 2600GPU units and it still asks for more work... And gets more work...

Stop hogging all that bandwidth. ;-)

I see I have downloads of 0.62 and 0.32 KB/sec. Ouch. :)
ID: 1024955 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1025002 - Posted: 14 Aug 2010, 13:51:08 UTC - in response to Message 1024931.  

The alternative would be to set deadlines in the first place that didn't finish in the planned outage periods.

Mike

The splitters just specify how long (delay_bound) the task is allowed, the Scheduler adds that to "now" to set the actual deadline. So the idea is possible if the splitters were modified to round all delay_bound values up to a multiple of one week. It might help, but a task "sent" right at the beginning of an uptime period would have a deadline in that difficult time period too.

There is something users can do to help avoid the problem. The core client sets up a computation deadline which is somewhat before the report deadline. One factor is the "Connect about every" preference, the core client will go into High Priority if needed to get work completed at least that much before the report deadline. So those doing only this project should boost their cache with that setting rather than the "Additional work" setting. When you think about it, "Connect about every" of more than 3 days is right for this project now. It may be inappropriate for other projects, though, and that's a global setting.

I looked at the Transitioner code which changes a task status from "in progress" to timed out, it may be fairly simple to add an option to defer the timeout there. I'll work up some proposal for Dr. Anderson to consider.
                                                                Joe
ID: 1025002 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1025245 - Posted: 15 Aug 2010, 8:16:26 UTC - in response to Message 1024596.  

It is only work that hasn't started yet. Work that's in progress or has been finished will continue to run and give you the warning that it's so many days overdue, consider aborting it. But BOINC won't abort it automatically.

Good thing too, as else those CPDN models would never reach home base again on the slower computers. ;)


I have seen this happen (deletion of completed units waiting to upload) it was about 3 months ago on a machine that was killed by a power failure when I wasn't around to restart it. I did not save the messages but they were something like "unit expired, deleting", lost about 100 WU's that way.

I can't tell you the BOINC version number as I'm away from home atm.

The Terror
ID: 1025245 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : The Outage has begun


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.