Message boards :
Number crunching :
The Outage has begun
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Robert Ribbeck Send message Joined: 7 Jun 02 Posts: 644 Credit: 5,283,174 RAC: 0 |
Uploads have restarted |
Odan Send message Joined: 8 May 03 Posts: 91 Credit: 15,331,177 RAC: 0 |
Do you sit there all day waiting for it to start or stop? :) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
It's part of the newer "intelligent" BOINC clients V6.10.5x and maybe a little bit earlier. If a unit is past it's return date it automatically deletes it. Even if its complete and ready to upload. It's a bleeding PIA especially when you only have 3 days or so when returns are accepted by the project servers. Really? Must check that out - if so, that's really going too far. I know the client now aborts work which hasn't even started before deadline, and at least warns, and suggests aborting, work that hasn't completed - but aborting work that's finished and held up at the report stage? What's the point in that? |
Walt Bennett Send message Joined: 20 Aug 99 Posts: 1 Credit: 1,009,047 RAC: 0 |
According to my notebook, they've been down for two days. It used to download enough tasks to keep crunching through the outages, but lately it'll only get ten or so and just sits there using up electricity until they come become available again. Not cool. |
soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0 |
According to my notebook, they've been down for two days. It used to download enough tasks to keep crunching through the outages, but lately it'll only get ten or so and just sits there using up electricity until they come become available again. Not cool. Your notebook missed a day of the 3 day planned outtage. Janice |
Mike.Gibson Send message Joined: 13 Oct 07 Posts: 34 Credit: 198,038 RAC: 0 |
For the information of anyone interested, some of the units got in quickly and were accepted immediately. The rest were timed out and replacements were generated. However, when they eventually were accepted (and I manually re-submitted) they were accepted and the extra generated WUs had not been sent out. In view of the huge amount of computer power needed for all the returns after an outage, it seems rather a waste of scarce resources for them to have generated the extra WUs. I would have thought that it would be better for all concerned if there was to be a moratorium until, say, 6 hours after an outage. That would give us time to submit the completed units and not waste the computing power generating unnecessary units. Mike |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
For the information of anyone interested, some of the units got in quickly and were accepted immediately. The rest were timed out and replacements were generated. However, when they eventually were accepted (and I manually re-submitted) they were accepted and the extra generated WUs had not been sent out. The trouble is, it takes more time and effort to not generate them than it does to generate them. Making the replacement is something that's been built into the server code for years. It happens automatically, and nobody has even to think about it. The tasks generated go to the end of the queue, so there's a reasonable chance that the missing reply can report in and be validated before the replacement is sent out - in that case the replacement is cancelled before it wastes any bandwidth. Trying to do it the other way is attractive, but far more complicated - and in the computing world that makes it more error-prone, too. You'd have to have mechanisms for turning replacement generation off, and back on again: and decisions to make about what constitutes an outage - if you turn things off for just five minutes while the pipes clear, does that trigger the full six-hour back-off? That sort of thing. Far better to keep the simpler, tried-and-tested rule, and put up with the slight inefficiency, I reckon. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
It's part of the newer "intelligent" BOINC clients V6.10.5x and maybe a little bit earlier. If a unit is past it's return date it automatically deletes it. Even if its complete and ready to upload. It's a bleeding PIA especially when you only have 3 days or so when returns are accepted by the project servers. That was my reaction too, so I spent some time walking through BOINC source code. All I can say for sure is that aborting completed work is not intentional, and I haven't spotted how it can happen accidentally. Soft^spirit's observations combined with T.A.'s comment are enough to make me believe there's a problem, but Dr. Anderson would require solid evidence from a message log. Joe |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
I know the client now aborts work which hasn't even started before deadline, and at least warns, and suggests aborting, work that hasn't completed - but aborting work that's finished and held up at the report stage? What's the point in that? It is only work that hasn't started yet. Work that's in progress or has been finished will continue to run and give you the warning that it's so many days overdue, consider aborting it. But BOINC won't abort it automatically. Good thing too, as else those CPDN models would never reach home base again on the slower computers. ;) |
soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0 |
I honestly did not see anything in my message log. It seemed to keep the messages ready to return, in fact returned them, but of course by then it was beyond the deadline, And the servers were already preparing to send out the follow up. All 21 are fairly obvious in my error log on one of my machines(following the earlier unit and the deadline past computer from it). The painful part was seeing them short fused, crunching them in time, and by the servers not being available, Time out. They still got returned. Just too late. Janice |
soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0 |
Show names Work unit click for details Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Credit Application 1664682305 635760429 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682299 635760411 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682291 635760387 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682285 635760369 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682281 635760357 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682279 635760530 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682275 635760518 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682271 635760506 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682269 635760500 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682267 635760494 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682265 635760488 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682263 635760482 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682259 635760470 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682255 635760458 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682250 635760446 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 1664682165 635760415 24 Jul 2010 0:56:27 UTC 6 Aug 2010 19:13:07 UTC Timed out - no response 0.00 0.00 --- SETI@home Enhanced v6.03 Janice |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
They still got returned. Just too late. Which isn't BOINC aborting them since they were past the deadline. But it's also possible that due to the problems around here, the database isn't showing the correct state of tasks. I wouldn't at all be surprised if they showed up 'correctly' later on. |
soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0 |
They still got returned. Just too late. What is correctly? Assuming that all 3 achieve the same results.. which 2? It is technically true they were returned late. Server issue, not cruncher issue.. but late is late. I just hope to see no more short fused tasks come in.. And I am still trying to understand why they did not complete sooner, i do not keep a huge cache. Priority issues perhaps? If I did indeed receive them by the 24th, .. Why wasn't my machine crunching the soonest due first.. (which would probably have completed them well before the aug 3-6 outtage).. I honestly did not have a huge cache.. in fact computers were running dry.. There are some more missing pieces to the puzzle. Janice |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
... Circumstances change, and I think there's possibly good reason to try to extend the deadlines. Something like a script run at the beginning of the outage which increases the deadline for all tasks due to time out within or shortly after the outage might work. For this week's outage, the "Results ready to send" queue was essentially empty Tuesday morning, and grew slowly for the next couple of days. I think those result creations were probably mostly caused by deadline misses. Those tasks would have been sent almost immediately as the uptime began today, before completions during the outage had a chance to make them "Not needed". That increases the amount of inefficiency, particularly since this project cannot afford to send the "abort if not started" server abort. Joe |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Circumstances change, and I think there's possibly good reason to try to extend the deadlines. Something like a script run at the beginning of the outage which increases the deadline for all tasks due to time out within or shortly after the outage might work. Yes, you're probably right - I've got a bunch trying to download now, and the vast majority are replication _2 or later - both for shorties (deadline passed?), and mid-range (ghosts?). I've even got an _8 - I'll have a look through the web page for the host in question, and see what I can find. Edit: the _8 is WU 632857301. Anyone like to pick the bones out of that? Edit2: Oh, a -12 - rebranded it, that should put it out of its misery. |
Mike.Gibson Send message Joined: 13 Oct 07 Posts: 34 Credit: 198,038 RAC: 0 |
The alternative would be to set deadlines in the first place that didn't finish in the planned outage periods. Mike |
TheFreshPrince a.k.a. BlueTooth76 Send message Joined: 4 Jun 99 Posts: 210 Credit: 10,315,944 RAC: 0 |
Jeff said:
But I get work from the servers like there isn't any limit... I have more than 2600GPU units and it still asks for more work... And gets more work... I don't complain, go on a vacation on monday so I like it that my caches are filled but was just wondering... Rig name: "x6Crunchy" OS: Win 7 x64 MB: Asus M4N98TD EVO CPU: AMD X6 1055T 2.8(1,2v) GPU: 2x Asus GTX560ti Member of: Dutch Power Cows |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
I have more than 2600GPU units and it still asks for more work... And gets more work... Stop hogging all that bandwidth. ;-) I see I have downloads of 0.62 and 0.32 KB/sec. Ouch. :) |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
The alternative would be to set deadlines in the first place that didn't finish in the planned outage periods. The splitters just specify how long (delay_bound) the task is allowed, the Scheduler adds that to "now" to set the actual deadline. So the idea is possible if the splitters were modified to round all delay_bound values up to a multiple of one week. It might help, but a task "sent" right at the beginning of an uptime period would have a deadline in that difficult time period too. There is something users can do to help avoid the problem. The core client sets up a computation deadline which is somewhat before the report deadline. One factor is the "Connect about every" preference, the core client will go into High Priority if needed to get work completed at least that much before the report deadline. So those doing only this project should boost their cache with that setting rather than the "Additional work" setting. When you think about it, "Connect about every" of more than 3 days is right for this project now. It may be inappropriate for other projects, though, and that's a global setting. I looked at the Transitioner code which changes a task status from "in progress" to timed out, it may be fairly simple to add an option to defer the timeout there. I'll work up some proposal for Dr. Anderson to consider. Joe |
Terror Australis Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44 |
It is only work that hasn't started yet. Work that's in progress or has been finished will continue to run and give you the warning that it's so many days overdue, consider aborting it. But BOINC won't abort it automatically. I have seen this happen (deletion of completed units waiting to upload) it was about 3 months ago on a machine that was killed by a power failure when I wasn't around to restart it. I did not save the messages but they were something like "unit expired, deleting", lost about 100 WU's that way. I can't tell you the BOINC version number as I'm away from home atm. The Terror |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.