Message boards :
Number crunching :
Panic Mode On (114) Server Problems?
Message board moderation
Previous · 1 . . . 37 · 38 · 39 · 40 · 41 · 42 · 43 . . . 45 · Next
Author | Message |
---|---|
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I'm still dealing with unrecoverable downloads and getting large 4 hour backoffs. Bah humbug. [Edit] Around ~1700 download errors accumulated so far. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I'm still dealing with unrecoverable downloads and getting large 4 hour backoffs. Bah humbug. . . I had one rig generate 124 ghosted tasks during the server spasms so I started ghost recovery. I manage to get 60 resends but the next attempt resulted in the remaining 64 giving "failed to resend : expired", which is not what I was aiming for but at least gets them out of limbo. Downside I am now unable to get new work, instead I'm getting "you have completed daily quota of 'x' WUs. The worst part is the value of 'x' only goes up by 1 even when I report 4 or more completed tasks. I just hope 'x' gets to the desired value before I completely run out of work. :( Stephen :( |
Bernie Vine Send message Joined: 26 May 99 Posts: 9958 Credit: 103,452,613 RAC: 328 |
Everything normal on my 2 machines. 37 errors across the 2 machines , both now with full caches and tasks validating. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13844 Credit: 208,696,464 RAC: 304 |
The worst part is the value of 'x' only goes up by 1 even when I report 4 or more completed tasks. As they Validate, then X will increase significantly for each one. Grant Darwin NT |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
The worst part is the value of 'x' only goes up by 1 even when I report 4 or more completed tasks. . . Thanks, Stephen . . |
Ghia Send message Joined: 7 Feb 17 Posts: 238 Credit: 28,911,438 RAC: 50 |
Haven't had an error i months....the DL problems produced 38 on my one cruncher. Most are flagged as "Error while downloading", while some are just flagged as "Error". Both types have the same Stderr output : <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>02no11ab.3519.9065.7.34.28</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> </message> ]]> Nothing to do about these, right ? Humans may rule the world...but bacteria run it... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Nothing to do about these, right ?Correct - except possibly read the front-page news. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13844 Credit: 208,696,464 RAC: 304 |
Nothing to do about these, right ? Nope. As the WU gets re-allocated & Errors out it will eventually be declared a dud (Too many errors (may have bug)) and after being Assimilated, Deleted & Purged will eventually disappear from your Task list. However you can also look forward to getting a few Invalids as WUs that are Pending Validation or Inconclusive that were on the failed storage get re-issued and then error out due to not being there to download. My WAG (Wild Arse Guess) is that most of the outright Errors should be mostly done in around 48 hours. However we can look forward to getting at least a few Invalids for several months to come as those that are Pending get re-issued to another host when the original wingmate doesn't return the result for the WU they were able to download before this issue occurred (which will of course result in another batch of Errors due to the WU not being available for download for each of the WUs this occurs on). Something to keep in mind for a while. Grant Darwin NT |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
Only 38 errors, that's good :) I just seen on computer sitting with 130 errors, out of 226 downloaded. That will be waiting awhile now for the quota to build up ... |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
uploads going through for me now. but i do have to kick start the work fetch cycle. I'm having similar problems. Reduced the number of reported task to 200 from 400 and will see if that works. A manual kick (punch) to project-update button helped. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Stopped tending the backoffs and went to bed finally last night. Woke up now to one host at 21 hours in backoff from the original 24 hour backoff. Restarted BOINC to get back to reporting the first 100 tasks and the normal 5 minute timer. Up to 1750 errored while downloading tasks now. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
betreger Send message Joined: 29 Jun 99 Posts: 11414 Credit: 29,581,041 RAC: 66 |
Up to 1750 errored while downloading tasks now. Delightful |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Stopped tending the backoffs and went to bed finally last night. Woke up now to one host at 21 hours in backoff from the original 24 hour backoff. Restarted BOINC to get back to reporting the first 100 tasks and the normal 5 minute timer.You can always overrule the timer by simply clicking 'update'. That will report and reset the backoffs. The only thing you can't do (successfully) is request new work during a 5-minute backoff. The 5 minutes pause is requested and tracked by the server - if you click too early, you'll have to wait another 5 minutes before requesting work. But you can report completed work even during that closed period. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I found that just doing "update" wasn't clearing the errored tasks out of the hosts. If I did just that, all it would do is report finished work and reset the backoff timer to somewhere between 24 hours and 4 hours depending on how many error tasks were in the system. The only certain surefire way to get to a proper 5 minute timer is stop and restart BOINC. Then depending on whether any new work received had lost tasks, it reset the backoff timer again proportionally to however many unrecoverable tasks were received. There are always around 3 to 4 now in every 200 task download. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Up to 1750 errored while downloading tasks now. . . Only 383 here ... But considering the number of tasks you do that is probably about pro rata. Stephen <shrug> |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13844 Credit: 208,696,464 RAC: 304 |
I'm wondering if Eric did some other tweaking after taking out the failed storage, or if it's been causing problems for a while now. It still took ages for the splitters to crank up again once work had run down, but once they got going the output from the splitters is like never before- minimum output is around 60/s, generally no less than 80/s. Most of the time around 110/s. So now it's a matter of minutes to refill the Ready-to-send buffer, with extended periods of no output required. Makes a change from the need for continuous splitter operation when the output was low. However we appear to be having issues with the AP & MB deleters again as their backlogs continue to grow (in the case of AP after the Validators & cleared their backlog). Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13844 Credit: 208,696,464 RAC: 304 |
I found that just doing "update" wasn't clearing the errored tasks out of the hosts. If I did just that, all it would do is report finished work and reset the backoff timer to somewhere between 24 hours and 4 hours depending on how many error tasks were in the system. The only certain surefire way to get to a proper 5 minute timer is stop and restart BOINC. Then depending on whether any new work received had lost tasks, it reset the backoff timer again proportionally to however many unrecoverable tasks were received. There are always around 3 to 4 now in every 200 task download. Must be something to do with the Linux client (or the large number of failed download tasks you're getting allocated your systems), as i'm not seeing that behaviour on my Windows systems. Even at it's worst, I only had to hit Update & that would report those Errored WUs, and the timer would be reset to 5min. About the longest backoff I noticed was around 3hrs. Generally they were between 15min & 2-2.5hrs. My numbers of Errored & Invalid work continue to grow, although Errored WUs rate of growth has slowed significantly. Invalids are just going to pop up here & there each day as those missing WUs get re-allocated over the next few months. I hope Eric can run a query to catch all these WUs affected by the missing download file and re-issue them. Grant Darwin NT |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Must be something to do with the Linux client (or the large number of failed download tasks you're getting allocated your systems), as i'm not seeing that behaviour on my Windows systems. . . It may be the version of Linux BOINC manager/client he is running. On my systems running an older Linux client, like you, I have found that hitting the update button reports and clears the failed downloads and restores the default back-off timer. . . If the tasks that fail have in fact been totally lost when the storage device failed then it seems to me they would need to be re-split unless split tasks are archived before they go to dispersal. Stephen . . |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Stopped tending the backoffs and went to bed finally last night. Woke up now to one host at 21 hours in backoff from the original 24 hour backoff. Restarted BOINC to get back to reporting the first 100 tasks and the normal 5 minute timer.You can always overrule the timer by simply clicking 'update'. That will report and reset the backoffs. I can not request new work when I am at sleep or during work hours. Something is 'wrong'. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
However we appear to be having issues with the AP & MB deleters again as their backlogs continue to grow (in the case of AP after the Validators & cleared their backlog). The file deleters on Georgem have been turned off since the hard drive failure on that machine. That is why the backlog continues to grow. Might have something to do with the high splitter output and the fact the replica zeroed out and not has started climbing back up as usual. Less I/O contention maybe? My large quantity of failed tasks is just because of my high turnaround rate. I ask for more work than most so am likely to get more of the missing tasks. Need to go service another couple of hosts I see with one in a 50 minute backoff now and another one that is not showing a 5 minute timer at all and hasn't contacted the server for an hour. See lots of pink Download Errors in the Tasks list for that machine with BoincTasks. [Edit] I must have a magnet. On the host that was in backoff for 35 minutes, when I reported 100 tasks and got 100 in return. 30 of them were unrecoverable download errors. This is getting tiresome. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.