Panic Mode On (114) Server Problems?

Author	Message
Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1979543 - Posted: 9 Feb 2019, 7:17:51 UTC Last modified: 9 Feb 2019, 7:19:51 UTC I'm still dealing with unrecoverable downloads and getting large 4 hour backoffs. Bah humbug. [Edit] Around ~1700 download errors accumulated so far. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1979543 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1979545 - Posted: 9 Feb 2019, 7:58:41 UTC - in response to Message 1979543. I'm still dealing with unrecoverable downloads and getting large 4 hour backoffs. Bah humbug. [Edit] Around ~1700 download errors accumulated so far. . . I had one rig generate 124 ghosted tasks during the server spasms so I started ghost recovery. I manage to get 60 resends but the next attempt resulted in the remaining 64 giving "failed to resend : expired", which is not what I was aiming for but at least gets them out of limbo. Downside I am now unable to get new work, instead I'm getting "you have completed daily quota of 'x' WUs. The worst part is the value of 'x' only goes up by 1 even when I report 4 or more completed tasks. I just hope 'x' gets to the desired value before I completely run out of work. :( Stephen :( ID: 1979545 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328	Message 1979547 - Posted: 9 Feb 2019, 8:42:44 UTC Everything normal on my 2 machines. 37 errors across the 2 machines , both now with full caches and tasks validating. ID: 1979547 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1979548 - Posted: 9 Feb 2019, 8:46:50 UTC - in response to Message 1979545. The worst part is the value of 'x' only goes up by 1 even when I report 4 or more completed tasks. As they Validate, then X will increase significantly for each one. Grant Darwin NT ID: 1979548 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1979552 - Posted: 9 Feb 2019, 8:53:49 UTC - in response to Message 1979548. The worst part is the value of 'x' only goes up by 1 even when I report 4 or more completed tasks. As they Validate, then X will increase significantly for each one. . . Thanks, Stephen . . ID: 1979552 ·

Ghia Send message Joined: 7 Feb 17 Posts: 238 Credit: 28,911,438 RAC: 50	Message 1979557 - Posted: 9 Feb 2019, 10:13:00 UTC Haven't had an error i months....the DL problems produced 38 on my one cruncher. Most are flagged as "Error while downloading", while some are just flagged as "Error". Both types have the same Stderr output : <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>02no11ab.3519.9065.7.34.28</file_name> <error_code>-224 (permanent HTTP error)</error_code> <error_message>permanent HTTP error</error_message> </file_xfer_error> </message> ]]> Nothing to do about these, right ? Humans may rule the world...but bacteria run it... ID: 1979557 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1979558 - Posted: 9 Feb 2019, 11:02:47 UTC - in response to Message 1979557. Nothing to do about these, right ? Correct - except possibly read the front-page news. ID: 1979558 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1979560 - Posted: 9 Feb 2019, 11:20:00 UTC - in response to Message 1979557. Last modified: 9 Feb 2019, 11:26:42 UTC Nothing to do about these, right ? Nope. As the WU gets re-allocated & Errors out it will eventually be declared a dud (Too many errors (may have bug)) and after being Assimilated, Deleted & Purged will eventually disappear from your Task list. However you can also look forward to getting a few Invalids as WUs that are Pending Validation or Inconclusive that were on the failed storage get re-issued and then error out due to not being there to download. My WAG (Wild Arse Guess) is that most of the outright Errors should be mostly done in around 48 hours. However we can look forward to getting at least a few Invalids for several months to come as those that are Pending get re-issued to another host when the original wingmate doesn't return the result for the WU they were able to download before this issue occurred (which will of course result in another batch of Errors due to the WU not being available for download for each of the WUs this occurs on). Something to keep in mind for a while. Grant Darwin NT ID: 1979560 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1979566 - Posted: 9 Feb 2019, 13:19:50 UTC - in response to Message 1979557. Only 38 errors, that's good :) I just seen on computer sitting with 130 errors, out of 226 downloaded. That will be waiting awhile now for the quota to build up ... ID: 1979566 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1979574 - Posted: 9 Feb 2019, 16:26:18 UTC - in response to Message 1979159. uploads going through for me now. but i do have to kick start the work fetch cycle. every time this issue with the uploads happens, my linux systems stop trying to communicate with the project. the normal 5min communication deferred timer just goes away completely (as opposed to just getting pushed back to longer intervals like what happens with normal project maintenance). the windows systems seem to pick back up eventually I'm having similar problems. Reduced the number of reported task to 200 from 400 and will see if that works. A manual kick (punch) to project-update button helped. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1979574 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1979575 - Posted: 9 Feb 2019, 16:30:14 UTC - in response to Message 1979574. Stopped tending the backoffs and went to bed finally last night. Woke up now to one host at 21 hours in backoff from the original 24 hour backoff. Restarted BOINC to get back to reporting the first 100 tasks and the normal 5 minute timer. Up to 1750 errored while downloading tasks now. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1979575 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1979577 - Posted: 9 Feb 2019, 16:43:12 UTC - in response to Message 1979575. Up to 1750 errored while downloading tasks now. Delightful ID: 1979577 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1979581 - Posted: 9 Feb 2019, 17:19:50 UTC - in response to Message 1979575. Stopped tending the backoffs and went to bed finally last night. Woke up now to one host at 21 hours in backoff from the original 24 hour backoff. Restarted BOINC to get back to reporting the first 100 tasks and the normal 5 minute timer. You can always overrule the timer by simply clicking 'update'. That will report and reset the backoffs. The only thing you can't do (successfully) is request new work during a 5-minute backoff. The 5 minutes pause is requested and tracked by the server - if you click too early, you'll have to wait another 5 minutes before requesting work. But you can report completed work even during that closed period. ID: 1979581 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1979589 - Posted: 9 Feb 2019, 18:16:00 UTC I found that just doing "update" wasn't clearing the errored tasks out of the hosts. If I did just that, all it would do is report finished work and reset the backoff timer to somewhere between 24 hours and 4 hours depending on how many error tasks were in the system. The only certain surefire way to get to a proper 5 minute timer is stop and restart BOINC. Then depending on whether any new work received had lost tasks, it reset the backoff timer again proportionally to however many unrecoverable tasks were received. There are always around 3 to 4 now in every 200 task download. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1979589 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1979605 - Posted: 9 Feb 2019, 20:16:13 UTC - in response to Message 1979575. Up to 1750 errored while downloading tasks now. . . Only 383 here ... But considering the number of tasks you do that is probably about pro rata. Stephen <shrug> ID: 1979605 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1979626 - Posted: 9 Feb 2019, 22:58:30 UTC I'm wondering if Eric did some other tweaking after taking out the failed storage, or if it's been causing problems for a while now. It still took ages for the splitters to crank up again once work had run down, but once they got going the output from the splitters is like never before- minimum output is around 60/s, generally no less than 80/s. Most of the time around 110/s. So now it's a matter of minutes to refill the Ready-to-send buffer, with extended periods of no output required. Makes a change from the need for continuous splitter operation when the output was low. However we appear to be having issues with the AP & MB deleters again as their backlogs continue to grow (in the case of AP after the Validators & cleared their backlog). Grant Darwin NT ID: 1979626 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1979627 - Posted: 9 Feb 2019, 23:04:05 UTC - in response to Message 1979589. Last modified: 9 Feb 2019, 23:08:25 UTC I found that just doing "update" wasn't clearing the errored tasks out of the hosts. If I did just that, all it would do is report finished work and reset the backoff timer to somewhere between 24 hours and 4 hours depending on how many error tasks were in the system. The only certain surefire way to get to a proper 5 minute timer is stop and restart BOINC. Then depending on whether any new work received had lost tasks, it reset the backoff timer again proportionally to however many unrecoverable tasks were received. There are always around 3 to 4 now in every 200 task download. Must be something to do with the Linux client (or the large number of failed download tasks you're getting allocated your systems), as i'm not seeing that behaviour on my Windows systems. Even at it's worst, I only had to hit Update & that would report those Errored WUs, and the timer would be reset to 5min. About the longest backoff I noticed was around 3hrs. Generally they were between 15min & 2-2.5hrs. My numbers of Errored & Invalid work continue to grow, although Errored WUs rate of growth has slowed significantly. Invalids are just going to pop up here & there each day as those missing WUs get re-allocated over the next few months. I hope Eric can run a query to catch all these WUs affected by the missing download file and re-issue them. Grant Darwin NT ID: 1979627 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1979629 - Posted: 9 Feb 2019, 23:38:07 UTC - in response to Message 1979627. Must be something to do with the Linux client (or the large number of failed download tasks you're getting allocated your systems), as i'm not seeing that behaviour on my Windows systems. I hope Eric can run a query to catch all these WUs affected by the missing download file and re-issue them. . . It may be the version of Linux BOINC manager/client he is running. On my systems running an older Linux client, like you, I have found that hitting the update button reports and clears the failed downloads and restores the default back-off timer. . . If the tasks that fail have in fact been totally lost when the storage device failed then it seems to me they would need to be re-split unless split tasks are archived before they go to dispersal. Stephen . . ID: 1979629 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1979632 - Posted: 10 Feb 2019, 0:10:13 UTC - in response to Message 1979581. Stopped tending the backoffs and went to bed finally last night. Woke up now to one host at 21 hours in backoff from the original 24 hour backoff. Restarted BOINC to get back to reporting the first 100 tasks and the normal 5 minute timer. You can always overrule the timer by simply clicking 'update'. That will report and reset the backoffs. The only thing you can't do (successfully) is request new work during a 5-minute backoff. The 5 minutes pause is requested and tracked by the server - if you click too early, you'll have to wait another 5 minutes before requesting work. But you can report completed work even during that closed period. I can not request new work when I am at sleep or during work hours. Something is 'wrong'. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1979632 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1979633 - Posted: 10 Feb 2019, 0:17:15 UTC - in response to Message 1979626. Last modified: 10 Feb 2019, 0:36:26 UTC However we appear to be having issues with the AP & MB deleters again as their backlogs continue to grow (in the case of AP after the Validators & cleared their backlog). The file deleters on Georgem have been turned off since the hard drive failure on that machine. That is why the backlog continues to grow. Might have something to do with the high splitter output and the fact the replica zeroed out and not has started climbing back up as usual. Less I/O contention maybe? My large quantity of failed tasks is just because of my high turnaround rate. I ask for more work than most so am likely to get more of the missing tasks. Need to go service another couple of hosts I see with one in a 50 minute backoff now and another one that is not showing a 5 minute timer at all and hasn't contacted the server for an hour. See lots of pink Download Errors in the Tasks list for that machine with BoincTasks. [Edit] I must have a magnet. On the host that was in backoff for 35 minutes, when I reported 100 tasks and got 100 in return. 30 of them were unrecoverable download errors. This is getting tiresome. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1979633 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.