Panic Mode On (114) Server Problems?

Message boards : Number crunching : Panic Mode On (114) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 37 · 38 · 39 · 40 · 41 · 42 · 43 . . . 45 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979543 - Posted: 9 Feb 2019, 7:17:51 UTC
Last modified: 9 Feb 2019, 7:19:51 UTC

I'm still dealing with unrecoverable downloads and getting large 4 hour backoffs. Bah humbug.

[Edit] Around ~1700 download errors accumulated so far.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979543 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1979545 - Posted: 9 Feb 2019, 7:58:41 UTC - in response to Message 1979543.  

I'm still dealing with unrecoverable downloads and getting large 4 hour backoffs. Bah humbug.

[Edit] Around ~1700 download errors accumulated so far.


. . I had one rig generate 124 ghosted tasks during the server spasms so I started ghost recovery. I manage to get 60 resends but the next attempt resulted in the remaining 64 giving "failed to resend : expired", which is not what I was aiming for but at least gets them out of limbo. Downside I am now unable to get new work, instead I'm getting "you have completed daily quota of 'x' WUs. The worst part is the value of 'x' only goes up by 1 even when I report 4 or more completed tasks. I just hope 'x' gets to the desired value before I completely run out of work. :(

Stephen

:(
ID: 1979545 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1979547 - Posted: 9 Feb 2019, 8:42:44 UTC

Everything normal on my 2 machines.

37 errors across the 2 machines , both now with full caches and tasks validating.
ID: 1979547 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1979548 - Posted: 9 Feb 2019, 8:46:50 UTC - in response to Message 1979545.  

The worst part is the value of 'x' only goes up by 1 even when I report 4 or more completed tasks.

As they Validate, then X will increase significantly for each one.
Grant
Darwin NT
ID: 1979548 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1979552 - Posted: 9 Feb 2019, 8:53:49 UTC - in response to Message 1979548.  

The worst part is the value of 'x' only goes up by 1 even when I report 4 or more completed tasks.

As they Validate, then X will increase significantly for each one.


. . Thanks,

Stephen

.
.
ID: 1979552 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 1979557 - Posted: 9 Feb 2019, 10:13:00 UTC

Haven't had an error i months....the DL problems produced 38 on my one cruncher.
Most are flagged as "Error while downloading", while some are just flagged as "Error".
Both types have the same Stderr output :

<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>02no11ab.3519.9065.7.34.28</file_name>
<error_code>-224 (permanent HTTP error)</error_code>
<error_message>permanent HTTP error</error_message>
</file_xfer_error>

</message>
]]>
Nothing to do about these, right ?
Humans may rule the world...but bacteria run it...
ID: 1979557 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1979558 - Posted: 9 Feb 2019, 11:02:47 UTC - in response to Message 1979557.  

Nothing to do about these, right ?
Correct - except possibly read the front-page news.
ID: 1979558 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1979560 - Posted: 9 Feb 2019, 11:20:00 UTC - in response to Message 1979557.  
Last modified: 9 Feb 2019, 11:26:42 UTC

Nothing to do about these, right ?

Nope.
As the WU gets re-allocated & Errors out it will eventually be declared a dud (Too many errors (may have bug)) and after being Assimilated, Deleted & Purged will eventually disappear from your Task list. However you can also look forward to getting a few Invalids as WUs that are Pending Validation or Inconclusive that were on the failed storage get re-issued and then error out due to not being there to download.

My WAG (Wild Arse Guess) is that most of the outright Errors should be mostly done in around 48 hours.
However we can look forward to getting at least a few Invalids for several months to come as those that are Pending get re-issued to another host when the original wingmate doesn't return the result for the WU they were able to download before this issue occurred (which will of course result in another batch of Errors due to the WU not being available for download for each of the WUs this occurs on).
Something to keep in mind for a while.
Grant
Darwin NT
ID: 1979560 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1979566 - Posted: 9 Feb 2019, 13:19:50 UTC - in response to Message 1979557.  

Only 38 errors, that's good :)
I just seen on computer sitting with 130 errors, out of 226 downloaded.
That will be waiting awhile now for the quota to build up ...
ID: 1979566 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1979574 - Posted: 9 Feb 2019, 16:26:18 UTC - in response to Message 1979159.  

uploads going through for me now. but i do have to kick start the work fetch cycle.

every time this issue with the uploads happens, my linux systems stop trying to communicate with the project. the normal 5min communication deferred timer just goes away completely (as opposed to just getting pushed back to longer intervals like what happens with normal project maintenance).

the windows systems seem to pick back up eventually


I'm having similar problems.
Reduced the number of reported task to 200 from 400 and will see if that works.
A manual kick (punch) to project-update button helped.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1979574 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979575 - Posted: 9 Feb 2019, 16:30:14 UTC - in response to Message 1979574.  

Stopped tending the backoffs and went to bed finally last night. Woke up now to one host at 21 hours in backoff from the original 24 hour backoff. Restarted BOINC to get back to reporting the first 100 tasks and the normal 5 minute timer.

Up to 1750 errored while downloading tasks now.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979575 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1979577 - Posted: 9 Feb 2019, 16:43:12 UTC - in response to Message 1979575.  

Up to 1750 errored while downloading tasks now.

Delightful
ID: 1979577 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1979581 - Posted: 9 Feb 2019, 17:19:50 UTC - in response to Message 1979575.  

Stopped tending the backoffs and went to bed finally last night. Woke up now to one host at 21 hours in backoff from the original 24 hour backoff. Restarted BOINC to get back to reporting the first 100 tasks and the normal 5 minute timer.
You can always overrule the timer by simply clicking 'update'. That will report and reset the backoffs.

The only thing you can't do (successfully) is request new work during a 5-minute backoff. The 5 minutes pause is requested and tracked by the server - if you click too early, you'll have to wait another 5 minutes before requesting work. But you can report completed work even during that closed period.
ID: 1979581 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979589 - Posted: 9 Feb 2019, 18:16:00 UTC

I found that just doing "update" wasn't clearing the errored tasks out of the hosts. If I did just that, all it would do is report finished work and reset the backoff timer to somewhere between 24 hours and 4 hours depending on how many error tasks were in the system. The only certain surefire way to get to a proper 5 minute timer is stop and restart BOINC. Then depending on whether any new work received had lost tasks, it reset the backoff timer again proportionally to however many unrecoverable tasks were received. There are always around 3 to 4 now in every 200 task download.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979589 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1979605 - Posted: 9 Feb 2019, 20:16:13 UTC - in response to Message 1979575.  

Up to 1750 errored while downloading tasks now.


. . Only 383 here ... But considering the number of tasks you do that is probably about pro rata.

Stephen

<shrug>
ID: 1979605 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1979626 - Posted: 9 Feb 2019, 22:58:30 UTC

I'm wondering if Eric did some other tweaking after taking out the failed storage, or if it's been causing problems for a while now.
It still took ages for the splitters to crank up again once work had run down, but once they got going the output from the splitters is like never before- minimum output is around 60/s, generally no less than 80/s. Most of the time around 110/s.
So now it's a matter of minutes to refill the Ready-to-send buffer, with extended periods of no output required. Makes a change from the need for continuous splitter operation when the output was low.

However we appear to be having issues with the AP & MB deleters again as their backlogs continue to grow (in the case of AP after the Validators & cleared their backlog).
Grant
Darwin NT
ID: 1979626 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1979627 - Posted: 9 Feb 2019, 23:04:05 UTC - in response to Message 1979589.  
Last modified: 9 Feb 2019, 23:08:25 UTC

I found that just doing "update" wasn't clearing the errored tasks out of the hosts. If I did just that, all it would do is report finished work and reset the backoff timer to somewhere between 24 hours and 4 hours depending on how many error tasks were in the system. The only certain surefire way to get to a proper 5 minute timer is stop and restart BOINC. Then depending on whether any new work received had lost tasks, it reset the backoff timer again proportionally to however many unrecoverable tasks were received. There are always around 3 to 4 now in every 200 task download.

Must be something to do with the Linux client (or the large number of failed download tasks you're getting allocated your systems), as i'm not seeing that behaviour on my Windows systems.
Even at it's worst, I only had to hit Update & that would report those Errored WUs, and the timer would be reset to 5min. About the longest backoff I noticed was around 3hrs. Generally they were between 15min & 2-2.5hrs.


My numbers of Errored & Invalid work continue to grow, although Errored WUs rate of growth has slowed significantly. Invalids are just going to pop up here & there each day as those missing WUs get re-allocated over the next few months.

I hope Eric can run a query to catch all these WUs affected by the missing download file and re-issue them.
Grant
Darwin NT
ID: 1979627 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1979629 - Posted: 9 Feb 2019, 23:38:07 UTC - in response to Message 1979627.  

Must be something to do with the Linux client (or the large number of failed download tasks you're getting allocated your systems), as i'm not seeing that behaviour on my Windows systems.

I hope Eric can run a query to catch all these WUs affected by the missing download file and re-issue them.


. . It may be the version of Linux BOINC manager/client he is running. On my systems running an older Linux client, like you, I have found that hitting the update button reports and clears the failed downloads and restores the default back-off timer.

. . If the tasks that fail have in fact been totally lost when the storage device failed then it seems to me they would need to be re-split unless split tasks are archived before they go to dispersal.

Stephen

. .
ID: 1979629 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1979632 - Posted: 10 Feb 2019, 0:10:13 UTC - in response to Message 1979581.  

Stopped tending the backoffs and went to bed finally last night. Woke up now to one host at 21 hours in backoff from the original 24 hour backoff. Restarted BOINC to get back to reporting the first 100 tasks and the normal 5 minute timer.
You can always overrule the timer by simply clicking 'update'. That will report and reset the backoffs.

The only thing you can't do (successfully) is request new work during a 5-minute backoff. The 5 minutes pause is requested and tracked by the server - if you click too early, you'll have to wait another 5 minutes before requesting work. But you can report completed work even during that closed period.


I can not request new work when I am at sleep or during work hours.

Something is 'wrong'.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1979632 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979633 - Posted: 10 Feb 2019, 0:17:15 UTC - in response to Message 1979626.  
Last modified: 10 Feb 2019, 0:36:26 UTC

However we appear to be having issues with the AP & MB deleters again as their backlogs continue to grow (in the case of AP after the Validators & cleared their backlog).


The file deleters on Georgem have been turned off since the hard drive failure on that machine. That is why the backlog continues to grow. Might have something to do with the high splitter output and the fact the replica zeroed out and not has started climbing back up as usual. Less I/O contention maybe?

My large quantity of failed tasks is just because of my high turnaround rate. I ask for more work than most so am likely to get more of the missing tasks. Need to go service another couple of hosts I see with one in a 50 minute backoff now and another one that is not showing a 5 minute timer at all and hasn't contacted the server for an hour. See lots of pink Download Errors in the Tasks list for that machine with BoincTasks.

[Edit] I must have a magnet. On the host that was in backoff for 35 minutes, when I reported 100 tasks and got 100 in return. 30 of them were unrecoverable download errors. This is getting tiresome.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979633 · Report as offensive
Previous · 1 . . . 37 · 38 · 39 · 40 · 41 · 42 · 43 . . . 45 · Next

Message boards : Number crunching : Panic Mode On (114) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.