Panic Mode On (114) Server Problems?

Message boards : Number crunching : Panic Mode On (114) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 45 · Next

AuthorMessage
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1979491 - Posted: 9 Feb 2019, 1:03:03 UTC - in response to Message 1979485.  
Last modified: 9 Feb 2019, 1:05:22 UTC

There aren't any failed download tasks to abort. All you see is the message in the Event Log of giving up on the task.

Fri 08 Feb 2019 04:42:46 PM PST | SETI@home | Giving up on download of 12dc06af.32725.1295.12.39.192.vlar: permanent HTTP error
Fri 08 Feb 2019 04:42:46 PM PST | SETI@home | Giving up on download of 12dc06af.6741.172935.15.42.62.vlar: permanent HTTP error
Fri 08 Feb 2019 04:42:46 PM PST | SETI@home | Giving up on download of 12dc06af.11999.1704.14.41.109: permanent HTTP error
Fri 08 Feb 2019 04:42:46 PM PST | SETI@home | Giving up on download of 02no11ab.8915.11110.7.34.154: permanent HTTP error

But you then get forced into 24 hour backoff. You have to wait out another five minutes to again contact the schedulers and hope you don't get more lost tasks.

But they never clear, remains in the list of the Tasks. That's why my suggestion to abort them.

When you update in this case the backoff back to the normal 5 min

<edit> Yes the DL is back to normal now.
ID: 1979491 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979493 - Posted: 9 Feb 2019, 1:08:59 UTC - in response to Message 1979491.  

Again, they are NOT in the list of Tasks. Only the server has a record of them on your machine. You never got them. They only way to get rid of them is stop BOINC, wait five minutes, contact the scheduler, get the master list successfully downloaded, get benchmarked and finally get rid of them from the database for your host.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979493 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979494 - Posted: 9 Feb 2019, 1:11:05 UTC

I have one host that constantly has been getting the short straw and receiving no new work. Just about out. The other hosts are getting a reasonable 1 for 1 return. Still not back to full though.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979494 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1979495 - Posted: 9 Feb 2019, 1:12:18 UTC - in response to Message 1979491.  
Last modified: 9 Feb 2019, 1:14:27 UTC

There aren't any failed download tasks to abort. All you see is the message in the Event Log of giving up on the task.

Fri 08 Feb 2019 04:42:46 PM PST | SETI@home | Giving up on download of 12dc06af.32725.1295.12.39.192.vlar: permanent HTTP error
Fri 08 Feb 2019 04:42:46 PM PST | SETI@home | Giving up on download of 12dc06af.6741.172935.15.42.62.vlar: permanent HTTP error
Fri 08 Feb 2019 04:42:46 PM PST | SETI@home | Giving up on download of 12dc06af.11999.1704.14.41.109: permanent HTTP error
Fri 08 Feb 2019 04:42:46 PM PST | SETI@home | Giving up on download of 02no11ab.8915.11110.7.34.154: permanent HTTP error

But you then get forced into 24 hour backoff. You have to wait out another five minutes to again contact the schedulers and hope you don't get more lost tasks.

But they never clear, remains in the list of the Tasks. That's why my suggestion to abort them.


What's happening now is you request new work, you get new WUs, and many of them are resends. Some are normal resends, others because of the download problems (all those _3, _4, _5s are a good hint). But as soon as they are allocated, they start to download. And in a matter of seconds (now that downloads are no long moving at a crawl) they error out as a permanent HTTP error.
No time to abort them. And there are still plenty of _2, _3 resend WUs that aren't due to download errors, so aborting those would be aborting good work.
And Aborting work goes against your Error count anyway (although at least you don't get ridiculous backoff times).

Edit- the long backoff only occurs after the Download errors out. If you wait 5min, Update, then they get cleared from your system, now work gets allocated (or not), and you're back to 5min between Scheduler requests.
Till the next batch of un-downloadable WUs, of course.
Grant
Darwin NT
ID: 1979495 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1979496 - Posted: 9 Feb 2019, 1:17:16 UTC - in response to Message 1979494.  
Last modified: 9 Feb 2019, 1:18:39 UTC

I have one host that constantly has been getting the short straw and receiving no new work. Just about out. The other hosts are getting a reasonable 1 for 1 return. Still not back to full though.

Splitters have only just started putting out a more reasonable amount of work. Even so, it's been easier to get work after this unscheduled outage than it was trying to get work after the weekly outage. Generally it's been every few requests gets some work. After the weekly outage it took hours, and even when work finally started to be allocated it was something like 7-20 requests to get more.
Grant
Darwin NT
ID: 1979496 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979497 - Posted: 9 Feb 2019, 1:21:01 UTC - in response to Message 1979495.  

Edit- the long backoff only occurs after the Download errors out. If you wait 5min, Update, then they get cleared from your system, now work gets allocated (or not), and you're back to 5min between Scheduler requests.
Till the next batch of un-downloadable WUs, of course.

Exactallly . . . . LOL
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979497 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1979502 - Posted: 9 Feb 2019, 1:41:26 UTC

Wild speculation time.

Eric mentioned15% of the storage space was affected? So with 600,000 WUs Ready-to-send, that's around 90,000 WUs MIA. Received in the last hour is around 110,000- but on a system that gets one of these missing WUs allocated, they can end up with some very long timeouts; and a great deal of people don't keep a close eye on their systems.
So it could easily take 48+hrs for them to finally be cleared from the system.
Grant
Darwin NT
ID: 1979502 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1979504 - Posted: 9 Feb 2019, 1:47:20 UTC - in response to Message 1979494.  

I have one host that constantly has been getting the short straw and receiving no new work. Just about out. The other hosts are getting a reasonable 1 for 1 return. Still not back to full though.

It really is rather strange to watch.
I have 1 system that struggles to get work, but when it does it's all good. The other system gets work pretty much each time it requests it, but without fail it gets many of those destined to download failure WUs.
Grant
Darwin NT
ID: 1979504 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1979505 - Posted: 9 Feb 2019, 1:50:35 UTC

ive been getting a lot of "Project has no tasks available"

:/
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1979505 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1979506 - Posted: 9 Feb 2019, 1:58:28 UTC

. . @ Unixchick

. . There has been a new series of tapes mounted. Blc31 tapes ... but they are from the same day :) LOL.

Stephen

:)
ID: 1979506 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979509 - Posted: 9 Feb 2019, 2:01:10 UTC - in response to Message 1979506.  

I saw the new tape loaded for the weekend. And the splitters are finally ramping up. Maybe we will start to make a dent in the unfilled caches.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979509 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34773
Credit: 261,360,520
RAC: 489
Australia
Message 1979517 - Posted: 9 Feb 2019, 3:15:54 UTC

After just getting in on updating my ready to download caches filled back up just before day break this morning, the system went down. :-(

Then I decided to go into town and do a few things then go to the pub to have a few beers and a feed. On arriving home all was back to normal.

So it proves that taking a break and a few beers will fix almost anything. :-D

Cheers.
ID: 1979517 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979520 - Posted: 9 Feb 2019, 3:42:01 UTC

Not quite to that point yet. Still getting the occasional 24 hour backoff that needs tending.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979520 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1979527 - Posted: 9 Feb 2019, 4:41:52 UTC
Last modified: 9 Feb 2019, 4:42:20 UTC

Just picked up my first Invalid from this issue.
It was an Inconclusive, but after that everything else ended up as a download error and it ended up as "Completed, can't validate"
Grant
Darwin NT
ID: 1979527 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979529 - Posted: 9 Feb 2019, 4:56:59 UTC - in response to Message 1979527.  
Last modified: 9 Feb 2019, 5:03:59 UTC

Just picked up my first Invalid from this issue.
It was an Inconclusive, but after that everything else ended up as a download error and it ended up as "Completed, can't validate"

Bummer
[Edit] 11 completed, can't validate so far on my farm
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979529 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1979534 - Posted: 9 Feb 2019, 5:50:17 UTC - in response to Message 1979517.  

So it proves that taking a break and a few beers will fix almost anything. :-D
Cheers.


. . Well of course ... :)

Stephen

:)
ID: 1979534 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1979535 - Posted: 9 Feb 2019, 5:53:14 UTC - in response to Message 1979419.  

I shall fade off into the background again.
I did my bit, and was successful in contacting an admin to kick the servers.
Be thankful, and that is all I shall ask.
Till we meet again.

Meow.


You still are a +2 to all of us.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1979535 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1979536 - Posted: 9 Feb 2019, 5:54:29 UTC - in response to Message 1979534.  

So it proves that taking a break and a few beers will fix almost anything. :-D
Cheers.


. . Well of course ... :)

Stephen

:)


Just got off work, haven't had the beer yet. But when I hit a manual update it started inhaling tons of tasks. I must be the weekend (someplace).

Tom
A proud member of the OFA (Old Farts Association).
ID: 1979536 · Report as offensive
12kpp
Volunteer tester

Send message
Joined: 20 Mar 08
Posts: 1
Credit: 4,903,051
RAC: 0
Russia
Message 1979539 - Posted: 9 Feb 2019, 6:33:31 UTC

http://setiathome.berkeley.edu/workunit.php?wuid=3328702611 odl WU could not complete due to server crash =(
ID: 1979539 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1979542 - Posted: 9 Feb 2019, 7:10:16 UTC

Minor panic when I got home just now- several WUs in project backoff having not been able to download. Re-tried pending transfers and luckily they then cleared without further hiccups.
Grant
Darwin NT
ID: 1979542 · Report as offensive
Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 45 · Next

Message boards : Number crunching : Panic Mode On (114) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.