Panic Mode On (79) Server Problems?

Message boards : Number crunching : Panic Mode On (79) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 22 · Next

AuthorMessage
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1308098 - Posted: 20 Nov 2012, 14:02:45 UTC

Hopefully, it's ghost task reissues. Which should be returned a bit more quickly by hungry hosts, helping to clean up the database.
Results in the field and results awaiting validation have both been dropping.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1308098 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1308106 - Posted: 20 Nov 2012, 14:40:14 UTC - in response to Message 1308023.  

On a side-note, Since switching from 6.2.19 to 6.10.58, I have noticed that my cache is not being processed in FIFO. It is all APs..about 17 days worth, and APs with a deadline four days sooner than the ones that keep getting picked to run next are still sitting there not getting started.

I know there are cache/queue changes along the way through the build history, but each WU has a 25-day deadline, so wouldn't it still make sense to run the soonest deadlines first (which also happen to be the ones that were acquired first)? I mean, it works out in the end I'm sure, but it's just weird.

Not true. FIFO does not necessarliy equate to EDF (earliest deadline first). A task's deadline is determined by how long it is estimated to take to run. So it is possible to download a bunch of tasks yesterday estimated to take 20 hours to run and have deadlines in late January, and then a bunch today estimated to take 1 hour and have deadlines in mid-December. And don't forget, if you're running more than one project, Boinc has to balance all of them, and different projects do their time estimates and deadlines differently.

However, if all of the tasks you're looking at have the same time estimate, then I agree, it is weird for them not to run FIFO, which presumably is also EDF.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1308106 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1308127 - Posted: 20 Nov 2012, 16:02:34 UTC - in response to Message 1308106.  

On a side-note, Since switching from 6.2.19 to 6.10.58, I have noticed that my cache is not being processed in FIFO. It is all APs..about 17 days worth, and APs with a deadline four days sooner than the ones that keep getting picked to run next are still sitting there not getting started.

I know there are cache/queue changes along the way through the build history, but each WU has a 25-day deadline, so wouldn't it still make sense to run the soonest deadlines first (which also happen to be the ones that were acquired first)? I mean, it works out in the end I'm sure, but it's just weird.

Not true. FIFO does not necessarliy equate to EDF (earliest deadline first). A task's deadline is determined by how long it is estimated to take to run. So it is possible to download a bunch of tasks yesterday estimated to take 20 hours to run and have deadlines in late January, and then a bunch today estimated to take 1 hour and have deadlines in mid-December. And don't forget, if you're running more than one project, Boinc has to balance all of them, and different projects do their time estimates and deadlines differently.

However, if all of the tasks you're looking at have the same time estimate, then I agree, it is weird for them not to run FIFO, which presumably is also EDF.

If tasks from the same download batch don't appear to run in FIFO (and be very careful to observe that you haven't applied a sort order to one of the columns in BOINC Manager, before you jump to that conclusion), then it's a long-standing bug which applies some slight randomisation to the display order when data is transferred from the server to the BOINC Client to the BOINC Manager. In short, it's cosmetic only.

BOINC v6.10.58 is still a very old version. We applied a lot of pressure to get that bug (and many others) fixed - I forget just when. The latest ones - I'm running v7.0.38 - have had display order and running order in perfect step for a long time - possibly even since sometime in the v6.12.xx range - but I wouldn't advise upgrading just for this. Like I said, it's cosmetic only.
ID: 1308127 · Report as offensive
fscheel

Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1308180 - Posted: 20 Nov 2012, 23:19:34 UTC

Welp... Looks like the cricket graph just bottomed out. :(
ID: 1308180 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1308181 - Posted: 20 Nov 2012, 23:21:07 UTC - in response to Message 1308180.  
Last modified: 20 Nov 2012, 23:22:02 UTC

Welp... Looks like the cricket graph just bottomed out. :(

Just waiting for the next server status update...
[Edit] Which is almost totally green...
ID: 1308181 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1308237 - Posted: 21 Nov 2012, 3:25:40 UTC - in response to Message 1308127.  

On a side-note, Since switching from 6.2.19 to 6.10.58, I have noticed that my cache is not being processed in FIFO. It is all APs..about 17 days worth, and APs with a deadline four days sooner than the ones that keep getting picked to run next are still sitting there not getting started.

I know there are cache/queue changes along the way through the build history, but each WU has a 25-day deadline, so wouldn't it still make sense to run the soonest deadlines first (which also happen to be the ones that were acquired first)? I mean, it works out in the end I'm sure, but it's just weird.

Not true. FIFO does not necessarliy equate to EDF (earliest deadline first). A task's deadline is determined by how long it is estimated to take to run. So it is possible to download a bunch of tasks yesterday estimated to take 20 hours to run and have deadlines in late January, and then a bunch today estimated to take 1 hour and have deadlines in mid-December. And don't forget, if you're running more than one project, Boinc has to balance all of them, and different projects do their time estimates and deadlines differently.

However, if all of the tasks you're looking at have the same time estimate, then I agree, it is weird for them not to run FIFO, which presumably is also EDF.

If tasks from the same download batch don't appear to run in FIFO (and be very careful to observe that you haven't applied a sort order to one of the columns in BOINC Manager, before you jump to that conclusion), then it's a long-standing bug which applies some slight randomisation to the display order when data is transferred from the server to the BOINC Client to the BOINC Manager. In short, it's cosmetic only.

BOINC v6.10.58 is still a very old version. We applied a lot of pressure to get that bug (and many others) fixed - I forget just when. The latest ones - I'm running v7.0.38 - have had display order and running order in perfect step for a long time - possibly even since sometime in the v6.12.xx range - but I wouldn't advise upgrading just for this. Like I said, it's cosmetic only.

Thank you for the very informative insight to my observation. Since the tasks are all APs, they all have a 25-day deadline from when they were issued. In 6.2.19, they would crunch in FIFO, unless for some crazy reason high priority mode kicked in. I switched to 6.10.58 a few days ago and for example, I have a pile of APs that are due Dec 3, but ones for Dec 6 were running in high priority instead. High priority has since ended, and the ones due Dec 3 and 4 still haven't been touched, but 6-8 are being crunched pretty much in order.

I do notice that the sort order in Manager operates a little differently than in the older version, but I figure it will sort itself out eventually.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1308237 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 1308241 - Posted: 21 Nov 2012, 4:38:41 UTC

Well I'm back to NNT if I want to get an acknowledge from the server of the tasks being reported. Allowing tasks greets me with

11/20/2012 11:21:44 PM | SETI@home | work fetch resumed by user
11/20/2012 11:25:39 PM | SETI@home | Sending scheduler request: To fetch work.
11/20/2012 11:25:39 PM | SETI@home | Requesting new tasks for CPU and ATI
11/20/2012 11:25:47 PM | | Project communication failed: attempting access to reference site
11/20/2012 11:25:47 PM | SETI@home | Scheduler request failed: Failure when receiving data from the peer
11/20/2012 11:25:49 PM | | Internet access OK - project servers may be temporarily down.
11/20/2012 11:27:13 PM | SETI@home | Sending scheduler request: To fetch work.
11/20/2012 11:27:13 PM | SETI@home | Requesting new tasks for CPU and ATI
11/20/2012 11:32:26 PM | | Project communication failed: attempting access to reference site
11/20/2012 11:32:26 PM | SETI@home | Scheduler request failed: Timeout was reached
11/20/2012 11:32:28 PM | | Internet access OK - project servers may be temporarily down.

I currently have 15 ghosts, all GPU units, and I'm now down to 104 units "In Progress" counting those ghosts. I'm probably down to under 2 days worth of units for either the CPU or GPU and the only reason I haven't run out of GPU units, other that it's a weak GPU, is I routinely suspend GPU crunching to play games or watch movies.

Well Turkey day is coming up, I guess even a computer could use the break.
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 1308241 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1308266 - Posted: 21 Nov 2012, 7:30:30 UTC - in response to Message 1308241.  
Last modified: 21 Nov 2012, 7:40:48 UTC

Well, since the outage i've picked up some work. I'm also getting new errors when trying to contact the Scheduler.
Still getting the timeouts, but to add to that i'm now getting "Server returned nothing (no headers, no data)" & "Failure when receiving data from the peer".
As before, even with NNT set, it appears to depend on the wind direction & how you hold your tongue while clicking repatedly on the retry button as to whether or not you will get a response from the Scheduler.
Grant
Darwin NT
ID: 1308266 · Report as offensive
fscheel

Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1308321 - Posted: 21 Nov 2012, 12:20:03 UTC

As expected, with the AP splitters not running this morning I am able to connect and and the tasks are flowing quite well.

Frank
ID: 1308321 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1308364 - Posted: 21 Nov 2012, 14:51:04 UTC

My i7 is finally up to its full 200 WU limit. I suppose this means it finished enough Einstein GPU work to ask Seti for some, and the Seti servers were actually able to deliver it.

I also see that the five APs I got yesterday are done already, four valid and one pending.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1308364 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1308417 - Posted: 21 Nov 2012, 18:11:11 UTC - in response to Message 1308266.  
Last modified: 21 Nov 2012, 18:24:41 UTC

Still getting the timeouts, but to add to that i'm now getting "Server returned nothing (no headers, no data)" & "Failure when receiving data from the peer".

And "Couldn't connect to server" pops up occasionally as well.

Well, it was occasionally.
20min of clicking on update with NNT set & that's the only response i've got.

EDIT- just had a look at the server status page & it appears the Scheduler has been disabled.
Grant
Darwin NT
ID: 1308417 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1308437 - Posted: 21 Nov 2012, 18:32:48 UTC

Well, something's afoot in da lab...
Scheduling server is disabled and the crickets take a dive.
Hmmmmmmmmmmm.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1308437 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1308445 - Posted: 21 Nov 2012, 18:54:20 UTC

Message from Eric:

We've got some things to try. Let us know if it starts working.


I find these two helpful:

findstr /C:"[SETI@home] Scheduler request failed: " stdoutdae.txt >sched_failures-%computername%.txt

findstr /C:"[SETI@home] Scheduler request completed: " stdoutdae.txt >sched_successes-%computername%.txt


They work in the "command prompt" environment in Windows.

Save them (separately or together) in one or two files in BOINC's Data directory: give the files names with the extension ".cmd"

Then, double-clicking the file(s) will quickly give you an overview of how well the scheduler requests have been going.

Don't swamp Eric with data, but if a few of us (those who feel confident working with that minimalist instruction - don't bother if you're not comfortable doing that) keep an eye on his experiments and provide feedback, it may help. Remember your logs will be timestamped in your local timezone - please supply the UTC offset so he can match them up with the server changes.
ID: 1308445 · Report as offensive
mikeej42

Send message
Joined: 26 Oct 00
Posts: 109
Credit: 791,875,385
RAC: 9
United States
Message 1308495 - Posted: 21 Nov 2012, 20:23:15 UTC - in response to Message 1308445.  

After the change(s) this afternoon, I had several nodes that had empty caches but could not get a successful scheduler update. I was able to get them to start downloading some tasks by decreasing the minimum work buffer to 0.25 days. Now they are slowly getting some resent tasks.
ID: 1308495 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1308497 - Posted: 21 Nov 2012, 20:33:27 UTC

Things seems to have started again, and this time we're talking to Synergy over the Campus data network (128.32.18.157) - anybody using a manually configured hosts file please note. We're still using setiboinc.ssl.berkeley.edu, so the proxies should pick up the change automatically.

So far, the only difference that I've noticed (apart from the fact that it works...) is a re-allocation and download of some of the little graphics files used in Simple View.
ID: 1308497 · Report as offensive
mikeej42

Send message
Joined: 26 Oct 00
Posts: 109
Credit: 791,875,385
RAC: 9
United States
Message 1308504 - Posted: 21 Nov 2012, 20:53:39 UTC - in response to Message 1308497.  

After I flushed the DNS caches on all my nodes I was able to go back to multi-day work buffers and got scheduler updates to complete.
ID: 1308504 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1308505 - Posted: 21 Nov 2012, 20:59:20 UTC - in response to Message 1308504.  

After I flushed the DNS caches on all my nodes I was able to go back to multi-day work buffers and got scheduler updates to complete.

Ah. So that's why I can't download the modest few I've been allocated...
ID: 1308505 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1308506 - Posted: 21 Nov 2012, 21:01:47 UTC - in response to Message 1308445.  

Message from Eric:

We've got some things to try. Let us know if it starts working.


I find these two helpful:

findstr /C:"[SETI@home] Scheduler request failed: " stdoutdae.txt >sched_failures-%computername%.txt

findstr /C:"[SETI@home] Scheduler request completed: " stdoutdae.txt >sched_successes-%computername%.txt


They work in the "command prompt" environment in Windows.

Save them (separately or together) in one or two files in BOINC's Data directory: give the files names with the extension ".cmd"

Then, double-clicking the file(s) will quickly give you an overview of how well the scheduler requests have been going.

Don't swamp Eric with data, but if a few of us (those who feel confident working with that minimalist instruction - don't bother if you're not comfortable doing that) keep an eye on his experiments and provide feedback, it may help. Remember your logs will be timestamped in your local timezone - please supply the UTC offset so he can match them up with the server changes.

I had not thought to check the logs that way. Quite a good idea. I took it a bit further and did a 3rd to just check for "[SETI@home] Scheduler request failed: Timeout was reached" to separate other failures. Then I have the bat count the lines and give me the % failure for total and timeout. So far checking several machines that have data going back to the 5th. The failure rate is between 14% & 19% for all failures.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1308506 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1308508 - Posted: 21 Nov 2012, 21:21:37 UTC - in response to Message 1308497.  
Last modified: 21 Nov 2012, 21:29:02 UTC

Things seems to have started again, and this time we're talking to Synergy over the Campus data network (128.32.18.157) - anybody using a manually configured hosts file please note. We're still using setiboinc.ssl.berkeley.edu, so the proxies should pick up the change automatically.

So far, the only difference that I've noticed (apart from the fact that it works...) is a re-allocation and download of some of the little graphics files used in Simple View.


No real chanches on my side, DL very slow without proxy (<0.5kbps), with proxy a little better (still <5 kbps). The same host/conection give >1MBps for DL an Einstein WU, so the slow is not because my internet conection. Scheduler works very slow to, and UL fast but takes a lot of time to clear from the screen (don´t know the nave you give to the task after tue UL is completes 100%).

Bad, but at least data is flow with little or no error (i realy not see any scheduler error yet)... but takes more time to DL than crunch... so at this rates the caches will never fill on the fastest hosts even with 100WU.

But AP-splitters still off line...
ID: 1308508 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1308511 - Posted: 21 Nov 2012, 21:31:21 UTC - in response to Message 1308497.  
Last modified: 21 Nov 2012, 21:41:10 UTC

Things seems to have started again, and this time we're talking to Synergy over the Campus data network (128.32.18.157) - anybody using a manually configured hosts file please note. We're still using setiboinc.ssl.berkeley.edu, so the proxies should pick up the change automatically.

So far, the only difference that I've noticed (apart from the fact that it works...) is a re-allocation and download of some of the little graphics files used in Simple View.

Brilliant, scheduler contacts now just work, even if i get no work (at the Main project) at least we're got a workaround for the scheduler timeouts, For Example:

21/11/2012 21:37:27 SETI@home Beta Test [sched_op_debug] Starting scheduler request
21/11/2012 21:37:27 SETI@home Beta Test Sending scheduler request: To fetch work.
21/11/2012 21:37:27 SETI@home Beta Test Requesting new tasks for CPU and GPU
21/11/2012 21:37:27 SETI@home Beta Test [sched_op_debug] CPU work request: 82381.29 seconds; 0.00 CPUs
21/11/2012 21:37:27 SETI@home Beta Test [sched_op_debug] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs
21/11/2012 21:37:27 SETI@home Beta Test [sched_op_debug] ATI GPU work request: 20239.74 seconds; 0.00 GPUs
21/11/2012 21:37:33 SETI@home Beta Test Scheduler request completed: got 20 new tasks
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] Server version 701
21/11/2012 21:37:33 SETI@home Beta Test Project requested delay of 7 seconds
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] estimated total CPU job duration: 76666 seconds
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] estimated total NVIDIA GPU job duration: 0 seconds
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] estimated total ATI GPU job duration: 19036 seconds
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] Deferring communication for 7 sec
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] Reason: requested by project


Claggy
ID: 1308511 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (79) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.