Panic Mode On (79) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (79) Server Problems?

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 23 · Next
Author Message
juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5421
Credit: 308,617,749
RAC: 357,866
Brazil
Message 1308076 - Posted: 20 Nov 2012, 11:52:22 UTC - in response to Message 1308061.

Yes, it could be ghosts being resent - some of my machines had developed a new crop of hauntings overnight. But it looks like it's beginning to decay now - this might be a good time to try manual updates, and help flush the remaining gremlins out of the system.

Maybe because the daylight, ghost don't like daylight.

____________

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46542
Credit: 36,892,552
RAC: 5,284
United States
Message 1308096 - Posted: 20 Nov 2012, 13:57:07 UTC - in response to Message 1308061.

Yes, it could be ghosts being resent - some of my machines had developed a new crop of hauntings overnight. But it looks like it's beginning to decay now - this might be a good time to try manual updates, and help flush the remaining gremlins out of the system.

Well as long as it's just Pocha Hauntis...
____________
My Facebook, War Commander, 2015

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 12071
Credit: 14,707,440
RAC: 10,993
United States
Message 1308106 - Posted: 20 Nov 2012, 14:40:14 UTC - in response to Message 1308023.

On a side-note, Since switching from 6.2.19 to 6.10.58, I have noticed that my cache is not being processed in FIFO. It is all APs..about 17 days worth, and APs with a deadline four days sooner than the ones that keep getting picked to run next are still sitting there not getting started.

I know there are cache/queue changes along the way through the build history, but each WU has a 25-day deadline, so wouldn't it still make sense to run the soonest deadlines first (which also happen to be the ones that were acquired first)? I mean, it works out in the end I'm sure, but it's just weird.

Not true. FIFO does not necessarliy equate to EDF (earliest deadline first). A task's deadline is determined by how long it is estimated to take to run. So it is possible to download a bunch of tasks yesterday estimated to take 20 hours to run and have deadlines in late January, and then a bunch today estimated to take 1 hour and have deadlines in mid-December. And don't forget, if you're running more than one project, Boinc has to balance all of them, and different projects do their time estimates and deadlines differently.

However, if all of the tasks you're looking at have the same time estimate, then I agree, it is weird for them not to run FIFO, which presumably is also EDF.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8671
Credit: 51,873,021
RAC: 49,396
United Kingdom
Message 1308127 - Posted: 20 Nov 2012, 16:02:34 UTC - in response to Message 1308106.

On a side-note, Since switching from 6.2.19 to 6.10.58, I have noticed that my cache is not being processed in FIFO. It is all APs..about 17 days worth, and APs with a deadline four days sooner than the ones that keep getting picked to run next are still sitting there not getting started.

I know there are cache/queue changes along the way through the build history, but each WU has a 25-day deadline, so wouldn't it still make sense to run the soonest deadlines first (which also happen to be the ones that were acquired first)? I mean, it works out in the end I'm sure, but it's just weird.

Not true. FIFO does not necessarliy equate to EDF (earliest deadline first). A task's deadline is determined by how long it is estimated to take to run. So it is possible to download a bunch of tasks yesterday estimated to take 20 hours to run and have deadlines in late January, and then a bunch today estimated to take 1 hour and have deadlines in mid-December. And don't forget, if you're running more than one project, Boinc has to balance all of them, and different projects do their time estimates and deadlines differently.

However, if all of the tasks you're looking at have the same time estimate, then I agree, it is weird for them not to run FIFO, which presumably is also EDF.

If tasks from the same download batch don't appear to run in FIFO (and be very careful to observe that you haven't applied a sort order to one of the columns in BOINC Manager, before you jump to that conclusion), then it's a long-standing bug which applies some slight randomisation to the display order when data is transferred from the server to the BOINC Client to the BOINC Manager. In short, it's cosmetic only.

BOINC v6.10.58 is still a very old version. We applied a lot of pressure to get that bug (and many others) fixed - I forget just when. The latest ones - I'm running v7.0.38 - have had display order and running order in perfect step for a long time - possibly even since sometime in the v6.12.xx range - but I wouldn't advise upgrading just for this. Like I said, it's cosmetic only.

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1308180 - Posted: 20 Nov 2012, 23:19:34 UTC

Welp... Looks like the cricket graph just bottomed out. :(

Profile ivan
Volunteer tester
Avatar
Send message
Joined: 5 Mar 01
Posts: 628
Credit: 144,365,229
RAC: 153,268
United Kingdom
Message 1308181 - Posted: 20 Nov 2012, 23:21:07 UTC - in response to Message 1308180.
Last modified: 20 Nov 2012, 23:22:02 UTC

Welp... Looks like the cricket graph just bottomed out. :(

Just waiting for the next server status update...
[Edit] Which is almost totally green...
____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2292
Credit: 8,832,574
RAC: 3,790
United States
Message 1308237 - Posted: 21 Nov 2012, 3:25:40 UTC - in response to Message 1308127.

On a side-note, Since switching from 6.2.19 to 6.10.58, I have noticed that my cache is not being processed in FIFO. It is all APs..about 17 days worth, and APs with a deadline four days sooner than the ones that keep getting picked to run next are still sitting there not getting started.

I know there are cache/queue changes along the way through the build history, but each WU has a 25-day deadline, so wouldn't it still make sense to run the soonest deadlines first (which also happen to be the ones that were acquired first)? I mean, it works out in the end I'm sure, but it's just weird.

Not true. FIFO does not necessarliy equate to EDF (earliest deadline first). A task's deadline is determined by how long it is estimated to take to run. So it is possible to download a bunch of tasks yesterday estimated to take 20 hours to run and have deadlines in late January, and then a bunch today estimated to take 1 hour and have deadlines in mid-December. And don't forget, if you're running more than one project, Boinc has to balance all of them, and different projects do their time estimates and deadlines differently.

However, if all of the tasks you're looking at have the same time estimate, then I agree, it is weird for them not to run FIFO, which presumably is also EDF.

If tasks from the same download batch don't appear to run in FIFO (and be very careful to observe that you haven't applied a sort order to one of the columns in BOINC Manager, before you jump to that conclusion), then it's a long-standing bug which applies some slight randomisation to the display order when data is transferred from the server to the BOINC Client to the BOINC Manager. In short, it's cosmetic only.

BOINC v6.10.58 is still a very old version. We applied a lot of pressure to get that bug (and many others) fixed - I forget just when. The latest ones - I'm running v7.0.38 - have had display order and running order in perfect step for a long time - possibly even since sometime in the v6.12.xx range - but I wouldn't advise upgrading just for this. Like I said, it's cosmetic only.

Thank you for the very informative insight to my observation. Since the tasks are all APs, they all have a 25-day deadline from when they were issued. In 6.2.19, they would crunch in FIFO, unless for some crazy reason high priority mode kicked in. I switched to 6.10.58 a few days ago and for example, I have a pile of APs that are due Dec 3, but ones for Dec 6 were running in high priority instead. High priority has since ended, and the ones due Dec 3 and 4 still haven't been touched, but 6-8 are being crunched pretty much in order.

I do notice that the sort order in Manager operates a little differently than in the older version, but I figure it will sort itself out eventually.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 370
Credit: 2,909,882
RAC: 2,469
United States
Message 1308241 - Posted: 21 Nov 2012, 4:38:41 UTC

Well I'm back to NNT if I want to get an acknowledge from the server of the tasks being reported. Allowing tasks greets me with

11/20/2012 11:21:44 PM | SETI@home | work fetch resumed by user
11/20/2012 11:25:39 PM | SETI@home | Sending scheduler request: To fetch work.
11/20/2012 11:25:39 PM | SETI@home | Requesting new tasks for CPU and ATI
11/20/2012 11:25:47 PM | | Project communication failed: attempting access to reference site
11/20/2012 11:25:47 PM | SETI@home | Scheduler request failed: Failure when receiving data from the peer
11/20/2012 11:25:49 PM | | Internet access OK - project servers may be temporarily down.
11/20/2012 11:27:13 PM | SETI@home | Sending scheduler request: To fetch work.
11/20/2012 11:27:13 PM | SETI@home | Requesting new tasks for CPU and ATI
11/20/2012 11:32:26 PM | | Project communication failed: attempting access to reference site
11/20/2012 11:32:26 PM | SETI@home | Scheduler request failed: Timeout was reached
11/20/2012 11:32:28 PM | | Internet access OK - project servers may be temporarily down.

I currently have 15 ghosts, all GPU units, and I'm now down to 104 units "In Progress" counting those ghosts. I'm probably down to under 2 days worth of units for either the CPU or GPU and the only reason I haven't run out of GPU units, other that it's a weak GPU, is I routinely suspend GPU crunching to play games or watch movies.

Well Turkey day is coming up, I guess even a computer could use the break.
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5872
Credit: 60,854,954
RAC: 47,505
Australia
Message 1308266 - Posted: 21 Nov 2012, 7:30:30 UTC - in response to Message 1308241.
Last modified: 21 Nov 2012, 7:40:48 UTC

Well, since the outage i've picked up some work. I'm also getting new errors when trying to contact the Scheduler.
Still getting the timeouts, but to add to that i'm now getting "Server returned nothing (no headers, no data)" & "Failure when receiving data from the peer".
As before, even with NNT set, it appears to depend on the wind direction & how you hold your tongue while clicking repatedly on the retry button as to whether or not you will get a response from the Scheduler.
____________
Grant
Darwin NT.

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1308321 - Posted: 21 Nov 2012, 12:20:03 UTC

As expected, with the AP splitters not running this morning I am able to connect and and the tasks are flowing quite well.

Frank

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 12071
Credit: 14,707,440
RAC: 10,993
United States
Message 1308364 - Posted: 21 Nov 2012, 14:51:04 UTC

My i7 is finally up to its full 200 WU limit. I suppose this means it finished enough Einstein GPU work to ask Seti for some, and the Seti servers were actually able to deliver it.

I also see that the five APs I got yesterday are done already, four valid and one pending.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5872
Credit: 60,854,954
RAC: 47,505
Australia
Message 1308417 - Posted: 21 Nov 2012, 18:11:11 UTC - in response to Message 1308266.
Last modified: 21 Nov 2012, 18:24:41 UTC

Still getting the timeouts, but to add to that i'm now getting "Server returned nothing (no headers, no data)" & "Failure when receiving data from the peer".

And "Couldn't connect to server" pops up occasionally as well.

Well, it was occasionally.
20min of clicking on update with NNT set & that's the only response i've got.

EDIT- just had a look at the server status page & it appears the Scheduler has been disabled.
____________
Grant
Darwin NT.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8671
Credit: 51,873,021
RAC: 49,396
United Kingdom
Message 1308445 - Posted: 21 Nov 2012, 18:54:20 UTC

Message from Eric:

We've got some things to try. Let us know if it starts working.


I find these two helpful:

findstr /C:"[SETI@home] Scheduler request failed: " stdoutdae.txt >sched_failures-%computername%.txt

findstr /C:"[SETI@home] Scheduler request completed: " stdoutdae.txt >sched_successes-%computername%.txt


They work in the "command prompt" environment in Windows.

Save them (separately or together) in one or two files in BOINC's Data directory: give the files names with the extension ".cmd"

Then, double-clicking the file(s) will quickly give you an overview of how well the scheduler requests have been going.

Don't swamp Eric with data, but if a few of us (those who feel confident working with that minimalist instruction - don't bother if you're not comfortable doing that) keep an eye on his experiments and provide feedback, it may help. Remember your logs will be timestamped in your local timezone - please supply the UTC offset so he can match them up with the server changes.

mikeej42
Send message
Joined: 26 Oct 00
Posts: 109
Credit: 790,750,043
RAC: 1,964
United States
Message 1308495 - Posted: 21 Nov 2012, 20:23:15 UTC - in response to Message 1308445.

After the change(s) this afternoon, I had several nodes that had empty caches but could not get a successful scheduler update. I was able to get them to start downloading some tasks by decreasing the minimum work buffer to 0.25 days. Now they are slowly getting some resent tasks.
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8671
Credit: 51,873,021
RAC: 49,396
United Kingdom
Message 1308497 - Posted: 21 Nov 2012, 20:33:27 UTC

Things seems to have started again, and this time we're talking to Synergy over the Campus data network (128.32.18.157) - anybody using a manually configured hosts file please note. We're still using setiboinc.ssl.berkeley.edu, so the proxies should pick up the change automatically.

So far, the only difference that I've noticed (apart from the fact that it works...) is a re-allocation and download of some of the little graphics files used in Simple View.

mikeej42
Send message
Joined: 26 Oct 00
Posts: 109
Credit: 790,750,043
RAC: 1,964
United States
Message 1308504 - Posted: 21 Nov 2012, 20:53:39 UTC - in response to Message 1308497.

After I flushed the DNS caches on all my nodes I was able to go back to multi-day work buffers and got scheduler updates to complete.
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8671
Credit: 51,873,021
RAC: 49,396
United Kingdom
Message 1308505 - Posted: 21 Nov 2012, 20:59:20 UTC - in response to Message 1308504.

After I flushed the DNS caches on all my nodes I was able to go back to multi-day work buffers and got scheduler updates to complete.

Ah. So that's why I can't download the modest few I've been allocated...

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4478
Credit: 119,888,575
RAC: 140,028
United States
Message 1308506 - Posted: 21 Nov 2012, 21:01:47 UTC - in response to Message 1308445.

Message from Eric:

We've got some things to try. Let us know if it starts working.


I find these two helpful:

findstr /C:"[SETI@home] Scheduler request failed: " stdoutdae.txt >sched_failures-%computername%.txt

findstr /C:"[SETI@home] Scheduler request completed: " stdoutdae.txt >sched_successes-%computername%.txt


They work in the "command prompt" environment in Windows.

Save them (separately or together) in one or two files in BOINC's Data directory: give the files names with the extension ".cmd"

Then, double-clicking the file(s) will quickly give you an overview of how well the scheduler requests have been going.

Don't swamp Eric with data, but if a few of us (those who feel confident working with that minimalist instruction - don't bother if you're not comfortable doing that) keep an eye on his experiments and provide feedback, it may help. Remember your logs will be timestamped in your local timezone - please supply the UTC offset so he can match them up with the server changes.

I had not thought to check the logs that way. Quite a good idea. I took it a bit further and did a 3rd to just check for "[SETI@home] Scheduler request failed: Timeout was reached" to separate other failures. Then I have the bat count the lines and give me the % failure for total and timeout. So far checking several machines that have data going back to the 5th. The failure rate is between 14% & 19% for all failures.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 23 · Next

Message boards : Number crunching : Panic Mode On (79) Server Problems?

Copyright © 2014 University of California