Panic Mode On (104) Server Problems?

Message boards : Number crunching : Panic Mode On (104) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 42 · Next

AuthorMessage
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36755
Credit: 261,360,520
RAC: 489
Australia
Message 1844260 - Posted: 25 Jan 2017, 6:57:15 UTC

After the outrage, we had a 3hr power here outage due to storms, but everything is now fully loaded.

Cheers.
ID: 1844260 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36755
Credit: 261,360,520
RAC: 489
Australia
Message 1844379 - Posted: 25 Jan 2017, 22:39:15 UTC - in response to Message 1844186.  


Hal is correct and this has happened before, several times in fact over the years, but thankfully I have none of those here this time.

Cheers.

Oh yes you have, and plenty of them too. Just one example from one of your computers:
https://setiathome.berkeley.edu/workunit.php?wuid=2349156167
That one has been in limbo since Dec 6. You have many more....

I should've went back further it seems, 66 of them, but their deadlines are 25, 26, 27 January so they should clear over the next few days.

Cheers.

Well 16 of mine have now validated.

Cheers.
ID: 1844379 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1844442 - Posted: 26 Jan 2017, 6:16:14 UTC - in response to Message 1844250.  
Last modified: 26 Jan 2017, 6:32:27 UTC

What i'm still trying to figure out is why I have to change the Application settings once or twice a day to be able to keep getting work. Eventually, even after the post outage congestion, the usual response from the Scheduler these days is "Project has no tasks available." Change the Application settings, then it has work available. At least for 12 or more hours. Then I get to do it all over again.


Rough gist is that when hard limits are set, such as project backoffs and queue sizes, statistically there will be people that fit into the expected area, ones that sometimes fit into the expected area while other times not, then still more that always fall into the always breaks regime. My guess is if the backoffs &/or request intervals were somewhat randomised, then it would allow fairer work distribution. Sadly actual statistics and control systems theory doesn't seem to be on the Agenda for Boinc anytime soon. So you'll need to either continue babysitting, or change something such that you fall into a different 'chance' bucket.

[Edit:] Either manually, or Some script to force update a random interval after project backoff expiration might work. The client's aggressive request on backoff expiration won't.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1844442 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1844448 - Posted: 26 Jan 2017, 6:47:58 UTC - in response to Message 1844442.  

Rough gist is that when hard limits are set, such as project backoffs and queue sizes, statistically there will be people that fit into the expected area, ones that sometimes fit into the expected area while other times not, then still more that always fall into the always breaks regime. My guess is if the backoffs &/or request intervals were somewhat randomised, then it would allow fairer work distribution. Sadly actual statistics and control systems theory doesn't seem to be on the Agenda for Boinc anytime soon. So you'll need to either continue babysitting, or change something such that you fall into a different 'chance' bucket.

[Edit:] Either manually, or Some script to force update a random interval after project backoff expiration might work. The client's aggressive request on backoff expiration won't.

Problem is it has nothing to do with the backoffs.
Every 5 minutes the Manager will ask for work, and for some reasons after a certain period of time it's necessary to change the project's application settings to keep it coming. Even though I don't have an AP application to process AP work, I have to set that option to Yes to get work (and the "If no work available for selected application, accept work from other applications). Then later on I have to set it to No, then Yes, then No and so on. Been this way for a few weeks now.

I suspect it's related to the issue that was reported late Dec for people that had selected to do AP work, and "If no work avail, do other work" with v8 left unselected. At the end of Dec they had to specifically enable v8 work in order to get any, where as before it wasn't necessary.

(Of course running out of work during the weekly outages does result in ridiculously long backoff times if you're not around to hit retry).
Grant
Darwin NT
ID: 1844448 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1844451 - Posted: 26 Jan 2017, 6:58:11 UTC - in response to Message 1844448.  

Well the 5 minutes is across the board, so constant for all of us.

Likely bugs aside (a big if), there are still periods of full or empty feeder queue which you may fall into. It's a task to determine whether there is a bug in the work issue, or you are simply falling into the empty feeder bucket for some reason (could be anything such as latency you described)

Perhaps an answer is to increase the backoff such than more users can get in, maybe not. Either way, there will always be some proportion of users that can never get work. What I'm proposing is, that if fixed time intervals are used, then nomatter what, some proportion of hosts will end up in a state of never being able to get work.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1844451 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1844452 - Posted: 26 Jan 2017, 7:11:08 UTC - in response to Message 1844451.  

Likely bugs aside (a big if), there are still periods of full or empty feeder queue which you may fall into. It's a task to determine whether there is a bug in the work issue, or you are simply falling into the empty feeder bucket for some reason (could be anything such as latency you described)

Possible, but unlikely IMHO.
Work returned per hour is less than at times in the past, yet this issue persists.
Also, even after the extended outages with work returned right up there & the demand for new work way up there, if the first few requests for work result in none, changing the application settings results in getting work- up to the point the cache is full. If it starts getting work after the outages, it continues to fill up normally. In both cases it will generally get work with each request, just the odd one or 2 it might miss out on.
However If the project does ever get the number of crunchers they're hoping for to crunch all the new data, I am expecting significant issues getting work with things as they stand.

A 5min +-30 seconds might help with the Scheduler and feeder loads, but as the number of active hosts increases, wouldn't the sheer number of hosts result in an (effectively) random load due to each system's randomness with the initial Scheduler request after BOINC starting?
Grant
Darwin NT
ID: 1844452 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1844453 - Posted: 26 Jan 2017, 7:18:07 UTC - in response to Message 1844452.  

Yes 'unlikely'. That may possibly account for your ability to change settings, last for 12 hours, then fall back into a hole. I'll be very interested to see if you can break out of that rut without code change on client or server. Being Australia day, I'd like to point out that falling into such holes is pretty much the Australian way. Usually the result of doing so is fairly disruptive.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1844453 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1844454 - Posted: 26 Jan 2017, 7:21:32 UTC - in response to Message 1844452.  

A 5min +-30 seconds might help with the Scheduler and feeder loads, but as the number of active hosts increases, wouldn't the sheer number of hosts result in an (effectively) random load due to each system's randomness with the initial Scheduler request after BOINC starting?


Most likely the backoff needs to be proportional to the rate of requests, since the feed rate is likely more or less constant.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1844454 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1844487 - Posted: 26 Jan 2017, 10:22:23 UTC

Grant, Maybe try adding the AP application, maybe that is the one thing that makes your requests different than others.

There are a few getting split right now, but normally you won't get any(many) anyways.
ID: 1844487 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1844488 - Posted: 26 Jan 2017, 10:37:40 UTC - in response to Message 1844487.  

Grant, Maybe try adding the AP application, maybe that is the one thing that makes your requests different than others.

There are a few getting split right now, but normally you won't get any(many) anyways.

True, but really I've no desire to re-do the setup again. I'm happy to run v8 only & crunch all the guppies others don't want.
It would be nice if the issue we just fixed. I'll just continue with my manual work around for now.
Grant
Darwin NT
ID: 1844488 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1844598 - Posted: 26 Jan 2017, 20:27:39 UTC

Just adding an idea here.. but the feeder can only re-fill so fast/frequently/often, so what if instead of letting someone get lucky and get all 200 tasks that it has in one request.. the feeder gets limited to assigning 20 or 50 tasks at a time? I know there will be more groaning and griping about that, too, but if the feeder is more likely to have tasks in it, then there should--theoretically--be less people who get absolutely nothing because it is empty.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1844598 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1856
Credit: 268,616,081
RAC: 1,349
United States
Message 1844621 - Posted: 26 Jan 2017, 22:34:50 UTC - in response to Message 1844598.  

Just adding an idea here.. but the feeder can only re-fill so fast/frequently/often, so what if instead of letting someone get lucky and get all 200 tasks that it has in one request.. the feeder gets limited to assigning 20 or 50 tasks at a time? I know there will be more groaning and griping about that, too, but if the feeder is more likely to have tasks in it, then there should--theoretically--be less people who get absolutely nothing because it is empty.

+1
Makes a lot of sense. Would also help for new folks who take a load and are never heard from again, in terms of wingmen waiting for timouts.
ID: 1844621 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1844633 - Posted: 26 Jan 2017, 23:01:35 UTC - in response to Message 1844598.  

+1
Seems to be the old norm for me anyway, at least WAS back last year before this new problem cropped up. I never got more than 40 or so tasks in the first downloads after the outage anyway. I've NEVER received the full 100 task buffer output for my download request. I always sort of thought that was the servers doing, or it was just my luck of the draw always pulling 40~ tasks out of the 100 task buffer. It never took more than 4 or 5 downloads to get back to my quota after the outage and my work request got answered in the queue deluge. I gather from the thread comments that the servers DON'T actually have this mechanism in play.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1844633 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1844634 - Posted: 26 Jan 2017, 23:02:10 UTC - in response to Message 1844598.  

Just adding an idea here.. but the feeder can only re-fill so fast/frequently/often, so what if instead of letting someone get lucky and get all 200 tasks that it has in one request.. the feeder gets limited to assigning 20 or 50 tasks at a time? I know there will be more groaning and griping about that, too, but if the feeder is more likely to have tasks in it, then there should--theoretically--be less people who get absolutely nothing because it is empty.

Something similar was done with AP work requests a few years ago. At least it seemed that way once no more than ~7 tasks were assigned per request.
If hitting an empty feeder is the only issue it seems like more users would be seeing their queues dropping & that toggling settings wouldn't fix the users getting no work.
At another project there was an issue with the feeder running dry. So the admin adjusted some settings for it. I think they said they increased the number the feeder held at once, but they may have increased how often it was filled.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1844634 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1844665 - Posted: 27 Jan 2017, 0:44:22 UTC - in response to Message 1844634.  
Last modified: 27 Jan 2017, 0:45:46 UTC

Just adding an idea here.. but the feeder can only re-fill so fast/frequently/often, so what if instead of letting someone get lucky and get all 200 tasks that it has in one request.. the feeder gets limited to assigning 20 or 50 tasks at a time? I know there will be more groaning and griping about that, too, but if the feeder is more likely to have tasks in it, then there should--theoretically--be less people who get absolutely nothing because it is empty.

Something similar was done with AP work requests a few years ago. At least it seemed that way once no more than ~7 tasks were assigned per request.
If hitting an empty feeder is the only issue it seems like more users would be seeing their queues dropping & that toggling settings wouldn't fix the users getting no work.
At another project there was an issue with the feeder running dry. So the admin adjusted some settings for it. I think they said they increased the number the feeder held at once, but they may have increased how often it was filled.


. . That is exactly how it seems to me as well. But I can offer no idea of just where the issue lies. Except to say it is a recent development. Since they upgraded the OS on one or more of the servers in fact.

Stephen

.
ID: 1844665 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1844700 - Posted: 27 Jan 2017, 6:35:34 UTC - in response to Message 1844665.  

That is exactly how it seems to me as well. But I can offer no idea of just where the issue lies. Except to say it is a recent development. Since they upgraded the OS on one or more of the servers in fact.

It wouldn't be the first time major changes resulted in configuration files going missing or being ignored in one way or another.
Grant
Darwin NT
ID: 1844700 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1845428 - Posted: 30 Jan 2017, 14:10:39 UTC
Last modified: 30 Jan 2017, 14:36:43 UTC

It seems the problem with downloading Work has surfaced again after being absent for a while. Just as before the machines with only ATI GPUs are Not affected. The problems are with the machines that just had GPUs swapped. One machine went from having 3 nVidia cards to 2 NV and 1 ATI. The other machine went from 2 NV and 1 ATI to 3 nVidia cards. Both machines have been having problems since the GPUs were swapped a couple days ago, up until then they were not having any problems since the last post a couple weeks ago. Suddenly, they are back to having problems. The one machine with the 3 NV GPUs is down over a hundred tasks, changing the preferences works for a few hours then the problem returns.
Mon Jan 30 09:27:58 2017 | SETI@home | [sched_op] Starting scheduler request
Mon Jan 30 09:27:58 2017 | SETI@home | Sending scheduler request: To fetch work.
Mon Jan 30 09:27:58 2017 | SETI@home | Reporting 1 completed tasks
Mon Jan 30 09:27:58 2017 | SETI@home | Requesting new tasks for CPU and NVIDIA GPU and AMD/ATI GPU
Mon Jan 30 09:27:58 2017 | SETI@home | [sched_op] CPU work request: 41045.68 seconds; 0.00 devices
Mon Jan 30 09:27:58 2017 | SETI@home | [sched_op] NVIDIA GPU work request: 139220.66 seconds; 0.00 devices
Mon Jan 30 09:27:58 2017 | SETI@home | [sched_op] AMD/ATI GPU work request: 9927.06 seconds; 0.00 devices
Mon Jan 30 09:28:01 2017 | SETI@home | Scheduler request completed: got 0 new tasks
Mon Jan 30 09:28:01 2017 | SETI@home | [sched_op] Server version 707
Mon Jan 30 09:28:01 2017 | SETI@home | Project has no tasks available
Mon Jan 30 09:28:01 2017 | SETI@home | Project requested delay of 303 seconds

Mon 30 Jan 2017 09:33:05 AM EST | SETI@home | Requesting new tasks for NVIDIA
Mon 30 Jan 2017 09:33:05 AM EST | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Mon 30 Jan 2017 09:33:05 AM EST | SETI@home | [sched_op] NVIDIA work request: 526053.44 seconds; 0.00 devices
Mon 30 Jan 2017 09:33:13 AM EST | SETI@home | Scheduler request completed: got 0 new tasks
Mon 30 Jan 2017 09:33:13 AM EST | SETI@home | [sched_op] Server version 707
Mon 30 Jan 2017 09:33:13 AM EST | SETI@home | Project has no tasks available
Mon 30 Jan 2017 09:33:13 AM EST | SETI@home | Project requested delay of 303 seconds


:-(
ID: 1845428 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1845439 - Posted: 30 Jan 2017, 15:52:46 UTC

Change the preferences and the work returns...for a little while;
Mon 30 Jan 2017 10:26:32 AM EST | SETI@home | Sending scheduler request: To fetch work.
Mon 30 Jan 2017 10:26:32 AM EST | SETI@home | Reporting 2 completed tasks
Mon 30 Jan 2017 10:26:32 AM EST | SETI@home | Requesting new tasks for NVIDIA
Mon 30 Jan 2017 10:26:32 AM EST | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Mon 30 Jan 2017 10:26:32 AM EST | SETI@home | [sched_op] NVIDIA work request: 494789.29 seconds; 0.00 devices
Mon 30 Jan 2017 10:26:35 AM EST | SETI@home | Scheduler request completed: got 91 new tasks
Mon 30 Jan 2017 10:26:35 AM EST | SETI@home | [sched_op] Server version 707
Mon 30 Jan 2017 10:26:35 AM EST | SETI@home | Project requested delay of 303 seconds
ID: 1845439 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1845455 - Posted: 30 Jan 2017, 17:11:15 UTC - in response to Message 1845439.  

Remind me again, what preferences are you changing to get effect.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1845455 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1845458 - Posted: 30 Jan 2017, 17:30:56 UTC - in response to Message 1845455.  

Run only the selected applications AstroPulse v7: no
SETI@home v8: yes
If no work for selected applications is available, accept work from other applications? no
It doesn't matter how they are set. Just change them. Right now one machine is set to the above while the other machine is;
Run only the selected applications AstroPulse v7: yes
SETI@home v8: yes
If no work for selected applications is available, accept work from other applications? yes

The next time I'll just swap the settings again.
It's the act of changing them that matters.
ID: 1845458 · Report as offensive
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 42 · Next

Message boards : Number crunching : Panic Mode On (104) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.