Panic Mode On (114) Server Problems?

Message boards : Number crunching : Panic Mode On (114) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 45 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1969701 - Posted: 10 Dec 2018, 2:14:10 UTC - in response to Message 1969695.  

has anyone else noticed that it doesnt seem like credit is being awarded, or it's being awarded VERY slowly. we've been back up and running all day, but RAC numbers are still in nosedive.

i mean, i see validation numbers going up, but credit totals arent.

It's normal. RAC nosedives immediately after the project hiccups. But it takes a week for RAC to reverse course and climb back to what is normal levels.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1969701 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1969713 - Posted: 10 Dec 2018, 4:56:05 UTC - in response to Message 1969695.  

has anyone else noticed that it doesnt seem like credit is being awarded, or it's being awarded VERY slowly. we've been back up and running all day, but RAC numbers are still in nosedive.

i mean, i see validation numbers going up, but credit totals arent.

. . Likewise. Mine are still going down as well.

Stephen

:(
ID: 1969713 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1969725 - Posted: 10 Dec 2018, 7:44:51 UTC
Last modified: 10 Dec 2018, 7:45:18 UTC

I had been thinking of what would help people get work sooner after long outages such as this.

The most work a system can get is it's Cache setting, or the 100 WU server side limit (CPU/per GPU)- whichever is lower.

The feeder can only supply 200 WUs at a time. At present, when you ask for work & have a big debt, the Scheduler may allocate up to 53 WUs (I've read some others can get more). That works out to supplying work to only 4 systems from that particular group from the feeder.
What if the Scheduler allocated work based on the system's ability to process it, and the BOINC Manager's 5 minute and a bit backoff after a Scheduler request?

An extreme cruncher- 4 GPUs, 30s to process a WU. So a bit over 10 WUs per GPU in 5minutes. So give let it have up to 50WUs on a request.
The complete opposite- an Arm based cruncher. Takes a day to process a WU. Is it going to run out of work in the next 6 hours? If yes, give it a WU. If not, then it doesn't need any more work at this stage.
Mid range system, a couple of GTX 1060s running Windows. 7min to process a WU. Is it going to run out of work in the next 5min? If yes, give it 4 WUs. If not, give it 2 WUs.

Based on the mid range system, giving it enough work to keep it going for a couple of Scheduler requests even if it doesn't get any work on the very next request it'll still have some to keep it busy.
And with the 200 WU Feeder limit, that one feeder group is enough to keep 50 mid range systems occupied for over 10 minutes (2 Scheduler requests) or 100 systems if they've already got some work, as opposed to the current system that would only be able to supply 4 systems with work.

And the enabling & disabling of this limiter could be done automatically based on the amount of work presently in a system's cache, and it's application turnaround time. If most systems requesting work have no work, or their time to having no work is 25% or less of it's usual turnaround time, enable the limiter to help get a everyone crunching again. If it's better than 50%, disable the limiter & allocate work as it is done now.
Grant
Darwin NT
ID: 1969725 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1969730 - Posted: 10 Dec 2018, 9:00:38 UTC - in response to Message 1969725.  

The feeder can only supply 200 WUs at a time. At present, when you ask for work & have a big debt, the Scheduler may allocate up to 53 WUs (I've read some others can get more). That works out to supplying work to only 4 systems from that particular group from the feeder.
What if the Scheduler allocated work based on the system's ability to process it, and the BOINC Manager's 5 minute and a bit backoff after a Scheduler request?
...
I'd be careful about making assumptions like that:

10/12/2018 08:39:10 | SETI@home | Scheduler request completed: got 161 new tasks
Work fetch, and the scheduler's response, is based on time (seconds of work requested and allocated) - that one was

10/12/2018 08:39:06 | SETI@home | [sched_op] NVIDIA GPU work request: 303982.25 seconds; 0.00 devices
10/12/2018 08:39:10 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 133545 seconds
I manage my normal requests to be

10/12/2018 08:01:39 | SETI@home | [sched_op] NVIDIA GPU work request: 3989.27 seconds; 0.00 devices
10/12/2018 08:01:42 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 4148 seconds
or thereabouts.

There will be enormous resistance to adding significant extra complexity to the server code: it's a critical component, one slip would crash the project, and it's hideously complicated already. There might be some efficiency savings by considering a maximum of, say, 50 tasks for any one request: there are a lot of tests to be made on each individual task (you can't be your own wingmate - have you had a task from this WU before? And so on.), so speeding that up might allow more requests to be processed. But I'd hate to be the systems analyst tasked with optimising that.
ID: 1969730 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1969732 - Posted: 10 Dec 2018, 9:35:25 UTC

It's been pretty clear that the scheduler assignments/limits aren't going to change.

But I think the easiest would be to make the 100/device a variable. and base it on cache size.
PerDevice = 100*FilesInCache/CacheLimit+1

All of which the scheduler 'should' know already without any database lookups of how fast you are compared to others, etc.
Everything gets a little at a time, enough to keep the timers going - which the server probably prefers the backoffs, lol.
Sure the CPU and slower devices would fill first, but it would be fair.
ID: 1969732 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1969733 - Posted: 10 Dec 2018, 9:35:51 UTC - in response to Message 1969730.  
Last modified: 10 Dec 2018, 9:41:11 UTC

There will be enormous resistance to adding significant extra complexity to the server code: it's a critical component, one slip would crash the project, and it's hideously complicated already.

I figured that would be the situation- the added complexity to something that's already beyond complex would be a insurmountable issue to it's implementation.
Edit- likewise changes to the Scheduler to allow the BOINC Manager to reschedule work between computing resources according to the users preferences (or their manual efforts based on unfathomable ideas).

But it would be nice if it were possible to give enough work to each system, based on it's abilities, to keep them busy for a couple of Scheduler requests and so get everyone processing again ASAP and build up their caches faster than is presently the case after even minor outages, let alone longer ones such as we just had.
Grant
Darwin NT
ID: 1969733 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1969750 - Posted: 10 Dec 2018, 12:22:26 UTC

I agree, mess with the servers code could be catastrophic but need to be made sooner or latter.

With the advent of even faster GPU's who process the WU on the 30 secs range the 100 WU limit is simply unthinkable , it holds for less than an hour. This type of GPU could produce close to 2800 WU/day and even faster ones are ready to deploy their power in the next year.

Few people needs on this days a cache greater than 1 day since high speed internet access is almost universal different from 18 years ago when we use dialing connections, and connect once a week only. LOL

So a new server code who take that in mind, look the host daily production and send enough WU for a complete day of work, instead of a fixed 100 WU limit will make us pass most of the outages, scheduled or no. Sure that not handle a catastrophic outage who will take days to fix, but that is out the "normal project life".

If that change happening, slower host who crunch few WU per day, will receive a lot less than 100 WU of course, but the question is: Do they really need that large quantity of WU or a 10 days cache?

By doing that the size of the DB will be keep in the safe range and all host will be feeded as needed by it's real contribution to the project.
ID: 1969750 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1969759 - Posted: 10 Dec 2018, 13:50:17 UTC

Seems like from a complexity of code etc it would be a lot easier/safer to simply increase the cpu limit from 100 tasks to a higher limit. This would avoid tinkering with the code except for the one (I hope) location where the 100 task limit is located.

The question is, would this slow down over-all production of the Seti ecology, speed it up or would it be neutral? Might want to run that change on Beta first.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1969759 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 1969773 - Posted: 10 Dec 2018, 15:28:38 UTC

Getting Project has no tasks available for the last 25 mins or so...
ID: 1969773 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1969774 - Posted: 10 Dec 2018, 15:31:16 UTC - in response to Message 1969773.  

Getting Project has no tasks available for the last 25 mins or so...


yup, looks like something is wrong. Early stage of PANIC !!!
ID: 1969774 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1969777 - Posted: 10 Dec 2018, 15:40:11 UTC - in response to Message 1969759.  
Last modified: 10 Dec 2018, 15:42:33 UTC

Seems like from a complexity of code etc it would be a lot easier/safer to simply increase the cpu limit from 100 tasks to a higher limit. This would avoid tinkering with the code except for the one (I hope) location where the 100 task limit is located.

Could be easier but that is against the safe of the DB itself, the main reason why the 100 WU limit exist. Change for 200 for example could double the size of it because most of the ones who don't need that rise will obtain the extra WU too. And that makes almost no difference for the ones who has powerful hosts who really needs more WU to pass the outages. That's why i suggest the limit to be related to the actual host production. Simple a host who produces 1000 WU/day has a cache size limited to 1000, one who produces 5000 WU/Day has a 5000 WU cache size, one who produce 5 WU/day receive a 5 WU cache and etc. Not related to the number of GPU's or CPU's the host has, related to the daily production. You want a bigger cache size, simple make your host produce more WU/day. Nothing complicated to code.

To the moderator, if possible, please migrate this conversation to a new thread.
ID: 1969777 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 1969778 - Posted: 10 Dec 2018, 15:44:20 UTC - in response to Message 1969774.  

Getting Project has no tasks available for the last 25 mins or so...


yup, looks like something is wrong. Early stage of PANIC !!!

Panic maybe over, getting tasks again.
ID: 1969778 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1969781 - Posted: 10 Dec 2018, 16:06:51 UTC - in response to Message 1969778.  

Getting Project has no tasks available for the last 25 mins or so...


yup, looks like something is wrong. Early stage of PANIC !!!

Panic maybe over, getting tasks again.


guess it was just a hiccup. -
ID: 1969781 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1969782 - Posted: 10 Dec 2018, 16:25:24 UTC

Perhaps this helps?

10/12/2018 16:23:21 | SETI@home | Scheduler request completed: got 11 new tasks
ID: 1969782 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1969796 - Posted: 10 Dec 2018, 18:39:08 UTC

Since then, we have entered a phase where all splitters are offline, but are continuing to run. I'm not quite sure where the output is going, >/dev/nul comes to mind.

The science database shows 'disabled', and RTS is falling fast.

Apologies for my earlier optimism.
ID: 1969796 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1969805 - Posted: 10 Dec 2018, 20:16:11 UTC - in response to Message 1969628.  

I've also been thinking about the preference to hand out work to those machines that already have work, as someone noted. Could it be that the machines that had work after such a long outage, were asking for a smaller amount of work at a time and thus got service earlier? Maybe if the empty machines asked for a smaller amount of work to start off would they then get some sooner?

I know this is from yesterday, but I believe this has already been noticed for years now.

I remember back when everything moved to the co-lo, if I had an absolutely empty cache, it was near impossible to get work because I was asking for 2.8M seconds of work. But if I dropped the preference down from 10.00 days to 0.1 days, to make it into about 50,000 seconds, nearly the first request after doing so would yield some tasks, THEN I could just punch it back to 10.00 days and walk away and come back an hour later and have a full cache/task limit.

I don't know if it is actually coded in this way, or if it is an unexpected behavior, but if you ask for TOO MUCH work while being empty, it won't give you anything, but once you have something onboard, there doesn't seem to be a "too much" number as long as you have more than ~5 tasks.

Just my anecdotal observations. I believe it was also Richard that pointed that one out to me several years ago, so I tried it and it worked. YMMV
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1969805 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1969812 - Posted: 10 Dec 2018, 20:51:59 UTC - in response to Message 1969805.  

One possible explanation of that is the client request behaviour.

1) If the client requests work, and gets none, it goes into (increasing) backoffs
2) If the client requests work, and gets some, it's free to ask again
3) If the client completes an allocated task, all backoffs are cleared, and it can report the completed task immediately and request more work at the same time.

Collectively, these mean that if you're completely dry, it takes ages to get started.

But if you have a few dregs to complete, or you get a few tasks at an early request, you have a much better chance of asking, and asking, and asking, until you get more.
ID: 1969812 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1969825 - Posted: 10 Dec 2018, 21:39:28 UTC
Last modified: 10 Dec 2018, 21:41:26 UTC

Less than an hour until the RTS is empty. I'm guessing someone is working on it as the science db is running. I have no idea what the splitters are doing.

edit: looks like the splitters are having problems (or bad data) as there are "channels with errors"
ID: 1969825 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1969833 - Posted: 10 Dec 2018, 22:28:30 UTC - in response to Message 1969805.  

I remember back when everything moved to the co-lo, if I had an absolutely empty cache, it was near impossible to get work because I was asking for 2.8M seconds of work. But if I dropped the preference down from 10.00 days to 0.1 days, to make it into about 50,000 seconds, nearly the first request after doing so would yield some tasks, THEN I could just punch it back to 10.00 days and walk away and come back an hour later and have a full cache/task limit.


. . Maybe an easier and safer point of attack would be to reduce the maximum work request from 10 days to say 5?

Stephen

? ?
ID: 1969833 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1969835 - Posted: 10 Dec 2018, 22:31:14 UTC - in response to Message 1969812.  

One possible explanation of that is the client request behaviour.

1) If the client requests work, and gets none, it goes into (increasing) backoffs
2) If the client requests work, and gets some, it's free to ask again
3) If the client completes an allocated task, all backoffs are cleared, and it can report the completed task immediately and request more work at the same time.

Collectively, these mean that if you're completely dry, it takes ages to get started.

But if you have a few dregs to complete, or you get a few tasks at an early request, you have a much better chance of asking, and asking, and asking, until you get more.


. . Except that I have found that manual requests for work in an extended backoff period still results in "No tasks available" and increases the backoff timer.

Stephen

:(
ID: 1969835 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 45 · Next

Message boards : Number crunching : Panic Mode On (114) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.