Panic Mode On (111) Server Problems?

Message boards : Number crunching : Panic Mode On (111) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 31 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928648 - Posted: 7 Apr 2018, 16:40:01 UTC - in response to Message 1928583.  



The correction for multiple CPUs or GPUs is done at work-fetch time. If you set a cache level of 1 day (86,400 seconds), BOINC will ask for 259,200 seconds, a day's work for each of the three GPUs.

So, for those hitting unexpected 'reached a limit of tasks in progress' messages, my suggestion would be to calculate how long it would take your particular CPU to complete 100 tasks. Say a task takes 1 hour, and you use 4 CPU cores to process them, the hundred tasks would take 25 hours - just over a day. Set your cache level just below that figure - one day and no additional would be neat in this simplified example - and your CPU should cruise along just below the CPU limit. That should allow the scheduler to process the GPU iteration of the loop, and send those tasks too. If each GPU finishes it's own task in under 15 minutes (and for the people in this discussion, I expect that's true), then each GPU should get it's own 100 allocation from a 1-day cache, too.

OK, I'm game. I do a cpu task in 30 minutes. Times 4-6 cores. So I have set work request for 0.5 days. Will see if the scheduler will maintain cache levels.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928648 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928650 - Posted: 7 Apr 2018, 16:48:02 UTC - in response to Message 1928647.  

// and we haven't exceeded result per RPC limit
What is the definition of result per RPC limit?
Those will all be in https://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits. That looks like

<max_wus_to_send> N </max_wus_to_send>
Maximum jobs returned per scheduler RPC is N*(NCPUS + GM*NGPUS). You can use this to limit the impact of faulty hosts. Default is 10.
ID: 1928650 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928651 - Posted: 7 Apr 2018, 16:51:57 UTC - in response to Message 1928648.  
Last modified: 7 Apr 2018, 17:09:12 UTC

OK, I'm game. I do a cpu task in 30 minutes. Times 4-6 cores. So I have set work request for 0.5 days. Will see if the scheduler will maintain cache levels.
I'm prepping up to do the same thing on computer 6910484 - I have work from other projects to burn off first. I'll let you know when it reaches test status - probably tomorrow.
ID: 1928651 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928659 - Posted: 7 Apr 2018, 18:06:23 UTC - in response to Message 1928650.  

// and we haven't exceeded result per RPC limit
What is the definition of result per RPC limit?
Those will all be in https://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits. That looks like

<max_wus_to_send> N </max_wus_to_send>
Maximum jobs returned per scheduler RPC is N*(NCPUS + GM*NGPUS). You can use this to limit the impact of faulty hosts. Default is 10.

That refers to config.xml. We don't have that file do we. That file must be on the server side. Correct?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928659 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928663 - Posted: 7 Apr 2018, 18:14:16 UTC - in response to Message 1928651.  

OK, I'm game. I do a cpu task in 30 minutes. Times 4-6 cores. So I have set work request for 0.5 days. Will see if the scheduler will maintain cache levels.
I'm prepping up to do the same thing on computer 6910484 - I have work from other projects to burn off first. I'll let you know when it reaches test status - probably tomorrow.

Well that was quick. I only changed the daily work request a little over an hour ago. Already, the three slower machines are down in gpu work by 75 tasks.

4/7/2018 10:41:55 | SETI@home | Sending scheduler request: To report completed tasks.
4/7/2018 10:41:55 | SETI@home | Reporting 43 completed tasks
4/7/2018 10:41:55 | SETI@home | Not requesting tasks: don't need (CPU: job cache full; NVIDIA GPU: job cache full)
4/7/2018 10:42:03 | SETI@home | Scheduler request completed


The only machines that are staying at full cpu and gpu caches are the Linux machines.

I will be putting the daily work request back to a full one day so the slower machines keep full gpu caches.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928663 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928664 - Posted: 7 Apr 2018, 18:19:07 UTC - in response to Message 1928659.  

That refers to config.xml. We don't have that file do we. That file must be on the server side. Correct?
Yes, which is why it's the right one to consider when looking at server code, as I was earlier.
ID: 1928664 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928666 - Posted: 7 Apr 2018, 18:22:46 UTC - in response to Message 1928663.  

I will be putting the daily work request back to a full one day so the slower machines keep full gpu caches.
That's fine. If you want to achieve that, you'll have to tune each machine according to the relative speeds, and the relative numbers, of each resource. You may find it easier if you refer to <work_fetch_debug> every so often, to see how close you're getting.
ID: 1928666 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928670 - Posted: 7 Apr 2018, 19:03:26 UTC

If each GPU finishes it's own task in under 15 minutes (and for the people in this discussion, I expect that's true), then each GPU should get it's own 100 allocation from a 1-day cache, too.


So why did the gpu caches fall? Under your scenario, they should have maintained their full caches. BoincTasks has the handy feature of keeping a running daily and weekly tally of cpu and gpu production.

My Windows machines,(the slowest in the farm) do around 1000 tasks a day on 3 gpus in each machine. So a half days worth would be 500 tasks. The maximum server side limit for those machines is 300 tasks. The scheduler did not calculate the correct amount of work necessary for 1/2 day of gpu work and let the caches fall below the server limit of 300.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928670 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928673 - Posted: 7 Apr 2018, 19:09:51 UTC - in response to Message 1928666.  

I will be putting the daily work request back to a full one day so the slower machines keep full gpu caches.
That's fine. If you want to achieve that, you'll have to tune each machine according to the relative speeds, and the relative numbers, of each resource. You may find it easier if you refer to <work_fetch_debug> every so often, to see how close you're getting.

Can you explain your comment about "tune each machine according to the relative speeds"?

What part of the system would be tuned? Are we talking about setting venues for each machine or something? Or are you talking about changing rsc_fpops_est in client_state?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928673 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928676 - Posted: 7 Apr 2018, 19:38:53 UTC - in response to Message 1928673.  

What part of the system would be tuned? Are we talking about setting venues for each machine or something? Or are you talking about changing rsc_fpops_est in client_state?
No, I wouldn't touch <rsc_fpops_est> - in fact, I wouldn't change anything inside client_state.xml (so no rescheduling, either). My aim Is to find out how the scheduler is really working, and hence to support or disprove some of the many theories that have been floated in this thread. Only once we all understand it, can we make sensible suggestions about how to drive it. I aim to work with BOINC, rather than against it - unless I come across a coding bug which makes it work in a way which is different from what is documented or which appears to have made the servers work differently from the way the designers intended.

Once I've got some working estimates for my particular host - yours will be different - I'm going to do the maths to work out how long 100 CPU tasks will last, and how long 100 GPU tasks will last. If the CPU runs dry first, that's no good for our test: I'll use app_config.xml to reduce the number of CPU cores SETI is allowed to use (that'll prolong the cache lifetime for the same setting - the spare cores can get on with something else). Then, I'll set the cache so that the machine loads up 100 CPU tasks, at which point it'll be gasping for GPU work, too. See if I get knocked back with a task limit message - that's the one I'm looking for, nothing else. Then, back off the cache just enough to stabilise the CPU in the 90s, but the GPU still wanting more. If I don't get GPU work, or get the wrong message, then it's back to the drawing board and re-read the code to see where I've gone wrong.

It may take a while...
ID: 1928676 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928678 - Posted: 7 Apr 2018, 20:04:01 UTC - in response to Message 1928676.  

OK, Richard I understand. No I wasn't contemplating messing with the server derived numbers in client_state. I just want the system to work as you described, "within the server documented limits" and follow those rules to the letter. But that is not what we have been seeing lately or we really do not have a correct grasp of what the server documented limits are in reality.

My falling gpu tasks and the server stating both cpu and gpu caches were full.

"SETI@home | Not requesting tasks: don't need (CPU: job cache full; NVIDIA GPU: job cache full)" the case in point I am concerned with. As I understand, it should have let the cpu cache fall a bit below 100 but still maintained the gpu caches at 300 tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928678 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928679 - Posted: 7 Apr 2018, 20:10:17 UTC - in response to Message 1928678.  

"SETI@home | Not requesting tasks: don't need (CPU: job cache full; NVIDIA GPU: job cache full)"
That one is a local message from your client, not from the server. You can decode it with <work_fetch_debug>.

I'm still in the status of "It works for me" - so I need to provoke it into producing the server behaviour you've all been complaining about, in a situation where I can immediately examine any of the underlying figures.
ID: 1928679 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928684 - Posted: 7 Apr 2018, 20:57:13 UTC - in response to Message 1928679.  

Can you please decode what work_fetch_debug and sched_ops output shows. I can't make the math of work_fetch_debug match what the math of sched_ops shows.

4/7/2018 13:51:09 | SETI@home | [work_fetch] REC 2324393.522 prio -0.004 can't request work: scheduler RPC backoff (47.02 sec)
4/7/2018 13:51:09 | | [work_fetch] --- state for CPU ---
4/7/2018 13:51:09 | | [work_fetch] shortfall 322607.65 nidle 0.00 saturated 45669.58 busy 0.00
4/7/2018 13:51:09 | Einstein@Home | [work_fetch] share 0.000 blocked by project preferences
4/7/2018 13:51:09 | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences
4/7/2018 13:51:09 | SETI@home | [work_fetch] share 0.000
4/7/2018 13:51:09 | | [work_fetch] --- state for NVIDIA GPU ---
4/7/2018 13:51:09 | | [work_fetch] shortfall 118044.05 nidle 0.00 saturated 47498.36 busy 0.00


4/7/2018 13:51:59 | SETI@home | [sched_op] Starting scheduler request
4/7/2018 13:51:59 | SETI@home | Sending scheduler request: To fetch work.
4/7/2018 13:51:59 | SETI@home | Reporting 4 completed tasks
4/7/2018 13:51:59 | SETI@home | Requesting new tasks for CPU and NVIDIA GPU
4/7/2018 13:51:59 | SETI@home | [sched_op] CPU work request: 322682.44 seconds; 0.00 devices
4/7/2018 13:51:59 | SETI@home | [sched_op] NVIDIA GPU work request: 118182.39 seconds; 0.00 devices
4/7/2018 13:52:03 | SETI@home | Scheduler request completed: got 4 new tasks
4/7/2018 13:52:03 | SETI@home | [sched_op] Server version 709
4/7/2018 13:52:03 | SETI@home | Project requested delay of 303 seconds
4/7/2018 13:52:03 | SETI@home | [sched_op] estimated total CPU task duration: 4057 seconds
4/7/2018 13:52:03 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 716 seconds
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928684 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1928686 - Posted: 7 Apr 2018, 21:10:17 UTC - in response to Message 1928583.  
Last modified: 7 Apr 2018, 21:11:49 UTC

So, for those hitting unexpected 'reached a limit of tasks in progress' messages, my suggestion would be to calculate how long it would take your particular CPU to complete 100 tasks. Say a task takes 1 hour, and you use 4 CPU cores to process them, the hundred tasks would take 25 hours - just over a day. Set your cache level just below that figure - one day and no additional would be neat in this simplified example - and your CPU should cruise along just below the CPU limit. That should allow the scheduler to process the GPU iteration of the loop, and send those tasks too. If each GPU finishes it's own task in under 15 minutes (and for the people in this discussion, I expect that's true), then each GPU should get it's own 100 allocation from a 1-day cache, too.

Looking at my CPU Average Turnaround Time it's 1.12 days.
The next time the Scheduler goes funny on us i'll change my cache setting down to 0.95 days or so (my additional is 0.05), save & update, and then see if the work starts flowing on the next report & request.

Thanks for your efforts.
Grant
Darwin NT
ID: 1928686 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928687 - Posted: 7 Apr 2018, 21:12:35 UTC - in response to Message 1928684.  

I wrote it in message 1900544 - getting too late here to write it up again (hic!).

You've left out the 'target work buffer:' line, which would tell us what you cache settings are.

Both CPU and GPU are in 'shortfall' - that's how much you're going to ask for - and you did. A little bit later, so a little more work had been done, you asked for a little more to make up for that.

'saturated' is how much work you currently have. 'saturated' plus 'shortfall' should add up to 'target' (possibly target times number of cores/GPUs, to make sure you fill them all - It's too late to look it up again)
ID: 1928687 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928690 - Posted: 7 Apr 2018, 21:17:14 UTC - in response to Message 1928686.  

Yes, really will have to wait for the scheduler to go funny again before we can try what Richard suggests. I looked at my cpu turnaround times which I think is easier to get a handle on. CPU turnaround on all hosts range from 0.56 days to 0.89 days. I'll try dropping to 0.5 or 0.4 days when the problem comes back and see if it makes any difference in getting the too many in progress message or not.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928690 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1928691 - Posted: 7 Apr 2018, 21:18:05 UTC - in response to Message 1928599.  

Creation rate is ramping up at the moment.
We shall see if it keeps up.

Unfortunately, it's not.
It cranks up for a bit, dropping & rising, then falls over for an hour or more.
Looks like they can meet 114,000 returned per hour, but 134,000 per hour is beyond their abilities. Ready-to-send should be empty in another 9-12 hours at the present rate of decline.

With the new Database layout, it might (hopefully) just require some changes to some existing configurations to allow the splitters to sustain the output that they displayed shortly after the outage (peaks of 70/s, 50/s sustained).
Grant
Darwin NT
ID: 1928691 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928693 - Posted: 7 Apr 2018, 21:33:06 UTC - in response to Message 1928687.  

I wrote it in message 1900544 - getting too late here to write it up again (hic!).

You've left out the 'target work buffer:' line, which would tell us what you cache settings are.

Both CPU and GPU are in 'shortfall' - that's how much you're going to ask for - and you did. A little bit later, so a little more work had been done, you asked for a little more to make up for that.

'saturated' is how much work you currently have. 'saturated' plus 'shortfall' should add up to 'target' (possibly target times number of cores/GPUs, to make sure you fill them all - It's too late to look it up again)

Sorry, meant to include that.
4/7/2018 13:50:24 | | [work_fetch] target work buffer: 86400.00 + 864.00 sec
So target is one day or 864864 seconds.

Adding up shortfall and saturated is 4/7/2018 13:51:09 | | [work_fetch] shortfall 118044.05 nidle 0.00 saturated 47498.36 busy 0.00 or 165542 seconds of gpu work. Which is 46 hours of gpu work total/ Divide by 3 gpus and you get 15.3 hours of work. Huhh?

I do 1000 tasks per day at 240 seconds per task. Which is 240000 seconds of total work. Divide by 3 gpus and you get 22.2 hours of work. That is what I would need to keep all gpus fed for 24 hours.

Why the difference between what I really need and what the project says I need.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928693 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1928697 - Posted: 7 Apr 2018, 21:44:27 UTC - in response to Message 1928690.  

I don't want to burst any bubbles, however, I've always run my cache settings at about a day. Since I only run One or Two CPU tasks, my CPU caches are Always very low. Doesn't seem to help...does it.
It would be interesting to see someone 'provoke' the server to simply stop sending replacements, without changing the cache settings of course. As far as I know the server just decides to stop sending replacement tasks without being 'provoked'. That's the way it works for me anyway. Right now it's working normally, completed tasks are being replaced when reported. How would you provoke the server to stop sending replacements? I'm also waiting to hit that empty feeder, seems I've managed to avoid it for the last day. Just lucky I suppose.
ID: 1928697 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1928698 - Posted: 7 Apr 2018, 21:46:36 UTC - in response to Message 1928693.  

This really is going to be the last for todaynight :-)

Sorry, meant to include that.
4/7/2018 13:50:24 | | [work_fetch] target work buffer: 86400.00 + 864.00 sec
So target is one day or 864864 seconds.
No it isn't - use a calculator. 87,264 seconds.

Adding up shortfall and saturated is 4/7/2018 13:51:09 | | [work_fetch] shortfall 118044.05 nidle 0.00 saturated 47498.36 busy 0.00 or 165542 seconds of gpu work. Which is 46 hours of gpu work total/ Divide by 3 gpus and you get 15.3 hours of work. Huhh?
OK, maybe I'm rusty.

Target (wall-clock time): 87,264
Saturated (maybe also wall-clock time): 47,498.36
Leaves shortfall: 39,765.64 - wall-clock, which means per GPU.
3 GPUs, total needed to fill 3 shortfalls: 119,296.92

Does that sound better?
ID: 1928698 · Report as offensive
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 31 · Next

Message boards : Number crunching : Panic Mode On (111) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.