No CPU work, only CUDA, won't even request


log in

Advanced search

Message boards : Number crunching : No CPU work, only CUDA, won't even request

Author Message
Harri Liljeroos
Avatar
Send message
Joined: 29 May 99
Posts: 46
Credit: 19,671,051
RAC: 7,718
Finland
Message 925573 - Posted: 12 Aug 2009, 15:32:03 UTC

Hi,
I just run into a situation where boinc does not request any work for my CPU, only for GPU. CPU has nothing to crunch at the moment.

I'm attached to 4 projects with following shares: CPDN 30, Einstein 25, LHC 150 and SETI 100. CPDN and Einstein are NNT (Einstein is down anyway), LHC doesn't have anything to crunch so that leaves only SETI. But CPU is just idling. This system has previously been doing both GPU and CPU WUs (last 6.03 CPU WU finished yesterday). Since then CPU was doing Einstein and now those are finished too (5 ready to report and 5 uploading).

Boinc is 6.6.36 (WinVista 32), opti apps SSSE3 (Lunatics 0.2 Unified Installer w. VLARnokill), CPU=Q9400, GPU=9800GT (driver 190.38, CUDA 2.3). Preferences are set to accept all kinds of work and all types are in my app_info.xml

I switched on the log_flag for work_fetch_debug and here is output for that:

12/08/2009 17:51:11 Starting BOINC client version 6.6.36 for windows_intelx86 12/08/2009 17:51:11 log flags: task, file_xfer, sched_ops, work_fetch_debug 12/08/2009 17:51:11 Libraries: libcurl/7.19.4 OpenSSL/0.9.8j zlib/1.2.3 12/08/2009 17:51:11 Data directory: C:\ProgramData\BOINC 12/08/2009 17:51:11 Running under account Harri Liljeroos 12/08/2009 17:51:11 Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz [x86 Family 6 Model 23 Stepping 10] 12/08/2009 17:51:11 Processor features: fpu tsc pae nx sse sse2 pni mmx 12/08/2009 17:51:11 OS: Microsoft Windows Vista: Home Premium x86 Edition, Service Pack 1, (06.00.6001.00) 12/08/2009 17:51:11 Memory: 3.00 GB physical, 6.22 GB virtual 12/08/2009 17:51:11 Disk: 931.51 GB total, 234.54 GB free 12/08/2009 17:51:11 Local time is UTC +3 hours 12/08/2009 17:51:11 CUDA device: GeForce 9800 GT (driver version 19038, compute capability 1.1, 512MB, est. 65GFLOPS) 12/08/2009 17:51:11 SETI@home Found app_info.xml; using anonymous platform 12/08/2009 17:51:11 Not using a proxy 12/08/2009 17:51:11 climateprediction.net URL: http://climateprediction.net/; Computer ID: 963208; location: work; project prefs: default 12/08/2009 17:51:11 Einstein@Home URL: http://einstein.phys.uwm.edu/; Computer ID: 1653564; location: home; project prefs: default 12/08/2009 17:51:11 lhcathome URL: http://lhcathome.cern.ch/lhcathome/; Computer ID: 9755093; location: work; project prefs: default 12/08/2009 17:51:11 SETI@home URL: http://setiathome.berkeley.edu/; Computer ID: 4649867; location: work; project prefs: default 12/08/2009 17:51:11 SETI@home General prefs: from SETI@home (last modified 08-Aug-2009 13:46:23) 12/08/2009 17:51:11 SETI@home Computer location: work 12/08/2009 17:51:11 General prefs: using separate prefs for work 12/08/2009 17:51:11 Reading preferences override file 12/08/2009 17:51:11 Preferences limit memory usage when active to 3070.26MB 12/08/2009 17:51:11 Preferences limit memory usage when idle to 3070.26MB 12/08/2009 17:51:11 Preferences limit disk usage to 100.00GB 12/08/2009 17:51:11 [work_fetch_debug] Request work fetch: Prefs update 12/08/2009 17:51:11 [work_fetch_debug] Request work fetch: Startup 12/08/2009 17:51:11 [work_fetch_debug] Request work fetch: Backoff ended for climateprediction.net 12/08/2009 17:51:11 [work_fetch_debug] Request work fetch: Backoff ended for Einstein@Home 12/08/2009 17:51:11 [work_fetch_debug] Request work fetch: Backoff ended for lhcathome 12/08/2009 17:51:11 [work_fetch_debug] Request work fetch: Backoff ended for SETI@home 12/08/2009 17:51:11 SETI@home chosen: CUDA major shortfall 12/08/2009 17:51:11 [wfd] ------- start work fetch state ------- 12/08/2009 17:51:11 [wfd] target work buffer: 172800.00 + 0.00 sec 12/08/2009 17:51:11 [wfd] CPU: shortfall 684437.92 nidle 3.96 est. delay 0.00 RS fetchable 0.00 runnable 0.00 12/08/2009 17:51:11 climateprediction.net [wfd] CPU: fetch share 0.00 debt -77128.59 backoff dt 0.00 int 0.00 (no new tasks) 12/08/2009 17:51:11 Einstein@Home [wfd] CPU: fetch share 0.00 debt -4802812.58 backoff dt 0.00 int 0.00 (no new tasks) (overworked) 12/08/2009 17:51:11 lhcathome [wfd] CPU: fetch share 0.00 debt 0.00 backoff dt 20023.97 int 61440.00 12/08/2009 17:51:11 SETI@home [wfd] CPU: fetch share 0.00 debt 0.00 backoff dt 10501.36 int 30720.00 12/08/2009 17:51:11 [wfd] CUDA: shortfall 3747.88 nidle 0.00 est. delay 169052.12 RS fetchable 100.00 runnable 100.00 12/08/2009 17:51:11 climateprediction.net [wfd] CUDA: fetch share 0.00 debt 0.00 backoff dt 0.00 int 0.00 (no new tasks) 12/08/2009 17:51:11 Einstein@Home [wfd] CUDA: fetch share 0.00 debt 0.00 backoff dt 0.00 int 86400.00 (no new tasks) 12/08/2009 17:51:11 lhcathome [wfd] CUDA: fetch share 0.00 debt 0.00 backoff dt 10930.40 int 86400.00 12/08/2009 17:51:11 SETI@home [wfd] CUDA: fetch share 1.00 debt 0.00 backoff dt 0.00 int 480.00 12/08/2009 17:51:11 climateprediction.net [wfd] overall_debt -77129 12/08/2009 17:51:11 Einstein@Home [wfd] overall_debt -4802813 12/08/2009 17:51:11 lhcathome [wfd] overall_debt 0 12/08/2009 17:51:11 SETI@home [wfd] overall_debt 0 12/08/2009 17:51:11 [wfd] ------- end work fetch state ------- 12/08/2009 17:51:11 SETI@home [wfd] request: CPU (0.00 sec, 0) CUDA (3747.88 sec, 0) 12/08/2009 17:51:11 SETI@home Sending scheduler request: To fetch work. 12/08/2009 17:51:11 SETI@home Requesting new tasks for GPU


Can somebody interpret why Boinc is not requesting any work for the CPU?

____________

Profile Byron S Goodgame
Volunteer tester
Avatar
Send message
Joined: 16 Jan 06
Posts: 1151
Credit: 3,936,993
RAC: 0
United States
Message 925578 - Posted: 12 Aug 2009, 15:47:35 UTC - in response to Message 925573.
Last modified: 12 Aug 2009, 15:50:23 UTC

12/08/2009 17:51:11 climateprediction.net [wfd] overall_debt -77129
12/08/2009 17:51:11 Einstein@Home [wfd] overall_debt -4802813
12/08/2009 17:51:11 lhcathome [wfd] overall_debt 0
12/08/2009 17:51:11 SETI@home [wfd] overall_debt 0


I might be reading it wrong but, it looks to me like debt for the other projects.
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8757
Credit: 52,706,896
RAC: 28,048
United Kingdom
Message 925584 - Posted: 12 Aug 2009, 16:06:13 UTC
Last modified: 12 Aug 2009, 16:30:22 UTC

Well done for finding and using the [wfd] flag - it reveals all.

It's the line:

12/08/2009 17:51:11 SETI@home [wfd] CPU: fetch share 0.00 debt 0.00 backoff dt 10501.36 int 30720.00

The idea is, starting with the v6.6 range of BOINC clients, that if no work is available to be sent out, BOINC doesn't waste everybody's time by continually asking for it. The retry interval ('int') doubles at each failure to get any work, up to a maxmimum of 86,400 (seconds - 1 day). At the moment, you still have 10,501.36 seconds ('dt' - just under 3 hours) to wait until the next retry - but read on, all is not lost.

Unfortunately, BOINC servers don't tell the client why they're not receiving any work, and the BOINC client wouldn't behave any differently even if it knew. So you get the same one-day maximum backoff whether Einstein is down (fileserver crash - could be a couple of days), LHC has vanished into its own private black hole (for the last several months), CPDN hasn't written a CUDA app (and probably never will) - or SETI is a bit busy right now and should have some more in the feeder cache in a couple of seconds. I've said that before, but it fell on deaf ears.

The backoff interval, and the time until the next retry, are reset to zero when any of three things happen

  • You get some of the work it isn't asking for
  • An existing task finishes running (but you haven't got any)
  • You click the 'Update project' button

So it turns out to be very simple - just click that button (once is all it needs), and work requests will resume. If you don't succeed after the first few attempts (which will come in quick succession), click it again.

Just why BOINC management chose to hide all this information in the changelogs and debug messages, I'll leave for someone else to explain.

Edit - to show I'm not making all this up, it's in changesets [17664] and [17665]. Those came out in BOINC v6.6.18

Harri Liljeroos
Avatar
Send message
Joined: 29 May 99
Posts: 46
Credit: 19,671,051
RAC: 7,718
Finland
Message 925632 - Posted: 12 Aug 2009, 19:03:09 UTC

Thank you for your answers. Things have moved along and CPU work has now been downloaded and the host is cruching now.
____________

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 925748 - Posted: 13 Aug 2009, 2:24:47 UTC - in response to Message 925584.

Just why BOINC management chose to hide all this information in the changelogs and debug messages, I'll leave for someone else to explain.

Edit - to show I'm not making all this up, it's in changesets [17664] and [17665]. Those came out in BOINC v6.6.18

I'm not sure I agree with your characterization of those two changesets, but this is one of those horrid Catch-22's.

If you don't back off, you can have 100,000 hosts hitting the project servers constantly -- very effectively delivering a self-inflicted DoS.

If you do back off, you can create a corner where you need work, but you don't want to hammer the servers to get it.

CUDA is still new enough some of those corners are still being discovered. Then you gotta figure out how to fix 'em.
____________

Harri Liljeroos
Avatar
Send message
Joined: 29 May 99
Posts: 46
Credit: 19,671,051
RAC: 7,718
Finland
Message 925779 - Posted: 13 Aug 2009, 7:13:22 UTC

Does the backoff work only halfway if it was only affecting the CPU requests and not the GPU requests? If a request to server is made anyway shouldn't it then ask work for both GPU and CPU at the same time if host is lacking work for CPU and GPU?
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8757
Credit: 52,706,896
RAC: 28,048
United Kingdom
Message 925784 - Posted: 13 Aug 2009, 9:54:33 UTC - in response to Message 925779.

Does the backoff work only halfway if it was only affecting the CPU requests and not the GPU requests? If a request to server is made anyway shouldn't it then ask work for both GPU and CPU at the same time if host is lacking work for CPU and GPU?

The backoff is calculated and applied separately for every individual project/resource combination. So SETI/CPU and SETI/CUDA have different backoffs, and one may be requesting work when the other isn't. I had a situation yesterday while I was researching that reply, where a host had a shortfall of 40,000 seconds for CUDA, and 200 seconds for CPU - yet it only asked for (and got) the CPU work, during a CUDA backoff.

That sounds odd, but you have to remember that BOINC is designed to be as general as possible, and not make assumptions about how any paticular project or its applications are going to work.

So, you and I know that for both SETI and Einstein, the CUDA work is exactly the same as CPU work, and a CUDA task can be 'rebranded' for processing on the CPU. So the temptation is to lump them together, and say "if you get work for anything, clear both sets of backoff". But in doing so, we're using the typical human selective memory - we're forgetting to allow for AP and S5R5, which can be processed by CPU but not (yet) by CUDA. So maybe clearing the CUDA backoff isn't always such a good idea.

And it gets worse. AQUA have just suspended further development of their CUDA application (because it's so much slower than their CPU application - yup). So although CUDA work was available in the past, it won't be for a long time to come (and CPU and CUDA work was never interchangeable there). And CPDN will probably never develop a CUDA app - too much data to handle. That's another backoff you never want to reset.

No, unless you want BOINC to implement a whole rules-based management system (and if you do, you'd better try writing it yourself), it's probably better to keep everything independent. And it's less work for projects - if you allow for complex rules, the projects would have to supply and maintain the data inputs for the rules to work on - we're having enough problems getting them to manage simple values like <rsc_fpops_est>!

@ Ned,

Yes, those two changesets were only the final coats of polish on a system that was implemented even earlier - trac is quite cumbersome for historical research, and I gave up looking when I'd found something even vaguely relevant - even that took me back five months! It's just interesting that the backoffs have been in use for all that time, and this is (so far as I can remember) the first time that anyone has asked in detail about them on this board.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 925857 - Posted: 13 Aug 2009, 18:55:07 UTC - in response to Message 925784.
Last modified: 13 Aug 2009, 18:57:39 UTC


@ Ned,

Yes, those two changesets were only the final coats of polish on a system that was implemented even earlier - trac is quite cumbersome for historical research, and I gave up looking when I'd found something even vaguely relevant - even that took me back five months! It's just interesting that the backoffs have been in use for all that time, and this is (so far as I can remember) the first time that anyone has asked in detail about them on this board.

All of this is based on very conflicting needs. You need to keep work on the clients, and you need to keep the clients from DoSing the servers.

... and you can't use the servers to tell the clients that they need to stop DoSing the servers because the servers are undergoing a denial-of-service attack and can't answer to tell you to stop being part of the attack.

Edit: it's also something that may look fine 99.9% of the time, and only cause these odd "edges" under very unique circumstances.
____________

Message boards : Number crunching : No CPU work, only CUDA, won't even request

Copyright © 2014 University of California