CPU tasks run slow after upgrade to 7.2.42

Message boards : Number crunching : CPU tasks run slow after upgrade to 7.2.42
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1490845 - Posted: 18 Mar 2014, 20:14:57 UTC

Just upgraded all three machines here to 7.2.42 from 7.2.33, and on two of the machines the upgrade went fine. On the third, however, after restart the CPU tasks now seem to get less than 2% CPU usage, and have thus been crunching for days to get only 1-2% complete. (machine ID 7119149) When I first noticed this, I assumed that it was a fluke related to the jobs in process at the time I upgraded, and suspended them to let new jobs start. Same result. Tried shutdowns, and full reboots, no help. Only other thing I can think of would be to abandon the current CPU jobs and see if new ones would get their proper share of resources.
Curious how I might have created such a situation, or if anyone else has seen a similar issue. Given the nature of the machine, I might be better off running GPU-only anyway, but it would be interesting to see what's really going on, for future reference. Thoughts, anyone?
ID: 1490845 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1490983 - Posted: 19 Mar 2014, 0:56:59 UTC - in response to Message 1490845.  

My guess is it has nothing to do with the upgrade to 7.2.42. Instead, I think you have at least one heavily blanked AP OpenCL GPU task running which is getting most of the CPU time. WU 1454057484 has been completed by the wingmate and that result shows 59.89% blanking, for instance, so if your host is working its ap_17my13aa_B2_P0_00206_20140316_16962.wu_1 task it will be taking a lot of CPU time. The wingmates on AP tasks sent at the same time haven't reported yet, so I can't tell about those.

Because CPU tasks are launched by BOINC at the lowest possible priority, but GPU tasks at just "below normal", the CPU tasks don't get much CPU time in cases like that. If multitasking were perfectly efficient the CPU tasks wouldn't affect the GPU tasks at all, but of course it isn't. I think it quite likely that setting project prefs to not get CPU tasks would improve that system's productivity, and show it in higher RAC.
                                                                   Joe
ID: 1490983 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1491059 - Posted: 19 Mar 2014, 5:51:16 UTC - in response to Message 1490983.  

My guess is it has nothing to do with the upgrade to 7.2.42. Instead, I think you have at least one heavily blanked AP OpenCL GPU task running which is getting most of the CPU time. WU 1454057484 has been completed by the wingmate and that result shows 59.89% blanking, for instance, so if your host is working its ap_17my13aa_B2_P0_00206_20140316_16962.wu_1 task it will be taking a lot of CPU time. The wingmates on AP tasks sent at the same time haven't reported yet, so I can't tell about those.

Because CPU tasks are launched by BOINC at the lowest possible priority, but GPU tasks at just "below normal", the CPU tasks don't get much CPU time in cases like that. If multitasking were perfectly efficient the CPU tasks wouldn't affect the GPU tasks at all, but of course it isn't. I think it quite likely that setting project prefs to not get CPU tasks would improve that system's productivity, and show it in higher RAC.
                                                                   Joe


Hey, Joe.
Thanks for chiming in with your thoughts on this. Not sure this can be blamed on any particular AP job, as I've run 30-40 APs across the GPUs (6 tasks at a time, 12 hr avg per AP task, 50+ hours since this started happening), and the CPU job percentage hasn't significantly changed. Also, I tried disabling GPU work entirely for a bit, to see if the CPU work would pick up, and it has not. Perhaps this will give a better picture of what the system is looking like:



Now, it may well be that this ancient iron needs to quit doing CPU jobs in order to maximize production of the 3 equipped GPUs, but it's really curious that prior to the 7.2.42 upgrade the CPUs could get 40-50 % CPU to do their work, with an equivalent GPU load, and now I can't get better than 1-2 even without one.

I'm not suggesting that there's something wrong with the latest load, but it seems to me that something could possibly have screwed up during the upgrade that caused this.
ID: 1491059 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1491075 - Posted: 19 Mar 2014, 6:40:07 UTC
Last modified: 19 Mar 2014, 6:42:57 UTC

All I can say after seeing that screen shot is,...... nope, I can't say that here.

Not only is that poor P4 CPU being over worked, but also that GT620 and those 2 GT610's as well.

Clearly you are asking to far to much from that setup IMHO.

Just because you can do it doesn't mean that it's best to do so.

Cheers.
ID: 1491075 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1491079 - Posted: 19 Mar 2014, 6:52:45 UTC - in response to Message 1491075.  

All I can say after seeing that screen shot is,...... nope, I can't say that here.

Not only is that poor P4 CPU being over worked, but also that GT620 and those 2 GT610's as well.

Clearly you are asking to far to much from that setup IMHO.

Just because you can do it doesn't mean that it's best to do so.

Cheers.


lol.
I agree, but that's what I have to work with, competing with you "big boys". However, I can get a pretty regular 4500+ RAC out of not much $ on this box, and, as I mentioned, this was _not_ an issue before I upgraded and messed something up! Just need a clue where to look.
I may fall back to 7.2.33 and see if it changes anything (screen caps to follow if it does:) and try the upgrade again. I guess that would tell the story, but I'm really unclear as to how much of the processor allocation is fixed when a particular job is started, and how much it's dynamic based on overall system load.
I think I'm missing something, or tweaked something I shouldn't have!
ID: 1491079 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 1491084 - Posted: 19 Mar 2014, 7:03:11 UTC

Are you sure it's not throttling? There are settings to do with suspending work when CPU usage exceeds a percentage. Have you checked these? Given BOINC itself doesn't do any crunching it sounds like it may be suspending and unsuspending. Messages should be in your BOINC event log if it is.
BOINC blog
ID: 1491084 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1491092 - Posted: 19 Mar 2014, 7:18:06 UTC - in response to Message 1491084.  

Are you sure it's not throttling? There are settings to do with suspending work when CPU usage exceeds a percentage. Have you checked these? Given BOINC itself doesn't do any crunching it sounds like it may be suspending and unsuspending. Messages should be in your BOINC event log if it is.


Thanks for the note!
I do review the event logs, and see nothing there out of the ordinary. If it's throttling, I don't know how. This box is devoted entirely to crunching (doesn't even have anti-virus or other stuff running in background, no need) and in BOINC Manager > Computing Preferences > Processor usage it's devoted to 100% CPU use on all cores with no limits, as far as I can tell. Also, when I'm watching it with BOINCTasks, I'd expect to see jobs toggling back and forth between "Running" and "Waiting to run" or "Suspended" if that were the case, and I do not.
ID: 1491092 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1491096 - Posted: 19 Mar 2014, 7:34:48 UTC
Last modified: 19 Mar 2014, 7:35:09 UTC

There is a program called "Throttlewatch" will tell you if the CPU is throttling itself.

I also noticed that you have a Pentium D CPU rig and if you can either swap the cards or the CPU's it would certainly be a better choice for the load as the Pentium D has 2 proper cores unlike the P4 that only has 1 true core and 1 virtual.

Cheers.
ID: 1491096 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1491107 - Posted: 19 Mar 2014, 7:49:33 UTC - in response to Message 1491096.  

There is a program called "Throttlewatch" will tell you if the CPU is throttling itself.

I also noticed that you have a Pentium D CPU rig and if you can either swap the cards or the CPU's it would certainly be a better choice for the load as the Pentium D has 2 proper cores unlike the P4 that only has 1 true core and 1 virtual.

Cheers.

Thanks. I'll check for Throttlewatch. I had tried putting TThrottle on a while back, but the install blew up in my face and I bagged it as a bad job, much as I love BOINCTasks.

Yeah, this Prescott CPU is pretty ancient now, not to mention being a hot-running power hog, but I was doing nothing else with it, so why not. I'd love nothing more than to move the GPUs to a more proper machine, but sadly my HP box, which has the Pentium-D, only supports (and has a 610 in) one PCIE-X1 slot, where the Foxconn MB has 1 X16 and three X1 slots, albeit V.1. So although it was pretty weird to load up the less-capable machine, it was possible.
ID: 1491107 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1491134 - Posted: 19 Mar 2014, 9:11:03 UTC
Last modified: 19 Mar 2014, 9:28:07 UTC

I agree this must not a Boinc issue, must be how you are using your CPU few resources.

I have similar issues with small CPU running fast GPUs, what i find in my tests, is simple the CPU can´t feed all the power the GPU needs.

To be sure if that is your problem, you need to make some tests.

First of all, the build you are ussing to crunch a AP (v1843) uses a lot of CPU to do it´s work, to avoid this go to Mike site: http://mikesworldnet.de/home and DL the NV AP6 V2058 build.

After install with the -use_sleep command enabled you will see a big change in your CPU usage. Follow the instructions supplied on the help file that comes with the build.

Now with the slow CPU usage build you need to find the optimal point for your CPU.

To do that first, start your Boinc with No CPU running, and one GPU WU crunching at a time, look your GPU usage, then try with 2 WU at a time, and find the best value for the GPU, you could try 3 but 1 or 2 must be the best in your GPU´s models.

Now start one CPU WU at a time (only one), see your times and CPU/GPU usages, and find the optimal number of CPU WU to crunch at a time.

In my slowest CPU i simply find something amazing, i can´t do ANY CPU work or the entire system will slow down, so i only do GPU work on this hosts, as somebody said before in the thread, not allways, more is best. In other hosts (with still slow CPU i5-2310) the test shows i could up to 2CPU WU at a time running even 3 WU at a time on the GPU´s without slowing the hosts, faster CPUs (I7 and up) allow more.

Since each host is unique i can´t tell you the optimal number, without test but my guess is, with 3 GPU´s on the same host who has a slow CPU your optimal point must be close to 2 WU at a time on the GPU and No or only 1 CPU work at a time. 2 CPU WU +6GPU WU as you are ussing i´m allmost sure is not the best point. But only the test could make you sure about that.

Try you will see some very interesting things about CPU vs GPU usages on slow CPU vs fast GPUs hosts.

I´m sure somebody could give us a technical explanations why that happening, and could have a better way to find the optimal point but until then test is our only allied to find that point.

my 2 Cents
ID: 1491134 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1491138 - Posted: 19 Mar 2014, 9:21:27 UTC - in response to Message 1490845.  

On the third, however, after restart the CPU tasks now seem to get less than 2% CPU usage, and have thus been crunching for days to get only 1-2% complete.

Besides what has been said already... what's using the other 98%? The usage as seen on the posted sceenshot adds up to something around 65%, so unless the CPU is partly idle, there must be something outside of BOINC using the other ~35%.

But considering that this is a P4 with just a single physical core (2 virtual), you really should limit the CPU usage of BOINC to 50%, i.e. 1 CPU task, or maybe even disable CPU crunching at all considering the 3 GPUs it has to feed, they are for sure not running as fast as they could. Well, the GPUs are not fast too, so maybe you can let one CPU task run, but you should start with no CPU tasks and see how fast the GPUs can be and than see how that changes with one CPU task. But my guess ist that you'll get most out of this machine with no CPU tasks at all, because you are running AP on the GPUs.
ID: 1491138 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1491219 - Posted: 19 Mar 2014, 15:22:32 UTC

The problem is you are (at minimum) running more APs than you have cores. Since each AP requires a core by itself, you are getting tremendous amounts of pure thrashing - the system is going crazy switching between the various tasks running.

If you open Task Manager and turn on Show Kernel Times under View, you will likely see almost all red. That's CPU that's totally wasted because of overcommitting the CPU.
ID: 1491219 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1492124 - Posted: 20 Mar 2014, 19:32:03 UTC
Last modified: 20 Mar 2014, 19:58:09 UTC

A lot of good info here, and I appreciate it all. I do basically understand the impacts of the hardware I'm running, and how I'm using it. It just seemed weird that there was such a dramatic change between the two BOINC loads. So I did what I should have done before opening this thread; fell back to 7.2.33 to see if I could duplicate the disparity. I can't, so clearly 7.2.42 had nothing to do with it. {shrugs} The trap I fell into was the fact that the CPU use information is apparently a "rolling average" for that particular job across its life on the machine, rather than an current measurement of performance.
It would be good to get a better understanding of how the resources get allocated, as I mentioned before. I've noticed before that it seems as though an allocation decision is made based on the system state when a given task commences, and that though it can increase or decrease from there based on system load, that only happens to a certain extent, as though the initial allocation is some kind of limiter. That's what I was trying to drill down to here.
So, for now, I cut back to one CPU to crunch the remaining CPU work so I don't have to abandon it, and once they're gone, GPU crunching only going forward on this box, as I suspected would be the case when I embarked on this adventure :)
Good news is that despite all the drama, configuring the system this way leaves me with a stable system that has a good 20-25% improvement in RAC over where I was before, even if it causes heart attacks to a few folks who look at it:)
Thanks to everyone who responded!
ID: 1492124 · Report as offensive

Message boards : Number crunching : CPU tasks run slow after upgrade to 7.2.42


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.