Message boards :
Number crunching :
CPU tasks run slow after upgrade to 7.2.42
Message board moderation
Author | Message |
---|---|
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Just upgraded all three machines here to 7.2.42 from 7.2.33, and on two of the machines the upgrade went fine. On the third, however, after restart the CPU tasks now seem to get less than 2% CPU usage, and have thus been crunching for days to get only 1-2% complete. (machine ID 7119149) When I first noticed this, I assumed that it was a fluke related to the jobs in process at the time I upgraded, and suspended them to let new jobs start. Same result. Tried shutdowns, and full reboots, no help. Only other thing I can think of would be to abandon the current CPU jobs and see if new ones would get their proper share of resources. Curious how I might have created such a situation, or if anyone else has seen a similar issue. Given the nature of the machine, I might be better off running GPU-only anyway, but it would be interesting to see what's really going on, for future reference. Thoughts, anyone? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
My guess is it has nothing to do with the upgrade to 7.2.42. Instead, I think you have at least one heavily blanked AP OpenCL GPU task running which is getting most of the CPU time. WU 1454057484 has been completed by the wingmate and that result shows 59.89% blanking, for instance, so if your host is working its ap_17my13aa_B2_P0_00206_20140316_16962.wu_1 task it will be taking a lot of CPU time. The wingmates on AP tasks sent at the same time haven't reported yet, so I can't tell about those. Because CPU tasks are launched by BOINC at the lowest possible priority, but GPU tasks at just "below normal", the CPU tasks don't get much CPU time in cases like that. If multitasking were perfectly efficient the CPU tasks wouldn't affect the GPU tasks at all, but of course it isn't. I think it quite likely that setting project prefs to not get CPU tasks would improve that system's productivity, and show it in higher RAC. Joe |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
My guess is it has nothing to do with the upgrade to 7.2.42. Instead, I think you have at least one heavily blanked AP OpenCL GPU task running which is getting most of the CPU time. WU 1454057484 has been completed by the wingmate and that result shows 59.89% blanking, for instance, so if your host is working its ap_17my13aa_B2_P0_00206_20140316_16962.wu_1 task it will be taking a lot of CPU time. The wingmates on AP tasks sent at the same time haven't reported yet, so I can't tell about those. Hey, Joe. Thanks for chiming in with your thoughts on this. Not sure this can be blamed on any particular AP job, as I've run 30-40 APs across the GPUs (6 tasks at a time, 12 hr avg per AP task, 50+ hours since this started happening), and the CPU job percentage hasn't significantly changed. Also, I tried disabling GPU work entirely for a bit, to see if the CPU work would pick up, and it has not. Perhaps this will give a better picture of what the system is looking like: Now, it may well be that this ancient iron needs to quit doing CPU jobs in order to maximize production of the 3 equipped GPUs, but it's really curious that prior to the 7.2.42 upgrade the CPUs could get 40-50 % CPU to do their work, with an equivalent GPU load, and now I can't get better than 1-2 even without one. I'm not suggesting that there's something wrong with the latest load, but it seems to me that something could possibly have screwed up during the upgrade that caused this. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
All I can say after seeing that screen shot is,...... nope, I can't say that here. Not only is that poor P4 CPU being over worked, but also that GT620 and those 2 GT610's as well. Clearly you are asking to far to much from that setup IMHO. Just because you can do it doesn't mean that it's best to do so. Cheers. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
All I can say after seeing that screen shot is,...... nope, I can't say that here. lol. I agree, but that's what I have to work with, competing with you "big boys". However, I can get a pretty regular 4500+ RAC out of not much $ on this box, and, as I mentioned, this was _not_ an issue before I upgraded and messed something up! Just need a clue where to look. I may fall back to 7.2.33 and see if it changes anything (screen caps to follow if it does:) and try the upgrade again. I guess that would tell the story, but I'm really unclear as to how much of the processor allocation is fixed when a particular job is started, and how much it's dynamic based on overall system load. I think I'm missing something, or tweaked something I shouldn't have! |
MarkJ Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5 |
Are you sure it's not throttling? There are settings to do with suspending work when CPU usage exceeds a percentage. Have you checked these? Given BOINC itself doesn't do any crunching it sounds like it may be suspending and unsuspending. Messages should be in your BOINC event log if it is. BOINC blog |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Are you sure it's not throttling? There are settings to do with suspending work when CPU usage exceeds a percentage. Have you checked these? Given BOINC itself doesn't do any crunching it sounds like it may be suspending and unsuspending. Messages should be in your BOINC event log if it is. Thanks for the note! I do review the event logs, and see nothing there out of the ordinary. If it's throttling, I don't know how. This box is devoted entirely to crunching (doesn't even have anti-virus or other stuff running in background, no need) and in BOINC Manager > Computing Preferences > Processor usage it's devoted to 100% CPU use on all cores with no limits, as far as I can tell. Also, when I'm watching it with BOINCTasks, I'd expect to see jobs toggling back and forth between "Running" and "Waiting to run" or "Suspended" if that were the case, and I do not. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
There is a program called "Throttlewatch" will tell you if the CPU is throttling itself. I also noticed that you have a Pentium D CPU rig and if you can either swap the cards or the CPU's it would certainly be a better choice for the load as the Pentium D has 2 proper cores unlike the P4 that only has 1 true core and 1 virtual. Cheers. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
There is a program called "Throttlewatch" will tell you if the CPU is throttling itself. Thanks. I'll check for Throttlewatch. I had tried putting TThrottle on a while back, but the install blew up in my face and I bagged it as a bad job, much as I love BOINCTasks. Yeah, this Prescott CPU is pretty ancient now, not to mention being a hot-running power hog, but I was doing nothing else with it, so why not. I'd love nothing more than to move the GPUs to a more proper machine, but sadly my HP box, which has the Pentium-D, only supports (and has a 610 in) one PCIE-X1 slot, where the Foxconn MB has 1 X16 and three X1 slots, albeit V.1. So although it was pretty weird to load up the less-capable machine, it was possible. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I agree this must not a Boinc issue, must be how you are using your CPU few resources. I have similar issues with small CPU running fast GPUs, what i find in my tests, is simple the CPU can´t feed all the power the GPU needs. To be sure if that is your problem, you need to make some tests. First of all, the build you are ussing to crunch a AP (v1843) uses a lot of CPU to do it´s work, to avoid this go to Mike site: http://mikesworldnet.de/home and DL the NV AP6 V2058 build. After install with the -use_sleep command enabled you will see a big change in your CPU usage. Follow the instructions supplied on the help file that comes with the build. Now with the slow CPU usage build you need to find the optimal point for your CPU. To do that first, start your Boinc with No CPU running, and one GPU WU crunching at a time, look your GPU usage, then try with 2 WU at a time, and find the best value for the GPU, you could try 3 but 1 or 2 must be the best in your GPU´s models. Now start one CPU WU at a time (only one), see your times and CPU/GPU usages, and find the optimal number of CPU WU to crunch at a time. In my slowest CPU i simply find something amazing, i can´t do ANY CPU work or the entire system will slow down, so i only do GPU work on this hosts, as somebody said before in the thread, not allways, more is best. In other hosts (with still slow CPU i5-2310) the test shows i could up to 2CPU WU at a time running even 3 WU at a time on the GPU´s without slowing the hosts, faster CPUs (I7 and up) allow more. Since each host is unique i can´t tell you the optimal number, without test but my guess is, with 3 GPU´s on the same host who has a slow CPU your optimal point must be close to 2 WU at a time on the GPU and No or only 1 CPU work at a time. 2 CPU WU +6GPU WU as you are ussing i´m allmost sure is not the best point. But only the test could make you sure about that. Try you will see some very interesting things about CPU vs GPU usages on slow CPU vs fast GPUs hosts. I´m sure somebody could give us a technical explanations why that happening, and could have a better way to find the optimal point but until then test is our only allied to find that point. my 2 Cents |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
On the third, however, after restart the CPU tasks now seem to get less than 2% CPU usage, and have thus been crunching for days to get only 1-2% complete. Besides what has been said already... what's using the other 98%? The usage as seen on the posted sceenshot adds up to something around 65%, so unless the CPU is partly idle, there must be something outside of BOINC using the other ~35%. But considering that this is a P4 with just a single physical core (2 virtual), you really should limit the CPU usage of BOINC to 50%, i.e. 1 CPU task, or maybe even disable CPU crunching at all considering the 3 GPUs it has to feed, they are for sure not running as fast as they could. Well, the GPUs are not fast too, so maybe you can let one CPU task run, but you should start with no CPU tasks and see how fast the GPUs can be and than see how that changes with one CPU task. But my guess ist that you'll get most out of this machine with no CPU tasks at all, because you are running AP on the GPUs. |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
The problem is you are (at minimum) running more APs than you have cores. Since each AP requires a core by itself, you are getting tremendous amounts of pure thrashing - the system is going crazy switching between the various tasks running. If you open Task Manager and turn on Show Kernel Times under View, you will likely see almost all red. That's CPU that's totally wasted because of overcommitting the CPU. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
A lot of good info here, and I appreciate it all. I do basically understand the impacts of the hardware I'm running, and how I'm using it. It just seemed weird that there was such a dramatic change between the two BOINC loads. So I did what I should have done before opening this thread; fell back to 7.2.33 to see if I could duplicate the disparity. I can't, so clearly 7.2.42 had nothing to do with it. {shrugs} The trap I fell into was the fact that the CPU use information is apparently a "rolling average" for that particular job across its life on the machine, rather than an current measurement of performance. It would be good to get a better understanding of how the resources get allocated, as I mentioned before. I've noticed before that it seems as though an allocation decision is made based on the system state when a given task commences, and that though it can increase or decrease from there based on system load, that only happens to a certain extent, as though the initial allocation is some kind of limiter. That's what I was trying to drill down to here. So, for now, I cut back to one CPU to crunch the remaining CPU work so I don't have to abandon it, and once they're gone, GPU crunching only going forward on this box, as I suspected would be the case when I embarked on this adventure :) Good news is that despite all the drama, configuring the system this way leaves me with a stable system that has a good 20-25% improvement in RAC over where I was before, even if it causes heart attacks to a few folks who look at it:) Thanks to everyone who responded! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.