Just added 3rd GPU and CPU is 'Waiting for Memory'

Author	Message
Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55	Message 1656075 - Posted: 23 Mar 2015, 17:23:06 UTC The title says it all really but here goes. I've just added another GPU (a second 750Ti to go with the Titan and other 750Ti) and now my CPU tasks halt occasionally saying 'Waiting for Memory'. The machine has 16GB RAM, I doubt it's that. I'm certain it's this whole concept of 'Leaving a core free to feed the GPU' which I've never understood and never encountered. So, Oh Wise Ones, time to educate me and help me tune the app_info.xml file to work best! Here's a brief overview of what I think are the important bits. Please ask for more info if you want, I'm more than happy to give it as I hope threads like this will help others further down the line. I'm running Lunatics latest (0.43a) and probably have the worst setup imaginable in my app_info file. At the moment it's set to run 4 CUDA tasks per GPU; using 0.04 of a CPU and 0.25 of a GPU Here's an example snippet, all the CUDA sections are set up like this (Astropulse differs only in that I've set it to 0.33 GPU, the rest is the same) <app_version> <app_name>setiathome_v7</app_name> <version_num>700</version_num> <platform>windows_intelx86</platform> <plan_class>cuda50</plan_class> <avg_ncpus>0.040000</avg_ncpus> <max_ncpus>0.040000</max_ncpus> <coproc> <type>CUDA</type> <count>0.25</count> </coproc> I have a feeling it's not as simple as setting the <avg_ncpus> or <max_ncpus> to 1.0, is it? ~W ID: 1656075 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1656123 - Posted: 23 Mar 2015, 20:26:06 UTC - in response to Message 1656075. So previously you were running OK with 4 instances per GPU now? SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1656123 ·

Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55	Message 1656128 - Posted: 23 Mar 2015, 20:37:00 UTC - in response to Message 1656123. So previously you were running OK with 4 instances per GPU now? That's correct, with 2 x GPUs (either the 2 x Titans or 1 x Titan & 1 x 750Ti) all was good, I had 4 CUDA tasks running per GPU (8 GPU CUDA tasks total) and 1 task per CPU core (8 x CPU tasks total). Now, with 3 x GPUs (1 x Titan & 2 x 750Ti), 4 or 5 of the CPU tasks (it flips a bit) sit there saying 'Waiting for Memory'. I do, however, still have 4 CUDA tasks running per GPU still (12 GPU CUDA tasks total). Makes sense? As I said, I'm sure I'm not being efficient with the workload I'm assigning the GPUs/CPUs. ~W ID: 1656128 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1656140 - Posted: 23 Mar 2015, 21:31:37 UTC - in response to Message 1656128. So previously you were running OK with 4 instances per GPU now? That's correct, with 2 x GPUs (either the 2 x Titans or 1 x Titan & 1 x 750Ti) all was good, I had 4 CUDA tasks running per GPU (8 GPU CUDA tasks total) and 1 task per CPU core (8 x CPU tasks total). Now, with 3 x GPUs (1 x Titan & 2 x 750Ti), 4 or 5 of the CPU tasks (it flips a bit) sit there saying 'Waiting for Memory'. I do, however, still have 4 CUDA tasks running per GPU still (12 GPU CUDA tasks total). Makes sense? As I said, I'm sure I'm not being efficient with the workload I'm assigning the GPUs/CPUs. In your BOINC Computing preferences. What do you have for When computer is in use, use at most When computer is not in use, use at most Page/swap file: use at most I think the default values are like 40 or 50%. Which if I'm doing my maths correctly should be fine for 12 GPU + 8 CPU tasks. However BOINC seems to think otherwise at the moment. So I would try bumping up those values if you have not already done so. I pretty much maxed out the settings in the BOINC prefs for everything. SO I wouldn't have weird resource issues. At least not ones caused by BOINC deciding it needed to do something about. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1656140 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1656157 - Posted: 23 Mar 2015, 22:17:02 UTC Last modified: 23 Mar 2015, 22:26:23 UTC As far as the "waiting for memory" bit, you could look in Task Manager and find out how much memory each CPU and GPU task are taking, and compare that to your actual RAM. This would give you a firm idea if the WFM is actual memory, or something else misleadingly labeled. For example, in my case, CPU tasks take about 35MB each, and GPU about 125MB each, so 835 + 12125 is < 2GB, so I don't think WFM is referring to real RAM if you have 4GB or more. Perhaps 20 threads fighting over the CPU is causing excessive system overhead(?). Turn on (in Task Manager) View -> Show Kernel Times. If it is mostly red, then that is likely the problem - the system is thrashing trying to support all those compute-bound threads. Remember, you are running 8 + 12*.04 cores worth of threads, even by BOINC/SETI's estimate. If you have only 8 cores, you are going to be switching tasks A LOT. (Hence more red in the graphs). I bet it would help a lot if you went to 7 CPU tasks, leaving one CPU for the 12 GPU tasks. And if they are HT cores, even worse, since they already share resources pair-wise. ID: 1656157 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1656159 - Posted: 23 Mar 2015, 22:26:41 UTC - in response to Message 1656157. 4 work units per 750? Very ambitious I'm sure the Titan has no problem with that but think your stressing those 750s. I'd take it down to 3 work units per 750. That might still be too much but Jason seems to think under best conditions you could get three. I only run 2 on mine as I notice lock ups but that has to do with my AMD chip. Best option is teaming the Titan with similar gpu that doesn't hamstring it. I guess with this new boinc you might be able to direct how many work units per specific gpu. I have tired that yet. I'd first try reducing the total number per gpu first and if it relives you know which direction to go. ID: 1656159 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1656256 - Posted: 24 Mar 2015, 5:07:20 UTC If previous suggestions don't help, I suggest setting the mem_usage_debug log flag in cc_config.xml. That will produce multiple lines in the event log each time BOINC decides what tasks should be running, so turning it off again after it captures the usage info would be sensible. The "Waiting for memory" is based on the smoothed working set size of each active task. That is, BOINC begins with the available RAM and for each task it's going to start or leave running it subtracts that smoothed value. If available goes negative, the task is not started but remains in the active task list. Joe ID: 1656256 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1656291 - Posted: 24 Mar 2015, 6:51:06 UTC - in response to Message 1656128. Last modified: 24 Mar 2015, 6:53:22 UTC As I said, I'm sure I'm not being efficient with the workload I'm assigning the GPUs/CPUs. Nope. My GTX 750Tis produce more work per hour running only 2 a time (I'm MB only). 3 at a time was very close, but not quite as good. 4 at a time would have resulted in significantly less WUs crunched per hour than running 2 at a time. Grant Darwin NT ID: 1656291 ·

Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55	Message 1656329 - Posted: 24 Mar 2015, 8:45:07 UTC As ever, I want to give my thanks to everyone involved with helping me troubleshoot things. You all chip in with the useful stuff and I just do the legwork :) This is going to be a long one, sorry. I'm also going to take things out of order to try to apply them in an sensible troubleshooting order. Again, Iâ€™m being verbose on the off chance that this will help someone else. - - - - - As far as the "waiting for memory" bit, you could look in Task Manager and find out how much memory each CPU and GPU task are taking, and compare that to your actual RAM. This would give you a firm idea if the WFM is actual memory, or something else misleadingly labeled. For example, in my case, CPU tasks take about 35MB each, and GPU about 125MB each, so 835 + 12125 is < 2GB, so I don't think WFM is referring to real RAM if you have 4GB or more. Perhaps 20 threads fighting over the CPU is causing excessive system overhead(?). Turn on (in Task Manager) View -> Show Kernel Times. If it is mostly red, then that is likely the problem - the system is thrashing trying to support all those compute-bound threads. Remember, you are running 8 + 12.04 cores worth of threads, even by BOINC/SETI's estimate. If you have only 8 cores, you are going to be switching tasks A LOT. (Hence more red in the graphs). I bet it would help a lot if you went to 7 CPU tasks, leaving one CPU for the 12 GPU tasks. And if they are HT cores, even worse, since they already share resources pair-wise. OK, they ARE HT cores (4 physical cores) so thatâ€™s a consideration. Please remember Iâ€™m not a Windows person, so If Iâ€™m reading this wrongly, apologies. Firstly, I canâ€™t see where to turn on Show Kernel Times, I certainly canâ€™t see it in Task (or Resource) Managerâ€™s â€˜Viewâ€™ menu. Still, itâ€™s not essential as I think your simple equation has shown me something very important. That I need to reduce the number of CPU tasks. Which is what I thought. With regards to RAM. It appears that: The CUDA tasks are taking between 105MB and 130MB The CPU tasks seem to be taking 36MB The amount of physical memory being used (the number reported at the bottom of the window) is 22% to 25%, fluctuating. This, to me says Iâ€™m using about 4GB of the 16GB in the system, plenty of overhead there. If I look under the â€˜Performanceâ€™ tab I see Physical Memory(MB) Total: 16322 Cached: 3267 Available: 12316 Free: 9208 So I donâ€™t think itâ€™s actual RAM problemsâ€¦ probably. Iâ€™ll come back to the number of tasks thing in a minute. - - - - - In your BOINC Computing preferences. What do you have for When computer is in use, use at most When computer is not in use, use at most Page/swap file: use at most I think the default values are like 40 or 50%. Which if I'm doing my maths correctly should be fine for 12 GPU + 8 CPU tasks. However BOINC seems to think otherwise at the moment. So I would try bumping up those values if you have not already done so. I pretty much maxed out the settings in the BOINC prefs for everything. SO I wouldn't have weird resource issues. At least not ones caused by BOINC deciding it needed to do something about. OK, just done a check and experiment. In Use was at 50% changed to 80% Not In Use was at 80% changed to 90% Page/Swap was at 20% changed to 90% Forced an update and it doesn't seem to have affected things. My thinking is I wonâ€™t see any change as I wasnâ€™t anywhere near using 50% RAM with the original settings, so upping its allocation wonâ€™t help. I checked all the same info as above with the new settings and I was tight, there was no change. (Changed it back for now as I can easily change it again should I need) - - - - - If previous suggestions don't help, I suggest setting the mem_usage_debug log flag in cc_config.xml. That will produce multiple lines in the event log each time BOINC decides what tasks should be running, so turning it off again after it captures the usage info would be sensible. The "Waiting for memory" is based on the smoothed working set size of each active task. That is, BOINC begins with the available RAM and for each task it's going to start or leave running it subtracts that smoothed value. If available goes negative, the task is not started but remains in the active task list. For the sake of completeness Iâ€™ll mention this essentially showed me what Task Manager showed me. Itâ€™s always good to remember to read the logs, people! :) - - - - - My GTX 750Tis produce more work per hour running only 2 a time (I'm MB only). 3 at a time was very close, but not quite as good. 4 at a time would have resulted in significantly less WUs crunched per hour than running 2 at a time. 4 work units per 750? Very ambitious I'm sure the Titan has no problem with that but think your stressing those 750s. I'd take it down to 3 work units per 750. That might still be too much but Jason seems to think under best conditions you could get three. I only run 2 on mine as I notice lock ups but that has to do with my AMD chip. Best option is teaming the Titan with similar gpu that doesn't hamstring it. I guess with this new boinc you might be able to direct how many work units per specific gpu. I have tired that yet. I'd first try reducing the total number per gpu first and if it relives you know which direction to go. Well, I HAD teamed it with another Titan, but it died :( I was wondering if it was possible to set the number of tasks per GPU and thereâ€™s something in the back of my brain nudging me saying itâ€™s come up in a thread of mine before but Iâ€™m going to type this before researching it. - - - - - OK, so hereâ€™s what Iâ€™m going to do. Iâ€™m going to reduce the number of GPU tasks per card. This is a bit of a trade off, as has been pointed out, the Titan can handle 4 tasks but the 750TI canâ€™t. So Iâ€™m going to split the difference and drop them to 3 tasks each. This should, by jravinâ€™s equation, reduce the number of threads the CPU is trying to contend with to 8+(9.04) Andâ€¦ It worked. BUT! Thatâ€™s still higher than probably it should be, so I should drop the number of CPU tasks but hereâ€™s the question, how? ~W ID: 1656329 ·

Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55	Message 1656333 - Posted: 24 Mar 2015, 8:54:59 UTC Last modified: 24 Mar 2015, 8:58:43 UTC Addendum: Found where to set the CPU core count. In cc-config the following option: <ncpus>N</ncpus> Act as if there were N CPUs; e.g. to simulate 2 CPUs on a machine that has only 1. To use the number of available CPUs, set the value to -1 (was 0 which in newer clients really means zero). So setting ncpus to 7 should, theoretically, free up a core to feed the ~~trolls~~ GPUs EDIT: Yep, that's the ticket. And all is quiet again. Until next time ~W ID: 1656333 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1656334 - Posted: 24 Mar 2015, 8:56:25 UTC - in response to Message 1656329. .... the Titan can handle 4 tasks but the 750TI canâ€™t. It's not a case of can't, it's a case of it's not efficient. And I suspect it's the same with the Titan- it may very well be able to crunch 5 or ever 6WUs at a time, but what good is that if you end up doing less work? Even with the Titan, there's a good chance that 2 WUs at a time will give the most work per hour. 3 would most likely give slightly less, I'd suggest 4 at a time gives you much less return than just running 2 at a time. Grant Darwin NT ID: 1656334 ·

Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55	Message 1656338 - Posted: 24 Mar 2015, 9:02:04 UTC - in response to Message 1656334. Last modified: 24 Mar 2015, 9:02:33 UTC .... the Titan can handle 4 tasks but the 750TI canâ€™t. It's not a case of can't, it's a case of it's not efficient. And I suspect it's the same with the Titan- it may very well be able to crunch 5 or ever 6WUs at a time, but what good is that if you end up doing less work? Even with the Titan, there's a good chance that 2 WUs at a time will give the most work per hour. 3 would most likely give slightly less, I'd suggest 4 at a time gives you much less return than just running 2 at a time. That's really what I meant by "Can" and "Can't", I see your point entirely. I'm going to give it a couple of days; wait for the weekly DB housekeeping to be done today and then see what setting gives me the best bang for my buck. I ALWAYS learn something useful from these threads. (I still want a couple of K80s and failing that a couple of Titan-X) EDIT: Why not? 2 it is. ~W ID: 1656338 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1656343 - Posted: 24 Mar 2015, 9:09:52 UTC - in response to Message 1656338. Last modified: 24 Mar 2015, 9:10:09 UTC I'm going to give it a couple of days; wait for the weekly DB housekeeping to be done today and then see what setting gives me the best bang for my buck. Hopefully the weekly outage will sort out whatever is wrong at the moment. Another 15-20min & there will be no work left to download. Grant Darwin NT ID: 1656343 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1656389 - Posted: 24 Mar 2015, 12:22:37 UTC The key here is GPU utilization. Once you reach the upper 90% area, adding more WUs at a time doesn't help, and may (again, because of competition for resources, and internal task switching, this time on the GPU rather than the CPU) hurt the total work done. On my GTX 780s I run 3 WUs simultaneously, and I find that the incremental work done (over 2) to be fairly small. And for 4, it wouldn't help at all. Note that the amount of RAM on these (and your) cards could support 5 or more easily but they would all stretch out and the extra work done would be either negative or negligible. ID: 1656389 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1656413 - Posted: 24 Mar 2015, 14:22:12 UTC - in response to Message 1656338. Woodie, You might want to get SIVX64. I hope I wrote that right. It's a program that will give you a general idea of how your system is behaving while crunching. Once you learn to read it. ID: 1656413 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1656451 - Posted: 24 Mar 2015, 15:26:00 UTC There's also some complications, adding for the sakes of completion, that make multiple GPUs with lots of VRAM much more complex than in the past. Those include: - Windows display driver model mirrors VRAM for display driver recovery purposes, (in this 3 card case to the tune of some ~6GiB kernel space 'shared') and - PCI express lanes are limited ( 16 lanes for the i7-4770K, to cover the 3 video and any other devices in the system ) The first item above, when you dig really deeply, covers the majority of why an extreme example (retired) host of Windows XP, 4 x old GTX 295's (total 8 GPUs, with some 7Gib physical VRAM) while viable under 'Old style XP drivers' and small amounts of Host RAM, will tend to choke early under more modern 'hybrid' drivers. With respect to the current (i7 + 3 larger GPUs), that's a lot of the 16 GiB physical goobbled up, and that will be out of the 8GiB half that is kernel space (leaving 2 GiB for OS and drivers, though likely plenty for application user space). The second item will have more of an impact on how many tasks can be 'fed' by the CPU in limited time, remembering that there will need to be some turns being taken on the PCI express links, and a lot of activity there can be met with sitting and waiting. Hyperthreading would probably double that queue contention. I didn't know or think about this limitation extensively in the past, though it becomes pretty important in modern workstation operation, which is probably why the likes of Xeon processors with more PCIe lanes have been becoming popular even in high end gaming rigs, just to feed the faster GPUs more promptly. For x42 (next major Cuda multibeam revision) I've been gradually engineering ways to make the application less 'chatty', which should reduce the issues there, though with the GPUs getting faster all the time, it's taking some time to find the best ways to make things scale better and more automatically in the future. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1656451 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1656705 - Posted: 25 Mar 2015, 13:29:14 UTC - in response to Message 1656451. - Windows display driver model mirrors VRAM for display driver recovery purposes How often this mirroring takes place? (I assume you mean the whole VRAM (?) is copied (by some DMA controller?) to main computer RAM every X seconds?) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1656705 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1656711 - Posted: 25 Mar 2015, 13:52:25 UTC - in response to Message 1656705. Last modified: 25 Mar 2015, 14:02:30 UTC - Windows display driver model mirrors VRAM for display driver recovery purposes How often this mirroring takes place? (I assume you mean the whole VRAM (?) is copied (by some DMA controller?) to main computer RAM every X seconds?) Complex, though in this post classic-XP mechanism generally most operations occur via a kernel memory 'staging area' which then transmits the commands/data (sometimes combined for optimisation purposes), so in effect you have a virtual GPU in host memory that the applications talk through via a user mode driver helper. (virtualisation of the GPU resources) That's a more complex kindof 'double-buffering' than simply mirroring, , that explains increased latencies, and why extreme gamer benchmarks stuck with old XP so long, amounting to 10% or so performance penalty at the time at introduction with Vista (Since then newer GPUs add DMA engines, faster & more DMA engines, and more latency hiding mechanisms). Later XP drivers add some (enough) of the virtualisation to keep applications compatible, though then being hybrid drivers attain all the scaling limits and latencies, without the benefit of new hardware & lots of RAM on top. In terms of amounts of VRAM being mirrored, it's this number right here for my 4GiB physical VRAM 980 (Win7x64): Fortunately, or unfortunately, depending on the usage, that virtualisation of the video memory is paged. If you actually start filling things up to the extent system resources are low, then you'll see similar or worse effects as with host memory excessively paging to disk (i.e. usually unusable). Naturally adding more host memory to modern standards is only an option on 64 Bit systems etc, so extreme care is needed if selected modern GPUs for a 32 bit desktop version of Windows. [Edit:] note that Windows 10 and DirectX12 is supposed to be changing this model. I've not seen details though tek syndicate mentioned at least sli configurations stacking VRAM, so that's different. The picture may change completely if they want to compete with Mantle for latency "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1656711 ·

Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55	Message 1656776 - Posted: 25 Mar 2015, 17:35:05 UTC - in response to Message 1656711. jason_gee. Wow, thanks for that! Proper technical insight into a proper technical subject. I find it fascinating. May I ask a few questions on behalf of the class? I'm going to say now I understand it's a complex subject and generalisations are going have to be made. (1) When you say 'Windows display driver model' I take it you mean Microsoft have dictated "This is how you need to write a driver to interface between your hardware and the OS because this is how we've designed the OS'. (2) Can you tell us how this differs from Linux & Mac OS X and does this make a difference as to how efficient the platform is as a number crunching entity; that is, does the latency introduced by the Windows 'double-buffering' affect how fast the same working would be crunched on Windows vs. Linux/OS X, all other things being equal. (Yes, I am aware I'm asking you to explain how long a piece of quantum superstring is :) ) (3) Would adding more RAM help the issue, i.e. reduce paging or is it "not as simple as that". I've got 8GB RAM across the 3 cards (4+2+2) so I'm assuming it's trying to reserve 8GB kernel space to call its own. (I don't know where to find the window you showed to check). ~W ID: 1656776 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1656782 - Posted: 25 Mar 2015, 18:04:50 UTC - in response to Message 1656711. Last modified: 25 Mar 2015, 18:13:38 UTC So Win7 Ultimate x64 with Woodgie's 6GB GTX Titan plus two 2GB 750ti's would like to have more than 9GB of shared kernel memory for that VRAM backup. With 16GB installed RAM implying 8GB kernel memory there must be some workaround in the driver model. {edit} Standard memory for the GTX Titan is 6GB, and the OpenCL AP task details are showing that amount, so I assume the card actually does have it even though the CUDA task details only show 4GB. Joe ID: 1656782 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.