Message boards :
Number crunching :
Just added 3rd GPU and CPU is 'Waiting for Memory'
Message board moderation
Author | Message |
---|---|
Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55 |
The title says it all really but here goes. I've just added another GPU (a second 750Ti to go with the Titan and other 750Ti) and now my CPU tasks halt occasionally saying 'Waiting for Memory'. The machine has 16GB RAM, I doubt it's that. I'm certain it's this whole concept of 'Leaving a core free to feed the GPU' which I've never understood and never encountered. So, Oh Wise Ones, time to educate me and help me tune the app_info.xml file to work best! Here's a brief overview of what I think are the important bits. Please ask for more info if you want, I'm more than happy to give it as I hope threads like this will help others further down the line. I'm running Lunatics latest (0.43a) and probably have the worst setup imaginable in my app_info file. At the moment it's set to run 4 CUDA tasks per GPU; using 0.04 of a CPU and 0.25 of a GPU Here's an example snippet, all the CUDA sections are set up like this (Astropulse differs only in that I've set it to 0.33 GPU, the rest is the same) <app_version> <app_name>setiathome_v7</app_name> <version_num>700</version_num> <platform>windows_intelx86</platform> <plan_class>cuda50</plan_class> <avg_ncpus>0.040000</avg_ncpus> <max_ncpus>0.040000</max_ncpus> <coproc> <type>CUDA</type> <count>0.25</count> </coproc> I have a feeling it's not as simple as setting the <avg_ncpus> or <max_ncpus> to 1.0, is it? ~W |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
So previously you were running OK with 4 instances per GPU now? SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55 |
So previously you were running OK with 4 instances per GPU now? That's correct, with 2 x GPUs (either the 2 x Titans or 1 x Titan & 1 x 750Ti) all was good, I had 4 CUDA tasks running per GPU (8 GPU CUDA tasks total) and 1 task per CPU core (8 x CPU tasks total). Now, with 3 x GPUs (1 x Titan & 2 x 750Ti), 4 or 5 of the CPU tasks (it flips a bit) sit there saying 'Waiting for Memory'. I do, however, still have 4 CUDA tasks running per GPU still (12 GPU CUDA tasks total). Makes sense? As I said, I'm sure I'm not being efficient with the workload I'm assigning the GPUs/CPUs. ~W |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
So previously you were running OK with 4 instances per GPU now? In your BOINC Computing preferences. What do you have for
When computer is not in use, use at most Page/swap file: use at most I think the default values are like 40 or 50%. Which if I'm doing my maths correctly should be fine for 12 GPU + 8 CPU tasks. However BOINC seems to think otherwise at the moment. So I would try bumping up those values if you have not already done so. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
As far as the "waiting for memory" bit, you could look in Task Manager and find out how much memory each CPU and GPU task are taking, and compare that to your actual RAM. This would give you a firm idea if the WFM is actual memory, or something else misleadingly labeled. For example, in my case, CPU tasks take about 35MB each, and GPU about 125MB each, so 8*35 + 12*125 is < 2GB, so I don't think WFM is referring to real RAM if you have 4GB or more. Perhaps 20 threads fighting over the CPU is causing excessive system overhead(?). Turn on (in Task Manager) View -> Show Kernel Times. If it is mostly red, then that is likely the problem - the system is thrashing trying to support all those compute-bound threads. Remember, you are running 8 + 12*.04 cores worth of threads, even by BOINC/SETI's estimate. If you have only 8 cores, you are going to be switching tasks A LOT. (Hence more red in the graphs). I bet it would help a lot if you went to 7 CPU tasks, leaving one CPU for the 12 GPU tasks. And if they are HT cores, even worse, since they already share resources pair-wise. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
4 work units per 750? Very ambitious I'm sure the Titan has no problem with that but think your stressing those 750s. I'd take it down to 3 work units per 750. That might still be too much but Jason seems to think under best conditions you could get three. I only run 2 on mine as I notice lock ups but that has to do with my AMD chip. Best option is teaming the Titan with similar gpu that doesn't hamstring it. I guess with this new boinc you might be able to direct how many work units per specific gpu. I have tired that yet. I'd first try reducing the total number per gpu first and if it relives you know which direction to go. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
If previous suggestions don't help, I suggest setting the mem_usage_debug log flag in cc_config.xml. That will produce multiple lines in the event log each time BOINC decides what tasks should be running, so turning it off again after it captures the usage info would be sensible. The "Waiting for memory" is based on the smoothed working set size of each active task. That is, BOINC begins with the available RAM and for each task it's going to start or leave running it subtracts that smoothed value. If available goes negative, the task is not started but remains in the active task list. Joe |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
As I said, I'm sure I'm not being efficient with the workload I'm assigning the GPUs/CPUs. Nope. My GTX 750Tis produce more work per hour running only 2 a time (I'm MB only). 3 at a time was very close, but not quite as good. 4 at a time would have resulted in significantly less WUs crunched per hour than running 2 at a time. Grant Darwin NT |
Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55 |
As ever, I want to give my thanks to everyone involved with helping me troubleshoot things. You all chip in with the useful stuff and I just do the legwork :) This is going to be a long one, sorry. I'm also going to take things out of order to try to apply them in an sensible troubleshooting order. Again, I’m being verbose on the off chance that this will help someone else. - - - - - As far as the "waiting for memory" bit, you could look in Task Manager and find out how much memory each CPU and GPU task are taking, and compare that to your actual RAM. This would give you a firm idea if the WFM is actual memory, or something else misleadingly labeled. OK, they ARE HT cores (4 physical cores) so that’s a consideration. Please remember I’m not a Windows person, so If I’m reading this wrongly, apologies. Firstly, I can’t see where to turn on Show Kernel Times, I certainly can’t see it in Task (or Resource) Manager’s ‘View’ menu. Still, it’s not essential as I think your simple equation has shown me something very important. That I need to reduce the number of CPU tasks. Which is what I thought. With regards to RAM. It appears that: The CUDA tasks are taking between 105MB and 130MB The CPU tasks seem to be taking 36MB The amount of physical memory being used (the number reported at the bottom of the window) is 22% to 25%, fluctuating. This, to me says I’m using about 4GB of the 16GB in the system, plenty of overhead there. If I look under the ‘Performance’ tab I see Physical Memory(MB) Total: 16322 Cached: 3267 Available: 12316 Free: 9208 So I don’t think it’s actual RAM problems… probably. I’ll come back to the number of tasks thing in a minute. - - - - - In your BOINC Computing preferences. What do you have for OK, just done a check and experiment. In Use was at 50% changed to 80% Not In Use was at 80% changed to 90% Page/Swap was at 20% changed to 90% Forced an update and it doesn't seem to have affected things. My thinking is I won’t see any change as I wasn’t anywhere near using 50% RAM with the original settings, so upping its allocation won’t help. I checked all the same info as above with the new settings and I was tight, there was no change. (Changed it back for now as I can easily change it again should I need) - - - - - If previous suggestions don't help, I suggest setting the mem_usage_debug log flag in cc_config.xml. That will produce multiple lines in the event log each time BOINC decides what tasks should be running, so turning it off again after it captures the usage info would be sensible. For the sake of completeness I’ll mention this essentially showed me what Task Manager showed me. It’s always good to remember to read the logs, people! :) - - - - - My GTX 750Tis produce more work per hour running only 2 a time (I'm MB only). 3 at a time was very close, but not quite as good. 4 at a time would have resulted in significantly less WUs crunched per hour than running 2 at a time. 4 work units per 750? Very ambitious I'm sure the Titan has no problem with that but think your stressing those 750s. I'd take it down to 3 work units per 750. That might still be too much but Jason seems to think under best conditions you could get three. I only run 2 on mine as I notice lock ups but that has to do with my AMD chip. Best option is teaming the Titan with similar gpu that doesn't hamstring it. I guess with this new boinc you might be able to direct how many work units per specific gpu. I have tired that yet. I'd first try reducing the total number per gpu first and if it relives you know which direction to go. Well, I HAD teamed it with another Titan, but it died :( I was wondering if it was possible to set the number of tasks per GPU and there’s something in the back of my brain nudging me saying it’s come up in a thread of mine before but I’m going to type this before researching it. - - - - - OK, so here’s what I’m going to do. I’m going to reduce the number of GPU tasks per card. This is a bit of a trade off, as has been pointed out, the Titan can handle 4 tasks but the 750TI can’t. So I’m going to split the difference and drop them to 3 tasks each. This should, by jravin’s equation, reduce the number of threads the CPU is trying to contend with to 8+(9*.04) And… It worked. BUT! That’s still higher than probably it should be, so I should drop the number of CPU tasks but here’s the question, how? ~W |
Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55 |
Addendum: Found where to set the CPU core count. In cc-config the following option: <ncpus>N</ncpus> Act as if there were N CPUs; e.g. to simulate 2 CPUs on a machine that has only 1. To use the number of available CPUs, set the value to -1 (was 0 which in newer clients really means zero). So setting ncpus to 7 should, theoretically, free up a core to feed the EDIT: Yep, that's the ticket. And all is quiet again. Until next time ~W |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
.... the Titan can handle 4 tasks but the 750TI can’t. It's not a case of can't, it's a case of it's not efficient. And I suspect it's the same with the Titan- it may very well be able to crunch 5 or ever 6WUs at a time, but what good is that if you end up doing less work? Even with the Titan, there's a good chance that 2 WUs at a time will give the most work per hour. 3 would most likely give slightly less, I'd suggest 4 at a time gives you much less return than just running 2 at a time. Grant Darwin NT |
Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55 |
.... the Titan can handle 4 tasks but the 750TI can’t. That's really what I meant by "Can" and "Can't", I see your point entirely. I'm going to give it a couple of days; wait for the weekly DB housekeeping to be done today and then see what setting gives me the best bang for my buck. I ALWAYS learn something useful from these threads. (I still want a couple of K80s and failing that a couple of Titan-X) EDIT: Why not? 2 it is. ~W |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
I'm going to give it a couple of days; wait for the weekly DB housekeeping to be done today and then see what setting gives me the best bang for my buck. Hopefully the weekly outage will sort out whatever is wrong at the moment. Another 15-20min & there will be no work left to download. Grant Darwin NT |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
The key here is GPU utilization. Once you reach the upper 90% area, adding more WUs at a time doesn't help, and may (again, because of competition for resources, and internal task switching, this time on the GPU rather than the CPU) hurt the total work done. On my GTX 780s I run 3 WUs simultaneously, and I find that the incremental work done (over 2) to be fairly small. And for 4, it wouldn't help at all. Note that the amount of RAM on these (and your) cards could support 5 or more easily but they would all stretch out and the extra work done would be either negative or negligible. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Woodie, You might want to get SIVX64. I hope I wrote that right. It's a program that will give you a general idea of how your system is behaving while crunching. Once you learn to read it. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
There's also some complications, adding for the sakes of completion, that make multiple GPUs with lots of VRAM much more complex than in the past. Those include: - Windows display driver model mirrors VRAM for display driver recovery purposes, (in this 3 card case to the tune of some ~6GiB kernel space 'shared') and - PCI express lanes are limited ( 16 lanes for the i7-4770K, to cover the 3 video and any other devices in the system ) The first item above, when you dig really deeply, covers the majority of why an extreme example (retired) host of Windows XP, 4 x old GTX 295's (total 8 GPUs, with some 7Gib physical VRAM) while viable under 'Old style XP drivers' and small amounts of Host RAM, will tend to choke early under more modern 'hybrid' drivers. With respect to the current (i7 + 3 larger GPUs), that's a lot of the 16 GiB physical goobbled up, and that will be out of the 8GiB half that is kernel space (leaving 2 GiB for OS and drivers, though likely plenty for application user space). The second item will have more of an impact on how many tasks can be 'fed' by the CPU in limited time, remembering that there will need to be some turns being taken on the PCI express links, and a lot of activity there can be met with sitting and waiting. Hyperthreading would probably double that queue contention. I didn't know or think about this limitation extensively in the past, though it becomes pretty important in modern workstation operation, which is probably why the likes of Xeon processors with more PCIe lanes have been becoming popular even in high end gaming rigs, just to feed the faster GPUs more promptly. For x42 (next major Cuda multibeam revision) I've been gradually engineering ways to make the application less 'chatty', which should reduce the issues there, though with the GPUs getting faster all the time, it's taking some time to find the best ways to make things scale better and more automatically in the future. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
- Windows display driver model mirrors VRAM for display driver recovery purposes How often this mirroring takes place? (I assume you mean the whole VRAM (?) is copied (by some DMA controller?) to main computer RAM every X seconds?) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
- Windows display driver model mirrors VRAM for display driver recovery purposes Complex, though in this post classic-XP mechanism generally most operations occur via a kernel memory 'staging area' which then transmits the commands/data (sometimes combined for optimisation purposes), so in effect you have a virtual GPU in host memory that the applications talk through via a user mode driver helper. (virtualisation of the GPU resources) That's a more complex kindof 'double-buffering' than simply mirroring, , that explains increased latencies, and why extreme gamer benchmarks stuck with old XP so long, amounting to 10% or so performance penalty at the time at introduction with Vista (Since then newer GPUs add DMA engines, faster & more DMA engines, and more latency hiding mechanisms). Later XP drivers add some (enough) of the virtualisation to keep applications compatible, though then being hybrid drivers attain all the scaling limits and latencies, without the benefit of new hardware & lots of RAM on top. In terms of amounts of VRAM being mirrored, it's this number right here for my 4GiB physical VRAM 980 (Win7x64): Fortunately, or unfortunately, depending on the usage, that virtualisation of the video memory is paged. If you actually start filling things up to the extent system resources are low, then you'll see similar or worse effects as with host memory excessively paging to disk (i.e. usually unusable). Naturally adding more host memory to modern standards is only an option on 64 Bit systems etc, so extreme care is needed if selected modern GPUs for a 32 bit desktop version of Windows. [Edit:] note that Windows 10 and DirectX12 is supposed to be changing this model. I've not seen details though tek syndicate mentioned at least sli configurations stacking VRAM, so that's different. The picture may change completely if they want to compete with Mantle for latency "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55 |
jason_gee. Wow, thanks for that! Proper technical insight into a proper technical subject. I find it fascinating. May I ask a few questions on behalf of the class? I'm going to say now I understand it's a complex subject and generalisations are going have to be made. (1) When you say 'Windows display driver model' I take it you mean Microsoft have dictated "This is how you need to write a driver to interface between your hardware and the OS because this is how we've designed the OS'. (2) Can you tell us how this differs from Linux & Mac OS X and does this make a difference as to how efficient the platform is as a number crunching entity; that is, does the latency introduced by the Windows 'double-buffering' affect how fast the same working would be crunched on Windows vs. Linux/OS X, all other things being equal. (Yes, I am aware I'm asking you to explain how long a piece of quantum superstring is :) ) (3) Would adding more RAM help the issue, i.e. reduce paging or is it "not as simple as that". I've got 8GB RAM across the 3 cards (4+2+2) so I'm assuming it's trying to reserve 8GB kernel space to call its own. (I don't know where to find the window you showed to check). ~W |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
So Win7 Ultimate x64 with Woodgie's 6GB GTX Titan plus two 2GB 750ti's would like to have more than 9GB of shared kernel memory for that VRAM backup. With 16GB installed RAM implying 8GB kernel memory there must be some workaround in the driver model. {edit} Standard memory for the GTX Titan is 6GB, and the OpenCL AP task details are showing that amount, so I assume the card actually does have it even though the CUDA task details only show 4GB. Joe |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.