CUDA App Memory Usage

Author	Message
Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1229875 - Posted: 10 May 2012, 3:39:38 UTC Due to my experiments with "The Monster" I've been paying more attention than usual to the memory usage on my boxes. I've noticed that on the boxes with plenty of RAM that each GPU task uses around 110MB of RAM. On "The Monster" and another memory challenged machine individual tasks are using 75MB and 85MB respectively. Despite the reduced RAM usage, crunching times do not seem to be effected and on the Fermi machine going back to one task per GPU did not alter the memory per task. My question is. How does CUDA use the system RAM ? On the boxes with the reduced RAM per unit, is the extra 40MB of RAM kept in virtual memory or does the app just use what it can get? In a situation like this, if virtual memory is in use, would going to an SSD improve crunching times or is the system self regulating in other ways ? I know adding more RAM would be the easiest fix but you can't get high performance DDR2 RAM these days. T.A. ID: 1229875 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1229951 - Posted: 10 May 2012, 10:27:44 UTC - in response to Message 1229875. Last modified: 10 May 2012, 10:28:25 UTC ...My question is. How does CUDA use the system RAM ? On the boxes with the reduced RAM per unit, is the extra 40MB of RAM kept in virtual memory or does the app just use what it can get? The app itself allocates & uses fixed amounts of VRAM & System RAM, though there is a complex interplay between Cuda libraries, driver (for which the model is different by OS), and modern virtual memory models implemeted by the OS. In a situation like this, if virtual memory is in use, would going to an SSD improve crunching times or is the system self regulating in other ways ? Old original XP driver model would be using Physical resources, Vista+ WDDM virtual ones, with later XP drivers implementing a compatibility layer internally for consistent Cuda operation. If you see paging to disk, then anything aimed at speeding up disk access would likely improve the symptoms, but not necessarily address the root cause. I know adding more RAM would be the easiest fix but you can't get high performance DDR2 RAM these days. Remember you are using 32-bit XP, which has very real limits on virtual address space, and you're pushing up against (&past) those limits with 7 x ~512MiB cards already. Going to a 64 bit OS would likely give you some room. ~half of your ~1Gig physical is going directly to kernel space as non-pageable memory, so indeed you;d be paging quite heavily, though I doubt much improvement could be gained from adding more physical memory alone... as there isn;t enough address space left to use it, unless the OS sacrifices more from the cards themselves. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1229951 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1230462 - Posted: 11 May 2012, 9:36:28 UTC - in response to Message 1229951. Thanks for the reply Jason A further question How is the actual amount of system RAM allocated to each video card calculated ? I tried several methods for this but none of them "compute" for me. I have another question regarding weird differences between the figures Task Manager and System Info gives me regarding RAM quantity and usage when I compare the machines with 2GB of memory with those that have 4GB. There is an inconsistency in the way the data is presented but I need the answer to the question above to see where I'm c*cking up T.A. ID: 1230462 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1230464 - Posted: 11 May 2012, 10:02:21 UTC - in response to Message 1230462. A further question How is the actual amount of system RAM allocated to each video card calculated ? I tried several methods for this but none of them "compute" for me. 'ordinary' amounts of system RAM are allocated during initialisation for the regular data buffers & other support data like any other multibeam CPU task, and that's fairly small. Cuda runtime libraries, however, being proprietary & closed source, do things internally to support the detected hardware, 'gobbling' up some resources for fft libraries etc, some of which would be host side code, target card specific internal data structures somewhat dependant on Driver threads at a lower level, and various OS/driver level transfer caches & mirrored images. Some of those are part of WDDM specifications, some would be for device specific performance optimisation, and some of those would be Cuda specific to make the same Cuda code work on the different driver models, as part of an 'abstraction layer'. I have another question regarding weird differences between the figures Task Manager and System Info gives me regarding RAM quantity and usage when I compare the machines with 2GB of memory with those that have 4GB. There is an inconsistency in the way the data is presented but I need the answer to the question above to see where I'm c*cking up T.A. Because the underlying OS / driver infrastructure scales with the hardware, and the libraries & application to some extent above, in particular the way non-paged kernel versus pages virtual memory are managed, along with that the Cuda drivers & programs themselves actually cover many different generations of GPU and driver capability, make it quite complex to know entirely what is going on underneath. The underlying infrastructure even handles things like copies of applications in (virtualised) video memory, reflected somewhere in kernel memory, such that application context switching and card reset & recovery after faults can happen transparently to the User. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1230464 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1230466 - Posted: 11 May 2012, 10:06:24 UTC - in response to Message 1230462. Last modified: 11 May 2012, 10:06:46 UTC I suspect the memory is is the file system cache transition list. As I recall XP task manager considers this to be free. Have a look at what SIV reports for the file system cache. ID: 1230466 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1230467 - Posted: 11 May 2012, 10:10:47 UTC - in response to Message 1230466. That OS caching is definitely one of the main elements, that became annoying under the newer models ;) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1230467 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1230497 - Posted: 11 May 2012, 12:46:16 UTC - in response to Message 1230464. Last modified: 11 May 2012, 12:58:01 UTC 'ordinary' amounts of system RAM are allocated during initialisation for the regular data buffers & other support data like any other multibeam CPU task, and that's fairly small. Cuda runtime libraries, however, being proprietary & closed source, do things internally to support the detected hardware, 'gobbling' up some resources for fft libraries etc, some of which would be host side code, target card specific internal data structures somewhat dependant on Driver threads at a lower level, and various OS/driver level transfer caches & mirrored images. Some of those are part of WDDM specifications, some would be for device specific performance optimisation, and some of those would be Cuda specific to make the same Cuda code work on the different driver models, as part of an 'abstraction layer'. Does this mean there is no hard and fast rule such as "256MB" or "half the GPU memory" per card ? I have another question regarding weird differences between the figures Task Manager and System Info gives me regarding RAM quantity and usage when I compare the machines with 2GB of memory with those that have 4GB. There is an inconsistency in the way the data is presented but I need the answer to the question above to see where I'm ccking up T.A. Because the underlying OS / driver infrastructure scales with the hardware, and the libraries & application to some extent above, in particular the way non-paged kernel versus pages virtual memory are managed, along with that the Cuda drivers & programs themselves actually cover many different generations of GPU and driver capability, make it quite complex to know entirely what is going on underneath. The underlying infrastructure even handles things like copies of applications in (virtualised) video memory, reflected somewhere in kernel memory, such that application context switching and card reset & recovery after faults can happen transparently to the User. I think my second problem is more something inside Windows. With the two 2GB machines both Sysinfo and Task Manager (in the Performance->Physical Memory tab) report the same amount of installed memory (the installed 2048MB). With the 4GB machines Sysinfo reports the 4096MB but Task Manager reports 2814MB of installed RAM on one machine, 1943MB on another and 1040MB on "The Monster". Is this "missing" RAM the bit allocated to the GPU's and if so, why is the machine with the indicated* most indicated RAM the one running 2 tasks each on 3 GTX470's ? The machine with 1943MB indicated runs 3 GTX285's. One would it should be the other way around. And why does task manager "compute" memory usage differently between the 2GB and 4GB machines ? And if the "holes" in the indicated memory in Task Manager are due to the GPU allocations, why don't the same holes appear on the 2GB machines? Something is not working according to my childishly simplistic view of memory usage i.e. Bigger, faster, and more GPU's equals more RAM usage and as the SAH units are a constant (sort of), RAM usage should scale reasonably linearly according to the power, number of GPU's, and the amount of VRAM on the GPU's. This does not appear to be happening and I'd like to understand why. T.A. ID: 1230497 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1230502 - Posted: 11 May 2012, 13:16:18 UTC Last modified: 11 May 2012, 13:38:08 UTC A lot of the "missing memory" is above 4GB and 32-bit Windows XP cannot use it. With 64-bit windows you could. On my system system with 12 GB and 4 GPUs the physical memory and GPU BARs are as below. You should be able to use it with 32-Bit Windows Advanced/Enterprise Server, but I suspect you may get issues with the nVidia drivers. Were it my system I would install W7 x64. ID: 1230502 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1230504 - Posted: 11 May 2012, 13:18:36 UTC - in response to Message 1230497. Last modified: 11 May 2012, 13:36:19 UTC Does this mean there is no hard and fast rule such as "256MB" or "half the GPU memory" per card ? Correct, No single hard & simple rule, but an interplay of extremely intricate rules. I think my second problem is more something inside Windows. With the two 2GB machines both Sysinfo and Task Manager (in the Performance->Physical Memory tab) report the same amount of installed memory (the installed 2048MB). With the 4GB machines Sysinfo reports the 4096MB but Task Manager reports 2814MB of installed RAM on one machine, 1943MB on another and 1040MB on "The Monster". Is this "missing" RAM the bit allocated to the GPU's and if so, why is the machine with the indicated most indicated RAM the one running 2 tasks each on 3 GTX470's ? The machine with 1943MB indicated runs 3 GTX285's. One would it should be the other way around. correct, there is a physical address space limitation (with 32 bit OS), within which the cards' mapped memory spaces must fit. The different generations map & represent & manage memory differently at hardware, firmware, driver & OS level. The original XP with old gen cards was a simple physical memory map, the newer (Fermi) cards with fudged XP driver model is partially 'virtualised', and the Fermi class on a newer OS would be completely virtualised & 'page'. And why does task manager "compute" memory usage differently between the 2GB and 4GB machines ? And if the "holes" in the indicated memory in Task Manager are due to the GPU allocations, why don't the same holes appear on the 2GB machines? Because with more virtualised paged memory to represent, you need more reserved kernel non-paged to hold the page tables and handle all the extra resources, so you can end up with less. Something is not working according to my childishly simplistic view of memory usage i.e. Bigger, faster, and more GPU's equals more RAM usage and as the SAH units are a constant (sort of), RAM usage should scale reasonably linearly according to the power, number of GPU's, and the amount of VRAM on the GPU's. This does not appear to be happening and I'd like to understand why. T.A. Because of Vista+ introducing WDDM (windows display driver model), Fermi class cards (or newer) having dedicated hardware to support these features, Cuda needing to operate on legacy cards as well that don't have this hardware, so the OS & drivers having to 'emulate' these features using 'what they've got'. No not simple at all ;) Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1230504 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13920 Credit: 208,696,464 RAC: 304	Message 1230684 - Posted: 11 May 2012, 22:13:34 UTC - in response to Message 1230504. Thanks for starting the thread, it was an interesting discussion. Grant Darwin NT ID: 1230684 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1230780 - Posted: 12 May 2012, 3:57:38 UTC click. The light goes on. It had slipped my mind that part of the VRAM is added to the system RAM and thus becomes effected by the 4GB limitation on a 32 bit system. This is why I got confused when the discussion mentioned memory above 4GB (smacks forehead). Thanks to Jason and Red-ray for enlightening me. One final question. Would using PAE be of any assistance or would it just confuse the issue even further ? T.A. ID: 1230780 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1230801 - Posted: 12 May 2012, 4:45:11 UTC For those who are interested. This thread on Techsupportforum.com has some good info on the subject, particularly the post by "JH-man" down near the bottom. T.A. ID: 1230801 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1231441 - Posted: 13 May 2012, 5:52:27 UTC - in response to Message 1230780. Last modified: 13 May 2012, 5:52:53 UTC One final question. Would using PAE be of any assistance or would it just confuse the issue even further ? I'd appreciate an answer to this question. I know that PAE only provides access to the RAM above 4GB. What I need to know is if it only operates on system RAM or does it work with an SysRAM + VRAM combination. T.A. ID: 1231441 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1231447 - Posted: 13 May 2012, 6:09:43 UTC - in response to Message 1231441. Last modified: 13 May 2012, 6:10:48 UTC One final question. Would using PAE be of any assistance or would it just confuse the issue even further ? I'd appreciate an answer to this question. I know that PAE only provides access to the RAM above 4GB. What I need to know is if it only operates on system RAM or does it work with an SysRAM + VRAM combination. T.A. My (limited & possibly incorrect) understanding of PAE in the Windows versions, is that it's limited to particular server/datacentre class Windows environments/configurations, and requires special/specific drivers for all sorts of things. If anywhere near correct, It'd be a bucketload easier just to use a consumer or server x64 edition IMO... or Linux jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1231447 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1231457 - Posted: 13 May 2012, 6:59:17 UTC 64 bit Linux is the long term plan for this box. But it is also an experiment and something to learn with. The whole aim is to push the hardware to its absolute limit just to what it is capable of, what works and what doesn't. Such knowledge could be handy when the revolution comes ;) T.A. ID: 1231457 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1231657 - Posted: 13 May 2012, 18:29:24 UTC - in response to Message 1231457. Last modified: 13 May 2012, 19:28:17 UTC Can you post the contents of the boot.ini file? ; ;Warning: Boot.ini is used on Windows XP and earlier operating systems. ;Warning: Use BCDEDIT.exe to modify Windows Vista boot options. ; [boot loader] timeout=30 default=multi(0)disk(0)rdisk(0)partition(2)\WINDOWS ; [operating systems] multi(0)disk(0)rdisk(0)partition(2)\WINDOWS="Microsoft Windows XP Professional (SP3)" /FASTDETECT /NoExecute=OptOut /usepmtimer c:\ ="Microsoft Windows 98 SE" "The PAE kernel can be enabled automatically without the /PAE switch present in the boot entry if the system has DEP enabled (/NOEXECUTE switch is present) or the system processor supports hardware-enforced DEP. Presence of the /NOEXECUTE switch on a system with a processor that supports hardware-enforced DEP implies the /PAE switch": http://msdn.microsoft.com/en-us/windows/hardware/gg487503.aspx http://support.microsoft.com/kb/899298 http://www.techspot.com/community/topics/change-in-boot-ini.32537/ Switch options for Boot.ini http://support.microsoft.com/kb/833721 I don't know if the /3GB switch will permit the use of System RAM that is hidden under the VRAM addresses (is that System RAM not used for anything?) http://support.microsoft.com/kb/171793 Is it possible on 32 bit Windows to force VRAM to be mapped to addresses over 4 GB? Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1231657 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1231670 - Posted: 13 May 2012, 18:58:10 UTC - in response to Message 1231657. Last modified: 13 May 2012, 18:58:56 UTC Is it possible on 32 bit Windows to force VRAM to be mapped to addresses over 4 GB? I believe not on normal workstation type Windows editions with GeForce drivers, maybe with Quadro drivers/cards. As mentioned I don't know completely about PAE, other than 64 Bit OSes solve those issues inherently, so really PAE is superseded on Windows and probably not supported through all (or many) drivers. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1231670 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1231685 - Posted: 13 May 2012, 19:40:19 UTC - in response to Message 1231670. ... so really PAE is superseded on Windows and probably not supported through all (or many) drivers. Jason My CPU is made ~2006 and supports DEP. Win XP32 SP3 supports DEP and PAE /NOEXECUTE switch in boot.ini enables DEP and PAE as you can see on the screenshot. I don't experience random crashes so it seems that current (~2006+) hardware and drivers are PAE compatible. Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1231685 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1231689 - Posted: 13 May 2012, 19:54:18 UTC - in response to Message 1231685. Last modified: 13 May 2012, 20:04:19 UTC My CPU is made ~2006 and supports DEP. Win XP32 SP3 supports DEP and PAE /NOEXECUTE switch in boot.ini enables DEP and PAE as you can see on the screenshot. I don't experience random crashes so it seems that current (~2006+) hardware and drivers are PAE compatible. Interesting, though your listed XP machine shows a lot less than 4Gb here. What are you using to verify the mapped video memory is not pinching from System memory ? It does not even list a Cuda GPU either, which is the point of the discussion. Fitting the cards' mapped regions into a 32 bit address space using available drivers. If you show a Cuda device on XP32 with PAE enabled & 4Gig System RAM, and you have Cuda Large memory addressing (LME) capable 32 bit drivers, then I will believe that TA's request is possible, though still regard a 64 bit OS as preferable for his situation. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1231689 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1231695 - Posted: 13 May 2012, 20:10:08 UTC - in response to Message 1231689. Yes, it has only 1 GB RAM, NVIDIA GeForce 6150SE nForce 430 (driver 191.07) I just say that probably >90% of users that run 32 bit Windows use PAE without even knowing that it is enabled. (DEP is enabled by default after Win XP SP2 and today's CPUs support hardware DEP, so PAE is ON by default on most computers) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1231695 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.