SAH On Linux

Author	Message
enzed Send message Joined: 27 Mar 05 Posts: 347 Credit: 1,681,694 RAC: 0	Message 1238753 - Posted: 29 May 2012, 9:11:42 UTC - in response to Message 1237412. Last modified: 29 May 2012, 9:15:13 UTC In your windows tasks I see: Priority of process raised successfully Priority of worker thread raised successfully In your linux tasks I do not. It could be the linux app does record that information or the tasks are running lower priority. From what I understand of a conversation I had at Einstein some years ago, Linux developers don't have access to any tool or setting equivalent to the 'worker thread priority' - they can only adjust priority at the process level. So much so that developers who cut their teeth on Linux code aren't even aware that Windows has threads which can be individually adjusted. Also, BOINC should raise the process priority of a CUDA app on launch, so there shouldn't be any need for a message about the app adjusting itself later. So, I wouldn't worry about the non-equivalence of the messages, though it might well be something for T.A. to look at next time he's running each of the two OSs. NB - you'd need to use something like Process Explorer to get the Windows thread priorities - Task Manager can only display/control the process priority. Hello Folks Actualy you can adjust Unix/Linux programs/processes, and also automate it to have the system kick each new "running seti client" to a "desired" priority level. I will knock some notes up and put a link on here.. back later.. I do like that idea of building a kernel with the seti-apps present as runnable modules..hmmmmm Technically you dont have to have a gui... or perhaps just have the system boot to level-3 and not make the gui active unless required... cheers ID: 1238753 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1238755 - Posted: 29 May 2012, 9:16:09 UTC - in response to Message 1238753. Last modified: 29 May 2012, 9:50:45 UTC Hello Folks Actualy you can adjust Unix/Linux programs/processes, and also automate it to have the system kick each new "running seti client" to a "desired" priority level. I will knock some notes up and put a link on here.. back later.. cheers Please cover the distinction between 'thread' and 'process' priorities in your notes - that's the one which caused most confusion for Linux developers transferring to Windows for the first time. [Edit - as Bernd Machenschalk was honest enough to admit] ID: 1238755 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1238814 - Posted: 29 May 2012, 13:46:34 UTC Last modified: 29 May 2012, 14:28:25 UTC (From a motel somewhere in the Kimberleys) I've been out of touch for a few days. I have a chron script running to adjust the "niceness" of the CUDA tasks to -5 which is roughly the equivalent of "Above Normal" in Windows. I tried -10 but it didn't make any difference as most of the processor usage is the CUDA tasks. The other running tasks only account for about 1% of CPU time. I've checked the usage with both krellm and the Mageia Task manager. They both agree on the CPU usage per task. Unfortunately The Monster had frozen up about 3 days ago. It has now been rebooted and is back in action. I will be back home next week and will continue checks then. Thanks for all the comments so far but can someone please explain why the Linux app uses more CPU than the Windows app ? T.A. ID: 1238814 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1238818 - Posted: 29 May 2012, 14:10:15 UTC - in response to Message 1238814. (From a motel somewhere in the Kimberleys) I've been out of touch for a few days. I have a chron script running to adjust the "niceness" of the CUDA tasks to -5 which is roughly the equivalent of "Above Normal" in Windows. I tried -10 but it didn't make any difference as most of the processor usage is the CUDA tasks. The other running tasks only account for about 1% of CPU time. I've checked the usage with both krellm and the Mageia Task manager. They both agree on the CPU usage per task. Unfortunately The Monster had frozen up about 3 days ago. It ahs now been rebooted and is back in action. I will be back home next week and will continue checks then. Thanks for all the comments so far but can someone please explain why the Linux app uses more CPU than the Windows app ? T.A. Perhaps the linux CUDA code requires a bit more CPU usage then it does in windows, or something along those lines. Also it could have something to do with the older cards. I recall someone, as in one of the lunatics devs, had mentioned something along the lines of "the older cards have to do in software what the newer cards do in their hardware". SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1238818 ·

Ex: "Socialist" Volunteer tester Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2	Message 1238830 - Posted: 29 May 2012, 14:44:55 UTC - in response to Message 1238814. Last modified: 29 May 2012, 14:48:34 UTC What is Boinc written with? The language it's written in can have a bit to do with that. Some languages run nicer in linux than others. I've heard devs say that C makes a horrible language to use for processor intensive stuff that will be compiled for linux... In all honesty, to really see what a linux system could do, you would want to start with compiling your own kernel specific for your system. Following that you would also want to compile your boinc for your own machine. Kangol does this, and that's why his performance is good. Linux distros are compiled for the masses. That's part of the problem. Eventually when I have more experience, I will get into compiling my own wares, then eventually compile my own kernel. I really think that's the key to performance... #resist ID: 1238830 ·

Wembley Volunteer tester Send message Joined: 16 Sep 09 Posts: 429 Credit: 1,844,293 RAC: 0	Message 1238948 - Posted: 29 May 2012, 21:59:51 UTC The compiler you use makes a bit of difference also. I think the Lunatics crew built their Windows versions with the Intel compiler which does a little better job of optimizing than gcc. (at least they used to, not sure if they still do with the most recent builds.) ID: 1238948 ·

Ex: "Socialist" Volunteer tester Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2	Message 1238968 - Posted: 31 May 2012, 17:11:48 UTC Yes Wembely, Intel's compiler is considered to be better than GCC, and that could definitely account for the performance difference between windows vs linux. #resist ID: 1238968 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1238996 - Posted: 31 May 2012, 18:06:30 UTC Programming for CUDA is done in a language called 'C for CUDA'. The compiler is provided by nVidia, and of course differs for Linux and Windows. Some of what's compiled runs on the GPU and given the same target hardware is most likely nearly identical. The part which runs on CPU is likely quite different, and uses routines in the CUDA runtime library which will also obviously be different. There's also additional C++ code for the application controlling the overall flow of processing, interacting with BOINC, etc. I doubt that part is contributing much to the observed additional CPU time. The crux of the issue is likely to be how the CUDA code recognizes and reacts when the GPU has finished one kernel and needs to be told what to do next. A way to keep the GPU almost fully occupied is to have the CPU in a spin loop waiting for the signal from the GPU, but that of course means the CPU will be shown as completely used too. Instead some kind of interrupt mechanism is used, and I don't know much more. I certainly don't know in what sense the niceness should affect the amount of wasted CPU time, for instance. Joe ID: 1238996 ·

Ex: "Socialist" Volunteer tester Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2	Message 1239005 - Posted: 31 May 2012, 18:14:27 UTC Last modified: 31 May 2012, 18:15:40 UTC Far as niceness, I'd think it to matter more on systems like mine, where I a)have no GPU, b) I keep Boinc throttled, and run many other things along side of it. Without changing the niceness, I assume when other (minor) processes start to do some work, they get the priority and hence steal time from boinc. with niceness I could get away with giving boinc the priority and my other processes would then suffer the lost cpu time. Personally, I like my server load as little as possible, and am considering going back to running boinc in a single thread virtual machine. (this will kill my RAC, but I would rather my fans run a little slower and my processor use a little less electric.) #resist ID: 1239005 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1239025 - Posted: 31 May 2012, 18:35:30 UTC - in response to Message 1238996. Last modified: 31 May 2012, 19:08:42 UTC The crux of the issue is likely to be how the CUDA code recognizes and reacts when the GPU has finished one kernel and needs to be told what to do next. A way to keep the GPU almost fully occupied is to have the CPU in a spin loop waiting for the signal from the GPU, but that of course means the CPU will be shown as completely used too. Instead some kind of interrupt mechanism is used, and I don't know much more. I certainly don't know in what sense the niceness should affect the amount of wasted CPU time, for instance. Joe As it's an interesting topic I've been studying & experimenting with for over a year, I'll elaborate on this from the technical angles. The Cuda builds on both sides mostly use 'blocking synchronisation' at this stage (which is somewhat antiquated), and some Cuda streams (less so). On both platforms the blocking synchronisation is expensive on GPU execution as effectively the device has to wait idle waiting for the CPU to get around to any postprocessing, which is a suboptimal situation, thus key code parts, such as initial chirp and some costly reductions are streamed ( asynchronous ) while the CPU does other postprocssing or reduction/reporting (as opposed to either spinning at full CPU thread usage, or completely blocking CPU thread, which is OS scheduled). Increased host process, (and the worker thread that hangs from it) priority can therefore aid the blocking portions depending on how the underlying OS & Drivers schedule the syncronisation primitives. Vista+ WDDM driver model also has special optimisation & scheduling features to aid in these synchronisation situations through finer grained kernel scheduling & slimmer syncronisation, which hides latencies, as does running multiple tasks at once on many cards (more so in Vista/Win7, as there is more latency to hide) To overcome these synchronisation, scheduling and utilisation issues, many years ago both Microsoft and OpenGL incorporated callback interfaces into their graphics and audio APis (OpenGL, DirectX, DirectSound, CoreAudio). This is an extremely efficient model, but complex. Last year, nVidia exposed a limited callback interface though its CUPTI libraries for CUDA development, and OpenCL corsortium introduced OpenCL 1.1 with GPU callbacks which largely will superceed legacy blocking, spinning or stream async methods... But unfortunately due to many GPGPU developers being unfamiliar with Windows DirectX, Core Audio & OpenGL programming, these complex avenues have yet to be fully & properly explored in gpgpu contexts, despite that they are mature & highly efficient mechanisms. In fact misunderstanding of this technological jump has lead to ill-conceived workarounds to put blocking sync functionality back in, and bone pointing at the engineers/vendors ... largely stemming from the massive research & transitional code change burden required to switch to the newer techniques (i.e. reading, understanding the callback interface purpose & function, and then applying updates to code, which all costs time & skill). Cuda 5 will support Callback synchronisation natively (AFAIK on all supported platforms). Cuda 'X-branch' will be completely re-engineered to use it throughout (x42 series, targeted at maximal efficiency) sometime after V7 Multibeam is more or less under control . Use of native callback techniques where available (Cuda5, DirectCompute & OpenCL1.1+), CUPTI (older Cuda), and hand-crafted callback syncronisation will eventually become standard practice & solve a lot of problems... but as with any technological jump, there are teething problems & some resistance to change. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1239025 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20291 Credit: 7,508,002 RAC: 20	Message 1240442 - Posted: 2 Jun 2012, 23:23:42 UTC For a 'minimal' system, you might want to take a look at the sort of things done with BusyBox on a Linux kernel. Happy lean crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 1240442 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20291 Credit: 7,508,002 RAC: 20	Message 1240443 - Posted: 2 Jun 2012, 23:25:31 UTC - in response to Message 1239025. Last modified: 2 Jun 2012, 23:25:57 UTC Thanks Joe, Jason, as ever an informative and very good interesting read... (Must make time to dive in...) Happy fast crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 1240443 ·

Ex: "Socialist" Volunteer tester Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2	Message 1240472 - Posted: 3 Jun 2012, 0:43:14 UTC Last modified: 3 Jun 2012, 0:43:52 UTC (OT:Busybox is also an excellent tool for rooted Android phones, if you're into that sort of thing) ID: 1240472 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20291 Credit: 7,508,002 RAC: 20	Message 1240616 - Posted: 3 Jun 2012, 12:19:20 UTC - in response to Message 1240472. (OT:Busybox is also an excellent tool for rooted Android phones, if you're into that sort of thing) Just to disentangle a little: What most people describe as "Linux" is the GNU (Richard Stallman, FSF) operating system and toolset running on top of the Linux (Linus Torvalds) kernel. That's abbreviated as "GNU/Linux". Android uses Google's rewritten version of some of the "GNU" parts which they then call Bionic, which is just enough to run Java. Their system is abbreviated "Bionic/Linux" and is marketed as "Android". All very clever geekie naming. You can also have the combination of BusyBox/Linux. Happy lean crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 1240616 ·

Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0	Message 1240631 - Posted: 3 Jun 2012, 13:21:12 UTC This should renice the cpu/gpu threads as a simple command: for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done; I guess it could be added to /etc/crontab to run every minute: * * * * * root sleep 10; (for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;) Not sure how much benefit it really is, but might not be helpful on a production webserver/db :-) ID: 1240631 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1240641 - Posted: 3 Jun 2012, 14:16:04 UTC - in response to Message 1240631. This should renice the cpu/gpu threads as a simple command: for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done; I guess it could be added to /etc/crontab to run every minute: * * * * * root sleep 10; (for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;) Not sure how much benefit it really is, but might not be helpful on a production webserver/db :-) I run a similar script via crontab on The Monster but it doesn't do much. Changing the Niceness of the GPU tasks is really only necessary if your running CPU+GPU as the CPU tasks get priority and slow the GPU tasks by as much as 50%. On another machine, I found a Nice of -5 brought the GPU crunching time down to "normal" without strangling the CPU times. On The Monster changing the Niceness of the GPU tasks makes no difference as it is a GPU only cruncher and there is very little CPU usage apart from that required to drive the GPU's. T.A. ID: 1240641 ·

doug Volunteer tester Send message Joined: 10 Jul 09 Posts: 202 Credit: 10,828,067 RAC: 0	Message 1240910 - Posted: 3 Jun 2012, 23:25:25 UTC - in response to Message 1239025. The crux of the issue is likely to be how the CUDA code recognizes and reacts when the GPU has finished one kernel and needs to be told what to do next. A way to keep the GPU almost fully occupied is to have the CPU in a spin loop waiting for the signal from the GPU, but that of course means the CPU will be shown as completely used too. Instead some kind of interrupt mechanism is used, and I don't know much more. I certainly don't know in what sense the niceness should affect the amount of wasted CPU time, for instance. Joe As it's an interesting topic I've been studying & experimenting with for over a year, I'll elaborate on this from the technical angles. The Cuda builds on both sides mostly use 'blocking synchronisation' at this stage (which is somewhat antiquated), and some Cuda streams (less so). On both platforms the blocking synchronisation is expensive on GPU execution as effectively the device has to wait idle waiting for the CPU to get around to any postprocessing, which is a suboptimal situation, thus key code parts, such as initial chirp and some costly reductions are streamed ( asynchronous ) while the CPU does other postprocssing or reduction/reporting (as opposed to either spinning at full CPU thread usage, or completely blocking CPU thread, which is OS scheduled). Increased host process, (and the worker thread that hangs from it) priority can therefore aid the blocking portions depending on how the underlying OS & Drivers schedule the syncronisation primitives. Vista+ WDDM driver model also has special optimisation & scheduling features to aid in these synchronisation situations through finer grained kernel scheduling & slimmer syncronisation, which hides latencies, as does running multiple tasks at once on many cards (more so in Vista/Win7, as there is more latency to hide) To overcome these synchronisation, scheduling and utilisation issues, many years ago both Microsoft and OpenGL incorporated callback interfaces into their graphics and audio APis (OpenGL, DirectX, DirectSound, CoreAudio). This is an extremely efficient model, but complex. Last year, nVidia exposed a limited callback interface though its CUPTI libraries for CUDA development, and OpenCL corsortium introduced OpenCL 1.1 with GPU callbacks which largely will superceed legacy blocking, spinning or stream async methods... But unfortunately due to many GPGPU developers being unfamiliar with Windows DirectX, Core Audio & OpenGL programming, these complex avenues have yet to be fully & properly explored in gpgpu contexts, despite that they are mature & highly efficient mechanisms. In fact misunderstanding of this technological jump has lead to ill-conceived workarounds to put blocking sync functionality back in, and bone pointing at the engineers/vendors ... largely stemming from the massive research & transitional code change burden required to switch to the newer techniques (i.e. reading, understanding the callback interface purpose & function, and then applying updates to code, which all costs time & skill). Cuda 5 will support Callback synchronisation natively (AFAIK on all supported platforms). Cuda 'X-branch' will be completely re-engineered to use it throughout (x42 series, targeted at maximal efficiency) sometime after V7 Multibeam is more or less under control . Use of native callback techniques where available (Cuda5, DirectCompute & OpenCL1.1+), CUPTI (older Cuda), and hand-crafted callback syncronisation will eventually become standard practice & solve a lot of problems... but as with any technological jump, there are teething problems & some resistance to change. Jason So we're a victim of Nvidia's unwillingness to understand parallel programming? I would be appalled, but given the rate of change in software/hardware I am not surprised. Maybe I'll email Dykstra's papers to Huang. :-) ID: 1240910 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1241038 - Posted: 4 Jun 2012, 8:16:29 UTC - in response to Message 1240910. Last modified: 4 Jun 2012, 8:19:09 UTC So we're a victim of Nvidia's unwillingness to understand parallel programming? I would be appalled, but given the rate of change in software/hardware I am not surprised. Maybe I'll email Dykstra's papers to Huang. :-) LoL, no not nVidia being reluctant at all. They've been very much a driving force for GPU supercompute with Cuda and OpenCL from the start. It's the developers that use 'the stuff' that can be quite reluctant to keep up with the rapid pace of development. i.e. Let's call it 'us developers' being slow on the uptake, rather than 'those developers' (Engineers really, for the most part) being slow to provide & refine technologies. That's advancing/evolving rapidly enough for me ;) Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1241038 ·

enzed Send message Joined: 27 Mar 05 Posts: 347 Credit: 1,681,694 RAC: 0	Message 1242611 - Posted: 7 Jun 2012, 4:28:19 UTC - in response to Message 1240631. This should renice the cpu/gpu threads as a simple command: for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done; I guess it could be added to /etc/crontab to run every minute: * * * * * root sleep 10; (for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;) Not sure how much benefit it really is, but might not be helpful on a production webserver/db :-) I like the script, just a couple of thoughts on extending it to add ability to handle the different executable names that I have noticed arising from the different client processes that run... and also the multiple process pid's that emerge from multicore architecture... #remember to set your renice number to your comfort level # for x in `pgrep -l setiath \|grep setiathome\|cut -d" " -f1\|paste -s` do renice -n -1 $x >null done for z in `pgrep -l astrop\|grep astropulse\|cut -d" " -f1\|paste -s` do renice -n -1 $z >null done ## ID: 1242611 ·

Terror Australis Volunteer tester Send message Joined: 14 Feb 04 Posts: 1817 Credit: 262,693,308 RAC: 44	Message 1242910 - Posted: 7 Jun 2012, 17:19:52 UTC An update. For some reason Linux does handle the load of seven GPU's and/or the bus congestion associated with keeping them fed as well as Windows does. This was borne out by the longer crunching times and much increased CPU usage. Whether this is due to a problem with the the kernel, the drivers or another reason I don't know but the evidence is there in the crunching times. Stage 2 of the experiment involved replacing the 7 GTS250's with 3 GTX470's, each running 2 tasks to try to keep a similar system load to the previous setup. All other hardware remained unchanged and the same CPU and RAM speeds were kept. The results were as follows. CPU load dropped from 30% to around 10%. CPU use per CUDA task dropped to "normal" at 1%-2% each. Crunching times increased slightly, (around 30 seconds for a "standard" unit) from what the same cards were achieving under Windows, but this is explainable due to the slightly reduced clock speeds the cards are now running. After 12 hours of GPU only working I have re-enabled CPU crunching as well. It will be interesting to see how overall times will be effected. T.A. ID: 1242910 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.