SAH On Linux

Message boards : Number crunching : SAH On Linux
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile enzed
Avatar

Send message
Joined: 27 Mar 05
Posts: 347
Credit: 1,681,694
RAC: 0
New Zealand
Message 1238753 - Posted: 29 May 2012, 9:11:42 UTC - in response to Message 1237412.  
Last modified: 29 May 2012, 9:15:13 UTC

In your windows tasks I see:

Priority of process raised successfully
Priority of worker thread raised successfully

In your linux tasks I do not.

It could be the linux app does record that information or the tasks are running lower priority.

From what I understand of a conversation I had at Einstein some years ago, Linux developers don't have access to any tool or setting equivalent to the 'worker thread priority' - they can only adjust priority at the process level. So much so that developers who cut their teeth on Linux code aren't even aware that Windows has threads which can be individually adjusted.

Also, BOINC should raise the process priority of a CUDA app on launch, so there shouldn't be any need for a message about the app adjusting itself later. So, I wouldn't worry about the non-equivalence of the messages, though it might well be something for T.A. to look at next time he's running each of the two OSs.

NB - you'd need to use something like Process Explorer to get the Windows thread priorities - Task Manager can only display/control the process priority.


Hello Folks
Actualy you can adjust Unix/Linux programs/processes, and also automate it to have the system kick each new "running seti client" to a "desired" priority level.
I will knock some notes up and put a link on here.. back later..

I do like that idea of building a kernel with the seti-apps present as runnable modules..hmmmmm

Technically you dont have to have a gui... or perhaps just have the system boot to level-3 and not make the gui active unless required...


cheers
ID: 1238753 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1238755 - Posted: 29 May 2012, 9:16:09 UTC - in response to Message 1238753.  
Last modified: 29 May 2012, 9:50:45 UTC

Hello Folks
Actualy you can adjust Unix/Linux programs/processes, and also automate it to have the system kick each new "running seti client" to a "desired" priority level.
I will knock some notes up and put a link on here.. back later..

cheers

Please cover the distinction between 'thread' and 'process' priorities in your notes - that's the one which caused most confusion for Linux developers transferring to Windows for the first time.

[Edit - as Bernd Machenschalk was honest enough to admit]
ID: 1238755 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1238814 - Posted: 29 May 2012, 13:46:34 UTC
Last modified: 29 May 2012, 14:28:25 UTC

(From a motel somewhere in the Kimberleys)
I've been out of touch for a few days. I have a chron script running to adjust the "niceness" of the CUDA tasks to -5 which is roughly the equivalent of "Above Normal" in Windows. I tried -10 but it didn't make any difference as most of the processor usage is the CUDA tasks. The other running tasks only account for about 1% of CPU time.

I've checked the usage with both krellm and the Mageia Task manager. They both agree on the CPU usage per task.

Unfortunately The Monster had frozen up about 3 days ago. It has now been rebooted and is back in action.

I will be back home next week and will continue checks then.

Thanks for all the comments so far but can someone please explain why the Linux app uses more CPU than the Windows app ?

T.A.
ID: 1238814 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1238818 - Posted: 29 May 2012, 14:10:15 UTC - in response to Message 1238814.  

(From a motel somewhere in the Kimberleys)
I've been out of touch for a few days. I have a chron script running to adjust the "niceness" of the CUDA tasks to -5 which is roughly the equivalent of "Above Normal" in Windows. I tried -10 but it didn't make any difference as most of the processor usage is the CUDA tasks. The other running tasks only account for about 1% of CPU time.

I've checked the usage with both krellm and the Mageia Task manager. They both agree on the CPU usage per task.

Unfortunately The Monster had frozen up about 3 days ago. It ahs now been rebooted and is back in action.

I will be back home next week and will continue checks then.

Thanks for all the comments so far but can someone please explain why the Linux app uses more CPU than the Windows app ?

T.A.

Perhaps the linux CUDA code requires a bit more CPU usage then it does in windows, or something along those lines. Also it could have something to do with the older cards. I recall someone, as in one of the lunatics devs, had mentioned something along the lines of "the older cards have to do in software what the newer cards do in their hardware".
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1238818 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1238830 - Posted: 29 May 2012, 14:44:55 UTC - in response to Message 1238814.  
Last modified: 29 May 2012, 14:48:34 UTC

What is Boinc written with? The language it's written in can have a bit to do with that. Some languages run nicer in linux than others. I've heard devs say that C makes a horrible language to use for processor intensive stuff that will be compiled for linux...


In all honesty, to really see what a linux system could do, you would want to start with compiling your own kernel specific for your system. Following that you would also want to compile your boinc for your own machine. Kangol does this, and that's why his performance is good.

Linux distros are compiled for the masses. That's part of the problem.

Eventually when I have more experience, I will get into compiling my own wares, then eventually compile my own kernel. I really think that's the key to performance...
#resist
ID: 1238830 · Report as offensive
Wembley
Volunteer tester
Avatar

Send message
Joined: 16 Sep 09
Posts: 429
Credit: 1,844,293
RAC: 0
United States
Message 1238948 - Posted: 29 May 2012, 21:59:51 UTC

The compiler you use makes a bit of difference also. I think the Lunatics crew built their Windows versions with the Intel compiler which does a little better job of optimizing than gcc. (at least they used to, not sure if they still do with the most recent builds.)
ID: 1238948 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1238968 - Posted: 31 May 2012, 17:11:48 UTC

Yes Wembely, Intel's compiler is considered to be better than GCC, and that could definitely account for the performance difference between windows vs linux.
#resist
ID: 1238968 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1238996 - Posted: 31 May 2012, 18:06:30 UTC

Programming for CUDA is done in a language called 'C for CUDA'. The compiler is provided by nVidia, and of course differs for Linux and Windows. Some of what's compiled runs on the GPU and given the same target hardware is most likely nearly identical. The part which runs on CPU is likely quite different, and uses routines in the CUDA runtime library which will also obviously be different.

There's also additional C++ code for the application controlling the overall flow of processing, interacting with BOINC, etc. I doubt that part is contributing much to the observed additional CPU time.

The crux of the issue is likely to be how the CUDA code recognizes and reacts when the GPU has finished one kernel and needs to be told what to do next. A way to keep the GPU almost fully occupied is to have the CPU in a spin loop waiting for the signal from the GPU, but that of course means the CPU will be shown as completely used too. Instead some kind of interrupt mechanism is used, and I don't know much more. I certainly don't know in what sense the niceness should affect the amount of wasted CPU time, for instance.
                                                                  Joe
ID: 1238996 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1239005 - Posted: 31 May 2012, 18:14:27 UTC
Last modified: 31 May 2012, 18:15:40 UTC

Far as niceness, I'd think it to matter more on systems like mine, where I a)have no GPU, b) I keep Boinc throttled, and run many other things along side of it. Without changing the niceness, I assume when other (minor) processes start to do some work, they get the priority and hence steal time from boinc. with niceness I could get away with giving boinc the priority and my other processes would then suffer the lost cpu time.

Personally, I like my server load as little as possible, and am considering going back to running boinc in a single thread virtual machine. (this will kill my RAC, but I would rather my fans run a little slower and my processor use a little less electric.)
#resist
ID: 1239005 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1239025 - Posted: 31 May 2012, 18:35:30 UTC - in response to Message 1238996.  
Last modified: 31 May 2012, 19:08:42 UTC

The crux of the issue is likely to be how the CUDA code recognizes and reacts when the GPU has finished one kernel and needs to be told what to do next. A way to keep the GPU almost fully occupied is to have the CPU in a spin loop waiting for the signal from the GPU, but that of course means the CPU will be shown as completely used too. Instead some kind of interrupt mechanism is used, and I don't know much more. I certainly don't know in what sense the niceness should affect the amount of wasted CPU time, for instance.
                                                                  Joe


As it's an interesting topic I've been studying & experimenting with for over a year, I'll elaborate on this from the technical angles. The Cuda builds on both sides mostly use 'blocking synchronisation' at this stage (which is somewhat antiquated), and some Cuda streams (less so).

On both platforms the blocking synchronisation is expensive on GPU execution as effectively the device has to wait idle waiting for the CPU to get around to any postprocessing, which is a suboptimal situation, thus key code parts, such as initial chirp and some costly reductions are streamed ( asynchronous ) while the CPU does other postprocssing or reduction/reporting (as opposed to either spinning at full CPU thread usage, or completely blocking CPU thread, which is OS scheduled). Increased host process, (and the worker thread that hangs from it) priority can therefore aid the blocking portions depending on how the underlying OS & Drivers schedule the syncronisation primitives. Vista+ WDDM driver model also has special optimisation & scheduling features to aid in these synchronisation situations through finer grained kernel scheduling & slimmer syncronisation, which hides latencies, as does running multiple tasks at once on many cards (more so in Vista/Win7, as there is more latency to hide)

To overcome these synchronisation, scheduling and utilisation issues, many years ago both Microsoft and OpenGL incorporated callback interfaces into their graphics and audio APis (OpenGL, DirectX, DirectSound, CoreAudio). This is an extremely efficient model, but complex.

Last year, nVidia exposed a limited callback interface though its CUPTI libraries for CUDA development, and OpenCL corsortium introduced OpenCL 1.1 with GPU callbacks which largely will superceed legacy blocking, spinning or stream async methods... But unfortunately due to many GPGPU developers being unfamiliar with Windows DirectX, Core Audio & OpenGL programming, these complex avenues have yet to be fully & properly explored in gpgpu contexts, despite that they are mature & highly efficient mechanisms.

In fact misunderstanding of this technological jump has lead to ill-conceived workarounds to put blocking sync functionality back in, and bone pointing at the engineers/vendors ... largely stemming from the massive research & transitional code change burden required to switch to the newer techniques (i.e. reading, understanding the callback interface purpose & function, and then applying updates to code, which all costs time & skill).

Cuda 5 will support Callback synchronisation natively (AFAIK on all supported platforms). Cuda 'X-branch' will be completely re-engineered to use it throughout (x42 series, targeted at maximal efficiency) sometime after V7 Multibeam is more or less under control . Use of native callback techniques where available (Cuda5, DirectCompute & OpenCL1.1+), CUPTI (older Cuda), and hand-crafted callback syncronisation will eventually become standard practice & solve a lot of problems... but as with any technological jump, there are teething problems & some resistance to change.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1239025 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20147
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1240442 - Posted: 2 Jun 2012, 23:23:42 UTC

For a 'minimal' system, you might want to take a look at the sort of things done with BusyBox on a Linux kernel.


Happy lean crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1240442 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20147
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1240443 - Posted: 2 Jun 2012, 23:25:31 UTC - in response to Message 1239025.  
Last modified: 2 Jun 2012, 23:25:57 UTC

Thanks Joe, Jason, as ever an informative and very good interesting read...

(Must make time to dive in...)

Happy fast crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1240443 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1240472 - Posted: 3 Jun 2012, 0:43:14 UTC
Last modified: 3 Jun 2012, 0:43:52 UTC

(OT:Busybox is also an excellent tool for rooted Android phones, if you're into that sort of thing)
ID: 1240472 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20147
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1240616 - Posted: 3 Jun 2012, 12:19:20 UTC - in response to Message 1240472.  

(OT:Busybox is also an excellent tool for rooted Android phones, if you're into that sort of thing)

Just to disentangle a little:


What most people describe as "Linux" is the GNU (Richard Stallman, FSF) operating system and toolset running on top of the Linux (Linus Torvalds) kernel. That's abbreviated as "GNU/Linux".

Android uses Google's rewritten version of some of the "GNU" parts which they then call Bionic, which is just enough to run Java. Their system is abbreviated "Bionic/Linux" and is marketed as "Android". All very clever geekie naming.

You can also have the combination of BusyBox/Linux.


Happy lean crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1240616 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 1240631 - Posted: 3 Jun 2012, 13:21:12 UTC

This should renice the cpu/gpu threads as a simple command:

for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;

I guess it could be added to /etc/crontab to run every minute:
* * * * * root sleep 10; (for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;)

Not sure how much benefit it really is, but might not be helpful on a production webserver/db :-)
ID: 1240631 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1240641 - Posted: 3 Jun 2012, 14:16:04 UTC - in response to Message 1240631.  

This should renice the cpu/gpu threads as a simple command:

for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;

I guess it could be added to /etc/crontab to run every minute:
* * * * * root sleep 10; (for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;)

Not sure how much benefit it really is, but might not be helpful on a production webserver/db :-)

I run a similar script via crontab on The Monster but it doesn't do much. Changing the Niceness of the GPU tasks is really only necessary if your running CPU+GPU as the CPU tasks get priority and slow the GPU tasks by as much as 50%. On another machine, I found a Nice of -5 brought the GPU crunching time down to "normal" without strangling the CPU times.

On The Monster changing the Niceness of the GPU tasks makes no difference as it is a GPU only cruncher and there is very little CPU usage apart from that required to drive the GPU's.

T.A.
ID: 1240641 · Report as offensive
doug
Volunteer tester

Send message
Joined: 10 Jul 09
Posts: 202
Credit: 10,828,067
RAC: 0
United States
Message 1240910 - Posted: 3 Jun 2012, 23:25:25 UTC - in response to Message 1239025.  

The crux of the issue is likely to be how the CUDA code recognizes and reacts when the GPU has finished one kernel and needs to be told what to do next. A way to keep the GPU almost fully occupied is to have the CPU in a spin loop waiting for the signal from the GPU, but that of course means the CPU will be shown as completely used too. Instead some kind of interrupt mechanism is used, and I don't know much more. I certainly don't know in what sense the niceness should affect the amount of wasted CPU time, for instance.
                                                                  Joe


As it's an interesting topic I've been studying & experimenting with for over a year, I'll elaborate on this from the technical angles. The Cuda builds on both sides mostly use 'blocking synchronisation' at this stage (which is somewhat antiquated), and some Cuda streams (less so).

On both platforms the blocking synchronisation is expensive on GPU execution as effectively the device has to wait idle waiting for the CPU to get around to any postprocessing, which is a suboptimal situation, thus key code parts, such as initial chirp and some costly reductions are streamed ( asynchronous ) while the CPU does other postprocssing or reduction/reporting (as opposed to either spinning at full CPU thread usage, or completely blocking CPU thread, which is OS scheduled). Increased host process, (and the worker thread that hangs from it) priority can therefore aid the blocking portions depending on how the underlying OS & Drivers schedule the syncronisation primitives. Vista+ WDDM driver model also has special optimisation & scheduling features to aid in these synchronisation situations through finer grained kernel scheduling & slimmer syncronisation, which hides latencies, as does running multiple tasks at once on many cards (more so in Vista/Win7, as there is more latency to hide)

To overcome these synchronisation, scheduling and utilisation issues, many years ago both Microsoft and OpenGL incorporated callback interfaces into their graphics and audio APis (OpenGL, DirectX, DirectSound, CoreAudio). This is an extremely efficient model, but complex.

Last year, nVidia exposed a limited callback interface though its CUPTI libraries for CUDA development, and OpenCL corsortium introduced OpenCL 1.1 with GPU callbacks which largely will superceed legacy blocking, spinning or stream async methods... But unfortunately due to many GPGPU developers being unfamiliar with Windows DirectX, Core Audio & OpenGL programming, these complex avenues have yet to be fully & properly explored in gpgpu contexts, despite that they are mature & highly efficient mechanisms.

In fact misunderstanding of this technological jump has lead to ill-conceived workarounds to put blocking sync functionality back in, and bone pointing at the engineers/vendors ... largely stemming from the massive research & transitional code change burden required to switch to the newer techniques (i.e. reading, understanding the callback interface purpose & function, and then applying updates to code, which all costs time & skill).

Cuda 5 will support Callback synchronisation natively (AFAIK on all supported platforms). Cuda 'X-branch' will be completely re-engineered to use it throughout (x42 series, targeted at maximal efficiency) sometime after V7 Multibeam is more or less under control . Use of native callback techniques where available (Cuda5, DirectCompute & OpenCL1.1+), CUPTI (older Cuda), and hand-crafted callback syncronisation will eventually become standard practice & solve a lot of problems... but as with any technological jump, there are teething problems & some resistance to change.

Jason

So we're a victim of Nvidia's unwillingness to understand parallel programming? I would be appalled, but given the rate of change in software/hardware I am not surprised. Maybe I'll email Dykstra's papers to Huang. :-)
ID: 1240910 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1241038 - Posted: 4 Jun 2012, 8:16:29 UTC - in response to Message 1240910.  
Last modified: 4 Jun 2012, 8:19:09 UTC

So we're a victim of Nvidia's unwillingness to understand parallel programming? I would be appalled, but given the rate of change in software/hardware I am not surprised. Maybe I'll email Dykstra's papers to Huang. :-)


LoL, no not nVidia being reluctant at all. They've been very much a driving force for GPU supercompute with Cuda and OpenCL from the start. It's the developers that use 'the stuff' that can be quite reluctant to keep up with the rapid pace of development. i.e. Let's call it 'us developers' being slow on the uptake, rather than 'those developers' (Engineers really, for the most part) being slow to provide & refine technologies. That's advancing/evolving rapidly enough for me ;)

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1241038 · Report as offensive
Profile enzed
Avatar

Send message
Joined: 27 Mar 05
Posts: 347
Credit: 1,681,694
RAC: 0
New Zealand
Message 1242611 - Posted: 7 Jun 2012, 4:28:19 UTC - in response to Message 1240631.  

This should renice the cpu/gpu threads as a simple command:

for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;

I guess it could be added to /etc/crontab to run every minute:
* * * * * root sleep 10; (for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;)

Not sure how much benefit it really is, but might not be helpful on a production webserver/db :-)



I like the script, just a couple of thoughts on extending it to add ability to handle the different executable names that I have noticed arising from the different client processes that run... and also the multiple process pid's that emerge from multicore architecture...

#remember to set your renice number to your comfort level

#
for x in `pgrep -l setiath |grep setiathome|cut -d" " -f1|paste -s`
do
renice -n -1 $x >null
done

for z in `pgrep -l astrop|grep astropulse|cut -d" " -f1|paste -s`
do
renice -n -1 $z >null
done
##
ID: 1242611 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1242910 - Posted: 7 Jun 2012, 17:19:52 UTC

An update.
For some reason Linux does handle the load of seven GPU's and/or the bus congestion associated with keeping them fed as well as Windows does. This was borne out by the longer crunching times and much increased CPU usage. Whether this is due to a problem with the the kernel, the drivers or another reason I don't know but the evidence is there in the crunching times.

Stage 2 of the experiment involved replacing the 7 GTS250's with 3 GTX470's, each running 2 tasks to try to keep a similar system load to the previous setup. All other hardware remained unchanged and the same CPU and RAM speeds were kept.

The results were as follows. CPU load dropped from 30% to around 10%. CPU use per CUDA task dropped to "normal" at 1%-2% each. Crunching times increased slightly, (around 30 seconds for a "standard" unit) from what the same cards were achieving under Windows, but this is explainable due to the slightly reduced clock speeds the cards are now running.

After 12 hours of GPU only working I have re-enabled CPU crunching as well. It will be interesting to see how overall times will be effected.

T.A.

ID: 1242910 · Report as offensive
Previous · 1 · 2

Message boards : Number crunching : SAH On Linux


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.