Question about SOG

Message boards : Number crunching : Question about SOG
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1838581 - Posted: 29 Dec 2016, 17:53:16 UTC

SOG is an open cl app as I understand. Is there a appreciable difference in run time between running open cl on Windoz vs Linux? The reason I ask is Einstein recently came out with an open cl app and it runs 5 to 10 times faster on Linux than Windoz. Is that something inherent in Linux or just the way the app was written?
ID: 1838581 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22190
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1838585 - Posted: 29 Dec 2016, 17:59:22 UTC

When I moved one of my crunchers from Windows to Linux it briefly ran SoG applications. I found very little difference in run times, but there was a reduction in the demands on the CPU.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1838585 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1838646 - Posted: 29 Dec 2016, 22:06:56 UTC

Since the seti apps have the same codebase for windows and Linux i can say there is no big difference in speed.
Of course drivers are different so it might vary a little bit.


With each crime and every kindness we birth our future.
ID: 1838646 · Report as offensive
baron_iv
Volunteer tester
Avatar

Send message
Joined: 4 Nov 02
Posts: 109
Credit: 104,905,241
RAC: 0
United States
Message 1840588 - Posted: 7 Jan 2017, 11:38:49 UTC

I have noticed a *tremendous* difference when using Linux vs Windows. My dual Nvidia 1070 system, under Windows, maxes out around 45k RAC, but under Linux, it hovers around 70-75k RAC with the "special sauce" app for Linux. I can't explain why there's a discrepancy, or what causes it, but Linux gets significantly more work done for a given time period than Windows.
ID: 1840588 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22190
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1840608 - Posted: 7 Jan 2017, 12:44:57 UTC

The big improvement using the TBar/Petri special application over SoG or SAH is that the special application is very highly optimised, using some "tricks" that may not work properly under Windows. There are a few folks working hard to get them to work under windows, but it is proving to be somewhat challenging.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1840608 · Report as offensive
Profile M_M
Avatar

Send message
Joined: 20 May 04
Posts: 76
Credit: 45,752,966
RAC: 8
Serbia
Message 1840630 - Posted: 7 Jan 2017, 14:08:44 UTC - in response to Message 1840608.  
Last modified: 7 Jan 2017, 14:10:45 UTC

Is it in essence mostly about "sleep" and timer accuracy?

If so, can in Windows HPET be used? Sure, it has to be enabled first as I think it is disabled by default...
ID: 1840630 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1840721 - Posted: 7 Jan 2017, 22:28:33 UTC - in response to Message 1840630.  
Last modified: 7 Jan 2017, 22:28:55 UTC

Hi,

It is not about sleep and other things. It mostly about
a) distributing the work that has to be done to all 'symmetrical multiprocessing units' (SM/SMX).
b) doing 'some' optimisations on the code itself
c) using shared memory where applicable (that is not published yet to other CUDA developers)
d) doing autocorrelation fft in a novel way needing way less memory accesses and less computation.
e) queueing tasks so that more gets done at a time
f) optimising kernel register usage and kernel size
g) some other minor stuff

The a) is the hardest part to do right.
I'm running now a version l (L stands for locking/synchronising globally). The version helps a lot, but is not bug free. It can not be published yet. Main reason being occasional lockups.

Previous versions had problems with the order of 'finding/reporting' pulses and sometimes reporting a bad value.

My code is something before alpha. I test it first. Then others (superior people) test it and after it has been field proofed by a small and a larger group of users in the beta to be 'valid' the others can get it. Otherwise we'd ruin the science.

You may ask: Why do you run it on main? -- I do, because it is allowed and encouraged to do so. At the same time I can show what is possible and the caveats of doing so. And I also run it on beta ofcause.

--
Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1840721 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1840728 - Posted: 7 Jan 2017, 23:14:08 UTC - in response to Message 1840721.  

e) queueing tasks so that more gets done at a time

I suspect that alone saves a good chunk of time.
I notice running SoG that the first 14-20secs of WU processing isn't done on the GPU (GPU load is 0%). Pre-processing the next WU to run so that when it starts, it starts crunching on the GPU straight away would save that 14-20secs on every WU.
Grant
Darwin NT
ID: 1840728 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1840760 - Posted: 8 Jan 2017, 2:20:32 UTC - in response to Message 1840728.  
Last modified: 8 Jan 2017, 2:22:10 UTC

e) queueing tasks so that more gets done at a time

I suspect that alone saves a good chunk of time.
I notice running SoG that the first 14-20secs of WU processing isn't done on the GPU (GPU load is 0%). Pre-processing the next WU to run so that when it starts, it starts crunching on the GPU straight away would save that 14-20secs on every WU.


That is an interesting find.

I'm sure Raistmer can tell more about that. And anyone running SoG can run 2 at a time to overcome that.
The e) queueuing ... is done on CPU to fill the GPU queues to the max. That can be micromanaged too. Sometimes it pays off at the end to do some beforehandwork (Grand total.)

Please insert space(s) where ever you want to. Then after that to the Press Space Bar to continue!
EDIT: My RAC hit 200 000! while writing this.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1840760 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1840770 - Posted: 8 Jan 2017, 3:21:09 UTC - in response to Message 1840760.  
Last modified: 8 Jan 2017, 3:23:08 UTC

I'm sure Raistmer can tell more about that. And anyone running SoG can run 2 at a time to overcome that.

I've run 2 at a time with my particular command line settings, and the only improvement I got was about an extra 1-1.5 WUs per hour. Not really worth it IMHO.
But it looks like it that gain would mostly be the result of offsetting that initial CPU setup work period. It's nothing like the benefit with CUDA50 of running 2 (or more) at a time.


EDIT: My RAC hit 200 000! while writing this.

Nothing else comes close to boosting the numbers like all Arecibo work.
;-)
Grant
Darwin NT
ID: 1840770 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1840775 - Posted: 8 Jan 2017, 3:45:34 UTC - in response to Message 1840770.  
Last modified: 8 Jan 2017, 3:47:07 UTC

I'm sure Raistmer can tell more about that. And anyone running SoG can run 2 at a time to overcome that.

I've run 2 at a time with my particular command line settings, and the only improvement I got was about an extra 1-1.5 WUs per hour. Not really worth it IMHO.
But it looks like it that gain would mostly be the result of offsetting that initial CPU setup work period. It's nothing like the benefit with CUDA50 of running 2 (or more) at a time.


EDIT: My RAC hit 200 000! while writing this.

Nothing else comes close to boosting the numbers like all Arecibo work.
;-)


Yes,

I noticed the APR to bump up from 1300 to over 1700.
I wonder what it will do: one credit per second times four times 24. (If it lasts)
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1840775 · Report as offensive

Message boards : Number crunching : Question about SOG


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.