No more guppi's=vlars on the gpu please

Message boards : Number crunching : No more guppi's=vlars on the gpu please
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1793608 - Posted: 5 Jun 2016, 5:38:07 UTC - in response to Message 1793602.  
Last modified: 5 Jun 2016, 5:39:14 UTC

. . Sorry but you have missed the gist of what I said.

Probably like last time, I can only respond to what you type. If what you type isn't what you mean I will certainly miss what it is you're trying to convey.

As it is, you missed what I was saying.
They are what they are, it's up to you to chose how you run them be it 1, 2 or 3 at a time.
It was that way with MB WUs, it's that way with Guppies. The Guppies have a more extreme effect when running more than 1 at a time than MB did, but it's no different. Just more pronounced.
If you optimise your crunching for MB WUs, and it bogs down when it runs Guppies, that's the choice you make. Or you could optimise it to allow for the effect of the Guppies; just as previously people had to chose between fast crunch times on longer running WUs, but less through put when shorter running WUs came in to the mix.
You can go back to the CUDA application, but you will still have to make the same choices.


. . That is why I asked if there was a way to make them multithread on the GPU,

As above, you run as many as you wish.
It's not a matter of if they will or won't multithread. It's a matter of developing the application in order to process them faster, and that will take time.
So as I said above- it's up to you to choose, fast MB work & extremely slow Guppie work, or reasonably fast MB & relatively fast Guppie work.
You're the one that choses which way to go.
Grant
Darwin NT
ID: 1793608 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1793612 - Posted: 5 Jun 2016, 6:08:06 UTC

If we could treat these Guppie VLAR's separately from normal Arecibo work just like Astropulse work then I'd be all for it.

In other words, SoG for Guppies and CUDA for Arecibo. ;-)

Cheers.
ID: 1793612 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1793633 - Posted: 5 Jun 2016, 8:54:36 UTC - in response to Message 1793567.  

Is there a setting that I can tweak to persuade the Guppi WU's to truly run concurrently and behave as the nonVLAR WUs do?

Try to add -sbs 256 or -sbs 512 if you have enough memory.
ID: 1793633 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1793634 - Posted: 5 Jun 2016, 8:56:23 UTC - in response to Message 1793612.  


In other words, SoG for Guppies and CUDA for Arecibo. ;-)

Taking into account that SoG best suited for VHAR currently and GBT data hardly produce any VHARs this doesn't sound as good proposal.
ID: 1793634 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1793635 - Posted: 5 Jun 2016, 8:59:30 UTC - in response to Message 1793583.  

But it seems strange to me, that the one application behaves so contrarily in dealing with the two different types of WU. With nonVLAR it combines them and fully utilises the GPU, but with Guppis it does almost the opposite.

Can you post GPU load pictures for those cases? Link to host?
Anyone else observe such behavior?
ID: 1793635 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1793638 - Posted: 5 Jun 2016, 9:13:44 UTC - in response to Message 1793602.  
Last modified: 5 Jun 2016, 9:18:27 UTC



. . These guppies are very contrary critters.

In addition to -sbs 256 or 512, if you don't experience lags or can tolerate them, try to set this option:
-period_iterations_num 1
(if lags too big one can increase value until they will be tolerable. Default is 50 [500 for low-performance path] so plenty room for tuning that way)

Seems issue with VLAR not benefiting from simultaneous tasks is increased share of PulseFind again (with lowest FFT sizes). PulseFind on lowest FFT sizes is longest kernel. That's why it can be "sleep away" with clumsy Windows (consider typical GPU kernel lenght of ~100us and minimal (!) sleep time of 1 ms and quantum size of 20ms) Sleep() call.
But if almost all work consists of such kernels, each task will go into sleep and GPU will not be feeded again.

So, the possible issue is that even biggest kernel smaller than minimal Sleep() duration. If it's true then increase number of multiple tasks (up to GPU memory limit) would help both with GPU load and throughput on VLARs. Unfortunately, this will increase switching overhead for all tasks, non-VLAR including. So I would expect decrease in throughput for non-VLARs in such config (how strong - depends on GPU architecture - inhibitely big starting from 2 tasks per GPU for pre-FERMI, for example).

Another way is to make kernels "under sleep" bigger.
This can be done by increasing -sbs N value and not splitting kernel on few calls (that is, decrease -period_iterations_num N value ).

Try these approaches.

P.S. In view of such theory running VLAR + non-VLAR simultaneously will give best throughput.
ID: 1793638 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1793642 - Posted: 5 Jun 2016, 9:42:56 UTC - in response to Message 1793638.  

So, the possible issue is that even biggest kernel smaller than minimal Sleep() duration. If it's true then increase number of multiple tasks (up to GPU memory limit) would help both with GPU load and throughput on VLARs. Unfortunately, this will increase switching overhead for all tasks, non-VLAR including. So I would expect decrease in throughput for non-VLARs in such config (how strong - depends on GPU architecture - inhibitely big starting from 2 tasks per GPU for pre-FERMI, for example).

OK.
So that explains why for my Maxwell cards, shorties run much slower than they did on my previous cards (GTX 460 & GTX 560Ti).
And explains why running 3 VLARs at a time on my GTX 750Ti gives the best throughput per hour, but when running shorties the increase in run time is so extreme that even a few shorties results in less work being done per hour.
Hence 2 at a time is optimal for my cards with the present CUDA application.
Grant
Darwin NT
ID: 1793642 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1793647 - Posted: 5 Jun 2016, 9:57:01 UTC - in response to Message 1793642.  
Last modified: 5 Jun 2016, 9:58:43 UTC

CUDA runtime (just as OpenCL runtime for AMD, btw) uses different default method of synching with GPU. Also, CUDA runtime (now instead of OpenCL runtime) has control call that allows to change that default way if needed. With OpenCL runtime on nVidia we bound with single offered way. Why default was chosen differently for CUDA and OpenCL runtimes by nVidia engineers - unknown for me. Deliberate sabotage from nVidia marketing department just one of possibilities ;)
ID: 1793647 · Report as offensive
Harri Liljeroos
Avatar

Send message
Joined: 29 May 99
Posts: 4070
Credit: 85,281,665
RAC: 126
Finland
Message 1793657 - Posted: 5 Jun 2016, 11:47:51 UTC
Last modified: 5 Jun 2016, 11:54:30 UTC

If we would have different SETI applications for Arecibo MBs and Green Bank Telescope MBs (we could still have same executables but with different names) we would have the possibility to apply different settings via app_config and commanline parameters.

Another thing I do not like is that because GBT work is slow and do not pay any extra credit for that, it will result lower amount of work being done also on other projects using the same resources (GPUs), i.e. for SETI to get it share done it needs more time to achieve it. And this reduces the time available for other projects.
ID: 1793657 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7379
Credit: 44,181,323
RAC: 238
United States
Message 1793662 - Posted: 5 Jun 2016, 13:07:40 UTC - in response to Message 1793550.  

normal wus take about 22 mins, guppis about 35-45 mins.

And the problem is?

I could say, not enough heat or not fast enough, you would probably reject both.

Of course.
The idea is to try to find signs of extra terrestrial life, not to provide people with a heat source.
As to taking longer than other WUs, so what? Seti on BOINC took longer to process than the original Seti WUs. v7 took longer to process than v6, v8 takes longer than v7. I expect v9 will take longer than v8.
Over time optimised versions will be released that can do the work in less time, but until then what we have is what we have.
Taking longer than other WUs isn't an issue.

Greetings Grant,

Pardon while I make an observation here:

I see the above statements as expressing regression rather than progression.

Keep on BOINCing... :)
CAPT Siran d'Vel'nahr - L L & P _\\//
Winders 11 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 1793662 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22190
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1793666 - Posted: 5 Jun 2016, 13:35:23 UTC

Each major iteration in the MB application has seen an increase in the amount of calculation performed. With that an the increased complexity of the calculations required for the guppi data it is hardly surprising that these take longer to run.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1793666 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1793667 - Posted: 5 Jun 2016, 13:39:36 UTC - in response to Message 1793666.  
Last modified: 5 Jun 2016, 14:02:11 UTC

Each major iteration in the MB application has seen an increase in the amount of calculation performed. With that an the increased complexity of the calculations required for the guppi data it is hardly surprising that these take longer to run.


That isn't the issue here... GUPPIs don't do any more calculation or take any longer to run on CPUs than Arecibo VHARs; they usually end up a little faster. They only run very slowly on GPUs due to architecture issues.
ID: 1793667 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1793685 - Posted: 5 Jun 2016, 15:33:37 UTC - in response to Message 1793667.  

Each major iteration in the MB application has seen an increase in the amount of calculation performed. With that an the increased complexity of the calculations required for the guppi data it is hardly surprising that these take longer to run.


That isn't the issue here... GUPPIs don't do any more calculation or take any longer to run on CPUs than Arecibo VHARs; they usually end up a little faster. They only run very slowly on GPUs due to architecture issues.


No, a 980 for exmaple doesn`t take longer to process a guppi than my AMD R9 380 if set up correctly.
Maybe a little more lags but that can be reduced with -period_iterations_num.


With each crime and every kindness we birth our future.
ID: 1793685 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1793688 - Posted: 5 Jun 2016, 15:46:52 UTC - in response to Message 1793685.  

Mr. Kevvy's point is accurate if you assume he is talking about Arecibo VLARs and NVidia GPUs.
ID: 1793688 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1793707 - Posted: 5 Jun 2016, 17:15:06 UTC - in response to Message 1793688.  

Mr. Kevvy's point is accurate if you assume he is talking about Arecibo VLARs and NVidia GPUs.


And CPUs, for which the pulsefinding algorithm was Eric K's seminal work in the first place (implementations since optimised by Alex Kan, Joe Segur, Ben Herndon and others, over a span of more than a decade)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1793707 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1793728 - Posted: 5 Jun 2016, 19:24:54 UTC

Here http://lunatics.kwsn.info/index.php/topic,1806.0.html will be pictures of v8 performance. For now one can refresh memories about how it was with older apps.
ID: 1793728 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1793923 - Posted: 6 Jun 2016, 11:14:54 UTC - in response to Message 1793612.  

If we could treat these Guppie VLAR's separately from normal Arecibo work just like Astropulse work then I'd be all for it.

In other words, SoG for Guppies and CUDA for Arecibo. ;-)

Cheers.



. . That would be a good start. But I am getting good results with normal (Arecibo) MB WUs under SoG. Though if you are willing to sacrifice the use of one CPU core (or more if you want to run multiples) then SoG gives better results with Guppis than CUDA. It just loses ground when running with Sleep ON to preserve the CPU core.

. . I think I am about to try some different combinations to test the limits.
ID: 1793923 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1793924 - Posted: 6 Jun 2016, 11:20:01 UTC

Any speed advantage gain of SOG is futile, if a whole core has to support it.
ID: 1793924 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1793925 - Posted: 6 Jun 2016, 11:21:26 UTC - in response to Message 1793608.  

. . Mea Culpa,

. . I didn't include much detail but it was a throw away observation of the moment. I hope my extended reply clarified things for you.
ID: 1793925 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1793928 - Posted: 6 Jun 2016, 11:28:14 UTC - in response to Message 1793633.  
Last modified: 6 Jun 2016, 11:29:23 UTC

Is there a setting that I can tweak to persuade the Guppi WU's to truly run concurrently and behave as the nonVLAR WUs do?

Try to add -sbs 256 or -sbs 512 if you have enough memory.


. . How much memory is enough?

. . I have a 2GB GTX 950, and I am thinking that with -sbs 256 running triples would require only 768MB, would that be correct? Then would using -sbs 512 take 1.5GB but still be possible on this card?

. . And which is most likely to achieve a positive result?
ID: 1793928 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : No more guppi's=vlars on the gpu please


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.