Message boards :
Number crunching :
No more guppi's=vlars on the gpu please
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304 |
. . Sorry but you have missed the gist of what I said. Probably like last time, I can only respond to what you type. If what you type isn't what you mean I will certainly miss what it is you're trying to convey. As it is, you missed what I was saying. They are what they are, it's up to you to chose how you run them be it 1, 2 or 3 at a time. It was that way with MB WUs, it's that way with Guppies. The Guppies have a more extreme effect when running more than 1 at a time than MB did, but it's no different. Just more pronounced. If you optimise your crunching for MB WUs, and it bogs down when it runs Guppies, that's the choice you make. Or you could optimise it to allow for the effect of the Guppies; just as previously people had to chose between fast crunch times on longer running WUs, but less through put when shorter running WUs came in to the mix. You can go back to the CUDA application, but you will still have to make the same choices. . . That is why I asked if there was a way to make them multithread on the GPU, As above, you run as many as you wish. It's not a matter of if they will or won't multithread. It's a matter of developing the application in order to process them faster, and that will take time. So as I said above- it's up to you to choose, fast MB work & extremely slow Guppie work, or reasonably fast MB & relatively fast Guppie work. You're the one that choses which way to go. Grant Darwin NT |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
If we could treat these Guppie VLAR's separately from normal Arecibo work just like Astropulse work then I'd be all for it. In other words, SoG for Guppies and CUDA for Arecibo. ;-) Cheers. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Is there a setting that I can tweak to persuade the Guppi WU's to truly run concurrently and behave as the nonVLAR WUs do? Try to add -sbs 256 or -sbs 512 if you have enough memory. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Taking into account that SoG best suited for VHAR currently and GBT data hardly produce any VHARs this doesn't sound as good proposal. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
But it seems strange to me, that the one application behaves so contrarily in dealing with the two different types of WU. With nonVLAR it combines them and fully utilises the GPU, but with Guppis it does almost the opposite. Can you post GPU load pictures for those cases? Link to host? Anyone else observe such behavior? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
In addition to -sbs 256 or 512, if you don't experience lags or can tolerate them, try to set this option: -period_iterations_num 1 (if lags too big one can increase value until they will be tolerable. Default is 50 [500 for low-performance path] so plenty room for tuning that way) Seems issue with VLAR not benefiting from simultaneous tasks is increased share of PulseFind again (with lowest FFT sizes). PulseFind on lowest FFT sizes is longest kernel. That's why it can be "sleep away" with clumsy Windows (consider typical GPU kernel lenght of ~100us and minimal (!) sleep time of 1 ms and quantum size of 20ms) Sleep() call. But if almost all work consists of such kernels, each task will go into sleep and GPU will not be feeded again. So, the possible issue is that even biggest kernel smaller than minimal Sleep() duration. If it's true then increase number of multiple tasks (up to GPU memory limit) would help both with GPU load and throughput on VLARs. Unfortunately, this will increase switching overhead for all tasks, non-VLAR including. So I would expect decrease in throughput for non-VLARs in such config (how strong - depends on GPU architecture - inhibitely big starting from 2 tasks per GPU for pre-FERMI, for example). Another way is to make kernels "under sleep" bigger. This can be done by increasing -sbs N value and not splitting kernel on few calls (that is, decrease -period_iterations_num N value ). Try these approaches. P.S. In view of such theory running VLAR + non-VLAR simultaneously will give best throughput. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304 |
So, the possible issue is that even biggest kernel smaller than minimal Sleep() duration. If it's true then increase number of multiple tasks (up to GPU memory limit) would help both with GPU load and throughput on VLARs. Unfortunately, this will increase switching overhead for all tasks, non-VLAR including. So I would expect decrease in throughput for non-VLARs in such config (how strong - depends on GPU architecture - inhibitely big starting from 2 tasks per GPU for pre-FERMI, for example). OK. So that explains why for my Maxwell cards, shorties run much slower than they did on my previous cards (GTX 460 & GTX 560Ti). And explains why running 3 VLARs at a time on my GTX 750Ti gives the best throughput per hour, but when running shorties the increase in run time is so extreme that even a few shorties results in less work being done per hour. Hence 2 at a time is optimal for my cards with the present CUDA application. Grant Darwin NT |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
CUDA runtime (just as OpenCL runtime for AMD, btw) uses different default method of synching with GPU. Also, CUDA runtime (now instead of OpenCL runtime) has control call that allows to change that default way if needed. With OpenCL runtime on nVidia we bound with single offered way. Why default was chosen differently for CUDA and OpenCL runtimes by nVidia engineers - unknown for me. Deliberate sabotage from nVidia marketing department just one of possibilities ;) |
Harri Liljeroos Send message Joined: 29 May 99 Posts: 4070 Credit: 85,281,665 RAC: 126 |
If we would have different SETI applications for Arecibo MBs and Green Bank Telescope MBs (we could still have same executables but with different names) we would have the possibility to apply different settings via app_config and commanline parameters. Another thing I do not like is that because GBT work is slow and do not pay any extra credit for that, it will result lower amount of work being done also on other projects using the same resources (GPUs), i.e. for SETI to get it share done it needs more time to achieve it. And this reduces the time available for other projects. |
Siran d'Vel'nahr Send message Joined: 23 May 99 Posts: 7379 Credit: 44,181,323 RAC: 238 |
normal wus take about 22 mins, guppis about 35-45 mins. Greetings Grant, Pardon while I make an observation here: I see the above statements as expressing regression rather than progression. Keep on BOINCing... :) CAPT Siran d'Vel'nahr - L L & P _\\// Winders 11 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
rob smith Send message Joined: 7 Mar 03 Posts: 22190 Credit: 416,307,556 RAC: 380 |
Each major iteration in the MB application has seen an increase in the amount of calculation performed. With that an the increased complexity of the calculations required for the guppi data it is hardly surprising that these take longer to run. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
Each major iteration in the MB application has seen an increase in the amount of calculation performed. With that an the increased complexity of the calculations required for the guppi data it is hardly surprising that these take longer to run. That isn't the issue here... GUPPIs don't do any more calculation or take any longer to run on CPUs than Arecibo VHARs; they usually end up a little faster. They only run very slowly on GPUs due to architecture issues. |
Mike Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80 |
Each major iteration in the MB application has seen an increase in the amount of calculation performed. With that an the increased complexity of the calculations required for the guppi data it is hardly surprising that these take longer to run. No, a 980 for exmaple doesn`t take longer to process a guppi than my AMD R9 380 if set up correctly. Maybe a little more lags but that can be reduced with -period_iterations_num. With each crime and every kindness we birth our future. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Mr. Kevvy's point is accurate if you assume he is talking about Arecibo VLARs and NVidia GPUs. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Mr. Kevvy's point is accurate if you assume he is talking about Arecibo VLARs and NVidia GPUs. And CPUs, for which the pulsefinding algorithm was Eric K's seminal work in the first place (implementations since optimised by Alex Kan, Joe Segur, Ben Herndon and others, over a span of more than a decade) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Here http://lunatics.kwsn.info/index.php/topic,1806.0.html will be pictures of v8 performance. For now one can refresh memories about how it was with older apps. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
If we could treat these Guppie VLAR's separately from normal Arecibo work just like Astropulse work then I'd be all for it. . . That would be a good start. But I am getting good results with normal (Arecibo) MB WUs under SoG. Though if you are willing to sacrifice the use of one CPU core (or more if you want to run multiples) then SoG gives better results with Guppis than CUDA. It just loses ground when running with Sleep ON to preserve the CPU core. . . I think I am about to try some different combinations to test the limits. |
Rasputin42 Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0 |
Any speed advantage gain of SOG is futile, if a whole core has to support it. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . Mea Culpa, . . I didn't include much detail but it was a throw away observation of the moment. I hope my extended reply clarified things for you. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Is there a setting that I can tweak to persuade the Guppi WU's to truly run concurrently and behave as the nonVLAR WUs do? . . How much memory is enough? . . I have a 2GB GTX 950, and I am thinking that with -sbs 256 running triples would require only 768MB, would that be correct? Then would using -sbs 512 take 1.5GB but still be possible on this card? . . And which is most likely to achieve a positive result? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.