No more guppi's=vlars on the gpu please

Author	Message
Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1793608 - Posted: 5 Jun 2016, 5:38:07 UTC - in response to Message 1793602. Last modified: 5 Jun 2016, 5:39:14 UTC . . Sorry but you have missed the gist of what I said. Probably like last time, I can only respond to what you type. If what you type isn't what you mean I will certainly miss what it is you're trying to convey. As it is, you missed what I was saying. They are what they are, it's up to you to chose how you run them be it 1, 2 or 3 at a time. It was that way with MB WUs, it's that way with Guppies. The Guppies have a more extreme effect when running more than 1 at a time than MB did, but it's no different. Just more pronounced. If you optimise your crunching for MB WUs, and it bogs down when it runs Guppies, that's the choice you make. Or you could optimise it to allow for the effect of the Guppies; just as previously people had to chose between fast crunch times on longer running WUs, but less through put when shorter running WUs came in to the mix. You can go back to the CUDA application, but you will still have to make the same choices. . . That is why I asked if there was a way to make them multithread on the GPU, As above, you run as many as you wish. It's not a matter of if they will or won't multithread. It's a matter of developing the application in order to process them faster, and that will take time. So as I said above- it's up to you to choose, fast MB work & extremely slow Guppie work, or reasonably fast MB & relatively fast Guppie work. You're the one that choses which way to go. Grant Darwin NT ID: 1793608 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1793612 - Posted: 5 Jun 2016, 6:08:06 UTC If we could treat these Guppie VLAR's separately from normal Arecibo work just like Astropulse work then I'd be all for it. In other words, SoG for Guppies and CUDA for Arecibo. ;-) Cheers. ID: 1793612 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793633 - Posted: 5 Jun 2016, 8:54:36 UTC - in response to Message 1793567. Is there a setting that I can tweak to persuade the Guppi WU's to truly run concurrently and behave as the nonVLAR WUs do? Try to add -sbs 256 or -sbs 512 if you have enough memory. ID: 1793633 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793634 - Posted: 5 Jun 2016, 8:56:23 UTC - in response to Message 1793612. In other words, SoG for Guppies and CUDA for Arecibo. ;-) Taking into account that SoG best suited for VHAR currently and GBT data hardly produce any VHARs this doesn't sound as good proposal. ID: 1793634 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793635 - Posted: 5 Jun 2016, 8:59:30 UTC - in response to Message 1793583. But it seems strange to me, that the one application behaves so contrarily in dealing with the two different types of WU. With nonVLAR it combines them and fully utilises the GPU, but with Guppis it does almost the opposite. Can you post GPU load pictures for those cases? Link to host? Anyone else observe such behavior? ID: 1793635 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793638 - Posted: 5 Jun 2016, 9:13:44 UTC - in response to Message 1793602. Last modified: 5 Jun 2016, 9:18:27 UTC . . These guppies are very contrary critters. In addition to -sbs 256 or 512, if you don't experience lags or can tolerate them, try to set this option: -period_iterations_num 1 (if lags too big one can increase value until they will be tolerable. Default is 50 [500 for low-performance path] so plenty room for tuning that way) Seems issue with VLAR not benefiting from simultaneous tasks is increased share of PulseFind again (with lowest FFT sizes). PulseFind on lowest FFT sizes is longest kernel. That's why it can be "sleep away" with clumsy Windows (consider typical GPU kernel lenght of ~100us and minimal (!) sleep time of 1 ms and quantum size of 20ms) Sleep() call. But if almost all work consists of such kernels, each task will go into sleep and GPU will not be feeded again. So, the possible issue is that even biggest kernel smaller than minimal Sleep() duration. If it's true then increase number of multiple tasks (up to GPU memory limit) would help both with GPU load and throughput on VLARs. Unfortunately, this will increase switching overhead for all tasks, non-VLAR including. So I would expect decrease in throughput for non-VLARs in such config (how strong - depends on GPU architecture - inhibitely big starting from 2 tasks per GPU for pre-FERMI, for example). Another way is to make kernels "under sleep" bigger. This can be done by increasing -sbs N value and not splitting kernel on few calls (that is, decrease -period_iterations_num N value ). Try these approaches. P.S. In view of such theory running VLAR + non-VLAR simultaneously will give best throughput. ID: 1793638 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1793642 - Posted: 5 Jun 2016, 9:42:56 UTC - in response to Message 1793638. So, the possible issue is that even biggest kernel smaller than minimal Sleep() duration. If it's true then increase number of multiple tasks (up to GPU memory limit) would help both with GPU load and throughput on VLARs. Unfortunately, this will increase switching overhead for all tasks, non-VLAR including. So I would expect decrease in throughput for non-VLARs in such config (how strong - depends on GPU architecture - inhibitely big starting from 2 tasks per GPU for pre-FERMI, for example). OK. So that explains why for my Maxwell cards, shorties run much slower than they did on my previous cards (GTX 460 & GTX 560Ti). And explains why running 3 VLARs at a time on my GTX 750Ti gives the best throughput per hour, but when running shorties the increase in run time is so extreme that even a few shorties results in less work being done per hour. Hence 2 at a time is optimal for my cards with the present CUDA application. Grant Darwin NT ID: 1793642 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793647 - Posted: 5 Jun 2016, 9:57:01 UTC - in response to Message 1793642. Last modified: 5 Jun 2016, 9:58:43 UTC CUDA runtime (just as OpenCL runtime for AMD, btw) uses different default method of synching with GPU. Also, CUDA runtime (now instead of OpenCL runtime) has control call that allows to change that default way if needed. With OpenCL runtime on nVidia we bound with single offered way. Why default was chosen differently for CUDA and OpenCL runtimes by nVidia engineers - unknown for me. Deliberate sabotage from nVidia marketing department just one of possibilities ;) ID: 1793647 ·

Harri Liljeroos Send message Joined: 29 May 99 Posts: 4070 Credit: 85,281,665 RAC: 126	Message 1793657 - Posted: 5 Jun 2016, 11:47:51 UTC Last modified: 5 Jun 2016, 11:54:30 UTC If we would have different SETI applications for Arecibo MBs and Green Bank Telescope MBs (we could still have same executables but with different names) we would have the possibility to apply different settings via app_config and commanline parameters. Another thing I do not like is that because GBT work is slow and do not pay any extra credit for that, it will result lower amount of work being done also on other projects using the same resources (GPUs), i.e. for SETI to get it share done it needs more time to achieve it. And this reduces the time available for other projects. ID: 1793657 ·

Siran d'Vel'nahr Volunteer tester Send message Joined: 23 May 99 Posts: 7379 Credit: 44,181,323 RAC: 238	Message 1793662 - Posted: 5 Jun 2016, 13:07:40 UTC - in response to Message 1793550. normal wus take about 22 mins, guppis about 35-45 mins. And the problem is? I could say, not enough heat or not fast enough, you would probably reject both. Of course. The idea is to try to find signs of extra terrestrial life, not to provide people with a heat source. As to taking longer than other WUs, so what? Seti on BOINC took longer to process than the original Seti WUs. v7 took longer to process than v6, v8 takes longer than v7. I expect v9 will take longer than v8. Over time optimised versions will be released that can do the work in less time, but until then what we have is what we have. Taking longer than other WUs isn't an issue. Greetings Grant, Pardon while I make an observation here: I see the above statements as expressing regression rather than progression. Keep on BOINCing... :) CAPT Siran d'Vel'nahr - L L & P _\\// Winders 11 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath ID: 1793662 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22190 Credit: 416,307,556 RAC: 380	Message 1793666 - Posted: 5 Jun 2016, 13:35:23 UTC Each major iteration in the MB application has seen an increase in the amount of calculation performed. With that an the increased complexity of the calculations required for the guppi data it is hardly surprising that these take longer to run. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1793666 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 1793667 - Posted: 5 Jun 2016, 13:39:36 UTC - in response to Message 1793666. Last modified: 5 Jun 2016, 14:02:11 UTC Each major iteration in the MB application has seen an increase in the amount of calculation performed. With that an the increased complexity of the calculations required for the guppi data it is hardly surprising that these take longer to run. That isn't the issue here... GUPPIs don't do any more calculation or take any longer to run on CPUs than Arecibo VHARs; they usually end up a little faster. They only run very slowly on GPUs due to architecture issues. ID: 1793667 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80	Message 1793685 - Posted: 5 Jun 2016, 15:33:37 UTC - in response to Message 1793667. Each major iteration in the MB application has seen an increase in the amount of calculation performed. With that an the increased complexity of the calculations required for the guppi data it is hardly surprising that these take longer to run. That isn't the issue here... GUPPIs don't do any more calculation or take any longer to run on CPUs than Arecibo VHARs; they usually end up a little faster. They only run very slowly on GPUs due to architecture issues. No, a 980 for exmaple doesn`t take longer to process a guppi than my AMD R9 380 if set up correctly. Maybe a little more lags but that can be reduced with -period_iterations_num. With each crime and every kindness we birth our future. ID: 1793685 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1793688 - Posted: 5 Jun 2016, 15:46:52 UTC - in response to Message 1793685. Mr. Kevvy's point is accurate if you assume he is talking about Arecibo VLARs and NVidia GPUs. ID: 1793688 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1793707 - Posted: 5 Jun 2016, 17:15:06 UTC - in response to Message 1793688. Mr. Kevvy's point is accurate if you assume he is talking about Arecibo VLARs and NVidia GPUs. And CPUs, for which the pulsefinding algorithm was Eric K's seminal work in the first place (implementations since optimised by Alex Kan, Joe Segur, Ben Herndon and others, over a span of more than a decade) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1793707 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793728 - Posted: 5 Jun 2016, 19:24:54 UTC Here http://lunatics.kwsn.info/index.php/topic,1806.0.html will be pictures of v8 performance. For now one can refresh memories about how it was with older apps. ID: 1793728 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793923 - Posted: 6 Jun 2016, 11:14:54 UTC - in response to Message 1793612. If we could treat these Guppie VLAR's separately from normal Arecibo work just like Astropulse work then I'd be all for it. In other words, SoG for Guppies and CUDA for Arecibo. ;-) Cheers. . . That would be a good start. But I am getting good results with normal (Arecibo) MB WUs under SoG. Though if you are willing to sacrifice the use of one CPU core (or more if you want to run multiples) then SoG gives better results with Guppis than CUDA. It just loses ground when running with Sleep ON to preserve the CPU core. . . I think I am about to try some different combinations to test the limits. ID: 1793923 ·

Rasputin42 Volunteer tester Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0	Message 1793924 - Posted: 6 Jun 2016, 11:20:01 UTC Any speed advantage gain of SOG is futile, if a whole core has to support it. ID: 1793924 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793925 - Posted: 6 Jun 2016, 11:21:26 UTC - in response to Message 1793608. . . Mea Culpa, . . I didn't include much detail but it was a throw away observation of the moment. I hope my extended reply clarified things for you. ID: 1793925 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793928 - Posted: 6 Jun 2016, 11:28:14 UTC - in response to Message 1793633. Last modified: 6 Jun 2016, 11:29:23 UTC Is there a setting that I can tweak to persuade the Guppi WU's to truly run concurrently and behave as the nonVLAR WUs do? Try to add -sbs 256 or -sbs 512 if you have enough memory. . . How much memory is enough? . . I have a 2GB GTX 950, and I am thinking that with -sbs 256 running triples would require only 768MB, would that be correct? Then would using -sbs 512 take 1.5GB but still be possible on this card? . . And which is most likely to achieve a positive result? ID: 1793928 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.