No more guppi's=vlars on the gpu please

Author	Message
Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13745 Credit: 208,696,464 RAC: 304	Message 1793929 - Posted: 6 Jun 2016, 11:28:16 UTC - in response to Message 1793924. Any speed advantage gain of SOG is futile, if a whole core has to support it. Is it? On my GTX 750Tis it was worth allocating 4 CPU cores to my 2 GTX 750Tis running 2 WUs at a time as the improved GPU performance offset the loss of CPU output. Maybe it's the same for S0G, maybe it's not. Only one way to find out- try it. With sleep, without sleep, settings tweaked, settings not tweaked, cores reserved, cores not reserved. Grant Darwin NT ID: 1793929 ·

Rasputin42 Volunteer tester Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0	Message 1793931 - Posted: 6 Jun 2016, 11:32:06 UTC - in response to Message 1793929. Compared to 4 CUDA tasks + 4 cpu tasks(in your case) ? ID: 1793931 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793932 - Posted: 6 Jun 2016, 11:32:38 UTC - in response to Message 1793635. Last modified: 6 Jun 2016, 11:33:58 UTC But it seems strange to me, that the one application behaves so contrarily in dealing with the two different types of WU. With nonVLAR it combines them and fully utilises the GPU, but with Guppis it does almost the opposite. Can you post GPU load pictures for those cases? Link to host? Anyone else observe such behavior? . . I have a screen shot of present performance and I will happily rerun the triple Guppis trials to get some for that condition. But I need to be informed on how to paste them into this message base. I have tried twice before but failed both times. . . Likewise I am ignorant on how to paste links in here. ID: 1793932 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793934 - Posted: 6 Jun 2016, 11:34:16 UTC - in response to Message 1793928. Is there a setting that I can tweak to persuade the Guppi WU's to truly run concurrently and behave as the nonVLAR WUs do? Try to add -sbs 256 or -sbs 512 if you have enough memory. . . How much memory is enough? . . I have a 2GB GTX 950, and I am thinking that with -sbs 256 running triples would require only 768MB, would that be correct? Then would using -sbs 512 take 1.5GB but still be possible on this card? . . And which is most likely to achieve a positive result? From tests I saw so far (mostly for AMD actually, much bigger NV community as whole seems more stronger in whine skill than in precise benchmarking and results sharing and I have no compatible NV hardware at all (!) :/ ) -sbs 512 gives little to no additional advantage over -sbs 256. But decrease number of iterations from 50 to 10 for example will give roughly 5-times bigger kernel launch that could keep GPU busy while app's process sleeping. ID: 1793934 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13745 Credit: 208,696,464 RAC: 304	Message 1793936 - Posted: 6 Jun 2016, 11:37:04 UTC - in response to Message 1793931. Compared to 4 CUDA tasks + 4 cpu tasks(in your case) ? Running 2 CUDA tasks on 2 GPUs with the -poll option and 1 core reserved for each WU and 4 CPU WUs produces more work per hour than just running 2 WUs on 2 GPUs without the poll option & 8 CPU WUs. Not a lot, about an extra 0.5 WUs per hour. But it adds up. Grant Darwin NT ID: 1793936 ·

Rasputin42 Volunteer tester Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0	Message 1793937 - Posted: 6 Jun 2016, 11:41:22 UTC -poll option What does that do? ID: 1793937 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793939 - Posted: 6 Jun 2016, 11:43:04 UTC - in response to Message 1793936. Compared to 4 CUDA tasks + 4 cpu tasks(in your case) ? Running 2 CUDA tasks on 2 GPUs with the -poll option and 1 core reserved for each WU and 4 CPU WUs produces more work per hour than just running 2 WUs on 2 GPUs without the poll option & 8 CPU WUs. Not a lot, about an extra 0.5 WUs per hour. But it adds up. 8 real cores or hyperthreaded? ID: 1793939 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793941 - Posted: 6 Jun 2016, 11:44:28 UTC - in response to Message 1793937. -poll option What does that do? that changes CUDA runtime sync behavior to spin-wait on CPU as OpenCL runtime does ID: 1793941 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874	Message 1793944 - Posted: 6 Jun 2016, 12:04:08 UTC - in response to Message 1793932. Last modified: 6 Jun 2016, 12:04:26 UTC . . Likewise I am ignorant on how to paste links in here. Every time you post to this message board, there's a link http://setiathome.berkeley.edu/bbcode.php above and to the left of your text entry. If you click that, it opens in a new window/tab, so you don't lose your place. It involves putting tags in [square brackets] around your text - many of the common ones can be applied by using the buttons above the text entry area. ID: 1793944 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793945 - Posted: 6 Jun 2016, 12:09:27 UTC - in response to Message 1793638. Last modified: 6 Jun 2016, 12:10:35 UTC . . These guppies are very contrary critters. In addition to -sbs 256 or 512, if you don't experience lags or can tolerate them, try to set this option: -period_iterations_num 1 (if lags too big one can increase value until they will be tolerable. Default is 50 [500 for low-performance path] so plenty room for tuning that way) Seems issue with VLAR not benefiting from simultaneous tasks is increased share of PulseFind again (with lowest FFT sizes). PulseFind on lowest FFT sizes is longest kernel. That's why it can be "sleep away" with clumsy Windows (consider typical GPU kernel lenght of ~100us and minimal (!) sleep time of 1 ms and quantum size of 20ms) Sleep() call. But if almost all work consists of such kernels, each task will go into sleep and GPU will not be feeded again. So, the possible issue is that even biggest kernel smaller than minimal Sleep() duration. If it's true then increase number of multiple tasks (up to GPU memory limit) would help both with GPU load and throughput on VLARs. Unfortunately, this will increase switching overhead for all tasks, non-VLAR including. So I would expect decrease in throughput for non-VLARs in such config (how strong - depends on GPU architecture - inhibitely big starting from 2 tasks per GPU for pre-FERMI, for example). Another way is to make kernels "under sleep" bigger. This can be done by increasing -sbs N value and not splitting kernel on few calls (that is, decrease -period_iterations_num N value ). Try these approaches. P.S. In view of such theory running VLAR + non-VLAR simultaneously will give best throughput. . . Not being a programmer I am having trouble following some of that so let's sneak up on this thing one step at a time. First lets see what effect setting the -sbs value has. . . As I understand it after trying different -sbs values, I should try combining a larger -sbs N value such as 512 with a lower -period_iterations_num N value such 1. Is this while still running 3 simultaneous WUs? . . When it come to the effect of FFT size I am lost, so let's deal with that further down the track. ID: 1793945 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793947 - Posted: 6 Jun 2016, 12:16:31 UTC - in response to Message 1793685. Each major iteration in the MB application has seen an increase in the amount of calculation performed. With that an the increased complexity of the calculations required for the guppi data it is hardly surprising that these take longer to run. That isn't the issue here... GUPPIs don't do any more calculation or take any longer to run on CPUs than Arecibo VHARs; they usually end up a little faster. They only run very slowly on GPUs due to architecture issues. No, a 980 for exmaple doesn`t take longer to process a guppi than my AMD R9 380 if set up correctly. Maybe a little more lags but that can be reduced with -period_iterations_num. . . But I suspect they both take a lot longer to process than "normal" Arecibo WUs. I think that is the point. And CUDA, well .... ID: 1793947 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793949 - Posted: 6 Jun 2016, 12:19:50 UTC - in response to Message 1793945. . . As I understand it after trying different -sbs values, I should try combining a larger -sbs N value such as 512 with a lower -period_iterations_num N value such 1. Is this while still running 3 simultaneous WUs? yes, still multitasking. . . When it come to the effect of FFT size I am lost, so let's deal with that further down the track. to understand that one should read original processing algorithm perhaps. ID: 1793949 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793951 - Posted: 6 Jun 2016, 12:24:57 UTC - in response to Message 1793728. Here http://lunatics.kwsn.info/index.php/topic,1806.0.html will be pictures of v8 performance. For now one can refresh memories about how it was with older apps. . . Too much information on that graph my head is spinning :) . . But the best I can make out from it is that the relationship between VHARs, normal WUs and VLARs has been pretty constant over the different incarnations of BOINC, and across hardware platforms. ID: 1793951 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793954 - Posted: 6 Jun 2016, 12:29:07 UTC - in response to Message 1793929. Any speed advantage gain of SOG is futile, if a whole core has to support it. Is it? On my GTX 750Tis it was worth allocating 4 CPU cores to my 2 GTX 750Tis running 2 WUs at a time as the improved GPU performance offset the loss of CPU output. Maybe it's the same for S0G, maybe it's not. Only one way to find out- try it. With sleep, without sleep, settings tweaked, settings not tweaked, cores reserved, cores not reserved. . . Are you a fortune teller? :) . . Have you installed 0.45 Beta yet? . . It definitely needs the use of the CPU cores. ID: 1793954 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1793955 - Posted: 6 Jun 2016, 12:33:27 UTC - in response to Message 1793934. Is there a setting that I can tweak to persuade the Guppi WU's to truly run concurrently and behave as the nonVLAR WUs do? Try to add -sbs 256 or -sbs 512 if you have enough memory. . . How much memory is enough? . . I have a 2GB GTX 950, and I am thinking that with -sbs 256 running triples would require only 768MB, would that be correct? Then would using -sbs 512 take 1.5GB but still be possible on this card? . . And which is most likely to achieve a positive result? From tests I saw so far (mostly for AMD actually, much bigger NV community as whole seems more stronger in whine skill than in precise benchmarking and results sharing and I have no compatible NV hardware at all (!) :/ ) -sbs 512 gives little to no additional advantage over -sbs 256. But decrease number of iterations from 50 to 10 for example will give roughly 5-times bigger kernel launch that could keep GPU busy while app's process sleeping. -sbs 384 gives best result on my R9 380. With each crime and every kindness we birth our future. ID: 1793955 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793956 - Posted: 6 Jun 2016, 12:35:40 UTC - in response to Message 1793934. Is there a setting that I can tweak to persuade the Guppi WU's to truly run concurrently and behave as the nonVLAR WUs do? Try to add -sbs 256 or -sbs 512 if you have enough memory. . . How much memory is enough? . . I have a 2GB GTX 950, and I am thinking that with -sbs 256 running triples would require only 768MB, would that be correct? Then would using -sbs 512 take 1.5GB but still be possible on this card? . . And which is most likely to achieve a positive result? From tests I saw so far (mostly for AMD actually, much bigger NV community as whole seems more stronger in whine skill than in precise benchmarking and results sharing and I have no compatible NV hardware at all (!) :/ ) -sbs 512 gives little to no additional advantage over -sbs 256. But decrease number of iterations from 50 to 10 for example will give roughly 5-times bigger kernel launch that could keep GPU busy while app's process sleeping. . . So first step then is -sbs 256 -period_iterations_num 1 ID: 1793956 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793958 - Posted: 6 Jun 2016, 12:53:20 UTC - in response to Message 1793944. . . Likewise I am ignorant on how to paste links in here. Every time you post to this message board, there's a link http://setiathome.berkeley.edu/bbcode.php above and to the left of your text entry. If you click that, it opens in a new window/tab, so you don't lose your place. It involves putting tags in [square brackets] around your text - many of the common ones can be applied by using the buttons above the text entry area. . . I have mastered the use of the common functions like quoting and bold text but when I use the "['img']" option and try to paste an image in there I get nothing. And where do I find the URL for my Host details? . . The only URL I see from this page is in the browser address window. . . Hold everything! http://setiathome.berkeley.edu/show_host_detail.php?hostid=8012534 . . Now did it work ? :) ID: 1793958 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793959 - Posted: 6 Jun 2016, 12:55:38 UTC - in response to Message 1793958. It involves putting tags in [square brackets] around your text - many of the common ones can be applied by using the buttons above the text entry area. . . I have mastered the use of the common functions like quoting and bold text but when I use the "['img']" option and try to paste an image in there I get nothing. And where do I find the URL for my Host details? . . The only URL I see from this page is in the browser address window. . . Hold everything! http://setiathome.berkeley.edu/show_host_detail.php?hostid=8012534 . . Now did it work ? :) . . Yes :) ID: 1793959 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793960 - Posted: 6 Jun 2016, 12:58:51 UTC - in response to Message 1793932. Can you post GPU load pictures for those cases? Link to host? Anyone else observe such behavior? . . I have a screen shot of present performance and I will happily rerun the triple Guppis trials to get some for that condition. But I need to be informed on how to paste them into this message base. I have tried twice before but failed both times. . . Likewise I am ignorant on how to paste links in here. . . OK I am learning .... http://setiathome.berkeley.edu/show_host_detail.php?hostid=8012534 . . Now I only have to work out how to insert a graphic/image ID: 1793960 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1793961 - Posted: 6 Jun 2016, 13:01:50 UTC - in response to Message 1793949. . . As I understand it after trying different -sbs values, I should try combining a larger -sbs N value such as 512 with a lower -period_iterations_num N value such 1. Is this while still running 3 simultaneous WUs? yes, still multitasking. . . When it come to the effect of FFT size I am lost, so let's deal with that further down the track. to understand that one should read original processing algorithm perhaps. . . I suspect I would not understand it, I am not a mathematics professor :) ID: 1793961 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.