No more guppi's=vlars on the gpu please

Message boards : Number crunching : No more guppi's=vlars on the gpu please
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1793962 - Posted: 6 Jun 2016, 13:05:04 UTC - in response to Message 1793955.  


From tests I saw so far (mostly for AMD actually, much bigger NV community as whole seems more stronger in whine skill than in precise benchmarking and results sharing and I have no compatible NV hardware at all (!) :/ ) -sbs 512 gives little to no additional advantage over -sbs 256. But decrease number of iterations from 50 to 10 for example will give roughly 5-times bigger kernel launch that could keep GPU busy while app's process sleeping.


-sbs 384 gives best result on my R9 380.



. . OK I will bear that in mind.
ID: 1793962 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1793990 - Posted: 6 Jun 2016, 15:52:25 UTC - in response to Message 1793949.  
Last modified: 6 Jun 2016, 16:36:24 UTC


. . As I understand it after trying different -sbs values, I should try combining a larger -sbs N value such as 512 with a lower -period_iterations_num N value such 1. Is this while still running 3 simultaneous WUs?

yes, still multitasking.


. . When it come to the effect of FFT size I am lost, so let's deal with that further down the track.

to understand that one should read original processing algorithm perhaps.



. . HOORAY!

. . It is now running with -sbs 256 -period_iterations_num 1.

. . You have hit the nail on the head, the GPU utilisation is now matching that of nonVLAR tasks, though it is maxing out a little more and screen lag has become intrusive. I would be happy if we can tweak it a little more to make it smoother :).

. . Should I try Mike's suggestion and try -sbs 384?

. . I only wish I had taken screenshots of when I was running the Guppis as singles and triples without the tweaks.
ID: 1793990 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1793994 - Posted: 6 Jun 2016, 16:03:50 UTC - in response to Message 1793990.  
Last modified: 6 Jun 2016, 16:07:33 UTC

. . HOORAY!

. . It is now running with -sbs 256 -period_iterations_num 1.

. . You have hit the nail on the head, the GPU utilisation is now matching that of nonVLAR tasks, though it is maxing out a little more and screen lag has become intrusive. I would be happy if we can tweak it a little more to make it smoother :).

. . Should I try Mike's suggestion and try -sbs 384?


Yes try Mike's suggestion and if the lag is bothering you, you can add the -use_sleep Should only add about 1-2 minutes to total run time

Edit..

Might want to add -hp also
ID: 1793994 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1794013 - Posted: 6 Jun 2016, 16:57:36 UTC - in response to Message 1793994.  
Last modified: 6 Jun 2016, 17:01:46 UTC

. . HOORAY!

. . It is now running with -sbs 256 -period_iterations_num 1.

. . You have hit the nail on the head, the GPU utilisation is now matching that of nonVLAR tasks, though it is maxing out a little more and screen lag has become intrusive. I would be happy if we can tweak it a little more to make it smoother :).

. . Should I try Mike's suggestion and try -sbs 384?


Yes try Mike's suggestion and if the lag is bothering you, you can add the -use_sleep Should only add about 1-2 minutes to total run time

Edit..

Might want to add -hp also


. . I am already running with sleep ON :)

. . I have set it to -sbs 384 and lag has improved a little so I might try 512, though I think that will take all of the GPU memory.

. . What does -hp do? It's OK I looked it up, GPU load is now close to 100% so I do not think I need that.

. . BTW, can you tell me how to insert a graphic/image into a message?
ID: 1794013 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1794026 - Posted: 6 Jun 2016, 17:42:29 UTC - in response to Message 1793937.  

-poll option

What does that do?

I've searched the forums and the BOINC FAQs and I cannot find just WHERE you implement this -poll option. Is it in the app_config or app_info?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1794026 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794028 - Posted: 6 Jun 2016, 17:45:35 UTC - in response to Message 1794026.  

-poll option

What does that do?

I've searched the forums and the BOINC FAQs and I cannot find just WHERE you implement this -poll option. Is it in the app_config or app_info?

Cause it's command line switch it should go the same place any command line switches go.

<cmdline> tag
ID: 1794028 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1794051 - Posted: 6 Jun 2016, 18:30:03 UTC - in response to Message 1793638.  



. . These guppies are very contrary critters.

In addition to -sbs 256 or 512, if you don't experience lags or can tolerate them, try to set this option:
-period_iterations_num 1
(if lags too big one can increase value until they will be tolerable. Default is 50 [500 for low-performance path] so plenty room for tuning that way)

Seems issue with VLAR not benefiting from simultaneous tasks is increased share of PulseFind again (with lowest FFT sizes). PulseFind on lowest FFT sizes is longest kernel. That's why it can be "sleep away" with clumsy Windows (consider typical GPU kernel lenght of ~100us and minimal (!) sleep time of 1 ms and quantum size of 20ms) Sleep() call.
But if almost all work consists of such kernels, each task will go into sleep and GPU will not be feeded again.

So, the possible issue is that even biggest kernel smaller than minimal Sleep() duration. If it's true then increase number of multiple tasks (up to GPU memory limit) would help both with GPU load and throughput on VLARs. Unfortunately, this will increase switching overhead for all tasks, non-VLAR including. So I would expect decrease in throughput for non-VLARs in such config (how strong - depends on GPU architecture - inhibitely big starting from 2 tasks per GPU for pre-FERMI, for example).

Another way is to make kernels "under sleep" bigger.
This can be done by increasing -sbs N value and not splitting kernel on few calls (that is, decrease -period_iterations_num N value ).

Try these approaches.

P.S. In view of such theory running VLAR + non-VLAR simultaneously will give best throughput.



. . I tried the suggestion Mike made and used -sbs 384 but that took up most of the graphics memory and did not seem to help the lag at all so now I am back to 256. Current settings are :-

. . -use_sleep_ex 2 -sbs 256 -period_iterations_num 2

. . Will see how that runs, the card is now running a mixture of VLAR/nonVLAR.
ID: 1794051 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1794053 - Posted: 6 Jun 2016, 18:33:09 UTC - in response to Message 1794026.  

-poll option

What does that do?

I've searched the forums and the BOINC FAQs and I cannot find just WHERE you implement this -poll option. Is it in the app_config or app_info?



. . Try mb_cmdline_win_x86_SSE3_OpenCL_NV.txt
ID: 1794053 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794055 - Posted: 6 Jun 2016, 18:47:12 UTC - in response to Message 1794053.  

-poll option

What does that do?

I've searched the forums and the BOINC FAQs and I cannot find just WHERE you implement this -poll option. Is it in the app_config or app_info?



. . Try mb_cmdline_win_x86_SSE3_OpenCL_NV.txt

wrong. -poll option belongs to CUDA app, not OpenCL one.
ID: 1794055 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1794059 - Posted: 6 Jun 2016, 19:02:57 UTC - in response to Message 1794055.  

-poll option

What does that do?

I've searched the forums and the BOINC FAQs and I cannot find just WHERE you implement this -poll option. Is it in the app_config or app_info?



. . Try mb_cmdline_win_x86_SSE3_OpenCL_NV.txt

wrong. -poll option belongs to CUDA app, not OpenCL one.

Thanks for pointing that out. I thought it applied to all GPU tasks. I only run OpenCL now. So moot datum.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1794059 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1794100 - Posted: 6 Jun 2016, 22:22:26 UTC - in response to Message 1793939.  

Compared to 4 CUDA tasks + 4 cpu tasks(in your case) ?

Running 2 CUDA tasks on 2 GPUs with the -poll option and 1 core reserved for each WU and 4 CPU WUs produces more work per hour than just running 2 WUs on 2 GPUs without the poll option & 8 CPU WUs. Not a lot, about an extra 0.5 WUs per hour. But it adds up.

8 real cores or hyperthreaded?

Hyperthreaded.
Grant
Darwin NT
ID: 1794100 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65738
Credit: 55,293,173
RAC: 49
United States
Message 1794146 - Posted: 7 Jun 2016, 1:36:27 UTC
Last modified: 7 Jun 2016, 1:38:31 UTC

Ok I'm wondering, what is the minimum OpenCL for Sleep on SoG?

My cards support OpenCL 1.1 only, I don't have any 600-1000 cards.

Pegasus

CUDA: NVIDIA GPU 0: GeForce GTX 580 (driver version 353.06, CUDA version 7.5, compute capability 2.0, 1536MB, 1195MB available, 1755 GFLOPS peak)
OpenCL: NVIDIA GPU 0: GeForce GTX 580 (driver version 353.06, device version OpenCL 1.1 CUDA, 1536MB, 1195MB available, 1755 GFLOPS peak)


And no I have not run the PNY LC GTX 580 card above 857MHz, which is stock for this card, could I hit 901MHz?
I don't know, I'd need to do some research on that.

The cpu is an i7 3820 and it is Hyperthreaded, but then it runs on an Asus Rampage IV Extreme, bios 4901.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1794146 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1794147 - Posted: 7 Jun 2016, 1:51:36 UTC - in response to Message 1794013.  
Last modified: 7 Jun 2016, 2:11:28 UTC

. . BTW, can you tell me how to insert a graphic/image into a message?

Example:

1) Go to:
http://postimage.org/

Upload your file (for programs' windows the best format is .png)

2) Copy/Paste the "Direct Link" here, in my case it was:
http://s33.postimg.org/xv9h331wv/ATI_Memory_Viewer_07_06_2016.png

3) Mark/select the whole line ([Home], Shift+[End] on the keyboard)

4) Click [Img] button here, you will get:
[img]http://s33.postimg.org/xv9h331wv/ATI_Memory_Viewer_07_06_2016.png[/img]


5) Use the [Preview] button to see if all is correct.

And you will get:




P.S.
- To see the 'code' of my post - use the [Quote] button under it and read the raw text.

- The original filename was:
ATI MemoryViewer - 07.06.2016.png
It was changed automatically by postimage.org to ATI_Memory_Viewer_07_06_2016.png

- On the first usage of postimage.org you have to select "FAMILY safe" before [Upload It!]
(it is remembered for the next visits)
 
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1794147 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1794159 - Posted: 7 Jun 2016, 3:56:21 UTC - in response to Message 1794055.  

-poll option

What does that do?

I've searched the forums and the BOINC FAQs and I cannot find just WHERE you implement this -poll option. Is it in the app_config or app_info?



. . Try mb_cmdline_win_x86_SSE3_OpenCL_NV.txt

wrong. -poll option belongs to CUDA app, not OpenCL one.



. . My mistake sorry
ID: 1794159 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1794162 - Posted: 7 Jun 2016, 4:07:38 UTC - in response to Message 1794147.  

. . BTW, can you tell me how to insert a graphic/image into a message?


Example:

1) Go to:
http://postimage.org/

5) Use the [Preview] button to see if all is correct.

- To see the 'code' of my post - use the ["Quote"] button under it and read the raw text.

- On the first usage of postimage.org you have to select "FAMILY safe" before [Upload It!]
(it is remembered for the next visits)



. . I replied to another message with an embedded image and saw the URL but I did not know a suitable site to upload the image to. Thank you for that
ID: 1794162 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65738
Credit: 55,293,173
RAC: 49
United States
Message 1794166 - Posted: 7 Jun 2016, 5:07:44 UTC

Ok I decided to push the card to 940MHz, temp crunching 65-66C.

Pegasus

CUDA: NVIDIA GPU 0: GeForce GTX 580 (driver version 353.06, CUDA version 7.5, compute capability 2.0, 1536MB, 1183MB available, 1925 GFLOPS peak)
OpenCL: NVIDIA GPU 0: GeForce GTX 580 (driver version 353.06, device version OpenCL 1.1 CUDA, 1536MB, 1183MB available, 1925 GFLOPS peak)


Maybe not as much as a newer card, but very few 580 cards can match this, any that can run near 80C, the HardOCP pushed one PNY LC 580 to 950MHz, I stopped just short of that.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1794166 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794218 - Posted: 7 Jun 2016, 9:52:24 UTC - in response to Message 1794100.  

Compared to 4 CUDA tasks + 4 cpu tasks(in your case) ?

Running 2 CUDA tasks on 2 GPUs with the -poll option and 1 core reserved for each WU and 4 CPU WUs produces more work per hour than just running 2 WUs on 2 GPUs without the poll option & 8 CPU WUs. Not a lot, about an extra 0.5 WUs per hour. But it adds up.

8 real cores or hyperthreaded?

Hyperthreaded.

For hyperthreaded device it could be more easy to sacrifice logical CPU indeed.
AFAIK usual numbers of throughput increase going from 4 tasks per device to full device load around ~20%. Actually, the CPU component could even win sometimes from idling some of virtualized CPUs (less cache contention and memory bus load decrease).

-----------------------------------

And in general it's (worth to sacrifice CPU or not) quite complex question depending from individual characteristics of CPU device, GPU device, motherboard northbridge, memory controller and RAM modules.
I would say experimentation on particular host required, not flaming wars on boards. Some systems will function best in one config, some in another, some can be tuned to transcend both default configs...
ID: 1794218 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794220 - Posted: 7 Jun 2016, 9:57:07 UTC - in response to Message 1794146.  

Ok I'm wondering, what is the minimum OpenCL for Sleep on SoG?

So far all my builds OpenCL 1.0 compatible. Due to some bug in NV driver not all drivers will go with NV build. Also, SoG will fail to build kernels on NV OpenCL 1.0 (pre-FERMI) devices. I think not because of 1.0 incompliance but because NV pretty cold to own OpenCL part of driver (root reason is money, of course but it's separate topic).
ID: 1794220 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5

Message boards : Number crunching : No more guppi's=vlars on the gpu please


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.