Are some gpu tasks longer now?

Message boards : Number crunching : Are some gpu tasks longer now?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794825 - Posted: 9 Jun 2016, 22:00:34 UTC - in response to Message 1794821.  

Just tell us when it's finished, please, and works out of the box.

It does. When you will learn how to separate expectations from observations constructive interaction can be resumed.
ID: 1794825 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1794843 - Posted: 9 Jun 2016, 22:56:58 UTC - in response to Message 1794825.  

Just tell us when it's finished, please, and works out of the box.

It does. When you will learn how to separate expectations from observations constructive interaction can be resumed.

It doesn't.
Something that works out of the box doesn't require the user to manually edit configuration files in order to be able to use it without it impacting negatively on the rest of the system.

By all means, provide the option for even greater performance, but make sure to advise the user that doing so will greatly reduce CPU processing of work, or stop it all together in the case of 2 & 4 core machines & make it possible for them to finish that work before it's completion is blocked.
Grant
Darwin NT
ID: 1794843 · Report as offensive
Miklos M.

Send message
Joined: 5 May 99
Posts: 955
Credit: 136,115,648
RAC: 73
Hungary
Message 1794865 - Posted: 10 Jun 2016, 0:02:19 UTC - in response to Message 1794770.  

Thank you for pointing them out. I see it here that you posted:blc4_2bit_guppi_57451_26351_HIP69732_0023.15229.0.18.27.31.vlar_1
But when I look at my pages of tasks I do not see any vlar.
ID: 1794865 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1794869 - Posted: 10 Jun 2016, 0:08:32 UTC - in response to Message 1794865.  

Thank you for pointing them out. I see it here that you posted:blc4_2bit_guppi_57451_26351_HIP69732_0023.15229.0.18.27.31.vlar_1
But when I look at my pages of tasks I do not see any vlar.

Looking at your in progress list about 1/3 of them are Guppie VLARs.
Grant
Darwin NT
ID: 1794869 · Report as offensive
Miklos M.

Send message
Joined: 5 May 99
Posts: 955
Credit: 136,115,648
RAC: 73
Hungary
Message 1794873 - Posted: 10 Jun 2016, 0:16:50 UTC - in response to Message 1794869.  

I just looked at them by NAME and found them. Thank you everyone.
ID: 1794873 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794899 - Posted: 10 Jun 2016, 1:32:09 UTC - in response to Message 1794821.  

Just tell us when it's finished, please, and works out of the box.

And to all with such attitude:
If you want your personal expectations will be met - donate hardware for development, pay for development of features you want, hire own personal programmer. And donate your own time for testing when asked (that is, on beta and alpha).
Until then... well, misters "I know better how it should be so do that as I said or I'll not use it". Your advices not actually useful. Want to cooperate - fine. Want to waste my time for reading blaming and spam - I'll start "respect dev's time" company via blacklisting.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1794899 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1794905 - Posted: 10 Jun 2016, 2:00:19 UTC - in response to Message 1794899.  

I think it's a great app Raistmer.

My GPUs are chewing through the data.
ID: 1794905 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1794908 - Posted: 10 Jun 2016, 2:07:36 UTC - in response to Message 1794905.  

My GPUs are chewing through the data.

That isn't the problem. The problem is the effect it has on systems in it's default stock settings.
Grant
Darwin NT
ID: 1794908 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794937 - Posted: 10 Jun 2016, 3:30:51 UTC - in response to Message 1794905.  

I think it's a great app Raistmer.

My GPUs are chewing through the data.

Thanks for support.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1794937 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11360
Credit: 29,581,041
RAC: 66
United States
Message 1795105 - Posted: 10 Jun 2016, 16:23:34 UTC - in response to Message 1795033.  

Sten I thought you were only going to run Android for the summer.
ID: 1795105 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1795121 - Posted: 10 Jun 2016, 16:44:06 UTC - in response to Message 1794708.  


But I have a suspicion that the newer and larger the GPU, the greater the slowdown. I'll try and test that next time I have a gap between GPUGrid tasks on my GTX 970.


You are right. Low ar makes pulsefinding to run on one SM/SMX on NVIDIA GPU's. When PoTLen == PulsePoTLen the work can not be (currently) divided to all SM units. So the hit is 16x on 980, 12x on 780, 5x on 750, etc. depending on the number of SM units on the GPU.

I have done some experimenting with my 1080 and it runs guppi vlar units in about 200-300 seconds. But is has an issue with not finding all pulses or finding too many pulses.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1795121 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1795136 - Posted: 10 Jun 2016, 17:02:31 UTC - in response to Message 1795121.  

But I have a suspicion that the newer and larger the GPU, the greater the slowdown. I'll try and test that next time I have a gap between GPUGrid tasks on my GTX 970.

You are right. Low ar makes pulsefinding to run on one SM/SMX on NVIDIA GPU's. When PoTLen == PulsePoTLen the work can not be (currently) divided to all SM units. So the hit is 16x on 980, 12x on 780, 5x on 750, etc. depending on the number of SM units on the GPU.

I have done some experimenting with my 1080 and it runs guppi vlar units in about 200-300 seconds. But is has an issue with not finding all pulses or finding too many pulses.

Then you might like to look at another 'suspicion' of mine. This would be much harder to demonstrate in numbers.

When two cuda50 tasks are running on the same GPU, fairly obviously, one will have started before the other - by anything between a fraction of a second and several minutes. It seems to me that the first to start consistently runs faster. This property is inheritable: when the first starter finishes, the second task becomes the 'first to start' and runs faster. A third task will start, becoming the 'second starter' for the time being, and accordingly run slowly.

I don't think that's purely the result of non-linear progress reporting (progress %age reporting moves more slowly at the start of the task), but it's easy to confuse it with that and I might have been confused. But you might consider the possibility that 'application launch order' might affect queuing, somewhere down the line.
ID: 1795136 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1795142 - Posted: 10 Jun 2016, 17:18:43 UTC - in response to Message 1795136.  
Last modified: 10 Jun 2016, 17:33:16 UTC

When two cuda50 tasks are running on the same GPU, fairly obviously, one will have started before the other - by anything between a fraction of a second and several minutes. It seems to me that the first to start consistently runs faster. This property is inheritable: when the first starter finishes, the second task becomes the 'first to start' and runs faster. A third task will start, becoming the 'second starter' for the time being, and accordingly run slowly.

I don't think that's purely the result of non-linear progress reporting (progress %age reporting moves more slowly at the start of the task), but it's easy to confuse it with that and I might have been confused. But you might consider the possibility that 'application launch order' might affect queuing, somewhere down the line.


In the Cuda handbook publication, it explains there is only one DMA engine, so some software pipelining needs to happen if multiple threads or processes (with their own threads) want to use the device concurrently. In Petri's case he's raising efficiency and hiding latencies with Cuda streams, such that optimal is a single instance. In my experience the latencies of the simpler model on Linux are smaller to start with. Whether on not these aspects change with Pascal & newer Linux+drivers, no idea as yet.

[Edit:] correction Kepler+ have two, but they are different priorities, and probably saturating with many small requests in baseline code + multiple instances/apps. Upping transfer sizes to over 4MiB for Fermi+, and doing some pipelining anyway, will probably improve things down the line.

Because the command buffer is shared between engines, applications must “software-pipeline” their requests in different streams...

So 'Classic' (Baseline) Cuda code is more likely to 'fight' under the demands of the new tasks.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1795142 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1795150 - Posted: 10 Jun 2016, 17:32:34 UTC - in response to Message 1795121.  


But I have a suspicion that the newer and larger the GPU, the greater the slowdown. I'll try and test that next time I have a gap between GPUGrid tasks on my GTX 970.


You are right. Low ar makes pulsefinding to run on one SM/SMX on NVIDIA GPU's. When PoTLen == PulsePoTLen the work can not be (currently) divided to all SM units. So the hit is 16x on 980, 12x on 780, 5x on 750, etc. depending on the number of SM units on the GPU.

I have done some experimenting with my 1080 and it runs guppi vlar units in about 200-300 seconds. But is has an issue with not finding all pulses or finding too many pulses.

Would it be possible to make this change to the Baseline App and see if it still had problems finding the correct number of pulses? From my experience the Baseline App is very accurate and might be useful very quickly if all the SMs could be used. Right now it seems the problem with the SIGBUS Errors I was having is related to the OS. The Apps compiled in Mountain Lion don't produce any Errors when compiled with Toolkit 7.5. So, for now it appears the problem with SIGBUS Errors can be avoided.
ID: 1795150 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1795158 - Posted: 10 Jun 2016, 17:40:54 UTC - in response to Message 1795150.  


But I have a suspicion that the newer and larger the GPU, the greater the slowdown. I'll try and test that next time I have a gap between GPUGrid tasks on my GTX 970.


You are right. Low ar makes pulsefinding to run on one SM/SMX on NVIDIA GPU's. When PoTLen == PulsePoTLen the work can not be (currently) divided to all SM units. So the hit is 16x on 980, 12x on 780, 5x on 750, etc. depending on the number of SM units on the GPU.

I have done some experimenting with my 1080 and it runs guppi vlar units in about 200-300 seconds. But is has an issue with not finding all pulses or finding too many pulses.

Would it be possible to make this change to the Baseline App and see if it still had problems finding the correct number of pulses? From my experience the Baseline App is very accurate and might be useful very quickly if all the SMs could be used. Right now it seems the problem with the SIGBUS Errors I was having is related to the OS. The Apps compiled in Mountain Lion don't produce any Errors when compiled with Toolkit 7.5. So, for now it appears the problem with SIGBUS Errors can be avoided.


Possible. This weekend for me is to involve direct comparisons between Petri's modifications and Baseline sources, then injecting the least-risky/widest-compatibility/biggest-impact components. Whether or not the strange pulses are a simple precision change, or a logic breakage somewhere, I won't know for a while. Either way the Logic changes Petri and I chatted about seemed headed down the right path to me, so whatever the weirdness is will likely turn up along the way.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1795158 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1795159 - Posted: 10 Jun 2016, 17:44:29 UTC - in response to Message 1795033.  
Last modified: 10 Jun 2016, 17:48:09 UTC


So far, this is the best settings for my GTX980:

-cpu_lock -sbs 512 -period_iterations_num 2 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -instances_per_device 3

Which version of the app are you using? The original:
MB8_win_x86_SSE3_OpenCL_NV_r3430_SoG.exe

or the -use_sleep accommodating one:
MB8_win_x86_SSE3_OpenCL_NV_r3430.exe
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1795159 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1795184 - Posted: 10 Jun 2016, 18:42:00 UTC - in response to Message 1795163.  


MB8_win_x86_SSE3_OpenCL_NV_r3430_SoG.exe, I thought that was clear from my post, where I mention SoG several times. MB8_win_x86_SSE3_OpenCL_NV_r3430.exe is not a SoG version.
MB8_win_x86_SSE3_OpenCL_NV_r3430_SoG.exe also have the -use_sleep option if one wants to use it.


I guess I am very confused then. So you are saying that MB8_win_x86_SSE3_OpenCL_NV_r3430_SoG.exe IS NOT a SoG app, EVEN THOUGH it ships with the <plan_class>opencl_nvidia_SoG</plan_class> in its aistub file???
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1795184 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1795190 - Posted: 10 Jun 2016, 18:55:05 UTC - in response to Message 1795184.  


MB8_win_x86_SSE3_OpenCL_NV_r3430_SoG.exe, I thought that was clear from my post, where I mention SoG several times. MB8_win_x86_SSE3_OpenCL_NV_r3430.exe is not a SoG version.
MB8_win_x86_SSE3_OpenCL_NV_r3430_SoG.exe also have the -use_sleep option if one wants to use it.


I guess I am very confused then. So you are saying that MB8_win_x86_SSE3_OpenCL_NV_r3430_SoG.exe IS NOT a SoG app, EVEN THOUGH it ships with the <plan_class>opencl_nvidia_SoG</plan_class> in its aistub file???

I think you might have misread their post.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1795190 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1795192 - Posted: 10 Jun 2016, 19:05:34 UTC - in response to Message 1795136.  
Last modified: 10 Jun 2016, 19:05:45 UTC

...But you might consider the possibility that 'application launch order' might affect queuing, somewhere down the line.


Here is some of the detail from the Cuda handbook, that pertains specifically to Windows WDDM (Vista+ drivers):

...On WDDM, if there are applications competing for time on the same GPU, Windows can and will swap memory objects out in order to enable each application to run. The Windows operating system tries to make this as efficient as possible, but as with all paging, having it never happen is much faster than having it ever happen.

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1795192 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1795193 - Posted: 10 Jun 2016, 19:10:35 UTC - in response to Message 1795136.  

But I have a suspicion that the newer and larger the GPU, the greater the slowdown. I'll try and test that next time I have a gap between GPUGrid tasks on my GTX 970.

You are right. Low ar makes pulsefinding to run on one SM/SMX on NVIDIA GPU's. When PoTLen == PulsePoTLen the work can not be (currently) divided to all SM units. So the hit is 16x on 980, 12x on 780, 5x on 750, etc. depending on the number of SM units on the GPU.

I have done some experimenting with my 1080 and it runs guppi vlar units in about 200-300 seconds. But is has an issue with not finding all pulses or finding too many pulses.

Then you might like to look at another 'suspicion' of mine. This would be much harder to demonstrate in numbers.

When two cuda50 tasks are running on the same GPU, fairly obviously, one will have started before the other - by anything between a fraction of a second and several minutes. It seems to me that the first to start consistently runs faster. This property is inheritable: when the first starter finishes, the second task becomes the 'first to start' and runs faster. A third task will start, becoming the 'second starter' for the time being, and accordingly run slowly.

I don't think that's purely the result of non-linear progress reporting (progress %age reporting moves more slowly at the start of the task), but it's easy to confuse it with that and I might have been confused. But you might consider the possibility that 'application launch order' might affect queuing, somewhere down the line.


A nice point Richard. But I run only one at a time.
I do though have an explanation or an educated guess.

The whole process is an alternating series of CPU-GPU work. GPU has to finish its work and transfer the data to main memory for CPU. Then the CPU does some post processing. Only after finishing the post processing it asks for more GPU work. I have a feeling that the SOG verion buffers more work and the transfers are eliminated to a minimum.

Explanation (guess) a) The task that has started (first) yields GPU time to other processes at some point of processing and does its own CPU processing (or waiting for a GPU to host [CPU] memory transfer) and is the first in line to begin with a new batch of GPU processing. And it is (almost) always the first to submit new work to the GPU and the later started threads do not get the GPU time slice, but have to wait instead. So the first started process is always in the lead.

Explanation (guess) b) The other explanation is that the processing seems to go faster towards the end. My experience is that when running multiple instances on a GPU that the percentage and the time to finish appear to go the faster the more near the end is. That may be an effect of BOINC, not seti. And if I remember correctly there is an option to set up the boincmgr to a 'linear time display'.

Just My Thoughts. Now I'm going to a Sauna (with beer).
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1795193 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Are some gpu tasks longer now?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.