CUDA Toolkit 8.0 Available for Developers

Message boards : Number crunching : CUDA Toolkit 8.0 Available for Developers
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794385 - Posted: 8 Jun 2016, 8:55:00 UTC - in response to Message 1794382.  

'I have 3 VLAR Guppis and a shorty in my queue'

Also a GPUGRid expecting to use the other GPU, four NumberFields that can only run on CPUs, and an Einstein minding its own business on the intel_gpu.

That's also the flaw with the '-use sleep and free four cores' discussion.
ID: 1794385 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794389 - Posted: 8 Jun 2016, 9:01:16 UTC - in response to Message 1794385.  
Last modified: 8 Jun 2016, 9:03:06 UTC

'I have 3 VLAR Guppis and a shorty in my queue'

Also a GPUGRid expecting to use the other GPU, four NumberFields that can only run on CPUs, and an Einstein minding its own business on the intel_gpu.

That's also the flaw with the '-use sleep and free four cores' discussion.


for the Edit:
Edit - BOINC needs to manage heterogenous apps - in the sense of different apps from different projects - already. But adding a second definition of heterogenicity within each app - we're going to need an API for that.


Probably not at client level. An application level framework/api yes. As a 'simple/classic' model boincapi is showing cracks just with OSes becoming more asynchronous. There are papers I'm using for x42's design that detail a lot of the infrastructure requirements. In most cases it'll become a need for the Boinc client to be less fussy about things it doesn't know about, if anything.

[So as not to strangle your pet hamster by squeezing it so much]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794389 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794392 - Posted: 8 Jun 2016, 9:10:39 UTC - in response to Message 1794389.  

Hmmm. I think were going to have to think about the AI needs of other projects - I don't think that we can necessarily ask another project's science app to yield a resource to us, without using the BOINC layer to act as message-passer.

I'm hoping that my email suggestion of reserving v8 for Arecibo, and deploying GBT as a separate v9, gains some traction - it would be interesting to watch Arecibo under CUDA sharing a GPU with guppies under SoG, with a <max_concurrent> of one each.
ID: 1794392 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794393 - Posted: 8 Jun 2016, 9:16:23 UTC - in response to Message 1794385.  

'I have 3 VLAR Guppis and a shorty in my queue'

Also a GPUGRid expecting to use the other GPU, four NumberFields that can only run on CPUs, and an Einstein minding its own business on the intel_gpu.

That's also the flaw with the '-use sleep and free four cores' discussion.


Rule/profile based AI in dispatcher on timed interval:

if isRunning(XYZ) then
setThrottle(myapp,low_percent);
excludeGPU(n);
numthreads=xx;
else
setThrottle(myapp,high_percent);
includeGPU(n);
numthreads=yy;
endif

with supporting tools designed to work-out/detect/generate rules.

Nothing games and Android apps don't do already.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794393 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1794394 - Posted: 8 Jun 2016, 9:20:47 UTC

I think Richard is more worried about resource share, if a heterogenous app grabs devices BOINC has earmarked for other project's apps?
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1794394 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794395 - Posted: 8 Jun 2016, 9:21:44 UTC - in response to Message 1794392.  

Hmmm. I think were going to have to think about the AI needs of other projects - I don't think that we can necessarily ask another project's science app to yield a resource to us, without using the BOINC layer to act as message-passer.

I'm hoping that my email suggestion of reserving v8 for Arecibo, and deploying GBT as a separate v9, gains some traction - it would be interesting to watch Arecibo under CUDA sharing a GPU with guppies under SoG, with a <max_concurrent> of one each.


The strength of a dispatch based approach comes through adapting to the existing situation/needs, as opposed to asking anything.

Good example is the dynamic dispatch within the FFTW library, and the CPU seti@home enhanced stock clients, which will use planning/bench phase to detect & select functions that will work.

Net effect with both the FFTW library, is you may get different code depending on the runtime conditions at the time.

If another non-heterogeneous app is given resources by the client, then don;t have to use those, and should the situation change mid-run, no problem, drop or add a thread.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794395 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794396 - Posted: 8 Jun 2016, 9:22:18 UTC - in response to Message 1794394.  
Last modified: 8 Jun 2016, 9:26:56 UTC

I think Richard is more worried about resource share, if a heterogenous app grabs devices BOINC has earmarked for other project's apps?


Don't 'Grab' anything not issued to the stub apps [collectively] :D

Feyd-Rautha: [whispers] You see your death. My blade will finish you.
Paul Atreides: [voiceover] I will bend like a reed in the wind.
-- Dune
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794396 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1794397 - Posted: 8 Jun 2016, 9:26:41 UTC - in response to Message 1794396.  

I think Richard is more worried about resource share, if a heterogenous app grabs devices BOINC has earmarked for other project's apps?


Don't grab anything not issued to the stub apps :D

but then the stub app needs to know what it has available and tell boinc what it might want.

In effect, the app could say 'please give me as many GPU and CPU cores as possible' and boinc has to be able to reply 'ok you can have x CPU and y GPU cores' with an option for 'excuse me can you free up another core ' or ' have another'...

That's quite a revamp of current scheduling...
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1794397 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794398 - Posted: 8 Jun 2016, 9:32:32 UTC - in response to Message 1794397.  

I think Richard is more worried about resource share, if a heterogenous app grabs devices BOINC has earmarked for other project's apps?


Don't grab anything not issued to the stub apps :D

but then the stub app needs to know what it has available and tell boinc what it might want.

In effect, the app could say 'please give me as many GPU and CPU cores as possible' and boinc has to be able to reply 'ok you can have x CPU and y GPU cores' with an option for 'excuse me can you free up another core ' or ' have another'...

That's quite a revamp of current scheduling...


Not really, think about it.

Current clients starts n applications with xyz resources (total)

I can make those n applications stubs, send them to sleep other than to periodically update progress and/or state/checkpoint, and hand over the resources to another process.

Nothing that technically the current applications don't do with worker threads, they're just limited to 1 worker per process. Heterogeneous app would use the same total set of resources, and gain the advantage of being able to dynamically respond if Boinc adds or removes a resource mid processing.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794398 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794400 - Posted: 8 Jun 2016, 9:36:10 UTC - in response to Message 1794394.  

Yes - not specifically Resource Share (that's more of a long term objective that can be sorted out later), but overcommitment of resources. If SETI needs extra CPU for a certain work process, and another project has been given the green light by BOINC to use a full core's worth of CPU time, what happens?

At best, both tasks use the CPU, with a bit of thrashing as they swap contexts. That's well established in the CPU world, and both should make progress, even if slower than expected.

Other approaches exist, like the precautionary one being suggested for OpenCL: "reserve a full core for the app, whether it actually is going to use all the power of a core or not". That was the flaw I questioned when the idea of supporting VM tasks under the BOINC framework was described at the 2011 BOINC Workshop in London. Since the VMs handle their own despatch (and more - all their own communications as well), the outer BOINC layer doesn't know whether the inner VM is actually using its reserved resource - and hence, can't offer it to another process when idle.

Now that VMs are actively being used by CERN projects, I see that same question has come up again in discussion, bot I don't think it's been answered yet.
ID: 1794400 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794404 - Posted: 8 Jun 2016, 9:49:28 UTC - in response to Message 1794400.  
Last modified: 8 Jun 2016, 9:50:38 UTC

Yes - not specifically Resource Share (that's more of a long term objective that can be sorted out later), but overcommitment of resources. If SETI needs extra CPU for a certain work process, and another project has been given the green light by BOINC to use a full core's worth of CPU time, what happens?

At best, both tasks use the CPU, with a bit of thrashing as they swap contexts. That's well established in the CPU world, and both should make progress, even if slower than expected.

Other approaches exist, like the precautionary one being suggested for OpenCL: "reserve a full core for the app, whether it actually is going to use all the power of a core or not". That was the flaw I questioned when the idea of supporting VM tasks under the BOINC framework was described at the 2011 BOINC Workshop in London. Since the VMs handle their own despatch (and more - all their own communications as well), the outer BOINC layer doesn't know whether the inner VM is actually using its reserved resource - and hence, can't offer it to another process when idle.

Now that VMs are actively being used by CERN projects, I see that same question has come up again in discussion, bot I don't think it's been answered yet.


Yeah, that's where the reed bending bit comes in. Users and applications know more about the tasks and their hardware (provided sufficient information) than Boinc needs to do to do its job. That's project/application/task domain specific knowledge. Since the first (Boinc-enabled) application of its kind will know what's going on, through the user and dispatch-support-tools, it has the responsibility to yield in the same way the individual applications normally would (or hopefully better).

The only 'Real' functional difference underneath, is precisely about overcommit and contention. Can spontaneously decide to shrink to a low priority single CPU core, go full throttle, or some intermediate, depending on what Boinc gives, or what the user/tools ask for.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794404 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794407 - Posted: 8 Jun 2016, 9:59:47 UTC - in response to Message 1794404.  

Obviously, that leads to the first such application being a bit of a doormat - allowing other projects to trample all over it. I think experienced users would know how to handle that and micro-manage their own machines to suit the new behaviour, but it could be a bit of a problem with a stock deployment for the public at large.
ID: 1794407 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794410 - Posted: 8 Jun 2016, 10:12:35 UTC - in response to Message 1794407.  
Last modified: 8 Jun 2016, 10:21:37 UTC

Obviously, that leads to the first such application being a bit of a doormat - allowing other projects to trample all over it. I think experienced users would know how to handle that and micro-manage their own machines to suit the new behaviour, but it could be a bit of a problem with a stock deployment for the public at large.


Acting like a 'normal/familiar' application would be the first priority, since they 'work'.

Mmmm, Well I'm not so sure that the simplistic model Boinc uses for resource management isn't the one thing it does well, even if not ideal. It wants me to run these tasks, with such and such resources, say no more.

I think if 'my' application was trampled on, I would rather yield than cripple a host. That's why standard default (single main device) Cuda applications will probably remain simple/familiar blocking-sync style, at least until the more sophisticated examples can at least equal the behaviour with minimal configuration (and preferably better it).

Naturally part of the adaptation process for better/simpler behaviour will have roadblocks, though I don't consider the ridiculous estimates as a huge one in the scheme of things. More challenging to me is demonstrating why a single process [and single task] per resource is increasingly a bad idea.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794410 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794412 - Posted: 8 Jun 2016, 10:37:28 UTC - in response to Message 1794410.  

OK, I think we've given the issues a thorough enough once-over for today. Now it moves on to the fulfillment stage - something this country is in the middle of demonstrating that it is spectacularly bad at (Sports Direct, BHS, voter registration - all in the last 24 hours).
ID: 1794412 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1794416 - Posted: 8 Jun 2016, 11:00:14 UTC - in response to Message 1794410.  

More challenging to me is demonstrating why a single process [and single task] per resource is increasingly a bad idea.

Whatever produces the most work per hour IMHO.
If that can be done running just 1 WU at a time on a GPU, excellent. If not, then multiple WUs per GPU will continue to be the norm.
Grant
Darwin NT
ID: 1794416 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794531 - Posted: 8 Jun 2016, 20:12:16 UTC - in response to Message 1794360.  
Last modified: 8 Jun 2016, 20:26:00 UTC

And for completeness OpenCL

Using Inline PTX with OpenCL
A simple test application that demonstrates a new CUDA 4.0 driver ability to embed PTX in a OpenCL kernel.

Tha sounds promizing indeed.
This would allow to embed cache-friendly reads Petri proposed long ago for some places both in AP and MB directly in NV branch of corresponding kernels.

NB to those who think "there should be only single OpenCL app" - ... go figure!
ID: 1794531 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1794585 - Posted: 9 Jun 2016, 0:43:28 UTC - in response to Message 1794531.  

NB to those who think "there should be only single OpenCL app" - ... go figure!

For stock, yes.
For optimised applications, obviously not.
Grant
Darwin NT
ID: 1794585 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794609 - Posted: 9 Jun 2016, 3:23:21 UTC - in response to Message 1794585.  

Don't see why "stock" should be made deliberately slower than "optimized".
ID: 1794609 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1794610 - Posted: 9 Jun 2016, 3:25:59 UTC - in response to Message 1794609.  

Don't see why "stock" should be made deliberately slower than "optimized".

See- Why need 5 different stock AMD OpenCL GPU applications? thread.
Grant
Darwin NT
ID: 1794610 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794678 - Posted: 9 Jun 2016, 10:50:50 UTC - in response to Message 1794610.  

Don't see why "stock" should be made deliberately slower than "optimized".

See- Why need 5 different stock AMD OpenCL GPU applications? thread.


Exactly
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794678 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : CUDA Toolkit 8.0 Available for Developers


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.