Message boards :
Number crunching :
OpenCL apps are available for download on Lunatics
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
If so, could somebody put together a recommended toolset and programming guide for study? I'll see if I can find a pull quote from there (I should have it somewhere, else my download link is going to get some more exercise) that I can pass on to Raistmer. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Will have to find it again myself, though I remember that using an OpenGL callback in a renderloop for OpenCL 1.0, since OpenCL didn't get them until 1.1.If so, could somebody put together a recommended toolset and programming guide for study? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Will have to find it again myself, though I remember that using an OpenGL callback in a renderloop for OpenCL 1.0, since OpenCL didn't get them until 1.1.If so, could somebody put together a recommended toolset and programming guide for study? Pity the OpenCL code samples aren't broken out into a separate download - looks like I need the whole GPU Computing [sic] SDK download (that's the same thing, I take it?) None of CUDA C/C++ Code Samples DirectCompute Code Samples CUDA Library Samples look hopeful. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Will have to find it again myself, though I remember that using an OpenGL callback in a renderloop for OpenCL 1.0, since OpenCL didn't get them until 1.1.If so, could somebody put together a recommended toolset and programming guide for study? Things have moved around. The oclSimpleGL sample implements the Callbacks using OpenGL, and the ocean simulation has been moved to DirectCompute Using DirectX render callbacks. In either case (both workably demonstrating callbacks & very low CPU usage, with high GPU usage), the basic computation changes from the old style: Setup device & Kernels etc. main processing loop: Do some computation on GPU Hard Synchronise (implicit or explicit) Transfer back to CPU for postprocessing to: Setup device & Kernels etc. Context with Vsync off where applicable. Setup Callbacks (OpenGL, DirectX, OpenCL1.1+ or CUPTi) Sleep loop Check for exit requests etc with callback ('render') function: Do Some computation on GPU transfer back to CPU & postprocess where needed [Optionally] draw something The core changes being only that processing comes out of the main polled/blocking loop, and into a callback called by either a graphics context or a callback interface as supplied by OpenCL 1.1 or Cupti without the need for a graphics context. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I found the CUDA FFT Ocean simulation, thanks - still using OpenGL in the public SDK I downloaded, it may be the developer NDA version which has moved to DirectX. I'll look for the oclSimpleGL sample over the weekend while the servers are dark (got to have an alternative displacement activity), and try and find Open CL in the documentation while I'm down there. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I found the CUDA FFT Ocean simulation, thanks - still using OpenGL in the public SDK I downloaded, it may be the developer NDA version which has moved to DirectX. I'll look for the oclSimpleGL sample over the weekend while the servers are dark (got to have an alternative displacement activity), and try and find Open CL in the documentation while I'm down there. Yep, right direction. A couple of extra notes that may or may not be helpful: - Internally when/if the app using the Cuda runtime crashes, the call stack usually illustrates a number of DirectX interface functions are used. Those are the method by which the Cuda runtime sets up its 'traditional looking' synchronisation, so attempting to emulate the Cuda method closely on Windows+opencl would involve those rather complex interfaces. - The Opencl/OpenGL methods use the open source FreeGLUT library, which if digging into deeper opens further options for 'roll your own' synchronisation techniques. (I have those on the board for 'other purposes' down the road) Obviously both these examples are geared toward graphics interop demonstration, where low CPU usage is considered important, as for example in games the CPU might need to be handling the user interface, ai, sound etc. In dedicated gpGPU 'Our case' is a bit unusual in that our users tend to expect max processing speed and low CPU usage. While these demos illustrate that's possible (especially the DirectCompute/DirectXOcean one) neither are trivial of course. The OpenCL+OpenGL example though is probably much simpler than dealing with DirectX directly, and OpenCL 1.1's mechanism would be an easier route to the same end. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Wedge009 Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553 |
Just an update on my experiment: 275.33 still exhibited high CPU usage with AP even though the release notes states it is only supporting OpenCL 1.0. But then I realised that reverting to an older driver isn't a feasible solution because then that means JG's most recent MB applications are no longer supported. Granted, these are slow cards, anyway, but with the new applications, the speed improvements are measurable in the order of several minutes for short WUs and dozens of minutes for the longer ones. With the servers being off-line, there was an opportunity for the ATI AP WUs to be processed without any heavy CPU usage in its duration. During this time, the host was able to produce at least one valid AP result. If this trend continues, it looks like I'll have to limit the number of NV AP WUs being run to one at a time, to avoid excessive CPU usage. This isn't a problem for most users, I imagine, as most people wouldn't be using such an old CPU in their hosts. Soli Deo Gloria |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
Just an update on my experiment: 275.33 still exhibited high CPU usage with AP even though the release notes states it is only supporting OpenCL 1.0. But then I realised that reverting to an older driver isn't a feasible solution because then that means JG's most recent MB applications are no longer supported. Granted, these are slow cards, anyway, but with the new applications, the speed improvements are measurable in the order of several minutes for short WUs and dozens of minutes for the longer ones. Which card? The usual pattern is that Cuda 2.3 is best for pre-Fermi, 4.2 for Fermi and 5.0 for Kepler. Going to an older driver and an earlier Cuda version may not be losing you too much speed. You might want to bench that though. With the servers being off-line, there was an opportunity for the ATI AP WUs to be processed without any heavy CPU usage in its duration. During this time, the host was able to produce at least one valid AP result. If this trend continues, it looks like I'll have to limit the number of NV AP WUs being run to one at a time, to avoid excessive CPU usage. since you are already running latest boinc alpha, you can make use of the 'max concurrent' feature, see this. This isn't a problem for most users, I imagine, as most people wouldn't be using such an old CPU in their hosts. or they might be having an old rig and give it a new lease of life with a good GPU. In those cases it might be best if the CPUs didn't crunch at all. William the Silent A person who won't read has no advantage over one who can't read. (Mark Twain) |
Wedge009 Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553 |
It's a Fermi-based card - GT 430. I'm aware of the differences in each build designed for a different CUDA version. It may all be moot, anyway - with the age of the overall system, I may retire it by the end of the year. It's not the number of GPU tasks that's the problem, it's the number of NV AP tasks specifically. I think modifying the app_info.xml will be best for limiting those tasks in my particular case. And yes, what you describe is what I've done - taken an old computer and fill it with GPUs. Still not really cost-effective compared with buying a cheap, modern platform, which is why I doubt many others have done it. Soli Deo Gloria |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I also gave 275.33 a try with an old NV8800. Still used a full core, and the CUDA 23 tasks were not affected. If anything, the MB tasks improved slightly. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
It's not the number of GPU tasks that's the problem, it's the number of NV AP tasks specifically. I think modifying the app_info.xml will be best for limiting those tasks in my particular case. If you change the <count> variable to 1 you can only run that one AP task. However, if you leave both MB and AP counts at 0.5 and use app_config.xml to limit max_concurrent for AP only to 1, you can run an AP alongside an MB or two MB at a time but it won't run two AP at the same time. Just saying that might be more to your liking throughput wise. A person who won't read has no advantage over one who can't read. (Mark Twain) |
Wedge009 Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553 |
I'm aware of how app_info.xml works. It's not a case of multiple WUs on a single card in this case, it's a case of single WU on multiple cards each. Soli Deo Gloria |
Floyd Send message Joined: 19 May 11 Posts: 524 Credit: 1,870,625 RAC: 0 |
Wedge009 "And yes, what you describe is what I've done - taken an old computer and fill it with GPUs. Still not really cost-effective compared with buying a cheap, modern platform, which is why I doubt many others have done it." There is probably more people doing that then you realize , I just did it with an old (2005) dell GX620 I was given , added a GTX 430 GPU and letting it work what it can. total cost $36.00 , With the economy being what it is , I can't afford to do anything else. Rac 611 , 10410 total credits , every little bit helps the cause... |
Wedge009 Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553 |
Glad to know I'm not the only one who likes tinkering with old hardware as well as new! (: Soli Deo Gloria |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I also gave 275.33 a try with an old NV8800. Still used a full core, and the CUDA 23 tasks were not affected. If anything, the MB tasks improved slightly. Seems 266.58 is the ticket for Windows XP. Running r1764 there are just a few little red spikes ever so often. We'll see how it goes. |
Mike Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80 |
For those suffering with driver restarts with MB_r_1761/r_1764 can now download the HD5 version from my site. Downloads With each crime and every kindness we birth our future. |
cov_route Send message Joined: 13 Sep 12 Posts: 342 Credit: 10,270,618 RAC: 0 |
For those suffering with driver restarts with MB_r_1761/r_1764 can now download the HD5 version from my site. Fantastic! Should this fix restarts with the 13.1 driver? |
Mike Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80 |
For those suffering with driver restarts with MB_r_1761/r_1764 can now download the HD5 version from my site. Yes. it should. But you will still loose time. With each crime and every kindness we birth our future. |
cov_route Send message Joined: 13 Sep 12 Posts: 342 Credit: 10,270,618 RAC: 0 |
Yes. it should. I just did a quick bench with short test wu's and it is faster than r390 on 12.8. I'm going to make the switch then go up to 13.1 and see what happens. |
Mike Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80 |
Yes. it should. Of course its faster. But 13.1 doesn`t work properly. On my GPU only 5 out of 12 registers were in use. That means a heavy slow down. But it still produces valid results. Maybe running more instances reduces the slow down a little bit. With each crime and every kindness we birth our future. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.