New AstroPulse for GPU ( ATi & NV) released (r1316)

Author	Message
Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1259097 - Posted: 11 Jul 2012, 20:02:46 UTC - in response to Message 1259081. My 9800GTX+ was still using almost 100% CPU with 301.24 & 302.59 drivers and the r1305 app (but i was purposly leaving a core free), not tried the r1316 app yet on it, will try it once my present Cuda testing is complete, Claggy ID: 1259097 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 1259202 - Posted: 12 Jul 2012, 2:26:32 UTC - in response to Message 1259097. Last modified: 12 Jul 2012, 2:30:31 UTC I did the same, leaving 1 'thread' free which ups the GPU load by 5-7%, also the 2nd has an even higher load as the first!? 98%-99% doing 2 MB wus. This was with MB work, but noticed the same with AstroPulse work. Unfortunatly no AstroPulse tasks available. In the mean time, also no 610 work, I did some MW work, with a resource share of 650 (SETI) and 75(MW) it's OK. AMD ATI 5870 GPUs But it crashed (heat?) don't know, I immediatly received an survey from HP as to what has happened and wanted to help?! Now doing 4 610 tasks on the GPUs. But some, better alot, AsroPulse work would be nice since the rev.1316, which really is faster and 2 results were validated, out of 2. ID: 1259202 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1259221 - Posted: 12 Jul 2012, 3:41:39 UTC Please stay on topic here. With each crime and every kindness we birth our future. ID: 1259221 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1259586 - Posted: 12 Jul 2012, 20:26:10 UTC Last modified: 12 Jul 2012, 20:27:25 UTC With 275.33 NVIDIA driver and one APv6 6.04 (r1316) application/work unit on my GPU one CPU-thread had ~ 10 % usage/support. As I changed to 2 work units/GPU CPU-thread usage/support increased to ~ 40 - 50 % for each application/work unit. I guess this OpenCL BUG is not happen (or not so big) if only one work unit on my system (Intel Core2 Duo E7600 with NVIDIA GeForce GTX260-216 O/C, WinXP 32bit). * Best regards! :-) * Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. * ID: 1259586 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1259593 - Posted: 12 Jul 2012, 20:39:16 UTC Yes Sutaru. We figured that pre Fermi cards are not that much affected. Maybe its because your 260 has much more compute units than 460 for example. But thats only my imagination. With each crime and every kindness we birth our future. ID: 1259593 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1259606 - Posted: 12 Jul 2012, 21:50:34 UTC Last modified: 12 Jul 2012, 22:02:33 UTC Ah, so maybe that's the key? I'm only running one WU at a time on the GTX 260 Core 216 as well - it seems that multiple simultaneous WUs don't scale well unless on Fermi or later cards. For my Fermi and Kepler cards, I'm running two at a time. And yes, I also noticed that the pre-Fermi G200-era GPU designs have several more compute units than Fermi and Kepler. Different architecture, different implementation. Edit: This is interesting... with the current lack of AP WUs, one of my Fermi cards received only one AP WU to work on... even though it was processing a MB/CUDA WU simultaneously, the AP WU it was working on exhibited lower CPU usage similar to that seen on the pre-Fermi card. So perhaps the contributing factor to the high CPU usage isn't the GPU architecture, but rather how many OpenCL tasks you try to run on it simultaneously. MB/CUDA WUs don't appear to suffer high CPU usage. Soli Deo Gloria ID: 1259606 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1259625 - Posted: 12 Jul 2012, 23:16:21 UTC - in response to Message 1259606. it seems that multiple simultaneous WUs don't scale well unless on Fermi or later cards. Yes, that's the way they were built. Tasks don't actually run 'simultaneously' (any more than they do on a single-core CPU under a multi-tasking operating system). The hardware is switched from task to task on a scale of milliseconds, perhaps even microseconds. Fermis and later have specialised silicon to handle this task-switching at high speed: earlier GPUs don't. ID: 1259625 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1259633 - Posted: 13 Jul 2012, 0:05:12 UTC Ah, thanks for the clarification. I thought the workload would be split among the numerous processing cores on the GPU, so clearly I was mistaken there. Well, it would seem, then, that for OpenCL tasks at least, this 'high-speed switching' consumes excessive CPU resources with the current NV drivers. Soli Deo Gloria ID: 1259633 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1259668 - Posted: 13 Jul 2012, 2:35:00 UTC - in response to Message 1257949. I made 2 posts about current situation with driver support for OpenCL on both vendors forums recently: http://devgurus.amd.com/thread/159432 http://developer.nvidia.com/devforum/discussion/10636/feature-request-to-add-synchronization-mode-tuning-via-nv-specific-opencl-extension If you have something to say on topic or explain why this important for users, please do post in corresponding threads. I thought I'd do a little light reading tonite and check out the Nvidia thread mentioned above (since all I run are their cards), and got this notice when I went to their site: Posted July 12, 2012 NVIDIA suspended operations today of the NVIDIA Developer Zone (developer.nvidia.com). We did this in response to attacks on the site by unauthorized third parties who may have gained access to hashed passwords. We are investigating this matter and working around the clock to ensure that secure operations can be restored. As a precautionary measure, we strongly recommend that you change any identical passwords that you may be using elsewhere. NVIDIA does not request sensitive information by email. Do not provide personal, financial or sensitive information (including new passwords) in response to any email purporting to be sent by an NVIDIA employee or representative. We will post updates about this matter here. For any questions, email us at devzoneupdate@nvidia.com. For technical support, go to www.nvidia.com/support. Bummer. ID: 1259668 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1259712 - Posted: 13 Jul 2012, 7:07:35 UTC - in response to Message 1259668. Last modified: 13 Jul 2012, 7:19:32 UTC Yes, I recived E-mail with same message, NV was hacked recently. Someone tired from their bugs perhaps ;) Well, last reports imply change in latest NV drivers actually. When I reproduced this bug on my GPU I saw CPU usage increase on GTX 260, signle MB task. So, apparently something was changed since then. Will update to latest drivers from rock-stable 263.06 (no CPU usage bug there at all, even if 2 AP tasks running) and check what we really have now. EDIT: As our inner investigation starts to show GPU-host synching very complex matter, not only on NV but on ATi hardware too. CPU time increase on low end ATi GPUs with r1316 over r1305 I attribute now to change in synching mode inside APP runtime when wait time increases. Will do post with these observations on ATi forums and post link here. So one who interesting in topic could follow discussion with ATi specialists (in case they bother to answer of course). ID: 1259712 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1259728 - Posted: 13 Jul 2012, 7:52:38 UTC Here is the link to the synchronization issue discussion on AMD site: http://devgurus.amd.com/message/1282663#1282663 ID: 1259728 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1259734 - Posted: 13 Jul 2012, 8:31:01 UTC I don't fully understand all the details of that, but I think I get the gist of it. I suppose the lower end GPUs have different implementations which are more sensitive to the differences in the two algorithms you tried. AP WUs are being (slowly) distributed again. Hopefully I can get more testing done, provided the blanking percentages don't vary too wildly (85+% blanking is really horrible, especially on slower CPUs). Soli Deo Gloria ID: 1259734 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1259768 - Posted: 13 Jul 2012, 10:29:30 UTC - in response to Message 1259734. Last modified: 13 Jul 2012, 10:32:14 UTC I don't fully understand all the details of that, but I think I get the gist of it. I suppose the lower end GPUs have different implementations which are more sensitive to the differences in the two algorithms you tried. AP WUs are being (slowly) distributed again. Hopefully I can get more testing done, provided the blanking percentages don't vary too wildly (85+% blanking is really horrible, especially on slower CPUs). Not quite... low-end GPUs just manifest it very noticeable. Mike did test on his quite fast GPU - he got 4s CPU time vs 10s CPU time... Values are low so more testing will be needed, but difference is very big in relative scale... From the other side, I think it's some switch between 2 sync modes inside driver itself, depending on time to wait. If it's true high-end GPU could still not pass threshold for such switch and not show so big increase in CPU time... will see, more testing needed. ID: 1259768 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1259776 - Posted: 13 Jul 2012, 11:14:11 UTC Last modified: 13 Jul 2012, 11:26:21 UTC Edit: I only just saw your private message (after making the below post). I feel pretty stupid now. x.x --- <GLaDOS>Continue testing...</GLaDOS> Ten seconds, wow. What card is that, out of curiosity? There were no instructions so it took me a while, but I finally managed to work out how to run your dummy work-unit in stand-alone mode and here are my results. I kept parameters the same across all executions (for the same GPU) and also made sure that the binary caches were built before making any timings (so run-time should be just the actual GPU processing and not skewed by the time for the CPU to create those caches). I'm including CPUs in this summary for reference though I don't expect it should have too great an influence on the times. Intel Core 2 Q9550 + Radeon HD 6950 r555: 0:46 r1305: 0:50 r1316: 0:43 Athlon 64 X2 6400+ + Radeon HD 5670 r555: 3:24 r1305: 3:10 r1316: 2:47 Pentium 4 HT 3.06 + Radeon HD 4670 r555: 5:54 r1305: 6:01 r1316: 4:03 Fusion C-50 (Radeon HD 6250) r555: 10:36 r1305: 9:11 r1316: 8:42 Not sure what to make of these trends, but hopefully they'll be of some use to you. Maybe. I suppose only one kind of work-unit doesn't reflect performance increase or decrease as much as we'd like - I'm not sure if the differences are significant enough to overcome the margin of error. Soli Deo Gloria ID: 1259776 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1259833 - Posted: 13 Jul 2012, 13:28:15 UTC All right, tried it again with the APbench tool... Intel Core 2 Q9550 + Radeon HD 6950 AP6_win_x86_SSE2_OpenCL_ATI_r1305.exe -unroll 11 : Elapsed 49.028 secs CPU 20.734 secs AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 11 : Elapsed 42.324 secs CPU 18.938 secs AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 11 : Elapsed 45.582 secs CPU 16.375 secs Athlon 64 X2 6400+ + Radeon HD 5670 AP6_win_x86_SSE2_OpenCL_ATI_r1305.exe -unroll 5 : Elapsed 191.719 secs CPU 26.016 secs AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 5 : Elapsed 169.219 secs CPU 14.766 secs AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 5 : Elapsed 210.141 secs CPU 34.609 secs Pentium 4 HT 3.06 + Radeon HD 4670 AP6_win_x86_SSE2_OpenCL_ATI_r1305.exe -unroll 4 : Elapsed 695.047 secs CPU 188.047 secs AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 4 : Elapsed 513.266 secs CPU 88.875 secs AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 4 : Elapsed 711.469 secs CPU 204.234 secs Fusion C-50 (Radeon HD 6250) AP6_win_x86_SSE2_OpenCL_ATI_r1305.exe -unroll 2 : Elapsed 1379.849 secs CPU 511.418 secs AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 2 : Elapsed 969.014 secs CPU 558.187 secs AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 2 : Elapsed 983.876 secs CPU 575.191 secs The times seem to confirm my manual testing for the mid-range and high-end cards - not sure what the script does differently that the other two GPUs took twice as long compared with my manual testing. But the results should still be relative to each other here. Soli Deo Gloria ID: 1259833 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1259841 - Posted: 13 Jul 2012, 13:46:08 UTC - in response to Message 1259833. Last modified: 13 Jul 2012, 13:50:48 UTC Thanks for testing 1) strange result for very first GPU, r555 looks faster %) [C-50 - the same...] 2) results for C-50 resemble resuls from my own C-60, looks like it likes r1305 more. 3) "twice as long" - noidea, it's worth to understand why. So could you upload full TestData directory archived somewhere? Or if you prefer just E-mail me archive. I will upload modified version soon then post link here.It should have same performance as r1305 & r1316 but is able to switch between different modes of execution via command line switch. EDIT: btw, for C-60 I reciving very inconsistent results. Time fluctuations are big. So it's worth to make few copies of the same task and run them all orjust rep[eat run few times to get estimation of error range. BTW, did you test with CPU free or CPU was busy with BOINC tasks ? ID: 1259841 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1259858 - Posted: 13 Jul 2012, 15:06:17 UTC No worries. When you say 'faster', are you only looking at CPU times? Because for the HD 6950, I can definitely say that overall run-time is better with r1316. On the old r555, it would finish zero-blanked AP task in about 50-55 minutes. With r1316, it finishes in about 40 minutes, maybe less. Maybe this test WU is not best representation of 'typical' AP tasks, if there's such a thing. Again, when you say 'like', are you only considering CPU time? Maybe it's just due to normal variance - will have to test some more. I sent you a private message with a link - let me know if you can't download it. I had to re-do the test for HD 5670 because I lost the first results, though results are very similar this time around, even though I suspended CPU tasks as well. I will try to do some more testing later (need to sleep now, very tired), especially for the C-50. Let me know what you want me to focus on. For the results I posted, I only allowed BOINC suspend on HD 6950. For others, I only suspend ATI tasks because I wanted to test 'real-world' conditions, so left the other work-units running. Since the CPU priority for GPU tasks is higher than the CPU-only ones, it shouldn't have too much effect. Except for C-50, of course, because CPU and GPU are on the same chip, so I will suspend CPU tasks for that one as well, for next test. ...as it turns out, I think testing (suspending/resuming repeatedly) killed the AP task on the HD 4670 prematurely (30/30 repeating pulses). So I took the opportunity to make another test run on it with CPU tasks suspended - I included the results in the link I gave you. Soli Deo Gloria ID: 1259858 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1259882 - Posted: 13 Jul 2012, 16:57:02 UTC Wedge. Offline tests should always be made under same conditions. That means always CPU busy or idle. Specially for speed comparison boinc should be turned off. With each crime and every kindness we birth our future. ID: 1259882 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1260095 - Posted: 14 Jul 2012, 1:36:50 UTC Last modified: 14 Jul 2012, 1:43:02 UTC Raistmer, I did another series of tests and I'm satisfied with these results so I won't be doing any more testing until the next build you want tested. These times are average of three runs, all tests had BOINC fully suspended and all parameters (on a given host) were kept the same throughout. The variances in times seems to be no more than about 5% for all tests - in most cases even less than that (this includes the C-50) - so I'm confident these times are accurate for my particular hosts. HD 6950 (Cayman) r555: 46.386 / 16.318 r1305: 48.799 / 20.490 r1316: 42.653 / 18.979 HD 5670 (Redwood) r555: 200.969 / 30.463 r1305: 185.953 / 23.380 r1316: 164.547 / 13.104 HD 4670 (R730) r555: 588.982 / 105.443 r1305: 599.636 / 109.239 r1316: 490.278 / 40.162 Fusion C-50 (HD 6250 - Wrestler) r555: 960.101 / 573.803 r1305: 865.306 / 549.155 r1316: 778.431 / 589.679 In conclusion, I still think r1316 is the overall winner out of these three versions across a broad range of GPU architectures. About the only disadvantage I can see is an increase in CPU usage on the C-50, even though it produced the shortest overall run-times. Now, I'm curious to know how GCN compares... x.x I'm still limited to WinXP / Catalyst 12.1 at the moment, so no HD 7000-series for me just yet. Soli Deo Gloria ID: 1260095 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1260139 - Posted: 14 Jul 2012, 5:00:32 UTC - in response to Message 1260095. Thanks! Some slowdown of r1305 on fast GPU can be from defaults change. Default FFA block size was decreased vs r555. If you want to change this try to add -ffa_block N and -ffa_block_fetch N params (where N should be the same for all revisions of course). ID: 1260139 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.