New AstroPulse for GPU ( ATi & NV) released (r1316)


log in

Advanced search

Message boards : Number crunching : New AstroPulse for GPU ( ATi & NV) released (r1316)

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 11 · Next
Author Message
Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 3963
Credit: 31,859,533
RAC: 11,007
United Kingdom
Message 1259097 - Posted: 11 Jul 2012, 20:02:46 UTC - in response to Message 1259081.

My 9800GTX+ was still using almost 100% CPU with 301.24 & 302.59 drivers and the r1305 app (but i was purposly leaving a core free), not tried the r1316 app yet on it, will try it once my present Cuda testing is complete,

Claggy

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3232
Credit: 31,585,541
RAC: 0
Netherlands
Message 1259202 - Posted: 12 Jul 2012, 2:26:32 UTC - in response to Message 1259097.
Last modified: 12 Jul 2012, 2:30:31 UTC

I did the same, leaving 1 'thread' free which ups the GPU load by 5-7%, also
the 2nd has an even higher load as the first!? 98%-99% doing 2 MB wus.
This was with MB work, but noticed the same with AstroPulse work.
Unfortunatly no AstroPulse tasks available.

In the mean time, also no 610 work, I did some MW work, with a resource share
of 650 (SETI) and 75(MW) it's OK. AMD ATI 5870 GPUs
But it crashed (heat?) don't know, I immediatly received an survey from HP
as to what has happened and wanted to help?!

Now doing 4 610 tasks on the GPUs.
But some, better alot, AsroPulse work would be nice since the rev.1316,
which really is faster and 2 results were validated, out of 2.

____________


Knight Who Says Ni N!, OUT numbered.................

Profile Mike
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 22400
Credit: 29,333,900
RAC: 23,642
Germany
Message 1259221 - Posted: 12 Jul 2012, 3:41:39 UTC

Please stay on topic here.

____________

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 6969
Credit: 57,096,215
RAC: 22,645
Germany
Message 1259586 - Posted: 12 Jul 2012, 20:26:10 UTC
Last modified: 12 Jul 2012, 20:27:25 UTC

With 275.33 NVIDIA driver and one APv6 6.04 (r1316) application/work unit on my GPU one CPU-thread had ~ 10 % usage/support.

As I changed to 2 work units/GPU CPU-thread usage/support increased to ~ 40 - 50 % for each application/work unit.

I guess this OpenCL BUG is not happen (or not so big) if only one work unit on my system (Intel Core2 Duo E7600 with NVIDIA GeForce GTX260-216 O/C, WinXP 32bit).


* Best regards! :-) * Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. *
____________
BR



>Das Deutsche Cafe. The German Cafe.<

Profile Mike
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 22400
Credit: 29,333,900
RAC: 23,642
Germany
Message 1259593 - Posted: 12 Jul 2012, 20:39:16 UTC

Yes Sutaru.
We figured that pre Fermi cards are not that much affected.

Maybe its because your 260 has much more compute units than 460 for example.
But thats only my imagination.


____________

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 237
Credit: 107,432,237
RAC: 167,781
Australia
Message 1259606 - Posted: 12 Jul 2012, 21:50:34 UTC
Last modified: 12 Jul 2012, 22:02:33 UTC

Ah, so maybe that's the key? I'm only running one WU at a time on the GTX 260 Core 216 as well - it seems that multiple simultaneous WUs don't scale well unless on Fermi or later cards. For my Fermi and Kepler cards, I'm running two at a time.

And yes, I also noticed that the pre-Fermi G200-era GPU designs have several more compute units than Fermi and Kepler. Different architecture, different implementation.

Edit: This is interesting... with the current lack of AP WUs, one of my Fermi cards received only one AP WU to work on... even though it was processing a MB/CUDA WU simultaneously, the AP WU it was working on exhibited lower CPU usage similar to that seen on the pre-Fermi card. So perhaps the contributing factor to the high CPU usage isn't the GPU architecture, but rather how many OpenCL tasks you try to run on it simultaneously. MB/CUDA WUs don't appear to suffer high CPU usage.
____________
Soli Deo Gloria

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,951,370
RAC: 13,644
United Kingdom
Message 1259625 - Posted: 12 Jul 2012, 23:16:21 UTC - in response to Message 1259606.

it seems that multiple simultaneous WUs don't scale well unless on Fermi or later cards.

Yes, that's the way they were built.

Tasks don't actually run 'simultaneously' (any more than they do on a single-core CPU under a multi-tasking operating system). The hardware is switched from task to task on a scale of milliseconds, perhaps even microseconds.

Fermis and later have specialised silicon to handle this task-switching at high speed: earlier GPUs don't.

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 237
Credit: 107,432,237
RAC: 167,781
Australia
Message 1259633 - Posted: 13 Jul 2012, 0:05:12 UTC

Ah, thanks for the clarification. I thought the workload would be split among the numerous processing cores on the GPU, so clearly I was mistaken there.

Well, it would seem, then, that for OpenCL tasks at least, this 'high-speed switching' consumes excessive CPU resources with the current NV drivers.
____________
Soli Deo Gloria

Al
Send message
Joined: 3 Apr 99
Posts: 481
Credit: 48,933,252
RAC: 29,907
United States
Message 1259668 - Posted: 13 Jul 2012, 2:35:00 UTC - in response to Message 1257949.

I made 2 posts about current situation with driver support for OpenCL on both vendors forums recently:
http://devgurus.amd.com/thread/159432
http://developer.nvidia.com/devforum/discussion/10636/feature-request-to-add-synchronization-mode-tuning-via-nv-specific-opencl-extension

If you have something to say on topic or explain why this important for users, please do post in corresponding threads.

I thought I'd do a little light reading tonite and check out the Nvidia thread mentioned above (since all I run are their cards), and got this notice when I went to their site:

Posted July 12, 2012

NVIDIA suspended operations today of the NVIDIA Developer Zone (developer.nvidia.com). We did this in response to attacks on the site by unauthorized third parties who may have gained access to hashed passwords.

We are investigating this matter and working around the clock to ensure that secure operations can be restored.

As a precautionary measure, we strongly recommend that you change any identical passwords that you may be using elsewhere.

NVIDIA does not request sensitive information by email. Do not provide personal, financial or sensitive information (including new passwords) in response to any email purporting to be sent by an NVIDIA employee or representative.

We will post updates about this matter here. For any questions, email us at devzoneupdate@nvidia.com.

For technical support, go to www.nvidia.com/support.


Bummer.
____________

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3291
Credit: 40,849,942
RAC: 59,443
Russia
Message 1259712 - Posted: 13 Jul 2012, 7:07:35 UTC - in response to Message 1259668.
Last modified: 13 Jul 2012, 7:19:32 UTC

Yes, I recived E-mail with same message, NV was hacked recently. Someone tired from their bugs perhaps ;)

Well, last reports imply change in latest NV drivers actually.
When I reproduced this bug on my GPU I saw CPU usage increase on GTX 260, signle MB task. So, apparently something was changed since then. Will update to latest drivers from rock-stable 263.06 (no CPU usage bug there at all, even if 2 AP tasks running) and check what we really have now.

EDIT: As our inner investigation starts to show GPU-host synching very complex matter, not only on NV but on ATi hardware too. CPU time increase on low end ATi GPUs with r1316 over r1305 I attribute now to change in synching mode inside APP runtime when wait time increases. Will do post with these observations on ATi forums and post link here. So one who interesting in topic could follow discussion with ATi specialists (in case they bother to answer of course).

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3291
Credit: 40,849,942
RAC: 59,443
Russia
Message 1259728 - Posted: 13 Jul 2012, 7:52:38 UTC

Here is the link to the synchronization issue discussion on AMD site:
http://devgurus.amd.com/message/1282663#1282663

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 237
Credit: 107,432,237
RAC: 167,781
Australia
Message 1259734 - Posted: 13 Jul 2012, 8:31:01 UTC

I don't fully understand all the details of that, but I think I get the gist of it. I suppose the lower end GPUs have different implementations which are more sensitive to the differences in the two algorithms you tried.

AP WUs are being (slowly) distributed again. Hopefully I can get more testing done, provided the blanking percentages don't vary too wildly (85+% blanking is really horrible, especially on slower CPUs).
____________
Soli Deo Gloria

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3291
Credit: 40,849,942
RAC: 59,443
Russia
Message 1259768 - Posted: 13 Jul 2012, 10:29:30 UTC - in response to Message 1259734.
Last modified: 13 Jul 2012, 10:32:14 UTC

I don't fully understand all the details of that, but I think I get the gist of it. I suppose the lower end GPUs have different implementations which are more sensitive to the differences in the two algorithms you tried.

AP WUs are being (slowly) distributed again. Hopefully I can get more testing done, provided the blanking percentages don't vary too wildly (85+% blanking is really horrible, especially on slower CPUs).


Not quite... low-end GPUs just manifest it very noticeable.
Mike did test on his quite fast GPU - he got 4s CPU time vs 10s CPU time... Values are low so more testing will be needed, but difference is very big in relative scale...

From the other side, I think it's some switch between 2 sync modes inside driver itself, depending on time to wait.
If it's true high-end GPU could still not pass threshold for such switch and not show so big increase in CPU time... will see, more testing needed.

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 237
Credit: 107,432,237
RAC: 167,781
Australia
Message 1259776 - Posted: 13 Jul 2012, 11:14:11 UTC
Last modified: 13 Jul 2012, 11:26:21 UTC

Edit: I only just saw your private message (after making the below post). I feel pretty stupid now. x.x

---

<GLaDOS>Continue testing...</GLaDOS>

Ten seconds, wow. What card is that, out of curiosity?

There were no instructions so it took me a while, but I finally managed to work out how to run your dummy work-unit in stand-alone mode and here are my results. I kept parameters the same across all executions (for the same GPU) and also made sure that the binary caches were built before making any timings (so run-time should be just the actual GPU processing and not skewed by the time for the CPU to create those caches).

I'm including CPUs in this summary for reference though I don't expect it should have too great an influence on the times.

Intel Core 2 Q9550 + Radeon HD 6950
r555: 0:46
r1305: 0:50
r1316: 0:43

Athlon 64 X2 6400+ + Radeon HD 5670
r555: 3:24
r1305: 3:10
r1316: 2:47

Pentium 4 HT 3.06 + Radeon HD 4670
r555: 5:54
r1305: 6:01
r1316: 4:03

Fusion C-50 (Radeon HD 6250)
r555: 10:36
r1305: 9:11
r1316: 8:42

Not sure what to make of these trends, but hopefully they'll be of some use to you. Maybe. I suppose only one kind of work-unit doesn't reflect performance increase or decrease as much as we'd like - I'm not sure if the differences are significant enough to overcome the margin of error.
____________
Soli Deo Gloria

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 237
Credit: 107,432,237
RAC: 167,781
Australia
Message 1259833 - Posted: 13 Jul 2012, 13:28:15 UTC

All right, tried it again with the APbench tool...

Intel Core 2 Q9550 + Radeon HD 6950
AP6_win_x86_SSE2_OpenCL_ATI_r1305.exe -unroll 11 :
Elapsed 49.028 secs
CPU 20.734 secs
AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 11 :
Elapsed 42.324 secs
CPU 18.938 secs
AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 11 :
Elapsed 45.582 secs
CPU 16.375 secs

Athlon 64 X2 6400+ + Radeon HD 5670
AP6_win_x86_SSE2_OpenCL_ATI_r1305.exe -unroll 5 :
Elapsed 191.719 secs
CPU 26.016 secs
AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 5 :
Elapsed 169.219 secs
CPU 14.766 secs
AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 5 :
Elapsed 210.141 secs
CPU 34.609 secs

Pentium 4 HT 3.06 + Radeon HD 4670
AP6_win_x86_SSE2_OpenCL_ATI_r1305.exe -unroll 4 :
Elapsed 695.047 secs
CPU 188.047 secs
AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 4 :
Elapsed 513.266 secs
CPU 88.875 secs
AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 4 :
Elapsed 711.469 secs
CPU 204.234 secs

Fusion C-50 (Radeon HD 6250)
AP6_win_x86_SSE2_OpenCL_ATI_r1305.exe -unroll 2 :
Elapsed 1379.849 secs
CPU 511.418 secs
AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 2 :
Elapsed 969.014 secs
CPU 558.187 secs
AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 2 :
Elapsed 983.876 secs
CPU 575.191 secs

The times seem to confirm my manual testing for the mid-range and high-end cards - not sure what the script does differently that the other two GPUs took twice as long compared with my manual testing. But the results should still be relative to each other here.
____________
Soli Deo Gloria

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3291
Credit: 40,849,942
RAC: 59,443
Russia
Message 1259841 - Posted: 13 Jul 2012, 13:46:08 UTC - in response to Message 1259833.
Last modified: 13 Jul 2012, 13:50:48 UTC

Thanks for testing

1) strange result for very first GPU, r555 looks faster %) [C-50 - the same...]
2) results for C-50 resemble resuls from my own C-60, looks like it likes r1305 more.
3) "twice as long" - noidea, it's worth to understand why.
So could you upload full TestData directory archived somewhere? Or if you prefer just E-mail me archive.

I will upload modified version soon then post link here.It should have same performance as r1305 & r1316 but is able to switch between different modes of execution via command line switch.

EDIT: btw, for C-60 I reciving very inconsistent results. Time fluctuations are big. So it's worth to make few copies of the same task and run them all orjust rep[eat run few times to get estimation of error range.
BTW, did you test with CPU free or CPU was busy with BOINC tasks ?

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 237
Credit: 107,432,237
RAC: 167,781
Australia
Message 1259858 - Posted: 13 Jul 2012, 15:06:17 UTC

No worries.


  1. When you say 'faster', are you only looking at CPU times? Because for the HD 6950, I can definitely say that overall run-time is better with r1316. On the old r555, it would finish zero-blanked AP task in about 50-55 minutes. With r1316, it finishes in about 40 minutes, maybe less. Maybe this test WU is not best representation of 'typical' AP tasks, if there's such a thing.
  2. Again, when you say 'like', are you only considering CPU time?
  3. Maybe it's just due to normal variance - will have to test some more.


I sent you a private message with a link - let me know if you can't download it. I had to re-do the test for HD 5670 because I lost the first results, though results are very similar this time around, even though I suspended CPU tasks as well.

I will try to do some more testing later (need to sleep now, very tired), especially for the C-50. Let me know what you want me to focus on. For the results I posted, I only allowed BOINC suspend on HD 6950. For others, I only suspend ATI tasks because I wanted to test 'real-world' conditions, so left the other work-units running. Since the CPU priority for GPU tasks is higher than the CPU-only ones, it shouldn't have too much effect. Except for C-50, of course, because CPU and GPU are on the same chip, so I will suspend CPU tasks for that one as well, for next test.

...as it turns out, I think testing (suspending/resuming repeatedly) killed the AP task on the HD 4670 prematurely (30/30 repeating pulses). So I took the opportunity to make another test run on it with CPU tasks suspended - I included the results in the link I gave you.
____________
Soli Deo Gloria

Profile Mike
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 22400
Credit: 29,333,900
RAC: 23,642
Germany
Message 1259882 - Posted: 13 Jul 2012, 16:57:02 UTC


Wedge.

Offline tests should always be made under same conditions.
That means always CPU busy or idle.

Specially for speed comparison boinc should be turned off.

____________

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 237
Credit: 107,432,237
RAC: 167,781
Australia
Message 1260095 - Posted: 14 Jul 2012, 1:36:50 UTC
Last modified: 14 Jul 2012, 1:43:02 UTC

Raistmer, I did another series of tests and I'm satisfied with these results so I won't be doing any more testing until the next build you want tested. These times are average of three runs, all tests had BOINC fully suspended and all parameters (on a given host) were kept the same throughout. The variances in times seems to be no more than about 5% for all tests - in most cases even less than that (this includes the C-50) - so I'm confident these times are accurate for my particular hosts.

HD 6950 (Cayman)
r555: 46.386 / 16.318
r1305: 48.799 / 20.490
r1316: 42.653 / 18.979

HD 5670 (Redwood)
r555: 200.969 / 30.463
r1305: 185.953 / 23.380
r1316: 164.547 / 13.104

HD 4670 (R730)
r555: 588.982 / 105.443
r1305: 599.636 / 109.239
r1316: 490.278 / 40.162

Fusion C-50 (HD 6250 - Wrestler)
r555: 960.101 / 573.803
r1305: 865.306 / 549.155
r1316: 778.431 / 589.679

In conclusion, I still think r1316 is the overall winner out of these three versions across a broad range of GPU architectures. About the only disadvantage I can see is an increase in CPU usage on the C-50, even though it produced the shortest overall run-times.

Now, I'm curious to know how GCN compares... x.x I'm still limited to WinXP / Catalyst 12.1 at the moment, so no HD 7000-series for me just yet.
____________
Soli Deo Gloria

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3291
Credit: 40,849,942
RAC: 59,443
Russia
Message 1260139 - Posted: 14 Jul 2012, 5:00:32 UTC - in response to Message 1260095.

Thanks!
Some slowdown of r1305 on fast GPU can be from defaults change. Default FFA block size was decreased vs r555. If you want to change this try to add -ffa_block N and -ffa_block_fetch N params (where N should be the same for all revisions of course).

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 11 · Next

Message boards : Number crunching : New AstroPulse for GPU ( ATi & NV) released (r1316)

Copyright © 2014 University of California