## @Pre-FERMI nVidia GPU users: Important warning

Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning
Message board moderation

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · Next

AuthorMessage
Jacob Klein
Volunteer tester

Joined: 15 Apr 11
Posts: 54
Credit: 9,005,510
RAC: 5,219
Message 1649554 - Posted: 5 Mar 2015, 14:30:08 UTC

I wanted to relay ongoing correspondence, about the remaining OpenCL and SDK issues that I've noticed, and that NVIDIA seems willing to investigate.

I doubt that these will affect the AstroPulse app, so as far as SETI is concerned, I still fully recommend continuing to test the 341.44 drivers, to make sure they run the app correctly, to lift the version ban for 341.44. If you guys want mt to help test that in any way, on my FX3800M GPU, just give me clear instructions and I will try to help.

Ongoing Correspondence:

19 February 2015 8:28 pm
JacobKlein
The last driver for my FX 3800M, was 12/5/2014, and did not include the fix. When can we expect a fixed driver to be released? Thanks, Jacob Klein

25 February 2015 10:54 pm
Kevin Kang
Hi Jacob The latest R340 driver version - 341.44 was posted on February 24, 2015, it should contain this fix, please let us know if it works for your FX 3800M GPU. Thanks! http://www.nvidia.com/download/driverResults.aspx/82388/en-us Thanks, Kevin

26 February 2015 4:47 am
JacobKlein
Hi Kevin/Team, Yes, 341.44 does solve the problems I was having with SDK examples 19, 21, 26, 34, 36, 37 on R340 drivers. And it looks like it also solves the issue with SDK examples 2, 7, which affected R337 and R340. And it also appears to solve the issue in Bug 1554016, though that bug reporter will need to do final verification. Questions: 1) SDK example 9 still fails with a TDR and output "Out of Memory? - Error # -5 (CL_OUT_OF_RESOURCES) at line 248 , in file .\oclMatVecMul.cpp" --- Is that expected? And can you reproduce that error? 2) SDK example 13 crashes on exit - is that expected? 3) On several examples, my FX3800M output value for "CL_DEVICE_LOCAL_MEM_SIZE" reports "15 KByte", yet in R337 it reported "16 KByte"; which value is correct? Thank you, Jacob Klein

4 March 2015 7:22 am
Kevin Kang
Hi Jacob, Sorry for response late on your questions. Please refer to my inline comments below: 1) SDK example 9 still fails with a TDR and output "Out of Memory? - Error # -5 (CL_OUT_OF_RESOURCES) at line 248 , in file .\oclMatVecMul.cpp" --- Is that expected? And can you reproduce that error? [Kevin]: As I don't have a platform with FX 3800M GPU for repro exactly for now, I tired running the sample "oclMatVecMul" on 9600 GT, GTS 250 and GTX 260 GPUs, unfortunately, I didn't reproduce this error with 341.44 or 337.88 drivers on these GPUs. Could you please re-try it with set TdrDelay to sufficiently high value and check whether it persists? Thanks! 2) SDK example 13 crashes on exit - is that expected? [Kevin]: Yes, I can reproduce the crash issue while exiting sample - "oclSimpleD3D10Texture" by pressing "ESC" or click "X". Our developer identified it's a known issue in this OpenCL sample itself and it would be fixed in the further OpenCL samples release. 3) On several examples, my FX3800M output value for "CL_DEVICE_LOCAL_MEM_SIZE" reports "15 KByte", yet in R337 it reported "16 KByte"; which value is correct? [Kevin]: I can reproduce this behavior in the latest R340 driver version 341.44 , the "16 KByte" should be correct. We have tracked this issue in an internal bug report and our developers are working on a fix for this issue. Thanks for reporting this to us and sorry for any inconvenience! Thanks, Kevin
ID: 1649554 ·
jason_gee
Volunteer developer
Volunteer tester

Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Message 1649633 - Posted: 5 Mar 2015, 17:58:00 UTC - in response to Message 1649554.

Yeah, Kevin's great :) dealt with him on both main problem occasions, first time being when they broke CUFFT for Fermi+, then a driver installation issue whereby you couldn't install in Windows without a C Drive. Both things were fixed pretty promptly, though older stock Cuda apps required a workaround for the CUFFT problems.

With that TDR delay request in his latest, IIRC there is a setting in nSight if you have that installed, rather that going into the registry. It's understandable that huge long running Kernels in the samples may be intended for non-display devices.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1649633 ·
Jacob Klein
Volunteer tester

Joined: 15 Apr 11
Posts: 54
Credit: 9,005,510
RAC: 5,219
Message 1649712 - Posted: 5 Mar 2015, 21:17:11 UTC

My response to Kevin:

5 March 2015 1:15 pm
JacobKlein
Hi Kevin, Thanks for getting back to me. 1) SDK Example 9 - TDR: When I created the TdrDelay registry key, and supplied a value of 10 seconds "0xA" (instead of the default of 2 seconds), and restarted the laptop... the "oclMatVecMul" test worked just fine and PASSED on the FX3800M. So, I guess the example does some call that happens to take 2+ seconds on an FX3800M, and TDRs. No big deal, I suppose, unless you deem otherwise. 2) SDK Example 13 - Crash on exit: I'm glad you can reproduce this. Since this is an issue with the "oclSimpleD3D10Texture" example, rather than a driver issue, I'm not too worried about it. 3) Incorrect reporting of "CL_DEVICE_LOCAL_MEM_SIZE": Since this is an actual driver issue, that surfaced in my examination of these SDK examples, I'd like to continue tracking the resolution of it. Do you think it would be ok/appropriate, if we leave this Bug 1574543 open, until it is resolved, so I can track it? Thanks for everything you've done. Looking forward to the driver being as correct as it can be, before being retired :) - Jacob
ID: 1649712 ·
Claggy
Volunteer tester

Joined: 5 Jul 99
Posts: 4654
Credit: 47,402,007
RAC: 588
Message 1649756 - Posted: 5 Mar 2015, 23:27:45 UTC - in response to Message 1649554.

I wanted to relay ongoing correspondence, about the remaining OpenCL and SDK issues that I've noticed, and that NVIDIA seems willing to investigate.

There still is the NV OpenCL 100% CPU usage Bug, I don't think that has been fixed yet:

Very nasty bug (feature?) was discovered recently in attempt to reduce CPU usage of OpenCL NV AP app: asynchronous buffer reads actually done as synchronous ones.
Bug was filed via nVidia CUDA registered developer program.
I got response with request for test case and thorough explanation of what is buggy behavior in this case. Explanations and test case were provided more than week ago - no signs of progress from that time.

Claggy
ID: 1649756 ·
Jacob Klein
Volunteer tester

Joined: 15 Apr 11
Posts: 54
Credit: 9,005,510
RAC: 5,219
Message 1649757 - Posted: 5 Mar 2015, 23:35:23 UTC

If you have an issue that is not the same as the ones I've identified, consider applying additional pressure by filing bugs, or editing bugs to ask for status.

My existing bug report is about OpenCL SDK example failures on pre-Fermi GPUs, and I am not going to add more problems to it.
ID: 1649757 ·
Claggy
Volunteer tester

Joined: 5 Jul 99
Posts: 4654
Credit: 47,402,007
RAC: 588
Message 1649763 - Posted: 6 Mar 2015, 0:05:14 UTC - in response to Message 1649757.

If you have an issue that is not the same as the ones I've identified, consider applying additional pressure by filing bugs, or editing bugs to ask for status.

My existing bug report is about OpenCL SDK example failures on pre-Fermi GPUs, and I am not going to add more problems to it.

Raistmer's Bug report, he'll have the ticket no, I don't know it, and can't find it, Been ongoing for four years now.

Claggy
ID: 1649763 ·
Wedge009
Volunteer tester

Joined: 3 Apr 99
Posts: 450
Credit: 383,574,974
RAC: 193,116
Message 1650160 - Posted: 7 Mar 2015, 0:11:32 UTC

I would regard the synchronous reads as the more serious problem. While there is a thread-sleep work-around, I find the performance penalty greater on some hosts than others. I know the other parameters need adjusting with use of the use_sleep option, but even so there's always some penalty involved.

But back to topic, it's encouraging to know at least some community concerns are being taken seriously.
Soli Deo Gloria
ID: 1650160 ·
Richard Haselgrove
Volunteer tester

Joined: 4 Jul 99
Posts: 13111
Credit: 147,944,751
RAC: 182,212
Message 1650429 - Posted: 7 Mar 2015, 20:41:26 UTC - in response to Message 1648520.

Any confirmation that new build works OK with 341.44 driver on pre-FERMI cards?

I'd say not. I'm getting

AP7_win_x86_SSE2_OpenCL_NV_r2745.exe -verbose / LoThresh_v5.dat :
AppName: AP7_win_x86_SSE2_OpenCL_NV_r2745.exe
AppArgs: -verbose
Started at : 19:38:27.563
Ended at : 19:43:27.773
300.183 secs Elapsed
0.062 secs CPU time
Speedup : 99.94%
Ratio : 1641.27x

[ stderr ]
19:38:27 (1592): Can't open init data file - running in standalone mode
Not using ap_cmdline.txt-file, using commandline options.
19:38:27 (1592): Can't open init data file - running in standalone mode
Priority of worker thread raised successfully
Priority of process adjusted successfully, below normal priority class used
19:38:27 (1592): Can't open init data file - running in standalone mode
WARNING: init_data.xml missing
OpenCL platform detected: Intel(R) Corporation
OpenCL platform detected: NVIDIA Corporation
WARNING: BOINC supplied wrong platform!
BOINC assigns device 0
WARNING: BOINC failed to provide OpenCL device, using own enumeration abilities
ERROR: pre-FERMI device with unsupported 341 driver detected, NV broke support starting from 340.52, can't continue on such device, please downgrade driver below 340.xx. Exiting...
[ /stderr ]

So it runs/waits for 5 minutes every bench, but doesn't actually do anything. Driver is 341.44, so the major version is being detected correctly, but the minor version isn't being echoed. No OpenCL report either, using the APbench211_minimal as you advised NVidia to test with. Did you say you'd disabled the original documented (stock Berkeley code) -verbose switch in favour of some homebrew variants?
ID: 1650429 ·
Raistmer
Volunteer developer
Volunteer tester

Joined: 16 Jun 01
Posts: 6115
Credit: 98,270,211
RAC: 47,222
Message 1650455 - Posted: 7 Mar 2015, 22:25:34 UTC - in response to Message 1650429.

Did you say you'd disabled the original documented (stock Berkeley code) -verbose switch in favour of some homebrew variants?

??? more verbose on this :) what switch? what "default" ... ?
ID: 1650455 ·
Raistmer
Volunteer developer
Volunteer tester

Joined: 16 Jun 01
Posts: 6115
Credit: 98,270,211
RAC: 47,222
Message 1650466 - Posted: 7 Mar 2015, 23:03:21 UTC - in response to Message 1650429.

Any confirmation that new build works OK with 341.44 driver on pre-FERMI cards?

I'd say not.

try updated build: https://www.dropbox.com/s/v8g8a4la4j6osk5/AP7_win_x86_SSE2_OpenCL_NV_r2745.7z?dl=0
ID: 1650466 ·
Josef W. Segur
Volunteer developer
Volunteer tester

Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
Message 1650510 - Posted: 8 Mar 2015, 1:39:50 UTC - in response to Message 1650455.

Did you say you'd disabled the original documented (stock Berkeley code) -verbose switch in favour of some homebrew variants?

??? more verbose on this :) what switch? what "default" ... ?

LoL. Astropulse never had a -verbose switch, but seti_boinc did and APbench211.cmd happens to have it as a default for both reference and science_apps runs, that simply wasn't removed when knabench was ported for AP testing.
                                                                   Joe
ID: 1650510 ·
Richard Haselgrove
Volunteer tester

Joined: 4 Jul 99
Posts: 13111
Credit: 147,944,751
RAC: 182,212
Message 1650629 - Posted: 8 Mar 2015, 13:39:19 UTC - in response to Message 1650510.

Did you say you'd disabled the original documented (stock Berkeley code) -verbose switch in favour of some homebrew variants?

??? more verbose on this :) what switch? what "default" ... ?

LoL. Astropulse never had a -verbose switch, but seti_boinc did and APbench211.cmd happens to have it as a default for both reference and science_apps runs, that simply wasn't removed when knabench was ported for AP testing.
                                                                   Joe

LOL indeed. OK, sorry - I'd forgotten how far Josh's code diverged from established Berkeley standards. Forget I spoke.
ID: 1650629 ·
Richard Haselgrove
Volunteer tester

Joined: 4 Jul 99
Posts: 13111
Credit: 147,944,751
RAC: 182,212
Message 1650631 - Posted: 8 Mar 2015, 13:42:34 UTC - in response to Message 1650466.

Any confirmation that new build works OK with 341.44 driver on pre-FERMI cards?

I'd say not.

try updated build: https://www.dropbox.com/s/v8g8a4la4j6osk5/AP7_win_x86_SSE2_OpenCL_NV_r2745.7z?dl=0

Thanks. That one looks much happier:

ref-AP7_win_x64_AVX_CPU_r2692.exe-single_pulses.wu.res: <ap_signal>44,<pulses>34
,<best_pulses>10
result-AP7_win_x86_SSE2_OpenCL_NV_r2745.exe-single_pulses.wu.res: <ap_signal>44,
<pulses>34,<best_pulses>10
All Signals: Weakly similar or Different.
Pulses: Checked 34, 34 , Strongly Similar
Best Pulses: Weakly similar or Different.

-(.\testDatas\ref\ref-AP7_win_x64_AVX_CPU_r2692.exe-single_pulses.wu.res)-
Reportable Single Pulses: 4 [OK], 3 above threshold*THRESHOLD_FUDGE
Reportable Repeating Pulses: 30 [OK]
Single Pulses (Best): 10 [Weak], 3 above threshold*THRESHOLD_FUDGE

-(.\testDatas\result-AP7_win_x86_SSE2_OpenCL_NV_r2745.exe-single_pulses.wu.res)-

Reportable Single Pulses: 4 [OK], 3 above threshold*THRESHOLD_FUDGE
Reportable Repeating Pulses: 30 [OK]
Single Pulses (Best): 10 [Weak], 3 above threshold*THRESHOLD_FUDGE

ref-AP7_win_x86_SSE2_OpenCL_NV_r2690.exe-single_pulses.wu.res: <ap_signal>44,<pu
lses>34,<best_pulses>10
result-AP7_win_x86_SSE2_OpenCL_NV_r2745.exe-single_pulses.wu.res: <ap_signal>44,
<pulses>34,<best_pulses>10
All Signals: Checked 44, 44 , Strongly Similar
Pulses: Checked 34, 34 , Strongly Similar
Best Pulses: Checked 10, 10 , Strongly Similar

-(.\testDatas\ref\ref-AP7_win_x86_SSE2_OpenCL_NV_r2690.exe-single_pulses.wu.res)
-
Reportable Single Pulses: 4 [OK], 3 above threshold*THRESHOLD_FUDGE
Reportable Repeating Pulses: 30 [OK]
Single Pulses (Best): 10 [OK], 3 above threshold*THRESHOLD_FUDGE

-(.\testDatas\result-AP7_win_x86_SSE2_OpenCL_NV_r2745.exe-single_pulses.wu.res)-

Reportable Single Pulses: 4 [OK], 3 above threshold*THRESHOLD_FUDGE
Reportable Repeating Pulses: 30 [OK]
Single Pulses (Best): 10 [OK], 3 above threshold*THRESHOLD_FUDGE

I'll run through the rest of the standard test set (having already generated reference results using Joe's (I think) AP7_win_x64_AVX_CPU_r2692 to save time). Any other specific cases that need checking?
ID: 1650631 ·
Raistmer
Volunteer developer
Volunteer tester

Joined: 16 Jun 01
Posts: 6115
Credit: 98,270,211
RAC: 47,222
Message 1650634 - Posted: 8 Mar 2015, 13:48:38 UTC - in response to Message 1650631.

using Joe's (I think) AP7_win_x64_AVX_CPU_r2692 to save time). Any other specific cases that need checking?

Well, AP mostly from me, just as stderr says. But hope this did not make ref any worse ;)
Just live runs. I'll pass binary to Eric for beta deployment [EDIT:done]. So, delay in deployment allowed to skip one round (prev beta build remained undeployed at all)....
ID: 1650634 ·
Richard Haselgrove
Volunteer tester

Joined: 4 Jul 99
Posts: 13111
Credit: 147,944,751
RAC: 182,212
Message 1650635 - Posted: 8 Mar 2015, 13:54:15 UTC - in response to Message 1649756.

I wanted to relay ongoing correspondence, about the remaining OpenCL and SDK issues that I've noticed, and that NVIDIA seems willing to investigate.

There still is the NV OpenCL 100% CPU usage Bug, I don't think that has been fixed yet:

Very nasty bug (feature?) was discovered recently in attempt to reduce CPU usage of OpenCL NV AP app: asynchronous buffer reads actually done as synchronous ones.
Bug was filed via nVidia CUDA registered developer program.
I got response with request for test case and thorough explanation of what is buggy behavior in this case. Explanations and test case were provided more than week ago - no signs of progress from that time.

Claggy

Can anyone remember exactly how it was worked out that asynchronous reads were being done synchronously? The bug (feature?) is still present, judging by the refresh app still using a whole core of CPU: and the NVidia SDK sample app 'oclParticles' runs continuously, long enough to verify with Process Explorer that it too uses a whole CPU core. That might be a better test case to use to reactivate this bug report, as it applies (if it applies?) to later drivers than the legacy pre-Fermi ones we're discussing here.
ID: 1650635 ·
Richard Haselgrove
Volunteer tester

Joined: 4 Jul 99
Posts: 13111
Credit: 147,944,751
RAC: 182,212
Message 1650637 - Posted: 8 Mar 2015, 14:00:10 UTC - in response to Message 1650634.

using Joe's (I think) AP7_win_x64_AVX_CPU_r2692 to save time). Any other specific cases that need checking?

Well, AP mostly from me, just as stderr says. But hope this did not make ref any worse ;)
Just live runs. I'll pass binary to Eric for beta deployment. So, delay in deployment allowed to skip one round (prev beta build remained undeployed at all)....

Remind him to tweak the plan_class major/minor filter too, so that work gets allocated appropriately.

It might be appropriate to tweak the error message

ERROR: pre-FERMI device with unsupported 341 driver detected, NV broke support starting from 340.52, can't continue on such device, please downgrade driver below 340.xx. Exiting...

to say that an upgrade to 341.44 or above is also permitted, if they happen to run with one of the dodgy intermediate drivers.
ID: 1650637 ·
Raistmer
Volunteer developer
Volunteer tester

Joined: 16 Jun 01
Posts: 6115
Credit: 98,270,211
RAC: 47,222
Message 1650639 - Posted: 8 Mar 2015, 14:16:04 UTC - in response to Message 1650635.

Can anyone remember exactly how it was worked out that asynchronous reads were being done synchronously?

I'm not inclined to consider these 2 issues as tightly linked.
full CPU usage core comes from CPU polling loop inside driver or some yield instead of delay for some time/using iterrupts.
It was precisely demonstrated on Linux, even workaround was developed with nanosleep call instead of yield. Unfortunately, on Windows best one can use (AFAIK) is 1ms delay that ages on computer scale. Also, not all systems support 1ms, usually real quantum bigger. So -use_sleep option fights with this issue but in quite unoptimal way (hence bigger kernels suggested, to fight with one type of overhead simultaneously introducing another ones). But such sync method, being not specified in OpenCL standard, can't be considered as pure bug, it's inefficiencey (CUDA has abilities to chose sync mode, OpenCL locked on such not too efficient one in nVidia runtime).
The bug I reported regarding sync buffers reads instead of asyn ones should be considered as "real" bug IMO. Specification says that asynchronous buffer transfers should return immediately. But they don't. This makes whole async* memory routines redundant cause they act just as sync ones.
To check if this was fixed or not (I recived no reports that it was from NV, instead of situation with last bug were comment was added and I will reply regarding successfull fix now) one need to put sleeping loop after "async" read.
If it will still reduce CPU time then yes, bug fixed. If -use_sleep becomes useless - then nothing changed. But in both cases app w/o -use_sleep will experience 100% CPU usage inefficiency/bug.
P.S. I'll prepare binary needed for this kind of testing and post link in this thread then.
ID: 1650639 ·
jason_gee
Volunteer developer
Volunteer tester

Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Message 1650655 - Posted: 8 Mar 2015, 15:34:36 UTC - in response to Message 1650635.

Can anyone remember exactly how it was worked out that asynchronous reads were being done synchronously? The bug (feature?) is still present, judging by the refresh app still using a whole core of CPU: and the NVidia SDK sample app 'oclParticles' runs continuously, long enough to verify with Process Explorer that it too uses a whole CPU core. That might be a better test case to use to reactivate this bug report, as it applies (if it applies?) to later drivers than the legacy pre-Fermi ones we're discussing here.

Been a long time. I currently only recall one nv openCL demo (an ocean one from perhaps 3.2 or so), that was built to use OpenCL and very little CPU (even less than Cuda blocking syncs!). That one used a freeGLUT library for the timer/render loop to set OS mutexes or events (instead of relying on CPU blocking old Cuda style). Common (industry, not here) practice has been to use these precision multimedia timers & non blocking behaviour, so that program main threads can remain ultra-responsive by staying mostly idle/asleep. Unfortunately the current boincAPI setup doesn't make that easy, having the work in the main worker and messaging in a timer thread... which is back-asswards.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1650655 ·
Richard Haselgrove
Volunteer tester

Joined: 4 Jul 99
Posts: 13111
Credit: 147,944,751
RAC: 182,212
Message 1650703 - Posted: 8 Mar 2015, 16:46:50 UTC - in response to Message 1650655.

Can anyone remember exactly how it was worked out that asynchronous reads were being done synchronously? The bug (feature?) is still present, judging by the refresh app still using a whole core of CPU: and the NVidia SDK sample app 'oclParticles' runs continuously, long enough to verify with Process Explorer that it too uses a whole CPU core. That might be a better test case to use to reactivate this bug report, as it applies (if it applies?) to later drivers than the legacy pre-Fermi ones we're discussing here.

Been a long time. I currently only recall one nv openCL demo (an ocean one from perhaps 3.2 or so), that was built to use OpenCL and very little CPU (even less than Cuda blocking syncs!). That one used a freeGLUT library for the timer/render loop to set OS mutexes or events (instead of relying on CPU blocking old Cuda style). Common (industry, not here) practice has been to use these precision multimedia timers & non blocking behaviour, so that program main threads can remain ultra-responsive by staying mostly idle/asleep. Unfortunately the current boincAPI setup doesn't make that easy, having the work in the main worker and messaging in a timer thread... which is back-asswards.

We're currently referring to the demos (with source code) at https://developer.nvidia.com/opencl
ID: 1650703 ·
Raistmer
Volunteer developer
Volunteer tester

Joined: 16 Jun 01
Posts: 6115
Credit: 98,270,211
RAC: 47,222
Message 1650705 - Posted: 8 Mar 2015, 16:49:07 UTC

Well, to re-test that sync/async issue this binary https://www.dropbox.com/s/nwtnomc5m6uhxen/AP7_win_x86_SSE2_OpenCL_NV_r2745_sleep_loop_shifted.exe.7z?dl=0 can be dropped along usual one into benchmark.
Then I propose to make 3 runs on let say Clean20 task:

1) both w/o switches - times should be comparable. If CPU usage inefficiency/bug presents both will take CPU ~ Elapsed time

2) both with -use_sleep option added. If bug under discussion is fixed times should be comparable again. And CPU should be lower than in 1). If bug still presents shifted sleep loop will be executed only when actual sync is complete hence will not save any CPU time. Hence, big difference in CPU times expected versus usual build.

3) -use_sleep -v 6 for both. In stderr will be clear info how many times sleeping loop executed for both variants.
ID: 1650705 ·
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · Next

Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning