Message boards :
Number crunching :
@Pre-FERMI nVidia GPU users: Important warning
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · Next
Author | Message |
---|---|
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
I wanted to relay ongoing correspondence, about the remaining OpenCL and SDK issues that I've noticed, and that NVIDIA seems willing to investigate. I doubt that these will affect the AstroPulse app, so as far as SETI is concerned, I still fully recommend continuing to test the 341.44 drivers, to make sure they run the app correctly, to lift the version ban for 341.44. If you guys want mt to help test that in any way, on my FX3800M GPU, just give me clear instructions and I will try to help. Ongoing Correspondence:
|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yeah, Kevin's great :) dealt with him on both main problem occasions, first time being when they broke CUFFT for Fermi+, then a driver installation issue whereby you couldn't install in Windows without a C Drive. Both things were fixed pretty promptly, though older stock Cuda apps required a workaround for the CUFFT problems. With that TDR delay request in his latest, IIRC there is a setting in nSight if you have that installed, rather that going into the registry. It's understandable that huge long running Kernels in the samples may be intended for non-display devices. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
My response to Kevin: 5 March 2015 1:15 pm |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
I wanted to relay ongoing correspondence, about the remaining OpenCL and SDK issues that I've noticed, and that NVIDIA seems willing to investigate. There still is the NV OpenCL 100% CPU usage Bug, I don't think that has been fixed yet: http://setiathome.berkeley.edu/forum_thread.php?id=73290&postid=1443401 Very nasty bug (feature?) was discovered recently in attempt to reduce CPU usage of OpenCL NV AP app: asynchronous buffer reads actually done as synchronous ones. http://lunatics.kwsn.net/gpu-crunching/nv-blocking-asynchronous-buffer-read-test-case.msg54111.html#msg54111 Claggy |
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
If you have an issue that is not the same as the ones I've identified, consider applying additional pressure by filing bugs, or editing bugs to ask for status. My existing bug report is about OpenCL SDK example failures on pre-Fermi GPUs, and I am not going to add more problems to it. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
If you have an issue that is not the same as the ones I've identified, consider applying additional pressure by filing bugs, or editing bugs to ask for status. Raistmer's Bug report, he'll have the ticket no, I don't know it, and can't find it, Been ongoing for four years now. Claggy |
Wedge009 Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553 |
I would regard the synchronous reads as the more serious problem. While there is a thread-sleep work-around, I find the performance penalty greater on some hosts than others. I know the other parameters need adjusting with use of the use_sleep option, but even so there's always some penalty involved. But back to topic, it's encouraging to know at least some community concerns are being taken seriously. Soli Deo Gloria |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Any confirmation that new build works OK with 341.44 driver on pre-FERMI cards? I'd say not. I'm getting AP7_win_x86_SSE2_OpenCL_NV_r2745.exe -verbose / LoThresh_v5.dat : So it runs/waits for 5 minutes every bench, but doesn't actually do anything. Driver is 341.44, so the major version is being detected correctly, but the minor version isn't being echoed. No OpenCL report either, using the APbench211_minimal as you advised NVidia to test with. Did you say you'd disabled the original documented (stock Berkeley code) -verbose switch in favour of some homebrew variants? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Did you say you'd disabled the original documented (stock Berkeley code) -verbose switch in favour of some homebrew variants? ??? more verbose on this :) what switch? what "default" ... ? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Any confirmation that new build works OK with 341.44 driver on pre-FERMI cards? try updated build: https://www.dropbox.com/s/v8g8a4la4j6osk5/AP7_win_x86_SSE2_OpenCL_NV_r2745.7z?dl=0 |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Did you say you'd disabled the original documented (stock Berkeley code) -verbose switch in favour of some homebrew variants? LoL. Astropulse never had a -verbose switch, but seti_boinc did and APbench211.cmd happens to have it as a default for both reference and science_apps runs, that simply wasn't removed when knabench was ported for AP testing. Joe |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Did you say you'd disabled the original documented (stock Berkeley code) -verbose switch in favour of some homebrew variants? LOL indeed. OK, sorry - I'd forgotten how far Josh's code diverged from established Berkeley standards. Forget I spoke. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Any confirmation that new build works OK with 341.44 driver on pre-FERMI cards? Thanks. That one looks much happier: ref-AP7_win_x64_AVX_CPU_r2692.exe-single_pulses.wu.res: <ap_signal>44,<pulses>34 I'll run through the rest of the standard test set (having already generated reference results using Joe's (I think) AP7_win_x64_AVX_CPU_r2692 to save time). Any other specific cases that need checking? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
using Joe's (I think) AP7_win_x64_AVX_CPU_r2692 to save time). Any other specific cases that need checking? Well, AP mostly from me, just as stderr says. But hope this did not make ref any worse ;) Just live runs. I'll pass binary to Eric for beta deployment [EDIT:done]. So, delay in deployment allowed to skip one round (prev beta build remained undeployed at all).... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I wanted to relay ongoing correspondence, about the remaining OpenCL and SDK issues that I've noticed, and that NVIDIA seems willing to investigate. Can anyone remember exactly how it was worked out that asynchronous reads were being done synchronously? The bug (feature?) is still present, judging by the refresh app still using a whole core of CPU: and the NVidia SDK sample app 'oclParticles' runs continuously, long enough to verify with Process Explorer that it too uses a whole CPU core. That might be a better test case to use to reactivate this bug report, as it applies (if it applies?) to later drivers than the legacy pre-Fermi ones we're discussing here. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
using Joe's (I think) AP7_win_x64_AVX_CPU_r2692 to save time). Any other specific cases that need checking? Remind him to tweak the plan_class major/minor filter too, so that work gets allocated appropriately. It might be appropriate to tweak the error message ERROR: pre-FERMI device with unsupported 341 driver detected, NV broke support starting from 340.52, can't continue on such device, please downgrade driver below 340.xx. Exiting... to say that an upgrade to 341.44 or above is also permitted, if they happen to run with one of the dodgy intermediate drivers. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I'm not inclined to consider these 2 issues as tightly linked. full CPU usage core comes from CPU polling loop inside driver or some yield instead of delay for some time/using iterrupts. It was precisely demonstrated on Linux, even workaround was developed with nanosleep call instead of yield. Unfortunately, on Windows best one can use (AFAIK) is 1ms delay that ages on computer scale. Also, not all systems support 1ms, usually real quantum bigger. So -use_sleep option fights with this issue but in quite unoptimal way (hence bigger kernels suggested, to fight with one type of overhead simultaneously introducing another ones). But such sync method, being not specified in OpenCL standard, can't be considered as pure bug, it's inefficiencey (CUDA has abilities to chose sync mode, OpenCL locked on such not too efficient one in nVidia runtime). The bug I reported regarding sync buffers reads instead of asyn ones should be considered as "real" bug IMO. Specification says that asynchronous buffer transfers should return immediately. But they don't. This makes whole async* memory routines redundant cause they act just as sync ones. To check if this was fixed or not (I recived no reports that it was from NV, instead of situation with last bug were comment was added and I will reply regarding successfull fix now) one need to put sleeping loop after "async" read. If it will still reduce CPU time then yes, bug fixed. If -use_sleep becomes useless - then nothing changed. But in both cases app w/o -use_sleep will experience 100% CPU usage inefficiency/bug. P.S. I'll prepare binary needed for this kind of testing and post link in this thread then. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Can anyone remember exactly how it was worked out that asynchronous reads were being done synchronously? The bug (feature?) is still present, judging by the refresh app still using a whole core of CPU: and the NVidia SDK sample app 'oclParticles' runs continuously, long enough to verify with Process Explorer that it too uses a whole CPU core. That might be a better test case to use to reactivate this bug report, as it applies (if it applies?) to later drivers than the legacy pre-Fermi ones we're discussing here. Been a long time. I currently only recall one nv openCL demo (an ocean one from perhaps 3.2 or so), that was built to use OpenCL and very little CPU (even less than Cuda blocking syncs!). That one used a freeGLUT library for the timer/render loop to set OS mutexes or events (instead of relying on CPU blocking old Cuda style). Common (industry, not here) practice has been to use these precision multimedia timers & non blocking behaviour, so that program main threads can remain ultra-responsive by staying mostly idle/asleep. Unfortunately the current boincAPI setup doesn't make that easy, having the work in the main worker and messaging in a timer thread... which is back-asswards. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Can anyone remember exactly how it was worked out that asynchronous reads were being done synchronously? The bug (feature?) is still present, judging by the refresh app still using a whole core of CPU: and the NVidia SDK sample app 'oclParticles' runs continuously, long enough to verify with Process Explorer that it too uses a whole CPU core. That might be a better test case to use to reactivate this bug report, as it applies (if it applies?) to later drivers than the legacy pre-Fermi ones we're discussing here. We're currently referring to the demos (with source code) at https://developer.nvidia.com/opencl |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Well, to re-test that sync/async issue this binary https://www.dropbox.com/s/nwtnomc5m6uhxen/AP7_win_x86_SSE2_OpenCL_NV_r2745_sleep_loop_shifted.exe.7z?dl=0 can be dropped along usual one into benchmark. Then I propose to make 3 runs on let say Clean20 task: 1) both w/o switches - times should be comparable. If CPU usage inefficiency/bug presents both will take CPU ~ Elapsed time 2) both with -use_sleep option added. If bug under discussion is fixed times should be comparable again. And CPU should be lower than in 1). If bug still presents shifted sleep loop will be executed only when actual sync is complete hence will not save any CPU time. Hence, big difference in CPU times expected versus usual build. 3) -use_sleep -v 6 for both. In stderr will be clear info how many times sleeping loop executed for both variants. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.