NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units

Author	Message
Jacob Klein Volunteer tester Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9	Message 2022663 - Posted: 10 Dec 2019, 22:55:10 UTC NVIDIA released 441.66 drivers today. I tested them, and they still have the "SETI OpenCL SoG VHAR on Windows 10" problems: Maxwell: > Tasks crash with error. >ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995. Pascal/Turing: > Tasks run indefinitely with no load on the GPU. 431.60 are the last drivers that work correctly for those specific SETI tasks on Windows 10. NVIDIA is aware, and per NVIDIA, we must continue to be patient for a driver version that includes a fix. ID: 2022663 · Reply Quote

Lazydude Volunteer tester Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136	Message 2022854 - Posted: 12 Dec 2019, 9:16:38 UTC - in response to Message 2020647. Itâ€™s looking more and more like nvidia isnâ€™t going to fix this. We may need to look at other options, either wider adoption of the sah app that doesnâ€™t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10. I'm not sure of this- vlar's working what i understand it on Nvidia-cards today. Why not "skip" vlar and reuse the code and call it VHAR instead? BR Lazy ID: 2022854 · Reply Quote

Wiggo Send message Joined: 24 Jan 00 Posts: 34872 Credit: 261,360,520 RAC: 489	Message 2022856 - Posted: 12 Dec 2019, 9:24:30 UTC - in response to Message 2022854. Itâ€™s looking more and more like nvidia isnâ€™t going to fix this. We may need to look at other options, either wider adoption of the sah app that doesnâ€™t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10. I'm not sure of this- vlar's working what i understand it on Nvidia-cards today. Why not "skip" vlar and reuse the code and call it VHAR instead? BR Lazy Yes they don't work too badly under the newer OpenCL apps on newer hardware, but they brought systems to a grinding halt under the older Cuda apps on older hardware and both are still being used today. ;-) Cheers. ID: 2022856 · Reply Quote

Lazydude Volunteer tester Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136	Message 2022857 - Posted: 12 Dec 2019, 9:40:15 UTC - in response to Message 2022856. Itâ€™s looking more and more like nvidia isnâ€™t going to fix this. We may need to look at other options, either wider adoption of the sah app that doesnâ€™t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10. I'm not sure of this- vlar's working what i understand it on Nvidia-cards today. Why not "skip" vlar and reuse the code and call it VHAR instead? BR Lazy Yes they don't work too badly under the newer OpenCL apps on newer hardware, but they brought systems to a grinding halt under the older Cuda apps on older hardware and both are still being used today. ;-) Cheers. Thanks Wiggo! Then we are in another debate - how long should the project keep "oldish" apps in use. and that is for another lengthy tread Lazy ID: 2022857 · Reply Quote

Wiggo Send message Joined: 24 Jan 00 Posts: 34872 Credit: 261,360,520 RAC: 489	Message 2022858 - Posted: 12 Dec 2019, 9:48:52 UTC - in response to Message 2022857. Itâ€™s looking more and more like nvidia isnâ€™t going to fix this. We may need to look at other options, either wider adoption of the sah app that doesnâ€™t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10. I'm not sure of this- vlar's working what i understand it on Nvidia-cards today. Why not "skip" vlar and reuse the code and call it VHAR instead? BR Lazy Yes they don't work too badly under the newer OpenCL apps on newer hardware, but they brought systems to a grinding halt under the older Cuda apps on older hardware and both are still being used today. ;-) Cheers. Thanks Wiggo! Then we are in another debate - how long should the project keep "oldish" apps in use. and that is for another lengthy tread Lazy I guess that depends on how long people keep on using pre-Fermi GPU's. ;-) Cheers. ID: 2022858 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2022870 - Posted: 12 Dec 2019, 16:17:26 UTC - in response to Message 2022854. Itâ€™s looking more and more like nvidia isnâ€™t going to fix this. We may need to look at other options, either wider adoption of the sah app that doesnâ€™t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10. I'm not sure of this- vlar's working what i understand it on Nvidia-cards today. Why not "skip" vlar and reuse the code and call it VHAR instead? BR Lazy you can't just call it VHAR. they need to classify WUs as VHAR first. right now VHARs are not differentiated (in naming scheme) from normal WUs since they have nothing to indicate as such. you can only tell by looking at the value for the AR. Once they do that, then yes they could "reuse" the code to prevent this kind of task from going to certain systems. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2022870 · Reply Quote

Lazydude Volunteer tester Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136	Message 2022871 - Posted: 12 Dec 2019, 17:36:16 UTC - in response to Message 2022870. Itâ€™s looking more and more like nvidia isnâ€™t going to fix this. We may need to look at other options, either wider adoption of the sah app that doesnâ€™t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10. I'm not sure of this- vlar's working what i understand it on Nvidia-cards today. Why not "skip" vlar and reuse the code and call it VHAR instead? BR Lazy you can't just call it VHAR. they need to classify WUs as VHAR first. right now VHARs are not differentiated (in naming scheme) from normal WUs since they have nothing to indicate as such. you can only tell by looking at the value for the AR. Once they do that, then yes they could "reuse" the code to prevent this kind of task from going to certain systems. Maybe it was uncler: The mechanics behind VLAR could be revamped - yes there are many parameters that must be reconfigured . a <AR must bechanced to >AR etc etc But that i only if we can scrap Vlar ID: 2022871 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2022872 - Posted: 12 Dec 2019, 17:48:33 UTC - in response to Message 2022871. Why scrap VLAR? I donâ€™t think any cards have issues with them anymore. Ignoring VLARs would ignore a large portion of the data that people are processing. You donâ€™t need to ignore VLAR in order to add more restrictions to the VHAR tasks. They donâ€™t seem to be mutually exclusive. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2022872 · Reply Quote

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2022875 - Posted: 12 Dec 2019, 18:25:39 UTC - in response to Message 2022871. Last modified: 12 Dec 2019, 18:26:33 UTC But that i only if we can scrap Vlar Did you mean VHAR? I hope so as almost every Breakthrough Listen work unit (over 90% of what we do now) is a VLAR, because it's dedicated recording time with the target being tracked (so little to no movement of it across the field). This is in contrast to Arecibo's normal/VHAR angle range work units where SETI@Home is "along for the ride" and not tracking the target. ID: 2022875 · Reply Quote

Lazydude Volunteer tester Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136	Message 2022882 - Posted: 12 Dec 2019, 18:53:34 UTC - in response to Message 2022875. my bad - scrap is not let the files be deleted but let them in as normal. and instead of class VLAR have a class of VHAR or maybe the best use 3 classes: VLAR NAR VHAR ???_vlar_.wu ???_nar_.wu ???_vhar.wu I never wanted to delete any work - only to reclassify the work and send the work to those who can handle them ID: 2022882 · Reply Quote

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22222 Credit: 416,307,556 RAC: 380	Message 2022883 - Posted: 12 Dec 2019, 19:01:55 UTC The "VLAR" tag in the file name has no functional meaning - it was put there many years ago the help the project identify data that was taken from "spotted" observations rather than "sky-sweeps", which at the time dominated what Arecibo was doing. Even today Arecibo does a lot of survey work as it is not well suited to spotted work, unlike the highly mobile GBT & Parkes telescopes Sky-sweeps provide a means of locating potential hot-spots, but I'm not sure if the data is used in that way or not. But only one of the "stock" applications is affected, and then only when running on one operating system with one group of drivers, if it really were a very widespread issue, rather than a single environment one we would be seeing a very substantial number of resends due to time exceeded errors - which we aren't. As for nVidia not sorting out the bug they've introduced into the driver - as this probably affects some of the big commercial users of GPUs there will be quite a bit of pressure from that quarter to get it properly fixed and in a reasonable time frame. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2022883 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2022886 - Posted: 12 Dec 2019, 19:31:05 UTC - in response to Message 2022883. according to Jacob, nvidia hadn't received much complaining about the issue with there drivers until he created the specific steps to reproduce the issue. I would think if this was causing major issues affecting any large commercial uses, they would have many more points of data to look at and probably faster than the relatively small collection of hobbyists looking for ET with their home computers. That's probably why this is such low priority for them, the affected pool of users is very small. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2022886 · Reply Quote

Jacob Klein Volunteer tester Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9	Message 2022894 - Posted: 12 Dec 2019, 21:08:46 UTC Last modified: 12 Dec 2019, 21:18:17 UTC I would say that is a fair assessment. And also affecting priority is the loudness of the complain. My suggestion: For people with the issue.. The more people that complain using NVIDIA's driver feedback area: To report a software issue, please fill out the NVIDIA driver feedback form. This will help us collect the specific information needed to reproduce your issue and prioritize driver fixes: https://forms.gle/kJ9Bqcaicvjb82SdA ... the better chances that the priority will go up, to get it fixed faster. * Quote is from the 441.66 driver feedback thread here: https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/332096/geforce-44166-game-ready-driver-feedback-thread-re/ ID: 2022894 · Reply Quote

robertmiles Volunteer tester Send message Joined: 16 Jan 12 Posts: 213 Credit: 4,117,756 RAC: 6	Message 2022928 - Posted: 13 Dec 2019, 1:38:54 UTC Something that might help: Create a modified version of the application that goes just past the point where the problem occurs, writes a checkpoint (to limit what gets optimized out), then halts the application. This should make it faster to check whether the problem occurs, by shortening the run time of cases where the problem does not occur. The tell Nvidia how to run this new test case in another report of the same problem. Also list some of the newer versions of the driver which have the same problem, in that problem report to Nvidia. ID: 2022928 · Reply Quote

6BQ5 Send message Joined: 7 Dec 18 Posts: 29 Credit: 12,725,636 RAC: 357	Message 2023189 - Posted: 15 Dec 2019, 3:34:08 UTC Last modified: 15 Dec 2019, 3:34:21 UTC I have a 1080 card and am using v441.66 of the GeForce drivers. Some work units get stuck at 0.630% while others blaze through. Here's some properties of one task working properly. ------------------- Application SETI@home v8 8.22 (opencl_nvidia_SoG) Name 26mr08ac.22756.40825.16.43.117.vlar State Running Received 12/13/2019 11:39:29 PM Report deadline 2/5/2020 4:39:12 AM Resources 0.486 CPUs + 1 NVIDIA GPU Estimated computation size 183,887 GFLOPs CPU time 00:00:56 CPU time since checkpoint 00:00:56 Elapsed time 00:01:00 Estimated time remaining 00:05:00 Fraction done 13.043% Virtual memory size 132.04 MB Working set size 106.56 MB Directory slots/4 Process ID 8696 Progress rate 15.330% per minute Executable setiathome_8.22_windows_intelx86__opencl_nvidia_SoG.exe ------------------- Here are properties one of that got stuck and required me to abort it. ------------------- Application SETI@home v8 8.22 (opencl_nvidia_SoG) Name 25jl12af.6862.19290.12.39.186 State Aborted by user Received 12/13/2019 11:39:29 PM Report deadline 1/3/2020 10:49:12 AM Resources 0.486 CPUs + 1 NVIDIA GPU Estimated computation size 70,727 GFLOPs CPU time 00:00:08 Elapsed time 00:03:04 Executable setiathome_8.22_windows_intelx86__opencl_nvidia_SoG.exe ------------------- -=- Boris ID: 2023189 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2023190 - Posted: 15 Dec 2019, 3:45:04 UTC Your examples don't mean anything. The VLAR task is not affected because the issues are just with VHAR tasks. The one that got stuck is a VHAR task. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2023190 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 2023191 - Posted: 15 Dec 2019, 3:52:25 UTC - in response to Message 2023190. The one that got stuck is a VHAR task. And the driver you are using is one that is known to have issues. Use v431.60 or earlier. Grant Darwin NT ID: 2023191 · Reply Quote

6BQ5 Send message Joined: 7 Dec 18 Posts: 29 Credit: 12,725,636 RAC: 357	Message 2023234 - Posted: 15 Dec 2019, 17:44:39 UTC Are VLAR tasks suffixed with a ".vlar" at the end of the name? If yes, then I have to say I had one or two without a ".vlar" suffix go through before updating. Maybe I was imagining it or maybe they started before the update. I have one running right now without the ".vlar" suffix, --------------- Application SETI@home v8 8.22 (opencl_nvidia_SoG) Name 07oc11ab.26899.1294.12.39.178 State Running Received 12/14/2019 2:49:24 AM Report deadline 2/5/2020 6:52:01 PM Resources 0.486 CPUs + 1 NVIDIA GPU Estimated computation size 185,478 GFLOPs CPU time 00:00:31 CPU time since checkpoint 00:00:31 Elapsed time 00:00:33 Estimated time remaining 00:05:57 Fraction done 1.813% Virtual memory size 167.23 MB Working set size 122.77 MB Directory slots/15 Process ID 6296 Progress rate 5.430% per minute Executable setiathome_8.22_windows_intelx86__opencl_nvidia_SoG.exe --------------- The Name in my BOINC Manager is "07oc11ab.26899.1294.12.39.178_1". That name adds a "_1" not shown in the text above. -=- Boris ID: 2023234 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2023235 - Posted: 15 Dec 2019, 17:55:35 UTC Yes, all VLAR tasks have a .vlar appended to the taskname. Arecibo tasks without .vlar appended to the taskname can be any angle range exceeding the de facto <0.13 VLAR range. So unless you examine the stderr.txt after it reports to see the angle range or examine the contents of the stderr.txt file in the slot the task is crunching in, you can't tell if a task is a VHAR and affected by the bad Windows drivers just by looking at the taskname. Revert to 431 drivers and you won't get any more stuck tasks. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2023235 · Reply Quote

Bill Volunteer tester Send message Joined: 30 Nov 05 Posts: 282 Credit: 6,916,194 RAC: 60	Message 2023645 - Posted: 19 Dec 2019, 2:35:03 UTC - in response to Message 2022894. I would say that is a fair assessment. And also affecting priority is the loudness of the complain. My suggestion: For people with the issue.. The more people that complain using NVIDIA's driver feedback area: To report a software issue, please fill out the NVIDIA driver feedback form. This will help us collect the specific information needed to reproduce your issue and prioritize driver fixes: https://forms.gle/kJ9Bqcaicvjb82SdA ... the better chances that the priority will go up, to get it fixed faster. * Quote is from the 441.66 driver feedback thread here: https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/332096/geforce-44166-game-ready-driver-feedback-thread-re/ I posted about this problem awhile ago, and I just checked that post: https://www.nvidia.com/en-us/geforce/forums/notifications/comment/31112/. I'm not sure where to post on the developer forms to bring this up. I'm starting to wonder if we as a community have been complaining about this in several ways to Nvidia, and thus diluting the overall effectiveness. Granted, we could have a small amount of people complaining to such a large company, but perhaps we need to organize more here? Can someone with a little more knowledge about this problem write up a default message for others to use to fill out a form (assuming it wouldn't be treated as spam)? Seti@home classic: 1,456 results, 1.613 years CPU time ID: 2023645 · Reply Quote

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.