Message boards :
Number crunching :
NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 20 · Next
Author | Message |
---|---|
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
NVIDIA released 441.66 drivers today. I tested them, and they still have the "SETI OpenCL SoG VHAR on Windows 10" problems: Maxwell: > Tasks crash with error. >ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995. Pascal/Turing: > Tasks run indefinitely with no load on the GPU. 431.60 are the last drivers that work correctly for those specific SETI tasks on Windows 10. NVIDIA is aware, and per NVIDIA, we must continue to be patient for a driver version that includes a fix. |
Lazydude Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136 |
It’s looking more and more like nvidia isn’t going to fix this. I'm not sure of this- vlar's working what i understand it on Nvidia-cards today. Why not "skip" vlar and reuse the code and call it VHAR instead? BR Lazy |
Wiggo Send message Joined: 24 Jan 00 Posts: 36765 Credit: 261,360,520 RAC: 489 |
Yes they don't work too badly under the newer OpenCL apps on newer hardware, but they brought systems to a grinding halt under the older Cuda apps on older hardware and both are still being used today. ;-)It’s looking more and more like nvidia isn’t going to fix this.I'm not sure of this- Cheers. |
Lazydude Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136 |
Yes they don't work too badly under the newer OpenCL apps on newer hardware, but they brought systems to a grinding halt under the older Cuda apps on older hardware and both are still being used today. ;-)It’s looking more and more like nvidia isn’t going to fix this.I'm not sure of this- Thanks Wiggo! Then we are in another debate - how long should the project keep "oldish" apps in use. and that is for another lengthy tread Lazy |
Wiggo Send message Joined: 24 Jan 00 Posts: 36765 Credit: 261,360,520 RAC: 489 |
I guess that depends on how long people keep on using pre-Fermi GPU's. ;-)Thanks Wiggo!Yes they don't work too badly under the newer OpenCL apps on newer hardware, but they brought systems to a grinding halt under the older Cuda apps on older hardware and both are still being used today. ;-)It’s looking more and more like nvidia isn’t going to fix this.I'm not sure of this- Cheers. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
It’s looking more and more like nvidia isn’t going to fix this. you can't just call it VHAR. they need to classify WUs as VHAR first. right now VHARs are not differentiated (in naming scheme) from normal WUs since they have nothing to indicate as such. you can only tell by looking at the value for the AR. Once they do that, then yes they could "reuse" the code to prevent this kind of task from going to certain systems. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Lazydude Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136 |
It’s looking more and more like nvidia isn’t going to fix this. Maybe it was uncler: The mechanics behind VLAR could be revamped - yes there are many parameters that must be reconfigured . a <AR must bechanced to >AR etc etc But that i only if we can scrap Vlar |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Why scrap VLAR? I don’t think any cards have issues with them anymore. Ignoring VLARs would ignore a large portion of the data that people are processing. You don’t need to ignore VLAR in order to add more restrictions to the VHAR tasks. They don’t seem to be mutually exclusive. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3806 Credit: 1,114,826,392 RAC: 3,319 |
But that i only if we can scrap Vlar Did you mean VHAR? I hope so as almost every Breakthrough Listen work unit (over 90% of what we do now) is a VLAR, because it's dedicated recording time with the target being tracked (so little to no movement of it across the field). This is in contrast to Arecibo's normal/VHAR angle range work units where SETI@Home is "along for the ride" and not tracking the target. |
Lazydude Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136 |
my bad - scrap is not let the files be deleted but let them in as normal. and instead of class VLAR have a class of VHAR or maybe the best use 3 classes: VLAR NAR VHAR ???_vlar_.wu ???_nar_.wu ???_vhar.wu I never wanted to delete any work - only to reclassify the work and send the work to those who can handle them |
rob smith Send message Joined: 7 Mar 03 Posts: 22526 Credit: 416,307,556 RAC: 380 |
The "VLAR" tag in the file name has no functional meaning - it was put there many years ago the help the project identify data that was taken from "spotted" observations rather than "sky-sweeps", which at the time dominated what Arecibo was doing. Even today Arecibo does a lot of survey work as it is not well suited to spotted work, unlike the highly mobile GBT & Parkes telescopes Sky-sweeps provide a means of locating potential hot-spots, but I'm not sure if the data is used in that way or not. But only one of the "stock" applications is affected, and then only when running on one operating system with one group of drivers, if it really were a very widespread issue, rather than a single environment one we would be seeing a very substantial number of resends due to time exceeded errors - which we aren't. As for nVidia not sorting out the bug they've introduced into the driver - as this probably affects some of the big commercial users of GPUs there will be quite a bit of pressure from that quarter to get it properly fixed and in a reasonable time frame. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
according to Jacob, nvidia hadn't received much complaining about the issue with there drivers until he created the specific steps to reproduce the issue. I would think if this was causing major issues affecting any large commercial uses, they would have many more points of data to look at and probably faster than the relatively small collection of hobbyists looking for ET with their home computers. That's probably why this is such low priority for them, the affected pool of users is very small. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
I would say that is a fair assessment. And also affecting priority is the loudness of the complain. My suggestion: For people with the issue.. The more people that complain using NVIDIA's driver feedback area: To report a software issue, please fill out the NVIDIA driver feedback form. This will help us collect the specific information needed to reproduce your issue and prioritize driver fixes: ... the better chances that the priority will go up, to get it fixed faster. * Quote is from the 441.66 driver feedback thread here: https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/332096/geforce-44166-game-ready-driver-feedback-thread-re/ |
robertmiles Send message Joined: 16 Jan 12 Posts: 213 Credit: 4,117,756 RAC: 6 |
Something that might help: Create a modified version of the application that goes just past the point where the problem occurs, writes a checkpoint (to limit what gets optimized out), then halts the application. This should make it faster to check whether the problem occurs, by shortening the run time of cases where the problem does not occur. The tell Nvidia how to run this new test case in another report of the same problem. Also list some of the newer versions of the driver which have the same problem, in that problem report to Nvidia. |
6BQ5 Send message Joined: 7 Dec 18 Posts: 29 Credit: 12,725,636 RAC: 357 |
I have a 1080 card and am using v441.66 of the GeForce drivers. Some work units get stuck at 0.630% while others blaze through. Here's some properties of one task working properly. ------------------- Application SETI@home v8 8.22 (opencl_nvidia_SoG) Name 26mr08ac.22756.40825.16.43.117.vlar State Running Received 12/13/2019 11:39:29 PM Report deadline 2/5/2020 4:39:12 AM Resources 0.486 CPUs + 1 NVIDIA GPU Estimated computation size 183,887 GFLOPs CPU time 00:00:56 CPU time since checkpoint 00:00:56 Elapsed time 00:01:00 Estimated time remaining 00:05:00 Fraction done 13.043% Virtual memory size 132.04 MB Working set size 106.56 MB Directory slots/4 Process ID 8696 Progress rate 15.330% per minute Executable setiathome_8.22_windows_intelx86__opencl_nvidia_SoG.exe ------------------- Here are properties one of that got stuck and required me to abort it. ------------------- Application SETI@home v8 8.22 (opencl_nvidia_SoG) Name 25jl12af.6862.19290.12.39.186 State Aborted by user Received 12/13/2019 11:39:29 PM Report deadline 1/3/2020 10:49:12 AM Resources 0.486 CPUs + 1 NVIDIA GPU Estimated computation size 70,727 GFLOPs CPU time 00:00:08 Elapsed time 00:03:04 Executable setiathome_8.22_windows_intelx86__opencl_nvidia_SoG.exe ------------------- -=- Boris |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Your examples don't mean anything. The VLAR task is not affected because the issues are just with VHAR tasks. The one that got stuck is a VHAR task. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
The one that got stuck is a VHAR task.And the driver you are using is one that is known to have issues. Use v431.60 or earlier. Grant Darwin NT |
6BQ5 Send message Joined: 7 Dec 18 Posts: 29 Credit: 12,725,636 RAC: 357 |
Are VLAR tasks suffixed with a ".vlar" at the end of the name? If yes, then I have to say I had one or two without a ".vlar" suffix go through before updating. Maybe I was imagining it or maybe they started before the update. I have one running right now without the ".vlar" suffix, --------------- Application SETI@home v8 8.22 (opencl_nvidia_SoG) Name 07oc11ab.26899.1294.12.39.178 State Running Received 12/14/2019 2:49:24 AM Report deadline 2/5/2020 6:52:01 PM Resources 0.486 CPUs + 1 NVIDIA GPU Estimated computation size 185,478 GFLOPs CPU time 00:00:31 CPU time since checkpoint 00:00:31 Elapsed time 00:00:33 Estimated time remaining 00:05:57 Fraction done 1.813% Virtual memory size 167.23 MB Working set size 122.77 MB Directory slots/15 Process ID 6296 Progress rate 5.430% per minute Executable setiathome_8.22_windows_intelx86__opencl_nvidia_SoG.exe --------------- The Name in my BOINC Manager is "07oc11ab.26899.1294.12.39.178_1". That name adds a "_1" not shown in the text above. -=- Boris |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Yes, all VLAR tasks have a .vlar appended to the taskname. Arecibo tasks without .vlar appended to the taskname can be any angle range exceeding the de facto <0.13 VLAR range. So unless you examine the stderr.txt after it reports to see the angle range or examine the contents of the stderr.txt file in the slot the task is crunching in, you can't tell if a task is a VHAR and affected by the bad Windows drivers just by looking at the taskname. Revert to 431 drivers and you won't get any more stuck tasks. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Bill Send message Joined: 30 Nov 05 Posts: 282 Credit: 6,916,194 RAC: 60 |
I would say that is a fair assessment.I posted about this problem awhile ago, and I just checked that post: https://www.nvidia.com/en-us/geforce/forums/notifications/comment/31112/. I'm not sure where to post on the developer forms to bring this up. I'm starting to wonder if we as a community have been complaining about this in several ways to Nvidia, and thus diluting the overall effectiveness. Granted, we could have a small amount of people complaining to such a large company, but perhaps we need to organize more here? Can someone with a little more knowledge about this problem write up a default message for others to use to fill out a form (assuming it wouldn't be treated as spam)? Seti@home classic: 1,456 results, 1.613 years CPU time |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.