NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 20 · Next

AuthorMessage
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2022663 - Posted: 10 Dec 2019, 22:55:10 UTC

NVIDIA released 441.66 drivers today.
I tested them, and they still have the "SETI OpenCL SoG VHAR on Windows 10" problems:

Maxwell:
> Tasks crash with error.
>ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995.

Pascal/Turing:
> Tasks run indefinitely with no load on the GPU.

431.60 are the last drivers that work correctly for those specific SETI tasks on Windows 10.
NVIDIA is aware, and per NVIDIA, we must continue to be patient for a driver version that includes a fix.
ID: 2022663 · Report as offensive     Reply Quote
Lazydude
Volunteer tester

Send message
Joined: 17 Jan 01
Posts: 45
Credit: 96,158,001
RAC: 136
Sweden
Message 2022854 - Posted: 12 Dec 2019, 9:16:38 UTC - in response to Message 2020647.  

It’s looking more and more like nvidia isn’t going to fix this.

We may need to look at other options, either wider adoption of the sah app that doesn’t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10.


I'm not sure of this-
vlar's working what i understand it on Nvidia-cards today.
Why not "skip" vlar and reuse the code and call it VHAR instead?

BR Lazy
ID: 2022854 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36765
Credit: 261,360,520
RAC: 489
Australia
Message 2022856 - Posted: 12 Dec 2019, 9:24:30 UTC - in response to Message 2022854.  

It’s looking more and more like nvidia isn’t going to fix this.

We may need to look at other options, either wider adoption of the sah app that doesn’t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10.
I'm not sure of this-
vlar's working what i understand it on Nvidia-cards today.
Why not "skip" vlar and reuse the code and call it VHAR instead?

BR Lazy
Yes they don't work too badly under the newer OpenCL apps on newer hardware, but they brought systems to a grinding halt under the older Cuda apps on older hardware and both are still being used today. ;-)

Cheers.
ID: 2022856 · Report as offensive     Reply Quote
Lazydude
Volunteer tester

Send message
Joined: 17 Jan 01
Posts: 45
Credit: 96,158,001
RAC: 136
Sweden
Message 2022857 - Posted: 12 Dec 2019, 9:40:15 UTC - in response to Message 2022856.  

It’s looking more and more like nvidia isn’t going to fix this.

We may need to look at other options, either wider adoption of the sah app that doesn’t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10.
I'm not sure of this-
vlar's working what i understand it on Nvidia-cards today.
Why not "skip" vlar and reuse the code and call it VHAR instead?

BR Lazy
Yes they don't work too badly under the newer OpenCL apps on newer hardware, but they brought systems to a grinding halt under the older Cuda apps on older hardware and both are still being used today. ;-)

Cheers.

Thanks Wiggo!
Then we are in another debate - how long should the project keep "oldish" apps in use. and that is for another lengthy tread
Lazy
ID: 2022857 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36765
Credit: 261,360,520
RAC: 489
Australia
Message 2022858 - Posted: 12 Dec 2019, 9:48:52 UTC - in response to Message 2022857.  

It’s looking more and more like nvidia isn’t going to fix this.

We may need to look at other options, either wider adoption of the sah app that doesn’t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10.
I'm not sure of this-
vlar's working what i understand it on Nvidia-cards today.
Why not "skip" vlar and reuse the code and call it VHAR instead?

BR Lazy
Yes they don't work too badly under the newer OpenCL apps on newer hardware, but they brought systems to a grinding halt under the older Cuda apps on older hardware and both are still being used today. ;-)

Cheers.
Thanks Wiggo!
Then we are in another debate - how long should the project keep "oldish" apps in use. and that is for another lengthy tread
Lazy
I guess that depends on how long people keep on using pre-Fermi GPU's. ;-)

Cheers.
ID: 2022858 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2022870 - Posted: 12 Dec 2019, 16:17:26 UTC - in response to Message 2022854.  

It’s looking more and more like nvidia isn’t going to fix this.

We may need to look at other options, either wider adoption of the sah app that doesn’t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10.


I'm not sure of this-
vlar's working what i understand it on Nvidia-cards today.
Why not "skip" vlar and reuse the code and call it VHAR instead?

BR Lazy


you can't just call it VHAR. they need to classify WUs as VHAR first. right now VHARs are not differentiated (in naming scheme) from normal WUs since they have nothing to indicate as such. you can only tell by looking at the value for the AR.

Once they do that, then yes they could "reuse" the code to prevent this kind of task from going to certain systems.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2022870 · Report as offensive     Reply Quote
Lazydude
Volunteer tester

Send message
Joined: 17 Jan 01
Posts: 45
Credit: 96,158,001
RAC: 136
Sweden
Message 2022871 - Posted: 12 Dec 2019, 17:36:16 UTC - in response to Message 2022870.  

It’s looking more and more like nvidia isn’t going to fix this.

We may need to look at other options, either wider adoption of the sah app that doesn’t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10.


I'm not sure of this-
vlar's working what i understand it on Nvidia-cards today.
Why not "skip" vlar and reuse the code and call it VHAR instead?

BR Lazy


you can't just call it VHAR. they need to classify WUs as VHAR first. right now VHARs are not differentiated (in naming scheme) from normal WUs since they have nothing to indicate as such. you can only tell by looking at the value for the AR.

Once they do that, then yes they could "reuse" the code to prevent this kind of task from going to certain systems.


Maybe it was uncler:
The mechanics behind VLAR could be revamped - yes there are many parameters that must be reconfigured .
a <AR must bechanced to >AR etc etc
But that i only if we can scrap Vlar
ID: 2022871 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2022872 - Posted: 12 Dec 2019, 17:48:33 UTC - in response to Message 2022871.  

Why scrap VLAR? I don’t think any cards have issues with them anymore.

Ignoring VLARs would ignore a large portion of the data that people are processing.

You don’t need to ignore VLAR in order to add more restrictions to the VHAR tasks. They don’t seem to be mutually exclusive.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2022872 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3806
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2022875 - Posted: 12 Dec 2019, 18:25:39 UTC - in response to Message 2022871.  
Last modified: 12 Dec 2019, 18:26:33 UTC

But that i only if we can scrap Vlar


Did you mean VHAR? I hope so as almost every Breakthrough Listen work unit (over 90% of what we do now) is a VLAR, because it's dedicated recording time with the target being tracked (so little to no movement of it across the field). This is in contrast to Arecibo's normal/VHAR angle range work units where SETI@Home is "along for the ride" and not tracking the target.
ID: 2022875 · Report as offensive     Reply Quote
Lazydude
Volunteer tester

Send message
Joined: 17 Jan 01
Posts: 45
Credit: 96,158,001
RAC: 136
Sweden
Message 2022882 - Posted: 12 Dec 2019, 18:53:34 UTC - in response to Message 2022875.  

my bad - scrap is not let the files be deleted but let them in as normal.
and instead of class VLAR have a class of VHAR

or maybe the best use 3 classes:
VLAR
NAR
VHAR

???_vlar_.wu
???_nar_.wu
???_vhar.wu


I never wanted to delete any work - only to reclassify the work and send the work to those who can handle them
ID: 2022882 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22526
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2022883 - Posted: 12 Dec 2019, 19:01:55 UTC

The "VLAR" tag in the file name has no functional meaning - it was put there many years ago the help the project identify data that was taken from "spotted" observations rather than "sky-sweeps", which at the time dominated what Arecibo was doing. Even today Arecibo does a lot of survey work as it is not well suited to spotted work, unlike the highly mobile GBT & Parkes telescopes
Sky-sweeps provide a means of locating potential hot-spots, but I'm not sure if the data is used in that way or not.
But only one of the "stock" applications is affected, and then only when running on one operating system with one group of drivers, if it really were a very widespread issue, rather than a single environment one we would be seeing a very substantial number of resends due to time exceeded errors - which we aren't.
As for nVidia not sorting out the bug they've introduced into the driver - as this probably affects some of the big commercial users of GPUs there will be quite a bit of pressure from that quarter to get it properly fixed and in a reasonable time frame.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2022883 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2022886 - Posted: 12 Dec 2019, 19:31:05 UTC - in response to Message 2022883.  

according to Jacob, nvidia hadn't received much complaining about the issue with there drivers until he created the specific steps to reproduce the issue. I would think if this was causing major issues affecting any large commercial uses, they would have many more points of data to look at and probably faster than the relatively small collection of hobbyists looking for ET with their home computers. That's probably why this is such low priority for them, the affected pool of users is very small.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2022886 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2022894 - Posted: 12 Dec 2019, 21:08:46 UTC
Last modified: 12 Dec 2019, 21:18:17 UTC

I would say that is a fair assessment.

And also affecting priority is the loudness of the complain.

My suggestion:
For people with the issue..
The more people that complain using NVIDIA's driver feedback area:

To report a software issue, please fill out the NVIDIA driver feedback form. This will help us collect the specific information needed to reproduce your issue and prioritize driver fixes:
https://forms.gle/kJ9Bqcaicvjb82SdA


... the better chances that the priority will go up, to get it fixed faster.

* Quote is from the 441.66 driver feedback thread here:
https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/332096/geforce-44166-game-ready-driver-feedback-thread-re/
ID: 2022894 · Report as offensive     Reply Quote
robertmiles
Volunteer tester

Send message
Joined: 16 Jan 12
Posts: 213
Credit: 4,117,756
RAC: 6
United States
Message 2022928 - Posted: 13 Dec 2019, 1:38:54 UTC

Something that might help:

Create a modified version of the application that goes just past the point where the problem occurs, writes a checkpoint (to limit what gets optimized out), then halts the application.

This should make it faster to check whether the problem occurs, by shortening the run time of cases where the problem does not occur.

The tell Nvidia how to run this new test case in another report of the same problem.

Also list some of the newer versions of the driver which have the same problem, in that problem report to Nvidia.
ID: 2022928 · Report as offensive     Reply Quote
6BQ5

Send message
Joined: 7 Dec 18
Posts: 29
Credit: 12,725,636
RAC: 357
United States
Message 2023189 - Posted: 15 Dec 2019, 3:34:08 UTC
Last modified: 15 Dec 2019, 3:34:21 UTC

I have a 1080 card and am using v441.66 of the GeForce drivers. Some work units get stuck at 0.630% while others blaze through.

Here's some properties of one task working properly.

-------------------
Application
SETI@home v8 8.22 (opencl_nvidia_SoG)
Name
26mr08ac.22756.40825.16.43.117.vlar
State
Running
Received
12/13/2019 11:39:29 PM
Report deadline
2/5/2020 4:39:12 AM
Resources
0.486 CPUs + 1 NVIDIA GPU
Estimated computation size
183,887 GFLOPs
CPU time
00:00:56
CPU time since checkpoint
00:00:56
Elapsed time
00:01:00
Estimated time remaining
00:05:00
Fraction done
13.043%
Virtual memory size
132.04 MB
Working set size
106.56 MB
Directory
slots/4
Process ID
8696
Progress rate
15.330% per minute
Executable
setiathome_8.22_windows_intelx86__opencl_nvidia_SoG.exe
-------------------

Here are properties one of that got stuck and required me to abort it.

-------------------
Application
SETI@home v8 8.22 (opencl_nvidia_SoG)
Name
25jl12af.6862.19290.12.39.186
State
Aborted by user
Received
12/13/2019 11:39:29 PM
Report deadline
1/3/2020 10:49:12 AM
Resources
0.486 CPUs + 1 NVIDIA GPU
Estimated computation size
70,727 GFLOPs
CPU time
00:00:08
Elapsed time
00:03:04
Executable
setiathome_8.22_windows_intelx86__opencl_nvidia_SoG.exe
-------------------

-=- Boris
ID: 2023189 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2023190 - Posted: 15 Dec 2019, 3:45:04 UTC

Your examples don't mean anything. The VLAR task is not affected because the issues are just with VHAR tasks. The one that got stuck is a VHAR task.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2023190 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 2023191 - Posted: 15 Dec 2019, 3:52:25 UTC - in response to Message 2023190.  

The one that got stuck is a VHAR task.
And the driver you are using is one that is known to have issues.
Use v431.60 or earlier.
Grant
Darwin NT
ID: 2023191 · Report as offensive     Reply Quote
6BQ5

Send message
Joined: 7 Dec 18
Posts: 29
Credit: 12,725,636
RAC: 357
United States
Message 2023234 - Posted: 15 Dec 2019, 17:44:39 UTC

Are VLAR tasks suffixed with a ".vlar" at the end of the name? If yes, then I have to say I had one or two without a ".vlar" suffix go through before updating. Maybe I was imagining it or maybe they started before the update.

I have one running right now without the ".vlar" suffix,

---------------
Application
SETI@home v8 8.22 (opencl_nvidia_SoG)
Name
07oc11ab.26899.1294.12.39.178
State
Running
Received
12/14/2019 2:49:24 AM
Report deadline
2/5/2020 6:52:01 PM
Resources
0.486 CPUs + 1 NVIDIA GPU
Estimated computation size
185,478 GFLOPs
CPU time
00:00:31
CPU time since checkpoint
00:00:31
Elapsed time
00:00:33
Estimated time remaining
00:05:57
Fraction done
1.813%
Virtual memory size
167.23 MB
Working set size
122.77 MB
Directory
slots/15
Process ID
6296
Progress rate
5.430% per minute
Executable
setiathome_8.22_windows_intelx86__opencl_nvidia_SoG.exe
---------------

The Name in my BOINC Manager is "07oc11ab.26899.1294.12.39.178_1". That name adds a "_1" not shown in the text above.

-=- Boris
ID: 2023234 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2023235 - Posted: 15 Dec 2019, 17:55:35 UTC

Yes, all VLAR tasks have a .vlar appended to the taskname. Arecibo tasks without .vlar appended to the taskname can be any angle range exceeding the de facto <0.13 VLAR range. So unless you examine the stderr.txt after it reports to see the angle range or examine the contents of the stderr.txt file in the slot the task is crunching in, you can't tell if a task is a VHAR and affected by the bad Windows drivers just by looking at the taskname.

Revert to 431 drivers and you won't get any more stuck tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2023235 · Report as offensive     Reply Quote
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 2023645 - Posted: 19 Dec 2019, 2:35:03 UTC - in response to Message 2022894.  

I would say that is a fair assessment.

And also affecting priority is the loudness of the complain.

My suggestion:
For people with the issue..
The more people that complain using NVIDIA's driver feedback area:

To report a software issue, please fill out the NVIDIA driver feedback form. This will help us collect the specific information needed to reproduce your issue and prioritize driver fixes:
https://forms.gle/kJ9Bqcaicvjb82SdA


... the better chances that the priority will go up, to get it fixed faster.

* Quote is from the 441.66 driver feedback thread here:
https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/332096/geforce-44166-game-ready-driver-feedback-thread-re/
I posted about this problem awhile ago, and I just checked that post: https://www.nvidia.com/en-us/geforce/forums/notifications/comment/31112/.

I'm not sure where to post on the developer forms to bring this up. I'm starting to wonder if we as a community have been complaining about this in several ways to Nvidia, and thus diluting the overall effectiveness. Granted, we could have a small amount of people complaining to such a large company, but perhaps we need to organize more here? Can someone with a little more knowledge about this problem write up a default message for others to use to fill out a form (assuming it wouldn't be treated as spam)?
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 2023645 · Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 20 · Next

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.