NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 20 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 2019639 - Posted: 19 Nov 2019, 7:52:29 UTC - in response to Message 2019638.  

I just installed the NVidia Studio Driver version 441.28 . I'll let you know if SETI my performance is restored.
v431.60 is the last version to work with VHAR Arecibo work.
Grant
Darwin NT
ID: 2019639 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2019645 - Posted: 19 Nov 2019, 12:14:13 UTC - in response to Message 2019638.  

I just installed the NVidia Studio Driver version 441.28 . I'll let you know if my SETI performance is restored.


It’ll work fine until you get some Arecibo VHAR tasks, and you’ll start getting failures again.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2019645 · Report as offensive     Reply Quote
Profile Bruce N. Goren

Send message
Joined: 1 Jul 99
Posts: 15
Credit: 11,329,118
RAC: 32
United States
Message 2019668 - Posted: 19 Nov 2019, 22:03:10 UTC - in response to Message 2019639.  

I just installed the NVidia Studio Driver version 441.28 . I'll let you know if SETI my performance is restored.
v431.60 is the last version to work with VHAR Arecibo work.


Can't/won't revert to 431.60, I need the latest Studio Driver to take advantage of the newest ray tracing acceleration code for my RTX 2080 Ti .
ID: 2019668 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2020096 - Posted: 22 Nov 2019, 16:52:30 UTC

looks like GRD 441.20 DCH was released today.

https://www.nvidia.com/download/driverResults.aspx/153947/en-us

probably doesn't fix the issue since nothing in the release notes mention it, but worth a shot.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2020096 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2020114 - Posted: 22 Nov 2019, 17:58:13 UTC - in response to Message 2020096.  

441.20 was released on 11/12/2019.
I tested them already. They do not fix the issue. I reported results here.
https://setiathome.berkeley.edu/forum_thread.php?id=84694&postid=2018729

However, they recently repacked that version, and the filename has also changed:
from: 441.20-desktop-win10-64bit-international-whql.exe
to: 441.20-desktop-win10-64bit-international-whql-rp.exe

I might retest the repack later, but we can expect to continue to wait on a fix.

It's interesting how their website sometimes shows the old date, and sometimes shows the new date. How confusing!

NVIDIA.com > Drivers > GeForce Drivers > Search > 441.20
https://www.geforce.com/drivers/results/153948
> Release Date Tue Nov 12, 2019

NVIDIA.com > Drivers > All NVIDIA Drivers > Search > 441.20
https://www.nvidia.com/Download/driverResults.aspx/153944/en-us
> Capital "D" in the URL for "Download"
> Release Date: 2019.11.12

NVIDIA.com > Drivers > All NVIDIA Drivers > Beta and Older Drivers > Search > 441.20
https://www.nvidia.com/download/driverResults.aspx/153944/en-us
> Lowercase "d" in the URL for "download"
> Release Date: 2019.11.22

Fun!
ID: 2020114 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2020126 - Posted: 22 Nov 2019, 18:40:08 UTC - in response to Message 2020114.  

you tested the non-DCH driver, they are different.

like i said, they are unlikely to fix the issue, but the driver is different nontheless
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2020126 · Report as offensive     Reply Quote
Baltazar Mejia

Send message
Joined: 14 Sep 05
Posts: 1
Credit: 14,817,297
RAC: 0
United States
Message 2020458 - Posted: 25 Nov 2019, 3:32:36 UTC

I have occasional GPU tasks (opencl_nvidia_SoG) stall between .5-.6%. I manually abort them, then other tasks work fine. I've noticed the tasks that fail tend to hand a deadline in December. Other tasks work fine.

I've uninstalled, and reinstalled boinc, deleted the directory, installed the latest nvidia drivers. For example just now I have a task stall at .605% task 19no19ac.26046.... due 12/14/19
Task 19no19ac.20334.1754 stalled at .592%

Contrast task 19no19ac.4760 due 1/6/20 is running fine. No idea how to fix this, if this is a problem on my end, or a problem with the tasks.

My GPU is an RTX 2070 running the latest 441.20 driver. My data directory is in a different drive than the program directory. Would that have anything to do with it?

Anyone else having this problem?
ID: 2020458 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2020461 - Posted: 25 Nov 2019, 4:15:42 UTC - in response to Message 2020458.  

The problem is a combination of:
1. Windows 10
2. Nvidia drivers newer than 431.xx
3. The SETI SoG app
4. Arecibo VHAR tasks.

Remove any of these 4 variables and the problem goes away.

The easiest thing to do is to revert back to the 431.xx driver.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2020461 · Report as offensive     Reply Quote
Cameron
Avatar

Send message
Joined: 27 Nov 02
Posts: 110
Credit: 5,082,471
RAC: 17
Australia
Message 2020553 - Posted: 26 Nov 2019, 0:47:44 UTC - in response to Message 2020461.  
Last modified: 26 Nov 2019, 0:48:04 UTC

The problem is a combination of:
1. Windows 10
2. Nvidia drivers newer than 431.xx
3. The SETI SoG app
4. Arecibo VHAR tasks.

Remove any of these 4 variables and the problem goes away.

The easiest thing to do is to revert back to the 431.xx driver.

One other easy alternative could be to not allow tasks for NVIDIA GPU.
ID: 2020553 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2020564 - Posted: 26 Nov 2019, 2:44:36 UTC - in response to Message 2020553.  

Doesn’t seem like a good idea since you won’t get any work that way. You can ignore the problem and you’ll still do most tasks. Or revert the drivers to be back to doing all tasks.

Stopping all work to the nvidia GPU seems overly extreme. Would you really prefer to do nothing instead of something?
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2020564 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2020565 - Posted: 26 Nov 2019, 2:45:46 UTC - in response to Message 2020553.  

The problem is a combination of:
1. Windows 10
2. Nvidia drivers newer than 431.xx
3. The SETI SoG app
4. Arecibo VHAR tasks.

Remove any of these 4 variables and the problem goes away.

The easiest thing to do is to revert back to the 431.xx driver.

One other easy alternative could be to not allow tasks for NVIDIA GPU.

To do this it would require a coding change on the server side I would imagine. I do not know anybody capable of such a task at the present time apart from Jeff, Matt or Eric
ID: 2020565 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2020567 - Posted: 26 Nov 2019, 2:50:21 UTC
Last modified: 26 Nov 2019, 2:50:33 UTC

The user has a web setting to not receive NVIDIA work, if desired.
Anyway, my short term solution has been to just put SETI to "no new work" until things are fixed up.
... since I don't want my GPUs to possibly go idle with the problem, and I have other GPU projects that I'm attached to.
ID: 2020567 · Report as offensive     Reply Quote
d.wenzel
Volunteer tester

Send message
Joined: 11 Mar 01
Posts: 3
Credit: 38,878,272
RAC: 130
Germany
Message 2020633 - Posted: 26 Nov 2019, 18:19:02 UTC

Good evening,

does anyone have experiences with the new driver 441.41 of Nvidia published this day (11/26/2019)?

Kind regards
ID: 2020633 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2020637 - Posted: 26 Nov 2019, 18:28:36 UTC - in response to Message 2020633.  

Good evening,

does anyone have experiences with the new driver 441.41 of Nvidia published this day (11/26/2019)?

Kind regards


nothing in the release notes saying it's been addressed. So likely not fixed.

someone will have to test to be sure.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2020637 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2020640 - Posted: 26 Nov 2019, 18:38:14 UTC - in response to Message 2020633.  

NVIDIA released 441.41 drivers today.
I tested them, and they still have the "SETI OpenCL SoG VHAR on Windows 10" problems:

Maxwell:
> Tasks crash with error.
>ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995.

Pascal/Turing:
> Tasks run indefinitely with no load on the GPU.

431.60 are the last drivers that work correctly for those specific SETI tasks on Windows 10.
NVIDIA is aware, and per NVIDIA, we must continue to be patient for a driver version that includes a fix.
ID: 2020640 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2020647 - Posted: 26 Nov 2019, 19:28:21 UTC
Last modified: 26 Nov 2019, 19:40:15 UTC

It’s looking more and more like nvidia isn’t going to fix this.

We may need to look at other options, either wider adoption of the sah app that doesn’t have this issue, or some tweaking on the distribution servers to not send Arecibo VHAR tasks to Nvidia GPUs on Windows 10.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2020647 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2020648 - Posted: 26 Nov 2019, 19:33:13 UTC
Last modified: 26 Nov 2019, 19:34:51 UTC

Fixes take time. I've been told they are working on it.

Plus, keep in mind that just because a driver is released today, doesn't mean that it has their latest efforts. For 441.41, for instance, Device Manager shows a driver date of 11/20/2019, so there's a week of lag between when the driver was compiled versus when it was released, meaning it doesn't have their changes for the past week.

Again, we need to remain patient, possibly until the first driver from a new Release branch, until we can really wonder if they're abandoning us.

You are welcome to work to create another repro for them and send them driver feedback.

I do like your idea of a server-side change to prevent sending those out if they are known to cause problems, especially the "infinite run" problem where it wastes a GPU that could be used for other projects/tasks!
ID: 2020648 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2020649 - Posted: 26 Nov 2019, 19:46:12 UTC - in response to Message 2020648.  

The project has done it in the past; IIRC, preventing Arecibo VLARs going to Nvidia GPUs. but I'm not sure if it had any additional constraints for the app used or the environment (OS). since it only seems to affect Windows 10, it would be good to narrow it down at least that much since there are lots of highly productive systems running Linux that do not have this problem and can crunch through these tasks without issue.

but the problem came about in the R435 release, and nvidia was informed about the issue fairly early on after that started happening, they pushed several updates on R435 without a fix, and now have moved to R440 and still no fix. that's why I think they might not fix it. they've already moved on to a new driver release branch.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2020649 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2020650 - Posted: 26 Nov 2019, 19:47:38 UTC - in response to Message 2020649.  
Last modified: 26 Nov 2019, 19:49:02 UTC

They didn't have my easy repro setup+steps, though, until I put forth the effort to supply it.
So .... they now have a repro, and are working on it.
To my knowledge, they didn't have a readily available repro before.
ID: 2020650 · Report as offensive     Reply Quote
Traveller

Send message
Joined: 6 Jul 99
Posts: 1
Credit: 5,502,932
RAC: 15
United States
Message 2021649 - Posted: 4 Dec 2019, 8:14:22 UTC - in response to Message 2020564.  

I intend to just turn off GPU for SETI.

My other projects are not having a problem.

I'll revisit this when I build the next computer.
ID: 2021649 · Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 20 · Next

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.