NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 20 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2030759 - Posted: 4 Feb 2020, 4:16:20 UTC - in response to Message 2030751.  

I think 442.19 has definitely fixed it!

In my testing of the 442.19 drivers, I had no problems processing VHAR work items, on my main rig (RTX 2080, GTX 980 Ti, GTX 980) using Windows 10.
All 3 GPUs acted correctly.

!YAY!
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2030759 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2030765 - Posted: 4 Feb 2020, 6:27:07 UTC - in response to Message 2030751.  

I think 442.19 has definitely fixed it!
If so, that is excellent news.

If sorted, the problems will be quickly resolved as many of those that don't want to use the older drivers almost always install the most recent drivers by habit (even if the new driver doesn't actually do anything for their games or systems anyway).


Thank you for all your efforts with this issue.
Grant
Darwin NT
ID: 2030765 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2030788 - Posted: 4 Feb 2020, 12:30:53 UTC - in response to Message 2030765.  
Last modified: 4 Feb 2020, 12:51:21 UTC

You're welcome. :)

It usually takes a bit of effort to get make an easy repro, and proper problem reporting, for NVIDIA guys to recognize the issue and go after it.

I'd like to thank Richard Haselgrove for his help with the repro. Thanks Richard -- Remember when we were diagnosing while I was at an airport and you were using Discord for one of the first times? Fun times - unforgettable!

Also, regarding confirming the fix, I'm in the process of getting OpenCL and CUDA results, for us to compare and verify. Maybe Richard can help with that verification, when I have the data ready, later today.
ID: 2030788 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030790 - Posted: 4 Feb 2020, 12:58:56 UTC - in response to Message 2030788.  

Can do. I've arranged to meet someone at 14:30, and may not be back before maintenance starts. We may have to fire up Discord again...
ID: 2030790 · Report as offensive     Reply Quote
robertmiles
Volunteer tester

Send message
Joined: 16 Jan 12
Posts: 213
Credit: 4,117,756
RAC: 6
United States
Message 2030798 - Posted: 5 Feb 2020, 2:08:08 UTC - in response to Message 2030788.  


[snip]

Also, regarding confirming the fix, I'm in the process of getting OpenCL and CUDA results, for us to compare and verify. Maybe Richard can help with that verification, when I have the data ready, later today.

Could you mention which of the applications runs VHAR workunits faster? We may need a way to insure that we use that one, except when you need the other one used instead.
ID: 2030798 · Report as offensive     Reply Quote
Profile Bruce N. Goren

Send message
Joined: 1 Jul 99
Posts: 15
Credit: 11,329,118
RAC: 32
United States
Message 2030799 - Posted: 5 Feb 2020, 2:09:31 UTC - in response to Message 2030751.  

Yep, I grabbed the Studio Driver variant of 442.19 and it looks good on my RTX2080i . Thanks to all !
ID: 2030799 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2030800 - Posted: 5 Feb 2020, 2:13:36 UTC - in response to Message 2030799.  

Yep, I grabbed the Studio Driver variant of 442.19 and it looks good on my RTX2080i . Thanks to all !
The issue only occurred with some Arecibo WUs. Have you processed any shortie Arecibo WUs successfully, as there are very, very few of them around at the moment?
Grant
Darwin NT
ID: 2030800 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2030811 - Posted: 5 Feb 2020, 3:56:28 UTC
Last modified: 5 Feb 2020, 3:58:36 UTC

1) My NVIDIA contact would like to extend NVIDIA's apologies on the lengthy time needed for this fix. They are grateful for our patience.

2) Some people are looking for good examples to test the fix. I offer them my examples, located in the .zip files here. Just download the .zip file, then extract it, then run the .cmd file within the folder whose name matches whatever GPU device (dev 0, or dev 1, or dev 2) that you want to test. Results are put into the "Testdatas" folder.
Example Work Units - Zips:
https://1drv.ms/f/s!AgP0NBEuAPQRp9ky322LD1BXy6rdAg

3) Richard, here are my results. I did not inspect thoroughly, and I'm hoping you can do that. All I know is that they completed without error, and GPU Usage seemed good... so I suspect the results are probably good. Can you please have a look? Especially comparing them fixed 442.19 to known working 431.68, and the results should match.
My OpenCL Results:
https://1drv.ms/f/s!AgP0NBEuAPQRp7kD322LD1BXy6rdAg
My CUDA Results:
https://1drv.ms/f/s!AgP0NBEuAPQRp8RF322LD1BXy6rdAg

4) For anyone wanting access to every file I have on this bug, here's the main folder:
Main folder:
https://1drv.ms/f/s!AgP0NBEuAPQRp6Fr322LD1BXy6rdAg

Regards,
Jacob Klein
ID: 2030811 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2030984 - Posted: 6 Feb 2020, 5:40:02 UTC

File 01fe20aa appears to be putting out quite a few shorties, so that will give people something to further test the new driver on.
Grant
Darwin NT
ID: 2030984 · Report as offensive     Reply Quote
EdwardPF
Volunteer tester

Send message
Joined: 26 Jul 99
Posts: 389
Credit: 236,772,605
RAC: 374
United States
Message 2031042 - Posted: 6 Feb 2020, 16:55:33 UTC

nvidia

442.19-desktop-win10-64bit-international-whql

has been working fine for the last 12 hrs with SOG on my nvidia 1660 super, no hanging as near as I can tell

Maybe it's a good one

Ed F
ID: 2031042 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2031044 - Posted: 6 Feb 2020, 16:58:18 UTC
Last modified: 6 Feb 2020, 16:58:51 UTC

Yep, 442.19 is a good driver that fixes the issue in this thread.
Scroll up to see the prior results.
And thanks for confirming.
ID: 2031044 · Report as offensive     Reply Quote
VelocityRC
Avatar

Send message
Joined: 27 Sep 19
Posts: 23
Credit: 1,421,582
RAC: 86
United States
Message 2031045 - Posted: 6 Feb 2020, 17:12:07 UTC

I 'll give it a spin this morning and see how things run for a few days. From the recent posts I'm glad that nVidia is listening and there are folks here that understand what to tell them.

Have fun everyone and thanks !!!

BIll S.
ID: 2031045 · Report as offensive     Reply Quote
KWSN - Sir Nutsalot

Send message
Joined: 4 Jun 99
Posts: 5
Credit: 22,114,565
RAC: 47
United Kingdom
Message 2031063 - Posted: 6 Feb 2020, 20:17:46 UTC
Last modified: 6 Feb 2020, 20:22:46 UTC

I can confirm that the latest Radeon 20-1-1 drivers drivers have fixed four of my machines (RX580 and 590) from crashing workunits on Radeon software above version 19-7-5.

Tried the the latest Nvidia DCH drivers at version 441.19 and I have not seen work units stalling yet either. I used the gaming drivers for my test machine.

A good day all round me thinks.

Jim
ID: 2031063 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031095 - Posted: 6 Feb 2020, 22:56:21 UTC
Last modified: 6 Feb 2020, 22:57:03 UTC

Interesting, just found a wingman running Windows 10 and the 441.66 drivers who crapped out on a VHAR task from the 30ja20ab series. Shouldn't have had an issue theoretically since the problem was supposedly fixed in the 441.19 drivers. AR = 14.597567
https://setiathome.berkeley.edu/result.php?resultid=8513797826
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031095 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2031096 - Posted: 6 Feb 2020, 23:03:04 UTC
Last modified: 6 Feb 2020, 23:03:17 UTC

Keith,
I believe you are mistaken.
The fix is in 442.19.
ID: 2031096 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031097 - Posted: 6 Feb 2020, 23:08:12 UTC - in response to Message 2031096.  

Keith,
I believe you are mistaken.
The fix is in 442.19.

Ohh, sorry about that. Got the version number of the fix wrong I see.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031097 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2031331 - Posted: 8 Feb 2020, 1:22:06 UTC

Hi folks,

I recently went through all of my local "repro" examples ... against all 6 of my GPUs ... against both Cuda and OpenCL ... against drivers:
- 431.60 (known good NVIDIA public release)
- 431.68 (known good NVIDIA hotfix driver)
- 432.00 (known good Windows Update driver)
- 442.19 (recent NVIDIA driver with fix that looks good so far)

A .zip of the results can be found here:
https://1drv.ms/f/s!AgP0NBEuAPQRp-ZG322LD1BXy6rdAg

Richard is going to look them over, but if anyone else knows how to do that and wants to also inspect for validation, please feel free!

Regards,
Jacob Klein
ID: 2031331 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031338 - Posted: 8 Feb 2020, 2:11:37 UTC

Doesn't look you ran any of the high AR tasks by the reference app for comparison.

That is the application that needs to have the results evaluated against.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031338 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 149
Credit: 9,783,406
RAC: 9
United States
Message 2031339 - Posted: 8 Feb 2020, 2:18:41 UTC
Last modified: 8 Feb 2020, 2:19:53 UTC

I'm not sure I understand. I ran the Cuda apps, but separately. Does that end up skipping an automated validation process?
ID: 2031339 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031346 - Posted: 8 Feb 2020, 2:49:32 UTC - in response to Message 2031339.  

All applications use the stock cpu application as the reference result that any other application is judged against. The stock cpu result is considered the standard to match. I looked in all your folders and only saw the individual CUDA and OpenCL applications with results. No cpu application results for the tasks that were run.

The benchmark allows using the stock cpu application to be run and then compare the test application against. It doesn't look like you did that. At least I could not find any reference result in any of the folders.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031346 · Report as offensive     Reply Quote
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 20 · Next

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.