NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11872
Credit: 184,694,823
RAC: 238,289
Australia
Message 2017598 - Posted: 2 Nov 2019, 21:40:07 UTC - in response to Message 2017596.  

I am hopeful NVIDIA will help out the Normal crunchers soon! :)
Yep. There're new models coming (and rumours are Ampere will be released in the first half of 2020), and you will need the latest drivers to use them .
Grant
Darwin NT
ID: 2017598 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 100
Credit: 9,389,107
RAC: 1,280
United States
Message 2017601 - Posted: 2 Nov 2019, 21:51:37 UTC - in response to Message 2017597.  
Last modified: 2 Nov 2019, 21:57:01 UTC

At this point it's pretty clear the problem doesn't exist when using the Non-SoG version of the App.
So I guess the questions are
1 What's different between he applications?
2 What's different about the recent drivers?

If you read through this thread, and find my other thread, you'll see I already did the research.
It's a driver regression. Only NVIDIA can diagnose further.

Here's my other thread:
https://setiathome.berkeley.edu/forum_thread.php?id=84780

Here's a summary:

431.60 is the last Release 430 driver, and it works correctly.

Release 435 was a major driver update, and it had a regression that made it not work.
All R435 drivers failed.
436.02 (8/20/2019)
436.15 (8/27/2019)
436.30 (9/10/2019)
436.48 (10/1/2019)

Release 440 was a major driver update, and it still had the regression that made it not work.
All R440 drivers have failed up to today.
440.97 (10/22/2019)
441.08 (10/29/2019)

Here's my timeline of research:
10/20/2019: I hit on the issue personally, and began my research.
10/21/2019: My NVIDIA contact informed me that they are aware of an issue that they thought they had it fixed.
10/22/2019: I formed a solid repro of the problem, and passed the info to NVIDIA. https://setiathome.berkeley.edu/forum_thread.php?id=84780&postid=2016218
10/25/2019: My NVIDIA contact informed me that they are now actively looking into the issue and my repro.

I believe NVIDIA intends to fix their driver regression.
That is all I know.
ID: 2017601 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4928
Credit: 668,333,734
RAC: 1,431,455
United States
Message 2017602 - Posted: 2 Nov 2019, 21:54:17 UTC

When Apple updated their OpenCL driver FOUR years ago Raistmer's code stopped working with the nVidia Mac builds. It would appear Windows has finally caught up to where Apple was Four years ago. I kinda thought it would happen sooner ;-) Now Four years later, Raistmer's code still will not build a correctly working nVidia Mac App, you have to use Raistmer's Intel GPU build for nVidia Mac GPUs. Fortunately for Windows, the problem isn't quite as severe. You'll have to ask Raistmer why his SoG path has problems with the newer OpenCL.
For the Mac, Eric just switched the Apps on the SETI Server to use the Non-SoG App, I'm not sure what he plans to do with the Windows Apps.
ID: 2017602 · Report as offensive     Reply Quote
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 17224
Credit: 240,454,058
RAC: 179,391
Australia
Message 2017610 - Posted: 2 Nov 2019, 22:21:37 UTC - in response to Message 2017597.  

At this point it's pretty clear the problem doesn't exist when using the Non-SoG version of the App.
So I guess the questions are
1 What's different between he applications?
2 What's different about the recent drivers?
3 Effects Win10, but not Win7 or Linux (no one has yet tested Win8/8.1 that I know of).

Cheers.
ID: 2017610 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11872
Credit: 184,694,823
RAC: 238,289
Australia
Message 2017611 - Posted: 2 Nov 2019, 22:23:54 UTC - in response to Message 2017601.  
Last modified: 2 Nov 2019, 22:29:00 UTC

If you read through this thread, and find my other thread, you'll see I already did the research.
It's a driver regression. Only NVIDIA can diagnose further.
I understand that, I can see that something changed with v435- and has stayed that way since. I'm just curious as to what is different. And I doubt Nvidia consider it a regression, I'd just call it a change, but I expect Nvidia management would call it an improvement (no matter how many steps (or leaps) backward it might be).
And the same for the application itself- what is it that is different between them for one to have an issue with the driver change, and the other not to.

Edit- 3 Only affects Win10 (what functions does the GDI of Win10 support that the OSes don't?)

The answer to any one of those 3 questions would most likely give the answer to the other two.
Grant
Darwin NT
ID: 2017611 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 100
Credit: 9,389,107
RAC: 1,280
United States
Message 2017613 - Posted: 2 Nov 2019, 22:25:47 UTC - in response to Message 2017611.  
Last modified: 2 Nov 2019, 22:27:13 UTC

You can try asking NVIDIA, if you'd like, regarding "what changed in the driver to break it"... But I doubt you'll get a solid answer.
Anyway, I've been instructed to wait patiently and see if upcoming driver versions fix the issue or not. And that is my plan.
Solid repro in place, fully ready to test! ;)
ID: 2017613 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11872
Credit: 184,694,823
RAC: 238,289
Australia
Message 2017614 - Posted: 2 Nov 2019, 22:32:23 UTC - in response to Message 2017613.  

You can try asking NVIDIA, if you'd like, regarding "what changed in the driver to break it"... But I doubt you'll get a solid answer.
I agree.
And if Raistmer were around we could ask him about the differences in the application; the answer there might be of use to Nvidia.
Grant
Darwin NT
ID: 2017614 · Report as offensive     Reply Quote
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 17224
Credit: 240,454,058
RAC: 179,391
Australia
Message 2017618 - Posted: 2 Nov 2019, 23:02:07 UTC - in response to Message 2017610.  

At this point it's pretty clear the problem doesn't exist when using the Non-SoG version of the App.
So I guess the questions are
1 What's different between he applications?
2 What's different about the recent drivers?
3 Effects Win10, but not Win7 or Linux (no one has yet tested Win8/8.1 that I know of).


OK, I finally found a Win8.1 rig running the 436.30 driver (after a lot of searching) that also doesn't any show signs of being effected (though the CPU is overcommitted) so the driver problem only effects Win10 rigs.

So is the problem actually Nvidia's or is it something that M$ has done to Win10 without telling anyone?

Cheers.
ID: 2017618 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4928
Credit: 668,333,734
RAC: 1,431,455
United States
Message 2017619 - Posted: 2 Nov 2019, 23:04:03 UTC
Last modified: 2 Nov 2019, 23:22:46 UTC

Waiting on a new Working driver can be frustrating. If you were a Mac user waiting for a driver that worked with the old nVidia SoG App you would have been waiting Four Years. Meanwhile, this Windows App works right now, http://boinc2.ssl.berkeley.edu/beta/download/setiathome_8.16_windows_intelx86__opencl_nvidia_sah.exe
It may not be for everyone, but, for those that have stopped running SETI because of the driver, it may be the answer.

BTW, I posted a while back Windows 7 & 8.x doesn't have the problem, I even tested it on My Win8.1 system. The Macs don't have the problem in Yosemite or Mavericks either, it started with the El Capitan 'Upgrade'.
ID: 2017619 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 100
Credit: 9,389,107
RAC: 1,280
United States
Message 2017622 - Posted: 2 Nov 2019, 23:21:32 UTC - in response to Message 2017618.  
Last modified: 2 Nov 2019, 23:31:02 UTC

Wiggo,

I have a couple questions.

1) I thought the task had to be a "VHAR" task, to get the failure. For that user you mentioned, were you able to find such a task, to prove that it's working for them? I don't know much about these, but we're looking for something like "WU true angle range is : 2.727445" (with that high of a number, 2.72), in the task output.

2) I see a lot of their tasks are Anonymous Platform, but is there any proof that they are using the SoG application? I don't know how to know.

3) Looking at their errors, perhaps they ARE having the problem after all!

Check this task:
https://setiathome.berkeley.edu/result.php?resultid=8186981661

Microsoft Windows 8.1
Driver version: 436.30
Name: GeForce GTX 960
WU true angle range is : 2.724311
ERROR: OpenCL kernel/call 'clEnqueueTask(cq,Autocorr_logging_kernel_cl)' call failed (-5) in file ..\analyzeFuncs.cpp near line 3795.


I don't think this error is OS specific, but do not know for sure.

Regards,
Jacob
ID: 2017622 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 8250
Credit: 519,126,255
RAC: 407,087
Panama
Message 2017623 - Posted: 2 Nov 2019, 23:26:17 UTC - in response to Message 2017622.  

2) I see a lot of their tasks are Anonymous Platform, but is there any proof that they are using the SoG application? I don't know how to know.

Loot at the stderr output:

Windows optimized setiathome_v8 application
Based on Intel, Core 2-optimized v8-nographics V5.13 by Alex Kan
SSE3xj Win32 Build 3557 , Ported by : Raistmer, JDWhale

SETI8 update by Raistmer

OpenCL version by Raistmer, r3557


ID: 2017623 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 100
Credit: 9,389,107
RAC: 1,280
United States
Message 2017625 - Posted: 2 Nov 2019, 23:29:03 UTC - in response to Message 2017623.  
Last modified: 2 Nov 2019, 23:30:03 UTC

Juan:
Where does that say SoG (which NVIDIA broke), versus sah (which TBar claims still works)?
ID: 2017625 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 8250
Credit: 519,126,255
RAC: 407,087
Panama
Message 2017627 - Posted: 2 Nov 2019, 23:35:33 UTC - in response to Message 2017625.  
Last modified: 2 Nov 2019, 23:39:33 UTC

Juan:
Where does that say SoG (which NVIDIA broke), versus sah (which TBar claims still works)?

OpenCL version by Raistmer, r3557

For more explanation look at: https://setiathome.berkeley.edu/forum_thread.php?id=80299#1818882
ID: 2017627 · Report as offensive     Reply Quote
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 17224
Credit: 240,454,058
RAC: 179,391
Australia
Message 2017628 - Posted: 2 Nov 2019, 23:36:29 UTC

Those 2 error work units are different, "-226 (0xFFFFFF1E) ERR_TOO_MANY_EXITS" (short runtime, probably not getting CPU access through the over commitment) and not "197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED" (extra long runtime) associated with Win10 and the latest drivers.

Cheers.
ID: 2017628 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 100
Credit: 9,389,107
RAC: 1,280
United States
Message 2017629 - Posted: 2 Nov 2019, 23:41:23 UTC - in response to Message 2017628.  
Last modified: 3 Nov 2019, 0:04:04 UTC

Hmm... But ... As I explained in my results, when the problem happens, the Maxwell results are different than the Pascal/Turing results.

When the problem happens:

Maxwell (GTX 980 Ti, GTX 980):
The program errors with a line similar to:
ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995.
Also with the line:
Waiting 30 sec before restart...
I believe BOINC then shows "Postponed" and tries again later... over and over and over.
... Until it finally fails it.
You can see this behavior in that Win 8.1 user's task, on their GTX 960 (Pascal):
... albeit their error is slightly different. So not sure what to make of that.
https://setiathome.berkeley.edu/result.php?resultid=8186981661

Pascal (GTX 1050 Ti) / Turing (RTX 2080):
The program runs indefinitely and does nothing.
I believe BOINC lets the task run doing nothing, until a time limit is exceeded and it ends it.

Those 2 different behaviors, are both this same bug --
NVIDIA 436.xx and later drivers broke SETI OpenCL SoG tasks, on Maxwell/Pascal/Turing, for VHAR (Very High Angle Range) tasks with large "WU true angle range" values like 2.72.

Maybe a mod could consider changing the topic's title to match that. Just a thought.
ID: 2017629 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11872
Credit: 184,694,823
RAC: 238,289
Australia
Message 2017634 - Posted: 3 Nov 2019, 0:12:08 UTC - in response to Message 2017618.  

So is the problem actually Nvidia's or is it something that M$ has done to Win10 without telling anyone?
I'd say Nvidia.
A change in driver broke an existing application on an existing OS. Unless the application was using an unsupported method, or using a known bug in the Nvidia OpenCL support in order to function, then it's a bug that's been introduced by Nvidia


There have been plenty of issues in the past that were the result of OS changes. Some of the classics are from back in the DOS and very early Windows days where programmers made use of a loophole in DOS to speed up their programmes, These techniques were not recognised as valid, because they made use of what was a bug with the OS, but many programmers used them. When the bug was eventually fixed (I think it took about a decade), those programmes ceased to work.

Then there was the crap that was Macrovision.
Way back in the days of video tape, the movie companies didn't like people copying their tapes, so Macrovision came up with their copy protection system. Unfortunately it changed the vertical timing signals so they were no longer compliant with the specification. Many TV & VCR combinations were capable of dealing with this non-compliant signal, however there were many that weren't and the symptoms were what we called flag waving- the top part of the screen may pull a small amount, or as much as the top 2/3 down of the image will pull back and forth.
On some TVs there were modifications that could help with some VCRs with some tapes, but in many cases nothing could be done.
Eventually timebase correctors came along, cheap enough, to strip out the Macrovision and leave a nice clean compliant timing signal. No more flag waving.

Macrovision always blamed the TV and/or VCR for not being able to play their protected tapes, but the issue was all because of their non-compliance with an established specification.
I'm thinking Nvidia have made a change from an existing implementation/specification, and that's impacting software that makes use of it.
Grant
Darwin NT
ID: 2017634 · Report as offensive     Reply Quote
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 17224
Credit: 240,454,058
RAC: 179,391
Australia
Message 2017636 - Posted: 3 Nov 2019, 0:23:49 UTC

Ok, I finally found another Win8.1 rig running driver 441.08 with a completely clean slate with plenty of these results, https://setiathome.berkeley.edu/result.php?resultid=8193002756.

And another 1, https://setiathome.berkeley.edu/show_host_detail.php?hostid=7827382

It was hard enough finding Win8/8.1 rigs in the top 2400 hosts, but even harder finding those using the latest drivers.

Cheers.
ID: 2017636 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 100
Credit: 9,389,107
RAC: 1,280
United States
Message 2017639 - Posted: 3 Nov 2019, 0:55:13 UTC
Last modified: 3 Nov 2019, 0:55:20 UTC

Interesting findings. Thank you for finding and sharing.
ID: 2017639 · Report as offensive     Reply Quote
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 17224
Credit: 240,454,058
RAC: 179,391
Australia
Message 2017640 - Posted: 3 Nov 2019, 1:04:22 UTC
Last modified: 3 Nov 2019, 1:05:23 UTC

I'm just saying that it wouldn't be the first time that M$ itself has broken something driver wise over the years by throwing in an undocumented update and that in itself would not surprise me that it's happened again.

Sorry, but I just ATM I can't wholly lay the blame with Nvidia as yet when it's only the 1 OS being effected. ;-)

Cheers.
ID: 2017640 · Report as offensive     Reply Quote
Jacob Klein
Volunteer tester

Send message
Joined: 15 Apr 11
Posts: 100
Credit: 9,389,107
RAC: 1,280
United States
Message 2017643 - Posted: 3 Nov 2019, 1:08:53 UTC - in response to Message 2017640.  

I guess that's fair. But, again 431.60 works fine on Windows 10 in regards to this issue, so.... We're back to where we started --- NVIDIA will have to diagnose :)
ID: 2017643 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.