Message boards :
Number crunching :
NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 20 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
I am hopeful NVIDIA will help out the Normal crunchers soon! :)Yep. There're new models coming (and rumours are Ampere will be released in the first half of 2020), and you will need the latest drivers to use them . Grant Darwin NT |
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
At this point it's pretty clear the problem doesn't exist when using the Non-SoG version of the App.So I guess the questions are If you read through this thread, and find my other thread, you'll see I already did the research. It's a driver regression. Only NVIDIA can diagnose further. Here's my other thread: https://setiathome.berkeley.edu/forum_thread.php?id=84780 Here's a summary: 431.60 is the last Release 430 driver, and it works correctly. Release 435 was a major driver update, and it had a regression that made it not work. All R435 drivers failed. 436.02 (8/20/2019) 436.15 (8/27/2019) 436.30 (9/10/2019) 436.48 (10/1/2019) Release 440 was a major driver update, and it still had the regression that made it not work. All R440 drivers have failed up to today. 440.97 (10/22/2019) 441.08 (10/29/2019) Here's my timeline of research: 10/20/2019: I hit on the issue personally, and began my research. 10/21/2019: My NVIDIA contact informed me that they are aware of an issue that they thought they had it fixed. 10/22/2019: I formed a solid repro of the problem, and passed the info to NVIDIA. https://setiathome.berkeley.edu/forum_thread.php?id=84780&postid=2016218 10/25/2019: My NVIDIA contact informed me that they are now actively looking into the issue and my repro. I believe NVIDIA intends to fix their driver regression. That is all I know. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
When Apple updated their OpenCL driver FOUR years ago Raistmer's code stopped working with the nVidia Mac builds. It would appear Windows has finally caught up to where Apple was Four years ago. I kinda thought it would happen sooner ;-) Now Four years later, Raistmer's code still will not build a correctly working nVidia Mac App, you have to use Raistmer's Intel GPU build for nVidia Mac GPUs. Fortunately for Windows, the problem isn't quite as severe. You'll have to ask Raistmer why his SoG path has problems with the newer OpenCL. For the Mac, Eric just switched the Apps on the SETI Server to use the Non-SoG App, I'm not sure what he plans to do with the Windows Apps. |
Wiggo Send message Joined: 24 Jan 00 Posts: 36827 Credit: 261,360,520 RAC: 489 |
3 Effects Win10, but not Win7 or Linux (no one has yet tested Win8/8.1 that I know of).At this point it's pretty clear the problem doesn't exist when using the Non-SoG version of the App.So I guess the questions are Cheers. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
If you read through this thread, and find my other thread, you'll see I already did the research.I understand that, I can see that something changed with v435- and has stayed that way since. I'm just curious as to what is different. And I doubt Nvidia consider it a regression, I'd just call it a change, but I expect Nvidia management would call it an improvement (no matter how many steps (or leaps) backward it might be). And the same for the application itself- what is it that is different between them for one to have an issue with the driver change, and the other not to. Edit- 3 Only affects Win10 (what functions does the GDI of Win10 support that the OSes don't?) The answer to any one of those 3 questions would most likely give the answer to the other two. Grant Darwin NT |
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
You can try asking NVIDIA, if you'd like, regarding "what changed in the driver to break it"... But I doubt you'll get a solid answer. Anyway, I've been instructed to wait patiently and see if upcoming driver versions fix the issue or not. And that is my plan. Solid repro in place, fully ready to test! ;) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
You can try asking NVIDIA, if you'd like, regarding "what changed in the driver to break it"... But I doubt you'll get a solid answer.I agree. And if Raistmer were around we could ask him about the differences in the application; the answer there might be of use to Nvidia. Grant Darwin NT |
Wiggo Send message Joined: 24 Jan 00 Posts: 36827 Credit: 261,360,520 RAC: 489 |
3 Effects Win10, but not Win7 or Linux (no one has yet tested Win8/8.1 that I know of).At this point it's pretty clear the problem doesn't exist when using the Non-SoG version of the App.So I guess the questions are OK, I finally found a Win8.1 rig running the 436.30 driver (after a lot of searching) that also doesn't any show signs of being effected (though the CPU is overcommitted) so the driver problem only effects Win10 rigs. So is the problem actually Nvidia's or is it something that M$ has done to Win10 without telling anyone? Cheers. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Waiting on a new Working driver can be frustrating. If you were a Mac user waiting for a driver that worked with the old nVidia SoG App you would have been waiting Four Years. Meanwhile, this Windows App works right now, http://boinc2.ssl.berkeley.edu/beta/download/setiathome_8.16_windows_intelx86__opencl_nvidia_sah.exe It may not be for everyone, but, for those that have stopped running SETI because of the driver, it may be the answer. BTW, I posted a while back Windows 7 & 8.x doesn't have the problem, I even tested it on My Win8.1 system. The Macs don't have the problem in Yosemite or Mavericks either, it started with the El Capitan 'Upgrade'. |
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
Wiggo, I have a couple questions. 1) I thought the task had to be a "VHAR" task, to get the failure. For that user you mentioned, were you able to find such a task, to prove that it's working for them? I don't know much about these, but we're looking for something like "WU true angle range is : 2.727445" (with that high of a number, 2.72), in the task output. 2) I see a lot of their tasks are Anonymous Platform, but is there any proof that they are using the SoG application? I don't know how to know. 3) Looking at their errors, perhaps they ARE having the problem after all! Check this task: https://setiathome.berkeley.edu/result.php?resultid=8186981661 Microsoft Windows 8.1 Driver version: 436.30 Name: GeForce GTX 960 WU true angle range is : 2.724311 ERROR: OpenCL kernel/call 'clEnqueueTask(cq,Autocorr_logging_kernel_cl)' call failed (-5) in file ..\analyzeFuncs.cpp near line 3795. I don't think this error is OS specific, but do not know for sure. Regards, Jacob |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
2) I see a lot of their tasks are Anonymous Platform, but is there any proof that they are using the SoG application? I don't know how to know. Loot at the stderr output: Windows optimized setiathome_v8 application Based on Intel, Core 2-optimized v8-nographics V5.13 by Alex Kan SSE3xj Win32 Build 3557 , Ported by : Raistmer, JDWhale SETI8 update by Raistmer OpenCL version by Raistmer, r3557 |
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
Juan: Where does that say SoG (which NVIDIA broke), versus sah (which TBar claims still works)? |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Juan: OpenCL version by Raistmer, r3557 For more explanation look at: https://setiathome.berkeley.edu/forum_thread.php?id=80299#1818882 |
Wiggo Send message Joined: 24 Jan 00 Posts: 36827 Credit: 261,360,520 RAC: 489 |
Those 2 error work units are different, "-226 (0xFFFFFF1E) ERR_TOO_MANY_EXITS" (short runtime, probably not getting CPU access through the over commitment) and not "197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED" (extra long runtime) associated with Win10 and the latest drivers. Cheers. |
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
Hmm... But ... As I explained in my results, when the problem happens, the Maxwell results are different than the Pascal/Turing results. When the problem happens: Maxwell (GTX 980 Ti, GTX 980): The program errors with a line similar to: ERROR: OpenCL kernel/call 'clEnqueueMapBuffer(gpu_GPUState)' call failed (-36) in file ..\analyzeFuncs.cpp near line 1995. Also with the line: Waiting 30 sec before restart... I believe BOINC then shows "Postponed" and tries again later... over and over and over. ... Until it finally fails it. You can see this behavior in that Win 8.1 user's task, on their GTX 960 (Pascal): ... albeit their error is slightly different. So not sure what to make of that. https://setiathome.berkeley.edu/result.php?resultid=8186981661 Pascal (GTX 1050 Ti) / Turing (RTX 2080): The program runs indefinitely and does nothing. I believe BOINC lets the task run doing nothing, until a time limit is exceeded and it ends it. Those 2 different behaviors, are both this same bug -- NVIDIA 436.xx and later drivers broke SETI OpenCL SoG tasks, on Maxwell/Pascal/Turing, for VHAR (Very High Angle Range) tasks with large "WU true angle range" values like 2.72. Maybe a mod could consider changing the topic's title to match that. Just a thought. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
So is the problem actually Nvidia's or is it something that M$ has done to Win10 without telling anyone?I'd say Nvidia. A change in driver broke an existing application on an existing OS. Unless the application was using an unsupported method, or using a known bug in the Nvidia OpenCL support in order to function, then it's a bug that's been introduced by Nvidia There have been plenty of issues in the past that were the result of OS changes. Some of the classics are from back in the DOS and very early Windows days where programmers made use of a loophole in DOS to speed up their programmes, These techniques were not recognised as valid, because they made use of what was a bug with the OS, but many programmers used them. When the bug was eventually fixed (I think it took about a decade), those programmes ceased to work. Then there was the crap that was Macrovision. Way back in the days of video tape, the movie companies didn't like people copying their tapes, so Macrovision came up with their copy protection system. Unfortunately it changed the vertical timing signals so they were no longer compliant with the specification. Many TV & VCR combinations were capable of dealing with this non-compliant signal, however there were many that weren't and the symptoms were what we called flag waving- the top part of the screen may pull a small amount, or as much as the top 2/3 down of the image will pull back and forth. On some TVs there were modifications that could help with some VCRs with some tapes, but in many cases nothing could be done. Eventually timebase correctors came along, cheap enough, to strip out the Macrovision and leave a nice clean compliant timing signal. No more flag waving. Macrovision always blamed the TV and/or VCR for not being able to play their protected tapes, but the issue was all because of their non-compliance with an established specification. I'm thinking Nvidia have made a change from an existing implementation/specification, and that's impacting software that makes use of it. Grant Darwin NT |
Wiggo Send message Joined: 24 Jan 00 Posts: 36827 Credit: 261,360,520 RAC: 489 |
Ok, I finally found another Win8.1 rig running driver 441.08 with a completely clean slate with plenty of these results, https://setiathome.berkeley.edu/result.php?resultid=8193002756. And another 1, https://setiathome.berkeley.edu/show_host_detail.php?hostid=7827382 It was hard enough finding Win8/8.1 rigs in the top 2400 hosts, but even harder finding those using the latest drivers. Cheers. |
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
Interesting findings. Thank you for finding and sharing. |
Wiggo Send message Joined: 24 Jan 00 Posts: 36827 Credit: 261,360,520 RAC: 489 |
I'm just saying that it wouldn't be the first time that M$ itself has broken something driver wise over the years by throwing in an undocumented update and that in itself would not surprise me that it's happened again. Sorry, but I just ATM I can't wholly lay the blame with Nvidia as yet when it's only the 1 OS being effected. ;-) Cheers. |
Jacob Klein Send message Joined: 15 Apr 11 Posts: 149 Credit: 9,783,406 RAC: 9 |
I guess that's fair. But, again 431.60 works fine on Windows 10 in regards to this issue, so.... We're back to where we started --- NVIDIA will have to diagnose :) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.