NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 20 · Next

AuthorMessage
Profile Whirling Steel
Avatar

Send message
Joined: 21 Sep 19
Posts: 8
Credit: 84,779
RAC: 0
United States
Message 2015199 - Posted: 12 Oct 2019, 19:33:29 UTC - in response to Message 2015113.  

One thing I know about Technology Companies. its all about the numbers.
If enough people complain I think they will. But honestly, since Seti@home is the
only program on my system that is/was affected, I honestly don't think the Seti user
base is enough to move the great NVidia to action (just my opinion)
ID: 2015199 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2015209 - Posted: 12 Oct 2019, 21:15:13 UTC - in response to Message 2015199.  

The drivers affect more projects than just Seti. But you are correct, the number of complaints from distributed computing users running consumer hardware on Compute loads is very small compared to the number of users just running normal graphics/game loads. But if the problem is also seen over on the professional side with the users of Tesla and Quadro business class cards, then they would be forced to do something about it because of the greater numbers of users and the prices they command for that hardware.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2015209 · Report as offensive     Reply Quote
Profile SongBird
Volunteer tester

Send message
Joined: 23 Oct 01
Posts: 104
Credit: 164,826,157
RAC: 297
Bulgaria
Message 2015327 - Posted: 13 Oct 2019, 17:58:59 UTC

I believe I may have the same problem. What is not discussed in this thread, though is the part where it is not that the workunits take too long to finish but rather, that the GPU is not working at all.

Can someone verify that indeed the GPU is idling on these units?
ID: 2015327 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2015332 - Posted: 13 Oct 2019, 18:36:19 UTC - in response to Message 2015327.  
Last modified: 13 Oct 2019, 18:50:53 UTC

I believe I may have the same problem. What is not discussed in this thread, though is the part where it is not that the workunits take too long to finish but rather, that the GPU is not working at all.

Can someone verify that indeed the GPU is idling on these units?

As posted on the other thread you need to roll back the driver and stay away from the 436.XX or you will never be clear from this error

This is what your crunched WU said:

setiathome_CUDA: Found 1 CUDA device(s):
nVidia Driver Version 436.48
  Device 1: GeForce GTX 1660 Ti, 4095 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 24 
     pciBusID = 3, pciSlotID = 0


BTW After you roll back the driver, reinstall the Lunatics over with the SoG option selected, your host will crunch a lot faster the BLC work.
ID: 2015332 · Report as offensive     Reply Quote
Profile SongBird
Volunteer tester

Send message
Joined: 23 Oct 01
Posts: 104
Credit: 164,826,157
RAC: 297
Bulgaria
Message 2015335 - Posted: 13 Oct 2019, 18:58:29 UTC - in response to Message 2015332.  
Last modified: 13 Oct 2019, 19:04:18 UTC

I rolled the drivers back. Crunched a couple of tasks already. Will keep in touch if it happens again.

Regarding the GPU idling I'm still curious if this is the case.

Apropos of nothing, I'm old enough to remember when crunching on GPU did not take up a full CPU core... Just looked at this shit and got mad again:
Run time 19 min 17 sec
CPU time 18 min 3 sec

ID: 2015335 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2015336 - Posted: 13 Oct 2019, 19:10:30 UTC - in response to Message 2015335.  
Last modified: 13 Oct 2019, 19:13:59 UTC

I rolled the drivers back. Crunched a couple of tasks already. Will keep in touch if it happens again.

Regarding the GPU idling I'm still curious if this is the case.

Apropos of nothing, I'm old enough to remember when crunching on GPU did not take up a full CPU core... Just looked at this shit and got mad again:
Run time 19 min 17 sec
CPU time 18 min 3 sec

About the GPU idling, probably comes from the driver incompatibility, some kind of empty loop where the GPU was left doing nothing as you post (0% usage).

About the CPU usage: Different crunching programs needs different CPU support.

Now you could look to add some configuration parameter on the OpenCl builds, that will make your GPU crunch faster.

Look at the doc files included on the build to see how.
ID: 2015336 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2015381 - Posted: 14 Oct 2019, 4:13:59 UTC - in response to Message 2015335.  

@ SongBird

If you want/need more CPU time, you could consider using the 'Sleep" parameter. See here for the details.
ID: 2015381 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2015418 - Posted: 14 Oct 2019, 14:36:14 UTC
Last modified: 14 Oct 2019, 14:36:46 UTC

I’ve been doing some digging the past few days, trying to pin down what’s causing this problem. What I’ve found:

So far this problem seems to only affect the Windows drivers. The Linux 435.xx code branch appears to be similar to the Windows 436.xx branch, and if you check the Windows 436 Release notes, you will notice that they are labeled as “435” drivers. But Linux systems running 435 drivers do not appear to have this issue.

It has also been opined that the “integer scaling” feature may be the culprit. But I’m a skeptical. The integer scaling has nothing to do with compute loads, and instead only deals with display scaling for making text and items bigger on the screen by integer factors.

Just something to chew on. Someone should create a well defined thread on the nvidia forums for discussion about this specific issue.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2015418 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2015419 - Posted: 14 Oct 2019, 14:47:42 UTC - in response to Message 2015418.  

The integer scaling has nothing to do with compute loads, and instead only deals with display scaling for making text and items bigger on the screen by integer factors.

But in order to perform the integer scaling involves use of math functions. Which is also what our compute loads use. Just my suspicions that the problem only occurred just as this new "feature" appeared in the drivers. Don't see any mention of integer scaling in the Linux release notes for the 435 drivers.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2015419 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2015421 - Posted: 14 Oct 2019, 14:53:54 UTC - in response to Message 2015419.  

The integer scaling has nothing to do with compute loads, and instead only deals with display scaling for making text and items bigger on the screen by integer factors.

But in order to perform the integer scaling involves use of math functions. Which is also what our compute loads use. Just my suspicions that the problem only occurred just as this new "feature" appeared in the drivers. Don't see any mention of integer scaling in the Linux release notes for the 435 drivers.


it seems that they do not release a full release notes package for the linux drivers like they do for the windows drivers. a least I couldn't find it on nvidia's site. they have "highlights", but not the multi-page pdf package like they have available for the windows drivers.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2015421 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2015423 - Posted: 14 Oct 2019, 15:10:14 UTC - in response to Message 2015419.  

another point is that the integer scaling requires a Turing GPU, yet you have people with Pascal GPUs experiencing the issue. that also leads me to believe integer scaling is not the issue.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2015423 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2015428 - Posted: 14 Oct 2019, 15:34:53 UTC

Ok, I thought all the reported issues were with Turing cards. Point conceded. Not the cause unless the code in the driver implementing integer scaling is causing the issue with earlier generations even if not accessed.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2015428 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2015431 - Posted: 14 Oct 2019, 15:52:37 UTC - in response to Message 2014905.  

If you are experiencing difficulties with this driver, I found that when I updated my system to this driver NVidia sleazed a program with the driver named GeForce Experience. This program was a huge problem for me. If you installed this driver and see that GeForce Experience is loaded, get it the heck out of your system. I had no more issues after I removed that program.
Obviously all users won't experience the same issues but I found this to be the issue with my lockups etc.... whether I installed the service that attempts to deal with other apps that might take ahold of the GPU or not.


many people seem to have skipped over this. the new driver package bundles the Geforce Experience software into the install. The only way to avoid this is to select a custom install. I imagine that most people do a standard install and just click "Next" without really reading the fine print.

This user has claimed to solved his issues by simply performing a custom install without the Geforce Experience software package. but it appears you are still having issues with Arecibo VHAR tasks. you have a handful from the past couple days that seem to exhibit the issues described in this thread. I guess Geforce Experience isnt the issue either.

https://setiathome.berkeley.edu/results.php?hostid=8825843&offset=0&show_names=0&state=6&appid=
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2015431 · Report as offensive     Reply Quote
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 2015434 - Posted: 14 Oct 2019, 16:58:25 UTC

Just for info I have been running GeForce Experience with the 431 drivers and earlier and it did not cause problems.
ID: 2015434 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2016087 - Posted: 20 Oct 2019, 20:04:31 UTC

Has anyone with Windows 10 tried the VHARs with the Non-SoG OpenCL App? How about the Intel OpenCL App? When the nVidia OpenCL build stopped working on the MacOS I switched to the Intel OpenCL build and it has worked well ever since.
I don't have Windows 10, so, you people are on your own.
ID: 2016087 · Report as offensive     Reply Quote
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 2016318 - Posted: 23 Oct 2019, 1:18:50 UTC

Driver 440.97 is out. Has anyone tried it?
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 2016318 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2016319 - Posted: 23 Oct 2019, 1:33:39 UTC - in response to Message 2016318.  

Yup. Same problem. Still not fixed.

The problem seems specific to OpenCL on Windows 10 drivers.

Linux doesn’t have the problem at all with OpenCL or CUDA.

And Richard tried 436 and 440 on Windows 7 and the problem didn’t show up there either.

Jacob tried Windows 10 with 440 and the SoG app and the problem was still there.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2016319 · Report as offensive     Reply Quote
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 2016321 - Posted: 23 Oct 2019, 1:36:55 UTC - in response to Message 2016319.  

I asked because I didn't think I was going to have the time to test it tonight...but just when I thought I had time freed up I saw your post. Thanks for the update, Ian.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 2016321 · Report as offensive     Reply Quote
Lemon Wolf
Volunteer tester
Avatar

Send message
Joined: 19 Jul 09
Posts: 9
Credit: 1,114,148
RAC: 0
Germany
Message 2016382 - Posted: 23 Oct 2019, 15:48:30 UTC
Last modified: 23 Oct 2019, 15:51:35 UTC

I might have sort of good news.
Just got the response from someone who has contacts to NVIDIA about the issue.
The people at NVIDIA are aware of the situation, which is at least something.
There was no mention on how long it will take to fix the issue.
However i have been told that the priority of the issue is still low, more user feedback would be required to give this problem more attention.
ID: 2016382 · Report as offensive     Reply Quote
billy ewell 1931 Project Donor
Volunteer tester

Send message
Joined: 1 Apr 03
Posts: 23
Credit: 24,295,322
RAC: 2
United States
Message 2016385 - Posted: 23 Oct 2019, 16:52:32 UTC - in response to Message 2016318.  
Last modified: 23 Oct 2019, 17:04:15 UTC

Checked my equipment first thing this morning (i7, windows 10, RTX2080) and a task had just been downloaded and the timing was totally in error: Time spent was about 5 minutes and time REMAINING was some 5 hours and increasing rapidly at the rate of one hour every 22 seconds; at ONE day and 15 hours and increasing I aborted and unfortunately the new task repeated with the same characteristics. I found this thread and subsequently (in error) updated my RTX 2080 to 440.97 and experienced the same error as before. Again reversed after reading more of the thread and downloaded Driver 431.60 and all is operating now normally after processing several tasks.

Appreciate the great reporting and suggestions. What I do find interesting though is apparently the timing errors only made themselves evident on this particular piece of equipment about 5 hours ago.

Added comment: I have no idea if it matters but in selecting my downloads from EVGA, I selected download drivers ONLY and excluded the GEFORCE Experience option.
ID: 2016385 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 20 · Next

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.