Message boards :
Number crunching :
NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 20 · Next
Author | Message |
---|---|
Whirling Steel Send message Joined: 21 Sep 19 Posts: 8 Credit: 84,779 RAC: 0 |
One thing I know about Technology Companies. its all about the numbers. If enough people complain I think they will. But honestly, since Seti@home is the only program on my system that is/was affected, I honestly don't think the Seti user base is enough to move the great NVidia to action (just my opinion) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
The drivers affect more projects than just Seti. But you are correct, the number of complaints from distributed computing users running consumer hardware on Compute loads is very small compared to the number of users just running normal graphics/game loads. But if the problem is also seen over on the professional side with the users of Tesla and Quadro business class cards, then they would be forced to do something about it because of the greater numbers of users and the prices they command for that hardware. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
SongBird Send message Joined: 23 Oct 01 Posts: 104 Credit: 164,826,157 RAC: 297 |
I believe I may have the same problem. What is not discussed in this thread, though is the part where it is not that the workunits take too long to finish but rather, that the GPU is not working at all. Can someone verify that indeed the GPU is idling on these units? |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I believe I may have the same problem. What is not discussed in this thread, though is the part where it is not that the workunits take too long to finish but rather, that the GPU is not working at all. As posted on the other thread you need to roll back the driver and stay away from the 436.XX or you will never be clear from this error This is what your crunched WU said: setiathome_CUDA: Found 1 CUDA device(s): nVidia Driver Version 436.48 Device 1: GeForce GTX 1660 Ti, 4095 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 3, pciSlotID = 0 BTW After you roll back the driver, reinstall the Lunatics over with the SoG option selected, your host will crunch a lot faster the BLC work. |
SongBird Send message Joined: 23 Oct 01 Posts: 104 Credit: 164,826,157 RAC: 297 |
I rolled the drivers back. Crunched a couple of tasks already. Will keep in touch if it happens again. Regarding the GPU idling I'm still curious if this is the case. Apropos of nothing, I'm old enough to remember when crunching on GPU did not take up a full CPU core... Just looked at this shit and got mad again: Run time 19 min 17 sec |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I rolled the drivers back. Crunched a couple of tasks already. Will keep in touch if it happens again. About the GPU idling, probably comes from the driver incompatibility, some kind of empty loop where the GPU was left doing nothing as you post (0% usage). About the CPU usage: Different crunching programs needs different CPU support. Now you could look to add some configuration parameter on the OpenCl builds, that will make your GPU crunch faster. Look at the doc files included on the build to see how. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ SongBird If you want/need more CPU time, you could consider using the 'Sleep" parameter. See here for the details. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
I’ve been doing some digging the past few days, trying to pin down what’s causing this problem. What I’ve found: So far this problem seems to only affect the Windows drivers. The Linux 435.xx code branch appears to be similar to the Windows 436.xx branch, and if you check the Windows 436 Release notes, you will notice that they are labeled as “435†drivers. But Linux systems running 435 drivers do not appear to have this issue. It has also been opined that the “integer scaling†feature may be the culprit. But I’m a skeptical. The integer scaling has nothing to do with compute loads, and instead only deals with display scaling for making text and items bigger on the screen by integer factors. Just something to chew on. Someone should create a well defined thread on the nvidia forums for discussion about this specific issue. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
The integer scaling has nothing to do with compute loads, and instead only deals with display scaling for making text and items bigger on the screen by integer factors. But in order to perform the integer scaling involves use of math functions. Which is also what our compute loads use. Just my suspicions that the problem only occurred just as this new "feature" appeared in the drivers. Don't see any mention of integer scaling in the Linux release notes for the 435 drivers. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
The integer scaling has nothing to do with compute loads, and instead only deals with display scaling for making text and items bigger on the screen by integer factors. it seems that they do not release a full release notes package for the linux drivers like they do for the windows drivers. a least I couldn't find it on nvidia's site. they have "highlights", but not the multi-page pdf package like they have available for the windows drivers. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
another point is that the integer scaling requires a Turing GPU, yet you have people with Pascal GPUs experiencing the issue. that also leads me to believe integer scaling is not the issue. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Ok, I thought all the reported issues were with Turing cards. Point conceded. Not the cause unless the code in the driver implementing integer scaling is causing the issue with earlier generations even if not accessed. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
If you are experiencing difficulties with this driver, I found that when I updated my system to this driver NVidia sleazed a program with the driver named GeForce Experience. This program was a huge problem for me. If you installed this driver and see that GeForce Experience is loaded, get it the heck out of your system. I had no more issues after I removed that program. many people seem to have skipped over this. the new driver package bundles the Geforce Experience software into the install. The only way to avoid this is to select a custom install. I imagine that most people do a standard install and just click "Next" without really reading the fine print. This user has claimed to solved his issues by simply performing a custom install without the Geforce Experience software package. but it appears you are still having issues with Arecibo VHAR tasks. you have a handful from the past couple days that seem to exhibit the issues described in this thread. I guess Geforce Experience isnt the issue either. https://setiathome.berkeley.edu/results.php?hostid=8825843&offset=0&show_names=0&state=6&appid= Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Bernie Vine Send message Joined: 26 May 99 Posts: 9956 Credit: 103,452,613 RAC: 328 |
Just for info I have been running GeForce Experience with the 431 drivers and earlier and it did not cause problems. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Has anyone with Windows 10 tried the VHARs with the Non-SoG OpenCL App? How about the Intel OpenCL App? When the nVidia OpenCL build stopped working on the MacOS I switched to the Intel OpenCL build and it has worked well ever since. I don't have Windows 10, so, you people are on your own. |
Bill Send message Joined: 30 Nov 05 Posts: 282 Credit: 6,916,194 RAC: 60 |
Driver 440.97 is out. Has anyone tried it? Seti@home classic: 1,456 results, 1.613 years CPU time |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Yup. Same problem. Still not fixed. The problem seems specific to OpenCL on Windows 10 drivers. Linux doesn’t have the problem at all with OpenCL or CUDA. And Richard tried 436 and 440 on Windows 7 and the problem didn’t show up there either. Jacob tried Windows 10 with 440 and the SoG app and the problem was still there. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Bill Send message Joined: 30 Nov 05 Posts: 282 Credit: 6,916,194 RAC: 60 |
I asked because I didn't think I was going to have the time to test it tonight...but just when I thought I had time freed up I saw your post. Thanks for the update, Ian. Seti@home classic: 1,456 results, 1.613 years CPU time |
Lemon Wolf Send message Joined: 19 Jul 09 Posts: 9 Credit: 1,114,148 RAC: 0 |
I might have sort of good news. Just got the response from someone who has contacts to NVIDIA about the issue. The people at NVIDIA are aware of the situation, which is at least something. There was no mention on how long it will take to fix the issue. However i have been told that the priority of the issue is still low, more user feedback would be required to give this problem more attention. |
billy ewell 1931 Send message Joined: 1 Apr 03 Posts: 23 Credit: 24,295,322 RAC: 2 |
Checked my equipment first thing this morning (i7, windows 10, RTX2080) and a task had just been downloaded and the timing was totally in error: Time spent was about 5 minutes and time REMAINING was some 5 hours and increasing rapidly at the rate of one hour every 22 seconds; at ONE day and 15 hours and increasing I aborted and unfortunately the new task repeated with the same characteristics. I found this thread and subsequently (in error) updated my RTX 2080 to 440.97 and experienced the same error as before. Again reversed after reading more of the thread and downloaded Driver 431.60 and all is operating now normally after processing several tasks. Appreciate the great reporting and suggestions. What I do find interesting though is apparently the timing errors only made themselves evident on this particular piece of equipment about 5 hours ago. Added comment: I have no idea if it matters but in selecting my downloads from EVGA, I selected download drivers ONLY and excluded the GEFORCE Experience option. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.