NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

AuthorMessage
Jesse Viviano

Send message
Joined: 27 Feb 00
Posts: 100
Credit: 3,949,252
RAC: 300
United States
Message 2014226 - Posted: 5 Oct 2019, 2:29:28 UTC - in response to Message 2014053.  

Thank you everyone for the feedback.

Based on what I have read from the conversations, am I correct in stating that the issue will most likely not get fixed?

Regards,
BoincSpy.

Fortunately, Nvidia is quick to fix any bugs that people find as long as they report the bug in the GeForce driver feedback form at https://docs.google.com/forms/d/e/1FAIpQLSewHJk1xP-C5elLBRCDLTLpNQZ9eiefrdZmUGP9hMCN6gKssA/viewform, which I found at https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/320827/official-geforce-43648-game-ready-driver-feedback-/. Since you are a volunteer tester, maybe you can fill out the form with more detail than I could fill it out with. I remember that Nvidia has fixed a bug that caused its drivers to incorrectly compute some math affecting PrimeGrid, and it once found a bug in the code for Folding@home instead of the driver thanks to a valid driver optimization that triggered the bug but would have computed the math correctly had the bug not have been present in Folding@home.
ID: 2014226 · Report as offensive     Reply Quote
Profile Keith Myers Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10327
Credit: 1,011,881,505
RAC: 1,413,362
United States
Message 2014229 - Posted: 5 Oct 2019, 3:08:16 UTC - in response to Message 2014226.  

I believe the bug has been posted over a month ago to Nvidia feedback/support. Still hasn't been resolved/fixed.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2014229 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2846
Credit: 914,669,697
RAC: 1,795,677
Canada
Message 2014257 - Posted: 5 Oct 2019, 14:08:23 UTC

I've discovered many excessive runtime tasks with Arecibo VLAR, not VHAR, work units on this computer that resulted in 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED errors but on driver 430.50, not a 436.xx driver, and I'm wondering if they may be related. They all have runtimes of 12,000+ seconds on GPU though the usual completion time is about 35-50 seconds. The computer was recently rebooted and is processing all other work properly. As well these work units errored out on other computers they were assigned to, even on CPU, though not from excessive runtimes.

Tasks are: 8100604949, 8100604968, 8100605056, 8100605039, 8100605044, 8100605056, 8100605074, 8100605099, 8097708300, 8097708079, 8097708336, 8097708355, 8097708139, 8097708255 and 8097708262.

Hope this proves helpful...
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2014257 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 8253
Credit: 519,748,232
RAC: 416,458
Panama
Message 2014259 - Posted: 5 Oct 2019, 14:41:11 UTC - in response to Message 2014257.  
Last modified: 5 Oct 2019, 14:46:41 UTC

I've discovered many excessive runtime tasks with Arecibo VLAR, not VHAR, work units on this computer that resulted in 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED errors but on driver 430.50, not a 436.xx driver, and I'm wondering if they may be related. They all have runtimes of 12,000+ seconds on GPU though the usual completion time is about 35-50 seconds. The computer was recently rebooted and is processing all other work properly. As well these work units errored out on other computers they were assigned to, even on CPU, though not from excessive runtimes.

Tasks are: 8100604949, 8100604968, 8100605056, 8100605039, 8100605044, 8100605056, 8100605074, 8100605099, 8097708300, 8097708079, 8097708336, 8097708355, 8097708139, 8097708255 and 8097708262.

Hope this proves helpful...

I not check all the WU but the ones i look are GPU WU but are running on the CPU not the GPU, that will cause the EXIT_TIME_LIMIT_EXCEEDED error. Now you need to find why that happening. Did you reschedule them?

<core_client_version>7.16.1</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 12958.64 (41454.44G/3.20G)</message>
<stderr_txt>
Not using mb_cmdline.txt-file, using commandline options.

Build features: SETI8 Non-graphics FFTW FFTOUT JSPF SSE4.1 64bit 
 System: Linux  x86_64  Kernel: 4.15.0-65-generic
 CPU   : AMD FX(tm)-8350 Eight-Core Processor
 8 core(s), Speed :  1404.693 MHz
 L1 : 64 KB, Cache : 2048 KB
 Features : FPU TSC PAE APIC MTRR MMX SSE  SSE2 HT PNI SSSE3 SSE4A SSE4_1 SSE4_2 AVX  FMA4  

ar=0.012314  NumCfft=145989  NumGauss=0  NumPulse=50164588672  NumTriplet=67958470816
In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768
Linux optimized setiathome_v8 application
Version info: SSE4.1xjf (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
SSE4.1xjf Linux64 Build 3711 , Ported by : Raistmer, JDWhale, Urs Echternacht

ID: 2014259 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 17955
Credit: 408,976,601
RAC: 32,837
United Kingdom
Message 2014260 - Posted: 5 Oct 2019, 14:42:19 UTC

There is something very strange about that computer - from one of the error tasks:
Build features: SETI8 Non-graphics FFTW FFTOUT JSPF SSE4.1 64bit
System: Linux x86_64 Kernel: 4.15.0-58-generic
CPU : AMD FX(tm)-8350 Eight-Core Processor
8 core(s), Speed : 1404.583 MHz
L1 : 64 KB, Cache : 2048 KB

The CPU is clocked at about a third of the normal speed for an AMD FX8350

Indeed the majority of the task description suggests this task was actually run on a CPU not a GPU, despite the headline claim of running on a GPU.
There is no mention of either OpenCL or CUDA, only the various CPU options which would (should?) be there.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2014260 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2846
Credit: 914,669,697
RAC: 1,795,677
Canada
Message 2014263 - Posted: 5 Oct 2019, 15:00:18 UTC - in response to Message 2014259.  

Thank you both... that was it. The error tasks listing indicates "SETI@home v8 Anonymous platform (NVIDIA GPU)" so I failed to notice that no GPU was involved. They must have been rescheduled and I forgot about it. Apologies and please disregard.
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2014263 · Report as offensive     Reply Quote
Profile Keith Myers Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10327
Credit: 1,011,881,505
RAC: 1,413,362
United States
Message 2014267 - Posted: 5 Oct 2019, 15:42:23 UTC - in response to Message 2014263.  

Thank you both... that was it. The error tasks listing indicates "SETI@home v8 Anonymous platform (NVIDIA GPU)" so I failed to notice that no GPU was involved. They must have been rescheduled and I forgot about it. Apologies and please disregard.

Or this condition can be caused by an unstable gpu which produces too many pulses or invalid power readings. When that happens, the stderr.txt will state the case that the PoT was exceeded (Power over Time) and the gpu task was moved to the cpu for completion. That always leads to a -197 time exceeded error.

I have not seen any errors on VLAR tasks only on the VHAR tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2014267 · Report as offensive     Reply Quote
Jesse Viviano

Send message
Joined: 27 Feb 00
Posts: 100
Credit: 3,949,252
RAC: 300
United States
Message 2014339 - Posted: 6 Oct 2019, 4:26:59 UTC

I just filed a bug report with Nvidia in case nobody else has done so.
ID: 2014339 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11877
Credit: 185,017,764
RAC: 240,032
Australia
Message 2014340 - Posted: 6 Oct 2019, 4:46:00 UTC - in response to Message 2014339.  
Last modified: 6 Oct 2019, 4:46:16 UTC

I just filed a bug report with Nvidia in case nobody else has done so.
Thanks.
Hopefully people from the other projects that are having similar issues will have done the same thing, and with luck the multiple reports from different sources will add some weight to the issue.
Grant
Darwin NT
ID: 2014340 · Report as offensive     Reply Quote
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 17239
Credit: 240,686,158
RAC: 179,690
Australia
Message 2014357 - Posted: 6 Oct 2019, 10:33:11 UTC
Last modified: 6 Oct 2019, 10:34:02 UTC

Thankfully I a very long way from being in the "Latest is Greatest Belief Club". ;-)

Cheers.
ID: 2014357 · Report as offensive     Reply Quote
daysteppr

Send message
Joined: 22 Mar 05
Posts: 78
Credit: 17,639,357
RAC: 25,261
United States
Message 2014414 - Posted: 6 Oct 2019, 18:09:04 UTC - in response to Message 2014267.  
Last modified: 6 Oct 2019, 18:32:07 UTC

https://setiathome.berkeley.edu/workunit.php?wuid=3678863986

Name 19au10ac.25800.16848.10.37.129_1
Workunit 3679081064
Created 4 Oct 2019, 9:43:51 UTC
Sent 4 Oct 2019, 13:18:47 UTC
Report deadline 25 Oct 2019, 0:28:29 UTC
Received 6 Oct 2019, 10:25:28 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED
Computer ID 8591279
Run time 45 min 3 sec
CPU time 13 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 18,124.80 GFLOPS
Application version SETI@home v8
Anonymous platform (NVIDIA GPU)
Peak working set size 112.63 MB
Peak swap size 137.32 MB
Peak disk usage 0.02 MB

Thats what it gives me when i get a wu that goes to 0.006 and essentially stops for an hr ish in time.

So, is it flaky power to the card or the drivers?
ID: 2014414 · Report as offensive     Reply Quote
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 17239
Credit: 240,686,158
RAC: 179,690
Australia
Message 2014453 - Posted: 6 Oct 2019, 22:04:15 UTC

That workunit has a true angle of 2.715739 so it's the driver that you are using, roll back to 431.60 driver to stop that.

Cheers.
ID: 2014453 · Report as offensive     Reply Quote
Profile Bill Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 261
Credit: 4,975,325
RAC: 22,548
United States
Message 2014583 - Posted: 8 Oct 2019, 0:24:07 UTC

I'm getting the same problem on the 436.48 DCH driver. I have rolled back to the 431.60 DCH driver and everyting works good on my 1660 Ti. I posted on the Nvidia message board as Jesse suggested, so we'll see what happens.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 2014583 · Report as offensive     Reply Quote
Profile Whirling Steel
Avatar

Send message
Joined: 21 Sep 19
Posts: 8
Credit: 84,295
RAC: 305
United States
Message 2014905 - Posted: 10 Oct 2019, 18:54:10 UTC - in response to Message 2013323.  

If you are experiencing difficulties with this driver, I found that when I updated my system to this driver NVidia sleazed a program with the driver named GeForce Experience. This program was a huge problem for me. If you installed this driver and see that GeForce Experience is loaded, get it the heck out of your system. I had no more issues after I removed that program.
Obviously all users won't experience the same issues but I found this to be the issue with my lockups etc.... whether I installed the service that attempts to deal with other apps that might take ahold of the GPU or not.
ID: 2014905 · Report as offensive     Reply Quote
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9877
Credit: 89,819,739
RAC: 96,378
United Kingdom
Message 2014926 - Posted: 10 Oct 2019, 21:24:49 UTC

This program was a huge problem for me. If you installed this driver and see that GeForce Experience is loaded, get it the heck out of your system. I had no more issues after I removed that program.


Use the custom install option. That allows you to not install it.

Saying that I have had it installed for the past year with no problems, GeForce Experience is really only for gaming. I use it as I game and it allows me to do in game screenshots, and video and to optimise games for the best performance. Wouldn't want to be without it now.

I made the decision to stop crunching on GPU's on my Windows machines and just run one of my Linux rigs for a couple of days a week.
ID: 2014926 · Report as offensive     Reply Quote
Profile Whirling Steel
Avatar

Send message
Joined: 21 Sep 19
Posts: 8
Credit: 84,295
RAC: 305
United States
Message 2014977 - Posted: 11 Oct 2019, 3:42:34 UTC - in response to Message 2014926.  

Agreed Bernie. in my experience, end users never use the "custom" or "advanced" options when installing software :)
but right now as I write this I realize you were probably talking to that same user base as I was! I suppose that would render
this post uhhhhh invalid! hahaha /sigh
ID: 2014977 · Report as offensive     Reply Quote
Profile Whirling Steel
Avatar

Send message
Joined: 21 Sep 19
Posts: 8
Credit: 84,295
RAC: 305
United States
Message 2014979 - Posted: 11 Oct 2019, 3:54:39 UTC

I have a semi off-topic... ok completely off topic question, as this is the only thread I have posted to :/
and you folks are really kind of the first folks I have communicated with at Seti, other than the team I am in's leader.
Two things really, any advice would be appreciated.
#1 Does anyone know of the "Fly's Eye" project, and if so, how or where to join it?
#2 Can someone PLEASE tell me what the "credits" are for? Please god tell me! lol (I kill me!)
ID: 2014979 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11877
Credit: 185,017,764
RAC: 240,032
Australia
Message 2014984 - Posted: 11 Oct 2019, 4:44:29 UTC - in response to Message 2014979.  

#2 Can someone PLEASE tell me what the "credits" are for?
For keeping track of how much work you have done.
Grant
Darwin NT
ID: 2014984 · Report as offensive     Reply Quote
Profile Whirling Steel
Avatar

Send message
Joined: 21 Sep 19
Posts: 8
Credit: 84,295
RAC: 305
United States
Message 2015035 - Posted: 11 Oct 2019, 14:33:35 UTC - in response to Message 2014984.  

thanks man I appreciate it
ID: 2015035 · Report as offensive     Reply Quote
Patrick Meyer

Send message
Joined: 18 Jun 11
Posts: 3
Credit: 19,179,738
RAC: 27,553
Canada
Message 2015113 - Posted: 11 Oct 2019, 23:22:20 UTC

do you think that NVIDIA will ever fix the driver
ID: 2015113 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

Message boards : Number crunching : NVidia 436.xx and later drivers can cause very long compute times especially on Arecibo VHAR work units


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.