Host keeps on freezing [4] NVIDIA GeForce GTX 690 (1999MB) driver: 430.40 OpenCL: 1.2

Message boards : Number crunching : Host keeps on freezing [4] NVIDIA GeForce GTX 690 (1999MB) driver: 430.40 OpenCL: 1.2
Message board moderation

To post messages, you must log in.

AuthorMessage
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2006744 - Posted: 11 Aug 2019, 3:31:04 UTC

Host keeps on freezing [4] NVIDIA GeForce GTX 690 (1999MB) driver: 430.40 OpenCL: 1.2
This host keeps on freezing every few days. I need to pull its power and plug back in. Anyway to see why?
ID: 2006744 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 2006747 - Posted: 11 Aug 2019, 3:47:27 UTC - in response to Message 2006744.  
Last modified: 11 Aug 2019, 3:51:37 UTC

Anyway to see why?

No. You need to track down the cause, using the process of elimination to find out what's causing the issue.
Check the GPU temperatures. Test the system's memory.Try it with 1, 2, 3 then 4 GPUs. Try a different power supply if you have one. Run something like Process Explorer and make sure there isn't some other software process causing issues.

Considering all it is doing is pumping out mostly errors, i'd suggest running it with just one video card at a time for several days to make sure each card is OK, then when all of them check out OK, add another card & monitor the results. I'd also keep an eye on PSU voltages as you add cards.

<![CDATA[
<message>
too many boinc_temporary_exit()s</message>
<stderr_txt>
Not using mb_cmdline.txt-file, using commandline options.
Running on device number: 1
WARNING: boinc_get_opencl_ids failed with code -1
Error: Getting Platforms. (clGetPlatformsIDs)
BOINC assigns slot on device #1.
WARNING: BOINC failed to provide OpenCL device, using own enumeration abilities
ERROR: OpenCL kernel/call 'clGetDeviceIDs (second call)' call failed (-32) in file ../../src/GPU_lock.cpp near line 1315.
Waiting 30 sec before restart...
Not using mb_cmdline.txt-file, using commandline options.
Running on device number: 3
WARNING: boinc_get_opencl_ids failed with code -1
Error: Getting Platforms. (clGetPlatformsIDs)
BOINC assigns slot on device #3.
WARNING: BOINC failed to provide OpenCL device, using own enumeration abilities
ERROR: OpenCL kernel/call 'clGetDeviceIDs (second call)' call failed (-32) in file ../../src/GPU_lock.cpp near line 1315.
Waiting 30 sec before restart...
Not using mb_cmdline.txt-file, using commandline options.
Running on device number: 0
WARNING: boinc_get_opencl_ids failed with code -1
Error: Getting Platforms. (clGetPlatformsIDs)
BOINC assigns slot on device #0.
WARNING: BOINC failed to provide OpenCL device, using own enumeration abilities
ERROR: OpenCL kernel/call 'clGetDeviceIDs (second call)' call failed (-32) in file ../../src/GPU_lock.cpp near line 1315.
Waiting 30 sec before restart...
Not using mb_cmdline.txt-file, using commandline options.
Running on device number: 2
WARNING: boinc_get_opencl_ids failed with code -1
Error: Getting Platforms. (clGetPlatformsIDs)
BOINC assigns slot on device #2.
WARNING: BOINC failed to provide OpenCL device, using own enumeration abilities
ERROR: OpenCL kernel/call 'clGetDeviceIDs (second call)' call failed (-32) in file ../../src/GPU_lock.cpp near line 1315.
Waiting 30 sec before restart...
Not using mb_cmdline.txt-file, using commandline options.
Running on device number: 1
WARNING: boinc_get_opencl_ids failed with code -1
Error: Getting Platforms. (clGetPlatformsIDs)
BOINC assigns slot on device #1.
WARNING: BOINC failed to provide OpenCL device, using own enumeration abilities

etc, etc, etc, etc, etc...
Grant
Darwin NT
ID: 2006747 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 2006828 - Posted: 11 Aug 2019, 18:49:08 UTC - in response to Message 2006744.  

Host keeps on freezing [4] NVIDIA GeForce GTX 690 (1999MB) driver: 430.40 OpenCL: 1.2
This host keeps on freezing every few days. I need to pull its power and plug back in. Anyway to see why?


How much power is that computer consuming? You have a kill o watt meter on it? How big is your PSU?
ID: 2006828 · Report as offensive
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2006958 - Posted: 12 Aug 2019, 13:46:16 UTC - in response to Message 2006828.  

PSU is XFX 1250watt
ID: 2006958 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 2006967 - Posted: 12 Aug 2019, 14:26:40 UTC - in response to Message 2006958.  
Last modified: 12 Aug 2019, 14:29:25 UTC

PSU is XFX 1250watt


I think your PSU is insufficient for those 4 GPUs. I run EVGA T2 1600W for my 4 1080Ti FTW. The only way to know for sure is get a Watt-a -meter. Amazon sells them and so do electronic stores.

https://www.amazon.com/P3-International-P4460-Electricity-Monitor/dp/B000RGF29Q/ref=sr_1_5?crid=1X0Q09PNJ5LE3&keywords=watt+a+meter&qid=1565620124&s=gateway&sprefix=watt+a+m%2Caps%2C154&sr=8-5
ID: 2006967 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2006971 - Posted: 12 Aug 2019, 14:47:08 UTC - in response to Message 2006967.  
Last modified: 12 Aug 2019, 14:50:13 UTC

1200W should be more than enough. however it's possible the PSU could still be at fault for other reasons. I'm not sure XFX has the best track record for PSUs and that model is quite old now.

Boinc reports 4x GPUs because the 690 is a dual GPU card, he only has 2 physical cards, each with 2 GPUs = Boinc says it's 4. each card is rated for ~300W

Your 4x 1080ti will use more power than his 2x 690
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2006971 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 2006976 - Posted: 12 Aug 2019, 14:58:25 UTC - in response to Message 2006971.  

1200W should be more than enough. however it's possible the PSU could still be at fault for other reasons. I'm not sure XFX has the best track record for PSUs and that model is quite old now.

Boinc reports 4x GPUs because the 690 is a dual GPU card, he only has 2 physical cards, each with 2 GPUs = Boinc says it's 4. each card is rated for ~300W

Your 4x 1080ti will use more power than his 2x 690


I knew that about the 690 (Juan had some I believe) but he himself post that he has 4 - 690s not 2 -690s so if it really is 4 then he's running at a min 1200 watts ( usually it's more) and then you have the degradation of the PSU over time. I'm just surprised it hadn't act up before now.
ID: 2006976 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2006977 - Posted: 12 Aug 2019, 15:01:53 UTC

based on the list of hosts you have, I can't tell which system used to have the 690s in it. which host ID was it?

there are several OpenCL related errors in the file Grant posted, and I see several of your systems are lacking OpenCL drivers.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2006977 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2006978 - Posted: 12 Aug 2019, 15:03:15 UTC - in response to Message 2006976.  
Last modified: 12 Aug 2019, 15:24:02 UTC

but he himself post that he has 4 - 690s not 2 -690s

he copy and pasted the BOINC reporting info. I do not believe he actually has 4x 690 cards. BOINC would show [8] in that case.



he actually confirmed that it is only 2x690s in a previous post: https://setiathome.berkeley.edu/forum_thread.php?id=84490&postid=2005512#2005512
What's me doing wrong with my 690x2 host.

Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2006978 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2006981 - Posted: 12 Aug 2019, 15:15:25 UTC - in response to Message 2006977.  
Last modified: 12 Aug 2019, 15:19:32 UTC

based on the list of hosts you have, I can't tell which system used to have the 690s in it. which host ID was it?

there are several OpenCL related errors in the file Grant posted, and I see several of your systems are lacking OpenCL drivers.


it appears to be this Host: https://setiathome.berkeley.edu/show_host_detail.php?hostid=8782986

It says OpenCL is installed, but maybe it's a good idea to just wipe out all drivers and reinstall them fresh.

You will have to do some hands on troubleshooting.

Possibilities:
one or more GPUs are defective
one or more GPUs are having thermal issues, what are the temps when it's running?
driver problems, try a fresh install
while your PSU has enough capacity, it's an old model and could also be degraded/failing
other hardware issues, defective memory or SSD/HDD/MB

you'll need to check all of these things.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2006981 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2007006 - Posted: 12 Aug 2019, 18:29:46 UTC

From my 690 days.... long time ago... IIRC

Besides the usual problems... old PSU, bad capacitors, etc.

Look at the memory usage, the 690 has 2 GB but each 1/2 GPU could use only 1 GB, so keep that in mind on your config file or you will easy get a lot of errors.
ID: 2007006 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2007008 - Posted: 12 Aug 2019, 18:37:30 UTC - in response to Message 2007006.  

that's a good point Juan, only 1GB available to each GPU.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2007008 · Report as offensive
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2007126 - Posted: 13 Aug 2019, 3:59:51 UTC - in response to Message 2007006.  

From my 690 days.... long time ago... IIRC

Besides the usual problems... old PSU, bad capacitors, etc.

Look at the memory usage, the 690 has 2 GB but each 1/2 GPU could use only 1 GB, so keep that in mind on your config file or you will easy get a lot of errors.


How can I fix the ram issue?
ID: 2007126 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2007129 - Posted: 13 Aug 2019, 4:18:23 UTC - in response to Message 2007126.  

If you have a -sbs statement in any command line, reduce it to only 512 or 768 to stay under the 1024 limit for each gpu.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2007129 · Report as offensive

Message boards : Number crunching : Host keeps on freezing [4] NVIDIA GeForce GTX 690 (1999MB) driver: 430.40 OpenCL: 1.2


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.