Bug in server affecting older BOINC clients with NVIDIA GPUs.

Author	Message
Eric Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60	Message 1271871 - Posted: 17 Aug 2012, 5:37:11 UTC Last modified: 17 Aug 2012, 5:38:33 UTC We've identified a bug in the current BOINC server that is online at SETI@home. With older BOINC clients this bug results in running multiple SETI@home GPU applications simultaneously on a single GPU. While we debug and fix the problem we've suspended distribution of NVIDIA work. We hope that everything will be back to normal some time tomorrow. @SETIEric@qoto.org (Mastodon) ID: 1271871 ·

robertmiles Volunteer tester Send message Joined: 16 Jan 12 Posts: 213 Credit: 4,117,756 RAC: 6	Message 1271877 - Posted: 17 Aug 2012, 6:34:52 UTC - in response to Message 1271871. Last modified: 17 Aug 2012, 6:35:21 UTC Do the multiple workunits use some same resource of the GPU or interfere in some other way? Some BOINC projects seem to let two or more OpenCL GPU workunits, but not multiple CUDA GPU workunits, share a single GPU after checking that they are not assigned any of the same resources within the GPU. ID: 1271877 ·

Francesco Forti Send message Joined: 24 May 00 Posts: 334 Credit: 204,421,005 RAC: 15	Message 1271902 - Posted: 17 Aug 2012, 8:13:59 UTC Last week I had same problem but with boinc 7.0.28 and lunatics v0.40 32 bit. As soon I had tried to run two instance of GPU task on a new GT 640 (driver 301.42) using count 0.5 I get 80 instance of GPU running and I had to deinstall adnd reinstall everithing. Other hosts, with older GTX cards, are also abbe to run two instances. ID: 1271902 ·

Shakir Send message Joined: 14 Aug 99 Posts: 3 Credit: 90,452,595 RAC: 93	Message 1271920 - Posted: 17 Aug 2012, 9:22:23 UTC - in response to Message 1271902. Last modified: 17 Aug 2012, 9:24:46 UTC I have a Notebook where this problem occoures. It isnt a Problem starting more threads on the GPU, the bigest Problem in my case is, the 10 Setis on the GPU use around 2 gig ram and the NB is 100% in swap mode. I cant deactivate the GPU calculation either. Well, there are 8 regular CPU Setis working, so the system hit against the 4 gig ram wall. I have to deaktivate calculation until this issue is fixed. Thanks for working on it. ID: 1271920 ·

Eric Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60	Message 1272118 - Posted: 17 Aug 2012, 15:59:37 UTC - in response to Message 1271877. Last modified: 17 Aug 2012, 16:00:04 UTC Do the multiple workunits use some same resource of the GPU or interfere in some other way? The main problem is that the BOINC client, depending upon the relative speed of your GPU and CPU, could decide to run as many as 10 GPU apps per CPU core simultaneously. If you've got 4 CPU cores, that's 40 GPU apps running at once. So no, we're not talking about running 2 or even 4 apps simultaneously on the GPU. The possible results, in order of severity, could be: 1) The apps error out when the GPU runs out of memory. 2) Your GPU driver freezes causes a reboot every time BOINC tries to run the apps. 3) Your GPU overheats and causes a reboot every time BOINC tries to run the apps. @SETIEric@qoto.org (Mastodon) ID: 1272118 ·

Michael W.F. Miles Send message Joined: 24 Mar 07 Posts: 268 Credit: 34,410,870 RAC: 0	Message 1272124 - Posted: 17 Aug 2012, 16:09:10 UTC I have Boinc client 7.0.31 installed on windows 7 x64 I am also running x41x miltibeam for gpu app Since the outage my 460 gtx is now taking 1.5 hours to complete two tasks It usually takes 15 minutes Now I was wondering what was happening as I hope I did not blow my card. It does not heat up to its normal temps when crunching My task manager is showing the six cpu apps running and the two x41x tasks running but that is all. I am hoping it is the server doing this and not poor 460 gtx which has been my main crunching unit Michael Miles ID: 1272124 ·

Eric Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60	Message 1272161 - Posted: 17 Aug 2012, 16:51:49 UTC - in response to Message 1272124. We've installed a fix from David Anderson that we hope will solve the problem. If you have a BOINC version 7 client, the problem never affected you, and you can stop reading this now. If you use BOINC version 6, you are probably affected. The fix we installed will not fix workunits that have already been downloaded. For that, you've got four options. 1) Abort all your CUDA tasks. 2) Upgrade to BOINC v7 or 3) Exit BOINC, edit your client_state.xml to replace all the occurrences of "NVIDIA" with "CUDA" or 4) Just let it run and deal with a few reboots. @SETIEric@qoto.org (Mastodon) ID: 1272161 ·

Sunny129 Send message Joined: 7 Nov 00 Posts: 190 Credit: 3,163,755 RAC: 0	Message 1272179 - Posted: 17 Aug 2012, 17:11:53 UTC - in response to Message 1272124. there was a first time poster named Cathy who posted a question about her GPU problems in this thread earlier, but her post has since mysteriously vanished. her post may have been slightly out of place in this thread, as its not a troubleshooting thread, but rather a thread dedicated to the status of the NOINC server bug. regardless, i'm hoping that her post was not deleted altogether, and at the very least towed to the appropriate sub-forum or thread so that her question can get answered... I have Boinc client 7.0.31 installed on windows 7 x64 I am also running x41x miltibeam for gpu app Since the outage my 460 gtx is now taking 1.5 hours to complete two tasks It usually takes 15 minutes Now I was wondering what was happening as I hope I did not blow my card. It does not heat up to its normal temps when crunching My task manager is showing the six cpu apps running and the two x41x tasks running but that is all. I am hoping it is the server doing this and not poor 460 gtx which has been my main crunching unit Michael Miles doesn't sound like a server-side issue to me, despite all the issues the server has right now. it sounds more like your video driver crashed and reset itself, leaving the GPU in limp/safe mode. are your GPU's core and memory clocks underclocked to approx. half of what they should be? you'll need to open Catalyst Control Center or some 3rd party utility like MSI Afterburner to confirm this. if so, you'll need to suspend all BOINC work and reboot the entire system to bring the GPU out of safe mode. but even if this isn't the solution, i highly doubt physical damage is responsible for the way your GPU is currently acting. ID: 1272179 ·

Michael W.F. Miles Send message Joined: 24 Mar 07 Posts: 268 Credit: 34,410,870 RAC: 0	Message 1272188 - Posted: 17 Aug 2012, 17:24:13 UTC - in response to Message 1272179. I did think it was a driver crash but a reboot will clear this out. All OC programs say it is running okay except for temp on my gpu which is lower than usual. I am going to try a driver reinstall and see what happens. It started to do this right after the last outage which make me suspicious ID: 1272188 ·

Gatekeeper Send message Joined: 14 Jul 04 Posts: 887 Credit: 176,479,616 RAC: 0	Message 1272191 - Posted: 17 Aug 2012, 17:26:12 UTC - in response to Message 1272188. I did think it was a driver crash but a reboot will clear this out. All OC programs say it is running okay except for temp on my gpu which is lower than usual. I am going to try a driver reinstall and see what happens. It started to do this right after the last outage which make me suspicious The units that are taking 1.5 hours aren't VLAR's by chance, are they? ID: 1272191 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22200 Credit: 416,307,556 RAC: 380	Message 1272192 - Posted: 17 Aug 2012, 17:27:13 UTC Or Astropulses? Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1272192 ·

Michael W.F. Miles Send message Joined: 24 Mar 07 Posts: 268 Credit: 34,410,870 RAC: 0	Message 1272229 - Posted: 17 Aug 2012, 18:07:17 UTC - in response to Message 1272192. They are vlars, I just noticed that. I am going to abort the vlars and see what happens ID: 1272229 ·

Michael W.F. Miles Send message Joined: 24 Mar 07 Posts: 268 Credit: 34,410,870 RAC: 0	Message 1272231 - Posted: 17 Aug 2012, 18:12:59 UTC - in response to Message 1272229. That was it. Thanks you all. Man what a relief. I thought I cooked my card running it in this weather. Vlars ran up the time by a huge margin. Thank you, thank you, thank you GOD almighty. One thing I don't have right now is 200 dollars for a new card Michael Miles ID: 1272231 ·

alan soden Send message Joined: 12 Feb 12 Posts: 2 Credit: 105,617 RAC: 0	Message 1272318 - Posted: 17 Aug 2012, 22:15:30 UTC ha ha maybe this bug is extraterestial ??? good luck with the fix. ID: 1272318 ·

Eric Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60	Message 1272335 - Posted: 17 Aug 2012, 23:02:05 UTC - in response to Message 1272318. Damn, now the vlars are broke again? @SETIEric@qoto.org (Mastodon) ID: 1272335 ·

Peter Send message Joined: 4 May 12 Posts: 22 Credit: 26,746 RAC: 0	Message 1272351 - Posted: 17 Aug 2012, 23:37:07 UTC - in response to Message 1271871. Hello Thank you for finding the problem, I got it and when I stated that I was told it is impossible. So you proved that I was right, am not crazy. All I can say to ones that told me I was wrong (I Told you so!) THEY SEE YOU!! LOOK UP!! ID: 1272351 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1272360 - Posted: 18 Aug 2012, 0:02:03 UTC - in response to Message 1272335. Damn, now the vlars are broke again? I dont think so... the vlars aborted by him were sent on Aug 11 (I guess before the fixes)... ID: 1272360 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1272375 - Posted: 18 Aug 2012, 0:34:56 UTC - in response to Message 1272360. Damn, now the vlars are broke again? I dont think so... the vlars aborted by him were sent on Aug 11 (I guess before the fixes)... Agreed. No new ones seen here, either. ID: 1272375 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1272377 - Posted: 18 Aug 2012, 0:53:26 UTC - in response to Message 1272161. If you use BOINC version 6, you are probably affected. The fix we installed will not fix workunits that have already been downloaded. For that, you've got four options. 1) Abort all your CUDA tasks. 2) Upgrade to BOINC v7 or 3) Exit BOINC, edit your client_state.xml to replace all the occurrences of "<type>NVIDIA</type>" with "<type>CUDA</type>" or 4) Just let it run and deal with a few reboots. I like the option '3)' most (the 'Replace All' is easy using just Notepad) I can think of another 'fix' (for those uncomfortable with any of 1) ... 4) above) but it involves 'hand' work: - Temporarily Disable getting NVIDIA/CUDA tasks ('Use NVIDIA GPU' here: http://setiathome.berkeley.edu/prefs.php?subset=project) - Suspend all your current CUDA tasks (sort by Application column, click ... Shift+click to select all CUDA tasks) - Resume them one at a time (or 2, 3 at a time if your GPU is good enough (Fermi++)) - When all 'old' CUDA tasks are done - Enable again getting NVIDIA/CUDA tasks ('Use NVIDIA GPU' yes) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1272377 ·

.clair. Send message Joined: 4 Nov 04 Posts: 1300 Credit: 55,390,408 RAC: 69	Message 1272397 - Posted: 18 Aug 2012, 2:21:34 UTC - in response to Message 1272335. Damn, now the vlars are broke again? I run ATI GPU and have not seen a VLAR in days If there are any out there i can not find them :( ID: 1272397 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.