GPU stalling

Message boards : Number crunching : GPU stalling
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 995657 - Posted: 11 May 2010, 23:08:22 UTC

Rom Walton wrote:
A number of people have reported that the CC starts failing to assign work to GPUs after a period of time.

Evidence of this can be found in your log file. For Nvidia GPUs it looks like:

[coproc] cuCtxCreate(0) returned 999

I'm not sure what is logged for an ATI GPU.

For those experiencing the issue, could you email me:

What OS are you using?

Number of GPUs?

What GPU driver version?

What GPU model version?

Amount of RAM for the computer?

Amount of RAM for the GPU?

Thanks in advance.


----- Rom

Email Rom at this address.
ID: 995657 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 995682 - Posted: 12 May 2010, 0:39:39 UTC - in response to Message 995657.  

I know that the DNETC folks had to install a command to end a WU if the WU wasn't showing progress for more than 20 minutes This apparently only affected ATI card WU's. This might be a problem for seti because of its non standard WU sizes.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 995682 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 995919 - Posted: 13 May 2010, 3:00:21 UTC - in response to Message 995682.  

I know that the DNETC folks had to install a command to end a WU if the WU wasn't showing progress for more than 20 minutes This apparently only affected ATI card WU's. This might be a problem for seti because of its non standard WU sizes.

A different problem ...

In the case of BOINC and what Rom is asking for, most, if not all, of the 6.10.45+ versions would stop assigning work to one or more GPUs in a system. The system might still process work on other GPUs installed, but the one that would "fail" was essentially a fail silent in that there would be no crash and the only positive indicator was that in some cases you could see that the memory size detect code would not get a valid memory size...

There is now the question if this code should be pulled (I voted yes) or not ... the whole point was to try to protect tasks from cards with limited GPU memory ...
ID: 995919 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 996337 - Posted: 15 May 2010, 1:09:05 UTC - in response to Message 995919.  

I know that the DNETC folks had to install a command to end a WU if the WU wasn't showing progress for more than 20 minutes This apparently only affected ATI card WU's. This might be a problem for seti because of its non standard WU sizes.

A different problem ...

In the case of BOINC and what Rom is asking for, most, if not all, of the 6.10.45+ versions would stop assigning work to one or more GPUs in a system. The system might still process work on other GPUs installed, but the one that would "fail" was essentially a fail silent in that there would be no crash and the only positive indicator was that in some cases you could see that the memory size detect code would not get a valid memory size...

There is now the question if this code should be pulled (I voted yes) or not ... the whole point was to try to protect tasks from cards with limited GPU memory ...


They may have found the answer. 6.10.56 would appear to address the GPU's going idle and .55 removed the memory checking.
BOINC blog
ID: 996337 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 996486 - Posted: 15 May 2010, 17:00:06 UTC - in response to Message 996337.  

They may have found the answer. 6.10.56 would appear to address the GPU's going idle and .55 removed the memory checking.

.55 was actually the fix, and .56 was a fix to the fix ... :)

Yes, I am running .56 as we speak on all 5 systems ... one has been up for 2 days now and still going strong ... (well, one install upgrade in there as well) ...

UCB does not subscribe to chaos theory which says that even if you do the "same thing" over and over again you do not necessarily get the same results even with the same inputs ... one of the reasons that there is so much instability in running BOINC is that many internal functions are done far more often than they really need to be done ...

The memory testing of the GPU was just another example of a good idea gone bad because they went overboard on the number of times they tested to see if there was enough memory ...

Theory says you should be able to ask as often as you would like, but, reality said otherwise .. so this thread can be unpinned ... and allowed to die ....
ID: 996486 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 997575 - Posted: 21 May 2010, 16:17:28 UTC - in response to Message 997475.  

Out of 52 projects I am currently running with 2x GTX 480s enabled in non-SLI, I narrowed down, 1 by 1, that SETI is the ONLY project out of all 52 that actually freezes my OS to the point of having to do a hard power-down to recover. The only question I have is will I have to unattach from SETI and reattach in the future or will I have to uninstall BOINC and reinstall once the GTX 480 crash/freeze is finally fixed.

Generally speaking ... no ...

You may have to run a particular set of optimized applications to get the cards to work correctly. For the moment, you should be able to just set NNT (or turn off the use of CUDA) for SaH ...

Usually one only needs to re-install BOINC if really bad things happen... aside from migration of versions (up or down) I cannot remember the last time that I had to re-install BOINC to clear up an issue...
ID: 997575 · Report as offensive
Profile T-Armstrong
Volunteer tester
Avatar

Send message
Joined: 2 Feb 10
Posts: 9
Credit: 312,965
RAC: 0
United States
Message 999889 - Posted: 2 Jun 2010, 7:54:39 UTC

Its a GPU Poblem oder Server shut down ? Pease help me !
my GPU is running with 1024 MB - oh, men, i hade a problem,

all works unit running with my Computers, but not Seti, is a Server Problem ? My PC Toschiba Tosh, weill not running with 8 CPU !!!??? ( look in my
Profil / ->"show Computers" so you see it, I must give all Works back. Becourse not running. I`ve wait 48 Hours, but no running.

Have a nice day today
ID: 999889 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 999910 - Posted: 2 Jun 2010, 10:27:53 UTC - in response to Message 999889.  
Last modified: 2 Jun 2010, 10:29:26 UTC

Mir ist nicht ganz klar, was Dein Problem ist. Du sprichst wahrscheinlich von diesem Rechner, da die anderen beiden keine acht Prozessoren haben und außerdem detached sind.

Der i7-Rechner hat zur Zeit acht Tasks aktiv, alle für die CUDA Karte. Läuft einer von denen? Ich weiß nichts über die GTS 360M Karte.

Wenn Du Dich wunderst dass keine SETI Tasks auf der CPU laufen, musst Du mal die Zeitschulden (Long Term Debt) der anderen Projekte (besonders AQUA) überprüfen. Wahrscheinlich fordert BOINC gar keine CPU-Jobs von SETI an. Das kannst Du alles im Logfile (stdoutdae.txt) nachsehen, aber wahrscheinlich müsstest Du dafür noch ein paar Optionen aktivieren (cc_config.xml).

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 999910 · Report as offensive
Profile T-Armstrong
Volunteer tester
Avatar

Send message
Joined: 2 Feb 10
Posts: 9
Credit: 312,965
RAC: 0
United States
Message 999974 - Posted: 2 Jun 2010, 15:18:05 UTC


ok,Thanks

Armstrong
ID: 999974 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1000091 - Posted: 2 Jun 2010, 20:07:21 UTC
Last modified: 2 Jun 2010, 20:46:49 UTC

Hi Angeless,

I installed BOINC 6.10.56, but it cant connect, then I deinstalled it and installd my former used BOINC 6.10.18
But now I have the situation none of the BOINC clients can connect.

Any idea what todo ?

edit:
After waiting a hour it connected now.
Running now BOINC 6.10.56
D5400XS V8-Xeon
ID: 1000091 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1000136 - Posted: 2 Jun 2010, 21:38:18 UTC - in response to Message 1000091.  

_heinz, I answered you in the other thread.
ID: 1000136 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1000148 - Posted: 2 Jun 2010, 23:00:08 UTC

Hi Angeless,

thanks, it worked now

heinz
ID: 1000148 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1000151 - Posted: 2 Jun 2010, 23:41:25 UTC - in response to Message 1000148.  

Yay! Glad to be of help.
ID: 1000151 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1000270 - Posted: 3 Jun 2010, 8:55:20 UTC

Hi,
last night I must restart my machine, now I have the same issue as yesterday.
BOINC can not sync.
Now I'm waiting for 6 hours, but it can not connect.
My net looks ok.
I can ping berkeley and get answer.
Ping wird ausgeführt für boinc2.ssl.berkeley.edu [208.68.240.18] mit 32 Bytes Da
ten:
Antwort von 208.68.240.18: Bytes=32 Zeit=213ms TTL=50
Antwort von 208.68.240.18: Bytes=32 Zeit=214ms TTL=50
Antwort von 208.68.240.18: Bytes=32 Zeit=214ms TTL=50
Antwort von 208.68.240.18: Bytes=32 Zeit=212ms TTL=50

Ping-Statistik für 208.68.240.18:
Pakete: Gesendet = 4, Empfangen = 4, Verloren = 0 (0% Verlust),
Ca. Zeitangaben in Millisek.:
Minimum = 212ms, Maximum = 214ms, Mittelwert = 213ms

Any help appreciate
D5400XS V8-Xeon
ID: 1000270 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1000283 - Posted: 3 Jun 2010, 9:46:50 UTC - in response to Message 1000270.  
Last modified: 3 Jun 2010, 9:50:16 UTC

Any help appreciated

I think the other half of boinc2.ssl.berkeley.edu is the problem. Wasn't that 208.68.240.12?

And I'm just curious: how come you have a German operating system? I had expected a French one.

Gruß,
Gundolf
[edit]I know that 6.10.56 should take care of that, but who knows?[/edit]
ID: 1000283 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 1000288 - Posted: 3 Jun 2010, 10:12:16 UTC - in response to Message 1000283.  


I think the other half of boinc2.ssl.berkeley.edu is the problem. Wasn't that 208.68.240.12?

.13

F.
ID: 1000288 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1000302 - Posted: 3 Jun 2010, 11:39:39 UTC
Last modified: 3 Jun 2010, 11:48:21 UTC

12 does not work
Ping wird ausgeführt für 208.68.240.12 mit 32 Bytes Daten:
Zeitüberschreitung der Anforderung.
Zeitüberschreitung der Anforderung.
Zeitüberschreitung der Anforderung.
Zeitüberschreitung der Anforderung.

Ping-Statistik für 208.68.240.12:
Pakete: Gesendet = 4, Empfangen = 0, Verloren = 4 (100% Verlust),
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
13 does
Ping wird ausgeführt für 208.68.240.13 mit 32 Bytes Daten:
Antwort von 208.68.240.13: Bytes=32 Zeit=215ms TTL=50
Antwort von 208.68.240.13: Bytes=32 Zeit=214ms TTL=50
Antwort von 208.68.240.13: Bytes=32 Zeit=210ms TTL=50
Antwort von 208.68.240.13: Bytes=32 Zeit=213ms TTL=50

Ping-Statistik für 208.68.240.13:
Pakete: Gesendet = 4, Empfangen = 4, Verloren = 0 (0% Verlust),
Ca. Zeitangaben in Millisek.:
Minimum = 210ms, Maximum = 215ms, Mittelwert = 213ms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18 does
Ping wird ausgeführt für 208.68.240.18 mit 32 Bytes Daten:
Antwort von 208.68.240.18: Bytes=32 Zeit=210ms TTL=50
Antwort von 208.68.240.18: Bytes=32 Zeit=218ms TTL=50
Antwort von 208.68.240.18: Bytes=32 Zeit=211ms TTL=50
Antwort von 208.68.240.18: Bytes=32 Zeit=215ms TTL=50

Ping-Statistik für 208.68.240.18:
Pakete: Gesendet = 4, Empfangen = 4, Verloren = 0 (0% Verlust),
Ca. Zeitangaben in Millisek.:
Minimum = 210ms, Maximum = 218ms, Mittelwert = 213ms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
this is in my etc/hosts
208.68.240.18 boinc2.ssl.berkeley.edu
208.68.240.13 boinc2.ssl.berkeley.edu
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
so it should always work.
the curiosity is since more than 9 hours it is connected now
BONC no sync
boincmgr_image
boincmgr_threads
boincmgr_nosync
boinsmgr_tcpip

I believe BOINC 6.10.56 has some issues.

heinz
D5400XS V8-Xeon
ID: 1000302 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1000303 - Posted: 3 Jun 2010, 11:46:04 UTC - in response to Message 1000283.  

Hi Gudolf,

mother language is german :-)
living in France
ID: 1000303 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1000324 - Posted: 3 Jun 2010, 13:56:43 UTC - in response to Message 1000302.  

this is in my etc/hosts
208.68.240.18 boinc2.ssl.berkeley.edu
208.68.240.13 boinc2.ssl.berkeley.edu

If both addresses are working (as they do currently), you should comment these lines in your hosts file. There should be only one of them active anyway.

To be certain, are we speaking of connecting of the client to the server or of the manager to the client?

Gruß,
Gundolf
ID: 1000324 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1000330 - Posted: 3 Jun 2010, 14:10:23 UTC

hmm...have seen:
isaac.ssl.berkeley.edu:http
207.46.209.243:http

C:\Users\heinz>ping isaac.ssl.berkeley.edu

Ping wird ausgeführt für isaac.ssl.berkeley.edu [128.32.18.189] mit 32 Bytes Dat
en:
Antwort von 128.32.18.189: Bytes=32 Zeit=212ms TTL=45
Antwort von 128.32.18.189: Bytes=32 Zeit=212ms TTL=45
Antwort von 128.32.18.189: Bytes=32 Zeit=213ms TTL=45
Antwort von 128.32.18.189: Bytes=32 Zeit=216ms TTL=45

Ping-Statistik für 128.32.18.189:
Pakete: Gesendet = 4, Empfangen = 4, Verloren = 0 (0% Verlust),
Ca. Zeitangaben in Millisek.:
Minimum = 212ms, Maximum = 216ms, Mittelwert = 213ms

C:\Users\heinz>ping 207.46.209.243

Ping wird ausgeführt für 207.46.209.243 mit 32 Bytes Daten:
Antwort von 207.46.39.45: Zielnetz nicht erreichbar.
Zeitüberschreitung der Anforderung.
Zeitüberschreitung der Anforderung.
Antwort von 207.46.39.45: Zielnetz nicht erreichbar.

Ping-Statistik für 207.46.209.243:
Pakete: Gesendet = 4, Empfangen = 2, Verloren = 2 (50% Verlust),
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
there must be any problem
boinc_exe_tcpip

I bought driver-cleaner and run it, as Angeless recommend.
But my connection problems are not solved.

Any other ideas ?
D5400XS V8-Xeon
ID: 1000330 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : GPU stalling


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.