cuda23 never showed progress on BM

Message boards : Number crunching : cuda23 never showed progress on BM
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1106446 - Posted: 15 May 2011, 12:53:07 UTC

I aborted this task because it never showed any progress after 13 hours of runtime. BM showed 0.00% complete while BoincTasks percent complete varied around 1400.00% It was the 1400 that caught my attention.

This happens very rarely, but when it does it consumes a lot of run time before I notice the problem.

I would think that any project using a GPU for co-processing should implement some type of timeout for "no progress".

I do not know why BoincTasks reported non-zero progress of %1400 for this task while BM showed 0. This indicates that BoincTasks spotted something that BM didnt.

my 2c.
ID: 1106446 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1106730 - Posted: 16 May 2011, 9:21:15 UTC - in response to Message 1106446.  

I would think that any project using a GPU for co-processing should implement some type of timeout for "no progress".

There is "timeout" for any task (not only GPU) - it is 10 times the initial estimated time.

E.g. if the initial estimated time to completion is 1 hour - after the task runs for 10 hours the task will be aborted (even if it is at 99%).


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1106730 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1106816 - Posted: 16 May 2011, 17:28:05 UTC - in response to Message 1106730.  

I would think that any project using a GPU for co-processing should implement some type of timeout for "no progress".

There is "timeout" for any task (not only GPU) - it is 10 times the initial estimated time.

E.g. if the initial estimated time to completion is 1 hour - after the task runs for 10 hours the task will be aborted (even if it is at 99%).


That's only reasonably accurate after a system has done enough work that BOINC is scaling the estimate and bound for each task and the Duration Correction Factor for the project is near 1.0. Beemer Biker's Computer 5796714 has no validated tasks under the new system so the cutoff time could have been much longer than he let it run.

The " Device 1: GeForce 9800 GTX/9800 GTX+ is okay" line is output after a quick check of card capability, but before any real attempt is made to initialize processing. It just says the card has CUDA capability and at least 128 MB of memory. After that there's a call to a function in the cudart.dll which should get the basic setup started, apparently that never returned.
                                                                Joe
ID: 1106816 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1106863 - Posted: 16 May 2011, 19:53:26 UTC

I haven't had any stickers in a while. Sometimes they would be at 0% or any percent below 100%. When I would find one it would normally be 80-100 hours of run time when the estimates were 3-4 hours. At first I was editing the client_state.xml to restart it & try again. However most of the time they would just get stuck again so I switched to aborting them.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1106863 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1106865 - Posted: 16 May 2011, 20:02:24 UTC - in response to Message 1106446.  

Does your next Cuda Wu compute Normally?

I'd be tempted to downgrade the driver from 270.61 to 266.58 and see if that works,

Claggy
ID: 1106865 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1107269 - Posted: 18 May 2011, 14:05:58 UTC - in response to Message 1106865.  
Last modified: 18 May 2011, 14:09:47 UTC

This message board was down while I had the same problem with a milkyway job. The milkway jobs were open_cl as opposed to cuda_fermi whatever that is or cuea23. I posted the same problem over at milkyway where the suggestion was to upgrade to the beta driver 275.whatever

Anyway it seems that if I stop boinc and restart then the job either completes immediately with success or restarts at 0.0 and runs the normal 15 or so minutes before completing successfully. Since milkyway deletes results almost as fast as they are posted I can only guess that it was validated. I had a 3 hour 30 minute job (3:25 cpu time) that completed within 20 minutes from scratch (0.0 %) after restarting boinc.

Discussion here

I think the job that had the problem I posted about seems to have problems at all the wingmen. However, I have observed jobs completeing succesfully after restarting boinc, both SETI and MILKY

I have upgraded all my systems to that 275.latest&geatest and will see what happens.
ID: 1107269 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1107409 - Posted: 18 May 2011, 20:30:36 UTC

Here is another one that hung (never progressed) until I restarted boinc:

From BoincTasks (note the 847% complete as shown by BoincTasks) and runtime of 1 hour, 22 minutes.



This is the same image but from Boincmgr 6.12.26



I restarted boinc and immediately progress was shown under boinc. The task completed in 8 minutes as shown here



I updated the project and the result was posted here

This job ran AFTER I upgrade to that beta nvidia driver 275.27

From looking at the work I see no indication of the over 1 hour run time that unaccountably occured. Something is amiss. This system has only a single gtx280 and a q6700 cpu and is used exculsively for boinc (no games)
ID: 1107409 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1107449 - Posted: 18 May 2011, 22:08:19 UTC
Last modified: 18 May 2011, 22:12:24 UTC

Different system, dual opteron, same problem: no progress unless the task is restarted:



This system has two cards: gtx460 (d0) and a 9800gtx (d1) and is also running that 275.27 and 6.12.26

the above is from boinctasks otherwise the progress would have shown 0.0 %

[EDIT]
I suspect a driver problem. I just logged onto that system, vista 64, and had to dismiss two nvidia driver reset warnings that popped up. A driver reset during crunching possibly is causing this to happen.
ID: 1107449 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1107456 - Posted: 18 May 2011, 22:34:29 UTC - in response to Message 1107449.  
Last modified: 18 May 2011, 22:39:41 UTC

And I never wanted to put quite different GPUs, in a system.
I would throw out the 9800GTX+ mine gave too much -9 and
Find triplets in a row....

You can run 2 SETI MB WUs on your FERMI. (Optimized, if you'd like)

Also depends which Compute Level is needed, 1.1 1.2, 1.3, 2.0, 2.1,
last two are FERMI compatible.
ID: 1107456 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1107548 - Posted: 19 May 2011, 4:40:09 UTC - in response to Message 1107449.  


Do you overclock your GPUs?
Do you check temperatures?
Is the PSU power enough?


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1107548 · Report as offensive

Message boards : Number crunching : cuda23 never showed progress on BM


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.