Task Postponed?

Message boards : Number crunching : Task Postponed?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1671777 - Posted: 30 Apr 2015, 1:58:34 UTC
Last modified: 30 Apr 2015, 2:01:34 UTC

This is a new message for me ...

2015-04-29 7:51:50 PM | SETI@home | task postponed 180.000000 sec: Cuda runtime, memory related failure, threadsafe temporary Exit

*scratching my head* never seen that message before.

EDIT: As I was typing this Status changed to 'Waiting to run'
ID: 1671777 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1671781 - Posted: 30 Apr 2015, 2:12:39 UTC - in response to Message 1671777.  

Corrupted kernal build?

Seems like I've seen that somewhere before. Let's see what the others say first before I suggest a possible solution.
ID: 1671781 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1671782 - Posted: 30 Apr 2015, 2:15:34 UTC - in response to Message 1671781.  

I haven't changed anything, and it's been working just fine, seems it was just the one task http://setiathome.berkeley.edu/result.php?resultid=4118287690 at about 48% complete. It's not done yet so no stderr yet.

Not sure if it's complaining about CPU or GPU memory.
ID: 1671782 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1671783 - Posted: 30 Apr 2015, 2:18:56 UTC - in response to Message 1671782.  

Try quitting BOINC and then restarting again and see if it picks back up. Also could try rebooting the computer and see if it resolves it
ID: 1671783 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1671786 - Posted: 30 Apr 2015, 2:23:11 UTC - in response to Message 1671783.  

I suspended another task and it started again, in 15 minutes should have a stderr to look at.
ID: 1671786 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1671789 - Posted: 30 Apr 2015, 2:32:10 UTC - in response to Message 1671786.  

Rebooting, this is strange, tasks appear to be running normal and counting down, but see my GPU went ice cool at the same time as that message.
ID: 1671789 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1671796 - Posted: 30 Apr 2015, 2:45:31 UTC - in response to Message 1671789.  

you mean that it went cool after the reboot or cold when that message about waiting to run posted?

If it was the latter then I would guess either the driver or app crashed and nothing was progressing.

After reboot it should restart at the last saved point and then progress forward. Lets see if it finishes
ID: 1671796 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1671799 - Posted: 30 Apr 2015, 2:56:49 UTC - in response to Message 1671796.  

It went cold at the same time as that message appeared. After reboot it warmed up again.

this is what I see in stderr
Error on call (cudaMemcpy(&flags, dev_flag, sizeof(*dev_flag), cudaMemcpyDeviceToHost)), file c:/[Projects]/__Sources/sah_v7_opt/Xbranch/client/cuda/cudaAcc_gaussfit.cu, line 587: unknown error
Exiting
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
cudaAcc_free() DONE.
Cuda sync'd & freed.
Preemptively acknowledging a safe temporary exit->
Exit Status: 0
boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->
Cuda threadsafe ExitProcess() initiated, rval 0


Not sure if that is a GPU mem fault or if just couldn't copy data to mem

Ohhh well I guess will cross fingers and hope it doesn't happen again.
ID: 1671799 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1671815 - Posted: 30 Apr 2015, 4:04:46 UTC - in response to Message 1671799.  
Last modified: 30 Apr 2015, 4:09:06 UTC

That's recovery from a failure during transfer of gaussfit results from the VRAM back to the host.

If it only happened the once, or extremely rarely, I wouldn't worry about it. The temporary exit did what it could to back out and try again, hopefully saving the result from some spurious problem.

If it happens more often, then you'd be looking at a range of possible issues from disk integrity through drivers into some hardware fault.

What actually caused that particular glitch, assuming ordinary consumer grade equipment like the rest of us, could be just about anything, including one-off spurious glitches, from temperature induced, [some other application messing with the GPU or system at the time], some immature system driver, even through to radiation in the silicon chip packaging material.

That's why enterprise gear with ECC RAM with Tesla compute cards, redundant PSUs etc exists... but for me a little bit of fault tolerance is more practical ;)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1671815 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1671822 - Posted: 30 Apr 2015, 4:33:50 UTC - in response to Message 1671815.  

Thanks for the info Jason, it seems that was just a one time thing, haven't seen it again.

Wingman should validate or reject in 6-12 hours.
ID: 1671822 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1671952 - Posted: 30 Apr 2015, 11:07:04 UTC - in response to Message 1671822.  

You can see that GPU downclocked (and after the reboot returned to 1228 MHz)
Kepler GPU current clockRate = 1228 MHz
Kepler GPU current clockRate = 405 MHz

Task was still making progress on 405 MHz (Restarted at 41.24 percent ... Restarted at 53.24 percent) so driver was probably OK

Downclock may be caused by too much Overclock, Overheating


NVIDIA say:

GTX 750 Ti GPU Engine Specs:
Base Clock (MHz) 1020
Boost Clock (MHz) 1085

http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-750-ti/specifications
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1671952 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1672457 - Posted: 1 May 2015, 5:18:37 UTC

Quick update.

That task that exit command just got validated by my wingman.
ID: 1672457 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1675010 - Posted: 8 May 2015, 3:00:35 UTC - in response to Message 1671815.  

Can you help an user (Dave Lampkins) to understand/diagnose/fix similar (but not the same) problem:
http://setiathome.berkeley.edu/forum_thread.php?id=77272

NVIDIA GeForce GTX 970 (4095MB) driver: 347.88
Windows 7 / BOINC 7.4.42
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7379623

'stock' "setiathome enhanced x41zc, Cuda 5.00"
"too many boinc_temporary_exit()s" because of:

uncaptured error before launch (find_pulse_kernel2<fft_n, numthreads/fft_n, 5, true><<<grid, block>>>(best_pulse_score, PulsePoTLen, AdvanceBy, y_offset, numdivs, firstP, lastP)), file c:/[Projects]/__Sources/sah_v7_opt/Xbranch/client/cuda/cudaAcc_pulsefind.cu, line 1505: unknown error
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1675010 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1675107 - Posted: 8 May 2015, 9:33:17 UTC - in response to Message 1675010.  

Had a quick look. Hard to help without knowing the system & GPU in person, but the artefact scanner might yield some clues as to stability, providing there aren't other major issues with the system there.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1675107 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1675108 - Posted: 8 May 2015, 9:34:42 UTC - in response to Message 1672457.  

Quick update.

That task that exit command just got validated by my wingman.


+10 points for failure recovery code :)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1675108 · Report as offensive

Message boards : Number crunching : Task Postponed?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.