Message boards :
Number crunching :
Task Postponed?
Message board moderation
Author | Message |
---|---|
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
This is a new message for me ... 2015-04-29 7:51:50 PM | SETI@home | task postponed 180.000000 sec: Cuda runtime, memory related failure, threadsafe temporary Exit *scratching my head* never seen that message before. EDIT: As I was typing this Status changed to 'Waiting to run' |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Corrupted kernal build? Seems like I've seen that somewhere before. Let's see what the others say first before I suggest a possible solution. |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I haven't changed anything, and it's been working just fine, seems it was just the one task http://setiathome.berkeley.edu/result.php?resultid=4118287690 at about 48% complete. It's not done yet so no stderr yet. Not sure if it's complaining about CPU or GPU memory. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Try quitting BOINC and then restarting again and see if it picks back up. Also could try rebooting the computer and see if it resolves it |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I suspended another task and it started again, in 15 minutes should have a stderr to look at. |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
Rebooting, this is strange, tasks appear to be running normal and counting down, but see my GPU went ice cool at the same time as that message. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
you mean that it went cool after the reboot or cold when that message about waiting to run posted? If it was the latter then I would guess either the driver or app crashed and nothing was progressing. After reboot it should restart at the last saved point and then progress forward. Lets see if it finishes |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
It went cold at the same time as that message appeared. After reboot it warmed up again. this is what I see in stderr Error on call (cudaMemcpy(&flags, dev_flag, sizeof(*dev_flag), cudaMemcpyDeviceToHost)), file c:/[Projects]/__Sources/sah_v7_opt/Xbranch/client/cuda/cudaAcc_gaussfit.cu, line 587: unknown error Exiting cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... cudaAcc_free() DONE. Cuda sync'd & freed. Preemptively acknowledging a safe temporary exit-> Exit Status: 0 boinc_exit(): requesting safe worker shutdown -> boinc_exit(): received safe worker shutdown acknowledge -> Cuda threadsafe ExitProcess() initiated, rval 0 Not sure if that is a GPU mem fault or if just couldn't copy data to mem Ohhh well I guess will cross fingers and hope it doesn't happen again. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
That's recovery from a failure during transfer of gaussfit results from the VRAM back to the host. If it only happened the once, or extremely rarely, I wouldn't worry about it. The temporary exit did what it could to back out and try again, hopefully saving the result from some spurious problem. If it happens more often, then you'd be looking at a range of possible issues from disk integrity through drivers into some hardware fault. What actually caused that particular glitch, assuming ordinary consumer grade equipment like the rest of us, could be just about anything, including one-off spurious glitches, from temperature induced, [some other application messing with the GPU or system at the time], some immature system driver, even through to radiation in the silicon chip packaging material. That's why enterprise gear with ECC RAM with Tesla compute cards, redundant PSUs etc exists... but for me a little bit of fault tolerance is more practical ;) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
Thanks for the info Jason, it seems that was just a one time thing, haven't seen it again. Wingman should validate or reject in 6-12 hours. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
You can see that GPU downclocked (and after the reboot returned to 1228 MHz) Kepler GPU current clockRate = 1228 MHz Kepler GPU current clockRate = 405 MHz Task was still making progress on 405 MHz (Restarted at 41.24 percent ... Restarted at 53.24 percent) so driver was probably OK Downclock may be caused by too much Overclock, Overheating NVIDIA say: GTX 750 Ti GPU Engine Specs: Base Clock (MHz) 1020 Boost Clock (MHz) 1085 http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-750-ti/specifications  - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
Quick update. That task that exit command just got validated by my wingman. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
Can you help an user (Dave Lampkins) to understand/diagnose/fix similar (but not the same) problem: http://setiathome.berkeley.edu/forum_thread.php?id=77272 NVIDIA GeForce GTX 970 (4095MB) driver: 347.88 Windows 7 / BOINC 7.4.42 http://setiathome.berkeley.edu/show_host_detail.php?hostid=7379623 'stock' "setiathome enhanced x41zc, Cuda 5.00" "too many boinc_temporary_exit()s" because of: uncaptured error before launch (find_pulse_kernel2<fft_n, numthreads/fft_n, 5, true><<<grid, block>>>(best_pulse_score, PulsePoTLen, AdvanceBy, y_offset, numdivs, firstP, lastP)), file c:/[Projects]/__Sources/sah_v7_opt/Xbranch/client/cuda/cudaAcc_pulsefind.cu, line 1505: unknown error  - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Had a quick look. Hard to help without knowing the system & GPU in person, but the artefact scanner might yield some clues as to stability, providing there aren't other major issues with the system there. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Quick update. +10 points for failure recovery code :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.