Task Postponed?

Author	Message
Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1671777 - Posted: 30 Apr 2015, 1:58:34 UTC Last modified: 30 Apr 2015, 2:01:34 UTC This is a new message for me ... 2015-04-29 7:51:50 PM \| SETI@home \| task postponed 180.000000 sec: Cuda runtime, memory related failure, threadsafe temporary Exit scratching my head never seen that message before. EDIT: As I was typing this Status changed to 'Waiting to run' ID: 1671777 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1671781 - Posted: 30 Apr 2015, 2:12:39 UTC - in response to Message 1671777. Corrupted kernal build? Seems like I've seen that somewhere before. Let's see what the others say first before I suggest a possible solution. ID: 1671781 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1671782 - Posted: 30 Apr 2015, 2:15:34 UTC - in response to Message 1671781. I haven't changed anything, and it's been working just fine, seems it was just the one task http://setiathome.berkeley.edu/result.php?resultid=4118287690 at about 48% complete. It's not done yet so no stderr yet. Not sure if it's complaining about CPU or GPU memory. ID: 1671782 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1671783 - Posted: 30 Apr 2015, 2:18:56 UTC - in response to Message 1671782. Try quitting BOINC and then restarting again and see if it picks back up. Also could try rebooting the computer and see if it resolves it ID: 1671783 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1671786 - Posted: 30 Apr 2015, 2:23:11 UTC - in response to Message 1671783. I suspended another task and it started again, in 15 minutes should have a stderr to look at. ID: 1671786 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1671789 - Posted: 30 Apr 2015, 2:32:10 UTC - in response to Message 1671786. Rebooting, this is strange, tasks appear to be running normal and counting down, but see my GPU went ice cool at the same time as that message. ID: 1671789 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1671796 - Posted: 30 Apr 2015, 2:45:31 UTC - in response to Message 1671789. you mean that it went cool after the reboot or cold when that message about waiting to run posted? If it was the latter then I would guess either the driver or app crashed and nothing was progressing. After reboot it should restart at the last saved point and then progress forward. Lets see if it finishes ID: 1671796 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1671799 - Posted: 30 Apr 2015, 2:56:49 UTC - in response to Message 1671796. It went cold at the same time as that message appeared. After reboot it warmed up again. this is what I see in stderr Error on call (cudaMemcpy(&flags, dev_flag, sizeof(*dev_flag), cudaMemcpyDeviceToHost)), file c:/[Projects]/__Sources/sah_v7_opt/Xbranch/client/cuda/cudaAcc_gaussfit.cu, line 587: unknown error Exiting cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... cudaAcc_free() DONE. Cuda sync'd & freed. Preemptively acknowledging a safe temporary exit-> Exit Status: 0 boinc_exit(): requesting safe worker shutdown -> boinc_exit(): received safe worker shutdown acknowledge -> Cuda threadsafe ExitProcess() initiated, rval 0 Not sure if that is a GPU mem fault or if just couldn't copy data to mem Ohhh well I guess will cross fingers and hope it doesn't happen again. ID: 1671799 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1671815 - Posted: 30 Apr 2015, 4:04:46 UTC - in response to Message 1671799. Last modified: 30 Apr 2015, 4:09:06 UTC That's recovery from a failure during transfer of gaussfit results from the VRAM back to the host. If it only happened the once, or extremely rarely, I wouldn't worry about it. The temporary exit did what it could to back out and try again, hopefully saving the result from some spurious problem. If it happens more often, then you'd be looking at a range of possible issues from disk integrity through drivers into some hardware fault. What actually caused that particular glitch, assuming ordinary consumer grade equipment like the rest of us, could be just about anything, including one-off spurious glitches, from temperature induced, [some other application messing with the GPU or system at the time], some immature system driver, even through to radiation in the silicon chip packaging material. That's why enterprise gear with ECC RAM with Tesla compute cards, redundant PSUs etc exists... but for me a little bit of fault tolerance is more practical ;) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1671815 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1671822 - Posted: 30 Apr 2015, 4:33:50 UTC - in response to Message 1671815. Thanks for the info Jason, it seems that was just a one time thing, haven't seen it again. Wingman should validate or reject in 6-12 hours. ID: 1671822 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1671952 - Posted: 30 Apr 2015, 11:07:04 UTC - in response to Message 1671822. You can see that GPU downclocked (and after the reboot returned to 1228 MHz) Kepler GPU current clockRate = 1228 MHz Kepler GPU current clockRate = 405 MHz Task was still making progress on 405 MHz (Restarted at 41.24 percent ... Restarted at 53.24 percent) so driver was probably OK Downclock may be caused by too much Overclock, Overheating NVIDIA say: GTX 750 Ti GPU Engine Specs: Base Clock (MHz) 1020 Boost Clock (MHz) 1085 http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-750-ti/specifications Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1671952 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1672457 - Posted: 1 May 2015, 5:18:37 UTC Quick update. That task that exit command just got validated by my wingman. ID: 1672457 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1675010 - Posted: 8 May 2015, 3:00:35 UTC - in response to Message 1671815. Can you help an user (Dave Lampkins) to understand/diagnose/fix similar (but not the same) problem: http://setiathome.berkeley.edu/forum_thread.php?id=77272 NVIDIA GeForce GTX 970 (4095MB) driver: 347.88 Windows 7 / BOINC 7.4.42 http://setiathome.berkeley.edu/show_host_detail.php?hostid=7379623 'stock' "setiathome enhanced x41zc, Cuda 5.00" "too many boinc_temporary_exit()s" because of: uncaptured error before launch (find_pulse_kernel2<fft_n, numthreads/fft_n, 5, true><<<grid, block>>>(best_pulse_score, PulsePoTLen, AdvanceBy, y_offset, numdivs, firstP, lastP)), file c:/[Projects]/__Sources/sah_v7_opt/Xbranch/client/cuda/cudaAcc_pulsefind.cu, line 1505: unknown error Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1675010 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1675107 - Posted: 8 May 2015, 9:33:17 UTC - in response to Message 1675010. Had a quick look. Hard to help without knowing the system & GPU in person, but the artefact scanner might yield some clues as to stability, providing there aren't other major issues with the system there. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1675107 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1675108 - Posted: 8 May 2015, 9:34:42 UTC - in response to Message 1672457. Quick update. That task that exit command just got validated by my wingman. +10 points for failure recovery code :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1675108 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.