AP running for 11+ hours on GPU then erroring out

Author	Message
qbit Volunteer tester Send message Joined: 19 Sep 04 Posts: 630 Credit: 6,868,528 RAC: 0	Message 1683683 - Posted: 24 May 2015, 13:17:34 UTC http://setiathome.berkeley.edu/result.php?resultid=4163479280 Haven't had something like that before. Any idea what could have caused this? ID: 1683683 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1683686 - Posted: 24 May 2015, 13:23:04 UTC - in response to Message 1683683. Last modified: 24 May 2015, 13:58:55 UTC ~~Think it has to do with more than 100 stops and restarts that fails to progress?~~ Someone else for sure will know. Is it the one AP that has done this? Zalster Edit x 2... My bad, just looked it up. Says that the estimated time to complete might have been set too low. ID: 1683686 ·

qbit Volunteer tester Send message Joined: 19 Sep 04 Posts: 630 Credit: 6,868,528 RAC: 0	Message 1683690 - Posted: 24 May 2015, 13:33:45 UTC Where did you see that it stopped that often? Yes, it's the only AP that had this problem. The few others I got lately all seem ok. ID: 1683690 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1683695 - Posted: 24 May 2015, 13:53:39 UTC - in response to Message 1683683. Last modified: 24 May 2015, 14:07:55 UTC http://setiathome.berkeley.edu/result.php?resultid=4163479280 Haven't had something like that before. Any idea what could have caused this? Driver restart perhaps, it failed with maximum time exceeded, it should have finished somewhat quicker, more inline with your GPU wingmen: http://setiathome.berkeley.edu/workunit.php?wuid=1796675768 4163479279 7324538 22 May 2015, 19:11:36 UTC 23 May 2015, 0:51:10 UTC Completed and validated 8,238.77 267.63 557.72 AstroPulse v7 v7.07 (opencl_intel_gpu_mac) 4163479280 7563243 22 May 2015, 19:11:36 UTC 24 May 2015, 5:02:43 UTC Error while computing 41,678.38 45.18 --- AstroPulse v7 Anonymous platform (NVIDIA GPU) 4166447410 6077487 24 May 2015, 5:02:49 UTC 24 May 2015, 5:44:56 UTC Completed and validated 1,766.37 162.83 557.72 AstroPulse v7 Anonymous platform (NVIDIA GPU) A GTX750 shouldn't take 5 times longer to do an AP Wu than a Intel GPU. Claggy ID: 1683695 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1683696 - Posted: 24 May 2015, 13:54:48 UTC - in response to Message 1683690. My apologies, in the post I put an edit that says it's the time to complete estimated was set too low. Sometimes that happens so that even when it is progressing correctly you "run out of time". The estimated time was set too low and it will give an error message as you exceeded what it thought was the appropriate amount of time to complete the task ID: 1683696 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1683699 - Posted: 24 May 2015, 14:10:17 UTC - in response to Message 1683696. Last modified: 24 May 2015, 14:11:35 UTC NX how many work task per card are you running? 2 APs on my 750 Ti take about 1.1 hours So Claggy is correct, what else was going on that took it over 11 hours? ID: 1683699 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1683703 - Posted: 24 May 2015, 14:18:32 UTC - in response to Message 1683696. My apologies, in the post I put an edit that says it's the time to complete estimated was set too low. Sometimes that happens so that even when it is progressing correctly you "run out of time". The estimated time was set too low and it will give an error message as you exceeded what it thought was the appropriate amount of time to complete the task No, I don't think that's it. The host's reported speeds, and the times for other tasks, look normal - but the running time for that particular task is far too long. I have - very rarely - seen NVidia tasks fail to make progress when BOINC says they are running. They simply stop, at some random point during the run, without any error or warning message: the warning sign, if you have suitable monitoring software, is that CPU usage drops to zero. I've only ever seen this with recent drivers: for me, on XP machines, the problem started when I installed driver v347.88, and went away again when I reverted to driver v344.75 ID: 1683703 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1683714 - Posted: 24 May 2015, 14:53:27 UTC Checking the other tasks i think the GPU was waiting for CPU cycles. He is running a dual core CPU and i guess 2 instances on GPU looking at the timings. With each crime and every kindness we birth our future. ID: 1683714 ·

qbit Volunteer tester Send message Joined: 19 Sep 04 Posts: 630 Credit: 6,868,528 RAC: 0	Message 1683720 - Posted: 24 May 2015, 15:19:09 UTC Sorry folks, I don't have a problem with Boinc stopping this task. That's ok. I just wonder what could have caused this task to run that long?!? Usually APs take about 1.5 hours here. @Mike: Yeah, I run 2 tasks. GPU only, no crunching on CPU and no other work done on this computer. ID: 1683720 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1683745 - Posted: 24 May 2015, 16:56:19 UTC Last modified: 24 May 2015, 17:02:14 UTC How many MBV7 instances are you running ? I can see you are running cuda tasks above normal priority but APÂ´s below normal. This task was running in conjunction with MB task(s). So it would make sense to adjust -hp for AP`s as well. I also suggest to use -cpu_lock switch to bind AP`s to a particular CPU core. With each crime and every kindness we birth our future. ID: 1683745 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1683770 - Posted: 24 May 2015, 18:38:29 UTC - in response to Message 1683745. Mike, While I totally agree that CPU contention issues and CPU 'loading' by OpenCL runtime components are something that the developer community ought to take up (ought to have taken up) with Khronos, NV, and AMD - I can't see that they are implicated in this case. Look at the Astropulse task list: as NX-01 says, tasks normally run for ~1 - ~1.5 hours. That 'feels right' for a GTX 750 (I have three of them). There's just the one single task which ran eight times as long, and was killed for running 20x (the SETI limit) its runtime estimate. In my experience, that points to a local computer glitch, not a configuration issue (which would affect all tasks) or a server estimate error (which wouldn't affect the actual runtime). I still think the closest match to my own experience with these cards is the total application stall occasionally observed with late-version drivers. In my case, that shows up in BoincView as a CPU efficiency of 0.0000: BoincView averages that figure over, I think, the last 10 RPCs (5 minutes), so it does drop to zero - unlike the BoincTasks equivalent, which is averaged over (again, I think) the lifetime of the task, so can never reach true zero if any CPU time at all has been recorded in the early part of the run. If these sporadic over-run errors continue to appear on NX-01's list, I'd suggest a driver downgrade to v344.75, just as experiment. If that works, of course the next step is to refine the configuration as Mike suggests. ID: 1683770 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1683795 - Posted: 24 May 2015, 19:41:03 UTC - in response to Message 1683770. Actually, if NV demonstrates same issue with loaded CPU as ATi does (w/o affinity management) then exactly sporadic few times execution slowdown is expected. I would add -cpu_lock. ID: 1683795 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1683835 - Posted: 24 May 2015, 22:55:56 UTC - in response to Message 1683795. Actually, if NV demonstrates same issue with loaded CPU as ATi does (w/o affinity management) then exactly sporadic few times execution slowdown is expected. I would add -cpu_lock. We need to find ways of distinguishing 'slowdown' from 'complete dead stop'. I find CPU efficiency monitoring helps - 0.0000 efficiency is a very specific marker, which I don't see on any task simply running slow. And if I do see it, "suspend and resume" restores normal progress, but usually winds back elapsed time to a 'last checkpoint' several hours ago. ID: 1683835 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1683841 - Posted: 24 May 2015, 23:21:47 UTC Nothing is more important than plenty of CPU cycles in this case. The OP is using sharp timings for a 750. Using -unroll 11 for just 4 compute units is at max for this GPU. So nothing else will help in this CPU/GPU combo. To be fair adding -cpu_lock switch is not that big deal. With each crime and every kindness we birth our future. ID: 1683841 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1683844 - Posted: 24 May 2015, 23:26:21 UTC In my experience 'hangs' such as this usually happen at App startup while the CPU is run up to 100% trying to launch the task. I've also determined that the same scenario is usually why I sometimes have an App crash with SIGABRT. It would appear my quad core CPU is stressed when running 3 GPU tasks and 2 CPU tasks. Especially when running ATI MBs verses ATI APs. On ATIs, it's the MBs that use more CPU. I've found things go much more smoothly if I just run 1 CPU task when running 3 MB tasks. I've come back to the computer and found an ATI MB task in a hung state with the timer at 8 hours. Usually just suspending and resuming the task clears it. I can see where there would be a problem running 2 GPUs on a Dual core machine using OpenCL, even when using the Sleep switch. ID: 1683844 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1683997 - Posted: 25 May 2015, 10:41:30 UTC - in response to Message 1683835. Actually, if NV demonstrates same issue with loaded CPU as ATi does (w/o affinity management) then exactly sporadic few times execution slowdown is expected. I would add -cpu_lock. We need to find ways of distinguishing 'slowdown' from 'complete dead stop'. I find CPU efficiency monitoring helps - 0.0000 efficiency is a very specific marker, which I don't see on any task simply running slow. And if I do see it, "suspend and resume" restores normal progress, but usually winds back elapsed time to a 'last checkpoint' several hours ago. Zero CPU consumption means driver restart and runtime being stuck. So, app just never returns in last runtime call. ID: 1683997 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1684006 - Posted: 25 May 2015, 11:14:38 UTC - in response to Message 1683997. Actually, if NV demonstrates same issue with loaded CPU as ATi does (w/o affinity management) then exactly sporadic few times execution slowdown is expected. I would add -cpu_lock. We need to find ways of distinguishing 'slowdown' from 'complete dead stop'. I find CPU efficiency monitoring helps - 0.0000 efficiency is a very specific marker, which I don't see on any task simply running slow. And if I do see it, "suspend and resume" restores normal progress, but usually winds back elapsed time to a 'last checkpoint' several hours ago. Zero CPU consumption means driver restart and runtime being stuck. So, app just never returns in last runtime call. IIRC, driver restarts usually put CUDA tasks (which is where I was seeing this, both SETI and GPUGrid) into "BOINC temporary exit" - from which they can recover automatically. But the driver-version-specific dead stops I'm talking about here don't seem to be the same thing as the normal driver restarts I'm sure we've all seen - unless NVidia changed their restart procedure subsequent to v344.75, and it no longer triggers a temporary exit? ID: 1684006 ·

qbit Volunteer tester Send message Joined: 19 Sep 04 Posts: 630 Credit: 6,868,528 RAC: 0	Message 1684034 - Posted: 25 May 2015, 14:30:19 UTC Thx everybody! Looks like I forgot to raise the priority for APs. Not sure if this was part of the problem, but I changed it now. I will check if something like that happens again and if it does I will try the cpu lock option. @Mike: I think unroll 10 is what you recommended for my card in some other thread. Not really sure though, since I accidently deleted my old file before I reinstalled windows and Boinc on this computer, so I had to improvise a bit. -hp -use_sleep -unroll 10 -oclFFT_plan 256 16 512 -ffa_block 6144 -ffa_block_fetch 1536 -tune 1 64 8 1 -tune 2 64 8 1 This is my current commandline. Would you change anything here or does it look ok for you? (Of course anybody else is also welcome to comment on this!) ID: 1684034 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1684041 - Posted: 25 May 2015, 15:24:07 UTC - in response to Message 1684034. Last modified: 25 May 2015, 15:52:51 UTC Thx everybody! Looks like I forgot to raise the priority for APs. Not sure if this was part of the problem, but I changed it now. I will check if something like that happens again and if it does I will try the cpu lock option. @Mike: I think unroll 10 is what you recommended for my card in some other thread. Not really sure though, since I accidently deleted my old file before I reinstalled windows and Boinc on this computer, so I had to improvise a bit. -hp -use_sleep -unroll 10 -oclFFT_plan 256 16 512 -ffa_block 6144 -ffa_block_fetch 1536 -tune 1 64 8 1 -tune 2 64 8 1 This is my current commandline. Would you change anything here or does it look ok for you? (Of course anybody else is also welcome to comment on this!) Looks O.K. now. Just keep in mind when you add -cpu_lock you also need to add -instances_per_device N. With each crime and every kindness we birth our future. ID: 1684041 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1684076 - Posted: 25 May 2015, 16:44:27 UTC - in response to Message 1683770. Richard Haselgrove wrote: ... Look at the Astropulse task list: as NX-01 says, tasks normally run for ~1 - ~1.5 hours. That 'feels right' for a GTX 750 (I have three of them). There's just the one single task which ran eight times as long, and was killed for running 20x (the SETI limit) its runtime estimate. ... Although the SaHv7 splitter code was changed to 20x, the APv7 splitter code is still at 10x, so the bound is 1.821e16. The APv7 APR on the host must have been somewhat higher than its present 414.05e9, which would put the "too long" at about 12.2 hours. Joe ID: 1684076 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.