AP running for 11+ hours on GPU then erroring out

Message boards : Number crunching : AP running for 11+ hours on GPU then erroring out
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1683683 - Posted: 24 May 2015, 13:17:34 UTC

http://setiathome.berkeley.edu/result.php?resultid=4163479280

Haven't had something like that before. Any idea what could have caused this?
ID: 1683683 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1683686 - Posted: 24 May 2015, 13:23:04 UTC - in response to Message 1683683.  
Last modified: 24 May 2015, 13:58:55 UTC

Think it has to do with more than 100 stops and restarts that fails to progress? Someone else for sure will know.

Is it the one AP that has done this?

Zalster

Edit x 2...

My bad, just looked it up. Says that the estimated time to complete might have been set too low.
ID: 1683686 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1683690 - Posted: 24 May 2015, 13:33:45 UTC

Where did you see that it stopped that often? Yes, it's the only AP that had this problem. The few others I got lately all seem ok.
ID: 1683690 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1683695 - Posted: 24 May 2015, 13:53:39 UTC - in response to Message 1683683.  
Last modified: 24 May 2015, 14:07:55 UTC

http://setiathome.berkeley.edu/result.php?resultid=4163479280

Haven't had something like that before. Any idea what could have caused this?

Driver restart perhaps, it failed with maximum time exceeded, it should have finished somewhat quicker, more inline with your GPU wingmen:

http://setiathome.berkeley.edu/workunit.php?wuid=1796675768

4163479279 7324538 22 May 2015, 19:11:36 UTC 23 May 2015, 0:51:10 UTC Completed and validated  8,238.77 267.63 557.72 AstroPulse v7 v7.07 (opencl_intel_gpu_mac) 
4163479280 7563243 22 May 2015, 19:11:36 UTC 24 May 2015, 5:02:43 UTC Error while computing   41,678.38  45.18    --- AstroPulse v7 Anonymous platform (NVIDIA GPU) 
4166447410 6077487 24 May 2015,  5:02:49 UTC 24 May 2015, 5:44:56 UTC Completed and validated  1,766.37 162.83 557.72 AstroPulse v7 Anonymous platform (NVIDIA GPU) 

A GTX750 shouldn't take 5 times longer to do an AP Wu than a Intel GPU.

Claggy
ID: 1683695 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1683696 - Posted: 24 May 2015, 13:54:48 UTC - in response to Message 1683690.  

My apologies, in the post I put an edit that says it's the time to complete estimated was set too low.

Sometimes that happens so that even when it is progressing correctly you "run out of time".

The estimated time was set too low and it will give an error message as you exceeded what it thought was the appropriate amount of time to complete the task
ID: 1683696 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1683699 - Posted: 24 May 2015, 14:10:17 UTC - in response to Message 1683696.  
Last modified: 24 May 2015, 14:11:35 UTC

NX how many work task per card are you running?

2 APs on my 750 Ti take about 1.1 hours

So Claggy is correct, what else was going on that took it over 11 hours?
ID: 1683699 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1683703 - Posted: 24 May 2015, 14:18:32 UTC - in response to Message 1683696.  

My apologies, in the post I put an edit that says it's the time to complete estimated was set too low.

Sometimes that happens so that even when it is progressing correctly you "run out of time".

The estimated time was set too low and it will give an error message as you exceeded what it thought was the appropriate amount of time to complete the task

No, I don't think that's it. The host's reported speeds, and the times for other tasks, look normal - but the running time for that particular task is far too long.

I have - very rarely - seen NVidia tasks fail to make progress when BOINC says they are running. They simply stop, at some random point during the run, without any error or warning message: the warning sign, if you have suitable monitoring software, is that CPU usage drops to zero.

I've only ever seen this with recent drivers: for me, on XP machines, the problem started when I installed driver v347.88, and went away again when I reverted to driver v344.75
ID: 1683703 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1683714 - Posted: 24 May 2015, 14:53:27 UTC

Checking the other tasks i think the GPU was waiting for CPU cycles.
He is running a dual core CPU and i guess 2 instances on GPU looking at the timings.


With each crime and every kindness we birth our future.
ID: 1683714 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1683720 - Posted: 24 May 2015, 15:19:09 UTC

Sorry folks, I don't have a problem with Boinc stopping this task. That's ok. I just wonder what could have caused this task to run that long?!? Usually APs take about 1.5 hours here.

@Mike: Yeah, I run 2 tasks. GPU only, no crunching on CPU and no other work done on this computer.
ID: 1683720 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1683745 - Posted: 24 May 2015, 16:56:19 UTC
Last modified: 24 May 2015, 17:02:14 UTC

How many MBV7 instances are you running ?
I can see you are running cuda tasks above normal priority but AP´s below normal.
This task was running in conjunction with MB task(s).
So it would make sense to adjust -hp for AP`s as well.
I also suggest to use -cpu_lock switch to bind AP`s to a particular CPU core.


With each crime and every kindness we birth our future.
ID: 1683745 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1683770 - Posted: 24 May 2015, 18:38:29 UTC - in response to Message 1683745.  

Mike,

While I totally agree that CPU contention issues and CPU 'loading' by OpenCL runtime components are something that the developer community ought to take up (ought to have taken up) with Khronos, NV, and AMD - I can't see that they are implicated in this case.

Look at the Astropulse task list: as NX-01 says, tasks normally run for ~1 - ~1.5 hours. That 'feels right' for a GTX 750 (I have three of them). There's just the one single task which ran eight times as long, and was killed for running 20x (the SETI limit) its runtime estimate. In my experience, that points to a local computer glitch, not a configuration issue (which would affect all tasks) or a server estimate error (which wouldn't affect the actual runtime). I still think the closest match to my own experience with these cards is the total application stall occasionally observed with late-version drivers. In my case, that shows up in BoincView as a CPU efficiency of 0.0000: BoincView averages that figure over, I think, the last 10 RPCs (5 minutes), so it does drop to zero - unlike the BoincTasks equivalent, which is averaged over (again, I think) the lifetime of the task, so can never reach true zero if any CPU time at all has been recorded in the early part of the run.

If these sporadic over-run errors continue to appear on NX-01's list, I'd suggest a driver downgrade to v344.75, just as experiment. If that works, of course the next step is to refine the configuration as Mike suggests.
ID: 1683770 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1683795 - Posted: 24 May 2015, 19:41:03 UTC - in response to Message 1683770.  

Actually, if NV demonstrates same issue with loaded CPU as ATi does (w/o affinity management) then exactly sporadic few times execution slowdown is expected.
I would add -cpu_lock.
ID: 1683795 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1683835 - Posted: 24 May 2015, 22:55:56 UTC - in response to Message 1683795.  

Actually, if NV demonstrates same issue with loaded CPU as ATi does (w/o affinity management) then exactly sporadic few times execution slowdown is expected.
I would add -cpu_lock.

We need to find ways of distinguishing 'slowdown' from 'complete dead stop'. I find CPU efficiency monitoring helps - 0.0000 efficiency is a very specific marker, which I don't see on any task simply running slow. And if I do see it, "suspend and resume" restores normal progress, but usually winds back elapsed time to a 'last checkpoint' several hours ago.
ID: 1683835 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1683841 - Posted: 24 May 2015, 23:21:47 UTC

Nothing is more important than plenty of CPU cycles in this case.
The OP is using sharp timings for a 750.
Using -unroll 11 for just 4 compute units is at max for this GPU.
So nothing else will help in this CPU/GPU combo.
To be fair adding -cpu_lock switch is not that big deal.


With each crime and every kindness we birth our future.
ID: 1683841 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1683844 - Posted: 24 May 2015, 23:26:21 UTC

In my experience 'hangs' such as this usually happen at App startup while the CPU is run up to 100% trying to launch the task. I've also determined that the same scenario is usually why I sometimes have an App crash with SIGABRT. It would appear my quad core CPU is stressed when running 3 GPU tasks and 2 CPU tasks. Especially when running ATI MBs verses ATI APs. On ATIs, it's the MBs that use more CPU. I've found things go much more smoothly if I just run 1 CPU task when running 3 MB tasks. I've come back to the computer and found an ATI MB task in a hung state with the timer at 8 hours. Usually just suspending and resuming the task clears it. I can see where there would be a problem running 2 GPUs on a Dual core machine using OpenCL, even when using the Sleep switch.
ID: 1683844 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1683997 - Posted: 25 May 2015, 10:41:30 UTC - in response to Message 1683835.  

Actually, if NV demonstrates same issue with loaded CPU as ATi does (w/o affinity management) then exactly sporadic few times execution slowdown is expected.
I would add -cpu_lock.

We need to find ways of distinguishing 'slowdown' from 'complete dead stop'. I find CPU efficiency monitoring helps - 0.0000 efficiency is a very specific marker, which I don't see on any task simply running slow. And if I do see it, "suspend and resume" restores normal progress, but usually winds back elapsed time to a 'last checkpoint' several hours ago.

Zero CPU consumption means driver restart and runtime being stuck. So, app just never returns in last runtime call.
ID: 1683997 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1684006 - Posted: 25 May 2015, 11:14:38 UTC - in response to Message 1683997.  

Actually, if NV demonstrates same issue with loaded CPU as ATi does (w/o affinity management) then exactly sporadic few times execution slowdown is expected.
I would add -cpu_lock.

We need to find ways of distinguishing 'slowdown' from 'complete dead stop'. I find CPU efficiency monitoring helps - 0.0000 efficiency is a very specific marker, which I don't see on any task simply running slow. And if I do see it, "suspend and resume" restores normal progress, but usually winds back elapsed time to a 'last checkpoint' several hours ago.

Zero CPU consumption means driver restart and runtime being stuck. So, app just never returns in last runtime call.

IIRC, driver restarts usually put CUDA tasks (which is where I was seeing this, both SETI and GPUGrid) into "BOINC temporary exit" - from which they can recover automatically.

But the driver-version-specific dead stops I'm talking about here don't seem to be the same thing as the normal driver restarts I'm sure we've all seen - unless NVidia changed their restart procedure subsequent to v344.75, and it no longer triggers a temporary exit?
ID: 1684006 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1684034 - Posted: 25 May 2015, 14:30:19 UTC

Thx everybody!

Looks like I forgot to raise the priority for APs. Not sure if this was part of the problem, but I changed it now. I will check if something like that happens again and if it does I will try the cpu lock option.

@Mike: I think unroll 10 is what you recommended for my card in some other thread. Not really sure though, since I accidently deleted my old file before I reinstalled windows and Boinc on this computer, so I had to improvise a bit.

-hp -use_sleep -unroll 10 -oclFFT_plan 256 16 512 -ffa_block 6144 -ffa_block_fetch 1536 -tune 1 64 8 1 -tune 2 64 8 1


This is my current commandline. Would you change anything here or does it look ok for you? (Of course anybody else is also welcome to comment on this!)
ID: 1684034 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1684041 - Posted: 25 May 2015, 15:24:07 UTC - in response to Message 1684034.  
Last modified: 25 May 2015, 15:52:51 UTC

Thx everybody!

Looks like I forgot to raise the priority for APs. Not sure if this was part of the problem, but I changed it now. I will check if something like that happens again and if it does I will try the cpu lock option.

@Mike: I think unroll 10 is what you recommended for my card in some other thread. Not really sure though, since I accidently deleted my old file before I reinstalled windows and Boinc on this computer, so I had to improvise a bit.

-hp -use_sleep -unroll 10 -oclFFT_plan 256 16 512 -ffa_block 6144 -ffa_block_fetch 1536 -tune 1 64 8 1 -tune 2 64 8 1


This is my current commandline. Would you change anything here or does it look ok for you? (Of course anybody else is also welcome to comment on this!)


Looks O.K. now.

Just keep in mind when you add -cpu_lock you also need to add -instances_per_device N.


With each crime and every kindness we birth our future.
ID: 1684041 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1684076 - Posted: 25 May 2015, 16:44:27 UTC - in response to Message 1683770.  

Richard Haselgrove wrote:
...
Look at the Astropulse task list: as NX-01 says, tasks normally run for ~1 - ~1.5 hours. That 'feels right' for a GTX 750 (I have three of them). There's just the one single task which ran eight times as long, and was killed for running 20x (the SETI limit) its runtime estimate.
...

Although the SaHv7 splitter code was changed to 20x, the APv7 splitter code is still at 10x, so the bound is 1.821e16. The APv7 APR on the host must have been somewhat higher than its present 414.05e9, which would put the "too long" at about 12.2 hours.
                                                                   Joe
ID: 1684076 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : AP running for 11+ hours on GPU then erroring out


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.