Short estimated runtimes - don't panic

Message boards : Number crunching : Short estimated runtimes - don't panic
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36829
Credit: 261,360,520
RAC: 489
Australia
Message 1217263 - Posted: 12 Apr 2012, 13:32:48 UTC - in response to Message 1217260.  

Thanks all for the advice after a lot of messing around this morning I managed to get the cruncher back to driver 270.61 and now all seems OK again. Just another example of Bill Gates trying to stop us finding ET:)) Now if it will just behave for a few days (been fighting weekend power cuts and overnight GPU crashes for the last few weeks) I should finally make it to 7Million credits which BoincStats was expecting me to get to yesterday!

I have another question. My motherboard has a x16 and a x4 slot for the GPUs. Whilst messing around this morning I noticed that my faster GPU (GTX460) is currently in the x4 slot and the slower GPU (GT430) is in the x16 slot. Would I see a (significant) performance enhancement if I swapped the GPUs over?

The driver problem is actually nvidia's fault, not Windows.

As to the slot speeds, probably nothing really significant but it would be advisable to have the faster card on the faster bus.

Cheers.
ID: 1217263 · Report as offensive
Profile cliff
Avatar

Send message
Joined: 16 Dec 07
Posts: 625
Credit: 3,590,440
RAC: 0
United Kingdom
Message 1217290 - Posted: 12 Apr 2012, 15:17:26 UTC

Well I dunno about underestimating runtimes, I've had 2 days of the damm thing overestimating runtimes and going bleeding balistic on me, first its one project, then its the other one, then its both projects using GPU, and 1 using CPU.. Arrghhhh.

But what particularly irritating to me, is that it dumps WU that are seconds away from completion to load WU at zero %. Even when both WU belong to the same project and are of the same type.. Its bloody daft.

There are now a multitude of WU waiting to run, in varying stages of completion for both projects.

In order to stop this behavior one has to micro mange WU, suspend whatever had caused the move to HP, and allow partialy completed WU to complete then unsuspend WU again.. Whereupon Boinc goes blithly on dumping more WU into the waiting queue while it loads yet more WU at zero % and cruises on its merry way.

Perhaps Boinc should do a small check of running tasks before it goes hyper, and see if they belong to the project/WU type or whatever has caused it got go HP and leave them running [at HP if required] until complete before loading new WU..
When you have AP6 GPU tasks that have estimated times from 240 hrs to 64 hrs and that inevitably complete on a GPU in under 3 hours.. Those estimates are nuts.
That even after Boinc has completed 6 or more such WU, it still has crazy estimated times is beyond me. Those estimates need to be corrected on the fly and in a timely manner.

Regards,


Cliff,
Been there, Done that, Still no damm T shirt!
ID: 1217290 · Report as offensive
Profile Khangollo
Avatar

Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1217295 - Posted: 12 Apr 2012, 15:32:57 UTC

This is what happens when you don't use <flops>...
You get screwed by unannounced messing with server software.

I'm keeping my flops forever.
ID: 1217295 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19403
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1217297 - Posted: 12 Apr 2012, 15:41:04 UTC - in response to Message 1217295.  

This is what happens when you don't use <flops>...
You get screwed by unannounced messing with server software.

I'm keeping my flops forever.

As I was the person who reported the initial problem. I felt I could not use flops, so I could keep coming up with reminder that it hdn't been fixed yet.

Now that it has been fixed, all the new estimates look as though they are right on the button and my DCF is ~1.0. And now I don't have tasks being suspended a few seconds from completion, or additional others suspended just because another task has completed etc.

So not going to use flops because they are not needed now.
ID: 1217297 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1217307 - Posted: 12 Apr 2012, 16:08:05 UTC - in response to Message 1217297.  

This is what happens when you don't use <flops>...
You get screwed by unannounced messing with server software.

I'm keeping my flops forever.

As I was the person who reported the initial problem. I felt I could not use flops, so I could keep coming up with reminder that it hdn't been fixed yet.

Now that it has been fixed, all the new estimates look as though they are right on the button and my DCF is ~1.0. And now I don't have tasks being suspended a few seconds from completion, or additional others suspended just because another task has completed etc.

So not going to use flops because they are not needed now.

It probably depends on your hardware - both in absolute and relative terms.

My Q6600/9800GT rigs - which I regard as being quite 'well balanced' - are happy with the new settings.

But this lunchtime I watched my E5320/9800GTX+ transition from 'old' to 'new' tasks. While the 'old' rate (capped APR) tasks were controlling proceedings, I accumulated 400 queued GPU tasks, estimated at exactly one day (well, 1 day and 4 minutes) with DCF=0.1468

The first 'new' task jumped that to DCF=0.7490 and a cache of 5 days 3 hours. So, with the slower CPUs and the faster GPU, that host hasn't yet completed rebalancing. And as for the Q9300/GTX470 - that still has DCF=0.3543, so it'll have to wait for the next raising of the cap. There will be other users out there, with an even more extreme disparity between CPU and GPU speed, that will still need another five-fold increase, and then some.
ID: 1217307 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1217361 - Posted: 12 Apr 2012, 18:54:15 UTC - in response to Message 1217307.  


My E8400/GTX 560Ti system has pretty much settled down.
Finally it is able hit the GPU server side limits. The DCF still moves around a bit, but at least it is only a bit. Before it used to go from 0.2 to 1.5+ with the completion of a CPU VLAR unit.
Grant
Darwin NT
ID: 1217361 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1217364 - Posted: 12 Apr 2012, 19:12:40 UTC - in response to Message 1217260.  

I have another question. My motherboard has a x16 and a x4 slot for the GPUs. Whilst messing around this morning I noticed that my faster GPU (GTX460) is currently in the x4 slot and the slower GPU (GT430) is in the x16 slot. Would I see a (significant) performance enhancement if I swapped the GPUs over?


On Seti MB, with the optimized apps the number of lanes (x16, x4) it is not crucial, because the amount of data transferred back and forth between CPU and GPU is not hughe. On other apps and or projects could be very different.
I know that a 430 on x16 works a 66% faster than on a x1 for Einstein BRPs (that are hibryd apps that need a lot of CPU work), while it give me no noticeable difference on Seti Multibeam.

So, should you switch them?
On one side, having the faster GPU on the slow PCIe, can help to make the difference on crunching speeds a bit less different, leading to a more stable DCF and APR... On the other, the faster GPU will be less faster than expected, so you might get better performance swithching them...
As always, Your mileage may vary.

ID: 1217364 · Report as offensive
Profile shizaru
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1130
Credit: 1,967,904
RAC: 0
Greece
Message 1219577 - Posted: 17 Apr 2012, 11:45:31 UTC

Run-times are perfect (on my little laptop).

Thanx everybody!
ID: 1219577 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1219620 - Posted: 17 Apr 2012, 14:53:50 UTC

So I finally ran some of the new estimate tasks ahead of the ~8 days of old estimate that I had. New ones went up to the correct time and the old ones just about doubled their estimate. Based on ETA alone, I've got about 22 days of cache, but it is actually more like 14 or so. BOINC has not gone into HP mode for anything though, which I was half-expecting, but it is working on the oldest tasks first anyway (FIFO), which those are the ones that would end up going into HP.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1219620 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1220677 - Posted: 20 Apr 2012, 17:13:07 UTC

The kitties are doing just fine...
Waiting for the next round of adjustments.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1220677 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1220705 - Posted: 20 Apr 2012, 18:17:50 UTC - in response to Message 1220680.  

So Richard!

Is it time to remove flops from my app_infos now?

Answering more generally, for the benefit of other readers who may shoulder-surf this thread:

I'm seeing that runtime estimation is pretty good now on my

20/04/2012 10:30:22 | | NVIDIA GPU 0: GeForce 9800 GT (driver version 29036, CUDA version 4010, compute capability 1.1, 512MB, 336 GFLOPS peak)

That card has 'Average processing rate 138.37077926918' for the Lunatics MB app.

But on faster cards, there's still some way to go before things settle down: we'll be writing to the staff on Monday, suggesting that the time is right for the second stage of normalisation (as suggested by Mark).

So, I'd say that if your GFLOPS/APR figures are similar to, or below, mine, you should be fine to remove them now. But if you have faster cards than this one, maybe wait for another week or two after the next round of corrections.
ID: 1220705 · Report as offensive
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1220721 - Posted: 20 Apr 2012, 18:47:57 UTC - in response to Message 1220705.  
Last modified: 20 Apr 2012, 19:31:34 UTC

Hi Richard, What is expected to happen on systems with mixed speed GPUs please? I am expecting the GTX 460 times to be a few % too high and the 430/520 ones way too low.

I get Average processing rate 308.83578908794 for as follows and a DCF that has a lot in common with a Yo-Yo.

20/04/2012 09:10:00 |  | NVIDIA GPU 0: GeForce GTX 460 (driver version 285.62, CUDA version 4.10, compute capability 2.1, 1024MB, 766MB available, 1025 GFLOPS peak)
20/04/2012 09:10:00 |  | NVIDIA GPU 1: GeForce GT  430 (driver version 285.62, CUDA version 4.10, compute capability 2.1,  512MB, 361MB available,  269 GFLOPS peak)
20/04/2012 09:10:00 |  | NVIDIA GPU 2: GeForce GTX 460 (driver version 285.62, CUDA version 4.10, compute capability 2.1, 1024MB, 814MB available, 1025 GFLOPS peak)
20/04/2012 09:10:00 |  | NVIDIA GPU 3: GeForce GT  520 (driver version 285.62, CUDA version 4.10, compute capability 2.1,  512MB, 366MB available,  156 GFLOPS peak)
ID: 1220721 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1220736 - Posted: 20 Apr 2012, 19:06:13 UTC - in response to Message 1220721.  

As always, you're limited to just one value per application type. All MB CUDA apps will share a value, all AP OpenCL apps will share a value, and so on.

The best value to use - in general - is the one shown as the APR for the application type. This thread arose because - for a while, now hopefully coming to an end - the displayed APR wasn't the same as the effective APR. People used <FLOPS> to get the effective APR back up to the proper value.

Since your APR seems to be below the new threshhold, it'll make no difference at all whether you supply a value yourself in <FLOPS>, or rely on the one maintained by the project in APR. APR is easier, in my book.
ID: 1220736 · Report as offensive
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1220741 - Posted: 20 Apr 2012, 19:22:37 UTC - in response to Message 1220736.  
Last modified: 20 Apr 2012, 19:36:06 UTC

Thank you Richard. I suspect that when/if dont_use_dcf is sent by the server I will get plausable estimates for all but the GT 430/520 so I am hoping that change happens on next tuesday.

What is the URL that lists the new threshholds currently being used please?

With my GTX 680 + GTX 460 I get Average processing rate 207.34629880725 and am wondering why this is lower. I suspect because the GTX 680 has only been in use since Monday.
ID: 1220741 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1220817 - Posted: 20 Apr 2012, 23:12:31 UTC - in response to Message 1220680.  
Last modified: 20 Apr 2012, 23:14:10 UTC

So Richard!

Is it time to remove flops from my app_infos now?

I didn't put the FLOPs back in when the bug fix borked things. On my E8400/ GTX 560Ti system the DCF would move from .2 to over 1.5. Since the latest changes the DCF moves bewteen 1.3 & 0.9 (or there abouts). On my i7/ GTX 460 system it moves between 1.1 & 0.8.
The estimated completion time for shorties is still out by a factor of 10, but it's a big improvement on what it was & now both systems actually hit the serverside limits for GPU work.

It would be nice if they could double the serverside limits with the next tweaking. The previous increase allowed me to keep just under 2 days work for the GPUs. Doubling it would get me close to my usual cache of 4, which would be nice.


Not knowing just how the estimated completion time estimates work, and the limits for timing them out for finishing too early or not early enough, and considering the huge change in values the last change gave i'd suggest changing that setting by half of what was done last time- gradually moving things closer back to their rightfull settings.
Grant
Darwin NT
ID: 1220817 · Report as offensive
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1222509 - Posted: 24 Apr 2012, 0:20:12 UTC

ID: 1222509 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1222731 - Posted: 24 Apr 2012, 8:06:18 UTC - in response to Message 1222509.  

I suspect these shorter estimates are sometimes too short for my slow GPUs. All of the following got Exit status -177 (0xffffffffffffff4f) ERR_RSC_LIMIT_EXCEEDED

http://setiathome.berkeley.edu/result.php?resultid=2404443134
http://setiathome.berkeley.edu/result.php?resultid=2404443124
http://setiathome.berkeley.edu/result.php?resultid=2403125100
http://setiathome.berkeley.edu/result.php?resultid=2401215374
http://setiathome.berkeley.edu/result.php?resultid=2401215372
http://setiathome.berkeley.edu/result.php?resultid=2401215324

http://setiathome.berkeley.edu/results.php?hostid=6379672&offset=0&show_names=0&state=5&appid=

I suspect BOINC would need to take different GPU speeds into account to stop these.

Most of those were run on your GT 520 - I forget where that comes in the speed range. And all of them have 'difficult' ARs, which extend the runtime a long way beyond expectations.

In your rather specialised environment (2 x GTX 460, GT 430, GT 520), you may have to help BOINC out by setting a <flops> value closer to the speed of the slowest, rather than relying on the APR which will be heavily weighted by the two fast cards.
ID: 1222731 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1222735 - Posted: 24 Apr 2012, 8:37:50 UTC

Just a heads up - we are expecting the next step to go in this maintenance.
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1222735 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1222742 - Posted: 24 Apr 2012, 9:48:53 UTC - in response to Message 1222735.  


*fingers crossed*
Grant
Darwin NT
ID: 1222742 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1222749 - Posted: 24 Apr 2012, 10:16:42 UTC - in response to Message 1222742.  


*fingers crossed*

It seemed to go pretty smoothly last time, and this time, fewer people will be affected - only the faster GPU cards.
ID: 1222749 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Short estimated runtimes - don't panic


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.