Shorties estimate up from three minutes to six hours after today's outage!

Message boards : Number crunching : Shorties estimate up from three minutes to six hours after today's outage!
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19048
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1153567 - Posted: 18 Sep 2011, 13:15:38 UTC

Further to the discussion on detection of AP 30/30 tasks, I have just had such a task, it had 16.5% blanking which on my system would have normally taken ~13 hrs to complete. It bailed out under the 30/30 rule at 12h23m.
In a case like this it would have made only a small change to the correct APR.
It is the tasks that bail out before a minute that cause the problem.
ID: 1153567 · Report as offensive
Profile Dave Barstow

Send message
Joined: 14 May 99
Posts: 76
Credit: 15,064,044
RAC: 0
Philippines
Message 1153569 - Posted: 18 Sep 2011, 13:18:08 UTC

While following this thread a thought has been brewing in 'the back of my mind' regarding the reporting of download requests...

It would be very nice if the 'NO Work Available for...' message were not, in effect, a LIE!.

Something like Scheduler Unable To Send New Work At This Time, would be truthful, and more polite, especially when there are tens of thousands of work-units shown as available.

With that said... I'll try to wait patiently for BOINC to regain its sanity...
ID: 1153569 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1153615 - Posted: 18 Sep 2011, 15:50:18 UTC - in response to Message 1153567.  

Further to the discussion on detection of AP 30/30 tasks, I have just had such a task, it had 16.5% blanking which on my system would have normally taken ~13 hrs to complete. It bailed out under the 30/30 rule at 12h23m.
In a case like this it would have made only a small change to the correct APR.
It is the tasks that bail out before a minute that cause the problem.

I'm nervous about using an absolute runtime, rather than one relative to the speed of the host, or the size of the task, to determine 'outlier'. That's exactly the mistake David made in introducing the faulty -177 'cure'.

My GTX 470 is already running tasks in around 90 seconds - that's good, full-length, validating, VHAR tasks. Not many more interations of Moore's Law, or improvements in the optimisation of the application, before your 1-minute test starts catching live ones. Whereas a relative test should be good for a while yet.
ID: 1153615 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19048
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1153623 - Posted: 18 Sep 2011, 16:02:34 UTC - in response to Message 1153615.  

If that is the case then it will have to be a percentage, presumably on flops count against est_flops (or whatever its called)

Talking of which do you reckon there would be much gained from improving the estimated flops. It would probably improve the APR calculation.

Unfortunately that thought might have to be on hold for a long time as it is a Seti responsibility. But maybe not David is head of Seti he can do so that he can check the APR figures are accurate. LOL.
ID: 1153623 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1153625 - Posted: 18 Sep 2011, 16:13:36 UTC
Last modified: 18 Sep 2011, 16:14:28 UTC

I think it might have been helpful it a partial correction had been made here, not just on Beta....
I know that once the code was botched, backing off wholesale might have precipitated a real mess (like this isn't already).
I have rigs in all manner of Boinc hell....not asking for work, asking and getting none, getting 1 at a time, getting only GPU whilst the CPU idles, you name it.

But, most have not gone totally dry yet.

And I will admit, at least it has not been a total disaster for the Seti project itself...much work IS going out and being returned at one of the highest rates ever for an extended period of time.

But it's gonna be a tough hole to dig out of, given the bandwidth limitation.
All that pent up AP work has gotta go out sooner or later....and once it starts, things are gonna bottleneck real fast.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1153625 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1153627 - Posted: 18 Sep 2011, 16:16:54 UTC

Perhaps the new plan is to alternate projects. MB one week AP the next.
Boinc....Boinc....Boinc....Boinc....
ID: 1153627 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1153630 - Posted: 18 Sep 2011, 16:22:36 UTC - in response to Message 1153627.  

Perhaps the new plan is to alternate projects. MB one week AP the next.

LOL...
Unfortunately, I don't think anything is going according to plan right now.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1153630 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1153635 - Posted: 18 Sep 2011, 16:56:18 UTC - in response to Message 1153630.  

Plan? We have a plan? We don't need no stinkin" plan! Full speed ahead, we'll find the pesky aliens yet!! :-)


PROUD MEMBER OF Team Starfire World BOINC
ID: 1153635 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1153663 - Posted: 18 Sep 2011, 18:55:54 UTC - in response to Message 1153615.  

My GTX 470 is already running tasks in around 90 seconds - that's good, full-length, validating, VHAR tasks. Not many more interations of Moore's Law, or improvements in the optimisation of the application, before your 1-minute test starts catching live ones. Whereas a relative test should be good for a while yet.

Even my GTX 560Ti when running 1 WU at a time does shorties in just under 2min.
I can see the next generation of video cards & drivers knocking them out in well under 1min.
Grant
Darwin NT
ID: 1153663 · Report as offensive
Wembley
Volunteer tester
Avatar

Send message
Joined: 16 Sep 09
Posts: 429
Credit: 1,844,293
RAC: 0
United States
Message 1153743 - Posted: 19 Sep 2011, 1:03:28 UTC - in response to Message 1153663.  

My GTX 470 is already running tasks in around 90 seconds - that's good, full-length, validating, VHAR tasks. Not many more interations of Moore's Law, or improvements in the optimisation of the application, before your 1-minute test starts catching live ones. Whereas a relative test should be good for a while yet.

Even my GTX 560Ti when running 1 WU at a time does shorties in just under 2min.
I can see the next generation of video cards & drivers knocking them out in well under 1min.


Seti will soon have to increase the size of a WU, or increase the depth of the analysis, just to keep the processing time longer than the download time.
ID: 1153743 · Report as offensive
EdwardPF
Volunteer tester

Send message
Joined: 26 Jul 99
Posts: 389
Credit: 236,772,605
RAC: 374
United States
Message 1153758 - Posted: 19 Sep 2011, 2:28:18 UTC - in response to Message 1153743.  

it's been suggested before (and been done before) ... Do more science per WU and a lot of bottlenecks go away ... (but the suggestion was ignored)

Ed F
ID: 1153758 · Report as offensive
Blake Bonkofsky
Volunteer tester
Avatar

Send message
Joined: 29 Dec 99
Posts: 617
Credit: 46,383,149
RAC: 0
United States
Message 1153771 - Posted: 19 Sep 2011, 3:56:42 UTC - in response to Message 1153758.  

V7 is slated to be released soon. I haven't any data on the speeds of the Lunatics apps on V7 WU's, but the stock app runs about 5 hours per CPU WU on my i7-950 @ 4Ghz.

Perhaps that will relieve some strain soon.
ID: 1153771 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1153821 - Posted: 19 Sep 2011, 9:51:46 UTC - in response to Message 1153771.  

V7 is slated to be released soon. I haven't any data on the speeds of the Lunatics apps on V7 WU's, but the stock app runs about 5 hours per CPU WU on my i7-950 @ 4Ghz.

Perhaps that will relieve some strain soon.


Off topic, but probably related, as we see the next fiasko on the horizon.

'Soon' is a relative term. At least another month, I'd say a good bit more.
Cuda apps need to be tested at beta first, and even v6.97 is still under scrutiny.

We are in extended alpha/beta, when it comes to optimised apps. We aren't looking at speeds just yet.

The extra analysis doesn't take that much time. So from that point of view not much less strain.

The rollout? Chaos and mayhem.

On topic:

Getting back to 'normal' requires 3 steps.
Step one, BOINC server code for outlier removal has been done but isn't much use without
Step two, SETI code to mark tasks as outliers.
Step three, remove the capping code again.

I have a horrible feeling, that 3 isn't going to be staggered as advised...

At that point, hosts that have adapted with incredibly low DCFs will suddenly be seeing APR estimated tasks again - estimated under the assumption that host DCF is near 1. So, coming in underestimated 10,20,50x... Workfetch frenzy ensues - bloating actual cache to what? extactly - 10,20,50x... - Actually taking into consideration the usual problems with workfetch it's probably not going to get that high. Then again, maybe yes. Take a host set with a 10 day cache. It maybe has tasks for 5. It will take 5 days for the first heavily underestimated tasks to work their way to the head of the queue. 5 days of neigh unlimted workfetch... THEN the first tasks gets processed, requiering 10,20,50x times the estimate. But no, after an elapsed time of more than 10x the estimate the task errors with a -177.

Anybody knows if DCF would go up?
IIRC tasks that end in errors don't impact on DCF.
On a host processing on both CPU and GPU, CPU is probably going to pull DCF back to 1, at which time the GPU starts breathing again.
On a GPU only host there is nothing whatsoever that can reset DCF apart from the user. An unsupervised GPU only host will be dead in the water.
An unsupervised mixed host will eventually recover - after trashing countless tasks.

I'm with Sartre on this one - 'L'infer c'est les autres'.

Based on the assumption that we aren't going to get the gradual release of the cap, I would suggest:

Depending on cache left, set NNT and/or reduce cache settings to an absolute minimum.
As soon as underestimated tasks appear (but preferably before) set NNT.
Wait for cache to empty of 'old' tasks.
Edit DCF to 1 - alternatively detach/reattach (you'll lose the few underestimated tasks that crept in, but they would just error anyway - see above) - second alternative, use the anti -177 option of Fred's rescheduler to be able to process the underestimated tasks and wait for DCF to return to 1 with those.
Resume workfetch. If it everything looks alright, you can increase the cache settings again.

Apologies, I seem to have run out of black paint.
ID: 1153821 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1153836 - Posted: 19 Sep 2011, 11:15:56 UTC - in response to Message 1153821.  

Thanks for the input, LadyL.

Even though the kitties are not fond of black.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1153836 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19048
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1153838 - Posted: 19 Sep 2011, 11:20:02 UTC - in response to Message 1153821.  

Isn't the detach/reattach option a no goer, effectively what caused all this in the first place.

Because the quick to APR, MB tasks, and slow to APR, AP tasks will still apply.
ID: 1153838 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1153839 - Posted: 19 Sep 2011, 11:32:58 UTC

For hosts which have both CPU and GPU applications installed (or which could easily have them installed by re-running the installer), one option would be to set "Use NVIDIA GPU" to 'no', and "Use CPU" to 'yes', in the website preferences until things are under control.

That allows GPU processing of any cached tasks to continue, but will inhibit the fetching of any new ones.

By allowing CPU tasks to be downloaded and crunched, DCF would be kept 'reasonable'.

You might need to keep toggling "Use NVIDIA GPU" for a little while, because of the known tendency of the server to preferentially tag for GPU on mixed requests. You would want to avoid getting too may GPU tasks at once, without CPU tasks to leaven the mix, until you see that new tasks for both resources are being issued with sane estimates.
ID: 1153839 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1153842 - Posted: 19 Sep 2011, 11:44:24 UTC - in response to Message 1153838.  

Isn't the detach/reattach option a no goer, effectively what caused all this in the first place.

Because the quick to APR, MB tasks, and slow to APR, AP tasks will still apply.



Well a detach/reattach is always the last straw and the worst option.

However to reset DCF on a MB only GPU only host, where DCF is not going to be pulled back to 1 by CPU tasks, you have no choice but to either manually edit DCF in client_state.xml to 1 or to get it reset with a reattach - I'm not positive simply resetting the project resets DCF as well.

[as Richard says, you could of course enable CPU processing to steer DCF]

As I have mentioned before, this is a worst case szenario - I've got a small hope remaining that the implications of just removing the cap again will reach the attention of the decision making minds before they unleash hell by doing so.
ID: 1153842 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1153843 - Posted: 19 Sep 2011, 11:52:09 UTC - in response to Message 1153842.  

Isn't the detach/reattach option a no goer, effectively what caused all this in the first place.

Because the quick to APR, MB tasks, and slow to APR, AP tasks will still apply.



Well a detach/reattach is always the last straw and the worst option.

However to reset DCF on a MB only GPU only host, where DCF is not going to be pulled back to 1 by CPU tasks, you have no choice but to either manually edit DCF in client_state.xml to 1 or to get it reset with a reattach - I'm not positive simply resetting the project resets DCF as well.

[as Richard says, you could of course enable CPU processing to steer DCF]

As I have mentioned before, this is a worst case szenario - I've got a small hope remaining that the implications of just removing the cap again will reach the attention of the decision making minds before they unleash hell by doing so.

And I wonder what this is going to do to all of the 'set and forget' hosts...
Folks that attached to the project on a whim and have long since forgotten it is running in the background. There are probably many thousands of them.

Might the project have to issue some kind of DCF reset command to coax them back into action? Is that even possible?

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1153843 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1153845 - Posted: 19 Sep 2011, 12:02:18 UTC - in response to Message 1153842.  

Detaching/reattaching to the project physically deletes the setiathome project folder - so all application files and configuration setting are lost. But you usually get the same HostID, and host APR etc., back again. That's the worst possible outcome, even if you remembered to back up your applications so you can put them back again.

Resetting the project does set DCF to the default 1.0000 (and, I think, in general preserves optimised applications) - but I think the behaviour varies between different versions of BOINC - best to let someone else test it first.

The other thing that resetting the project certainly does it to throw away, without reporting, all current tasks - including those completed but not yet uploaded.

The best sequence, if you're adventurous enough to try, is:

NNT - crunch all cached tasks - upload - report - backup folder contents - reset - check folder contents are still intact - allow new work.
ID: 1153845 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1153846 - Posted: 19 Sep 2011, 12:05:35 UTC - in response to Message 1153845.  
Last modified: 19 Sep 2011, 12:05:59 UTC

Detaching/reattaching to the project physically deletes the setiathome project folder - so all application files and configuration setting are lost. But you usually get the same HostID, and host APR etc., back again. That's the worst possible outcome, even if you remembered to back up your applications so you can put them back again.

Resetting the project does set DCF to the default 1.0000 (and, I think, in general preserves optimised applications) - but I think the behaviour varies between different versions of BOINC - best to let someone else test it first.

The other thing that resetting the project certainly does it to throw away, without reporting, all current tasks - including those completed but not yet uploaded.

The best sequence, if you're adventurous enough to try, is:

NNT - crunch all cached tasks - upload - report - backup folder contents - reset - check folder contents are still intact - allow new work.

It's been such a long time.......
I don't recall if a reset will chuck out the baby with the bathwater...IE, the optimized apps and app_info file. A detach certainly does.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1153846 · Report as offensive
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Shorties estimate up from three minutes to six hours after today's outage!


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.