Shorties estimate up from three minutes to six hours after today's outage!

Message boards : Number crunching : Shorties estimate up from three minutes to six hours after today's outage!
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

AuthorMessage
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1152422 - Posted: 15 Sep 2011, 11:08:37 UTC - in response to Message 1152417.  

NB if you run anon, inserting <flops> will circumvent this problem.

I thought the reason for the server side adjustments was so we didn't have to do all that stuffing around?
And it did involve a lot of stuffing around.


Actually, as Joe explained in this post calculating flops is fairly easy. Also if you take APR from the app details page and multiply with 10e9, that's also flops.

Yes, with the advent of APR to cope with the largely different speeds of applications, that a single client DCF can not handle, we stopped recommending flops entries - mainly because people find them difficult to get right and because they shouldn't be needed. But as Joe keeps reminding us, there are corners where having flops entries will stop you from jumping off the tracks.
If everybody on anonymous platform had flops, neither the overblown APR nor the botched attempt at mitigating the effect of early exit tasks would have much effect (I think).

We are still trying to work out if there is any way to have the installer calculate rough flops to insert into app_info.xml as part of the installation. We prefer not to have to jump through hoops though, getting David to do stuff 'right' is by far the better option.
ID: 1152422 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19015
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1152423 - Posted: 15 Sep 2011, 11:13:25 UTC - in response to Message 1152422.  

Snipped......
We prefer not to have to jump through hoops though, getting David to do stuff 'right' is by far the better option.

And from a long term sceptics view, almost b***dy impossible
ID: 1152423 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1152453 - Posted: 15 Sep 2011, 13:03:50 UTC
Last modified: 15 Sep 2011, 13:18:55 UTC

Yes, I use again the <flops> entries in my app_info.xml file.

In past I did it how Joe explained it.

Then after it was easier (you don't need to calculate ;-) to use the values of the new application details overview of the host.
Geek@Play made a manual later: 'How to add flops value'.


BTW, <flops> entries could disturb CreditNew?


- Best regards! - Sutaru Tsureku, team seti.international founder. - Optimize your PC for higher RAC. - SETI@home needs your help. -
ID: 1152453 · Report as offensive
Profile Floyd
Avatar

Send message
Joined: 19 May 11
Posts: 524
Credit: 1,870,625
RAC: 0
United States
Message 1152474 - Posted: 15 Sep 2011, 14:54:21 UTC - in response to Message 1152364.  



I suggest people get ready to hit the NNT button.....


I have already set NNT, in hopes that someone will pay attention to the outcry and to avoid getting any more screwed up times.

Who knows, it may stay that way if nothing changes.




I Hit the NNT button a couple days ago , when all this started , on my quad core cpu I had nothing But AP's and Vlars , about a weeks worth... AP's about 15 hrs each . NO gpu work at all , SO,,,
Freds reschedular put a bunch of Vlars onto GPU , Only it is taking over 4 hrs for each one of them to run... 4 cpu tasks and 3 Gpu tasks at a time ... = a long time to get thru all this mess...
Was hoping this would Clear up by the time I finish all the cache I have now.
Still have 8 AP's that havn't even started yet...

at Least I have work... LOL
ID: 1152474 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1152477 - Posted: 15 Sep 2011, 15:09:03 UTC
Last modified: 15 Sep 2011, 15:15:13 UTC

Well, things have improved a bit. The shorties that were showing estimates of over 7 hours yesterday are now showing around 2 hours and 20 minutes on my GPU. They are actually taking about 7 or 8 minutes to complete. They are working their way down though. When I first looked this morning they were at 2 hours 28 minutes guesstimates. My CPU work looks to be just about right. I don't know what will happen with any AP work I might get as I finished all of it last night and haven't got any new. Same thing for any longer midrange GPU work. All I have are shorties on my GPU. According to my details page my DCF is now at .51xxx down from .57xxx when I first looked.

What should be interesting though is that I have Fred's tool set to check every 6 hours to adjust my limit RSC_ FPOPS bound. I'm not moving anything around, just making sure I don't get any 177 errors. I wonder what effect that is going to have?

For those that don't know, I'm running my E5400 Intel and a GTS 450 GPU with the Lunatic's x39e flavor (Two at a time) and Raistmer's AP on NVIDIA app.


Okay, I got a mid-range task for my GPU and it is guesstimating over 7 hours and my shorties are now guesstimating 2 hours and 6 minutes. Still a long way from the 7 or 8 minutes they are taking but improving.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1152477 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1152480 - Posted: 15 Sep 2011, 15:15:08 UTC
Last modified: 15 Sep 2011, 15:19:16 UTC

As it currently stands, only those using anonymous platform (e.g. optimised apps) without flops in app_info.xml are affected by the botched attempt to avoid overblown APR resulting in -177 errors. Those are experiencing workfetch issues, as tasks complete much faster than the far too large estimates. The problem is compounded by the 5 min minimum intervall between workfetches and having a shortie storm on top isn't helping either.

At this point it would probably help inserting flops - until a higher cap goes in or until code to deal with the problem that started all this is in and works.

A detailed analysis of the higher capping factor shows that, while being far better than the original one, is still too low - by up to a magnitude. task duration is currently being overestimated in the region of 25x (actually between 4x and 50x depending on card) after the limit increase (provided it stays at the current value of 10) that will drop to about 6x (1.5x - 10x). It will then affect all users...

That may or may not be enough to keep the big ones afloat - it's hard to tell even if we can watch for a bit on beta.
ID: 1152480 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19015
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1152483 - Posted: 15 Sep 2011, 15:35:05 UTC - in response to Message 1152480.  
Last modified: 15 Sep 2011, 15:36:46 UTC

I'm not seeing any problems here, but that might be because I am keeping one cpu core busy with an AP task and explained here and in "Average processing rate - a little high?" my APR is still a little high and this brings the DCF back to >1.5 at the moment, that could change if the pendings get validated soon.

I do feel that I may have been to blame for this event, but do I feel guilty.

Not one little bit.
ID: 1152483 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1152484 - Posted: 15 Sep 2011, 15:36:40 UTC - in response to Message 1152480.  
Last modified: 15 Sep 2011, 15:40:15 UTC

I have a bunch of VLARs, crunched by the ATI 5870 GPUs, also noticed a
host with very little memory, O.S. Linux
2.6.27.7-9-pae
BOINC version 6.4.5
Memory 182.09 MB
Cache 1024 KB.

This
host?!
Also produces errors.

Vlar WU.


Can't be working right, with so little memory? Or something else is wrong!
Should he/she be PMed, cause of so much errors?
ID: 1152484 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1152485 - Posted: 15 Sep 2011, 15:39:48 UTC - in response to Message 1152483.  

I'm not seeing any problems here, but that might be because I am keeping one cpu core busy with an AP task and explained here and in "Average processing rate - a little high?" my APR is still a little high and this brings the DCF back to >1.5 at the moment, that could change if the pendings get validated soon.

I do feel that I may have been to blame for this event, but do I feel guilty.

Not one little bit.

You pointed out the original problem, you did not engineer the 'fix'.
Although your complaints may have precipitated it, you don't have to feel guilty that the remedy proved worse for most than the disease.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1152485 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1152487 - Posted: 15 Sep 2011, 15:42:41 UTC - in response to Message 1152480.  
Last modified: 15 Sep 2011, 15:44:58 UTC

As it currently stands, only those using anonymous platform (e.g. optimised apps) without flops in app_info.xml are affected by the botched attempt to avoid overblown APR resulting in -177 errors. Those are experiencing workfetch issues, as tasks complete much faster than the far too large estimates. The problem is compounded by the 5 min minimum intervall between workfetches and having a shortie storm on top isn't helping either.

At this point it would probably help inserting flops - until a higher cap goes in or until code to deal with the problem that started all this is in and works.

A detailed analysis of the higher capping factor shows that, while being far better than the original one, is still too low - by up to a magnitude. task duration is currently being overestimated in the region of 25x (actually between 4x and 50x depending on card) after the limit increase (provided it stays at the current value of 10) that will drop to about 6x (1.5x - 10x). It will then affect all users...

That may or may not be enough to keep the big ones afloat - it's hard to tell even if we can watch for a bit on beta.

The kitties would like to thank you for going to bat for the multitude of crunchers who this bit of codeplay has totally messed up.

Hope you shall continue to monitor things here whilst the mucked up work makes it's way through the system.

Also hoping that you can find a 'fix' for the original problem that does not penalize those who never had it to begin with.

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1152487 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1152489 - Posted: 15 Sep 2011, 15:51:25 UTC - in response to Message 1152487.  


The kitties would like to thank you for going to bat for the multitude of crunchers who this bit of codeplay has totally messed up.

Hope you shall continue to monitor things here whilst the mucked up work makes it's way through the system.

Also hoping that you can find a 'fix' for the original problem that does not penalize those who never had it to begin with.

Meow.


I second that motion !!!!


Dave

ID: 1152489 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19015
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1152497 - Posted: 15 Sep 2011, 16:01:39 UTC - in response to Message 1152485.  
Last modified: 15 Sep 2011, 16:02:03 UTC

I'm not seeing any problems here, but that might be because I am keeping one cpu core busy with an AP task and explained here and in "Average processing rate - a little high?" my APR is still a little high and this brings the DCF back to >1.5 at the moment, that could change if the pendings get validated soon.

I do feel that I may have been to blame for this event, but do I feel guilty.

Not one little bit.

You pointed out the original problem, you did not engineer the 'fix'.
Although your complaints may have precipitated it, you don't have to feel guilty that the remedy proved worse for most than the disease.

I have no guilt problems at all about this episode, I am not the one responsible for failing to tackle the problem effectively.
And if you read all my posts on the subject I have given a hint on solving the problem.

And that is to remove the early finishing tasks, -9's and "too much blanking" from the APR calulation. See my latest in "Average processing rate - a little high?" and you can see quite clearly that a host running the default apps has the problem also.
ID: 1152497 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1152498 - Posted: 15 Sep 2011, 16:04:06 UTC

At this point, let's just accept that a mistake has been made, and, as an unexpected and unplanned side effect, some people will be downloading less work than they would have expected if everything was working fine - splitters working flat out, an unlimited supply of new tapes, a gigabit download pipe.

We all know that nirvana doesn't exist. It's just a normal work supply glitch, OK? Cricket is still near maxx, work is being crunched, the project is searching for ET at the normal speed. No-one needs to be blamed, no-one needs to get upset.

But: what we do need to do is to negotiate our way safely back to the path that we've stumbled off. If we ask for too rapid a return to how things were before, the recovery is going to be far, far worse the the problem.

At the moment you're asking for too little work? That's not the end of the world. On the way back, you'll be asking for too much. Think about it.
ID: 1152498 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1152504 - Posted: 15 Sep 2011, 16:17:21 UTC - in response to Message 1152498.  

At this point, let's just accept that a mistake has been made, and, as an unexpected and unplanned side effect, some people will be downloading less work than they would have expected if everything was working fine - splitters working flat out, an unlimited supply of new tapes, a gigabit download pipe.

We all know that nirvana doesn't exist. It's just a normal work supply glitch, OK? Cricket is still near maxx, work is being crunched, the project is searching for ET at the normal speed. No-one needs to be blamed, no-one needs to get upset.

But: what we do need to do is to negotiate our way safely back to the path that we've stumbled off. If we ask for too rapid a return to how things were before, the recovery is going to be far, far worse the the problem.

At the moment you're asking for too little work? That's not the end of the world. On the way back, you'll be asking for too much. Think about it.

I agree, and now that we've taken a brody off of the path and gone way out into the rough, a return to 'normalcy' should be done a bit at a time lest things really go off into the tall timber.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1152504 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 1152510 - Posted: 15 Sep 2011, 16:40:13 UTC - in response to Message 1152498.  


But: what we do need to do is to negotiate our way safely back to the path that we've stumbled off. If we ask for too rapid a return to how things were before, the recovery is going to be far, far worse the the problem.

At the moment you're asking for too little work? That's not the end of the world. On the way back, you'll be asking for too much. Think about it.



I got swamped with VLAR's a couple of months ago, it was not pretty, ended up having to process a fair quantity of them on GPU's to prevent them being wasted, no fun at all.

By the looks of this machine it may take a while for this to settle down, but the last thing we want is another attempted fix that makes it even worse.


Kevin


ID: 1152510 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1152512 - Posted: 15 Sep 2011, 17:10:04 UTC

Some notes on changesets [trac]changeset:24128[/trac] and [trac]changeset:24217[/trac]:

24128 was deployed at SETI Beta Wednesday Sept. 7 or maybe the day before and caused very little disturbance because S@H v7 is being tested so most participants are running the stock CPU application. Those few running anonymous platform testing Raistmer's openCL ATI applications are advanced users willing and able to edit app_info.xml as needed.

24128 was deployed here about a week later, and that could have happened not as a deliberate test but simply because the server software needed to be rebuilt for other reasons.

24217 is supposed to be deployed at Beta today, here tomorrow (Friday). As Richard has pointed out, the increase to a factor of 10 will not be enough for many crunching with GPUs. But since rsc_fpops_bound is set to 10 times rsc_fpops_est that factor of 10 is about as far as it could be stretched and still make any kind of sense in a server routine which applies to all projects.

For those without <flops> in their app_info.xml files and trying to decide whether adding them will be needed, there are checks you can do. I suggest first going to the computer's info page at http://setiathome.berkeley.edu/show_host_detail.php?hostid=xxxxxxx and writing down the value for "Measured floating point speed" except shift the decimal point 3 places to the left to put it into GFLOPS. For instance, 2455.93 million ops/sec becomes 2.45593 GFLOPS. Next, click the "Show" button for Application details and look at the Average processing rate entries for those applications listed as "SETI@home Enhanced (anonymous platform, xxxxx)" or "Astropulse v505 (anonymous platform, xxxxx)" where xxxxx is CPU, NVIDIA GPU, or ATI GPU. If any of those are more than 10 times the GFLOPS from the benchmark, runtime estimates for that application will be affected. If the ratio is not too much more than 10, the core client's DCF may adequately cope.

If you're running more than one task per GPU, it's even worse. The core client will probably be using a guesstimated <flops> for GPU which is less than the CPU benchmark. Looking in BOINC's client_state.xml file, those <flops> can be seen within the <app_version> sections.

Dr. Anderson's intended longer-term modification will allow a project's Validator to indicate when the runtime of a validated task should not be used in the averages, what Claggy asked for. I don't know any way to guess when that might be implemented.
                                                                Joe
ID: 1152512 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1152513 - Posted: 15 Sep 2011, 17:20:50 UTC - in response to Message 1152512.  


Dr. Anderson's intended longer-term modification will allow a project's Validator to indicate when the runtime of a validated task should not be used in the averages, what Claggy asked for. I don't know any way to guess when that might be implemented.
                                                                Joe

Joe....now you have me a bit confused again.
I thought that the current problem was occurring client side, messing up the DCF and other important stats.
This happens with or without validation.
Sending out work with estimates artificially bloated makes one's cache settings pretty much impossible to reach.
Or, are you saying that client side and server side estimates will combine to sort this?
And what about the DCF tussle between CPU and GPU work?

Meow?
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1152513 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1152557 - Posted: 15 Sep 2011, 19:39:28 UTC - in response to Message 1152513.  


Dr. Anderson's intended longer-term modification will allow a project's Validator to indicate when the runtime of a validated task should not be used in the averages, what Claggy asked for. I don't know any way to guess when that might be implemented.
                                                                Joe

Joe....now you have me a bit confused again.
I thought that the current problem was occurring client side, messing up the DCF and other important stats.
This happens with or without validation.
Sending out work with estimates artificially bloated makes one's cache settings pretty much impossible to reach.
Or, are you saying that client side and server side estimates will combine to sort this?
And what about the DCF tussle between CPU and GPU work?

Meow?

What I'm hoping is that when that longer-term protection against APR going ridiculously high is in place, the current short-term junk will be removed. If so, we'd be back in a mode where the server-side scaling could work to keep estimates reasonable and DCF would be fairly stable.

Runtime estimates and their effects on work fetch can't be considered just client side or server side, what the client tells the server affects the estimates and vice versa.
                                                                 Joe
ID: 1152557 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1152558 - Posted: 15 Sep 2011, 19:42:17 UTC - in response to Message 1152513.  
Last modified: 15 Sep 2011, 19:44:56 UTC

...I thought that the current problem was occurring client side, messing up the DCF and other important stats...


Hi Mark, in case Joe is off catching some Z's I'll try answer & clarify a bit.

Beat me, so was awake... :D

Had posted

Those mentioned are symptoms, not causes. All estimates are based on server side fpops estimates distributed with the task. It's those that have been bloated server side. The client side DCF (or aDCF) can adapt slowly to the sudden abrupt change, and probably subsequent corrections likely to come, though Project DCF is likely to struggle to stablise on mixed CPU-GPU work, especially given that GPU performance seems to throw the estimates at the initial step by some 20-100x, which pushes DCFs against hard lower limits attempting to compensate..

I think the 'intended longer term modification' being referred to is the future one that Dr.A. implements to fix whatever is happening now...

Jason

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1152558 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1152681 - Posted: 16 Sep 2011, 9:12:28 UTC - in response to Message 1152558.  

I think the 'intended longer term modification' being referred to is the future one that Dr.A. implements to fix whatever is happening now...


The longterm fix is really about a statistical problem - whether or not you include outlying datapoints into your calculations.

The whole CreditNew/APR cluster is a statistical approach.
It currently uses all tasks for calculations of both credit and APR.
Early exit conditions (-9 on MB, 30/30 on AP, some others) make the server believe, that the task was completed extremely fast. Consequently APR starts to go up - sometimes, as we learned from Winterknight absurdly high, with all the associated problems when that bloated value gets fed back into the runtime estimates. This can then lead to -177 errors.

Based on the report on boinc_dev about this problem DA decided to implement two things - one 'short term' - the capping on the processing rate used in estimates (so irrealistic rates would not be used. Sadly he didn't have realistic expectations as to what realistic rates might be) and one 'long term' - removing outlying datapoints from the calculations by allowing the project to mark certain cases as outlying points not to be used in calculations. The project then still needs to define what it treats as outliers.

Statistics can be wonderfully manipulated by careful selection of outliers...
ID: 1152681 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

Message boards : Number crunching : Shorties estimate up from three minutes to six hours after today's outage!


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.