Shorties estimate up from three minutes to six hours after today's outage!

Message boards : Number crunching : Shorties estimate up from three minutes to six hours after today's outage!
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

AuthorMessage
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1152380 - Posted: 15 Sep 2011, 7:37:21 UTC - in response to Message 1152379.  

No, this was a question of me.. ;-)


BTW, I'm confused that this new changeset wasn't tested before at S@h Beta Test.


[EDIT: Sorry, english is not my mother language. I had for a long time only a few years in school.. ;-)]


- Best regards! - Sutaru Tsureku, team seti.international founder. - Optimize your PC for higher RAC. - SETI@home needs your help. -

This is a BOINC change, Seti Beta is for testing Seti applications, although there is nothing to stop Dr.A doing tests at SetiBeta. It is not wise to cloud the issues of the testing at Beta, with possible errors introduced by changes to BOINC.

When testing the KISS principle should always be used.

KISS is the acronym for "Keep It Simple, Stupid".

Oh, but it's so much more fun to test it live....LOL.

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1152380 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1152382 - Posted: 15 Sep 2011, 7:41:26 UTC

Mark,

Are you sure this is a "live" test...seems to me it is more like DOA.
Dave

ID: 1152382 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1152383 - Posted: 15 Sep 2011, 7:42:15 UTC - in response to Message 1152382.  

Mark,

Are you sure this is a "live" test...seems to me it is more like DOA.

Not until our dear Dr. Anderson says it is............
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1152383 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1152384 - Posted: 15 Sep 2011, 7:45:34 UTC - in response to Message 1152363.  

I suggest people get ready to hit the NNT button.....

That's assuming people are able to get work.
One of my systems (the one with v6.12.33) has run out of GPU work, so each time it misses a request the backoof becomes something completely ridiculous. And since the server is issuing "No tasks sent" most of the time, i don't think there's much chance of me getting any GPU work for that system till the weekend when i can sit there & edit the client_state file & hit "retry" 100s of times untill i get a couple of hundred GPU tasks downloaded & running.
Rather annoying.
:-/

Well, i've been able to get some work. Problem is that it's all crunched before i can download more.
Grant
Darwin NT
ID: 1152384 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1152386 - Posted: 15 Sep 2011, 7:46:36 UTC - in response to Message 1152372.  


It was a very real, but rare, problem reported and discussed on these boards. It was passed on, in good faith, to the developers by one of our regular messenger pigeons.



Richard,

Why screw up 250,000 users for a rare problem??

Or at least fix it with MUCH less impact on the portion that is not a problem.
Dave

ID: 1152386 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1152387 - Posted: 15 Sep 2011, 7:47:21 UTC - in response to Message 1152372.  

Hang on while I go and find the threads.

OK, go and have a read of Average processing rate - a little high?

"Be careful what you wish for"

Now please excuse me while I go to try and compose an email for David that he will actually listen to - yesterday's efforts don't seem to have elicited a response yet.
ID: 1152387 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19062
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1152388 - Posted: 15 Sep 2011, 7:49:46 UTC - in response to Message 1152387.  

Hang on while I go and find the threads.

OK, go and have a read of Average processing rate - a little high?

"Be careful what you wish for"

Now please excuse me while I go to try and compose an email for David that he will actually listen to - yesterday's efforts don't seem to have elicited a response yet.

But I wished for a competent fix not a band aid.
ID: 1152388 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1152389 - Posted: 15 Sep 2011, 7:50:14 UTC - in response to Message 1152387.  

Hang on while I go and find the threads.

OK, go and have a read of Average processing rate - a little high?

"Be careful what you wish for"

Now please excuse me while I go to try and compose an email for David that he will actually listen to - yesterday's efforts don't seem to have elicited a response yet.

Well.....best of luck, my friend.
Even though not running smoothly, I've got work to crunch and will let the rigs get on with struggling with Boinc whilst I sleep with the kitties.

As the saying goes.......

'This too, shall pass.'

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1152389 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1152391 - Posted: 15 Sep 2011, 7:51:22 UTC - in response to Message 1152386.  

Why screw up 250,000 users for a rare problem??

It's much fewer than that. The fix only affects users of optimised applications - something of the order of 10,000 users, last time we checked the download statistics.

As to why? I'm sure it wasn't intentional. But a fix, developed - as I said, in good faith - had side-effects when deployed untested on a live production server.
ID: 1152391 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1152392 - Posted: 15 Sep 2011, 7:52:44 UTC - in response to Message 1152389.  

Hang on while I go and find the threads.

OK, go and have a read of Average processing rate - a little high?

"Be careful what you wish for"

Now please excuse me while I go to try and compose an email for David that he will actually listen to - yesterday's efforts don't seem to have elicited a response yet.

Well.....best of luck, my friend.
Even though not running smoothly, I've got work to crunch and will let the rigs get on with struggling with Boinc whilst I sleep with the kitties.

As the saying goes.......

'This too, shall pass.'

Meow.


Kidney stones pass too, but they are not what you would wish for.


Dave

ID: 1152392 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1152393 - Posted: 15 Sep 2011, 7:55:37 UTC - in response to Message 1152392.  

Hang on while I go and find the threads.

OK, go and have a read of Average processing rate - a little high?

"Be careful what you wish for"

Now please excuse me while I go to try and compose an email for David that he will actually listen to - yesterday's efforts don't seem to have elicited a response yet.

Well.....best of luck, my friend.
Even though not running smoothly, I've got work to crunch and will let the rigs get on with struggling with Boinc whilst I sleep with the kitties.

As the saying goes.......

'This too, shall pass.'

Meow.


Kidney stones pass too, but they are not what you would wish for.


I did say something a number of posts ago about it being painful...LOL.

G'night all, and good luck.

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1152393 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19062
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1152395 - Posted: 15 Sep 2011, 7:58:39 UTC - in response to Message 1152386.  


It was a very real, but rare, problem reported and discussed on these boards. It was passed on, in good faith, to the developers by one of our regular messenger pigeons.



Richard,

Why screw up 250,000 users for a rare problem??

Or at least fix it with MUCH less impact on the portion that is not a problem.

Because it is not as rare as Richard might assume, it will hit everybody attaching a new or re-attached computer who chooses to do MB and AP. It messed things up as badly as what we are seeing now because of the wide difference in processing times and therefore the big differences in time taken to get 10 tasks validated. i.e. MB goes to APR timings long before AP gets there. I my case as soon as the MB CUDA app got a working APR the DCF bounced from 1.xx to 10.xx every time an AP task completed.
It is still so bad that just before this ill conceived change, that when an AP task completed the DCF rose to above 2.xx and that is 6 weeks after it was re-attached.
ID: 1152395 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1152401 - Posted: 15 Sep 2011, 8:12:53 UTC - in response to Message 1152387.  
Last modified: 15 Sep 2011, 8:13:36 UTC

Hang on while I go and find the threads.

OK, go and have a read of Average processing rate - a little high?

"Be careful what you wish for"

Now please excuse me while I go to try and compose an email for David that he will actually listen to - yesterday's efforts don't seem to have elicited a response yet.

There's been another Changeset [trac]changeset:24217[/trac] which sounds like it'll work better:

- scheduler: revise [21428] to include non-anonymous-platform,


and change the ratio limit from 2 to 10.


Claggy
ID: 1152401 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1152402 - Posted: 15 Sep 2011, 8:13:23 UTC - in response to Message 1152395.  

Deep breaths guys, it'll get fixed.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1152402 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1152403 - Posted: 15 Sep 2011, 8:13:28 UTC - in response to Message 1152395.  


It was a very real, but rare, problem reported and discussed on these boards. It was passed on, in good faith, to the developers by one of our regular messenger pigeons.



Richard,

Why screw up 250,000 users for a rare problem??

Or at least fix it with MUCH less impact on the portion that is not a problem.

Because it is not as rare as Richard might assume, it will hit everybody attaching a new or re-attached computer who chooses to do MB and AP. It messed things up as badly as what we are seeing now because of the wide difference in processing times and therefore the big differences in time taken to get 10 tasks validated. i.e. MB goes to APR timings long before AP gets there. I my case as soon as the MB CUDA app got a working APR the DCF bounced from 1.xx to 10.xx every time an AP task completed.
It is still so bad that just before this ill conceived change, that when an AP task completed the DCF rose to above 2.xx and that is 6 weeks after it was re-attached.


I don't want to start a war but I disagree.

About 5 weeks ago I installed cuda cards in 3 of my machines. I started them all out with stock and changed to opti apps after 3 or 4 days. All were running GPU MB and CPU MB & Ap from the get go.. Initially estimates were off, I had not completed 10. As soon as I completed 10, a few days, estimates were accurate and remained so for a month, until this change.

Maybe something else is going on, especially since you are still having issues after 6 weeks.

I did read the thread listed above.



Dave

ID: 1152403 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1152404 - Posted: 15 Sep 2011, 8:16:29 UTC - in response to Message 1152395.  


It was a very real, but rare, problem reported and discussed on these boards. It was passed on, in good faith, to the developers by one of our regular messenger pigeons.

Richard,

Why screw up 250,000 users for a rare problem??

Or at least fix it with MUCH less impact on the portion that is not a problem.

Because it is not as rare as Richard might assume, it will hit everybody attaching a new or re-attached computer who chooses to do MB and AP. It messed things up as badly as what we are seeing now because of the wide difference in processing times and therefore the big differences in time taken to get 10 tasks validated. i.e. MB goes to APR timings long before AP gets there. I my case as soon as the MB CUDA app got a working APR the DCF bounced from 1.xx to 10.xx every time an AP task completed.
It is still so bad that just before this ill conceived change, that when an AP task completed the DCF rose to above 2.xx and that is 6 weeks after it was re-attached.

And, when SAH_v7 is deployed here in October or November, it will become not rare at all. Common, in fact. Universal. Because EVERY user will be starting from a clean APR slate for what we now call MB.

Email sent:

David,

I hope you're still reading here, though after midnight for you.

I note your changeset [trac]changeset:24217[/trac]

That may be a useful band-aid, and changing the capping from 2 to 10 will be a sufficiently gradual change to avoid some of the major problems that have arisen (and are likely to arise in the future) for the 10,000-odd users of anonymous platform at SAH.

But could I please urge you to enter into a dialog with those of us who have day-to-day observational, and analytical, experience of SAH? I think we need a more thorough, and targeted, solution of the related issues of APR and DCF, early exit tasks as opposed to optimised applications, and related matters.

And PLEASE, let's have that dialog and analysis BEFORE any change which affects non-anon-platform users is rushed onto a live production server.

I'll try and write something up before your morning, but in the meantime can I draw readers' attention to [this thread]
ID: 1152404 · Report as offensive
Profile [DPC] hansR Project Donor
Volunteer tester
Avatar

Send message
Joined: 14 Jul 00
Posts: 47
Credit: 235,829,569
RAC: 8
Netherlands
Message 1152405 - Posted: 15 Sep 2011, 8:21:38 UTC

I'm receiving every now and then 1 or 2 WU's. I'm running 3 at a time on my GTX 570. After every request there is a backoff for 5 minutes, but it takes just 2-3 minutes to finish the received WU('s).

The system is not able to connect to the internet between 22:00 and 5:00.

I think my card will become lazy ..

Just wait and see ......


ID: 1152405 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1152412 - Posted: 15 Sep 2011, 9:30:17 UTC - in response to Message 1152387.  
Last modified: 15 Sep 2011, 9:43:41 UTC

Hang on while I go and find the threads.

OK, go and have a read of Average processing rate - a little high?

"Be careful what you wish for"

Now please excuse me while I go to try and compose an email for David that he will actually listen to - yesterday's efforts don't seem to have elicited a response yet.

Yesterday's effort (which was only a whisker short of a rolling pin) yielded the desired response of having David listen to Joe's advice on what might be good numbers (see the new changeset posted).

from boinc_dev:
I changed the ratio limit from 2 to 10,
and added the limit to non-anonymous-platfrom as well.
Should be on beta tomorrow, main project Friday.
-- David

Perfect in time to wreck havoc over the weekend.
Anybody attached to beta, please monitor your incoming work closely, to see if the change has brought the desired effect - i.e. task duration estimates back to more realistic. We don't want the change to hit main if it doesn't. Especially if it now is going to affect not only the estimated 10k users on anonymous platform (some 4k of them with a v0.38 Lunatics installer) but also stock.

Winterknight wrote:
This is a BOINC change, Seti Beta is for testing Seti applications, although there is nothing to stop Dr.A doing tests at SetiBeta. It is not wise to cloud the issues of the testing at Beta, with possible errors introduced by changes to BOINC.


That point was addressed as well.
That's probably why we have a one day window between beta and main deploy - balancing 'fixing ASAP' with 'checking it actually works'

Anybody can predict whether these changes will impact on the 'DCF squared' problem? We expect V7 release in late autumn/early winter and then EVERYBODY will start with a fresh slate - DCF squared will need addressing before that.

DCF squared is the problem experienced when attaching a new host or swapping to anonymous platform, resulting in a new entry on the application details page, tracking the APR of the CPU/GPU to be used in task duration estimates.
Tasks first come in with the far too high initial estimate - over the next few tasks, DCF on the client adjusts to small values to get estimates down. after the 10th VALIDATION APR kicks in - expecting a client DCF of 1! And BANG! DCF squared - tasks are suddenly estimated much much too short, the host overfetches, etc. etc. the first task of the new batch (if it doesn't get killed by a -177 'ran longer than 10x estimate) pushes DCF back up towards 1, but by then the damage has often been done.
NB if you run anon, inserting <flops> will circumvent this problem.
ID: 1152412 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1152416 - Posted: 15 Sep 2011, 10:14:41 UTC - in response to Message 1152412.  

Yes, David's interim fix was posted and announced while we were discussing matters here. Neither the changeset nor the email had been released at the time I entered this discussion - and, to be honest, given the time-zones involved, I didn't expect it: I thought we'd already missed the 'Wednesday Window' for changes.

The interim fix is, IMHO, precisely that - a modest change that will take us back part-way towards the status quo ante. Joe's suggested ratio of 10 was derived from, and is reasonable for, optimised CPU crunching. My GTX 470, for example, - at 18 months old no longer state of the art - needs a ratio of around 80.

The 'DCF squared' problem is comparatively minor for users of the stock application delivery method. It's all over within the runtime of a single task, because the revised APR estimates are applied globally to all tasks in the cache, including the currently-running one. So, on the conclusion of the current task, local DCF is reset and the 'squared' drops out of the equation.

DCF squared is more of a problem for users of the 'anonymous platform' delivery method. Then, the APR values are applied singly, one task at a time - and only to newly-downloaded work. Previously cached tasks retain their old estimates, and thus DCF squared persists for as long as it takes to work through that cached work.

All of this has major implications for the expected autumn rollout of v7.

• There'll be no pre-existing project-wide averages to 'seed' the initial estimates.
• Everyone will be starting their own APR record from scratch.
• Downloading tasks will be even harder than usual, because everyone will be downloading new applications (and, in a lot of cases, very large CUDA runtime/fft DLLs) at the same time.
• We have no idea yet how those factors will affect the time it takes to reach the crucial tenth validation.

Be prepared for a bumpy ride...
ID: 1152416 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1152417 - Posted: 15 Sep 2011, 10:16:49 UTC - in response to Message 1152412.  

NB if you run anon, inserting <flops> will circumvent this problem.

I thought the reason for the server side adjustments was so we didn't have to do all that stuffing around?
And it did involve a lot of stuffing around.
Grant
Darwin NT
ID: 1152417 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

Message boards : Number crunching : Shorties estimate up from three minutes to six hours after today's outage!


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.