Panic Mode On (58) Server problems?

Message boards : Number crunching : Panic Mode On (58) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 11 · Next

AuthorMessage
Les

Send message
Joined: 20 May 99
Posts: 53
Credit: 21,062,237
RAC: 18
United States
Message 1160105 - Posted: 8 Oct 2011, 6:07:14 UTC - in response to Message 1160104.  

Joe - Thanks for the suggestion and of course you were right for all of the B3_P1 that made it to my computer, 21 to 26 seconds.
ID: 1160105 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1160110 - Posted: 8 Oct 2011, 7:04:51 UTC

Thanks for that information Joe. Just cleared 8 of mine. I was playing a game for a few hours and finished up and noticed that my 10-day cache was filled. Saw your message about B3_P1 and noticed a few in my cache. Let those run and early-exit, report, and got some new APs right away.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1160110 · Report as offensive
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1160115 - Posted: 8 Oct 2011, 8:03:48 UTC

I got 21 AP's last night spread over three separate crunchers. Only one problem, they have estimated completion times of 234 hours (i.e. 10 days) when they will actually run in 12 hours. As a result I am now not getting any other work as BOINC thinks my 10 day cash is full. I am currently running shorties in HP mode. No doubt when I do run the AP's, BOINC will recalculate the completion times on my remaining MB's to be 20 times shorter than it will actually take and fill my cashes to the 50 WU per core limit.
ID: 1160115 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1160127 - Posted: 8 Oct 2011, 9:04:47 UTC - in response to Message 1160104.  

While not strictly a server problem, the B3_P1 Astropulse problem is here again. The ap_20ap11ad_B3_P1_00338_20111007_07731.wu I'm downloading now has the pattern of damaged data we've seen before on that channel which can be expected to give a "Blanking too much RFI" immediate exit.

Those of you who have been building a cache of AP work might want to look for tasks from that channel and consider forcing them to run by suspending others, just to get them cleared out if damaged.

Yep, the download completed while I was typing and the task ran all of 21 seconds...
                                                                   Joe

Might I suggest that people don't try to return too many of these in succession, especially if they're new to AP crunching? So far as I can tell, we still have no outlier detection in the AP validator, so you could still run into APR problems.
ID: 1160127 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1160129 - Posted: 8 Oct 2011, 9:23:49 UTC - in response to Message 1160127.  


Does anyone want to hazard a guess as to when GPU crunching estimates will finally come back to realistic values?

The DCF on both of my machines is now around 1.0 & CPU estimates are close enough. But the GPU ones are still way out- and although they slowly get closer to the actual value (although never even remotely near it) as soon as a CPU WU is done, it's back to where it was.
Grant
Darwin NT
ID: 1160129 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1160132 - Posted: 8 Oct 2011, 10:02:16 UTC - in response to Message 1160129.  


Does anyone want to hazard a guess as to when GPU crunching estimates will finally come back to realistic values?

The DCF on both of my machines is now around 1.0 & CPU estimates are close enough. But the GPU ones are still way out- and although they slowly get closer to the actual value (although never even remotely near it) as soon as a CPU WU is done, it's back to where it was.

Remember the sequence of events.

Problem identified - first attempt at solution - collateral damage - interim precautionary measures - need second (permanent) solution - final staged return to normality - remove precautionary measures. And all sort of other network issues got in the way, as well.

We're part way through several of those interim steps.

Precautionary measures (limit on work in progress) - has been lifted from '50 CPU tasks' to '50 tasks per CPU core'. Nobody has mentioned whether the GPU version has been raised to 'per GPU' yet - I can't test (only single GPU rigs here).

Permanent solution - we need to detect 'outliers', and exclude them from DCF/APR calculations. That's been done for MB, but not yet (so far as I can tell) for AP. With Joe's news about the stuck bit, there will be a lot of outliers around - we need that finishing off.

Staged return to normality - we've had two steps so far, x2 and x5. I'd be happy to argue for another x5 during maintenance this coming Tuesday: I think we're ready for that, provided the servers hold together over the weekend, and they can get the outlier test into the AP validator.

But all we can do from the outside is observe, advise, warn, cajole..... The decisions will be taken in Berkeley.
ID: 1160132 · Report as offensive
Blake Bonkofsky
Volunteer tester
Avatar

Send message
Joined: 29 Dec 99
Posts: 617
Credit: 46,383,149
RAC: 0
United States
Message 1160135 - Posted: 8 Oct 2011, 10:06:07 UTC - in response to Message 1160132.  

I can comment on limits. My machines have been banging off of the limiters for a while. i7 Quad Core with HyperThreading on (8 cores) + 3x GTX460's has been at 400 CPU/1200 GPU. Quad core + Single GTX460 is at 200/400, and i3 dual core with HT is at 200/400 as well. Sooo, it is back to per core/GPU (50/400)
ID: 1160135 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1160138 - Posted: 8 Oct 2011, 10:20:56 UTC - in response to Message 1160135.  

I can comment on limits. My machines have been banging off of the limiters for a while. i7 Quad Core with HyperThreading on (8 cores) + 3x GTX460's has been at 400 CPU/1200 GPU. Quad core + Single GTX460 is at 200/400, and i3 dual core with HT is at 200/400 as well. Sooo, it is back to per core/GPU (50/400)

Looks about right, based on cache levels I am seeing here.
The kitties are a little happier with a bit more kibble in the bowls.

So far, so good with the server code adjustments.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1160138 · Report as offensive
Profile S@NL Etienne Dokkum
Volunteer tester
Avatar

Send message
Joined: 11 Jun 99
Posts: 212
Credit: 43,822,095
RAC: 0
Netherlands
Message 1160142 - Posted: 8 Oct 2011, 10:30:17 UTC - in response to Message 1160132.  


Staged return to normality - we've had two steps so far, x2 and x5. I'd be happy to argue for another x5 during maintenance this coming Tuesday: I think we're ready for that, provided the servers hold together over the weekend, and they can get the outlier test into the AP validator.

But all we can do from the outside is observe, advise, warn, cajole..... The decisions will be taken in Berkeley.


Maybe the guys in the lab could comment on how and when AP validation will resume ??? They're sending out new WU's but still no returned result gets validated...
ID: 1160142 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1160146 - Posted: 8 Oct 2011, 10:38:28 UTC - in response to Message 1160142.  


Staged return to normality - we've had two steps so far, x2 and x5. I'd be happy to argue for another x5 during maintenance this coming Tuesday: I think we're ready for that, provided the servers hold together over the weekend, and they can get the outlier test into the AP validator.

But all we can do from the outside is observe, advise, warn, cajole..... The decisions will be taken in Berkeley.


Maybe the guys in the lab could comment on how and when AP validation will resume ??? They're sending out new WU's but still no returned result gets validated...

Server status does show the validator running...
Could possibly be a return of the AP validator bug that brought things to a crawl some time ago.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1160146 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1160148 - Posted: 8 Oct 2011, 10:41:15 UTC

Could be nearly time for 'That Duck' ...
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1160148 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36832
Credit: 261,360,520
RAC: 489
Australia
Message 1160149 - Posted: 8 Oct 2011, 10:42:27 UTC - in response to Message 1160148.  
Last modified: 8 Oct 2011, 10:43:30 UTC

Could be nearly time for 'That Duck' ...

Ahh.., duck hunting season is soon to begin. :D

Cheers.
ID: 1160149 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1160150 - Posted: 8 Oct 2011, 10:43:27 UTC - in response to Message 1160148.  

Could be nearly time for 'That Duck' ...

Ooooooo noooooooooooooooo.
Not the return of 'the little yellow fluffy thing who's name shall not be spoken'.............
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1160150 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1160152 - Posted: 8 Oct 2011, 10:56:58 UTC - in response to Message 1160138.  
Last modified: 8 Oct 2011, 11:05:02 UTC

Well, now I've got alot of DownLoads, stuck on every rig, computing them
is faster then they're DownLoaded, even pushing BOINC Buttons, doesn't work.

First time, whithout any SETI work, not even Bêta.
Just wait and see..........?

Some movement after some REtries.
ID: 1160152 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1160154 - Posted: 8 Oct 2011, 11:10:49 UTC - in response to Message 1160132.  


Does anyone want to hazard a guess as to when GPU crunching estimates will finally come back to realistic values?

The DCF on both of my machines is now around 1.0 & CPU estimates are close enough. But the GPU ones are still way out- and although they slowly get closer to the actual value (although never even remotely near it) as soon as a CPU WU is done, it's back to where it was.

Remember the sequence of events.

Problem identified - first attempt at solution - collateral damage - interim precautionary measures - need second (permanent) solution - final staged return to normality - remove precautionary measures. And all sort of other network issues got in the way, as well.

We're part way through several of those interim steps.

Precautionary measures (limit on work in progress) - has been lifted from '50 CPU tasks' to '50 tasks per CPU core'. Nobody has mentioned whether the GPU version has been raised to 'per GPU' yet - I can't test (only single GPU rigs here).

Permanent solution - we need to detect 'outliers', and exclude them from DCF/APR calculations. That's been done for MB, but not yet (so far as I can tell) for AP. With Joe's news about the stuck bit, there will be a lot of outliers around - we need that finishing off.

Staged return to normality - we've had two steps so far, x2 and x5. I'd be happy to argue for another x5 during maintenance this coming Tuesday: I think we're ready for that, provided the servers hold together over the weekend, and they can get the outlier test into the AP validator.

But all we can do from the outside is observe, advise, warn, cajole..... The decisions will be taken in Berkeley.

Yep, it looks that way, I started my XP3200+/HD4650/8400 GS host on the Stock apps the other day, at this time it's got 5 validated 6.03 tasks, one being a -9,
and APR does not look hugely different from the anonymous platform entry (i think i was running Stock 6.03, but i only have two validations there)

The next question is why does 'Number of tasks completed' read one less than 'Consecutive valid tasks', i suppose 'Number of tasks completed' included task Number 0

Claggy
ID: 1160154 · Report as offensive
Profile KWSN Ekky Ekky Ekky
Avatar

Send message
Joined: 25 May 99
Posts: 944
Credit: 52,956,491
RAC: 67
United Kingdom
Message 1160159 - Posted: 8 Oct 2011, 11:35:34 UTC

I have plenty of work here but uploads seem to have stopped at this end.
Cricket he say good.
Cricket speak with forked tongue?

ID: 1160159 · Report as offensive
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1160171 - Posted: 8 Oct 2011, 12:40:34 UTC

I cannot remember the last time the daily cricket graph looked the way it does at present - a perfect, solid mass of green. Long may it continue.

http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;ranges=d%3Aw%3Am%3Ay;view=Octets
ID: 1160171 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1160177 - Posted: 8 Oct 2011, 13:10:08 UTC - in response to Message 1160166.  

Ok, whoever it is that have "the yellow one" in custody, prepare to bring it out in about 30 minutes.

/me ducks.

LOL

You mean ?
ID: 1160177 · Report as offensive
Profile KWSN Ekky Ekky Ekky
Avatar

Send message
Joined: 25 May 99
Posts: 944
Credit: 52,956,491
RAC: 67
United Kingdom
Message 1160178 - Posted: 8 Oct 2011, 13:11:42 UTC - in response to Message 1160171.  

I cannot remember the last time the daily cricket graph looked the way it does at present - a perfect, solid mass of green. Long may it continue.

http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;ranges=d%3Aw%3Am%3Ay;view=Octets


But if nothing is actually uploading?

ID: 1160178 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1160186 - Posted: 8 Oct 2011, 13:32:18 UTC - in response to Message 1160154.  

Yep, it looks that way, I started my XP3200+/HD4650/8400 GS host on the Stock apps the other day, at this time it's got 5 validated 6.03 tasks, one being a -9,
and APR does not look hugely different from the anonymous platform entry (i think i was running Stock 6.03, but i only have two validations there)

The next question is why does 'Number of tasks completed' read one less than 'Consecutive valid tasks', i suppose 'Number of tasks completed' included task Number 0

Claggy

As you know, I reverted my host 3751792 to stock when you reported the 'no work to stock GPUs' bug earlier this week.

I'm seeing correct runtime estimates with a DCF currently around 1.2 (it'll drop back again when I reach the next batch of shorties). Since I'm running GPU only, and the card is well fast enough to show APR/estimate anomalies, I'm assuming that the 'non-anonymous-platform' bits of [trac]changeset:24217[/trac] were never applied here, even though we know the change in the ratio limit from 2 to 10 is active. I think the verdict on the APR outlier code is the old Scottish standby of 'not proven', but we ought to test it sometime, for both the stock and anonymous platform cases.

On the task counts, I guess it's an initialisation issue like the one I got him to fix in [trac]changeset:23637[/trac]. Feel free.....
ID: 1160186 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (58) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.