Panic Mode On (56) Server problems?

Message boards : Number crunching : Panic Mode On (56) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 12 · Next

AuthorMessage
Starman
Avatar

Send message
Joined: 15 May 99
Posts: 204
Credit: 81,351,915
RAC: 25
Canada
Message 1156646 - Posted: 27 Sep 2011, 15:11:27 UTC

Well, my "work" machine is dead in the water. It hasn't been able to upload since yesterday afternoon, even using the 72.52.96.30 proxy!
ID: 1156646 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19402
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1156651 - Posted: 27 Sep 2011, 15:39:26 UTC - in response to Message 1156646.  

Has anybody tried using programs like Tor to enable using proxies.
I know it is designed to provide security/hide users etc.

Haven't any experience myself, but maybe worth a try, but might need a knowledgeable person to set up, I just don't know.
ID: 1156651 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1156652 - Posted: 27 Sep 2011, 15:39:48 UTC

No ups or downs right now...
One outage....coming up.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1156652 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 1156656 - Posted: 27 Sep 2011, 21:45:16 UTC

Aaand we're through
ID: 1156656 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1156657 - Posted: 27 Sep 2011, 21:45:57 UTC

I got myself 13 or 14 APs just before things went down earlier today. Don't know what the DL speed was since I was sleeping, but I woke up to find that I had a nice list of "waiting to run."
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1156657 · Report as offensive
Profile Lint trap

Send message
Joined: 30 May 03
Posts: 871
Credit: 28,092,319
RAC: 0
United States
Message 1156658 - Posted: 27 Sep 2011, 21:46:27 UTC - in response to Message 1156656.  

Aaand we're through



Seems like it...

Lt

ID: 1156658 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1156662 - Posted: 27 Sep 2011, 21:50:58 UTC

Remember how the server problems started when task estimates jumped up to several hours?

Preliminary observations are that estimates for work issued since the outage have started to move back towards normality. I haven't got enough yet to measure how big the change is, but the last plan we heard about was to make a five-fold step change this first time.

So, for the time being, you may get five times as much work as you expect - you have been warned ;-)

Of course, the quota limits of 50/CPU and 400/GPU should still be in place to stop things going off scale - but I haven't been able to check that yet.
ID: 1156662 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1156664 - Posted: 27 Sep 2011, 22:04:17 UTC - in response to Message 1156662.  
Last modified: 27 Sep 2011, 22:06:12 UTC

Remember how the server problems started when task estimates jumped up to several hours?

Preliminary observations are that estimates for work issued since the outage have started to move back towards normality. I haven't got enough yet to measure how big the change is, but the last plan we heard about was to make a five-fold step change this first time.

So, for the time being, you may get five times as much work as you expect - you have been warned ;-)

Of course, the quota limits of 50/CPU and 400/GPU should still be in place to stop things going off scale - but I haven't been able to check that yet.

I can confirm that too, i snipped my Two Astropulse tasks out of my client_state.xml and got them resent, their estimated runtimes matched what i expected,
i then got all my recent MB Wu's resent, they now match Wu's i got just before the bodged changeset, now going to do my remaining shorties to get my DCF back up to One.

Claggy
ID: 1156664 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1156671 - Posted: 27 Sep 2011, 22:22:12 UTC - in response to Message 1156664.  
Last modified: 27 Sep 2011, 22:29:07 UTC

Remember how the server problems started when task estimates jumped up to several hours?

Preliminary observations are that estimates for work issued since the outage have started to move back towards normality. I haven't got enough yet to measure how big the change is, but the last plan we heard about was to make a five-fold step change this first time.

So, for the time being, you may get five times as much work as you expect - you have been warned ;-)

Of course, the quota limits of 50/CPU and 400/GPU should still be in place to stop things going off scale - but I haven't been able to check that yet.

I can confirm that too, i snipped my Two Astropulse tasks out of my client_state.xml and got them resent, their estimated runtimes matched what i expected,
i then got all my recent MB Wu's resent, they now match Wu's i got just before the bodged changeset, now going to do my remaining shorties to get my DCF back up to One.

Claggy

ALL THE WAY in one jump?? !!

What did the DCF get down to on that box? If it was below 0.1, won't we be in -177 territory? One of my 9800GTs is showing DCF=0.0391 - I'd better got and nudge it into a fetch. Back soon.....

Edit - got some already. Showing 1 minute 15 seconds for a shorty, 4 m 15 s for mid-AR - I'd expect 4 minutes plus / around 20 minutes respectively, so we're not getting the whole DCF correction back in one go. Phew - may be a bit rocky, but not enough for errors. Panic over.
ID: 1156671 · Report as offensive
Profile Lint trap

Send message
Joined: 30 May 03
Posts: 871
Credit: 28,092,319
RAC: 0
United States
Message 1156674 - Posted: 27 Sep 2011, 22:27:58 UTC



My DCF has Sky-Rocketed up to .19xx from where it was this morning, between .01xxx and .02xxx.

GPU ap's still have longer estimates than CPU ap's, though...

No new work received since the outage ended.

Lt

ID: 1156674 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1156679 - Posted: 27 Sep 2011, 22:41:12 UTC - in response to Message 1156671.  
Last modified: 27 Sep 2011, 22:45:36 UTC

Remember how the server problems started when task estimates jumped up to several hours?

Preliminary observations are that estimates for work issued since the outage have started to move back towards normality. I haven't got enough yet to measure how big the change is, but the last plan we heard about was to make a five-fold step change this first time.

So, for the time being, you may get five times as much work as you expect - you have been warned ;-)

Of course, the quota limits of 50/CPU and 400/GPU should still be in place to stop things going off scale - but I haven't been able to check that yet.

I can confirm that too, i snipped my Two Astropulse tasks out of my client_state.xml and got them resent, their estimated runtimes matched what i expected,
i then got all my recent MB Wu's resent, they now match Wu's i got just before the bodged changeset, now going to do my remaining shorties to get my DCF back up to One.

Claggy

ALL THE WAY in one jump?? !!

What did the DCF get down to on that box? If it was below 0.1, won't we be in -177 territory? One of my 9800GTs is showing DCF=0.0391 - I'd better got and nudge it into a fetch. Back soon.....


I've always had flops in my app_info's (except for the HD5770), so DCF didn't go down as low as others did:

27/09/2011 23:00:02 SETI@home [dcf] DCF: 0.175808->1.307273, raw_ratio 1.307273, adj_ratio 7.435800

And i've been doing pre and post change Wu's (suspending old Wu's), first doing old Wu's with DCF ~1, then doing the post Dodgey changeset Wu's with DCF ~0.18,
then swapping back to doing Old Wu's again,

Now i'm rushing through my resent shorties in ~2 minutes, DCF climbing fast, APR rates look normal.

Edit: Richard, remember i'm running Jason's special Boinc.exe, and i haven't had any AP since before the bodged changeset, so DCF there was ~1.

Claggy
ID: 1156679 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1156682 - Posted: 27 Sep 2011, 22:52:34 UTC - in response to Message 1156679.  

Remember how the server problems started when task estimates jumped up to several hours?

Preliminary observations are that estimates for work issued since the outage have started to move back towards normality. I haven't got enough yet to measure how big the change is, but the last plan we heard about was to make a five-fold step change this first time.

So, for the time being, you may get five times as much work as you expect - you have been warned ;-)

Of course, the quota limits of 50/CPU and 400/GPU should still be in place to stop things going off scale - but I haven't been able to check that yet.

I can confirm that too, i snipped my Two Astropulse tasks out of my client_state.xml and got them resent, their estimated runtimes matched what i expected,
i then got all my recent MB Wu's resent, they now match Wu's i got just before the bodged changeset, now going to do my remaining shorties to get my DCF back up to One.

Claggy

ALL THE WAY in one jump?? !!

What did the DCF get down to on that box? If it was below 0.1, won't we be in -177 territory? One of my 9800GTs is showing DCF=0.0391 - I'd better got and nudge it into a fetch. Back soon.....


I've always had flops in my app_info's (except for the HD5770), so DCF didn't go down as low as others did:

27/09/2011 23:00:02 SETI@home [dcf] DCF: 0.175808->1.307273, raw_ratio 1.307273, adj_ratio 7.435800

And i've been doing pre and post change Wu's (suspending old Wu's), first doing old Wu's with DCF ~1, then doing the post Dodgey changeset Wu's with DCF ~0.18,
then swapping back to doing Old Wu's again,

Now i'm rushing through my resent shorties in ~2 minutes, DCF climbing fast, APR rates look normal.

Claggy

I'm deliberately running without flops so I can watch what the server's doing cleanly.

That DCF of 0.0391 was set by a mid-AR which completed in 1,279.42 seconds The new estimate of 4:15 (255 seconds) is almost exactly five times smaller.

I've completed the first task from the new estimate set - a shorty, so we wouldn't expect DCF to reset all the way, but it's come up to 0.1310 - that's fair enough.

I'd expect fast Fermis to scrape above 0.02, so once they get and complete their first single WU, I would expect them to start fetching normally again - but still to have DCF well below 0.1, so it won't be safe to remove the cap completely next week. We'll need at least one more interim step.
ID: 1156682 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1156717 - Posted: 28 Sep 2011, 0:47:07 UTC

Definitely looks like ETAs are way down. Looks like ~33% what it should be normally.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1156717 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31013
Credit: 53,134,872
RAC: 32
United States
Message 1156719 - Posted: 28 Sep 2011, 0:55:57 UTC

Here we go again ...

Tue Sep 27 17:54:00 2011 SETI@home Temporarily failed download of 23jn11ae.3946.24198.15.10.62: HTTP error
Tue Sep 27 17:54:00 2011 SETI@home Backing off 1 min 0 sec on download of 23jn11ae.3946.24198.15.10.62
Tue Sep 27 17:54:02 2011 SETI@home Temporarily failed download of 23jn11ae.3946.24198.15.10.99: HTTP error
Tue Sep 27 17:54:02 2011 SETI@home Backing off 1 min 0 sec on download of 23jn11ae.3946.24198.15.10.99

ID: 1156719 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 66354
Credit: 55,293,173
RAC: 49
United States
Message 1156721 - Posted: 28 Sep 2011, 1:00:02 UTC - in response to Message 1156719.  

I had that happen to Me a few hours back, about a minute ago though I was able to snag 14 wu's for the 2 GTX295 cards in this PC, I have a cache of 1.25 days.
Savoir-Faire is everywhere!
The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST

ID: 1156721 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1156741 - Posted: 28 Sep 2011, 2:54:47 UTC

And the Network is UP!!!
I'm now able to get through without a proxy. Not getting a lot of work due to the logjam but when I do the downloads "hoot", peak speeds of up to 30KBs. Hopefully the fix is in.

DCF's are all over the place, from 0.09 to 1.4. (Don't know why, I disabled DCF correction in my Rescheduling program.) Estimated GPU crunching times on new units looks pretty good but CPU times are about X3.

T.A.
ID: 1156741 · Report as offensive
Wandering Willie
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 136
Credit: 2,127,073
RAC: 0
United Kingdom
Message 1156775 - Posted: 28 Sep 2011, 8:15:27 UTC

Quick question just received 30 short WU,s after outage all resend time outs for 27/09/2011

Should I leave these for an hour or two to let the replica data base catch up or will they be okay to crunch. (13,358 seconds)

Deadline for these 11/10/2011

Michael
ID: 1156775 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19402
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1156777 - Posted: 28 Sep 2011, 8:27:28 UTC - in response to Message 1156775.  
Last modified: 28 Sep 2011, 8:28:00 UTC

I'd crunch them, and if you're worried about replica catch up, then hit the activity menu and suspend network activity. That will suspend everything on the network so no requests, d/loads as well.
ID: 1156777 · Report as offensive
Wandering Willie
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 136
Credit: 2,127,073
RAC: 0
United Kingdom
Message 1156778 - Posted: 28 Sep 2011, 8:36:26 UTC - in response to Message 1156777.  

Thank you.

It was just in case they had already been completed.



Michael
ID: 1156778 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1156810 - Posted: 28 Sep 2011, 12:40:36 UTC
Last modified: 28 Sep 2011, 12:47:29 UTC

Current server status shows replica DB caught up...
Looks like AP work is going out again, so bandwidth and downloads are stuffed.
No MB left to split until the AP stuff goes out....my GPUs are gonna starve.

Hang in there, kitties.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1156810 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 12 · Next

Message boards : Number crunching : Panic Mode On (56) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.