Panic Mode On (78) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 22 · Next
Author Message
Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 0
United States
Message 1305811 - Posted: 13 Nov 2012, 20:09:03 UTC

UTC
Yay! Back from normal Tuesday time-out. (btw, people in lab are really morning people...)

Let's see what comes next... Cricket on top now, AP splitting disabled... Let's see and hope for better...


Yes, they took it down just before 6am California time. Doesn't look any better to me. Had timeouts on work requests, so I went NNT and some of those timed out, but I finally reported my completions. First work request generated some new ghosts and I haven't got them yet. Down to 1.5 hrs of gpu work.
____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

WezH
Volunteer tester
Send message
Joined: 19 Aug 99
Posts: 250
Credit: 6,082,203
RAC: 44,294
Finland
Message 1305822 - Posted: 13 Nov 2012, 20:30:01 UTC - in response to Message 1305811.

Yes, they took it down just before 6am California time. Doesn't look any better to me. Had timeouts on work requests, so I went NNT and some of those timed out, but I finally reported my completions. First work request generated some new ghosts and I haven't got them yet. Down to 1.5 hrs of gpu work.


My computers logs says that they too it down before 5:25, at that time came first "project in maintenance" -message. I don't like to wake up that early...

Well, timeouts is normal after maintenance period, IMHO.

Let's wait about 12 hours and hope for better.

Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 0
United States
Message 1305836 - Posted: 13 Nov 2012, 21:02:54 UTC

Disappointed that Scheduler assigned some more work before I got the ghosts, so I have more ghosts now. Not a lot and the limits will contain it. But that's what got us the limits - scheduler should handle ghosts first, but it just showed me that it is not fixed. I got the first 20 resends and they weren't all shorties, so that's a help.
____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 12672
Credit: 15,018,684
RAC: 9,710
United States
Message 1305846 - Posted: 13 Nov 2012, 21:26:02 UTC - in response to Message 1305811.

Yay! Back from normal Tuesday time-out. (btw, people in lab are really morning people...)

Yes, they took it down just before 6am California time.

That's unusual. Normally, they get in at 8am and start the maintenance some time between 8:30 and 9:00. Looks to me like it didn't run as late as it usually does, but the total time was more than normal.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7124
Credit: 61,633,081
RAC: 15,724
Germany
Message 1305848 - Posted: 13 Nov 2012, 21:43:20 UTC
Last modified: 13 Nov 2012, 21:47:23 UTC

postid=1302257

Richard Haselgrove wrote:
I've just had a note back from Eric:

I've stopped the splitters and doubled the httpd timeout...

I think we're going to need to at least temporarily go back
to restricting workunits in progress on a per host basis and per RPC
basis, regardless of what complaints we get about people being unable
to keep their hosts busy.

The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control.


Just a repeat of the message of Eric (S@h admin) .. ;-)
(...)
I think we're going to need to at least temporarily go back
to restricting workunits in progress on a per host basis and per RPC
basis, regardless of what complaints we get about people being unable
to keep their hosts busy
.



* Best regards! :-) * Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. *
____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Lionel
Send message
Joined: 25 Mar 00
Posts: 588
Credit: 243,494,567
RAC: 146,254
Australia
Message 1305894 - Posted: 13 Nov 2012, 23:20:57 UTC - in response to Message 1305848.


Scheduler request failed: Error 403.


____________

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 372
Credit: 3,015,361
RAC: 2,215
United States
Message 1305909 - Posted: 13 Nov 2012, 23:43:03 UTC - in response to Message 1305773.

I was just talking about one of the rigs that recently got CPU units. You still had around 1500 GPU units for the 3 GPUs. @500 seconds per GPU unit that's nearly 3 days worth left. Even if you get down to 100 per GPU that's still a half a day's worth. What did you normally run your queue as? 10 days.

It doesn't make a difference in bandwidth usage in the long run once the whole seti@home ecosystem hits steady state, it'll just mean that when a super cruncher's nVidia card goes off the rails they can only shaft at most 100 wingman per GPU as oppose to thousands. (Please check your, not directed at you msattler just nVidia users in general, results daily to catch when you system starts to produce mostly inconclusive/error/invalid GPU results.)

Each 690 crunch a WU in less than 7 min runing 3 WU at time on each GPU (it have 2) about 48 per hour or more, so in a big cruncher (3x690) a 100 WU cache is simpy ridiculous, not last for 1 hour. I have 2x690 sleeping on a bed waiting they rissing the limits, with the actual limits is a waste of time/resources put them to work, simply they will not receive the WU they need to work.

That's true. But every 5 minutes it asks for more to top it back off. It's not 100 per day or per hour but 100 per GPU isn't it? Is it seti@home's fault that someone clever discovered that you could do multiple GPU units at the same time per GPU and shared it with others? Is it s@h's fault that the campus IT department only has a 100Mb line going out to their shack thus topping new units transmitted at something around 80-100,000 per hour in theory?

[rant]

For a project that started out as "Mister could you spare a few cycles for a good cause" turned into yet another professional-amateur "sport" where some people have gone nuts building dedicated crunching servers for thousands of dollars but then let them churn out endless bad results because it's not entirely stable and they only check in on them if their precious RAC starts to drop. Then some super crunchers turn around and blame s@h for bad unit generation or people like me, the tiny guys who let their $500 home computer run 24/7 for "stealing" the units that are "rightfully" theirs to process because they spent all this money simply to brag how they have one of the top 10 daily RACs. They blame s@h for running out of units or how their server infrastructure isn't as robust as, say, Amazon.

Well I am sorry that you now have to fret that your vast array of super crunchers now has a chance to run dry. That you can't sit on 20+ days worth of units because God forbid if your precious array of machines run dry for even a moment. Welcome all you crunching gods to the land of mere mortals.

[/rant]
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

zoom314Project donor
Volunteer tester
Avatar
Send message
Joined: 30 Nov 03
Posts: 47152
Credit: 37,086,119
RAC: 4,407
United States
Message 1305929 - Posted: 14 Nov 2012, 0:26:04 UTC - in response to Message 1305894.


Scheduler request failed: Error 403.


Same here, so at least Milkyway is up, for the moment...
____________
My Facebook, War Commander, 2015

Lionel
Send message
Joined: 25 Mar 00
Posts: 588
Credit: 243,494,567
RAC: 146,254
Australia
Message 1305971 - Posted: 14 Nov 2012, 2:47:02 UTC

these are the fore runners to the GTX7xx series ...

http://www.dvhardware.net/article56628.html

if you think there are problems at the moment with the scheduler, wait till the GTX7xx cards become wide spread ... a GTX780 is about 40-50% faster than a GTX680 or roughly about equal to 0.75 times a GTX690 card ... personally I think these cards are going to hum ...

cheers
____________

WinterKnight
Volunteer tester
Send message
Joined: 18 May 99
Posts: 8780
Credit: 25,956,595
RAC: 16,971
United Kingdom
Message 1306048 - Posted: 14 Nov 2012, 9:16:49 UTC

I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5954
Credit: 62,479,249
RAC: 40,251
Australia
Message 1306050 - Posted: 14 Nov 2012, 9:22:20 UTC - in response to Message 1306048.

I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC.

I'm still getting the odd one here & there, but mostly i'm getting a response within a minute or so.
____________
Grant
Darwin NT.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5489
Credit: 316,439,239
RAC: 134,628
Brazil
Message 1306052 - Posted: 14 Nov 2012, 9:24:15 UTC - in response to Message 1306050.

I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC.

I'm still getting the odd one here & there, but mostly i'm getting a response within a minute or so.

Did you see the Server Page? The AP-Splitting was turned ON...
____________

Profile Mad Fritz
Avatar
Send message
Joined: 20 Jul 01
Posts: 87
Credit: 11,334,904
RAC: 0
Switzerland
Message 1306060 - Posted: 14 Nov 2012, 10:24:04 UTC

No luck since AP splitters are on-line again... just timeouts even with NNT :-(
____________

Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 0
United States
Message 1306063 - Posted: 14 Nov 2012, 11:25:42 UTC

I got up early and found I ran out of gpu tasks overnight. I was at limits last night. Had a couple of hung downloads - finally got them and they were short shorties that ran two minutes each. Reported my stack of results on NNT and then generated a new batch of ghosts. Trying to get them but it's mostly timeouts. Low gpu limit hurts the project, not me. I'll probably give up on intervention and run another project like the others.
____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

Paul BowyerProject donor
Volunteer tester
Send message
Joined: 15 Aug 99
Posts: 9
Credit: 77,933,752
RAC: 91,912
United States
Message 1306072 - Posted: 14 Nov 2012, 12:19:29 UTC - in response to Message 1306063.

Same here - back to "Not now, honey, I've got a headache" mode.
Seems pretty clear from here that the ap's are holding the smoking gun.
____________

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 12672
Credit: 15,018,684
RAC: 9,710
United States
Message 1306097 - Posted: 14 Nov 2012, 14:24:02 UTC - in response to Message 1306072.

Same here - back to "Not now, honey, I've got a headache" mode.
Seems pretty clear from here that the ap's are holding the smoking gun.

APs may not be the only problem, but they're certainly making it worse.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile Bill GProject donor
Avatar
Send message
Joined: 1 Jun 01
Posts: 349
Credit: 44,426,007
RAC: 24,978
United States
Message 1306101 - Posted: 14 Nov 2012, 14:50:01 UTC - in response to Message 1306097.

Same here - back to "Not now, honey, I've got a headache" mode.
Seems pretty clear from here that the ap's are holding the smoking gun.

APs may not be the only problem, but they're certainly making it worse.

You are right there.....the server status page is not updating now.

____________

Cherokee150
Send message
Joined: 11 Nov 99
Posts: 112
Credit: 25,678,123
RAC: 7,599
United States
Message 1306107 - Posted: 14 Nov 2012, 15:11:58 UTC

I noticed something that may be very significant.

In looking back at what SETI was doing with my computers just before the current problems began, I discovered that, while everything was deteriorating, SETI was sending all four of my machines enough units for a -month- of processing each! Even after squeezing every last cycle I can into processing, I still have around ten days worth of GPU units and quite a few days of CPU units left to process on most of my hosts.

While I did have my cache set to 10 days to give me a cushion for emergencies, I have -never- been sent too many units before.

If many of us were getting the same overload by SETI, it would most likely explain many of the symptoms we were seeing. The throughput demand to crank thousands of us up to caches of that size would most certainly run the SETI servers into the ground.

Perhaps this will help shed light on the problem, and also on why the SETI staff has (temporarily, I hope), limited us to 100/CPU and 100/GPU.

WezH
Volunteer tester
Send message
Joined: 19 Aug 99
Posts: 250
Credit: 6,082,203
RAC: 44,294
Finland
Message 1306118 - Posted: 14 Nov 2012, 16:16:50 UTC

Yesterday, after maintenance, I noticed that one of my machines has Ghosts, so I did disable NNT and "set it up and forget it"

This is what did happen:

13/11/2012 22:09:48 | SETI@home | Requesting new tasks for CPU 14/11/2012 16:03:33 | SETI@home | Scheduler request completed: got 20 new tasks


So it took almost 18 hours and several Timeouts to send 20 of those lost tasks.

And 29 are still missing after 2 hours, only got Server Timouts.

I just decided to "forget" one another cruncher and wait and see what do happen...

____________
"Please keep Your signature under four lines so Internet traffic doesn't go up too much"

- In 1992 when I had my first e-mail address -

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5954
Credit: 62,479,249
RAC: 40,251
Australia
Message 1306147 - Posted: 14 Nov 2012, 18:11:22 UTC - in response to Message 1306050.
Last modified: 14 Nov 2012, 18:17:52 UTC

I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC.

I'm still getting the odd one here & there, but mostly i'm getting a response within a minute or so.

And over night i ran out of GPU work on both of my systems because i got nothing but Timeout errors on every Scheduler request.

I set NNTs, one system managed to report, the other is still getting timeouts.
After clearing the backlog on one system i set it to get new work. Nothing but Scheduler timeouts, the other system still hasn't been able to report it's work. I expect to run out of CPU work in the next couple of hours on one system, the other later today- only because it is so slow.



I think they need to keep AP offline till they work out what the problem with the Scheduler is- limiting the number of tasks hasn't fixed the problem. It's barely even had an effect on it.
They really do need to address the problem- the work around (limiting tasks) has done nothing except result in people running out of work.
____________
Grant
Darwin NT.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Copyright © 2014 University of California