Panic Mode On (78) Server Problems?

Message boards : Number crunching : Panic Mode On (78) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 22 · Next

AuthorMessage
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1305836 - Posted: 13 Nov 2012, 21:02:54 UTC

Disappointed that Scheduler assigned some more work before I got the ghosts, so I have more ghosts now. Not a lot and the limits will contain it. But that's what got us the limits - scheduler should handle ghosts first, but it just showed me that it is not fixed. I got the first 20 resends and they weren't all shorties, so that's a help.
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1305836 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1305846 - Posted: 13 Nov 2012, 21:26:02 UTC - in response to Message 1305811.  

Yay! Back from normal Tuesday time-out. (btw, people in lab are really morning people...)

Yes, they took it down just before 6am California time.

That's unusual. Normally, they get in at 8am and start the maintenance some time between 8:30 and 9:00. Looks to me like it didn't run as late as it usually does, but the total time was more than normal.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1305846 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1305848 - Posted: 13 Nov 2012, 21:43:20 UTC
Last modified: 13 Nov 2012, 21:47:23 UTC

postid=1302257
Richard Haselgrove wrote:
I've just had a note back from Eric:

I've stopped the splitters and doubled the httpd timeout...

I think we're going to need to at least temporarily go back
to restricting workunits in progress on a per host basis and per RPC
basis, regardless of what complaints we get about people being unable
to keep their hosts busy.

The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control.


Just a repeat of the message of Eric (S@h admin) .. ;-)
(...)
I think we're going to need to at least temporarily go back
to restricting workunits in progress on a per host basis and per RPC
basis, regardless of what complaints we get about people being unable
to keep their hosts busy
.



* Best regards! :-) * Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. *
ID: 1305848 · Report as offensive
Lionel

Send message
Joined: 25 Mar 00
Posts: 680
Credit: 563,640,304
RAC: 597
Australia
Message 1305894 - Posted: 13 Nov 2012, 23:20:57 UTC - in response to Message 1305848.  


Scheduler request failed: Error 403.


ID: 1305894 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 1305909 - Posted: 13 Nov 2012, 23:43:03 UTC - in response to Message 1305773.  

I was just talking about one of the rigs that recently got CPU units. You still had around 1500 GPU units for the 3 GPUs. @500 seconds per GPU unit that's nearly 3 days worth left. Even if you get down to 100 per GPU that's still a half a day's worth. What did you normally run your queue as? 10 days.

It doesn't make a difference in bandwidth usage in the long run once the whole seti@home ecosystem hits steady state, it'll just mean that when a super cruncher's nVidia card goes off the rails they can only shaft at most 100 wingman per GPU as oppose to thousands. (Please check your, not directed at you msattler just nVidia users in general, results daily to catch when you system starts to produce mostly inconclusive/error/invalid GPU results.)

Each 690 crunch a WU in less than 7 min runing 3 WU at time on each GPU (it have 2) about 48 per hour or more, so in a big cruncher (3x690) a 100 WU cache is simpy ridiculous, not last for 1 hour. I have 2x690 sleeping on a bed waiting they rissing the limits, with the actual limits is a waste of time/resources put them to work, simply they will not receive the WU they need to work.

That's true. But every 5 minutes it asks for more to top it back off. It's not 100 per day or per hour but 100 per GPU isn't it? Is it seti@home's fault that someone clever discovered that you could do multiple GPU units at the same time per GPU and shared it with others? Is it s@h's fault that the campus IT department only has a 100Mb line going out to their shack thus topping new units transmitted at something around 80-100,000 per hour in theory?

[rant]

For a project that started out as "Mister could you spare a few cycles for a good cause" turned into yet another professional-amateur "sport" where some people have gone nuts building dedicated crunching servers for thousands of dollars but then let them churn out endless bad results because it's not entirely stable and they only check in on them if their precious RAC starts to drop. Then some super crunchers turn around and blame s@h for bad unit generation or people like me, the tiny guys who let their $500 home computer run 24/7 for "stealing" the units that are "rightfully" theirs to process because they spent all this money simply to brag how they have one of the top 10 daily RACs. They blame s@h for running out of units or how their server infrastructure isn't as robust as, say, Amazon.

Well I am sorry that you now have to fret that your vast array of super crunchers now has a chance to run dry. That you can't sit on 20+ days worth of units because God forbid if your precious array of machines run dry for even a moment. Welcome all you crunching gods to the land of mere mortals.

[/rant]
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 1305909 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65689
Credit: 55,293,173
RAC: 49
United States
Message 1305929 - Posted: 14 Nov 2012, 0:26:04 UTC - in response to Message 1305894.  


Scheduler request failed: Error 403.


Same here, so at least Milkyway is up, for the moment...
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1305929 · Report as offensive
Lionel

Send message
Joined: 25 Mar 00
Posts: 680
Credit: 563,640,304
RAC: 597
Australia
Message 1305971 - Posted: 14 Nov 2012, 2:47:02 UTC

these are the fore runners to the GTX7xx series ...

http://www.dvhardware.net/article56628.html

if you think there are problems at the moment with the scheduler, wait till the GTX7xx cards become wide spread ... a GTX780 is about 40-50% faster than a GTX680 or roughly about equal to 0.75 times a GTX690 card ... personally I think these cards are going to hum ...

cheers
ID: 1305971 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 18996
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1306048 - Posted: 14 Nov 2012, 9:16:49 UTC

I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC.
ID: 1306048 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1306050 - Posted: 14 Nov 2012, 9:22:20 UTC - in response to Message 1306048.  

I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC.

I'm still getting the odd one here & there, but mostly i'm getting a response within a minute or so.
Grant
Darwin NT
ID: 1306050 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1306052 - Posted: 14 Nov 2012, 9:24:15 UTC - in response to Message 1306050.  

I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC.

I'm still getting the odd one here & there, but mostly i'm getting a response within a minute or so.

Did you see the Server Page? The AP-Splitting was turned ON...
ID: 1306052 · Report as offensive
Profile Mad Fritz
Avatar

Send message
Joined: 20 Jul 01
Posts: 87
Credit: 11,334,904
RAC: 0
Switzerland
Message 1306060 - Posted: 14 Nov 2012, 10:24:04 UTC

No luck since AP splitters are on-line again... just timeouts even with NNT :-(
ID: 1306060 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1306063 - Posted: 14 Nov 2012, 11:25:42 UTC

I got up early and found I ran out of gpu tasks overnight. I was at limits last night. Had a couple of hung downloads - finally got them and they were short shorties that ran two minutes each. Reported my stack of results on NNT and then generated a new batch of ghosts. Trying to get them but it's mostly timeouts. Low gpu limit hurts the project, not me. I'll probably give up on intervention and run another project like the others.
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1306063 · Report as offensive
Paul Bowyer
Volunteer tester

Send message
Joined: 15 Aug 99
Posts: 11
Credit: 137,603,890
RAC: 0
United States
Message 1306072 - Posted: 14 Nov 2012, 12:19:29 UTC - in response to Message 1306063.  

Same here - back to "Not now, honey, I've got a headache" mode.
Seems pretty clear from here that the ap's are holding the smoking gun.
ID: 1306072 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1306097 - Posted: 14 Nov 2012, 14:24:02 UTC - in response to Message 1306072.  

Same here - back to "Not now, honey, I've got a headache" mode.
Seems pretty clear from here that the ap's are holding the smoking gun.

APs may not be the only problem, but they're certainly making it worse.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1306097 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1306101 - Posted: 14 Nov 2012, 14:50:01 UTC - in response to Message 1306097.  

Same here - back to "Not now, honey, I've got a headache" mode.
Seems pretty clear from here that the ap's are holding the smoking gun.

APs may not be the only problem, but they're certainly making it worse.

You are right there.....the server status page is not updating now.


SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1306101 · Report as offensive
Cherokee150

Send message
Joined: 11 Nov 99
Posts: 192
Credit: 58,513,758
RAC: 74
United States
Message 1306107 - Posted: 14 Nov 2012, 15:11:58 UTC

I noticed something that may be very significant.

In looking back at what SETI was doing with my computers just before the current problems began, I discovered that, while everything was deteriorating, SETI was sending all four of my machines enough units for a -month- of processing each! Even after squeezing every last cycle I can into processing, I still have around ten days worth of GPU units and quite a few days of CPU units left to process on most of my hosts.

While I did have my cache set to 10 days to give me a cushion for emergencies, I have -never- been sent too many units before.

If many of us were getting the same overload by SETI, it would most likely explain many of the symptoms we were seeing. The throughput demand to crank thousands of us up to caches of that size would most certainly run the SETI servers into the ground.

Perhaps this will help shed light on the problem, and also on why the SETI staff has (temporarily, I hope), limited us to 100/CPU and 100/GPU.
ID: 1306107 · Report as offensive
WezH
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 576
Credit: 67,033,957
RAC: 95
Finland
Message 1306118 - Posted: 14 Nov 2012, 16:16:50 UTC

Yesterday, after maintenance, I noticed that one of my machines has Ghosts, so I did disable NNT and "set it up and forget it"

This is what did happen:

13/11/2012 22:09:48 | SETI@home | Requesting new tasks for CPU

14/11/2012 16:03:33 | SETI@home | Scheduler request completed: got 20 new tasks


So it took almost 18 hours and several Timeouts to send 20 of those lost tasks.

And 29 are still missing after 2 hours, only got Server Timouts.

I just decided to "forget" one another cruncher and wait and see what do happen...

"Please keep Your signature under four lines so Internet traffic doesn't go up too much"

- In 1992 when I had my first e-mail address -
ID: 1306118 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1306147 - Posted: 14 Nov 2012, 18:11:22 UTC - in response to Message 1306050.  
Last modified: 14 Nov 2012, 18:17:52 UTC

I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC.

I'm still getting the odd one here & there, but mostly i'm getting a response within a minute or so.

And over night i ran out of GPU work on both of my systems because i got nothing but Timeout errors on every Scheduler request.

I set NNTs, one system managed to report, the other is still getting timeouts.
After clearing the backlog on one system i set it to get new work. Nothing but Scheduler timeouts, the other system still hasn't been able to report it's work. I expect to run out of CPU work in the next couple of hours on one system, the other later today- only because it is so slow.



I think they need to keep AP offline till they work out what the problem with the Scheduler is- limiting the number of tasks hasn't fixed the problem. It's barely even had an effect on it.
They really do need to address the problem- the work around (limiting tasks) has done nothing except result in people running out of work.
Grant
Darwin NT
ID: 1306147 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1306162 - Posted: 14 Nov 2012, 18:48:42 UTC - in response to Message 1306147.  
Last modified: 14 Nov 2012, 18:56:06 UTC

After almost an hour of button clicking i was finally able to report all the tasks on my second machine.
Niether is able to get any new work- every single request resulsts in a Scheduler Timeout. I expect to run out of work completely in the next 40 min on one system, the other later today.



Please, please, please can someone let the satff know that limiting the number of tasks hasn't helped in the slightest. When it does start to help- it will only be becasue everyone is out of work.
Until the Scheduler is fixed they need to stop all AP production & distribution. They need to fix the Scheduler problem.


EDIT- this problem only started 3 (or was it 4?) weeks ago ofter the weekly outage. Whatever changes they did then to cause the problem, please undo them.
Grant
Darwin NT
ID: 1306162 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1306167 - Posted: 14 Nov 2012, 18:53:34 UTC

Task limits may be helping a little.
BUT, stopping AP production did nothing over the weekend, I was still suffering server side time-outs and very slow delivery with no AP production, not only that the number of ghosts in my possession went up, from less than 5 to about 50.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1306167 · Report as offensive
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.