Panic Mode On (78) Server Problems?

Author	Message
Fred E. Volunteer tester Send message Joined: 22 Jul 99 Posts: 768 Credit: 24,140,697 RAC: 0	Message 1305836 - Posted: 13 Nov 2012, 21:02:54 UTC Disappointed that Scheduler assigned some more work before I got the ghosts, so I have more ghosts now. Not a lot and the limits will contain it. But that's what got us the limits - scheduler should handle ghosts first, but it just showed me that it is not fixed. I got the first 20 resends and they weren't all shorties, so that's a help. Another Fred Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop. ID: 1305836 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1305846 - Posted: 13 Nov 2012, 21:26:02 UTC - in response to Message 1305811. Yay! Back from normal Tuesday time-out. (btw, people in lab are really morning people...) Yes, they took it down just before 6am California time. That's unusual. Normally, they get in at 8am and start the maintenance some time between 8:30 and 9:00. Looks to me like it didn't run as late as it usually does, but the total time was more than normal. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1305846 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1305848 - Posted: 13 Nov 2012, 21:43:20 UTC Last modified: 13 Nov 2012, 21:47:23 UTC postid=1302257 Richard Haselgrove wrote: I've just had a note back from Eric: I've stopped the splitters and doubled the httpd timeout... I think we're going to need to at least temporarily go back to restricting workunits in progress on a per host basis and per RPC basis, regardless of what complaints we get about people being unable to keep their hosts busy. The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control. Just a repeat of the message of Eric (S@h admin) .. ;-) (...) I think we're going to need to at least temporarily go back to restricting workunits in progress on a per host basis and per RPC basis, regardless of what complaints we get about people being unable to keep their hosts busy. * Best regards! :-) * Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. * ID: 1305848 ·

Lionel Send message Joined: 25 Mar 00 Posts: 680 Credit: 563,640,304 RAC: 597	Message 1305894 - Posted: 13 Nov 2012, 23:20:57 UTC - in response to Message 1305848. Scheduler request failed: Error 403. ID: 1305894 ·

Keith White Send message Joined: 29 May 99 Posts: 392 Credit: 13,035,233 RAC: 22	Message 1305909 - Posted: 13 Nov 2012, 23:43:03 UTC - in response to Message 1305773. I was just talking about one of the rigs that recently got CPU units. You still had around 1500 GPU units for the 3 GPUs. @500 seconds per GPU unit that's nearly 3 days worth left. Even if you get down to 100 per GPU that's still a half a day's worth. What did you normally run your queue as? 10 days. It doesn't make a difference in bandwidth usage in the long run once the whole seti@home ecosystem hits steady state, it'll just mean that when a super cruncher's nVidia card goes off the rails they can only shaft at most 100 wingman per GPU as oppose to thousands. (Please check your, not directed at you msattler just nVidia users in general, results daily to catch when you system starts to produce mostly inconclusive/error/invalid GPU results.) Each 690 crunch a WU in less than 7 min runing 3 WU at time on each GPU (it have 2) about 48 per hour or more, so in a big cruncher (3x690) a 100 WU cache is simpy ridiculous, not last for 1 hour. I have 2x690 sleeping on a bed waiting they rissing the limits, with the actual limits is a waste of time/resources put them to work, simply they will not receive the WU they need to work. That's true. But every 5 minutes it asks for more to top it back off. It's not 100 per day or per hour but 100 per GPU isn't it? Is it seti@home's fault that someone clever discovered that you could do multiple GPU units at the same time per GPU and shared it with others? Is it s@h's fault that the campus IT department only has a 100Mb line going out to their shack thus topping new units transmitted at something around 80-100,000 per hour in theory? [rant] For a project that started out as "Mister could you spare a few cycles for a good cause" turned into yet another professional-amateur "sport" where some people have gone nuts building dedicated crunching servers for thousands of dollars but then let them churn out endless bad results because it's not entirely stable and they only check in on them if their precious RAC starts to drop. Then some super crunchers turn around and blame s@h for bad unit generation or people like me, the tiny guys who let their $500 home computer run 24/7 for "stealing" the units that are "rightfully" theirs to process because they spent all this money simply to brag how they have one of the top 10 daily RACs. They blame s@h for running out of units or how their server infrastructure isn't as robust as, say, Amazon. Well I am sorry that you now have to fret that your vast array of super crunchers now has a chance to run dry. That you can't sit on 20+ days worth of units because God forbid if your precious array of machines run dry for even a moment. Welcome all you crunching gods to the land of mere mortals. [/rant] "Life is just nature's way of keeping meat fresh." - The Doctor ID: 1305909 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65750 Credit: 55,293,173 RAC: 49	Message 1305929 - Posted: 14 Nov 2012, 0:26:04 UTC - in response to Message 1305894. Scheduler request failed: Error 403. Same here, so at least Milkyway is up, for the moment... The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 1305929 ·

Lionel Send message Joined: 25 Mar 00 Posts: 680 Credit: 563,640,304 RAC: 597	Message 1305971 - Posted: 14 Nov 2012, 2:47:02 UTC these are the fore runners to the GTX7xx series ... http://www.dvhardware.net/article56628.html if you think there are problems at the moment with the scheduler, wait till the GTX7xx cards become wide spread ... a GTX780 is about 40-50% faster than a GTX680 or roughly about equal to 0.75 times a GTX690 card ... personally I think these cards are going to hum ... cheers ID: 1305971 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19064 Credit: 40,757,560 RAC: 67	Message 1306048 - Posted: 14 Nov 2012, 9:16:49 UTC I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC. ID: 1306048 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1306050 - Posted: 14 Nov 2012, 9:22:20 UTC - in response to Message 1306048. I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC. I'm still getting the odd one here & there, but mostly i'm getting a response within a minute or so. Grant Darwin NT ID: 1306050 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1306052 - Posted: 14 Nov 2012, 9:24:15 UTC - in response to Message 1306050. I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC. I'm still getting the odd one here & there, but mostly i'm getting a response within a minute or so. Did you see the Server Page? The AP-Splitting was turned ON... ID: 1306052 ·

Mad Fritz Send message Joined: 20 Jul 01 Posts: 87 Credit: 11,334,904 RAC: 0	Message 1306060 - Posted: 14 Nov 2012, 10:24:04 UTC No luck since AP splitters are on-line again... just timeouts even with NNT :-( ID: 1306060 ·

Fred E. Volunteer tester Send message Joined: 22 Jul 99 Posts: 768 Credit: 24,140,697 RAC: 0	Message 1306063 - Posted: 14 Nov 2012, 11:25:42 UTC I got up early and found I ran out of gpu tasks overnight. I was at limits last night. Had a couple of hung downloads - finally got them and they were short shorties that ran two minutes each. Reported my stack of results on NNT and then generated a new batch of ghosts. Trying to get them but it's mostly timeouts. Low gpu limit hurts the project, not me. I'll probably give up on intervention and run another project like the others. Another Fred Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop. ID: 1306063 ·

Paul Bowyer Volunteer tester Send message Joined: 15 Aug 99 Posts: 11 Credit: 137,603,890 RAC: 0	Message 1306072 - Posted: 14 Nov 2012, 12:19:29 UTC - in response to Message 1306063. Same here - back to "Not now, honey, I've got a headache" mode. Seems pretty clear from here that the ap's are holding the smoking gun. ID: 1306072 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1306097 - Posted: 14 Nov 2012, 14:24:02 UTC - in response to Message 1306072. Same here - back to "Not now, honey, I've got a headache" mode. Seems pretty clear from here that the ap's are holding the smoking gun. APs may not be the only problem, but they're certainly making it worse. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1306097 ·

Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182	Message 1306101 - Posted: 14 Nov 2012, 14:50:01 UTC - in response to Message 1306097. Same here - back to "Not now, honey, I've got a headache" mode. Seems pretty clear from here that the ap's are holding the smoking gun. APs may not be the only problem, but they're certainly making it worse. You are right there.....the server status page is not updating now. SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours ID: 1306101 ·

Cherokee150 Send message Joined: 11 Nov 99 Posts: 192 Credit: 58,513,758 RAC: 74	Message 1306107 - Posted: 14 Nov 2012, 15:11:58 UTC I noticed something that may be very significant. In looking back at what SETI was doing with my computers just before the current problems began, I discovered that, while everything was deteriorating, SETI was sending all four of my machines enough units for a -month- of processing each! Even after squeezing every last cycle I can into processing, I still have around ten days worth of GPU units and quite a few days of CPU units left to process on most of my hosts. While I did have my cache set to 10 days to give me a cushion for emergencies, I have *-never-* been sent too many units before. If many of us were getting the same overload by SETI, it would most likely explain many of the symptoms we were seeing. The throughput demand to crank thousands of us up to caches of that size would most certainly run the SETI servers into the ground. Perhaps this will help shed light on the problem, and also on why the SETI staff has (temporarily, I hope), limited us to 100/CPU and 100/GPU. ID: 1306107 ·

WezH Volunteer tester Send message Joined: 19 Aug 99 Posts: 576 Credit: 67,033,957 RAC: 95	Message 1306118 - Posted: 14 Nov 2012, 16:16:50 UTC Yesterday, after maintenance, I noticed that one of my machines has Ghosts, so I did disable NNT and "set it up and forget it" This is what did happen: 13/11/2012 22:09:48 \| SETI@home \| Requesting new tasks for CPU 14/11/2012 16:03:33 \| SETI@home \| Scheduler request completed: got 20 new tasks So it took almost 18 hours and several Timeouts to send 20 of those lost tasks. And 29 are still missing after 2 hours, only got Server Timouts. I just decided to "forget" one another cruncher and wait and see what do happen... "Please keep Your signature under four lines so Internet traffic doesn't go up too much" - In 1992 when I had my first e-mail address - ID: 1306118 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1306147 - Posted: 14 Nov 2012, 18:11:22 UTC - in response to Message 1306050. Last modified: 14 Nov 2012, 18:17:52 UTC I am getting "scheduler request: timeout was reached" again. All of the last 5 requests since 08:32 UTC. I'm still getting the odd one here & there, but mostly i'm getting a response within a minute or so. And over night i ran out of GPU work on both of my systems because i got nothing but Timeout errors on every Scheduler request. I set NNTs, one system managed to report, the other is still getting timeouts. After clearing the backlog on one system i set it to get new work. Nothing but Scheduler timeouts, the other system still hasn't been able to report it's work. I expect to run out of CPU work in the next couple of hours on one system, the other later today- only because it is so slow. I think they need to keep AP offline till they work out what the problem with the Scheduler is- limiting the number of tasks hasn't fixed the problem. It's barely even had an effect on it. They really do need to address the problem- the work around (limiting tasks) has done nothing except result in people running out of work. Grant Darwin NT ID: 1306147 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1306162 - Posted: 14 Nov 2012, 18:48:42 UTC - in response to Message 1306147. Last modified: 14 Nov 2012, 18:56:06 UTC After almost an hour of button clicking i was finally able to report all the tasks on my second machine. Niether is able to get any new work- every single request resulsts in a Scheduler Timeout. I expect to run out of work completely in the next 40 min on one system, the other later today. Please, please, please can someone let the satff know that limiting the number of tasks hasn't helped in the slightest. When it does start to help- it will only be becasue everyone is out of work. Until the Scheduler is fixed they need to stop all AP production & distribution. They need to fix the Scheduler problem. EDIT- this problem only started 3 (or was it 4?) weeks ago ofter the weekly outage. Whatever changes they did then to cause the problem, please undo them. Grant Darwin NT ID: 1306162 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22204 Credit: 416,307,556 RAC: 380	Message 1306167 - Posted: 14 Nov 2012, 18:53:34 UTC Task limits may be helping a little. BUT, stopping AP production did nothing over the weekend, I was still suffering server side time-outs and very slow delivery with no AP production, not only that the number of ghosts in my possession went up, from less than 5 to about 50. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1306167 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.