Panic Mode On (79) Server Problems?

Author	Message
Mark Fiske Send message Joined: 15 Aug 11 Posts: 713 Credit: 7,392,921 RAC: 0	Message 1310060 - Posted: 25 Nov 2012, 6:29:29 UTC - in response to Message 1310044. Well, I wasn't expecting this but I just got 94 CPU WU's out of the blue. Happy Camper! Mark ID: 1310060 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1310067 - Posted: 25 Nov 2012, 7:26:14 UTC - in response to Message 1310044. Timeouts, Timeouts, Timeouts!!!!!!! Now it's Couldn't connect to server, Couldn't connect to server, Couldn't connect to server!!! Back to Timeouts, Timeouts, Timeouts!!!!!!! again. Grant Darwin NT ID: 1310067 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1310075 - Posted: 25 Nov 2012, 7:57:21 UTC - in response to Message 1310067. Last modified: 25 Nov 2012, 7:57:46 UTC Timeouts, Timeouts, Timeouts!!!!!!! Now it's Couldn't connect to server, Couldn't connect to server, Couldn't connect to server!!! Back to Timeouts, Timeouts, Timeouts!!!!!!! again. I think it's now just throwing random errors. Timeouts, couldn't connect & failure when receiving data from the peer depending on the mood it's in. I've even received a No tasks sent, but that was on some GPU requests- i got a whole bunch of VLARs on one CPU request so at least that particular response makes sense. Grant Darwin NT ID: 1310075 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1310111 - Posted: 25 Nov 2012, 10:11:29 UTC - in response to Message 1310075. Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches. ID: 1310111 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1310206 - Posted: 25 Nov 2012, 18:03:01 UTC - in response to Message 1310111. Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches. Then i've got the hassle of finding a working proxy, then finding a new one every few days when the working one nolonger does. Grant Darwin NT ID: 1310206 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1310213 - Posted: 25 Nov 2012, 18:08:45 UTC - in response to Message 1310206. Just noticed that the AP Science Database & Assimilators haven't been running for a few days- lots of work to be assimilated is backing up. Grant Darwin NT ID: 1310213 ·

Keith White Send message Joined: 29 May 99 Posts: 392 Credit: 13,035,233 RAC: 22	Message 1310216 - Posted: 25 Nov 2012, 18:16:03 UTC Just posting what I'm currently getting in my event log since it seems to succeed within an hour of me posting what I'm currently getting in my event log. I don't question the voodoo, I just go with it. 11/25/2012 1:01:03 PM \| SETI@home \| Sending scheduler request: To fetch work. 11/25/2012 1:01:03 PM \| SETI@home \| Reporting 24 completed tasks, requesting new tasks for CPU and ATI 11/25/2012 1:01:25 PM \| \| Project communication failed: attempting access to reference site 11/25/2012 1:01:25 PM \| SETI@home \| Scheduler request failed: Couldn't connect to server 11/25/2012 1:01:26 PM \| \| Internet access OK - project servers may be temporarily down. 11/25/2012 1:03:06 PM \| SETI@home \| Sending scheduler request: To fetch work. 11/25/2012 1:03:06 PM \| SETI@home \| Reporting 24 completed tasks, requesting new tasks for CPU and ATI 11/25/2012 1:03:29 PM \| \| Project communication failed: attempting access to reference site 11/25/2012 1:03:29 PM \| SETI@home \| Scheduler request failed: Couldn't connect to server 11/25/2012 1:03:30 PM \| \| Internet access OK - project servers may be temporarily down. 11/25/2012 1:06:11 PM \| SETI@home \| Sending scheduler request: To fetch work. 11/25/2012 1:06:11 PM \| SETI@home \| Reporting 24 completed tasks, requesting new tasks for CPU and ATI 11/25/2012 1:06:34 PM \| \| Project communication failed: attempting access to reference site 11/25/2012 1:06:34 PM \| SETI@home \| Scheduler request failed: Couldn't connect to server 11/25/2012 1:06:35 PM \| \| Internet access OK - project servers may be temporarily down. 11/25/2012 1:12:29 PM \| SETI@home \| Sending scheduler request: To fetch work. 11/25/2012 1:12:29 PM \| SETI@home \| Reporting 24 completed tasks, requesting new tasks for CPU and ATI 11/25/2012 1:12:52 PM \| \| Project communication failed: attempting access to reference site 11/25/2012 1:12:52 PM \| SETI@home \| Scheduler request failed: Couldn't connect to server 11/25/2012 1:12:54 PM \| \| Internet access OK - project servers may be temporarily down. "Life is just nature's way of keeping meat fresh." - The Doctor ID: 1310216 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1310226 - Posted: 25 Nov 2012, 18:29:37 UTC - in response to Message 1310216. Just posting what I'm currently getting in my event log since it seems to succeed within an hour of me posting what I'm currently getting in my event log. I don't question the voodoo, I just go with it. I've been getting a lot of can't connect errors here too. Not on your end. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1310226 ·

Keith White Send message Joined: 29 May 99 Posts: 392 Credit: 13,035,233 RAC: 22	Message 1310230 - Posted: 25 Nov 2012, 18:34:01 UTC - in response to Message 1310226. Just posting what I'm currently getting in my event log since it seems to succeed within an hour of me posting what I'm currently getting in my event log. I don't question the voodoo, I just go with it. I've been getting a lot of can't connect errors here too. Not on your end. See what I mean... VOODOO! 11/25/2012 1:12:29 PM \| SETI@home \| Sending scheduler request: To fetch work. 11/25/2012 1:12:29 PM \| SETI@home \| Reporting 24 completed tasks, requesting new tasks for CPU and ATI 11/25/2012 1:12:52 PM \| \| Project communication failed: attempting access to reference site 11/25/2012 1:12:52 PM \| SETI@home \| Scheduler request failed: Couldn't connect to server 11/25/2012 1:12:54 PM \| \| Internet access OK - project servers may be temporarily down. 11/25/2012 1:27:50 PM \| SETI@home \| Sending scheduler request: To fetch work. 11/25/2012 1:27:50 PM \| SETI@home \| Reporting 24 completed tasks, requesting new tasks for CPU and ATI 11/25/2012 1:29:03 PM \| SETI@home \| Scheduler request completed: got 11 new tasks "Life is just nature's way of keeping meat fresh." - The Doctor ID: 1310230 ·

Lionel Send message Joined: 25 Mar 00 Posts: 680 Credit: 563,640,304 RAC: 597	Message 1310389 - Posted: 26 Nov 2012, 4:43:53 UTC - in response to Message 1310111. Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches. As Grant has basically said, it doesn't always work...they are waking up to the traffic that seti puts through and soon this avenue will be closed for many of us...what they need to do is increase the bandwidth beyond 100Mbps... ID: 1310389 ·

Lionel Send message Joined: 25 Mar 00 Posts: 680 Credit: 563,640,304 RAC: 597	Message 1310390 - Posted: 26 Nov 2012, 4:46:42 UTC - in response to Message 1310230. Just posting what I'm currently getting in my event log since it seems to succeed within an hour of me posting what I'm currently getting in my event log. I don't question the voodoo, I just go with it. I've been getting a lot of can't connect errors here too. Not on your end. See what I mean... VOODOO! 11/25/2012 1:12:29 PM \| SETI@home \| Sending scheduler request: To fetch work. 11/25/2012 1:12:29 PM \| SETI@home \| Reporting 24 completed tasks, requesting new tasks for CPU and ATI 11/25/2012 1:12:52 PM \| \| Project communication failed: attempting access to reference site 11/25/2012 1:12:52 PM \| SETI@home \| Scheduler request failed: Couldn't connect to server 11/25/2012 1:12:54 PM \| \| Internet access OK - project servers may be temporarily down. 11/25/2012 1:27:50 PM \| SETI@home \| Sending scheduler request: To fetch work. 11/25/2012 1:27:50 PM \| SETI@home \| Reporting 24 completed tasks, requesting new tasks for CPU and ATI 11/25/2012 1:29:03 PM \| SETI@home \| Scheduler request completed: got 11 new tasks getting a lot of that here as well...suspect that many of us will be... ID: 1310390 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1310413 - Posted: 26 Nov 2012, 7:08:23 UTC - in response to Message 1310389. Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches. As Grant has basically said, it doesn't always work...they are waking up to the traffic that seti puts through and soon this avenue will be closed for many of us...what they need to do is increase the bandwidth beyond 100Mbps... That would help (massively- till the next bottleneck is hit), but what doesn't make sense is why using a proxy does give better connections & speeds than not using one? Grant Darwin NT ID: 1310413 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1310414 - Posted: 26 Nov 2012, 7:09:55 UTC - in response to Message 1310413. Last modified: 26 Nov 2012, 7:10:58 UTC Even with all the wierdness going on, my systems have managed to stay busy while at work. And while the inbound network traffic has been rather odd (little peaks here & there & gradually increasing overall) since coming back up after the multiple Scheduler breakdowns, there have been a couple of significant dips while i was away. And they also affected the download traffic. Grant Darwin NT ID: 1310414 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30651 Credit: 53,134,872 RAC: 32	Message 1310417 - Posted: 26 Nov 2012, 7:27:07 UTC - in response to Message 1310413. Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches. As Grant has basically said, it doesn't always work...they are waking up to the traffic that seti puts through and soon this avenue will be closed for many of us...what they need to do is increase the bandwidth beyond 100Mbps... That would help (massively- till the next bottleneck is hit), but what doesn't make sense is why using a proxy does give better connections & speeds than not using one? Because as Eric stated there is a problem upstream from SSL possibly in the Campus tunnel. Has nothing to do with pipe size. Oh and can you imagine how much worse the scheduler ghosts woes would be if the pipe was 10X wider? Would there be 10X the number of ghosts? IIRC Eric was able to get a test in and a 5X increase in pipe size hits a bottleneck that may not be surmountable. I also have a question, do you think the hardware can take 5X additional 24/7 or what will break next? ID: 1310417 ·

Lionel Send message Joined: 25 Mar 00 Posts: 680 Credit: 563,640,304 RAC: 597	Message 1310420 - Posted: 26 Nov 2012, 7:52:22 UTC - in response to Message 1310417. Use a proxy, with them the comunications errors are minimized, and you could easely rebuild your caches. As Grant has basically said, it doesn't always work...they are waking up to the traffic that seti puts through and soon this avenue will be closed for many of us...what they need to do is increase the bandwidth beyond 100Mbps... That would help (massively- till the next bottleneck is hit), but what doesn't make sense is why using a proxy does give better connections & speeds than not using one? Because as Eric stated there is a problem upstream from SSL possibly in the Campus tunnel. Has nothing to do with pipe size. Oh and can you imagine how much worse the scheduler ghosts woes would be if the pipe was 10X wider? Would there be 10X the number of ghosts? IIRC Eric was able to get a test in and a 5X increase in pipe size hits a bottleneck that may not be surmountable. I also have a question, do you think the hardware can take 5X additional 24/7 or what will break next? maximum throughput (down to us) is governed by the maximum rate at which work can be created and sent...that is the natural ceiling...given a maximum down rate you can approximate a maximum up rate based on average returned work unit size...the pipe should be wider than these to allow for other things such as overhead/management traffic... ID: 1310420 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1310421 - Posted: 26 Nov 2012, 7:55:41 UTC - in response to Message 1310417. Last modified: 26 Nov 2012, 7:56:29 UTC Oh and can you imagine how much worse the scheduler ghosts woes would be if the pipe was 10X wider? Would there be 10X the number of ghosts? Maybe, maybe not. When the Scheduler was using the campus network, it was responding in less than 7 seconds, often within 2-4 So it would appear the network congestion is a factor- remove it & no more ghosts at all. IIRC Eric was able to get a test in and a 5X increase in pipe size hits a bottleneck that may not be surmountable. I also have a question, do you think the hardware can take 5X additional 24/7 or what will break next? Keep in mind if there were a 5 fold increase in available bandwidth, the load on the servers would drop 5 times faster. The load would probably be less than it is now becasue there wouldn't be all the re-tries going on, or the acccumulation of ghosts. I have no doubt we'd find some new major problem sooner rather than later, but it would erase completely several existing ones. Grant Darwin NT ID: 1310421 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1310428 - Posted: 26 Nov 2012, 8:56:18 UTC - in response to Message 1310421. Inbound & outbound traffic is plummeting. Hopefully it'll recover again like it's done twice previously. fingers crossed Grant Darwin NT ID: 1310428 ·

musicplayer Send message Joined: 17 May 10 Posts: 2430 Credit: 926,046 RAC: 0	Message 1310436 - Posted: 26 Nov 2012, 9:29:36 UTC And I got a new batch of jobs coming my way. Thanks! ID: 1310436 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1310463 - Posted: 26 Nov 2012, 12:33:56 UTC - in response to Message 1310421. I have no doubt we'd find some new major problem sooner rather than later, but it would erase completely several existing ones. It would be fun to play some other game for a while. Let's try it and see! ID: 1310463 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1310487 - Posted: 26 Nov 2012, 15:22:48 UTC I've been speculating for at least two years now that if we were able to increase the pipe to even just 200mbit, I'm not sure the scheduler/feeder would handle it. Even two years ago when GPUs were much slower and less common, the database was having trouble keeping up. Of course, we have better hardware now, but I think it's significantly more likely we'll run into a software limitation, on top of the disk I/O limitation. Bigger pipe will likely cause more issues without some sort of restraint (per-host limits are a simple way to do it, but there are better ways.. like server-side cache size based on DCF). It was a good idea to run the scheduler on a different link, as long as that can be reliable. It will at least allow a high rate of successful contacts to report work and be assigned new work, and then you just have to fight for bandwidth on the download link, which in the grand scheme of things, isn't that huge of an issue. You wouldn't end up with ghosts, you'd just end up with 10+ hour back-offs, but you can over-come those with some manual intervention, or some less draconian exponential back-off calculations in the client itself. Maybe once the scheduler reliability issues get sorted out, we can possibly test the database's capability to keep up for a 24-hour period by updating some DNS records and ramp the bandwidth up? Maybe pick a Saturday or we have the winter holidays coming up where the campus will be empty except a select few faculty members. Could do a 1-3 day test on 200+ mbit then, assuming the red tape can be removed temporarily for such a thing. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1310487 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.