The Server Issues / Outages Thread - Panic Mode On! (117)

Author	Message
Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2014494 - Posted: 7 Oct 2019, 1:30:28 UTC - in response to Message 2014492. I sure hope they give everyone an explanation on whats going on.. ID: 2014494 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 2014497 - Posted: 7 Oct 2019, 1:54:58 UTC Somebody has obviously sorted the uploads out because as I write the returning at an astonishing rate 802,639 I cannot recall ever seeing the return rate that high ID: 2014497 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2014499 - Posted: 7 Oct 2019, 2:11:54 UTC - in response to Message 2014497. Now all the hosts are trying to report work and get new work. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2014499 ·

FurryGuy Volunteer tester Send message Joined: 1 Jun 04 Posts: 6 Credit: 9,294,513 RAC: 1	Message 2014505 - Posted: 7 Oct 2019, 3:47:51 UTC - in response to Message 2014499. Now all the hosts are trying to report work and get new work. I got one GPU work unit that took almost 30 minutes to download. Average run time for GPU work units, less than 10 minutes. This is going to be a loooooooooong catch up period. ID: 2014505 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2014506 - Posted: 7 Oct 2019, 4:07:46 UTC - in response to Message 2014505. This is going to be a loooooooooong catch up period. Looks a lot better now. Downloads stalls have mostly resolved. Still a ways to go to get full caches, maybe another hour. ID: 2014506 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2014514 - Posted: 7 Oct 2019, 6:37:57 UTC - in response to Message 2014473. They have added an Aricebo file (01oc19ac) to be split. It is happily splitting, but of course we can't have any of the WUs. It is really hard to see the large RTS queues and not get to have any. edit - on the bright side it means someone is fiddling with the machine, but, unfortunately not in a helpful way. . . I'll tell you what Keith will also tell. The Arecibo data is configured to auto-download and auto-mount on the splitters, so that was probably done entirely without human intervention. If it were archival data I would suspect that someone had a hand in it. Stephen . . ID: 2014514 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2014515 - Posted: 7 Oct 2019, 6:41:20 UTC - in response to Message 2014494. I sure hope they give everyone an explanation on whats going on.. . . Cynic ... :) . . +1 Stephen :) ID: 2014515 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2014516 - Posted: 7 Oct 2019, 6:42:28 UTC - in response to Message 2014497. Last modified: 7 Oct 2019, 6:44:18 UTC Somebody has obviously sorted the uploads out because as I write the returning at an astonishing rate 802,639 I cannot recall ever seeing the return rate that high . . Is there an upload server listed named muarae2 ??? . . I have just restarted the 1st of my 5 rigs and immediately got new work ... Stephen ?? ID: 2014516 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2014519 - Posted: 7 Oct 2019, 7:03:45 UTC - in response to Message 2014516. Muarae2 is the new upload server they have deployed at Beta. Uses SSD storage and a lot more memory. The one that Richard posted a image of when he was visiting this summer. Looks like it has finally made its appearance at Main. And yes, any new data from this year from Arecibo is automounted since they have a bigger pipeline from Arecibo after the repair from the hurricane if I remember correctly. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2014519 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2014520 - Posted: 7 Oct 2019, 7:17:04 UTC - in response to Message 2014519. Muarae2 is the new upload server they have deployed at Beta. Uses SSD storage and a lot more memory. The one that Richard posted a image of when he was visiting this summer. Looks like it has finally made its appearance at Main. That would explain the doubling of Received-last-hour numbers compared to it's usual values after the systems came back up. Hopefully the File deleter & File purge duties have ben moved over as well, and the rest of the system should get a bit of an improvement in overall performance. fingers crossed Grant Darwin NT ID: 2014520 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2014521 - Posted: 7 Oct 2019, 7:34:30 UTC - in response to Message 2014520. Last modified: 7 Oct 2019, 7:35:14 UTC Good morning all. Obviously going to bed fixed it this time... Came back to the console to find one machine stalled on downloads, all others crunching. One click to retry and we're in business. All my uploads yesterday went through as normal as each task finished. I think the massive number on the SSP will be the number reported each hour - reflecting the massive backlog of tasks waiting to report. I don't think it tells us anything about Muarae2. No reply from the lab about the cause yet. ID: 2014521 ·

AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266	Message 2014528 - Posted: 7 Oct 2019, 11:56:17 UTC - in response to Message 2014521. Last modified: 7 Oct 2019, 11:58:38 UTC Good morning all. Obviously going to bed fixed it this time... I'm going to lay this square on your shoulders then Richard. Next time, don't stay up so bloody late! We need our work units. Although, confessions be told, I went to work for 7 hours, and may have had a hand in it as well. Cheers, Guy ID: 2014528 ·

AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266	Message 2014531 - Posted: 7 Oct 2019, 13:12:49 UTC - in response to Message 2014528. As I expected with the system time out, it looks like a pretty dramatic decrease in the pending validations column. I'd been averaging around the low 700 tasks pending, which had grown to the low to mid 800s over the last month. Sitting at 650ish now, but I'm very hesitant to say anything has been "fixed" at this point. Positive note is that this appears to be decreasing still. Trend appears positive. ID: 2014531 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2014564 - Posted: 7 Oct 2019, 19:24:15 UTC The mystery deepens. New tape 01oc19ac having helped to get us re-started, then seems to have stalled. I don't think it's moved since I got up, although I've processed lot of jobs from it during the day. I sent the lab my table of timings, plus the observation "The first sign of failure is a timeout on a scheduler request, followed by a complete failure to connect". I've just had a reply back: "Thank you for this information. Very helpful in debugging". Very gnomic. I think we can take it that they are aware of the problem, but haven't found a cause yet. So, if you have any more observed symptoms or logs you think it might be helpful to pass on, please post them here. ID: 2014564 ·

AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266	Message 2014566 - Posted: 7 Oct 2019, 20:24:35 UTC - in response to Message 2014564. I've completely reversed course on pending validations too. Almost at 700 again, but that would seem nominal for my particular output. I don't even remember who brought that particular item to the table, but it would be interesting if they've observations of their own. ID: 2014566 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2014571 - Posted: 7 Oct 2019, 22:02:35 UTC Thanks Richard for communicating with the lab. I think we come up with good definitions of problems and stuck files and such on the panic thread, and I had hoped someone from the lab was looking now and then at this thread to see this info, but I guess they aren't. I created the data chat thread in hopes of keeping this thread only for true panic situations, so that someone from the lab could look at this thread without being overloaded, but again it looks like they don't. I'm happy our "pain" and sleuthing have given them a starting point. ID: 2014571 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2014605 - Posted: 8 Oct 2019, 6:15:20 UTC - in response to Message 2014564. The mystery deepens. New tape 01oc19ac having helped to get us re-started, then seems to have stalled. I don't think it's moved since I got up, although I've processed lot of jobs from it during the day. Just got back from work & looked at the Server Status page and was thinking that file had been there for a while, yet I haven't had any Arecibo work (other than resends) for a while now. Grant Darwin NT ID: 2014605 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 2014606 - Posted: 8 Oct 2019, 6:22:18 UTC - in response to Message 2014605. The mystery deepens. New tape 01oc19ac having helped to get us re-started, then seems to have stalled. I don't think it's moved since I got up, although I've processed lot of jobs from it during the day. Just got back from work & looked at the Server Status page and was thinking that file had been there for a while, yet I haven't had any Arecibo work (other than resends) for a while now. I filled up on them straight after yesterdays outrage and I still have a dozen or so of them waiting for my 2500K's cores to get around to. Cheers. ID: 2014606 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2014616 - Posted: 8 Oct 2019, 10:29:12 UTC 01oc19ac still sitting there. Maybe the planned weekly outage will give it a nudge? Grant Darwin NT ID: 2014616 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2014636 - Posted: 8 Oct 2019, 17:05:54 UTC Last modified: 8 Oct 2019, 17:51:22 UTC 01oc19ac is still sitting there. It can't finish for some reason. 06oc19aa unfortunately has had an AP splitting error. edit 06oc19aa did split two channels with no errors, so guess it is just a bad channel of data, and no real panic edit2 - maybe 01oc19ac has spit out some data after being stalled. wonder if someone at seti gave the process a kick or if it is just the fact I posted that fixed it :-) ID: 2014636 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.