Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (117)
Message board moderation
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 52 · Next
Author | Message |
---|---|
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
I sure hope they give everyone an explanation on whats going on.. |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
Somebody has obviously sorted the uploads out because as I write the returning at an astonishing rate 802,639 I cannot recall ever seeing the return rate that high |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Now all the hosts are trying to report work and get new work. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
FurryGuy Send message Joined: 1 Jun 04 Posts: 6 Credit: 9,294,513 RAC: 1 |
Now all the hosts are trying to report work and get new work. I got one GPU work unit that took almost 30 minutes to download. Average run time for GPU work units, less than 10 minutes. This is going to be a loooooooooong catch up period. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
This is going to be a loooooooooong catch up period.Looks a lot better now. Downloads stalls have mostly resolved. Still a ways to go to get full caches, maybe another hour. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
They have added an Aricebo file (01oc19ac) to be split. It is happily splitting, but of course we can't have any of the WUs. . . I'll tell you what Keith will also tell. The Arecibo data is configured to auto-download and auto-mount on the splitters, so that was probably done entirely without human intervention. If it were archival data I would suspect that someone had a hand in it. Stephen . . |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I sure hope they give everyone an explanation on whats going on.. . . Cynic ... :) . . +1 Stephen :) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Somebody has obviously sorted the uploads out because as I write the returning at an astonishing rate 802,639 I cannot recall ever seeing the return rate that high . . Is there an upload server listed named muarae2 ??? . . I have just restarted the 1st of my 5 rigs and immediately got new work ... Stephen ?? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Muarae2 is the new upload server they have deployed at Beta. Uses SSD storage and a lot more memory. The one that Richard posted a image of when he was visiting this summer. Looks like it has finally made its appearance at Main. And yes, any new data from this year from Arecibo is automounted since they have a bigger pipeline from Arecibo after the repair from the hurricane if I remember correctly. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Muarae2 is the new upload server they have deployed at Beta. Uses SSD storage and a lot more memory. The one that Richard posted a image of when he was visiting this summer.That would explain the doubling of Received-last-hour numbers compared to it's usual values after the systems came back up. Hopefully the File deleter & File purge duties have ben moved over as well, and the rest of the system should get a bit of an improvement in overall performance. *fingers crossed* Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Good morning all. Obviously going to bed fixed it this time... Came back to the console to find one machine stalled on downloads, all others crunching. One click to retry and we're in business. All my uploads yesterday went through as normal as each task finished. I think the massive number on the SSP will be the number reported each hour - reflecting the massive backlog of tasks waiting to report. I don't think it tells us anything about Muarae2. No reply from the lab about the cause yet. |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
Good morning all. Obviously going to bed fixed it this time... I'm going to lay this square on your shoulders then Richard. Next time, don't stay up so bloody late! We need our work units. Although, confessions be told, I went to work for 7 hours, and may have had a hand in it as well. Cheers, Guy |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
As I expected with the system time out, it looks like a pretty dramatic decrease in the pending validations column. I'd been averaging around the low 700 tasks pending, which had grown to the low to mid 800s over the last month. Sitting at 650ish now, but I'm very hesitant to say anything has been "fixed" at this point. Positive note is that this appears to be decreasing still. Trend appears positive. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
The mystery deepens. New tape 01oc19ac having helped to get us re-started, then seems to have stalled. I don't think it's moved since I got up, although I've processed lot of jobs from it during the day. I sent the lab my table of timings, plus the observation "The first sign of failure is a timeout on a scheduler request, followed by a complete failure to connect". I've just had a reply back: "Thank you for this information. Very helpful in debugging". Very gnomic. I think we can take it that they are aware of the problem, but haven't found a cause yet. So, if you have any more observed symptoms or logs you think it might be helpful to pass on, please post them here. |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
I've completely reversed course on pending validations too. Almost at 700 again, but that would seem nominal for my particular output. I don't even remember who brought that particular item to the table, but it would be interesting if they've observations of their own. |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
Thanks Richard for communicating with the lab. I think we come up with good definitions of problems and stuck files and such on the panic thread, and I had hoped someone from the lab was looking now and then at this thread to see this info, but I guess they aren't. I created the data chat thread in hopes of keeping this thread only for true panic situations, so that someone from the lab could look at this thread without being overloaded, but again it looks like they don't. I'm happy our "pain" and sleuthing have given them a starting point. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
The mystery deepens. New tape 01oc19ac having helped to get us re-started, then seems to have stalled. I don't think it's moved since I got up, although I've processed lot of jobs from it during the day.Just got back from work & looked at the Server Status page and was thinking that file had been there for a while, yet I haven't had any Arecibo work (other than resends) for a while now. Grant Darwin NT |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
I filled up on them straight after yesterdays outrage and I still have a dozen or so of them waiting for my 2500K's cores to get around to.The mystery deepens. New tape 01oc19ac having helped to get us re-started, then seems to have stalled. I don't think it's moved since I got up, although I've processed lot of jobs from it during the day.Just got back from work & looked at the Server Status page and was thinking that file had been there for a while, yet I haven't had any Arecibo work (other than resends) for a while now. Cheers. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
01oc19ac still sitting there. Maybe the planned weekly outage will give it a nudge? Grant Darwin NT |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
01oc19ac is still sitting there. It can't finish for some reason. 06oc19aa unfortunately has had an AP splitting error. edit 06oc19aa did split two channels with no errors, so guess it is just a bad channel of data, and no real panic edit2 - maybe 01oc19ac has spit out some data after being stalled. wonder if someone at seti gave the process a kick or if it is just the fact I posted that fixed it :-) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.