The Server Issues / Outages Thread - Panic Mode On! (117)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (117)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 52 · Next

AuthorMessage
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2014494 - Posted: 7 Oct 2019, 1:30:28 UTC - in response to Message 2014492.  

I sure hope they give everyone an explanation on whats going on..


ID: 2014494 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2014497 - Posted: 7 Oct 2019, 1:54:58 UTC

Somebody has obviously sorted the uploads out because as I write the returning at an astonishing rate 802,639 I cannot recall ever seeing the return rate that high
ID: 2014497 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2014499 - Posted: 7 Oct 2019, 2:11:54 UTC - in response to Message 2014497.  

Now all the hosts are trying to report work and get new work.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2014499 · Report as offensive
FurryGuy
Volunteer tester

Send message
Joined: 1 Jun 04
Posts: 6
Credit: 9,294,513
RAC: 1
United States
Message 2014505 - Posted: 7 Oct 2019, 3:47:51 UTC - in response to Message 2014499.  

Now all the hosts are trying to report work and get new work.

I got one GPU work unit that took almost 30 minutes to download.

Average run time for GPU work units, less than 10 minutes.

This is going to be a loooooooooong catch up period.
ID: 2014505 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2014506 - Posted: 7 Oct 2019, 4:07:46 UTC - in response to Message 2014505.  

This is going to be a loooooooooong catch up period.
Looks a lot better now. Downloads stalls have mostly resolved. Still a ways to go to get full caches, maybe another hour.
ID: 2014506 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2014514 - Posted: 7 Oct 2019, 6:37:57 UTC - in response to Message 2014473.  

They have added an Aricebo file (01oc19ac) to be split. It is happily splitting, but of course we can't have any of the WUs.
It is really hard to see the large RTS queues and not get to have any.

edit - on the bright side it means someone is fiddling with the machine, but, unfortunately not in a helpful way.


. . I'll tell you what Keith will also tell. The Arecibo data is configured to auto-download and auto-mount on the splitters, so that was probably done entirely without human intervention. If it were archival data I would suspect that someone had a hand in it.

Stephen

. .
ID: 2014514 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2014515 - Posted: 7 Oct 2019, 6:41:20 UTC - in response to Message 2014494.  

I sure hope they give everyone an explanation on whats going on..



. . Cynic ...

:)

. . +1

Stephen

:)
ID: 2014515 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2014516 - Posted: 7 Oct 2019, 6:42:28 UTC - in response to Message 2014497.  
Last modified: 7 Oct 2019, 6:44:18 UTC

Somebody has obviously sorted the uploads out because as I write the returning at an astonishing rate 802,639 I cannot recall ever seeing the return rate that high


. . Is there an upload server listed named muarae2 ???

. . I have just restarted the 1st of my 5 rigs and immediately got new work ...

Stephen

??
ID: 2014516 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2014519 - Posted: 7 Oct 2019, 7:03:45 UTC - in response to Message 2014516.  

Muarae2 is the new upload server they have deployed at Beta. Uses SSD storage and a lot more memory. The one that Richard posted a image of when he was visiting this summer.
Looks like it has finally made its appearance at Main.

And yes, any new data from this year from Arecibo is automounted since they have a bigger pipeline from Arecibo after the repair from the hurricane if I remember correctly.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2014519 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2014520 - Posted: 7 Oct 2019, 7:17:04 UTC - in response to Message 2014519.  

Muarae2 is the new upload server they have deployed at Beta. Uses SSD storage and a lot more memory. The one that Richard posted a image of when he was visiting this summer.
Looks like it has finally made its appearance at Main.
That would explain the doubling of Received-last-hour numbers compared to it's usual values after the systems came back up. Hopefully the File deleter & File purge duties have ben moved over as well, and the rest of the system should get a bit of an improvement in overall performance.
*fingers crossed*
Grant
Darwin NT
ID: 2014520 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2014521 - Posted: 7 Oct 2019, 7:34:30 UTC - in response to Message 2014520.  
Last modified: 7 Oct 2019, 7:35:14 UTC

Good morning all. Obviously going to bed fixed it this time...

Came back to the console to find one machine stalled on downloads, all others crunching. One click to retry and we're in business.

All my uploads yesterday went through as normal as each task finished. I think the massive number on the SSP will be the number reported each hour - reflecting the massive backlog of tasks waiting to report. I don't think it tells us anything about Muarae2. No reply from the lab about the cause yet.
ID: 2014521 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2014528 - Posted: 7 Oct 2019, 11:56:17 UTC - in response to Message 2014521.  
Last modified: 7 Oct 2019, 11:58:38 UTC

Good morning all. Obviously going to bed fixed it this time...


I'm going to lay this square on your shoulders then Richard. Next time, don't stay up so bloody late! We need our work units. Although, confessions be told, I went to work for 7 hours, and may have had a hand in it as well.

Cheers,
Guy
ID: 2014528 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2014531 - Posted: 7 Oct 2019, 13:12:49 UTC - in response to Message 2014528.  

As I expected with the system time out, it looks like a pretty dramatic decrease in the pending validations column. I'd been averaging around the low 700 tasks pending, which had grown to the low to mid 800s over the last month. Sitting at 650ish now, but I'm very hesitant to say anything has been "fixed" at this point. Positive note is that this appears to be decreasing still. Trend appears positive.
ID: 2014531 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2014564 - Posted: 7 Oct 2019, 19:24:15 UTC

The mystery deepens. New tape 01oc19ac having helped to get us re-started, then seems to have stalled. I don't think it's moved since I got up, although I've processed lot of jobs from it during the day.

I sent the lab my table of timings, plus the observation "The first sign of failure is a timeout on a scheduler request, followed by a complete failure to connect". I've just had a reply back: "Thank you for this information. Very helpful in debugging". Very gnomic.

I think we can take it that they are aware of the problem, but haven't found a cause yet. So, if you have any more observed symptoms or logs you think it might be helpful to pass on, please post them here.
ID: 2014564 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2014566 - Posted: 7 Oct 2019, 20:24:35 UTC - in response to Message 2014564.  

I've completely reversed course on pending validations too. Almost at 700 again, but that would seem nominal for my particular output. I don't even remember who brought that particular item to the table, but it would be interesting if they've observations of their own.
ID: 2014566 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2014571 - Posted: 7 Oct 2019, 22:02:35 UTC

Thanks Richard for communicating with the lab. I think we come up with good definitions of problems and stuck files and such on the panic thread, and I had hoped someone from the lab was looking now and then at this thread to see this info, but I guess they aren't.

I created the data chat thread in hopes of keeping this thread only for true panic situations, so that someone from the lab could look at this thread without being overloaded, but again it looks like they don't.

I'm happy our "pain" and sleuthing have given them a starting point.
ID: 2014571 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2014605 - Posted: 8 Oct 2019, 6:15:20 UTC - in response to Message 2014564.  

The mystery deepens. New tape 01oc19ac having helped to get us re-started, then seems to have stalled. I don't think it's moved since I got up, although I've processed lot of jobs from it during the day.
Just got back from work & looked at the Server Status page and was thinking that file had been there for a while, yet I haven't had any Arecibo work (other than resends) for a while now.
Grant
Darwin NT
ID: 2014605 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 2014606 - Posted: 8 Oct 2019, 6:22:18 UTC - in response to Message 2014605.  

The mystery deepens. New tape 01oc19ac having helped to get us re-started, then seems to have stalled. I don't think it's moved since I got up, although I've processed lot of jobs from it during the day.
Just got back from work & looked at the Server Status page and was thinking that file had been there for a while, yet I haven't had any Arecibo work (other than resends) for a while now.
I filled up on them straight after yesterdays outrage and I still have a dozen or so of them waiting for my 2500K's cores to get around to.

Cheers.
ID: 2014606 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2014616 - Posted: 8 Oct 2019, 10:29:12 UTC

01oc19ac still sitting there. Maybe the planned weekly outage will give it a nudge?
Grant
Darwin NT
ID: 2014616 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2014636 - Posted: 8 Oct 2019, 17:05:54 UTC
Last modified: 8 Oct 2019, 17:51:22 UTC

01oc19ac is still sitting there. It can't finish for some reason.
06oc19aa unfortunately has had an AP splitting error.

edit 06oc19aa did split two channels with no errors, so guess it is just a bad channel of data, and no real panic
edit2 - maybe 01oc19ac has spit out some data after being stalled.
wonder if someone at seti gave the process a kick or if it is just the fact I posted that fixed it :-)
ID: 2014636 · Report as offensive
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 52 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (117)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.