The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 70 · 71 · 72 · 73 · 74 · 75 · 76 . . . 94 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2030425 - Posted: 2 Feb 2020, 0:55:39 UTC - in response to Message 2030409.  

Yeah Matt is really missed in times like these and he would've had those MBv7's put to bed long ago.
Wait, I've been out of the loop for a while. Matt left?
Yeah Matt has been over at the Breakthrough Listen project for a couple of years now.
Or maybe just at the Breakthrough Listen office down on Campus - that's where I found him back in July.

https://i.imgur.com/Lninw9X.jpg
And now and again out at Parkes.

Cheers.


. . Not recently I don't think :(

Stephen

? ?
ID: 2030425 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2030426 - Posted: 2 Feb 2020, 0:56:48 UTC - in response to Message 2030410.  

A word of warning- if you do manage to sore some work, be prepared to have to Retry pending transfers a few 100 (it feels like a thousand) times.
The download servers are borked as well as everything else.


. . been there, done that, over and over and over and ... oh what the heck ...

Stephen

:(
ID: 2030426 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 2030433 - Posted: 2 Feb 2020, 1:32:58 UTC

Just ran across something new.

I looked at my pending page and found a workunit at randon.
It said "completed, waiting validation.

But when I clicked on it to see the status, I found this


https://setiathome.berkeley.edu/workunit.php?wuid=3863183873

Reported by 3 units and validated.

Something is borked.
Dave

ID: 2030433 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11451
Credit: 29,581,041
RAC: 66
United States
Message 2030436 - Posted: 2 Feb 2020, 1:59:53 UTC - in response to Message 2030433.  

That's not unusual. The first 2 tasks did not match but were close. The 3rd was close enough to validate the first 2.
ID: 2030436 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2030437 - Posted: 2 Feb 2020, 2:01:52 UTC - in response to Message 2030433.  

Reported by 3 units and validated.

Something is borked.

Side effect of the minimum 3 quorum for early overflows and the flakey AMD card problem.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2030437 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2030439 - Posted: 2 Feb 2020, 2:23:01 UTC

I am getting a very small flow of Seti@Home tasks.

So I have NNTed both E@H and WCG in hopes of sucking more down :)
Some of the weather tasks run more than a half a dayso it will take a while to widdle about half of them down.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2030439 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 2030459 - Posted: 2 Feb 2020, 3:34:33 UTC - in response to Message 2030433.  

I guess I was not clear.

Looking at my pending page, it says that the workunit is "completed, waiting validation"

YET

looking at the workunit, it says validated.

Just ran across something new.

I looked at my pending page and found a workunit at randon.
It said "completed, waiting validation.

But when I clicked on it to see the status, I found this


https://setiathome.berkeley.edu/workunit.php?wuid=3863183873

Reported by 3 units and validated.

Something is borked.

Dave

ID: 2030459 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2030462 - Posted: 2 Feb 2020, 3:57:11 UTC - in response to Message 2030433.  
Last modified: 2 Feb 2020, 4:02:33 UTC

Just ran across something new.

I looked at my pending page and found a workunit at randon.
It said "completed, waiting validation.

But when I clicked on it to see the status, I found this


https://setiathome.berkeley.edu/workunit.php?wuid=3863183873

Reported by 3 units and validated.

Something is borked.


. . Not sure what your issue is with that WU. Normally a WU will linger in the system for approx 24 hours after validation, currently with the problems everyone has been discussing and are very concerned about they are hanging about for much longer. That unit has only just validated so I would not expect to wave goodbye to it for a day or 3 yet ...

. . OH, maybe you missed the discussion of the change that was introduced because of NAVI cards whereby overflow results are being triple checked for validation.

{edit}
. . The misleading listing on the your stats page may be due to the lag with the replica database??

Stephen

:(
ID: 2030462 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2030463 - Posted: 2 Feb 2020, 3:58:40 UTC - in response to Message 2030459.  

Yes, things are borked. That is what this thread has been discussing for the past two weeks. Also the replica database is 8000 seconds behind. So what you see on your page is already 2 hours old.

There is nothing normal about the current situation so no reason to expect normal classifications. I would just not worry about it since there is nothing you can do on your end to change anything.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2030463 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2030464 - Posted: 2 Feb 2020, 4:33:30 UTC

There will also be weirdness from the replica db not being caught up with the main db.
ID: 2030464 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 2030465 - Posted: 2 Feb 2020, 4:42:57 UTC

Tired of this current snafu. Shut my crunchers down and will hold off until SETI has a couple of days of normal work flow. Good crunching, all.
ID: 2030465 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030468 - Posted: 2 Feb 2020, 5:09:37 UTC - in response to Message 2030373.  

I am seeing no reductions in the size of the database with all the task counts at all time highs. Nothing is going to happen until we fall below the magic 20M number.
We fell below that at 02:50 UTC, but nothing has happened yet at the splitters.

Assimilation queue seems to be slowly going down now. Two steps down, one up.
ID: 2030468 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030471 - Posted: 2 Feb 2020, 5:20:08 UTC - in response to Message 2030468.  

Assimilation queue seems to be slowly going down now. Two steps down, one up.
Until we can get "Results returned and awaiting validation" down to around 3.5 million (given the present amount of Work in progress- so 7 million to go), and the "Workunits waiting for assimilation" back down to 0 (3.7 million to go), any new work just causes those numbers to climb.
And ideally we'd want the purge numbers to be within about 500k of the In progress numbers (i think that was the general ball park when things were working).
Grant
Darwin NT
ID: 2030471 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030474 - Posted: 2 Feb 2020, 5:25:00 UTC - in response to Message 2030471.  
Last modified: 2 Feb 2020, 5:53:47 UTC

Until we can get "Results returned and awaiting validation" down to around 3.5 million (given the present amount of Work in progress- so 7 million to go), and the "Workunits waiting for assimilation" back down to 0 (3.7 million to go), any new work just causes those numbers to climb.
If the underlying problem is not fixed, the numbers will just start growing again no matter how low they were driven.

Apparently the splitters are occasionally running in so short bursts that the SSP can't catch them. I got a small bunch of freshly split _0s and _1s. Mostly noise bombs.
ID: 2030474 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030478 - Posted: 2 Feb 2020, 5:59:30 UTC - in response to Message 2030474.  
Last modified: 2 Feb 2020, 6:02:57 UTC

If the underlying problem is not fixed, the numbers will just start growing again no matter how low they were driven.
Yep.
It appears we've just about finished all the BLC35 noise bombs***. And there is now a fix for the AMD RX 5000 card issues.
While the increased serverside limits didn't help things, it was those 2 issues that really brought things undone- as the way to stop dodgy results getting in to the science database was require more than 1 wingman to verify a noisy WU result. Combined with files that were producing almost nothing but noise bombs, the size of the database exploded as the hardware just couldn't keep up with the load. And there may have been other performance related issues that have contributed to the initial database rapid expansion & the corresponding excruciatingly slow recovery.


Having said that, it shows that we really do need new hardware in order to meet (not too distant) future workloads (let alone the continuing upload & download server issues).


Edit-
*** Having said that, there's still a big heap of them still to come (there were that many noisy files there).
Grant
Darwin NT
ID: 2030478 · Report as offensive
Profile Peter

Send message
Joined: 12 Feb 14
Posts: 19
Credit: 1,385,738
RAC: 6
Slovakia
Message 2030488 - Posted: 2 Feb 2020, 9:43:56 UTC
Last modified: 2 Feb 2020, 9:44:28 UTC

Yeaaaaah, a lot of tasks for for CPU and CPU+GPU are now waiting :)
ID: 2030488 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 2030490 - Posted: 2 Feb 2020, 10:04:33 UTC - in response to Message 2030487.  

Edit: Except for the replica, which is now 5,91 hours behind, and it's getting worse for each update of the SSP. :-(


Fun time, I just config'd graphs for replica:
https://munin.kiska.pw/munin/Munin-Node/Munin-Node/replica_setiathome.html

This should make Grant happy :D
ID: 2030490 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030493 - Posted: 2 Feb 2020, 10:16:44 UTC - in response to Message 2030488.  

Yeaaaaah, a lot of tasks for for CPU and CPU+GPU are now waiting :)
It's nice to get work, but it would have been nicer (given how things are at present) for the backlogs to be a few more million down before that happened.
Grant
Darwin NT
ID: 2030493 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13987
Credit: 208,696,464
RAC: 304
Australia
Message 2030495 - Posted: 2 Feb 2020, 10:27:09 UTC - in response to Message 2030490.  
Last modified: 2 Feb 2020, 10:33:05 UTC

This should make Grant happy :D
Very nice.
Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfect (they're the same order of magnitude as each other- millions for MB and hundreds of thousands for AP, whereas the Assimilation & Deletion numbers are (when things aren't broken) usually around 0 so with the values in their millions there it makes it harder to see what's been going on with the smaller values).

Oh, and the "Workunits waiting for db purging" and "Results waiting for db purging" could also go on the "Results returned and awaiting validation" and "Results out in the field" graph (or have their own).
Pretty please. Pretty please with a cherry on top.
Grant
Darwin NT
ID: 2030495 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 2030499 - Posted: 2 Feb 2020, 11:49:30 UTC - in response to Message 2030495.  

This should make Grant happy :D
Very nice.
Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfect (they're the same order of magnitude as each other- millions for MB and hundreds of thousands for AP, whereas the Assimilation & Deletion numbers are (when things aren't broken) usually around 0 so with the values in their millions there it makes it harder to see what's been going on with the smaller values).

Oh, and the "Workunits waiting for db purging" and "Results waiting for db purging" could also go on the "Results returned and awaiting validation" and "Results out in the field" graph (or have their own).
Pretty please. Pretty please with a cherry on top.


Once it starts populating :D
https://munin.kiska.pw/munin/Munin-Node/Munin-Node/results_setiathomev8_in_progress_validation.html

Remind me to do the other stuff later
ID: 2030499 · Report as offensive
Previous · 1 . . . 70 · 71 · 72 · 73 · 74 · 75 · 76 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.