The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 71 · 72 · 73 · 74 · 75 · 76 · 77 . . . 94 · Next

AuthorMessage
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030474 - Posted: 2 Feb 2020, 5:25:00 UTC - in response to Message 2030471.  
Last modified: 2 Feb 2020, 5:53:47 UTC

Until we can get "Results returned and awaiting validation" down to around 3.5 million (given the present amount of Work in progress- so 7 million to go), and the "Workunits waiting for assimilation" back down to 0 (3.7 million to go), any new work just causes those numbers to climb.
If the underlying problem is not fixed, the numbers will just start growing again no matter how low they were driven.

Apparently the splitters are occasionally running in so short bursts that the SSP can't catch them. I got a small bunch of freshly split _0s and _1s. Mostly noise bombs.
ID: 2030474 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2030478 - Posted: 2 Feb 2020, 5:59:30 UTC - in response to Message 2030474.  
Last modified: 2 Feb 2020, 6:02:57 UTC

If the underlying problem is not fixed, the numbers will just start growing again no matter how low they were driven.
Yep.
It appears we've just about finished all the BLC35 noise bombs***. And there is now a fix for the AMD RX 5000 card issues.
While the increased serverside limits didn't help things, it was those 2 issues that really brought things undone- as the way to stop dodgy results getting in to the science database was require more than 1 wingman to verify a noisy WU result. Combined with files that were producing almost nothing but noise bombs, the size of the database exploded as the hardware just couldn't keep up with the load. And there may have been other performance related issues that have contributed to the initial database rapid expansion & the corresponding excruciatingly slow recovery.


Having said that, it shows that we really do need new hardware in order to meet (not too distant) future workloads (let alone the continuing upload & download server issues).


Edit-
*** Having said that, there's still a big heap of them still to come (there were that many noisy files there).
Grant
Darwin NT
ID: 2030478 · Report as offensive
Profile Peter

Send message
Joined: 12 Feb 14
Posts: 19
Credit: 1,385,738
RAC: 6
Slovakia
Message 2030488 - Posted: 2 Feb 2020, 9:43:56 UTC
Last modified: 2 Feb 2020, 9:44:28 UTC

Yeaaaaah, a lot of tasks for for CPU and CPU+GPU are now waiting :)
ID: 2030488 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 2030490 - Posted: 2 Feb 2020, 10:04:33 UTC - in response to Message 2030487.  

Edit: Except for the replica, which is now 5,91 hours behind, and it's getting worse for each update of the SSP. :-(


Fun time, I just config'd graphs for replica:
https://munin.kiska.pw/munin/Munin-Node/Munin-Node/replica_setiathome.html

This should make Grant happy :D
ID: 2030490 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2030493 - Posted: 2 Feb 2020, 10:16:44 UTC - in response to Message 2030488.  

Yeaaaaah, a lot of tasks for for CPU and CPU+GPU are now waiting :)
It's nice to get work, but it would have been nicer (given how things are at present) for the backlogs to be a few more million down before that happened.
Grant
Darwin NT
ID: 2030493 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2030495 - Posted: 2 Feb 2020, 10:27:09 UTC - in response to Message 2030490.  
Last modified: 2 Feb 2020, 10:33:05 UTC

This should make Grant happy :D
Very nice.
Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfect (they're the same order of magnitude as each other- millions for MB and hundreds of thousands for AP, whereas the Assimilation & Deletion numbers are (when things aren't broken) usually around 0 so with the values in their millions there it makes it harder to see what's been going on with the smaller values).

Oh, and the "Workunits waiting for db purging" and "Results waiting for db purging" could also go on the "Results returned and awaiting validation" and "Results out in the field" graph (or have their own).
Pretty please. Pretty please with a cherry on top.
Grant
Darwin NT
ID: 2030495 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 2030499 - Posted: 2 Feb 2020, 11:49:30 UTC - in response to Message 2030495.  

This should make Grant happy :D
Very nice.
Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfect (they're the same order of magnitude as each other- millions for MB and hundreds of thousands for AP, whereas the Assimilation & Deletion numbers are (when things aren't broken) usually around 0 so with the values in their millions there it makes it harder to see what's been going on with the smaller values).

Oh, and the "Workunits waiting for db purging" and "Results waiting for db purging" could also go on the "Results returned and awaiting validation" and "Results out in the field" graph (or have their own).
Pretty please. Pretty please with a cherry on top.


Once it starts populating :D
https://munin.kiska.pw/munin/Munin-Node/Munin-Node/results_setiathomev8_in_progress_validation.html

Remind me to do the other stuff later
ID: 2030499 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030501 - Posted: 2 Feb 2020, 12:26:41 UTC - in response to Message 2030478.  

And there is now a fix for the AMD RX 5000 card issues.
They can force only 'vanilla' hosts to upgrade their apps. So they can't really revert the triple validation kludge for overflow results before enough of the anonymous platform hosts have updated their apps to make the risk of a task getting sent to two bad hosts tiny enough to be acceptable.

Unless they can 'blacklist' amd gpus from receiving the _1 if the corresponding _0 was sent to one. But I don't think the system supports this because if it did, they would have already done it instead of using this triple validation kludge - which isn't even 100% watertight because there's still the risk of all three going to bad hosts.
ID: 2030501 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2030502 - Posted: 2 Feb 2020, 12:36:27 UTC
Last modified: 2 Feb 2020, 12:38:28 UTC

I am waiting and waiting to have the website confirm that I have a full cache.

Everything is running Seti@Home except for three weather forecast tasks from WCG.

Eyeballing it looks like I have a full set of cpu tasks and a less than full set of gpu tasks. But all the gpus are engaged and I think I may have 150 gpu tasks so hopefully it will stay that way.

Apparently the Replica DB is "just a bit behind". It just reported I have 6 tasks in progress.

I know I have to take off my shoes to count past 10 but I am sure I have more than "6" :)

Here it is Sunday morning and I/we? are finally get a steady flow of tasks?

Tom
A proud member of the OFA (Old Farts Association).
ID: 2030502 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030505 - Posted: 2 Feb 2020, 12:41:50 UTC - in response to Message 2030495.  

Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfect
Actually one of the more interesting graphs would be ts SUM of 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging' for both MB & AP. That is all eight fields in one sum.

This would be the number of results in the database. The value that Eric said has to be kept under 20 milllion to avoid the result table spilling out of RAM. It is now 18.9 milllion.

Those 71 ancient zombie S@Hv7 results appear to have finally been purged!
ID: 2030505 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030507 - Posted: 2 Feb 2020, 12:56:53 UTC - in response to Message 2030502.  
Last modified: 2 Feb 2020, 12:58:10 UTC

I am waiting and waiting to have the website confirm that I have a full cache.
Do what I did: Write a program that reads the client_state.xml and reports the number of tasks for CPU and GPU. That way you can easily see how full your queues are and you don't need the website for that, so it works even during the out(r)ages.

And the data will always be fresh no matter how behind the relica db is.
ID: 2030507 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030508 - Posted: 2 Feb 2020, 13:07:21 UTC - in response to Message 2030507.  

I think BoincTasks can do that, as well.
ID: 2030508 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2030512 - Posted: 2 Feb 2020, 13:42:54 UTC - in response to Message 2030508.  

I think BoincTasks can do that, as well.

Quite well, in fact.
ID: 2030512 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2030513 - Posted: 2 Feb 2020, 13:43:39 UTC

And, at least for the moment, the floodgates appear to have opened.
ID: 2030513 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030523 - Posted: 2 Feb 2020, 14:54:04 UTC

Something has changed. The floodgates are wide open but the assimilation queue is still getting smaller.
ID: 2030523 · Report as offensive
Profile Chris904395093209d Project Donor
Volunteer tester

Send message
Joined: 1 Jan 01
Posts: 112
Credit: 29,923,129
RAC: 6
United States
Message 2030524 - Posted: 2 Feb 2020, 15:00:34 UTC

I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase?
~Chris

ID: 2030524 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 716
Credit: 8,032,827
RAC: 62
France
Message 2030525 - Posted: 2 Feb 2020, 15:03:31 UTC


02-Feb-2020 15:51:01 [SETI@home] Sending scheduler request: To fetch work.
02-Feb-2020 15:51:01 [SETI@home] Requesting new tasks for CPU and AMD/ATI GPU
02-Feb-2020 15:51:06 [SETI@home] Scheduler request completed: got 124 new tasks


UTC+1 ^^
ID: 2030525 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030527 - Posted: 2 Feb 2020, 15:04:47 UTC - in response to Message 2030524.  

I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase?


It appears they did... the purging queue has fallen by half, so work generation is back as the result table is well below 20M.
ID: 2030527 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2030529 - Posted: 2 Feb 2020, 15:09:49 UTC - in response to Message 2030527.  
Last modified: 2 Feb 2020, 15:44:09 UTC

I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase?


It appears they did... the purging queue has fallen by half, so work generation is back as the result table is well below 20M.

Maybe is time to start to cut the timeline of the WUs and some changes in the way the work is distributed like sending the resends to the fastest hosts to clear them ASAP. Or we will be trapped on an endless loop of no new work each time the total reaches 20 MM.
ID: 2030529 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030530 - Posted: 2 Feb 2020, 15:12:40 UTC - in response to Message 2030529.  

Or we will be trapped on an endless loop of no new work each time the total reaches 20 MM.


Possible explanation of why this has only been happening recently here.... Briefly: Quorum=3 for overflows coupled with BLC35 files which generate little except overflows.
ID: 2030530 · Report as offensive
Previous · 1 . . . 71 · 72 · 73 · 74 · 75 · 76 · 77 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.