The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030474 - Posted: 2 Feb 2020, 5:25:00 UTC - in response to Message 2030471. Last modified: 2 Feb 2020, 5:53:47 UTC Until we can get "Results returned and awaiting validation" down to around 3.5 million (given the present amount of Work in progress- so 7 million to go), and the "Workunits waiting for assimilation" back down to 0 (3.7 million to go), any new work just causes those numbers to climb. If the underlying problem is not fixed, the numbers will just start growing again no matter how low they were driven. Apparently the splitters are occasionally running in so short bursts that the SSP can't catch them. I got a small bunch of freshly split _0s and _1s. Mostly noise bombs. ID: 2030474 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304	Message 2030478 - Posted: 2 Feb 2020, 5:59:30 UTC - in response to Message 2030474. Last modified: 2 Feb 2020, 6:02:57 UTC If the underlying problem is not fixed, the numbers will just start growing again no matter how low they were driven. Yep. It appears we've just about finished all the BLC35 noise bombs*. And there is now a fix for the AMD RX 5000 card issues. While the increased serverside limits didn't help things, it was those 2 issues that really brought things undone- as the way to stop dodgy results getting in to the science database was require more than 1 wingman to verify a noisy WU result. Combined with files that were producing almost nothing but noise bombs, the size of the database exploded as the hardware just couldn't keep up with the load. And there may have been other performance related issues that have contributed to the initial database rapid expansion & the corresponding excruciatingly slow recovery. Having said that, it shows that we really do need new hardware in order to meet (not too distant) future workloads (let alone the continuing upload & download server issues). Edit- * Having said that, there's still a big heap of them still to come (there were that many noisy files there). Grant Darwin NT ID: 2030478 ·

Peter Send message Joined: 12 Feb 14 Posts: 19 Credit: 1,385,738 RAC: 6	Message 2030488 - Posted: 2 Feb 2020, 9:43:56 UTC Last modified: 2 Feb 2020, 9:44:28 UTC Yeaaaaah, a lot of tasks for for CPU and CPU+GPU are now waiting :) ID: 2030488 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 2030490 - Posted: 2 Feb 2020, 10:04:33 UTC - in response to Message 2030487. Edit: Except for the replica, which is now 5,91 hours behind, and it's getting worse for each update of the SSP. :-( Fun time, I just config'd graphs for replica: https://munin.kiska.pw/munin/Munin-Node/Munin-Node/replica_setiathome.html This should make Grant happy :D ID: 2030490 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304	Message 2030493 - Posted: 2 Feb 2020, 10:16:44 UTC - in response to Message 2030488. Yeaaaaah, a lot of tasks for for CPU and CPU+GPU are now waiting :) It's nice to get work, but it would have been nicer (given how things are at present) for the backlogs to be a few more million down before that happened. Grant Darwin NT ID: 2030493 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304	Message 2030495 - Posted: 2 Feb 2020, 10:27:09 UTC - in response to Message 2030490. Last modified: 2 Feb 2020, 10:33:05 UTC This should make Grant happy :D Very nice. Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfect (they're the same order of magnitude as each other- millions for MB and hundreds of thousands for AP, whereas the Assimilation & Deletion numbers are (when things aren't broken) usually around 0 so with the values in their millions there it makes it harder to see what's been going on with the smaller values). Oh, and the "Workunits waiting for db purging" and "Results waiting for db purging" could also go on the "Results returned and awaiting validation" and "Results out in the field" graph (or have their own). Pretty please. Pretty please with a cherry on top. Grant Darwin NT ID: 2030495 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 2030499 - Posted: 2 Feb 2020, 11:49:30 UTC - in response to Message 2030495. This should make Grant happy :D Very nice. Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfect (they're the same order of magnitude as each other- millions for MB and hundreds of thousands for AP, whereas the Assimilation & Deletion numbers are (when things aren't broken) usually around 0 so with the values in their millions there it makes it harder to see what's been going on with the smaller values). Oh, and the "Workunits waiting for db purging" and "Results waiting for db purging" could also go on the "Results returned and awaiting validation" and "Results out in the field" graph (or have their own). Pretty please. Pretty please with a cherry on top. Once it starts populating :D https://munin.kiska.pw/munin/Munin-Node/Munin-Node/results_setiathomev8_in_progress_validation.html Remind me to do the other stuff later ID: 2030499 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030501 - Posted: 2 Feb 2020, 12:26:41 UTC - in response to Message 2030478. And there is now a fix for the AMD RX 5000 card issues. They can force only 'vanilla' hosts to upgrade their apps. So they can't really revert the triple validation kludge for overflow results before enough of the anonymous platform hosts have updated their apps to make the risk of a task getting sent to two bad hosts tiny enough to be acceptable. Unless they can 'blacklist' amd gpus from receiving the _1 if the corresponding _0 was sent to one. But I don't think the system supports this because if it did, they would have already done it instead of using this triple validation kludge - which isn't even 100% watertight because there's still the risk of all three going to bad hosts. ID: 2030501 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 2030502 - Posted: 2 Feb 2020, 12:36:27 UTC Last modified: 2 Feb 2020, 12:38:28 UTC I am waiting and waiting to have the website confirm that I have a full cache. Everything is running Seti@Home except for three weather forecast tasks from WCG. Eyeballing it looks like I have a full set of cpu tasks and a less than full set of gpu tasks. But all the gpus are engaged and I think I may have 150 gpu tasks so hopefully it will stay that way. Apparently the Replica DB is "just a bit behind". It just reported I have 6 tasks in progress. I know I have to take off my shoes to count past 10 but I am sure I have more than "6" :) Here it is Sunday morning and I/we? are finally get a steady flow of tasks? Tom A proud member of the OFA (Old Farts Association). ID: 2030502 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030505 - Posted: 2 Feb 2020, 12:41:50 UTC - in response to Message 2030495. Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfect Actually one of the more interesting graphs would be ts SUM of 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging' for both MB & AP. That is all eight fields in one sum. This would be the number of results in the database. The value that Eric said has to be kept under 20 milllion to avoid the result table spilling out of RAM. It is now 18.9 milllion. Those 71 ancient zombie S@Hv7 results appear to have finally been purged! ID: 2030505 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030507 - Posted: 2 Feb 2020, 12:56:53 UTC - in response to Message 2030502. Last modified: 2 Feb 2020, 12:58:10 UTC I am waiting and waiting to have the website confirm that I have a full cache. Do what I did: Write a program that reads the client_state.xml and reports the number of tasks for CPU and GPU. That way you can easily see how full your queues are and you don't need the website for that, so it works even during the out(r)ages. And the data will always be fresh no matter how behind the relica db is. ID: 2030507 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 2030508 - Posted: 2 Feb 2020, 13:07:21 UTC - in response to Message 2030507. I think BoincTasks can do that, as well. ID: 2030508 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2030512 - Posted: 2 Feb 2020, 13:42:54 UTC - in response to Message 2030508. I think BoincTasks can do that, as well. Quite well, in fact. ID: 2030512 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2030513 - Posted: 2 Feb 2020, 13:43:39 UTC And, at least for the moment, the floodgates appear to have opened. ID: 2030513 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030523 - Posted: 2 Feb 2020, 14:54:04 UTC Something has changed. The floodgates are wide open but the assimilation queue is still getting smaller. ID: 2030523 ·

Chris904395093209d Volunteer tester Send message Joined: 1 Jan 01 Posts: 112 Credit: 29,923,129 RAC: 6	Message 2030524 - Posted: 2 Feb 2020, 15:00:34 UTC I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase? ~Chris ID: 2030524 ·

Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62	Message 2030525 - Posted: 2 Feb 2020, 15:03:31 UTC 02-Feb-2020 15:51:01 [SETI@home] Sending scheduler request: To fetch work. 02-Feb-2020 15:51:01 [SETI@home] Requesting new tasks for CPU and AMD/ATI GPU 02-Feb-2020 15:51:06 [SETI@home] Scheduler request completed: got 124 new tasks UTC+1 ^^ ID: 2030525 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2030527 - Posted: 2 Feb 2020, 15:04:47 UTC - in response to Message 2030524. I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase? It appears they did... the purging queue has fallen by half, so work generation is back as the result table is well below 20M. ID: 2030527 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2030529 - Posted: 2 Feb 2020, 15:09:49 UTC - in response to Message 2030527. Last modified: 2 Feb 2020, 15:44:09 UTC I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase? It appears they did... the purging queue has fallen by half, so work generation is back as the result table is well below 20M. Maybe is time to start to cut the timeline of the WUs and some changes in the way the work is distributed like sending the resends to the fastest hosts to clear them ASAP. Or we will be trapped on an endless loop of no new work each time the total reaches 20 MM. ID: 2030529 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2030530 - Posted: 2 Feb 2020, 15:12:40 UTC - in response to Message 2030529. Or we will be trapped on an endless loop of no new work each time the total reaches 20 MM. Possible explanation of why this has only been happening recently here.... Briefly: Quorum=3 for overflows coupled with BLC35 files which generate little except overflows. ID: 2030530 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.