The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13751 Credit: 208,696,464 RAC: 304	Message 2030850 - Posted: 5 Feb 2020, 9:26:10 UTC And we're back to sticking downloads again. Grant Darwin NT ID: 2030850 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2030851 - Posted: 5 Feb 2020, 9:46:26 UTC Last modified: 5 Feb 2020, 10:02:34 UTC Hey, this is nice. Seems the same setting that controls the Upload Retries also controls the Download Retries. Instead of Download retries in minutes, it's seconds. Download 'Project Backoffs' are minutes instead of Hours....this will work. Except as usual, we are now Out Of Work, and my machines are still out of work. ID: 2030851 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2030855 - Posted: 5 Feb 2020, 10:55:22 UTC - in response to Message 2030809. Last modified: 5 Feb 2020, 11:07:50 UTC Setting nnt until all work is reported has been very effective for me. . . Reducing work report to 99 and setting NNT did not help here ... :( Stephen :( ID: 2030855 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2030856 - Posted: 5 Feb 2020, 10:56:21 UTC - in response to Message 2030822. I just noticed we are back. And it wasn't a multi-day shutdown. Just a basic long Tuesday Tom. . . Hmmmm, 12 hours is a little more than a basic outage :( Stephen :( ID: 2030856 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2030857 - Posted: 5 Feb 2020, 10:59:56 UTC - in response to Message 2030838. It's been this way for at least 8 Years that I'm aware of. It doesn't make any difference whether it runs as Stock or Anonymous. Both those two machines ran as Stock for weeks after the Christmas SNAFU, One is still Stock, no difference 8 years ago or now. Is your Windows machine full yet? I'm finally getting a few downloads now, hopefully I'll get enough to keep the machines running soon. . . I didn't start to get more than an odd task or 2 until 8:30am UTC. :( Stephen :( ID: 2030857 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2030858 - Posted: 5 Feb 2020, 11:02:29 UTC - in response to Message 2030839. I'm wondering if this issue with handing out work to some systems & not others is related to the Anonymous Platform issue with the new Scheduler version? Whatever it is that stops Anon Platform from getting work because other requests have already been filled by the time it gets around to the Anon Platform request may already be at work in the present Scheduler when it comes to processing work requests. The order in which it determines eligibility for work, results in certain platforms not getting any under certain load conditions, eg extremely high (250k+) return rates. I have often had one of my hosts getting work on every request while the other host stays dry. And they are both anonymous platform linux boxes. My theory is that because the clients are doing scheduler requests in a regular five minute cadence, then if there is a a big bunch of clients hitting the server at the same time my host hits it, this same bunch will be competing with my host on its next request too. And if my other host hits the server at a quiet point in time, It'll keep hitting this same 'hole' on the subsequent requests. . . My slowest Linux host seems to find that sweet spot regularly and will get regular downloads when the other 3 Linux machines are getting nothing.. All on the same line ... Stephen ? ? ID: 2030858 ·

AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266	Message 2030866 - Posted: 5 Feb 2020, 12:51:48 UTC Game on, just got two healthy downloads back to back. ID: 2030866 ·

Oddbjornik Volunteer tester Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728	Message 2030881 - Posted: 5 Feb 2020, 15:20:38 UTC Validation and assimilation backlogs are approaching old heights. It's just a question of time before we're stuck again. ID: 2030881 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030886 - Posted: 5 Feb 2020, 16:40:03 UTC - in response to Message 2030881. Validation and assimilation backlogs are approaching old heights. It's just a question of time before we're stuck again. The total result count is over 19.5 million and rising. Somwhere beyond 20 million they don't fit in ram any more and the database performance goes through the floor. What's weird is that on ssp only about 1% of all the returned results are in 'waiting for db purging' state but of all my returned results of the website 75% are in 'valid' state. I guess ssp counts the results associated with workunits waiting for assimilation as 'waiting for validation' but web site counts them as 'valid'. If I estimate the number of those results from the number of workunits waiting for assimilation and move this number from 'waiting for validation' to 'waiting for db purging', then 66% of all the returned results are there and this is a much better match to the 75% fraction within my results. Workunits waiting for assimilation has grown by 600 000 since the downtime. Fixing the problem that is causing that should be on very high priority. Can't blame the blc35 overflow storm any more. ID: 2030886 ·

popandbob Volunteer tester Send message Joined: 19 Mar 05 Posts: 551 Credit: 4,673,015 RAC: 0	Message 2030894 - Posted: 5 Feb 2020, 17:19:59 UTC I haven`t noticed anyone comment on this yet but the reason for the growing assimilation number is quite a simple one.... They have less spindles on the storage drive. Less spindles means lower read and write rates. Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957 Or Good Shop? http://www.goodshop.com/?charityid=888957 ID: 2030894 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030897 - Posted: 5 Feb 2020, 17:44:58 UTC - in response to Message 2030894. I haven`t noticed anyone comment on this yet but the reason for the growing assimilation number is quite a simple one.... They have less spindles on the storage drive. Less spindles means lower read and write rates. Not really because the new spindles read or write many times more bytes per rotation. But it does affect the performance of multiple simultaneous reads or writes as with less spindles there's lower chance for the simultaneous operations affecting different spindles. ID: 2030897 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2030902 - Posted: 5 Feb 2020, 18:31:36 UTC - in response to Message 2030897. we don't even have confirmation that the new database system is even bought/built/implemented yet. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2030902 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13751 Credit: 208,696,464 RAC: 304	Message 2030980 - Posted: 6 Feb 2020, 5:07:55 UTC - in response to Message 2030902. we don't even have confirmation that the new database system is even bought/built/implemented yet. I would expect the system to be down for a day or more when it comes time for getting the new NAS going. First the normal weekly outage to compact & tidy up the database, then the time it takes to transfer it all across, then get the new hardware and transferred database recognised by the rest of the system. I seem to recall a full database transfer taking much longer than was expected once upon a time in the distant past. Grant Darwin NT ID: 2030980 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2031001 - Posted: 6 Feb 2020, 8:17:27 UTC - in response to Message 2030980. Last modified: 6 Feb 2020, 8:17:50 UTC I would expect the system to be down for a day or more when it comes time for getting the new NAS going. First the normal weekly outage to compact & tidy up the database, then the time it takes to transfer it all across, then get the new hardware and transferred database recognised by the rest of the system. They have the replica db they can copy to the new NAS without impacting the running system. Then they can make the new NAS using db the replica dp and let the replication process bring it up to date. Then the only thing they need to do during the downtime is to swap the roles of the databases so it won't necessarily have any impact on the length of the downtime. We had a period of time a week ago or so where the replica db was offline and the web site was using the master db directly. Perhaps they were doing just this. ID: 2031001 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22228 Credit: 416,307,556 RAC: 380	Message 2031002 - Posted: 6 Feb 2020, 8:19:46 UTC And then of course there is getting like the purchasing done (even for fully pre-funded equipment) within a university - that can be a very fraught and time consuming activity :-( Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2031002 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2031006 - Posted: 6 Feb 2020, 8:33:05 UTC Looks like the splitter throttling is much more effective now when the overflow storm is over. The result table has now grown to 20 million and the splitters are being throttled but when they stop, the table drops under 20 mil almost immediately so the splitters spend only short periods stopped making this almost unnoticeable. During the overflow storm the validators kept adding lot of resends to the result table so the table kept growing fast despite the splitters not splitting anything. ID: 2031006 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 2031126 - Posted: 7 Feb 2020, 2:08:55 UTC Last modified: 7 Feb 2020, 2:10:30 UTC I just did a quick up of the big numbers on the service status page. It seems the database can handle over 20 million comfortably when I added up the numbers this is what I got. 22,986,785. Splitter rate is over 67 a second ID: 2031126 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2031192 - Posted: 7 Feb 2020, 11:37:27 UTC - in response to Message 2031126. Last modified: 7 Feb 2020, 11:37:59 UTC I just did a quick up of the big numbers on the service status page. It seems the database can handle over 20 million comfortably when I added up the numbers this is what I got. 22,986,785 The highest number the ssp has had within the last day or so was 20,012,235 and it spends most of its time below 20 mil only doing brief dips above it. I guess you are mixing some non-result fields in your count getting a weird hybrid number that doesn't match the size of any table. That 20 mil is the size of the result table. You get that by summing up all the result fields: 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging'. If you add the workunit and file fields, then you will count some results up to four times. And you can't really count the size of the workunit table because ssp only shows a subset of them. ID: 2031192 ·

BetelgeuseFive Volunteer tester Send message Joined: 6 Jul 99 Posts: 158 Credit: 17,117,787 RAC: 19	Message 2031205 - Posted: 7 Feb 2020, 14:15:42 UTC Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ... https://setiathome.berkeley.edu/workunit.php?wuid=3871356807 Both computers that have this task marked as valid returned an overflow (and both these hosts return lots of invalids). Both computers that have this task marked as invalid did NOT return an overflow (and both these hosts have no other invalids). Shouldn't there be some kind of mechanism to prevent this (when at least one host did not return an overflow try more hosts) ? Tom ID: 2031205 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2031209 - Posted: 7 Feb 2020, 14:39:36 UTC - in response to Message 2031205. Hmmm, looks like good tasks are being marked as invalid and bad ones as valid ... https://setiathome.berkeley.edu/workunit.php?wuid=3871356807 Both computers that have this task marked as valid returned an overflow (and both these hosts return lots of invalids). Both computers that have this task marked as invalid did NOT return an overflow (and both these hosts have no other invalids). Shouldn't there be some kind of mechanism to prevent this (when at least one host did not return an overflow try more hosts) ? Tom . . The two hosts with lots of invalids have NAVI 5700 GPUs, so there are still some out there not upgrading their drivers to fix this problem. Stephen :( ID: 2031209 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.