The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572	Message 2030025 - Posted: 31 Jan 2020, 3:51:40 UTC - in response to Message 2030023. Switched to AP only until this is sorted. Using E@H for heating:-) They have "Tasks in progress suppressed pending completion" set on AP as well, so might even be a problem there as well. https://setiathome.berkeley.edu/workunit.php?wuid=3861006207 NNT it is then:-( Kevin ID: 2030025 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304	Message 2030033 - Posted: 31 Jan 2020, 4:47:48 UTC Last modified: 31 Jan 2020, 4:50:41 UTC And even with this self Validation, the Validation & Assimilation backlogs continue to grow. And my Inconclusives look to be heading for an all time record, and there would appear to the return of the BLC35 noise bombs. Grant Darwin NT ID: 2030033 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 2030034 - Posted: 31 Jan 2020, 4:48:41 UTC If we all set NNT no work will be processed and no resends will get processed. That is just my opinion ID: 2030034 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304	Message 2030036 - Posted: 31 Jan 2020, 4:57:52 UTC - in response to Message 2029934. We could cut our caches to 0.0 + 0.0 - return every task after just 5 minutes and get the 'first back' reward? No, I didn't think so either. My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared. Not started to clear, but fully cleared. Pull all BLC35 files and then restart the splitters with 100 + 100 serverside limits again. Once we have the new NAS device up and running, bump up the limits & reintroduce the BLC35 files and then use them to stress test the system. If it fails again then it's fundraising time for new database servers that are capable of handling the load (that really needs to be done anyway in order to meet the projects goals of many more crunchers returning much more work). Grant Darwin NT ID: 2030036 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 2030037 - Posted: 31 Jan 2020, 5:00:50 UTC - in response to Message 2030036. I agree Grant in regards to the pulling of the data tapes however I think you will find that they don't have the manpower to sift through the data and pool selected tapes ID: 2030037 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304	Message 2030039 - Posted: 31 Jan 2020, 5:04:07 UTC - in response to Message 2030037. I agree Grant in regards to the pulling of the data tapes however I think you will find that they don't have the manpower to sift through the data and pool selected tapes Hence just pull all files named BLC35 and hold them over till such time as the servers can handle the load they generate. Plenty of other files to be processed, so no need to do these ones now. Grant Darwin NT ID: 2030039 ·

Gene Send message Joined: 26 Apr 99 Posts: 150 Credit: 48,393,279 RAC: 118	Message 2030043 - Posted: 31 Jan 2020, 5:18:28 UTC I got 3 invalids on the 30th. In all three cases the wingman, who got "valid" credit, returned a StdErr file that was empty - just one line <core_client_version>7.14.2</core_client_version> Not sensible that a result was "valid" that returned no info. I'm going NNT as others have. ID: 2030043 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030048 - Posted: 31 Jan 2020, 5:29:24 UTC - in response to Message 2030036. My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared. Not started to clear, but fully cleared. This would alienate all those users who are not following these forums making them quit or switch to other projects permanently. Loss of users would help the server congestion but hurt the science progress. I think letting the backlogs clear at the start of a Tuesday downtime would make a big difference. Especially if they also trigger the validation of all those results that have missed validation for various reasons over the last weeks and are now waiting for the deadlines. The resend cycle wouldn't clear but they are a small percentage of all the tasks. The huge 'Workunits waiting for assimilation' backlog that is now 3.5 million and still rising would clear. Those workunits waiting for assimilation must have corresponding result rows still in the database at least for the canonical result but probably for all the results because I have never seen any workunit in the website show part of the results deleted while the workunit still exists. The number or results waiting for assimilation is not shown on SSP, so I guess those results may be still counted in the validation queue. It this is the case, then those may explain over 7 million of the current 12 million result validation queue! Once we have the new NAS device up and running, bump up the limits &... When the problem is the database not fitting in RAM, the disk performance increase won't fix the problem. It only reduces the magnitude of the consequences a bit. ID: 2030048 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2030050 - Posted: 31 Jan 2020, 5:58:42 UTC - in response to Message 2030036. Last modified: 31 Jan 2020, 6:03:23 UTC My vote was for shutting down the splitters for a week (or 2 or how ever long it takes), and just have people process resends until such time as the Validation & Assimilation backlog's have cleared. Not started to clear, but fully cleared. Pull all BLC35 files and then restart the splitters with 100 + 100 serverside limits again. Once we have the new NAS device up and running, bump up the limits & reintroduce the BLC35 files and then use them to stress test the system. If it fails again then it's fundraising time for new database servers that are capable of handling the load (that really needs to be done anyway in order to meet the projects goals of many more crunchers returning much more work). +1 I think this is a great idea. We will all still get work... just make it the resends, until db reaches a good size. p.s. The idea of processing data without a wingman, or having a bad result put in over my good result is BS and worthless. I love Seti, but I don't want feel good theater, I want SCIENCE! so I'm NNT until it gets better. ID: 2030050 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2030051 - Posted: 31 Jan 2020, 6:01:53 UTC Last modified: 31 Jan 2020, 6:08:23 UTC The Splitters have fallen off again, most requests are receiving 'Project has No tasks...' again. Caches are falling.... one is down by 50% already, and so it continues. Oh, the problem with failed Uploads has also returned. Probably has something to do with returning around 60 to 70 completed tasks every 5 minutes. ID: 2030051 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030053 - Posted: 31 Jan 2020, 6:07:14 UTC Last modified: 31 Jan 2020, 6:07:45 UTC Now they have apparently switched to 'initial replication 1': 3861450832 So no more risk of bad results returned first making good results returned later fail, but also no chance whatsoever of catching the bad results. ID: 2030053 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304	Message 2030057 - Posted: 31 Jan 2020, 6:28:29 UTC - in response to Message 2030048. Once we have the new NAS device up and running, bump up the limits &... When the problem is the database not fitting in RAM, the disk performance increase won't fix the problem. It only reduces the magnitude of the consequences a bit. Depending on how much better they perform, the need for all of it to fit in RAM may not arise (although that is rather wishful thinking- i am expecting the new storage to be significantly faster than the exiting storage, however i don't expect it to be significant enough.). Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM. Or a new server with more RAM. Or better yet, both. Grant Darwin NT ID: 2030057 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2030058 - Posted: 31 Jan 2020, 6:34:02 UTC Results received in last hour = 197,095 just a matter of time now, probably not long. ID: 2030058 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304	Message 2030059 - Posted: 31 Jan 2020, 6:39:20 UTC - in response to Message 2030058. Results received in last hour = 197,095 just a matter of time now, probably not long. Already getting "Project has no tasks available" messages, i think Tbar posted similarly in another thread. Caches running down. Not surprising considering the return rate & the increasing Validation & Assimilation backlogs- both have reached new record highs. Grant Darwin NT ID: 2030059 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030066 - Posted: 31 Jan 2020, 7:55:22 UTC - in response to Message 2030057. Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM. Setiathome Boinc database running from flash would burn out the flash in a short time! ID: 2030066 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2030068 - Posted: 31 Jan 2020, 7:59:22 UTC Last modified: 31 Jan 2020, 8:00:49 UTC If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status?? edit: Ville - I love your Pluto pic. ID: 2030068 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13766 Credit: 208,696,464 RAC: 304	Message 2030069 - Posted: 31 Jan 2020, 8:05:43 UTC - in response to Message 2030066. Last modified: 31 Jan 2020, 8:11:50 UTC Or they could be replaced by an AFA (All Flash Array), negating the need for the entire thing to fit in RAM. Setiathome Boinc database running from flash would burn out the flash in a short time! After several decades. Yes, if you were to use consumer/client SSDs they would die rather quickly, however SSDs designed for enterprise use will last an extremely long time, under much heavier use than Seti provides. For example, DWPD (Drive Writes Per Day, where the entire capacity of the drive is written to in a 24hr period). Consumer drivers are rated at around .1 to .5, Enterprise drives are rated as high as 3 DWPD, some specialised write drives even higher. And of course with multiple drives in an array or pool, even udner the heaviest of loads, they will never come close to their rated maximum DWPD limit. Grant Darwin NT ID: 2030069 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030072 - Posted: 31 Jan 2020, 8:24:42 UTC - in response to Message 2030068. Ville - I love your Pluto pic. I got the idea of using a Pluto pic from you ;) ID: 2030072 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14656 Credit: 200,643,578 RAC: 874	Message 2030073 - Posted: 31 Jan 2020, 8:26:32 UTC - in response to Message 2030068. If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status?? My theory is that the Transitioner isn't (hasn't) marked all those returns as 'ready to validate' - I think the bulk of them have been sitting there untouched since the December troubles. Eric replied - very late on Thursday night, his time - I will see if I can figure out a transitioner trick tomorrow, in which case I will revert to standard replication. (I suggested that Matt might have a script for that - I think we've done it before) ID: 2030073 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2030075 - Posted: 31 Jan 2020, 8:35:39 UTC The first machine is now out of work, https://setiathome.berkeley.edu/results.php?hostid=6796479 The Next machine's cache is down by 60%, it will be out soon, https://setiathome.berkeley.edu/results.php?hostid=6813106 The load is still above 190k, and the Splitters can't keep up, https://setiathome.berkeley.edu/show_server_status.php ID: 2030075 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.