The Server Issues / Outages Thread - Panic Mode On! (119)

Author	Message
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2036423 - Posted: 7 Mar 2020, 3:32:55 UTC This graph shows that 'waiting for validation' on SSP really means 'waiting for validation or assimilation': The purple curve is the validation queue size as shown on SSP. The green curve is assimilation queue size form SSP multiplied by 2.2 to scale if from workunits to results. The blue curve is their difference, i.e. the true number of results waiting for validation. The blue curve looks very much like the validation queue before the assimilation problem started. Stable around 5 million with a sharp spike just after each weekly downtime when everyone reports their results crunched during the downtime. We also see that the spike drops down as fast as it climbed up, so validation has worked fine, but simultaneously with this drop, the assimilation curve climbs up and then stays there. So the validated results get stuck in the assimilation queue. The assimilation queue descends down much slower. So slow that the next downtime hits before it has reached the level it had before the previous downtime. So every downtime pushes it higher and higher. ID: 2036423 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2036424 - Posted: 7 Mar 2020, 3:39:05 UTC - in response to Message 2036421. No, they haven't, go to the very end of the listing for my MB Valid tasks and you will find, 10 tasks issued 30 Jan 2020 I can't browse the listing of valid tasks alone. Only the 'all tasks' list really works. Trying to choose anything else just leaves the browser loading the page forever without ever getting anything. Even when I try to click my invalid task list that has only two tasks in it. ID: 2036424 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036430 - Posted: 7 Mar 2020, 4:05:29 UTC - in response to Message 2036424. I tried it with a Host with many Valids and it's still spinning. I then tried it with a smaller number and reached a large number of WUs dated 18 Feb that all fail to open with the error, Unable to handle request can't find workunit It's just a WAG, but, I would imagine it would be difficult to Assimilate something that can't be found....I Dunno See if you can open this inside of a few minutes, https://setiathome.berkeley.edu/results.php?hostid=8097309&offset=3260&state=4 ID: 2036430 · Reply Quote

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004	Message 2036431 - Posted: 7 Mar 2020, 4:09:16 UTC - in response to Message 2036430. It's just a WAG, but, I would imagine it would be difficult to Assimilate something that can't be found.... LOL, I suppose that would be true. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 2036431 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036433 - Posted: 7 Mar 2020, 4:27:36 UTC - in response to Message 2036432. Every single WU on every page? It goes on for pages, starting with Feb 29, https://setiathome.berkeley.edu/results.php?hostid=8097309&offset=3160&state=4 ID: 2036433 · Reply Quote

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19118 Credit: 40,757,560 RAC: 67	Message 2036435 - Posted: 7 Mar 2020, 4:38:38 UTC - in response to Message 2036433. Every single WU on every page? It goes on for pages, starting with Feb 29, https://setiathome.berkeley.edu/results.php?hostid=8097309&offset=3160&state=4 That is a demonstration of progress. The listing of valid's comes from the Replica, where the task is still visible. But the workunit page comes from the master and the workunit has been purged. ID: 2036435 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036439 - Posted: 7 Mar 2020, 4:43:10 UTC - in response to Message 2036434. You do realize Eric ran that script many hours ago, right? I'll give you another 2.5 hours though. Every WU older than 29 Feb... ID: 2036439 · Reply Quote

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19118 Credit: 40,757,560 RAC: 67	Message 2036442 - Posted: 7 Mar 2020, 4:46:54 UTC - in response to Message 2036439. Last modified: 7 Mar 2020, 4:51:47 UTC You do realize Eric ran that script many hours ago, right? I'll give you another 2.5 hours though. Every WU older than 29 Feb... The script is probably still running and will until Eric gets up and takes a look at the progress. It is going to take some time to remove the 12 million tasks in the bloat ID: 2036442 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036446 - Posted: 7 Mar 2020, 5:33:26 UTC Last modified: 7 Mar 2020, 5:46:26 UTC Ah, many are now gone, leaving two that are listed as Validated with a "minimum quorum 1" I wonder how many of those are still lurking around? Is it Validated or what? https://setiathome.berkeley.edu/workunit.php?wuid=3861283408 granted credit 104.20 minimum quorum : 1 initial replication : 2 Task Computer Sent Time reported Status Runtime CPUtime Credit Application 8493614556 8097309 30 Jan 2020, 17:37:30 UTC 31 Jan 2020, 10:07:46 UTC Completed and validated 259.93 244.61 104.20 SETI@home v8 v8.11 (cuda42_mac)x86_64-apple-darwin 8493614557 8743335 30 Jan 2020, 17:37:22 UTC 23 Mar 2020, 9:03:18 UTC In progress --- --- --- SETI@home v8 v8.24 (opencl_ati5_SoG_nocal)windows_intelx86 Millions? ID: 2036446 · Reply Quote

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19118 Credit: 40,757,560 RAC: 67	Message 2036449 - Posted: 7 Mar 2020, 5:57:18 UTC - in response to Message 2036446. Last modified: 7 Mar 2020, 6:01:13 UTC An unknown variable. If you only have two with an RAC over 1.6 million, but I have ten with an RAC of 26,400 then it probably too difficult to make an accurate guess. But i doubt if it is anywhere near a million. ~~they are probably all from the same tape that was split on the 30th Jan.~~ edit] Not all from the same tape, just all split on the 30th Jan ID: 2036449 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304	Message 2036451 - Posted: 7 Mar 2020, 6:02:56 UTC Last modified: 7 Mar 2020, 6:16:16 UTC Forums have almost ground to a halt. So the Scheduler should go MIA again any minute now... Edit, yep- fail, fail fail. And even the web site is barely responding. 7/03/2020 15:35:06 \| SETI@home \| Scheduler request failed: Couldn't connect to server 7/03/2020 15:36:58 \| SETI@home \| Scheduler request failed: Couldn't connect to server 7/03/2020 15:40:28 \| SETI@home \| Scheduler request failed: Couldn't connect to server Grant Darwin NT ID: 2036451 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036452 - Posted: 7 Mar 2020, 6:07:02 UTC - in response to Message 2036449. From what I remember, the people with the AMD 5700s were cranking one out every 20 seconds or so. Those machines would be the ones to investigate. It was really quite alarming to see so many clearly False Valids being generated. ID: 2036452 · Reply Quote

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19118 Credit: 40,757,560 RAC: 67	Message 2036453 - Posted: 7 Mar 2020, 6:11:29 UTC - in response to Message 2036452. From what I remember, the people with the AMD 5700s were cranking one out every 20 seconds or so. Those machines would be the ones to investigate. It was really quite alarming to see so many clearly False Valids being generated. I had a look to see if they were from any of the known problems or 'noise bombs' and it doesn't seem to be the case. and some of mine are blc's the others from Arecibo and all lokk like they ran full distance. ID: 2036453 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304	Message 2036454 - Posted: 7 Mar 2020, 6:12:34 UTC - in response to Message 2036442. The script is probably still running and will until Eric gets up and takes a look at the progress. It's a script that is run manually by the staff, it's not a Cron job that is set to run over & over again. It's run, does it's thing, and ends. If you like, you can look at the graphs & see the effect it had. A few 100 thousand knocked off the almost 16 million total, between 22:00 to 23:00 graph time. It ran, it finished, the effect was minimal. It is going to take some time to remove the 12 million tasks in the bloat Yep, months at the present rate due to the deadlines and more new work being added to the database. Grant Darwin NT ID: 2036454 · Reply Quote

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19118 Credit: 40,757,560 RAC: 67	Message 2036456 - Posted: 7 Mar 2020, 6:18:15 UTC - in response to Message 2036454. What's the graph time re UTC? ID: 2036456 · Reply Quote

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 2036463 - Posted: 7 Mar 2020, 7:01:04 UTC - in response to Message 2036456. What's the graph time re UTC? Time is whatever your timezone is in this case for me its: 2020-03-06T07:12:18+1100 to 2020-03-07T13:12:18+1100 You can view here: https://munin.kiska.pw/munin/static/dynazoom.html?plugin_name=Munin-Node%2FMunin-Node%2Fresults_setiathomev8_in_progress_validation&start_iso8601=2020-03-06T11%3A57%3A46%2B1100&stop_iso8601=2020-03-07T17%3A57%3A46%2B1100&start_epoch=1583439138&stop_epoch=1583547138&lower_limit=&upper_limit=&size_x=800&size_y=400&cgiurl_graph=%2Fmunin-cgi%2Fmunin-cgi-graph ID: 2036463 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036466 - Posted: 7 Mar 2020, 7:36:35 UTC - in response to Message 2036279. Well, I think I'm making some progress on this. Here's a table with the v8 SSP values when I started (a couple of hours ago), for reference. And what appear to be the SQL counts that they represent. I had to line them up by eye, but I had nine rows in each block, and this is the only way they fitted. Results ready to send 1,131 result_server_state_2 (UNSENT) Results out in the field 5,490,824 result_server_state_4 (IN PROGRESS) Results returned and awaiting validation 15,242,139 result_server_state_5_and_file_delete_state_0 (OVER, INIT) Workunits waiting for validation 42 workunit_need_validate_1 bool Workunits waiting for assimilation 4,508,013 workunit_assimilate_state_1 (READY) Workunit files waiting for deletion 74 workunit_file_delete_state_1 (READY) Result files waiting for deletion 155 result_file_delete_state_1 (READY) Workunits waiting for db purging 77,989 workunit_file_delete_state_2 (DONE) Results waiting for db purging 170,748 result_file_delete_state_2 (DONE) Most of that makes sense, but I think our problem is the third line: server state 5 includes all sorts of nasties: #define RESULT_SERVER_STATE_OVER 5 // we received a reply, timed out, or decided not to send. Why should a 'timed out' result (passed deadline) be paired with a file delete status? There's a perfectly good VALIDATE_STATE_INIT we could use, which would allow us to cut out VALIDATE_STATE_TOO_LATE. Thoughts? Is it possible the system doesn't know what to do with those WUs listed as Validated with a minimum quorum 1, while still having an outstanding Wingman? If that were the case there wouldn't need to be a very large number of those WUs, the Error being generated from them may be enough to slow the assimilation to a crawl. ID: 2036466 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2036474 - Posted: 7 Mar 2020, 9:27:33 UTC All the tasks in those valid lists that give an error when trying to click the workunit link give the error also when trying to open the task link. This is pretty clear indication that we are just observing different points of time on the list and the individual pages that come from different databases. ID: 2036474 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2036477 - Posted: 7 Mar 2020, 9:56:23 UTC Finally I was able to open my own valid tasks list. At the end there are just four tasks that I have received earlier than 1 Mar. Of those four two are artifacts of replica displaying old data: the result and workunit are deleted in the primary database. The remaining two are quorum 1 tasks from January that have a wingman still crunching: https://setiathome.berkeley.edu/workunit.php?wuid=3860194203 and https://setiathome.berkeley.edu/workunit.php?wuid=3861282516 Just two out of 5535 on this host doesn't look too bad. And I guess those two tasks have been assimilated. The still crunching wingman just prevents them from being deleted. So they are counted in 'waiting for db purging' on SSP. ID: 2036477 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874	Message 2036482 - Posted: 7 Mar 2020, 10:27:20 UTC - in response to Message 2036477. https://setiathome.berkeley.edu/workunit.php?wuid=3860194203 and https://setiathome.berkeley.edu/workunit.php?wuid=3861282516 Just two out of 5535 on this host doesn't look too bad. And I guess those two tasks have been assimilated. The still crunching wingman just prevents them from being deleted. So they are counted in 'waiting for db purging' on SSP. You say 'still crunching wingmen', but I doubt it. The tasks were both issued on 30 January, and have a deadline of 23 March. And both computers have contacted the servers recently - one on 5 March, and one today. I'm waiting for the task lists on each of their computers to load, but one at least is still actively crunching - the one which contacted today received new work, and has returned one of the tasks already. So, my best guess (prediction) is that these will turn out to be ghost tasks, never received and never to be crunched. They will reach deadline and time out on 23 March. What happens then, I'm less certain about. The minimum quorum of one, but initial replication of two, is an unusual combination, and we don't know exactly how the SETI daemons are programmed to cope with it. Ideally, a simple 'finished/purge', but my concern would be that the system, in its current configuration, might create and send out a replacement task. As an aside, if the deadlines had been shorter back when this workunit was created, it wouldn't still be here now. If the project was continuing, shorter deadlines would certainly reduce the damage from bumps in the road like this. ID: 2036482 · Reply Quote

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.