The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 107 · Next

AuthorMessage
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2036423 - Posted: 7 Mar 2020, 3:32:55 UTC

This graph shows that 'waiting for validation' on SSP really means 'waiting for validation or assimilation':



The purple curve is the validation queue size as shown on SSP.
The green curve is assimilation queue size form SSP multiplied by 2.2 to scale if from workunits to results.
The blue curve is their difference, i.e. the true number of results waiting for validation.

The blue curve looks very much like the validation queue before the assimilation problem started. Stable around 5 million with a sharp spike just after each weekly downtime when everyone reports their results crunched during the downtime.

We also see that the spike drops down as fast as it climbed up, so validation has worked fine, but simultaneously with this drop, the assimilation curve climbs up and then stays there. So the validated results get stuck in the assimilation queue. The assimilation queue descends down much slower. So slow that the next downtime hits before it has reached the level it had before the previous downtime. So every downtime pushes it higher and higher.
ID: 2036423 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2036424 - Posted: 7 Mar 2020, 3:39:05 UTC - in response to Message 2036421.  

No, they haven't, go to the very end of the listing for my MB Valid tasks and you will find, 10 tasks issued 30 Jan 2020
I can't browse the listing of valid tasks alone. Only the 'all tasks' list really works. Trying to choose anything else just leaves the browser loading the page forever without ever getting anything. Even when I try to click my invalid task list that has only two tasks in it.
ID: 2036424 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036430 - Posted: 7 Mar 2020, 4:05:29 UTC - in response to Message 2036424.  

I tried it with a Host with many Valids and it's still spinning. I then tried it with a smaller number and reached a large number of WUs dated 18 Feb that all fail to open with the error,
Unable to handle request
can't find workunit

It's just a WAG, but, I would imagine it would be difficult to Assimilate something that can't be found....I Dunno

See if you can open this inside of a few minutes, https://setiathome.berkeley.edu/results.php?hostid=8097309&offset=3260&state=4
ID: 2036430 · Report as offensive     Reply Quote
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51469
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2036431 - Posted: 7 Mar 2020, 4:09:16 UTC - in response to Message 2036430.  


It's just a WAG, but, I would imagine it would be difficult to Assimilate something that can't be found....

LOL, I suppose that would be true.

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 2036431 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036433 - Posted: 7 Mar 2020, 4:27:36 UTC - in response to Message 2036432.  

Every single WU on every page? It goes on for pages, starting with Feb 29, https://setiathome.berkeley.edu/results.php?hostid=8097309&offset=3160&state=4
ID: 2036433 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19118
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2036435 - Posted: 7 Mar 2020, 4:38:38 UTC - in response to Message 2036433.  

Every single WU on every page? It goes on for pages, starting with Feb 29, https://setiathome.berkeley.edu/results.php?hostid=8097309&offset=3160&state=4

That is a demonstration of progress.
The listing of valid's comes from the Replica, where the task is still visible.
But the workunit page comes from the master and the workunit has been purged.
ID: 2036435 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036439 - Posted: 7 Mar 2020, 4:43:10 UTC - in response to Message 2036434.  

You do realize Eric ran that script many hours ago, right? I'll give you another 2.5 hours though.
Every WU older than 29 Feb...
ID: 2036439 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19118
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2036442 - Posted: 7 Mar 2020, 4:46:54 UTC - in response to Message 2036439.  
Last modified: 7 Mar 2020, 4:51:47 UTC

You do realize Eric ran that script many hours ago, right? I'll give you another 2.5 hours though.
Every WU older than 29 Feb...

The script is probably still running and will until Eric gets up and takes a look at the progress.

It is going to take some time to remove the 12 million tasks in the bloat
ID: 2036442 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036446 - Posted: 7 Mar 2020, 5:33:26 UTC
Last modified: 7 Mar 2020, 5:46:26 UTC

Ah, many are now gone, leaving two that are listed as Validated with a "minimum quorum 1"
I wonder how many of those are still lurking around? Is it Validated or what? https://setiathome.berkeley.edu/workunit.php?wuid=3861283408
granted credit 	104.20
minimum quorum : 1
initial replication : 2

   Task     Computer          Sent                  Time reported                  Status          Runtime  CPUtime  Credit                   Application
8493614556  8097309  30 Jan 2020, 17:37:30 UTC  31 Jan 2020, 10:07:46 UTC  Completed and validated  259.93  244.61  104.20 SETI@home v8 v8.11 (cuda42_mac)x86_64-apple-darwin
8493614557  8743335  30 Jan 2020, 17:37:22 UTC  23 Mar 2020, 9:03:18 UTC         In progress          ---     ---    ---   SETI@home v8 v8.24 (opencl_ati5_SoG_nocal)windows_intelx86
Millions?
ID: 2036446 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19118
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2036449 - Posted: 7 Mar 2020, 5:57:18 UTC - in response to Message 2036446.  
Last modified: 7 Mar 2020, 6:01:13 UTC

An unknown variable.
If you only have two with an RAC over 1.6 million, but I have ten with an RAC of 26,400 then it probably too difficult to make an accurate guess.

But i doubt if it is anywhere near a million. they are probably all from the same tape that was split on the 30th Jan.

edit]
Not all from the same tape, just all split on the 30th Jan
ID: 2036449 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2036451 - Posted: 7 Mar 2020, 6:02:56 UTC
Last modified: 7 Mar 2020, 6:16:16 UTC

Forums have almost ground to a halt. So the Scheduler should go MIA again any minute now...


Edit, yep- fail, fail fail. And even the web site is barely responding.
7/03/2020 15:35:06 | SETI@home | Scheduler request failed: Couldn't connect to server
7/03/2020 15:36:58 | SETI@home | Scheduler request failed: Couldn't connect to server
7/03/2020 15:40:28 | SETI@home | Scheduler request failed: Couldn't connect to server
Grant
Darwin NT
ID: 2036451 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036452 - Posted: 7 Mar 2020, 6:07:02 UTC - in response to Message 2036449.  

From what I remember, the people with the AMD 5700s were cranking one out every 20 seconds or so. Those machines would be the ones to investigate. It was really quite alarming to see so many clearly False Valids being generated.
ID: 2036452 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19118
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2036453 - Posted: 7 Mar 2020, 6:11:29 UTC - in response to Message 2036452.  

From what I remember, the people with the AMD 5700s were cranking one out every 20 seconds or so. Those machines would be the ones to investigate. It was really quite alarming to see so many clearly False Valids being generated.

I had a look to see if they were from any of the known problems or 'noise bombs' and it doesn't seem to be the case. and some of mine are blc's the others from Arecibo and all lokk like they ran full distance.
ID: 2036453 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 2036454 - Posted: 7 Mar 2020, 6:12:34 UTC - in response to Message 2036442.  

The script is probably still running and will until Eric gets up and takes a look at the progress.
It's a script that is run manually by the staff, it's not a Cron job that is set to run over & over again. It's run, does it's thing, and ends.
If you like, you can look at the graphs & see the effect it had. A few 100 thousand knocked off the almost 16 million total, between 22:00 to 23:00 graph time.
It ran, it finished, the effect was minimal.


It is going to take some time to remove the 12 million tasks in the bloat
Yep, months at the present rate due to the deadlines and more new work being added to the database.
Grant
Darwin NT
ID: 2036454 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19118
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2036456 - Posted: 7 Mar 2020, 6:18:15 UTC - in response to Message 2036454.  

What's the graph time re UTC?
ID: 2036456 · Report as offensive     Reply Quote
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 2036463 - Posted: 7 Mar 2020, 7:01:04 UTC - in response to Message 2036456.  

ID: 2036463 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036466 - Posted: 7 Mar 2020, 7:36:35 UTC - in response to Message 2036279.  

Well, I think I'm making some progress on this. Here's a table with the v8 SSP values when I started (a couple of hours ago), for reference. And what appear to be the SQL counts that they represent. I had to line them up by eye, but I had nine rows in each block, and this is the only way they fitted.

Results ready to send				1,131		result_server_state_2      				(UNSENT)
Results out in the field			5,490,824	result_server_state_4      				(IN PROGRESS)
Results returned and awaiting validation	15,242,139	result_server_state_5_and_file_delete_state_0      	(OVER, INIT)
Workunits waiting for validation		42		workunit_need_validate_1   				bool
Workunits waiting for assimilation		4,508,013	workunit_assimilate_state_1				(READY)
Workunit files waiting for deletion		74		workunit_file_delete_state_1       			(READY)
Result files waiting for deletion		155		result_file_delete_state_1 				(READY)
Workunits waiting for db purging		77,989		workunit_file_delete_state_2  				(DONE)
Results waiting for db purging			170,748		result_file_delete_state_2 				(DONE)
Most of that makes sense, but I think our problem is the third line: server state 5 includes all sorts of nasties:

#define RESULT_SERVER_STATE_OVER           5
    // we received a reply, timed out, or decided not to send.
Why should a 'timed out' result (passed deadline) be paired with a file delete status? There's a perfectly good VALIDATE_STATE_INIT we could use, which would allow us to cut out VALIDATE_STATE_TOO_LATE. Thoughts?
Is it possible the system doesn't know what to do with those WUs listed as Validated with a minimum quorum 1, while still having an outstanding Wingman? If that were the case there wouldn't need to be a very large number of those WUs, the Error being generated from them may be enough to slow the assimilation to a crawl.
ID: 2036466 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2036474 - Posted: 7 Mar 2020, 9:27:33 UTC

All the tasks in those valid lists that give an error when trying to click the workunit link give the error also when trying to open the task link. This is pretty clear indication that we are just observing different points of time on the list and the individual pages that come from different databases.
ID: 2036474 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2036477 - Posted: 7 Mar 2020, 9:56:23 UTC

Finally I was able to open my own valid tasks list. At the end there are just four tasks that I have received earlier than 1 Mar.

Of those four two are artifacts of replica displaying old data: the result and workunit are deleted in the primary database.
The remaining two are quorum 1 tasks from January that have a wingman still crunching:

https://setiathome.berkeley.edu/workunit.php?wuid=3860194203 and
https://setiathome.berkeley.edu/workunit.php?wuid=3861282516

Just two out of 5535 on this host doesn't look too bad. And I guess those two tasks have been assimilated. The still crunching wingman just prevents them from being deleted. So they are counted in 'waiting for db purging' on SSP.
ID: 2036477 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2036482 - Posted: 7 Mar 2020, 10:27:20 UTC - in response to Message 2036477.  

https://setiathome.berkeley.edu/workunit.php?wuid=3860194203 and
https://setiathome.berkeley.edu/workunit.php?wuid=3861282516

Just two out of 5535 on this host doesn't look too bad. And I guess those two tasks have been assimilated. The still crunching wingman just prevents them from being deleted. So they are counted in 'waiting for db purging' on SSP.
You say 'still crunching wingmen', but I doubt it. The tasks were both issued on 30 January, and have a deadline of 23 March. And both computers have contacted the servers recently - one on 5 March, and one today.

I'm waiting for the task lists on each of their computers to load, but one at least is still actively crunching - the one which contacted today received new work, and has returned one of the tasks already.

So, my best guess (prediction) is that these will turn out to be ghost tasks, never received and never to be crunched. They will reach deadline and time out on 23 March. What happens then, I'm less certain about. The minimum quorum of one, but initial replication of two, is an unusual combination, and we don't know exactly how the SETI daemons are programmed to cope with it. Ideally, a simple 'finished/purge', but my concern would be that the system, in its current configuration, might create and send out a replacement task.

As an aside, if the deadlines had been shorter back when this workunit was created, it wouldn't still be here now. If the project was continuing, shorter deadlines would certainly reduce the damage from bumps in the road like this.
ID: 2036482 · Report as offensive     Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.