The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 65 · 66 · 67 · 68 · 69 · 70 · 71 . . . 94 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030088 - Posted: 31 Jan 2020, 10:06:32 UTC - in response to Message 2030087.  

Well, of course it isn't working yet - he hasn't tried it yet! He wrote to me about 10 pm - four hours ago. He'll be asleep now, and when he wakes up, he said he "will see if I can figure out a transitioner trick" - possibly from scratch. Give him time - we wouldn't want it to go wrong, as his last idea seems to have done!
ID: 2030088 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030089 - Posted: 31 Jan 2020, 10:12:33 UTC - in response to Message 2030086.  

The 'results awaiting validation' and 'workunits awaiting validation' figures are also unambiguous, but they are unusual - why are they so different?
Results awaiting validation are all results that have been returned but not validated yet. Almost all of those in a normal situation are results that are waiting for their wingmen that are still crunching. Workunits awaiting validation are wus that are ready to be validated (that is all the results have been returned) but have not been validated yet and in a normal situation this is close to zero. So the result and workunit fields measure different things.

I'm almost sure that the results corresponding to 'workunits waiting for assimilation' are also counted in 'results waiting for validation'. So the assimilation problem is the direct cause of the huge apparent validation backlog.
ID: 2030089 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030090 - Posted: 31 Jan 2020, 10:28:06 UTC - in response to Message 2030089.  

I'm almost sure that the results corresponding to 'workunits waiting for assimilation' are also counted in 'results waiting for validation'. So the assimilation problem is the direct cause of the huge apparent validation backlog.
Assimilation takes place after validation - only the canonical result is assimilated into the science database. It would be curious if the 12 million figure was 'results waiting for validation, plus results which have already been validated'. Stranger things have happened in database design, but it's not the intuitive position.

Personally, I'm still returning an excessive number of overnight 'immediate overflow' BLC35 tasks, plus the faster running but normal Arecibo tasks - I think that's the driver for the increasing numbers.
ID: 2030090 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14020
Credit: 208,696,464
RAC: 304
Australia
Message 2030091 - Posted: 31 Jan 2020, 10:33:49 UTC - in response to Message 2030089.  
Last modified: 31 Jan 2020, 10:37:36 UTC

So the assimilation problem is the direct cause of the huge apparent validation backlog.
Nope.
The WUs have to be Validated before they can be Assimilated (moved to the Science database). Once they are Assimilated, then they can be deleted. However once deleted, they aren't actually deleted until they are purged from the database (hence you see Validated WUs in your Task list for 24hrs (when things are working anyway)).
We've got both a problem with Validation- hence that huge (and growing) "Results returned and awaiting validation" backlog, there is also a problem with the Assimilators- once things have been Validated, they aren't getting Assimilated- hence the huge & growing "Workunits waiting for assimilation" backlog.
It appears all boils down to disk I/O (Input/Output), or a lack thereof.


Edit-
and this all in addition to the not so occasional upload & occasional download issues we've been having for some time now.
Grant
Darwin NT
ID: 2030091 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030093 - Posted: 31 Jan 2020, 10:50:44 UTC - in response to Message 2030090.  
Last modified: 31 Jan 2020, 10:56:43 UTC

Assimilation takes place after validation - only the canonical result is assimilated into the science database. It would be curious if the 12 million figure was 'results waiting for validation, plus results which have already been validated'. Stranger things have happened in database design, but it's not the intuitive position.
It's not a database design thing, but Server Status Page design thing. There is no field for results waiting for assimilation, so it must either list those results under some other state (waiting for validation or waiting for purging) or keep them hidden.

The fact that the total sum of all the shown result counts has hovered very near 20 million, which was the number Eric said they are targeting by splitter throttling makes the hidden assumption unlikely. Results waiting for db purging is smaller than workunits waiting for assimilation, so that alternative can be eliminated too. The only one left is results waiting for validation.

Those result rows must exist because they aren't deleted before the result files are deleted. And those files must exist because the data in them is needed for assimilation. Also I believe the database integrity requires that all the result rows must exist as long as the workunit row exists, Not just the canonical result. I have never seen on the web page a workunit that is missing results (less results than 'initial replication' number).

My validated results appear on the web page pretty soon after my hosts have returned them, so there doesn't seem to be any significant validation backlog. But the count of results waiting for validation is enormous despite of this. If the results waiting for assimilation were counted in it too, then this would perfectly explain the discrepancy.
ID: 2030093 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 717
Credit: 8,032,827
RAC: 62
France
Message 2030094 - Posted: 31 Jan 2020, 10:54:11 UTC - in response to Message 2029963.  
Last modified: 31 Jan 2020, 11:08:57 UTC

better in PM ^^ hop edited
ID: 2030094 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030095 - Posted: 31 Jan 2020, 11:01:03 UTC - in response to Message 2030093.  

The assimilator only looks at the workunit table: does this WU have a canonical result? If yes, it's ready for assimilation, and should be counted in the (current) 3 million total of 'ready' WUs.

Having decided it's ready, the assimilator has to load the result row from that bloated result table (to find out what the filename is - depends on which result became canonical), and then retrieve the result data file from disk. I expect both of those operations are slower than normal.
ID: 2030095 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7381
Credit: 44,181,323
RAC: 238
United States
Message 2030098 - Posted: 31 Jan 2020, 11:48:21 UTC - in response to Message 2029989.  

The Validation procedure has been changed within the past day, now you don't receive an Inconclusive result if your result is not received first. Now the First result received is Automatically Validated and any following results that don't match are ruled Invalid. Look at these results, do you see any third result, or Inconclusive result?

[ sniped ]

Here is the Host whose results were received First, and Awarded as being Valid, State: All (5990) · In progress (300) · Validation pending (189) · Validation inconclusive (407) · Valid (4590) · Invalid (504) · Error (0)
That Host doesn't appear to be producing Valid results does it? It's producing what appears to be Overflows, but the Stderr is missing. Yet, the results are being Validated.

Hi TBar,

Please read this whole post before responding. :)

This is what I see... nothing. I don't sit here every minute of every day analyzing each and every WU my hosts do like it seems you, Grant, Richard H. and others seem to do. I do not see what you guys are talking about. I went to the computer you linked to and I saw inconclusive, validated and error WUs. I did not see any invalid WUs. I did not go to the previous or next pages. Are you saying that some computers are getting nothing but valid WUs and others are getting nothing but invalid WUs?

The way I understand the inconclusive status is that if 2 hosts send in results on the same WU and they don't match, the WU is given an inconclusive status for both hosts and the WU gets sent out to another host. When that result comes in if it matches one of the other results, they get a valid status and the 3rd gets invalid. Correct? If this is the way it "used to work" then it seems you are saying that the validation process has been changed. Correct?

My question now is how can the first WU result get a valid status if it is NOT compared to another result? How is it getting validated? Or is it just automatically getting the valid status when it crosses the finish line first and all others are invalid? If the 3rd question (ok 3 questions, not one ;) ) is correct, then I understand why you guys are saying it is not real science. Perhaps I just figured out what you guys are talking about. ;) In that case, is it best to go NNT and wait for the validation process to go back to what it should be, if it ever does?

Have a great day! :)

Siran
CAPT Siran d'Vel'nahr - L L & P _\\//
Winders 11 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 2030098 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2030099 - Posted: 31 Jan 2020, 11:52:58 UTC

Quorum of 1 Bad Example. Here's a failure of the Approach. The PC with the validated result has 461 invalids, but returned ahead of my PC by 7 seconds! It has no stderr output, so what did it return?

https://setiathome.berkeley.edu/workunit.php?wuid=3860444977

Curious Example. Saw this from my overnight. Both my PC and the Apple had same number of Autocorr and Pulses found, similar peaks and time (rounding error diffs?). The Apple got credit cause it was first returned. Not clear which or both are correct. Track record similar in numbers; mine is better in percentage valid tasks.

https://setiathome.berkeley.edu/workunit.php?wuid=3859724613

I don't understand how the first PC above was allowed to have a quorum of 1. In general, this seems to be introducing questionable results into the science database. More importantly, it may potentially reject an ET result.
ID: 2030099 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2030109 - Posted: 31 Jan 2020, 13:35:47 UTC - in response to Message 2030099.  
Last modified: 31 Jan 2020, 13:53:55 UTC

I got one of these too. It also has one returned first with no data marked valid, mine rejected as invalid with data, but returned after the other.


Edit: Just noticed that both of the WUs involved an SoG WU which was returned first, with the empty WU result.

https://setiathome.berkeley.edu/workunit.php?wuid=3860988255
ID: 2030109 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3870
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030128 - Posted: 31 Jan 2020, 17:26:20 UTC

Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew!
ID: 2030128 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2030131 - Posted: 31 Jan 2020, 17:38:48 UTC - in response to Message 2030128.  

Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew!


Thanks for the very good information. I'll jump back in the pool now that it is safe. Now allowing New Tasks! (in small amounts to help the db)
ID: 2030131 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030134 - Posted: 31 Jan 2020, 17:48:11 UTC - in response to Message 2030128.  

Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew!

How ironic that the machines producing nothing but trash are still being sent work while my two machines which produce Valid results have been mostly out of work for hours.
https://setiathome.berkeley.edu/results.php?hostid=7941469&offset=100 At least they won't be receiving False valids anymore.
ID: 2030134 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030136 - Posted: 31 Jan 2020, 17:53:45 UTC - in response to Message 2030098.  

The way I understand the inconclusive status is that if 2 hosts send in results on the same WU and they don't match, the WU is given an inconclusive status for both hosts and the WU gets sent out to another host. When that result comes in if it matches one of the other results, they get a valid status and the 3rd gets invalid. Correct?
Almost but not quite correct. The first comparison that makes it inclonclusive is much more strict than the final comparison that detects who is right. Most inconclusives have only small differences and when the third result is returned and is similar to both of the initial results, all tree become Valid. Only in a minority of inconclusives someone gets condemned Invalid.

My question now is how can the first WU result get a valid status if it is NOT compared to another result?
They changed minimum quorum to 1 in a desperate attempt to relieve server load, so the first result returned is automatically assumed Valid. If the second result doesn't match it, it becomes Invalid regardless of who was right. A bit later they changed initial replication to 1 too, so that that task is never sent to a wingman, so a bad wingman can't make your good results Invalid any more.
ID: 2030136 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 2030140 - Posted: 31 Jan 2020, 18:30:26 UTC - in response to Message 2030128.  

Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew!


Allow new tasks has now been set.
Kevin


ID: 2030140 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3870
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030147 - Posted: 31 Jan 2020, 19:32:54 UTC

Work unit generation is off, "Results returned and awaiting validation" has passed 13M which has to be the highest ever. I am expecting any moment there will be a News blurb about low/no work for the next little while, but if no blurb comes I would still get the backup projects ready, ie disable NNT and set their task share to zero (works for most of them) so if cache runs dry the machine will get minimal work of them. I predict a dry weekend, but it's for the integrity of the data and thus the very existence of the SETI@Home project, thus a worthy cause.
ID: 2030147 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2030149 - Posted: 31 Jan 2020, 20:02:02 UTC - in response to Message 2030147.  
Last modified: 31 Jan 2020, 20:40:52 UTC

Work unit generation is off, "Results returned and awaiting validation" has passed 13M which has to be the highest ever. I am expecting any moment there will be a News blurb about low/no work for the next little while, but if no blurb comes I would still get the backup projects ready, ie disable NNT and set their task share to zero (works for most of them) so if cache runs dry the machine will get minimal work of them. I predict a dry weekend, but it's for the integrity of the data and thus the very existence of the SETI@Home project, thus a worthy cause.


. . And you pre-empted my message but I will write it anyway ...

{edit}
. . Uploads - stalling
. . Downloads - stalling
. . Constant "no tasks available"
{/edit}

. . SETI is Boinked ...

Stephen

:(
ID: 2030149 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3870
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030150 - Posted: 31 Jan 2020, 20:03:51 UTC - in response to Message 2030149.  

Actually by writing it, I caused the splitters to come back up, there is work, and Results returned fell below 13M just to prove me wrong. Mission accomplished. :^) Apparently Dr. K. has a script to clear the backlog and keep the project up... fingers crossed.
ID: 2030150 · Report as offensive
Profile Schatten

Send message
Joined: 12 Oct 02
Posts: 18
Credit: 14,047,388
RAC: 9
Germany
Message 2030152 - Posted: 31 Jan 2020, 20:19:29 UTC
Last modified: 31 Jan 2020, 20:21:09 UTC

I got strange looking WU's. Example: https://setiathome.berkeley.edu/workunit.php?wuid=3863823181 https://setiathome.berkeley.edu/workunit.php?wuid=3863820784 there are more i think.

Something isn't right. I though it is fixed. :-/
ID: 2030152 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030153 - Posted: 31 Jan 2020, 20:29:57 UTC - in response to Message 2030152.  

"created 31 Jan 2020, 17:15:55 UTC"

I think Eric has swapped things back since then, but there'll still be a few in the system.
ID: 2030153 · Report as offensive
Previous · 1 . . . 65 · 66 · 67 · 68 · 69 · 70 · 71 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2026 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.