The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 65 · 66 · 67 · 68 · 69 · 70 · 71 . . . 94 · Next

AuthorMessage
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030077 - Posted: 31 Jan 2020, 8:43:25 UTC - in response to Message 2030068.  
Last modified: 31 Jan 2020, 8:48:04 UTC

If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status??
If my theory about results belonging to workunits waiting for assimilation being shown as waiting for validation is correct, then we could have about 7.5 million of the 12.2 million results there being ones that have been validated but not assimilated yet. And that is growing fast.

The 'Workunits waiting for assimilation' is a supposed to be close to zero in normal situation because workunits get assimilated immediately after they have been validated. But for more than a week now that number has been steadily growing. Recently by about 30000 per hour. The Astropulse number has also been growing the last couple of hours. There is some serious performance problem in assimilation.
ID: 2030077 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030078 - Posted: 31 Jan 2020, 8:45:18 UTC - in response to Message 2030075.  
Last modified: 31 Jan 2020, 8:49:03 UTC

The first machine is now out of work
You are crunching too fast. My caches are nearly full in both machines.
ID: 2030078 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22227
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2030079 - Posted: 31 Jan 2020, 8:47:19 UTC

Because I'm nowhere near the bulk of my computers I've had to resort to using the web options page to set don't do any SETI work - the first time I've had to resort to this sort of thing due to actions by SETI :-(
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2030079 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030080 - Posted: 31 Jan 2020, 8:54:49 UTC

There is a noise bombing window in blc35 at around 58692_07 and _08. Those are probably causing the current high return rate.
ID: 2030080 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030085 - Posted: 31 Jan 2020, 9:24:58 UTC - in response to Message 2030073.  

Eric replied - very late on Thursday night, his time -
Is he aware of the assimilation problem?
ID: 2030085 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030086 - Posted: 31 Jan 2020, 9:54:27 UTC - in response to Message 2030085.  

Eric replied - very late on Thursday night, his time -
Is he aware of the assimilation problem?
He didn't mention it, but I would expect so, yes: that's an unambiguous figure on the face of the SSP (and the more complete figures which, I presume, they have access to via internal monitoring).

The 'results awaiting validation' and 'workunits awaiting validation' figures are also unambiguous, but they are unusual - why are they so different? The first usually hovers around 4 million, but recently it's been 12 million. Why?

The rise started when the 'in progress' limit was raised - an obvious direct connection, no alarm bells. But why is it still so high? That needs explanation, and I've suggested a possible way of finding out the answer. Lets hope it works, else someone is going to have to come up with another suggestion.
ID: 2030086 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22227
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2030087 - Posted: 31 Jan 2020, 9:58:34 UTC

If this is an attempt to reduce the amount of work sitting around waiting to be validated it's not working as the number has increased from about 11,500,000 last night to about 12,250,000 this morning.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2030087 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030088 - Posted: 31 Jan 2020, 10:06:32 UTC - in response to Message 2030087.  

Well, of course it isn't working yet - he hasn't tried it yet! He wrote to me about 10 pm - four hours ago. He'll be asleep now, and when he wakes up, he said he "will see if I can figure out a transitioner trick" - possibly from scratch. Give him time - we wouldn't want it to go wrong, as his last idea seems to have done!
ID: 2030088 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030089 - Posted: 31 Jan 2020, 10:12:33 UTC - in response to Message 2030086.  

The 'results awaiting validation' and 'workunits awaiting validation' figures are also unambiguous, but they are unusual - why are they so different?
Results awaiting validation are all results that have been returned but not validated yet. Almost all of those in a normal situation are results that are waiting for their wingmen that are still crunching. Workunits awaiting validation are wus that are ready to be validated (that is all the results have been returned) but have not been validated yet and in a normal situation this is close to zero. So the result and workunit fields measure different things.

I'm almost sure that the results corresponding to 'workunits waiting for assimilation' are also counted in 'results waiting for validation'. So the assimilation problem is the direct cause of the huge apparent validation backlog.
ID: 2030089 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030090 - Posted: 31 Jan 2020, 10:28:06 UTC - in response to Message 2030089.  

I'm almost sure that the results corresponding to 'workunits waiting for assimilation' are also counted in 'results waiting for validation'. So the assimilation problem is the direct cause of the huge apparent validation backlog.
Assimilation takes place after validation - only the canonical result is assimilated into the science database. It would be curious if the 12 million figure was 'results waiting for validation, plus results which have already been validated'. Stranger things have happened in database design, but it's not the intuitive position.

Personally, I'm still returning an excessive number of overnight 'immediate overflow' BLC35 tasks, plus the faster running but normal Arecibo tasks - I think that's the driver for the increasing numbers.
ID: 2030090 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13750
Credit: 208,696,464
RAC: 304
Australia
Message 2030091 - Posted: 31 Jan 2020, 10:33:49 UTC - in response to Message 2030089.  
Last modified: 31 Jan 2020, 10:37:36 UTC

So the assimilation problem is the direct cause of the huge apparent validation backlog.
Nope.
The WUs have to be Validated before they can be Assimilated (moved to the Science database). Once they are Assimilated, then they can be deleted. However once deleted, they aren't actually deleted until they are purged from the database (hence you see Validated WUs in your Task list for 24hrs (when things are working anyway)).
We've got both a problem with Validation- hence that huge (and growing) "Results returned and awaiting validation" backlog, there is also a problem with the Assimilators- once things have been Validated, they aren't getting Assimilated- hence the huge & growing "Workunits waiting for assimilation" backlog.
It appears all boils down to disk I/O (Input/Output), or a lack thereof.


Edit-
and this all in addition to the not so occasional upload & occasional download issues we've been having for some time now.
Grant
Darwin NT
ID: 2030091 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030093 - Posted: 31 Jan 2020, 10:50:44 UTC - in response to Message 2030090.  
Last modified: 31 Jan 2020, 10:56:43 UTC

Assimilation takes place after validation - only the canonical result is assimilated into the science database. It would be curious if the 12 million figure was 'results waiting for validation, plus results which have already been validated'. Stranger things have happened in database design, but it's not the intuitive position.
It's not a database design thing, but Server Status Page design thing. There is no field for results waiting for assimilation, so it must either list those results under some other state (waiting for validation or waiting for purging) or keep them hidden.

The fact that the total sum of all the shown result counts has hovered very near 20 million, which was the number Eric said they are targeting by splitter throttling makes the hidden assumption unlikely. Results waiting for db purging is smaller than workunits waiting for assimilation, so that alternative can be eliminated too. The only one left is results waiting for validation.

Those result rows must exist because they aren't deleted before the result files are deleted. And those files must exist because the data in them is needed for assimilation. Also I believe the database integrity requires that all the result rows must exist as long as the workunit row exists, Not just the canonical result. I have never seen on the web page a workunit that is missing results (less results than 'initial replication' number).

My validated results appear on the web page pretty soon after my hosts have returned them, so there doesn't seem to be any significant validation backlog. But the count of results waiting for validation is enormous despite of this. If the results waiting for assimilation were counted in it too, then this would perfectly explain the discrepancy.
ID: 2030093 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 716
Credit: 8,032,827
RAC: 62
France
Message 2030094 - Posted: 31 Jan 2020, 10:54:11 UTC - in response to Message 2029963.  
Last modified: 31 Jan 2020, 11:08:57 UTC

better in PM ^^ hop edited
ID: 2030094 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030095 - Posted: 31 Jan 2020, 11:01:03 UTC - in response to Message 2030093.  

The assimilator only looks at the workunit table: does this WU have a canonical result? If yes, it's ready for assimilation, and should be counted in the (current) 3 million total of 'ready' WUs.

Having decided it's ready, the assimilator has to load the result row from that bloated result table (to find out what the filename is - depends on which result became canonical), and then retrieve the result data file from disk. I expect both of those operations are slower than normal.
ID: 2030095 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7379
Credit: 44,181,323
RAC: 238
United States
Message 2030098 - Posted: 31 Jan 2020, 11:48:21 UTC - in response to Message 2029989.  

The Validation procedure has been changed within the past day, now you don't receive an Inconclusive result if your result is not received first. Now the First result received is Automatically Validated and any following results that don't match are ruled Invalid. Look at these results, do you see any third result, or Inconclusive result?

[ sniped ]

Here is the Host whose results were received First, and Awarded as being Valid, State: All (5990) · In progress (300) · Validation pending (189) · Validation inconclusive (407) · Valid (4590) · Invalid (504) · Error (0)
That Host doesn't appear to be producing Valid results does it? It's producing what appears to be Overflows, but the Stderr is missing. Yet, the results are being Validated.

Hi TBar,

Please read this whole post before responding. :)

This is what I see... nothing. I don't sit here every minute of every day analyzing each and every WU my hosts do like it seems you, Grant, Richard H. and others seem to do. I do not see what you guys are talking about. I went to the computer you linked to and I saw inconclusive, validated and error WUs. I did not see any invalid WUs. I did not go to the previous or next pages. Are you saying that some computers are getting nothing but valid WUs and others are getting nothing but invalid WUs?

The way I understand the inconclusive status is that if 2 hosts send in results on the same WU and they don't match, the WU is given an inconclusive status for both hosts and the WU gets sent out to another host. When that result comes in if it matches one of the other results, they get a valid status and the 3rd gets invalid. Correct? If this is the way it "used to work" then it seems you are saying that the validation process has been changed. Correct?

My question now is how can the first WU result get a valid status if it is NOT compared to another result? How is it getting validated? Or is it just automatically getting the valid status when it crosses the finish line first and all others are invalid? If the 3rd question (ok 3 questions, not one ;) ) is correct, then I understand why you guys are saying it is not real science. Perhaps I just figured out what you guys are talking about. ;) In that case, is it best to go NNT and wait for the validation process to go back to what it should be, if it ever does?

Have a great day! :)

Siran
CAPT Siran d'Vel'nahr - L L & P _\\//
Winders 11 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 2030098 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2030099 - Posted: 31 Jan 2020, 11:52:58 UTC

Quorum of 1 Bad Example. Here's a failure of the Approach. The PC with the validated result has 461 invalids, but returned ahead of my PC by 7 seconds! It has no stderr output, so what did it return?

https://setiathome.berkeley.edu/workunit.php?wuid=3860444977

Curious Example. Saw this from my overnight. Both my PC and the Apple had same number of Autocorr and Pulses found, similar peaks and time (rounding error diffs?). The Apple got credit cause it was first returned. Not clear which or both are correct. Track record similar in numbers; mine is better in percentage valid tasks.

https://setiathome.berkeley.edu/workunit.php?wuid=3859724613

I don't understand how the first PC above was allowed to have a quorum of 1. In general, this seems to be introducing questionable results into the science database. More importantly, it may potentially reject an ET result.
ID: 2030099 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2030109 - Posted: 31 Jan 2020, 13:35:47 UTC - in response to Message 2030099.  
Last modified: 31 Jan 2020, 13:53:55 UTC

I got one of these too. It also has one returned first with no data marked valid, mine rejected as invalid with data, but returned after the other.


Edit: Just noticed that both of the WUs involved an SoG WU which was returned first, with the empty WU result.

https://setiathome.berkeley.edu/workunit.php?wuid=3860988255
ID: 2030109 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030128 - Posted: 31 Jan 2020, 17:26:20 UTC

Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew!
ID: 2030128 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2030131 - Posted: 31 Jan 2020, 17:38:48 UTC - in response to Message 2030128.  

Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew!


Thanks for the very good information. I'll jump back in the pool now that it is safe. Now allowing New Tasks! (in small amounts to help the db)
ID: 2030131 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030134 - Posted: 31 Jan 2020, 17:48:11 UTC - in response to Message 2030128.  

Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew!

How ironic that the machines producing nothing but trash are still being sent work while my two machines which produce Valid results have been mostly out of work for hours.
https://setiathome.berkeley.edu/results.php?hostid=7941469&offset=100 At least they won't be receiving False valids anymore.
ID: 2030134 · Report as offensive
Previous · 1 . . . 65 · 66 · 67 · 68 · 69 · 70 · 71 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.