Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 65 · 66 · 67 · 68 · 69 · 70 · 71 . . . 94 · Next
Author | Message |
---|---|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
If we are no longer validating the WUs properly... some of mine don't even have a wingman... why is Results returned and awaiting validation number growing in the status??If my theory about results belonging to workunits waiting for assimilation being shown as waiting for validation is correct, then we could have about 7.5 million of the 12.2 million results there being ones that have been validated but not assimilated yet. And that is growing fast. The 'Workunits waiting for assimilation' is a supposed to be close to zero in normal situation because workunits get assimilated immediately after they have been validated. But for more than a week now that number has been steadily growing. Recently by about 30000 per hour. The Astropulse number has also been growing the last couple of hours. There is some serious performance problem in assimilation. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
The first machine is now out of workYou are crunching too fast. My caches are nearly full in both machines. |
rob smith Send message Joined: 7 Mar 03 Posts: 22227 Credit: 416,307,556 RAC: 380 |
Because I'm nowhere near the bulk of my computers I've had to resort to using the web options page to set don't do any SETI work - the first time I've had to resort to this sort of thing due to actions by SETI :-( Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
There is a noise bombing window in blc35 at around 58692_07 and _08. Those are probably causing the current high return rate. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Eric replied - very late on Thursday night, his time -Is he aware of the assimilation problem? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
He didn't mention it, but I would expect so, yes: that's an unambiguous figure on the face of the SSP (and the more complete figures which, I presume, they have access to via internal monitoring).Eric replied - very late on Thursday night, his time -Is he aware of the assimilation problem? The 'results awaiting validation' and 'workunits awaiting validation' figures are also unambiguous, but they are unusual - why are they so different? The first usually hovers around 4 million, but recently it's been 12 million. Why? The rise started when the 'in progress' limit was raised - an obvious direct connection, no alarm bells. But why is it still so high? That needs explanation, and I've suggested a possible way of finding out the answer. Lets hope it works, else someone is going to have to come up with another suggestion. |
rob smith Send message Joined: 7 Mar 03 Posts: 22227 Credit: 416,307,556 RAC: 380 |
If this is an attempt to reduce the amount of work sitting around waiting to be validated it's not working as the number has increased from about 11,500,000 last night to about 12,250,000 this morning. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
Well, of course it isn't working yet - he hasn't tried it yet! He wrote to me about 10 pm - four hours ago. He'll be asleep now, and when he wakes up, he said he "will see if I can figure out a transitioner trick" - possibly from scratch. Give him time - we wouldn't want it to go wrong, as his last idea seems to have done! |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
The 'results awaiting validation' and 'workunits awaiting validation' figures are also unambiguous, but they are unusual - why are they so different?Results awaiting validation are all results that have been returned but not validated yet. Almost all of those in a normal situation are results that are waiting for their wingmen that are still crunching. Workunits awaiting validation are wus that are ready to be validated (that is all the results have been returned) but have not been validated yet and in a normal situation this is close to zero. So the result and workunit fields measure different things. I'm almost sure that the results corresponding to 'workunits waiting for assimilation' are also counted in 'results waiting for validation'. So the assimilation problem is the direct cause of the huge apparent validation backlog. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
I'm almost sure that the results corresponding to 'workunits waiting for assimilation' are also counted in 'results waiting for validation'. So the assimilation problem is the direct cause of the huge apparent validation backlog.Assimilation takes place after validation - only the canonical result is assimilated into the science database. It would be curious if the 12 million figure was 'results waiting for validation, plus results which have already been validated'. Stranger things have happened in database design, but it's not the intuitive position. Personally, I'm still returning an excessive number of overnight 'immediate overflow' BLC35 tasks, plus the faster running but normal Arecibo tasks - I think that's the driver for the increasing numbers. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13750 Credit: 208,696,464 RAC: 304 |
So the assimilation problem is the direct cause of the huge apparent validation backlog.Nope. The WUs have to be Validated before they can be Assimilated (moved to the Science database). Once they are Assimilated, then they can be deleted. However once deleted, they aren't actually deleted until they are purged from the database (hence you see Validated WUs in your Task list for 24hrs (when things are working anyway)). We've got both a problem with Validation- hence that huge (and growing) "Results returned and awaiting validation" backlog, there is also a problem with the Assimilators- once things have been Validated, they aren't getting Assimilated- hence the huge & growing "Workunits waiting for assimilation" backlog. It appears all boils down to disk I/O (Input/Output), or a lack thereof. Edit- and this all in addition to the not so occasional upload & occasional download issues we've been having for some time now. Grant Darwin NT |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Assimilation takes place after validation - only the canonical result is assimilated into the science database. It would be curious if the 12 million figure was 'results waiting for validation, plus results which have already been validated'. Stranger things have happened in database design, but it's not the intuitive position.It's not a database design thing, but Server Status Page design thing. There is no field for results waiting for assimilation, so it must either list those results under some other state (waiting for validation or waiting for purging) or keep them hidden. The fact that the total sum of all the shown result counts has hovered very near 20 million, which was the number Eric said they are targeting by splitter throttling makes the hidden assumption unlikely. Results waiting for db purging is smaller than workunits waiting for assimilation, so that alternative can be eliminated too. The only one left is results waiting for validation. Those result rows must exist because they aren't deleted before the result files are deleted. And those files must exist because the data in them is needed for assimilation. Also I believe the database integrity requires that all the result rows must exist as long as the workunit row exists, Not just the canonical result. I have never seen on the web page a workunit that is missing results (less results than 'initial replication' number). My validated results appear on the web page pretty soon after my hosts have returned them, so there doesn't seem to be any significant validation backlog. But the count of results waiting for validation is enormous despite of this. If the results waiting for assimilation were counted in it too, then this would perfectly explain the discrepancy. |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 |
better in PM ^^ hop edited |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
The assimilator only looks at the workunit table: does this WU have a canonical result? If yes, it's ready for assimilation, and should be counted in the (current) 3 million total of 'ready' WUs. Having decided it's ready, the assimilator has to load the result row from that bloated result table (to find out what the filename is - depends on which result became canonical), and then retrieve the result data file from disk. I expect both of those operations are slower than normal. |
Siran d'Vel'nahr Send message Joined: 23 May 99 Posts: 7379 Credit: 44,181,323 RAC: 238 |
The Validation procedure has been changed within the past day, now you don't receive an Inconclusive result if your result is not received first. Now the First result received is Automatically Validated and any following results that don't match are ruled Invalid. Look at these results, do you see any third result, or Inconclusive result? Hi TBar, Please read this whole post before responding. :) This is what I see... nothing. I don't sit here every minute of every day analyzing each and every WU my hosts do like it seems you, Grant, Richard H. and others seem to do. I do not see what you guys are talking about. I went to the computer you linked to and I saw inconclusive, validated and error WUs. I did not see any invalid WUs. I did not go to the previous or next pages. Are you saying that some computers are getting nothing but valid WUs and others are getting nothing but invalid WUs? The way I understand the inconclusive status is that if 2 hosts send in results on the same WU and they don't match, the WU is given an inconclusive status for both hosts and the WU gets sent out to another host. When that result comes in if it matches one of the other results, they get a valid status and the 3rd gets invalid. Correct? If this is the way it "used to work" then it seems you are saying that the validation process has been changed. Correct? My question now is how can the first WU result get a valid status if it is NOT compared to another result? How is it getting validated? Or is it just automatically getting the valid status when it crosses the finish line first and all others are invalid? If the 3rd question (ok 3 questions, not one ;) ) is correct, then I understand why you guys are saying it is not real science. Perhaps I just figured out what you guys are talking about. ;) In that case, is it best to go NNT and wait for the validation process to go back to what it should be, if it ever does? Have a great day! :) Siran CAPT Siran d'Vel'nahr - L L & P _\\// Winders 11 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
Quorum of 1 Bad Example. Here's a failure of the Approach. The PC with the validated result has 461 invalids, but returned ahead of my PC by 7 seconds! It has no stderr output, so what did it return? https://setiathome.berkeley.edu/workunit.php?wuid=3860444977 Curious Example. Saw this from my overnight. Both my PC and the Apple had same number of Autocorr and Pulses found, similar peaks and time (rounding error diffs?). The Apple got credit cause it was first returned. Not clear which or both are correct. Track record similar in numbers; mine is better in percentage valid tasks. https://setiathome.berkeley.edu/workunit.php?wuid=3859724613 I don't understand how the first PC above was allowed to have a quorum of 1. In general, this seems to be introducing questionable results into the science database. More importantly, it may potentially reject an ET result. |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
I got one of these too. It also has one returned first with no data marked valid, mine rejected as invalid with data, but returned after the other. Edit: Just noticed that both of the WUs involved an SoG WU which was returned first, with the empty WU result. https://setiathome.berkeley.edu/workunit.php?wuid=3860988255 |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
|
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew! Thanks for the very good information. I'll jump back in the pool now that it is safe. Now allowing New Tasks! (in small amounts to help the db) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew! How ironic that the machines producing nothing but trash are still being sent work while my two machines which produce Valid results have been mostly out of work for hours. https://setiathome.berkeley.edu/results.php?hostid=7941469&offset=100 At least they won't be receiving False valids anymore. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.