Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 65 · 66 · 67 · 68 · 69 · 70 · 71 . . . 94 · Next
| Author | Message |
|---|---|
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
Well, of course it isn't working yet - he hasn't tried it yet! He wrote to me about 10 pm - four hours ago. He'll be asleep now, and when he wakes up, he said he "will see if I can figure out a transitioner trick" - possibly from scratch. Give him time - we wouldn't want it to go wrong, as his last idea seems to have done! |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
The 'results awaiting validation' and 'workunits awaiting validation' figures are also unambiguous, but they are unusual - why are they so different?Results awaiting validation are all results that have been returned but not validated yet. Almost all of those in a normal situation are results that are waiting for their wingmen that are still crunching. Workunits awaiting validation are wus that are ready to be validated (that is all the results have been returned) but have not been validated yet and in a normal situation this is close to zero. So the result and workunit fields measure different things. I'm almost sure that the results corresponding to 'workunits waiting for assimilation' are also counted in 'results waiting for validation'. So the assimilation problem is the direct cause of the huge apparent validation backlog. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
I'm almost sure that the results corresponding to 'workunits waiting for assimilation' are also counted in 'results waiting for validation'. So the assimilation problem is the direct cause of the huge apparent validation backlog.Assimilation takes place after validation - only the canonical result is assimilated into the science database. It would be curious if the 12 million figure was 'results waiting for validation, plus results which have already been validated'. Stranger things have happened in database design, but it's not the intuitive position. Personally, I'm still returning an excessive number of overnight 'immediate overflow' BLC35 tasks, plus the faster running but normal Arecibo tasks - I think that's the driver for the increasing numbers. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14020 Credit: 208,696,464 RAC: 304
|
So the assimilation problem is the direct cause of the huge apparent validation backlog.Nope. The WUs have to be Validated before they can be Assimilated (moved to the Science database). Once they are Assimilated, then they can be deleted. However once deleted, they aren't actually deleted until they are purged from the database (hence you see Validated WUs in your Task list for 24hrs (when things are working anyway)). We've got both a problem with Validation- hence that huge (and growing) "Results returned and awaiting validation" backlog, there is also a problem with the Assimilators- once things have been Validated, they aren't getting Assimilated- hence the huge & growing "Workunits waiting for assimilation" backlog. It appears all boils down to disk I/O (Input/Output), or a lack thereof. Edit- and this all in addition to the not so occasional upload & occasional download issues we've been having for some time now. Grant Darwin NT |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
Assimilation takes place after validation - only the canonical result is assimilated into the science database. It would be curious if the 12 million figure was 'results waiting for validation, plus results which have already been validated'. Stranger things have happened in database design, but it's not the intuitive position.It's not a database design thing, but Server Status Page design thing. There is no field for results waiting for assimilation, so it must either list those results under some other state (waiting for validation or waiting for purging) or keep them hidden. The fact that the total sum of all the shown result counts has hovered very near 20 million, which was the number Eric said they are targeting by splitter throttling makes the hidden assumption unlikely. Results waiting for db purging is smaller than workunits waiting for assimilation, so that alternative can be eliminated too. The only one left is results waiting for validation. Those result rows must exist because they aren't deleted before the result files are deleted. And those files must exist because the data in them is needed for assimilation. Also I believe the database integrity requires that all the result rows must exist as long as the workunit row exists, Not just the canonical result. I have never seen on the web page a workunit that is missing results (less results than 'initial replication' number). My validated results appear on the web page pretty soon after my hosts have returned them, so there doesn't seem to be any significant validation backlog. But the count of results waiting for validation is enormous despite of this. If the results waiting for assimilation were counted in it too, then this would perfectly explain the discrepancy. |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 717 Credit: 8,032,827 RAC: 62
|
better in PM ^^ hop edited |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
The assimilator only looks at the workunit table: does this WU have a canonical result? If yes, it's ready for assimilation, and should be counted in the (current) 3 million total of 'ready' WUs. Having decided it's ready, the assimilator has to load the result row from that bloated result table (to find out what the filename is - depends on which result became canonical), and then retrieve the result data file from disk. I expect both of those operations are slower than normal. |
Siran d'Vel'nahr Send message Joined: 23 May 99 Posts: 7381 Credit: 44,181,323 RAC: 238
|
The Validation procedure has been changed within the past day, now you don't receive an Inconclusive result if your result is not received first. Now the First result received is Automatically Validated and any following results that don't match are ruled Invalid. Look at these results, do you see any third result, or Inconclusive result? Hi TBar, Please read this whole post before responding. :) This is what I see... nothing. I don't sit here every minute of every day analyzing each and every WU my hosts do like it seems you, Grant, Richard H. and others seem to do. I do not see what you guys are talking about. I went to the computer you linked to and I saw inconclusive, validated and error WUs. I did not see any invalid WUs. I did not go to the previous or next pages. Are you saying that some computers are getting nothing but valid WUs and others are getting nothing but invalid WUs? The way I understand the inconclusive status is that if 2 hosts send in results on the same WU and they don't match, the WU is given an inconclusive status for both hosts and the WU gets sent out to another host. When that result comes in if it matches one of the other results, they get a valid status and the 3rd gets invalid. Correct? If this is the way it "used to work" then it seems you are saying that the validation process has been changed. Correct? My question now is how can the first WU result get a valid status if it is NOT compared to another result? How is it getting validated? Or is it just automatically getting the valid status when it crosses the finish line first and all others are invalid? If the 3rd question (ok 3 questions, not one ;) ) is correct, then I understand why you guys are saying it is not real science. Perhaps I just figured out what you guys are talking about. ;) In that case, is it best to go NNT and wait for the validation process to go back to what it should be, if it ever does? Have a great day! :) Siran CAPT Siran d'Vel'nahr - L L & P _\\// Winders 11 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
Freewill ![]() Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693
|
Quorum of 1 Bad Example. Here's a failure of the Approach. The PC with the validated result has 461 invalids, but returned ahead of my PC by 7 seconds! It has no stderr output, so what did it return? https://setiathome.berkeley.edu/workunit.php?wuid=3860444977 Curious Example. Saw this from my overnight. Both my PC and the Apple had same number of Autocorr and Pulses found, similar peaks and time (rounding error diffs?). The Apple got credit cause it was first returned. Not clear which or both are correct. Track record similar in numbers; mine is better in percentage valid tasks. https://setiathome.berkeley.edu/workunit.php?wuid=3859724613 I don't understand how the first PC above was allowed to have a quorum of 1. In general, this seems to be introducing questionable results into the science database. More importantly, it may potentially reject an ET result.
|
|
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266
|
I got one of these too. It also has one returned first with no data marked valid, mine rejected as invalid with data, but returned after the other. Edit: Just noticed that both of the WUs involved an SoG WU which was returned first, with the empty WU result. https://setiathome.berkeley.edu/workunit.php?wuid=3860988255 |
Mr. Kevvy ![]() Send message Joined: 15 May 99 Posts: 3870 Credit: 1,114,826,392 RAC: 3,319
|
|
Unixchick ![]() Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22
|
Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew! Thanks for the very good information. I'll jump back in the pool now that it is safe. Now allowing New Tasks! (in small amounts to help the db) |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew! How ironic that the machines producing nothing but trash are still being sent work while my two machines which produce Valid results have been mostly out of work for hours. https://setiathome.berkeley.edu/results.php?hostid=7941469&offset=100 At least they won't be receiving False valids anymore. |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
The way I understand the inconclusive status is that if 2 hosts send in results on the same WU and they don't match, the WU is given an inconclusive status for both hosts and the WU gets sent out to another host. When that result comes in if it matches one of the other results, they get a valid status and the 3rd gets invalid. Correct?Almost but not quite correct. The first comparison that makes it inclonclusive is much more strict than the final comparison that detects who is right. Most inconclusives have only small differences and when the third result is returned and is similar to both of the initial results, all tree become Valid. Only in a minority of inconclusives someone gets condemned Invalid. My question now is how can the first WU result get a valid status if it is NOT compared to another result?They changed minimum quorum to 1 in a desperate attempt to relieve server load, so the first result returned is automatically assumed Valid. If the second result doesn't match it, it becomes Invalid regardless of who was right. A bit later they changed initial replication to 1 too, so that that task is never sent to a wingman, so a bad wingman can't make your good results Invalid any more. |
|
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572
|
Dr. Korpela has confirmed that normal two-quorum validation has been restored. Whew! Allow new tasks has now been set. Kevin
|
Mr. Kevvy ![]() Send message Joined: 15 May 99 Posts: 3870 Credit: 1,114,826,392 RAC: 3,319
|
Work unit generation is off, "Results returned and awaiting validation" has passed 13M which has to be the highest ever. I am expecting any moment there will be a News blurb about low/no work for the next little while, but if no blurb comes I would still get the backup projects ready, ie disable NNT and set their task share to zero (works for most of them) so if cache runs dry the machine will get minimal work of them. I predict a dry weekend, but it's for the integrity of the data and thus the very existence of the SETI@Home project, thus a worthy cause.
|
Stephen "Heretic" ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628
|
Work unit generation is off, "Results returned and awaiting validation" has passed 13M which has to be the highest ever. I am expecting any moment there will be a News blurb about low/no work for the next little while, but if no blurb comes I would still get the backup projects ready, ie disable NNT and set their task share to zero (works for most of them) so if cache runs dry the machine will get minimal work of them. I predict a dry weekend, but it's for the integrity of the data and thus the very existence of the SETI@Home project, thus a worthy cause. . . And you pre-empted my message but I will write it anyway ... {edit} . . Uploads - stalling . . Downloads - stalling . . Constant "no tasks available" {/edit} . . SETI is Boinked ... Stephen :( |
Mr. Kevvy ![]() Send message Joined: 15 May 99 Posts: 3870 Credit: 1,114,826,392 RAC: 3,319
|
|
Schatten Send message Joined: 12 Oct 02 Posts: 18 Credit: 14,047,388 RAC: 9
|
I got strange looking WU's. Example: https://setiathome.berkeley.edu/workunit.php?wuid=3863823181 https://setiathome.berkeley.edu/workunit.php?wuid=3863820784 there are more i think. Something isn't right. I though it is fixed. :-/ |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
"created 31 Jan 2020, 17:15:55 UTC" I think Eric has swapped things back since then, but there'll still be a few in the system. |
©2026 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.