Message boards :
Number crunching :
Multiple Validate Errors On WU
Message board moderation
Author | Message |
---|---|
ohiomike Send message Joined: 14 Mar 04 Posts: 357 Credit: 650,069 RAC: 0 |
I just ran an interesting WU. http://setiathome.berkeley.edu/workunit.php?wuid=108268800 All three machines that have run it have gotten validate errors. Is there a reason why a WU could do this? Boinc Button Abuser In Training >My Shrubbers< |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
I just ran an interesting WU. http://setiathome.berkeley.edu/workunit.php?wuid=108268800 Suddenly I got two validate errors after several hours of work. The next two were ok. It isn't just me, but there seems to be a hole in the algorithm to permit this, in part because other people using different versions of the seti client also had validation errors after considerable crunching. Mildly dis-amused. May this Farce be with You |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874 |
It tends to happen to a few WUs at either the beginning or the end of the unplanned server outages - the ones where a lost 'mount' causes file uploads to be rejected. There are always a few files being uploaded just as the drive mount fails, and I think it's those ones which cause the problem. The uploaded file doesn't get stored on the server disk (or it's truncated/damaged), but the cruncher thinks it's uploaded OK and tries to report it. That single glitch causes all three quorum members to fail validation. Not so much an algorithm problem, as a problem with old/tired/overworked servers - I know how they feel! I won't beat the drum about the obvious solution, but I expect you can work it out for yourselves. |
Kaj Christiansen Send message Joined: 13 Jul 99 Posts: 2 Credit: 31,402 RAC: 0 |
With validate error on four of my last five results, I think i'll give the servers a break, and chrunch on other projects instead... |
Clyde C. Phillips, III Send message Joined: 2 Aug 00 Posts: 1851 Credit: 5,955,047 RAC: 0 |
I found another validate error on one of my results pages today and saw two others and a reissue when I looked at that workunit sheet. If something fails while it receives results wouldn't the results have to be tranferring to that receiver (presumably at Berkeley) at the same time? At any rate it's clear that these validate errors are a Berkeley error and we'll just have to tolerate them until somebody has the time to fix it. Fortunately they don't happen too often. Same thing with workunits calculated twice (doubling computation time). |
Sam Bartlett Send message Joined: 24 Nov 05 Posts: 7 Credit: 5,584,844 RAC: 35 |
I had 7 validate errors yesterday. Is there anything I can do about it? |
Brock Send message Joined: 19 Dec 06 Posts: 201 Credit: 774,488 RAC: 0 |
I've got several validate errors and now a few "compute errors" have shown up. I don't seem to have any problems when the servers all show green. |
[B^S] madmac Send message Joined: 9 Feb 04 Posts: 1175 Credit: 4,754,897 RAC: 0 |
|
W-K 666 Send message Joined: 18 May 99 Posts: 19135 Credit: 40,757,560 RAC: 67 |
I too have had a vaildator error, so can someone explain what is going on. AFAIK, At the start of a server problem, like there is now, some uploads fail to complete, but the host computer is under the impression that all is ok. So when the unit is reported the server says uploaded unit was garbage, there you get nil credits. |
Robert Nelson Send message Joined: 13 Aug 99 Posts: 43 Credit: 3,632,674 RAC: 1 |
I too have had a vaildator error, so can someone explain what is going on. It seems to be something that has happened with the most recent outages and is getting worse. Previous outages I can't remember seeing validate errors, now I have seen 6 in the last few days. They are appearing on multiple computers so its not local to here. Frustrating to loose that work, especially on some of the longer work units. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Check my slug's results out. Nothing hurts more than seeing the poor thing crunch for 2 days and then have the results deemed invalid. Given her history, I'd say that something at CommandCentral has a loose nut and that it isn't my slug's fault. May this Farce be with You |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
Phoneacq a quick look shows that it running Chicken Good 1.3 not 2.0 which would help with time. Thank You for pointing to the results that did not validate I will pass this on to Eric Check my slug's results out. Nothing hurts more than seeing the poor thing crunch for 2 days and then have the results deemed invalid. Given her history, I'd say that something at CommandCentral has a loose nut and that it isn't my slug's fault. Please consider a Donation to the Seti Project. |
[B^S] madmac Send message Joined: 9 Feb 04 Posts: 1175 Credit: 4,754,897 RAC: 0 |
|
Fischer-Kerli Send message Joined: 12 Jul 03 Posts: 53 Credit: 35,690 RAC: 0 |
I see from my WU that had validate error, they have been sent out again. Now these results haven't been sent out again: the whole WU has been cancelled due to "Too many error results", six of them being validate errors. It's not only credits that get lost - if there is ever going to be a "Wow" signal, chances are rising it will vanish somewhere in the server mayhem. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Does the system validate the WU's before sending them out? I've seen WU's having 3 or so invalid errors by 3 or so different computers, which are sent out to 3 or so new computers who return a successful result. (Hopefully that mouthful was clear.) Why would this happen? Does the validation process on incoming results have a problem? May this Farce be with You |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874 |
Does the system validate the WU's before sending them out? In the sense we're talking about here, it doesn't and couldn't. 'Validation' is specifically a process of comparing the incoming results coming from three different crunchers (or whatever the quorum might be for the time being). My understanding is that the 'validation errors' we see are caused when the transitioner says 'three hosts have reported that they've uploaded a result for this WU - go and have a look at it' - and when the validator/assimilator does go to have a look, there are only two (complete) uploaded files there to be compared. It looks as if there's some sort of automatic procedure to turn the validators/assimilators off when they start detecting errors (to limit the damage), but of course that can't kick into action until the first few errors have happened. It would be better if the validators could put the WU back into the 'pending' queue, but that might need extra back-end programming - a resource which is in short supply at the moment. With regard to checking the WU's before they go out - no, they don't seem to, and just occasionally we see a batch of mal-formed ones. When that happens, the science app on your machine can't make any sense of it, and immediately errors out and moves on to the next WU. A bit of wasted downloading, but no other serious side-effect. |
Martin P. Send message Joined: 19 May 99 Posts: 294 Credit: 27,230,961 RAC: 2 |
I just ran an interesting WU. http://setiathome.berkeley.edu/workunit.php?wuid=108268800 Same here. There seem to be 2 different sorts of problems: 1. WUs that claim the exact same amount as the others but check aout as "Invalid": WU 108863837 2. WUs were ALL users receive validate errors: WU 108860384 What's up? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874 |
1. The third result was received during a 'healthy' phase. The validator ran, checked all three out, and granted credit. The fourth result was received during a 'sick' phase. The validator ran a second time to check the late-comer against the first three - hit the current problem and resulted in a error. 2. The third result was received during a 'sick' phase. The validator ran, and (at least one of the three) result files was broken in some way. Because it was trying to validate all three against each other, the whole process failed, and all three got the 'error' marking. Interesting that your example (1) is so recent - something must have trickled through the fog to trigger the validation attempt. A reason for suspending network activity until you see things are working again, perhaps? |
Martin P. Send message Joined: 19 May 99 Posts: 294 Credit: 27,230,961 RAC: 2 |
1. The third result was received during a 'healthy' phase. The validator ran, checked all three out, and granted credit. The fourth result was received during a 'sick' phase. The validator ran a second time to check the late-comer against the first three - hit the current problem and resulted in a error. Richard, thanks for that explanation. The annoying part is, that the staff asks very loudly for donations all over the web-site but they do not even lose one word on the current problems. On Einstein@Home it takes only a few hours or at most until 8 a.m. next morning until they explain what's going on and how long the problems will last. I did donate in the past but will not anymore until they realize that we are "customers" and deserve at least a minimum amount of information within a reasonable time frame. |
peristalsis Send message Joined: 23 Jul 99 Posts: 154 Credit: 28,610,163 RAC: 51 |
Dredging up an older post. Had two wus that were "client errors" Checked on them, and one had three machines with client errors and the other had four machines. 10mr00aa.6833.2194.1015886.3.253 06no03aa.28336.11362.167344.3.207_1 Both fairly recent..p |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.