Multiple Validate Errors On WU

Message boards : Number crunching : Multiple Validate Errors On WU
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile ohiomike
Avatar

Send message
Joined: 14 Mar 04
Posts: 357
Credit: 650,069
RAC: 0
United States
Message 499951 - Posted: 9 Jan 2007, 10:33:37 UTC

I just ran an interesting WU. http://setiathome.berkeley.edu/workunit.php?wuid=108268800
All three machines that have run it have gotten validate errors. Is there a reason why a WU could do this?

Boinc Button Abuser In Training >My Shrubbers<
ID: 499951 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 500023 - Posted: 9 Jan 2007, 13:40:25 UTC - in response to Message 499951.  

I just ran an interesting WU. http://setiathome.berkeley.edu/workunit.php?wuid=108268800
All three machines that have run it have gotten validate errors. Is there a reason why a WU could do this?


Suddenly I got two validate errors after several hours of work. The next two were ok. It isn't just me, but there seems to be a hole in the algorithm to permit this, in part because other people using different versions of the seti client also had validation errors after considerable crunching.

Mildly dis-amused.
May this Farce be with You
ID: 500023 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 500029 - Posted: 9 Jan 2007, 13:53:45 UTC

It tends to happen to a few WUs at either the beginning or the end of the unplanned server outages - the ones where a lost 'mount' causes file uploads to be rejected.

There are always a few files being uploaded just as the drive mount fails, and I think it's those ones which cause the problem. The uploaded file doesn't get stored on the server disk (or it's truncated/damaged), but the cruncher thinks it's uploaded OK and tries to report it. That single glitch causes all three quorum members to fail validation.

Not so much an algorithm problem, as a problem with old/tired/overworked servers - I know how they feel! I won't beat the drum about the obvious solution, but I expect you can work it out for yourselves.
ID: 500029 · Report as offensive
Kaj Christiansen

Send message
Joined: 13 Jul 99
Posts: 2
Credit: 31,402
RAC: 0
Denmark
Message 500055 - Posted: 9 Jan 2007, 15:07:02 UTC - in response to Message 500029.  

With validate error on four of my last five results, I think i'll give the servers a break, and chrunch on other projects instead...
ID: 500055 · Report as offensive
Profile Clyde C. Phillips, III

Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 500077 - Posted: 9 Jan 2007, 16:03:17 UTC

I found another validate error on one of my results pages today and saw two others and a reissue when I looked at that workunit sheet. If something fails while it receives results wouldn't the results have to be tranferring to that receiver (presumably at Berkeley) at the same time? At any rate it's clear that these validate errors are a Berkeley error and we'll just have to tolerate them until somebody has the time to fix it. Fortunately they don't happen too often. Same thing with workunits calculated twice (doubling computation time).
ID: 500077 · Report as offensive
Profile Sam Bartlett

Send message
Joined: 24 Nov 05
Posts: 7
Credit: 5,584,844
RAC: 35
Tunisia
Message 501498 - Posted: 12 Jan 2007, 12:14:23 UTC

I had 7 validate errors yesterday. Is there anything I can do about it?
ID: 501498 · Report as offensive
Profile Brock
Avatar

Send message
Joined: 19 Dec 06
Posts: 201
Credit: 774,488
RAC: 0
United States
Message 501562 - Posted: 12 Jan 2007, 15:03:00 UTC

I've got several validate errors and now a few "compute errors" have shown up. I don't seem to have any problems when the servers all show green.
ID: 501562 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 502059 - Posted: 13 Jan 2007, 10:17:52 UTC

I too have had a vaildator error, so can someone explain what is going on.
ID: 502059 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19091
Credit: 40,757,560
RAC: 67
United Kingdom
Message 502071 - Posted: 13 Jan 2007, 11:21:03 UTC - in response to Message 502059.  

I too have had a vaildator error, so can someone explain what is going on.

AFAIK, At the start of a server problem, like there is now, some uploads fail to complete, but the host computer is under the impression that all is ok. So when the unit is reported the server says uploaded unit was garbage, there you get nil credits.


ID: 502071 · Report as offensive
Robert Nelson
Volunteer tester
Avatar

Send message
Joined: 13 Aug 99
Posts: 43
Credit: 3,632,674
RAC: 1
United States
Message 502104 - Posted: 13 Jan 2007, 13:17:57 UTC - in response to Message 502071.  

I too have had a vaildator error, so can someone explain what is going on.

AFAIK, At the start of a server problem, like there is now, some uploads fail to complete, but the host computer is under the impression that all is ok. So when the unit is reported the server says uploaded unit was garbage, there you get nil credits.


It seems to be something that has happened with the most recent outages and is getting worse. Previous outages I can't remember seeing validate errors, now I have seen 6 in the last few days. They are appearing on multiple computers so its not local to here. Frustrating to loose that work, especially on some of the longer work units.
ID: 502104 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 502462 - Posted: 14 Jan 2007, 0:56:18 UTC

Check my slug's results out. Nothing hurts more than seeing the poor thing crunch for 2 days and then have the results deemed invalid. Given her history, I'd say that something at CommandCentral has a loose nut and that it isn't my slug's fault.
May this Farce be with You
ID: 502462 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 502533 - Posted: 14 Jan 2007, 3:40:14 UTC - in response to Message 502462.  

Phoneacq

a quick look shows that it running Chicken Good 1.3 not 2.0 which would help with time. Thank You for pointing to the results that did not validate

I will pass this on to Eric

Check my slug's results out. Nothing hurts more than seeing the poor thing crunch for 2 days and then have the results deemed invalid. Given her history, I'd say that something at CommandCentral has a loose nut and that it isn't my slug's fault.



Please consider a Donation to the Seti Project.

ID: 502533 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 502619 - Posted: 14 Jan 2007, 10:44:12 UTC

I see from my WU that had validate error, they have been sent out again. Why cannot it be like RCN, when they know it is their servers fault we get the credit.
ID: 502619 · Report as offensive
Fischer-Kerli
Volunteer tester

Send message
Joined: 12 Jul 03
Posts: 53
Credit: 35,690
RAC: 0
Germany
Message 502664 - Posted: 14 Jan 2007, 12:57:03 UTC - in response to Message 502619.  

I see from my WU that had validate error, they have been sent out again.


Now these results haven't been sent out again: the whole WU has been cancelled due to "Too many error results", six of them being validate errors. It's not only credits that get lost - if there is ever going to be a "Wow" signal, chances are rising it will vanish somewhere in the server mayhem.
ID: 502664 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 502693 - Posted: 14 Jan 2007, 14:10:00 UTC

Does the system validate the WU's before sending them out?

I've seen WU's having 3 or so invalid errors by 3 or so different computers, which are sent out to 3 or so new computers who return a successful result. (Hopefully that mouthful was clear.) Why would this happen? Does the validation process on incoming results have a problem?
May this Farce be with You
ID: 502693 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 502709 - Posted: 14 Jan 2007, 14:38:06 UTC - in response to Message 502693.  

Does the system validate the WU's before sending them out?

I've seen WU's having 3 or so invalid errors by 3 or so different computers, which are sent out to 3 or so new computers who return a successful result. (Hopefully that mouthful was clear.) Why would this happen? Does the validation process on incoming results have a problem?

In the sense we're talking about here, it doesn't and couldn't. 'Validation' is specifically a process of comparing the incoming results coming from three different crunchers (or whatever the quorum might be for the time being).

My understanding is that the 'validation errors' we see are caused when the transitioner says 'three hosts have reported that they've uploaded a result for this WU - go and have a look at it' - and when the validator/assimilator does go to have a look, there are only two (complete) uploaded files there to be compared. It looks as if there's some sort of automatic procedure to turn the validators/assimilators off when they start detecting errors (to limit the damage), but of course that can't kick into action until the first few errors have happened.

It would be better if the validators could put the WU back into the 'pending' queue, but that might need extra back-end programming - a resource which is in short supply at the moment.

With regard to checking the WU's before they go out - no, they don't seem to, and just occasionally we see a batch of mal-formed ones. When that happens, the science app on your machine can't make any sense of it, and immediately errors out and moves on to the next WU. A bit of wasted downloading, but no other serious side-effect.
ID: 502709 · Report as offensive
Profile Martin P.

Send message
Joined: 19 May 99
Posts: 294
Credit: 27,230,961
RAC: 2
Austria
Message 502730 - Posted: 14 Jan 2007, 15:18:42 UTC - in response to Message 499951.  
Last modified: 14 Jan 2007, 15:21:24 UTC

I just ran an interesting WU. http://setiathome.berkeley.edu/workunit.php?wuid=108268800
All three machines that have run it have gotten validate errors. Is there a reason why a WU could do this?


Same here. There seem to be 2 different sorts of problems:
1. WUs that claim the exact same amount as the others but check aout as "Invalid": WU 108863837
2. WUs were ALL users receive validate errors: WU 108860384

What's up?

ID: 502730 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 502736 - Posted: 14 Jan 2007, 15:30:36 UTC

1. The third result was received during a 'healthy' phase. The validator ran, checked all three out, and granted credit. The fourth result was received during a 'sick' phase. The validator ran a second time to check the late-comer against the first three - hit the current problem and resulted in a error.

2. The third result was received during a 'sick' phase. The validator ran, and (at least one of the three) result files was broken in some way. Because it was trying to validate all three against each other, the whole process failed, and all three got the 'error' marking.

Interesting that your example (1) is so recent - something must have trickled through the fog to trigger the validation attempt. A reason for suspending network activity until you see things are working again, perhaps?
ID: 502736 · Report as offensive
Profile Martin P.

Send message
Joined: 19 May 99
Posts: 294
Credit: 27,230,961
RAC: 2
Austria
Message 502764 - Posted: 14 Jan 2007, 16:24:04 UTC - in response to Message 502736.  

1. The third result was received during a 'healthy' phase. The validator ran, checked all three out, and granted credit. The fourth result was received during a 'sick' phase. The validator ran a second time to check the late-comer against the first three - hit the current problem and resulted in a error.

2. The third result was received during a 'sick' phase. The validator ran, and (at least one of the three) result files was broken in some way. Because it was trying to validate all three against each other, the whole process failed, and all three got the 'error' marking.

Interesting that your example (1) is so recent - something must have trickled through the fog to trigger the validation attempt. A reason for suspending network activity until you see things are working again, perhaps?


Richard,

thanks for that explanation. The annoying part is, that the staff asks very loudly for donations all over the web-site but they do not even lose one word on the current problems. On Einstein@Home it takes only a few hours or at most until 8 a.m. next morning until they explain what's going on and how long the problems will last.

I did donate in the past but will not anymore until they realize that we are "customers" and deserve at least a minimum amount of information within a reasonable time frame.

ID: 502764 · Report as offensive
peristalsis

Send message
Joined: 23 Jul 99
Posts: 154
Credit: 28,610,163
RAC: 51
United States
Message 508414 - Posted: 25 Jan 2007, 14:07:01 UTC - in response to Message 502764.  

Dredging up an older post.
Had two wus that were "client errors"
Checked on them, and one had three machines with client errors and the other had four machines.
10mr00aa.6833.2194.1015886.3.253
06no03aa.28336.11362.167344.3.207_1
Both fairly recent..p
ID: 508414 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Multiple Validate Errors On WU


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.