Because of the 'validate errors'

Message boards : Number crunching : Because of the 'validate errors'
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 614765 - Posted: 4 Aug 2007, 13:54:22 UTC
Last modified: 4 Aug 2007, 14:05:05 UTC



Because of the 'validate errors'

Crunch3r posted, since the update of the software of the servers at Berkeley, 'report results immediately' make sometimes 'validate errors'.
I have BOINC V5.10.7 and connect every 0.001 days, this are ~ 90 seconds. (report ~ 90 seconds after upload)
And sometimes I have 'validate errors'.
I think because of this:


8/4/2007 3:37:41 PM|SETI@home|Computation for task 29mr00ab.25614.1169.804838.3.10_3 finished
8/4/2007 3:37:41 PM|SETI@home|Starting 19jn00aa.11827.12834.417318.3.137_2
8/4/2007 3:37:41 PM|SETI@home|Starting task 19jn00aa.11827.12834.417318.3.137_2 using setiathome_enhanced version 515
8/4/2007 3:37:41 PM|SETI@home|Sending scheduler request: To fetch work
8/4/2007 3:37:41 PM|SETI@home|Requesting 2363 seconds of new work
8/4/2007 3:37:43 PM|SETI@home|[file_xfer] Started upload of file 29mr00ab.25614.1169.804838.3.10_3_0
8/4/2007 3:37:47 PM|SETI@home|Scheduler RPC succeeded [server version 511]
8/4/2007 3:37:47 PM|SETI@home|Deferring communication for 11 sec
8/4/2007 3:37:47 PM|SETI@home|Reason: requested by project
8/4/2007 3:37:49 PM|SETI@home|[file_xfer] Started download of file 29mr00ab.25614.5729.548578.3.113
[b]8/4/2007 3:37:50 PM|SETI@home|[file_xfer] Finished upload of file 29mr00ab.25614.1169.804838.3.10_3_0[/b]
8/4/2007 3:37:50 PM|SETI@home|[file_xfer] Throughput 8315 bytes/sec
8/4/2007 3:37:54 PM|SETI@home|[file_xfer] Finished download of file 29mr00ab.25614.5729.548578.3.113
8/4/2007 3:37:54 PM|SETI@home|[file_xfer] Throughput 77539 bytes/sec
8/4/2007 3:38:02 PM|SETI@home|Sending scheduler request: To fetch work
[b]8/4/2007 3:38:02 PM|SETI@home|Requesting 874 seconds of new work, and [color=red]reporting 1 completed tasks[/color][/b]
8/4/2007 3:38:12 PM|SETI@home|Scheduler RPC succeeded [server version 511]
8/4/2007 3:38:12 PM|SETI@home|Deferring communication for 11 sec
8/4/2007 3:38:12 PM|SETI@home|Reason: requested by project
8/4/2007 3:38:14 PM|SETI@home|[file_xfer] Started download of file 13mr00aa.16419.10962.842330.3.48
8/4/2007 3:38:20 PM|SETI@home|[file_xfer] Finished download of file 13mr00aa.16419.10962.842330.3.48
8/4/2007 3:38:20 PM|SETI@home|[file_xfer] Throughput 87819 bytes/sec



This are 12 seconds and it's a good reported result.

BUT sometimes the time between is shorter and then: 'validate error' :-(

OR?


ID: 614765 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 614808 - Posted: 4 Aug 2007, 15:41:16 UTC - in response to Message 614765.  



Because of the 'validate errors'

Crunch3r posted, since the update of the software of the servers at Berkeley, 'report results immediately' make sometimes 'validate errors'.
I have BOINC V5.10.7 and connect every 0.001 days, this are ~ 90 seconds. (report ~ 90 seconds after upload)
And sometimes I have 'validate errors'.
...
This are 12 seconds and it's a good reported result.

BUT sometimes the time between is shorter and then: 'validate error' :-(

OR?

In the Beta "Validate error?" thread Keith T reported a similar case with a 22 second delay which did cause a Validate error.

Because work fetch calculations are distinct from cpu usage calculations it's always possible for a request to the Scheduler to occur very soon after "completion" of an upload. Those situations should be fairly rare, though.

Perhaps the 0.001 day setting is not always enough, 0.002 would be even safer.
                                                               Joe
ID: 614808 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 614814 - Posted: 4 Aug 2007, 15:48:24 UTC

LOL...

Agreed, why go looking for trouble? I've used 0.01 days for a CI for ages and have never had a result get invalidated for this reason AFAIK, and the project wasn't having problems.

When you get right down to it, practically speaking, 15 minutes is as good as 80 seconds when it comes to 'immediately', especially if the alternative is loosing a good result needlessly. ;-)

Alinator
ID: 614814 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 614829 - Posted: 4 Aug 2007, 16:09:57 UTC
Last modified: 4 Aug 2007, 16:43:02 UTC

During a discussion about 'Return Results Immediately' last year, I noticed (here) that there is a strong correlation between computation finishing on a WU, and a scheduler request for more work. This effect is quite separate from the CI: I'm sure it happens because of the re-calculation of the RDCF, and hence the total estimated crunch time of the WUs held in cache. If you've recently finished a slow WU like the dreaded 58.69, and are now working on more 'normal' WUs, then your RDCF will be decreased as each WU finishes: the total work buffer on hand will decrease pro-rata. There's a fair chance that this decrease will cross the cache size boundary, and so the request goes in for more work.

Usually this happens before the upload has completed, and so the report of the just-finished WU has to wait until the next scheduler contact (the point of the discussion last year). However, now that the server back-off is only 11 seconds, sometimes two work requests happen in quick succession, and the second one can carry with it the report of the just-completed WU. This is exactly the situation shown in Sutaru Tsureku's opening post in this thread.

[Edit - the second 'work-fetch' after 11 seconds is much more likely with recent clients, because of the server-abort of redundant WUs. You ask for more work - you end up with less because of an abort instruction - so you ask again. Like Oliver Twist: more! more!]

There was some support last year for a client-enforced 'cooling off period' after a WU completes, to allow that upload to complete in an orderly fashion before the next scheduler contact. I think I've observed something like that happening in recent clients when WUs error out, but not when they finish normally.

Is it worth re-visiting this suggestion?
ID: 614829 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 614836 - Posted: 4 Aug 2007, 16:19:04 UTC

I think the best solution would be for the project to set a longer Scheduler delay.

Rosetta uses a 4 minute delay, would it tbe possible to set a similar value on the SETI and SETI Beta schedulers?
Sir Arthur C Clarke 1917-2008
ID: 614836 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 614839 - Posted: 4 Aug 2007, 16:22:23 UTC - in response to Message 614836.  

I think the best solution would be for the project to set a longer Scheduler delay.

Rosetta uses a 4 minute delay, would it tbe possible to set a similar value on the SETI and SETI Beta schedulers?

When we were discussing this last year, the delay was 10-minutes-plus-a-bit. It then dropped without warning (or, as far as I can remember, any explanation) to the current 11 seconds.

Maybe 4 minutes would be a happy medium....
ID: 614839 · Report as offensive
KB7RZF
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 9549
Credit: 3,308,926
RAC: 2
United States
Message 615203 - Posted: 5 Aug 2007, 7:49:43 UTC

Just to throw my 2 cents worth in, as I just saw this thread:

I left the CI at 0, and have Maintain enough work for an additional: set at .1, and all last month I ran nothing but SETI, and I never, ever had a result go bad. Call me lucky I guess? Dunno, but I have yet to have a problem with it set like this.

Jeremy
ID: 615203 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 615210 - Posted: 5 Aug 2007, 8:23:08 UTC
Last modified: 5 Aug 2007, 8:55:00 UTC



I had done now, connect every 0.002 days.. (~ 180 seconds)



BUT what do the people with DUO or QUAD- CPUs?
..or the people with 'V8'?

Here an example..
2 results finished and uploaded.
A third result finished uploading after reporting..
BUT, if this third result finished the upload ~ 10 seconds earlier, it would be reported 3 seconds later..
AND then, it will be VERY SURE a 'validate error'.. OR? :-(

SO, 'how we could do it better'? ;-)




8/5/2007 10:04:00 AM|SETI@home|Computation for task 29mr00ab.25614.4656.665884.3.235_0 finished
8/5/2007 10:04:00 AM|SETI@home|Starting 19jn00aa.11827.19456.484658.3.204_2
8/5/2007 10:04:00 AM|SETI@home|Starting task 19jn00aa.11827.19456.484658.3.204_2 using setiathome_enhanced version 515
8/5/2007 10:04:02 AM|SETI@home|[file_xfer] Started upload of file 29mr00ab.25614.4656.665884.3.235_0_0
8/5/2007 10:04:10 AM|SETI@home|[file_xfer] Finished upload of file 29mr00ab.25614.4656.665884.3.235_0_0
8/5/2007 10:04:10 AM|SETI@home|[file_xfer] Throughput 8571 bytes/sec
8/5/2007 10:04:12 AM|SETI@home|Computation for task 19jn00aa.11827.19456.484658.3.204_2 finished
8/5/2007 10:04:12 AM|SETI@home|Starting 29mr00ab.25614.4656.665884.3.198_0
8/5/2007 10:04:12 AM|SETI@home|Starting task 29mr00ab.25614.4656.665884.3.198_0 using setiathome_enhanced version 515
8/5/2007 10:04:14 AM|SETI@home|[file_xfer] Started upload of file 19jn00aa.11827.19456.484658.3.204_2_0
8/5/2007 10:04:19 AM|SETI@home|[file_xfer] Finished upload of file 19jn00aa.11827.19456.484658.3.204_2_0
8/5/2007 10:04:19 AM|SETI@home|[file_xfer] Throughput 9858 bytes/sec
8/5/2007 10:06:57 AM|SETI@home|Computation for task 29mr00ab.25614.7121.304816.3.3_1 finished
8/5/2007 10:06:57 AM|SETI@home|Starting 29mr00ab.25614.4656.665884.3.237_1
8/5/2007 10:06:57 AM|SETI@home|Starting task 29mr00ab.25614.4656.665884.3.237_1 using setiathome_enhanced version 515
8/5/2007 10:07:00 AM|SETI@home|[file_xfer] Started upload of file 29mr00ab.25614.7121.304816.3.3_1_0
8/5/2007 10:07:05 AM|SETI@home|Sending scheduler request: To report completed tasks

8/5/2007 10:07:05 AM|SETI@home|Reporting 2 tasks
8/5/2007 10:07:12 AM|SETI@home|[file_xfer] Finished upload of file 29mr00ab.25614.7121.304816.3.3_1_0
8/5/2007 10:07:12 AM|SETI@home|[file_xfer] Throughput 2641 bytes/sec
8/5/2007 10:07:15 AM|SETI@home|Scheduler RPC succeeded [server version 511]
8/5/2007 10:07:15 AM|SETI@home|Deferring communication for 11 sec
8/5/2007 10:07:15 AM|SETI@home|Reason: requested by project
8/5/2007 10:10:07 AM|SETI@home|Sending scheduler request: To report completed tasks
8/5/2007 10:10:07 AM|SETI@home|Reporting 1 tasks
8/5/2007 10:10:17 AM|SETI@home|Scheduler RPC succeeded [server version 511]
8/5/2007 10:10:17 AM|SETI@home|Deferring communication for 11 sec
8/5/2007 10:10:17 AM|SETI@home|Reason: requested by project



ID: 615210 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 615333 - Posted: 5 Aug 2007, 16:54:41 UTC



I was 'little interested', why my results are got 'validate errors',
so I took a little time.. and looked to my online available results and in 'stdoutdae.txt' and I saw:

An example for all 3 available results:

2007-08-03 04:00:43 [SETI@home] [file_xfer] Started upload of file 20jn00aa.3173.11457.542316.3.176_1_0
2007-08-03 04:00:50 [SETI@home] [error] Error on file upload: no command
2007-08-03 04:00:50 [SETI@home] [file_xfer] Permanently failed upload of 20jn00aa.3173.11457.542316.3.176_1_0
2007-08-03 04:00:50 [SETI@home] Giving up on upload of 20jn00aa.3173.11457.542316.3.176_1_0: server rejected file


So it's a server problem and not a problem from the client, OR?


ID: 615333 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 615370 - Posted: 5 Aug 2007, 18:26:52 UTC - in response to Message 615333.  



I was 'little interested', why my results are got 'validate errors',
so I took a little time.. and looked to my online available results and in 'stdoutdae.txt' and I saw:

An example for all 3 available results:

2007-08-03 04:00:43 [SETI@home] [file_xfer] Started upload of file 20jn00aa.3173.11457.542316.3.176_1_0
2007-08-03 04:00:50 [SETI@home] [error] Error on file upload: no command
2007-08-03 04:00:50 [SETI@home] [file_xfer] Permanently failed upload of 20jn00aa.3173.11457.542316.3.176_1_0
2007-08-03 04:00:50 [SETI@home] Giving up on upload of 20jn00aa.3173.11457.542316.3.176_1_0: server rejected file


So it's a server problem and not a problem from the client, OR?

The other possibility is garbled communication. The upload uses two POSTs, in the first one the "command" is <get_file_size> and in the second it's <file_upload>. If neither is found, that gives the "no command" error.
                                                                 Joe
ID: 615370 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 615433 - Posted: 5 Aug 2007, 21:08:34 UTC - in response to Message 615370.  
Last modified: 5 Aug 2007, 21:08:47 UTC

The other possibility is garbled communication. The upload uses two POSTs, in the first one the "command" is <get_file_size> and in the second it's <file_upload>. If neither is found, that gives the "no command" error.
                                                                 Joe


And how or what we could do that this don't happen?


ID: 615433 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 615554 - Posted: 6 Aug 2007, 4:20:44 UTC - in response to Message 615433.  

The other possibility is garbled communication. The upload uses two POSTs, in the first one the "command" is <get_file_size> and in the second it's <file_upload>. If neither is found, that gives the "no command" error.
                                                                 Joe


And how or what we could do that this don't happen?


The first step in solving every problem is diagnosing it.

I was going to look at your computers to see if there is anything obvious, but they're hidden.
ID: 615554 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19062
Credit: 40,757,560
RAC: 67
United Kingdom
Message 615596 - Posted: 6 Aug 2007, 5:54:31 UTC - in response to Message 615554.  

The other possibility is garbled communication. The upload uses two POSTs, in the first one the "command" is <get_file_size> and in the second it's <file_upload>. If neither is found, that gives the "no command" error.
                                                                 Joe


And how or what we could do that this don't happen?


The first step in solving every problem is diagnosing it.

I was going to look at your computers to see if there is anything obvious, but they're hidden.

But from one of the Validation errors you posted resultid=582062180 I think you might try a bit less over-clocking, and/or checking stability with prime95 or similar.

Andy
ID: 615596 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 615915 - Posted: 6 Aug 2007, 21:18:43 UTC - in response to Message 615596.  
Last modified: 6 Aug 2007, 21:21:04 UTC

The other possibility is garbled communication. The upload uses two POSTs, in the first one the "command" is <get_file_size> and in the second it's <file_upload>. If neither is found, that gives the "no command" error.
                                                                 Joe


And how or what we could do that this don't happen?


The first step in solving every problem is diagnosing it.

I was going to look at your computers to see if there is anything obvious, but they're hidden.

But from one of the Validation errors you posted resultid=582062180 I think you might try a bit less over-clocking, and/or checking stability with prime95 or similar.

Andy



No.. no.. the OC is O.K. .. :-)

The three available results are 'server rejected file' errors!

I posted it here.

SETI@home is not a good test-program to look it's stable? ;-)
But Prime95, what is this?
This is an other BOINC project?
Or now it's named PrimeGrid?

I had let run memtest86+ V1.70 and it was well.


ID: 615915 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 615920 - Posted: 6 Aug 2007, 21:22:51 UTC - in response to Message 615915.  

No.. no.. the OC is O.K. .. :-)

SETI@home is not a good test-program to look it's stable? ;-)
But Prime95, what is this?
This is an other BOINC project?
Or now it's named PrimeGrid?

I had let run memtest85+ V1.70 and it was well.


To verify that SETI is not the problem, another CPU stress tester is always a good idea to cross-verify results.

Prime95 is a different, stand-alone application that stresses the CPU just like SETI@Home does. I think they have a BOINC project, but that wouldn't remove BOINC as a possible point of failure so it's best to use the stand-alone program.

If you get errors with Prime95 too, then there's a good chance your overclock is too aggressive.
ID: 615920 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 615926 - Posted: 6 Aug 2007, 21:32:29 UTC - in response to Message 615920.  

To verify that SETI is not the problem, another CPU stress tester is always a good idea to cross-verify results.

Prime95 is a different, stand-alone application that stresses the CPU just like SETI@Home does. I think they have a BOINC project, but that wouldn't remove BOINC as a possible point of failure so it's best to use the stand-alone program.

If you get errors with Prime95 too, then there's a good chance your overclock is too aggressive.



I saw it like this..

If I have a 'validate error', it's because of the server..
And if I have a 'client error', it's because of to much OC..

I saw it right or wrong?


I OC the Intel Core2 Extreme QX6700 from 2.66 to 3.17 GHz, so it's not so much..
You must ask msattler because of his OC! ;-)


Where I can get Prime95?


ID: 615926 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 615931 - Posted: 6 Aug 2007, 21:37:29 UTC - in response to Message 615926.  

To verify that SETI is not the problem, another CPU stress tester is always a good idea to cross-verify results.

Prime95 is a different, stand-alone application that stresses the CPU just like SETI@Home does. I think they have a BOINC project, but that wouldn't remove BOINC as a possible point of failure so it's best to use the stand-alone program.

If you get errors with Prime95 too, then there's a good chance your overclock is too aggressive.



I saw it like this..

If I have a 'validate error', it's because of the server..
And if I have a 'client error', it's because of to much OC..

I saw it right or wrong?


I OC the Intel Core2 Extreme QX6700 from 2.66 to 3.17 GHz, so it's not so much..
You must ask msattler because of his OC! ;-)


Where I can get Prime95?


You seem overly focused on finding fault, and not focused at all on diagnosing and fixing the problem.

Overclocking is the process of getting more performance by reducing the "margins" -- getting closer to the 'edge' of the signal's rise and/or fall (moving away from solid, stable 1's and 0's toward 0.7's and 0.3's).

How much you can overclock depends on a lot of factors, not just the CPU.

We'd like to look at your computers if you'd like our help.
ID: 615931 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 615933 - Posted: 6 Aug 2007, 21:38:41 UTC
Last modified: 6 Aug 2007, 21:40:08 UTC

Well I think it's safe to say if it's a compute error, then in all likleyhood it's due to the OC, especially if it goes away when you back off.

However, you cannot say the same thing about a validate error. It might be due to a server issue losing the output files for one reason or another. OTOH, it could just as easily be due to subtle calculational errors from the OC which don't generate a 'hard' error.

Alinator
ID: 615933 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 615934 - Posted: 6 Aug 2007, 21:40:52 UTC - in response to Message 615926.  

I saw it like this..

If I have a 'validate error', it's because of the server..
And if I have a 'client error', it's because of to much OC..

I saw it right or wrong?


Not necessarily. Always double check your work and cross reference your results.

I OC the Intel Core2 Extreme QX6700 from 2.66 to 3.17 GHz, so it's not so much..
You must ask msattler because of his OC! ;-)


Unless you're running the same setup MSattler is (including, most importantly, the same cooling setup he is), I don't think you can make a direct comparison.


Where I can get Prime95?


Here.
ID: 615934 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 615960 - Posted: 6 Aug 2007, 21:56:58 UTC
Last modified: 6 Aug 2007, 22:01:02 UTC



Thanks a lot for help!

I'll look in future more in 'stdoutdae.txt', that I know it's a server prob or maybe a OC prob.

And maybe I'll let run Prime95.

@ Ned Ludd
You are funny, your PCs are hidden too! ;-)


ID: 615960 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Because of the 'validate errors'


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.