Message boards :
Number crunching :
Because of the 'validate errors'
Message board moderation
Author | Message |
---|---|
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Because of the 'validate errors' Crunch3r posted, since the update of the software of the servers at Berkeley, 'report results immediately' make sometimes 'validate errors'. I have BOINC V5.10.7 and connect every 0.001 days, this are ~ 90 seconds. (report ~ 90 seconds after upload) And sometimes I have 'validate errors'. I think because of this: 8/4/2007 3:37:41 PM|SETI@home|Computation for task 29mr00ab.25614.1169.804838.3.10_3 finished 8/4/2007 3:37:41 PM|SETI@home|Starting 19jn00aa.11827.12834.417318.3.137_2 8/4/2007 3:37:41 PM|SETI@home|Starting task 19jn00aa.11827.12834.417318.3.137_2 using setiathome_enhanced version 515 8/4/2007 3:37:41 PM|SETI@home|Sending scheduler request: To fetch work 8/4/2007 3:37:41 PM|SETI@home|Requesting 2363 seconds of new work 8/4/2007 3:37:43 PM|SETI@home|[file_xfer] Started upload of file 29mr00ab.25614.1169.804838.3.10_3_0 8/4/2007 3:37:47 PM|SETI@home|Scheduler RPC succeeded [server version 511] 8/4/2007 3:37:47 PM|SETI@home|Deferring communication for 11 sec 8/4/2007 3:37:47 PM|SETI@home|Reason: requested by project 8/4/2007 3:37:49 PM|SETI@home|[file_xfer] Started download of file 29mr00ab.25614.5729.548578.3.113 [b]8/4/2007 3:37:50 PM|SETI@home|[file_xfer] Finished upload of file 29mr00ab.25614.1169.804838.3.10_3_0[/b] 8/4/2007 3:37:50 PM|SETI@home|[file_xfer] Throughput 8315 bytes/sec 8/4/2007 3:37:54 PM|SETI@home|[file_xfer] Finished download of file 29mr00ab.25614.5729.548578.3.113 8/4/2007 3:37:54 PM|SETI@home|[file_xfer] Throughput 77539 bytes/sec 8/4/2007 3:38:02 PM|SETI@home|Sending scheduler request: To fetch work [b]8/4/2007 3:38:02 PM|SETI@home|Requesting 874 seconds of new work, and [color=red]reporting 1 completed tasks[/color][/b] 8/4/2007 3:38:12 PM|SETI@home|Scheduler RPC succeeded [server version 511] 8/4/2007 3:38:12 PM|SETI@home|Deferring communication for 11 sec 8/4/2007 3:38:12 PM|SETI@home|Reason: requested by project 8/4/2007 3:38:14 PM|SETI@home|[file_xfer] Started download of file 13mr00aa.16419.10962.842330.3.48 8/4/2007 3:38:20 PM|SETI@home|[file_xfer] Finished download of file 13mr00aa.16419.10962.842330.3.48 8/4/2007 3:38:20 PM|SETI@home|[file_xfer] Throughput 87819 bytes/sec This are 12 seconds and it's a good reported result. BUT sometimes the time between is shorter and then: 'validate error' :-( OR? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
In the Beta "Validate error?" thread Keith T reported a similar case with a 22 second delay which did cause a Validate error. Because work fetch calculations are distinct from cpu usage calculations it's always possible for a request to the Scheduler to occur very soon after "completion" of an upload. Those situations should be fairly rare, though. Perhaps the 0.001 day setting is not always enough, 0.002 would be even safer. Joe |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
LOL... Agreed, why go looking for trouble? I've used 0.01 days for a CI for ages and have never had a result get invalidated for this reason AFAIK, and the project wasn't having problems. When you get right down to it, practically speaking, 15 minutes is as good as 80 seconds when it comes to 'immediately', especially if the alternative is loosing a good result needlessly. ;-) Alinator |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
During a discussion about 'Return Results Immediately' last year, I noticed (here) that there is a strong correlation between computation finishing on a WU, and a scheduler request for more work. This effect is quite separate from the CI: I'm sure it happens because of the re-calculation of the RDCF, and hence the total estimated crunch time of the WUs held in cache. If you've recently finished a slow WU like the dreaded 58.69, and are now working on more 'normal' WUs, then your RDCF will be decreased as each WU finishes: the total work buffer on hand will decrease pro-rata. There's a fair chance that this decrease will cross the cache size boundary, and so the request goes in for more work. Usually this happens before the upload has completed, and so the report of the just-finished WU has to wait until the next scheduler contact (the point of the discussion last year). However, now that the server back-off is only 11 seconds, sometimes two work requests happen in quick succession, and the second one can carry with it the report of the just-completed WU. This is exactly the situation shown in Sutaru Tsureku's opening post in this thread. [Edit - the second 'work-fetch' after 11 seconds is much more likely with recent clients, because of the server-abort of redundant WUs. You ask for more work - you end up with less because of an abort instruction - so you ask again. Like Oliver Twist: more! more!] There was some support last year for a client-enforced 'cooling off period' after a WU completes, to allow that upload to complete in an orderly fashion before the next scheduler contact. I think I've observed something like that happening in recent clients when WUs error out, but not when they finish normally. Is it worth re-visiting this suggestion? |
Keith T. Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 |
|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I think the best solution would be for the project to set a longer Scheduler delay. When we were discussing this last year, the delay was 10-minutes-plus-a-bit. It then dropped without warning (or, as far as I can remember, any explanation) to the current 11 seconds. Maybe 4 minutes would be a happy medium.... |
KB7RZF Send message Joined: 15 Aug 99 Posts: 9549 Credit: 3,308,926 RAC: 2 |
Just to throw my 2 cents worth in, as I just saw this thread: I left the CI at 0, and have Maintain enough work for an additional: set at .1, and all last month I ran nothing but SETI, and I never, ever had a result go bad. Call me lucky I guess? Dunno, but I have yet to have a problem with it set like this. Jeremy |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
I had done now, connect every 0.002 days.. (~ 180 seconds) BUT what do the people with DUO or QUAD- CPUs? ..or the people with 'V8'? Here an example.. 2 results finished and uploaded. A third result finished uploading after reporting.. BUT, if this third result finished the upload ~ 10 seconds earlier, it would be reported 3 seconds later.. AND then, it will be VERY SURE a 'validate error'.. OR? :-( SO, 'how we could do it better'? ;-) 8/5/2007 10:04:00 AM|SETI@home|Computation for task 29mr00ab.25614.4656.665884.3.235_0 finished 8/5/2007 10:04:00 AM|SETI@home|Starting 19jn00aa.11827.19456.484658.3.204_2 8/5/2007 10:04:00 AM|SETI@home|Starting task 19jn00aa.11827.19456.484658.3.204_2 using setiathome_enhanced version 515 8/5/2007 10:04:02 AM|SETI@home|[file_xfer] Started upload of file 29mr00ab.25614.4656.665884.3.235_0_0 8/5/2007 10:04:10 AM|SETI@home|[file_xfer] Finished upload of file 29mr00ab.25614.4656.665884.3.235_0_0 8/5/2007 10:04:10 AM|SETI@home|[file_xfer] Throughput 8571 bytes/sec 8/5/2007 10:04:12 AM|SETI@home|Computation for task 19jn00aa.11827.19456.484658.3.204_2 finished 8/5/2007 10:04:12 AM|SETI@home|Starting 29mr00ab.25614.4656.665884.3.198_0 8/5/2007 10:04:12 AM|SETI@home|Starting task 29mr00ab.25614.4656.665884.3.198_0 using setiathome_enhanced version 515 8/5/2007 10:04:14 AM|SETI@home|[file_xfer] Started upload of file 19jn00aa.11827.19456.484658.3.204_2_0 8/5/2007 10:04:19 AM|SETI@home|[file_xfer] Finished upload of file 19jn00aa.11827.19456.484658.3.204_2_0 8/5/2007 10:04:19 AM|SETI@home|[file_xfer] Throughput 9858 bytes/sec 8/5/2007 10:06:57 AM|SETI@home|Computation for task 29mr00ab.25614.7121.304816.3.3_1 finished 8/5/2007 10:06:57 AM|SETI@home|Starting 29mr00ab.25614.4656.665884.3.237_1 8/5/2007 10:06:57 AM|SETI@home|Starting task 29mr00ab.25614.4656.665884.3.237_1 using setiathome_enhanced version 515 8/5/2007 10:07:00 AM|SETI@home|[file_xfer] Started upload of file 29mr00ab.25614.7121.304816.3.3_1_0 8/5/2007 10:07:05 AM|SETI@home|Sending scheduler request: To report completed tasks 8/5/2007 10:07:05 AM|SETI@home|Reporting 2 tasks 8/5/2007 10:07:12 AM|SETI@home|[file_xfer] Finished upload of file 29mr00ab.25614.7121.304816.3.3_1_0 8/5/2007 10:07:12 AM|SETI@home|[file_xfer] Throughput 2641 bytes/sec 8/5/2007 10:07:15 AM|SETI@home|Scheduler RPC succeeded [server version 511] 8/5/2007 10:07:15 AM|SETI@home|Deferring communication for 11 sec 8/5/2007 10:07:15 AM|SETI@home|Reason: requested by project 8/5/2007 10:10:07 AM|SETI@home|Sending scheduler request: To report completed tasks 8/5/2007 10:10:07 AM|SETI@home|Reporting 1 tasks 8/5/2007 10:10:17 AM|SETI@home|Scheduler RPC succeeded [server version 511] 8/5/2007 10:10:17 AM|SETI@home|Deferring communication for 11 sec 8/5/2007 10:10:17 AM|SETI@home|Reason: requested by project |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
I was 'little interested', why my results are got 'validate errors', so I took a little time.. and looked to my online available results and in 'stdoutdae.txt' and I saw: An example for all 3 available results: 2007-08-03 04:00:43 [SETI@home] [file_xfer] Started upload of file 20jn00aa.3173.11457.542316.3.176_1_0 2007-08-03 04:00:50 [SETI@home] [error] Error on file upload: no command 2007-08-03 04:00:50 [SETI@home] [file_xfer] Permanently failed upload of 20jn00aa.3173.11457.542316.3.176_1_0 2007-08-03 04:00:50 [SETI@home] Giving up on upload of 20jn00aa.3173.11457.542316.3.176_1_0: server rejected file So it's a server problem and not a problem from the client, OR? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
The other possibility is garbled communication. The upload uses two POSTs, in the first one the "command" is <get_file_size> and in the second it's <file_upload>. If neither is found, that gives the "no command" error. Joe |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
The other possibility is garbled communication. The upload uses two POSTs, in the first one the "command" is <get_file_size> and in the second it's <file_upload>. If neither is found, that gives the "no command" error.Joe And how or what we could do that this don't happen? |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
The other possibility is garbled communication. The upload uses two POSTs, in the first one the "command" is <get_file_size> and in the second it's <file_upload>. If neither is found, that gives the "no command" error.Joe The first step in solving every problem is diagnosing it. I was going to look at your computers to see if there is anything obvious, but they're hidden. |
W-K 666 Send message Joined: 18 May 99 Posts: 19062 Credit: 40,757,560 RAC: 67 |
The other possibility is garbled communication. The upload uses two POSTs, in the first one the "command" is <get_file_size> and in the second it's <file_upload>. If neither is found, that gives the "no command" error.Joe But from one of the Validation errors you posted resultid=582062180 I think you might try a bit less over-clocking, and/or checking stability with prime95 or similar. Andy |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
The other possibility is garbled communication. The upload uses two POSTs, in the first one the "command" is <get_file_size> and in the second it's <file_upload>. If neither is found, that gives the "no command" error.Joe No.. no.. the OC is O.K. .. :-) The three available results are 'server rejected file' errors! I posted it here. SETI@home is not a good test-program to look it's stable? ;-) But Prime95, what is this? This is an other BOINC project? Or now it's named PrimeGrid? I had let run memtest86+ V1.70 and it was well. |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
No.. no.. the OC is O.K. .. :-) To verify that SETI is not the problem, another CPU stress tester is always a good idea to cross-verify results. Prime95 is a different, stand-alone application that stresses the CPU just like SETI@Home does. I think they have a BOINC project, but that wouldn't remove BOINC as a possible point of failure so it's best to use the stand-alone program. If you get errors with Prime95 too, then there's a good chance your overclock is too aggressive. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
To verify that SETI is not the problem, another CPU stress tester is always a good idea to cross-verify results. I saw it like this.. If I have a 'validate error', it's because of the server.. And if I have a 'client error', it's because of to much OC.. I saw it right or wrong? I OC the Intel Core2 Extreme QX6700 from 2.66 to 3.17 GHz, so it's not so much.. You must ask msattler because of his OC! ;-) Where I can get Prime95? |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
To verify that SETI is not the problem, another CPU stress tester is always a good idea to cross-verify results. You seem overly focused on finding fault, and not focused at all on diagnosing and fixing the problem. Overclocking is the process of getting more performance by reducing the "margins" -- getting closer to the 'edge' of the signal's rise and/or fall (moving away from solid, stable 1's and 0's toward 0.7's and 0.3's). How much you can overclock depends on a lot of factors, not just the CPU. We'd like to look at your computers if you'd like our help. |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
Well I think it's safe to say if it's a compute error, then in all likleyhood it's due to the OC, especially if it goes away when you back off. However, you cannot say the same thing about a validate error. It might be due to a server issue losing the output files for one reason or another. OTOH, it could just as easily be due to subtle calculational errors from the OC which don't generate a 'hard' error. Alinator |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
I saw it like this.. Not necessarily. Always double check your work and cross reference your results. I OC the Intel Core2 Extreme QX6700 from 2.66 to 3.17 GHz, so it's not so much.. Unless you're running the same setup MSattler is (including, most importantly, the same cooling setup he is), I don't think you can make a direct comparison. Where I can get Prime95? Here. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Thanks a lot for help! I'll look in future more in 'stdoutdae.txt', that I know it's a server prob or maybe a OC prob. And maybe I'll let run Prime95. @ Ned Ludd You are funny, your PCs are hidden too! ;-) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.