Message boards :
Number crunching :
Rash of Validate errors
Message board moderation
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
![]() Send message Joined: 28 Nov 07 Posts: 5 Credit: 10,759,150 RAC: 0 ![]() |
I have been suffering a rash of validate errors the past few days. None of the Wu processed today succeeded, and several 'validate errors' remain on the results page from previous days, but not as many as had occurred. I have the same problem on my machine at work, about 80% of my results since 15 Feb give "validate error". At home it work like a charm... My guess would be a DNS problem making our machines to send our result to the wrong server at seti. Just a guess thru. Xtian. |
Eric Korpela ![]() Send message Joined: 3 Apr 99 Posts: 1383 Credit: 54,506,847 RAC: 60 ![]() ![]() |
Hmmmm... This is a new one to me. The BOINC client is sucessfully connecting to the server and the transfer happens and completes without TCP errors, but what we receive of the communications is missing vital portions. I'll let Matt and Jeff know about it. It probably had something to do with the weekend server problems. Has anyone seen it since the 17th? Eric @SETIEric@qoto.org (Mastodon) ![]() |
Morris Send message Joined: 11 Sep 01 Posts: 57 Credit: 9,077,302 RAC: 29 ![]() ![]() |
Not me, Eric, i had that prob in my late evening of 17th, around midnight roughly (CET). After that "no command" prob, i couldnt even report the task, the client left them in a "ready to report" state. Next morning, (18th) i could report the complete bunch of "no command", and two additional task computed during the night. The two additional were granted credit, while the whole lot was marked as "Validate error". Since then, except for a few sporadic "HTTP Error" while downloading fresh WU, i didn't experienced that issue anymore. m. ![]() |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Every case we've seen seems to have started on the 14th. For Natronomonas it continued until the 20th. That's the really unusually thing about this one: once a BOINC client gets into this state, it continues until the client is restarted - nothing to do with the servers at all, except possibly the initial glitch which triggers the whole sequence. |
Morris Send message Joined: 11 Sep 01 Posts: 57 Credit: 9,077,302 RAC: 29 ![]() ![]() |
Sorry to contradict Richard, but at least in MY case there was no restart of client. The fact happened on 17th evening, next morn upload and report of WU was fine, even if i still have some slight doubt about hibernating... The issue happened on my laptop, and i switched off and on at least once (no power-off, no standby, simple and neat hibernate..) Anyway, i went deep in the stdoutdae.txt file, and there's no sign of client reboot when i switched on again on 18th morn. HTH M. [edited a bit later, to give some "meaning" to the looooooong sentence] ![]() |
connor Send message Joined: 27 Jan 04 Posts: 1 Credit: 1,782,277 RAC: 0 ![]() |
Some logs that may help: nohup.out from computer 4128879 22-Feb-2008 07:37:10 [SETI@home] Computation for task 28no06ad.11419.13569.9.7.189_0 finished 22-Feb-2008 07:37:10 [SETI@home] Resuming task 24ja07ac.17733.8661.13.7.34_0 using setiathome_enhanced version 527 22-Feb-2008 07:37:10 [SETI@home] Sending scheduler request: To fetch work. Requesting 4804 seconds of work, reporting 0 completed tasks 22-Feb-2008 07:37:12 [SETI@home] Started upload of 28no06ad.11419.13569.9.7.189_0_0 22-Feb-2008 07:37:14 [SETI@home] [error] Error on file upload: no command 22-Feb-2008 07:37:14 [SETI@home] Giving up on upload of 28no06ad.11419.13569.9.7.189_0_0: fatal upload error 22-Feb-2008 07:37:15 [SETI@home] Scheduler request succeeded: got 1 new tasks 22-Feb-2008 07:37:18 [SETI@home] Started download of 29no06af.21891.15614.15.7.96 22-Feb-2008 07:37:23 [SETI@home] Finished download of 29no06af.21891.15614.15.7.96 22-Feb-2008 08:15:22 [SETI@home] Computation for task 22fe07af.14819.71435.12.7.221_1 finished 22-Feb-2008 08:15:22 [SETI@home] Restarting task 25ja07aa.28174.890.16.7.115_1 using setiathome_enhanced version 527 22-Feb-2008 08:15:24 [SETI@home] Started upload of 22fe07af.14819.71435.12.7.221_1_0 22-Feb-2008 08:15:27 [SETI@home] [error] Error on file upload: no command 22-Feb-2008 08:15:27 [SETI@home] Giving up on upload of 22fe07af.14819.71435.12.7.221_1_0: fatal upload error 22-Feb-2008 10:32:30 [SETI@home] Sending scheduler request: To fetch work. Requesting 38 seconds of work, reporting 2 completed tasks 22-Feb-2008 10:33:03 [SETI@home] Scheduler request succeeded: got 0 new tasks tcpdump from 8:15 error upload time: 08:15:24.670218 IP 10.0.1.77.42415 > setiboinc.SSL.Berkeley.EDU.http: F 14561:14561(0) ack 10610 win 6528 <nop,nop,timestamp 2427632632 479489437 > 08:15:24.922164 IP 10.0.1.77.42409 > setiboinc.SSL.Berkeley.EDU.http: F 539:539(0) ack 564 win 1741 <nop,nop,timestamp 2427632884 476926604> 08:15:25.251108 IP 10.0.1.77.42416 > 208.68.240.18.http: F 251:251(0) ack 375570 win 32612 <nop,nop,timestamp 2427633213 1401792860> 08:15:25.412068 IP 10.0.1.77.42411 > setiboincdata.SSL.Berkeley.EDU.http: F 291:291(0) ack 287 win 1728 <nop,nop,timestamp 2427633374 725572490> 08:15:25.585326 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: S 2944937978:2944937978(0) win 5840 <mss 1460,sackOK,timestamp 2427633547 0 ,nop,wscale 2> 08:15:25.757000 arp who-has 10.0.1.1 tell 10.0.1.77 08:15:25.757282 arp reply 10.0.1.1 is-at 00:80:ad:8b:c5:80 08:15:25.798150 IP setiboinc.SSL.Berkeley.EDU.http > 10.0.1.77.42421: S 1955859346:1955859346(0) ack 2944937979 win 5696 <mss 1436,sackOK,timesta mp 484072738 2427633547,nop,wscale 10> 08:15:25.798191 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: . ack 1 win 1460 <nop,nop,timestamp 2427633760 484072738> 08:15:25.798277 IP 10.0.1.2.netbios-dgm > 10.0.1.255.netbios-dgm: NBT UDP PACKET(138) 08:15:25.798396 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: P 1:254(253) ack 1 win 1460 <nop,nop,timestamp 2427633760 484072738> 08:15:26.013019 IP setiboinc.SSL.Berkeley.EDU.http > 10.0.1.77.42421: . ack 254 win 7 <nop,nop,timestamp 484072953 2427633760> 08:15:26.013053 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: P 254:541(287) ack 1 win 1460 <nop,nop,timestamp 2427633975 484072953> 08:15:26.227371 IP setiboinc.SSL.Berkeley.EDU.http > 10.0.1.77.42421: . ack 541 win 8 <nop,nop,timestamp 484073167 2427633975> 08:15:26.231217 IP setiboinc.SSL.Berkeley.EDU.http > 10.0.1.77.42421: P 1:563(562) ack 541 win 8 <nop,nop,timestamp 484073167 2427633975> 08:15:26.231253 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: . ack 563 win 1741 <nop,nop,timestamp 2427634193 484073167> 08:15:26.231265 IP setiboinc.SSL.Berkeley.EDU.http > 10.0.1.77.42421: F 563:563(0) ack 541 win 8 <nop,nop,timestamp 484073167 2427633975> 08:15:26.231687 IP 10.0.1.77.42422 > setiboincdata.SSL.Berkeley.EDU.http: S 2950039049:2950039049(0) win 5840 <mss 1460,sackOK,timestamp 24276341 94 0,nop,wscale 2> 08:15:26.270918 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: . ack 564 win 1741 <nop,nop,timestamp 2427634233 484073167> 08:15:26.448472 IP setiboincdata.SSL.Berkeley.EDU.http > 10.0.1.77.42422: S 2982712386:2982712386(0) ack 2950039050 win 5696 <mss 1436,sackOK,tim estamp 732718837 2427634194,nop,wscale 10> 08:15:26.448537 IP 10.0.1.77.42422 > setiboincdata.SSL.Berkeley.EDU.http: . ack 1 win 1460 <nop,nop,timestamp 2427634411 732718837> 08:15:26.448772 IP 10.0.1.77.42422 > setiboincdata.SSL.Berkeley.EDU.http: P 1:291(290) ack 1 win 1460 <nop,nop,timestamp 2427634411 732718837> 08:15:26.664038 IP setiboincdata.SSL.Berkeley.EDU.http > 10.0.1.77.42422: . ack 291 win 7 <nop,nop,timestamp 732719056 2427634411> 08:15:26.667079 IP setiboincdata.SSL.Berkeley.EDU.http > 10.0.1.77.42422: P 1:286(285) ack 291 win 7 <nop,nop,timestamp 732719058 2427634411> 08:15:26.667099 IP 10.0.1.77.42422 > setiboincdata.SSL.Berkeley.EDU.http: . ack 286 win 1728 <nop,nop,timestamp 2427634629 732719058> 08:15:26.668403 IP setiboincdata.SSL.Berkeley.EDU.http > 10.0.1.77.42422: F 286:286(0) ack 291 win 7 <nop,nop,timestamp 732719058 2427634411> 08:15:26.707835 IP 10.0.1.77.42422 > setiboincdata.SSL.Berkeley.EDU.http: . ack 287 win 1728 <nop,nop,timestamp 2427634670 732719058> host is 10.0.1.77 DNS 10.0.1.1 Also it looks like whenever client uploading data by doing "Sending scheduler request: To fetch work. Requesting XXX seconds of work, reporting 2 completed tasks" some work gets through and when it is just reporting work, none gets validated. Anyway the pc hasn't yet restarted so feel free to request any more information. |
![]() Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 ![]() |
I noticed/posted my first 'no command' error at: 27 May 2007 21:53:50 UTC Now, what is now the finally result about this topic? It's recommened to use a special Version of BOINC? To reboot the PC, from time to time? This helped me not much. I have nearly every day one 'no command' - 'validate error'. O.K., other have a bunch of this now.. BUT I have/had problem with the 'no command' error min. since 27 May 2007 21:53:50 UTC ![]() OR, it's not at the user, it's a problem with the server at Berkeley? ![]() |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Sorry, Sutaru, but I'm out of my depth on this one! It sounded from the quote from Dadid Anderson that there may be some changes to future versions of BOINC which make the problem less likely. But all of this started after BOINC v5.10.42 was released for testing - watch out for at least v5.10.43 (in Beta test first, then in full release). But it hasn't even been written yet - nothing to see yet. I hope Joe Segur / Eric Korpela / David Anderson can get some useful detail out of the TCPdump that Connor has posted - that's the most detailed analysis yet. Thanks, Connor. It's possible that some change to the server code could come from that, so the problems could be eliminated without everyone having to upgrade BOINC - but we'll have to wait and see. |
![]() Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 ![]() |
O.K., Richard.. thanks a lot! Now we could only wait to the release of the new BOINC client.. ![]() Now we have 3 possible unhappy situations..
![]() |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
I'm out of my depth too on the protocol stuff, but I do see some interesting things in Connor's tcpdump output: At 08:15:25.798396 the host sent a 253 character packet to setiboinc.SSL.Berkeley.EDU. The server acknowledged it but said its receive window was 7 characters. The host then sent a 287 character packet which got through anyhow and was acknowledged. The server then sent back a 562 character reply and that connection terminated. Note that setiboinc.SSL.Berkeley.EDU is the Scheduler, not the Upload handler (setiboincdata.SSL.Berkeley.EDU). When my hosts upload, they first send a 239 character packet with HTTP headers, but I'm using BOINC 5.2.12 and there have been some header changes since; I think that Connor's first 253 character packet size is probably those headers. The second packet with POST data tells the Upload handler what result file is going to be sent, it's like this except this stupid forum software will hide leading spaces: <data_server_request> <core_client_major_version>5</core_client_major_version> <core_client_minor_version>2</core_client_minor_version> <core_client_release>12</core_client_release> <get_file_size>29no06af.24239.19295.12.7.35_1_0</get_file_size> </data_server_request> That one of mine was 285 characters, but Connor is running 5.10.28 which would add one character in the <core_client_minor_version> field, and his result filename was one character longer, so 287 for his second packet makes sense. If those packets had been sent to the right server, I think the upload might have worked. The remainder of the tcpdump has the connection to the Upload handler which I assume ended with the "no command" response. I can't figure out what the packet sizes mean, they're not right for either the first or second phase of doing an upload. There's a lot of guesswork in that, I hope someone can do a capture with the actual packet contents. What I do with Windump is use -s2000 -w<filename> to capture the full packets and write them to a file, then stop Windump after it has captured enough. Then I invoke it again with -X -r<filename> and redirect the output to a text file. After that I can open the text file and see everything. I'm sure Wireshark would make it simpler. Joe |
david z Send message Joined: 19 Nov 99 Posts: 2 Credit: 4,894,997 RAC: 0 ![]() |
I to have had a large number of validate errors. I normally don't check the pending page o a regular basis- work flows smoothly, I use an 8 day cache to get me "over the humps" so I see little down time anymore. But recently on one of my mac intel laptops I noticed that my RAC had dropped several hundred in the past few days- validate errors! 91 of them starting around 2/15. After reading the threads I could see the upload errors as previously discussed. I restarted BOINC, manually updated SETI to have it send in WU to see if they would go through- they did not. Next step was to reinstall BOINC , have a Wii party and go to bed. This morning I checked the log and uploads are going through. Thanks to all of you who take the time to put information such as this in a forum where all others can get to it if needed. As pointed out before, as this appears to be MORE than a transient problem it would be nice if Berkley would be able to look into it. Thanks again guys! |
![]() Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 ![]() |
@ Josef W. Segur You had never a 'no command - validate error' with your old client? Why you use this old client? Maybe now it would be better to use V5.2.x (V5.2.6 or later! ![]() BTW. After looking to your PCs.. You use SETI@home V5.15 Rev. 2.4, but it's reported as V6.0 So the now available opt. apps (Rev. 2.4V) from Crunch3rs homepage are V6.0 compatible? ![]() |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
@ Josef W. Segur Never a "no command", but when Kryten was dropping mounts I had a couple of validate errors. I generally only connect once a day, and try to choose times when the servers appear to be working well. Why you use this old client? Rom Walton wouldn't build a libCurl compatible with my Win95 system, so I did. BOINC 5.2.12 was current at that time, so rather than taking time to build newer versions of libCurl for newer BOINCs I've stayed with that. Maybe now it would be better to use V5.2.x (V5.2.6 or later! I have thought about upgrading, but the few improvements which have been made to BOINC are far ouweighed by excess garbage added. There were a couple of changes for 5.4.6 and 5.5.13 or so which were tempting. In any case I want to keep the same version on all my systems. BTW. Because I do SETI Beta too and am running an old BOINC, I have to add a section to the app_info.xml files here spoofing as 6.00. Any version of setiathome_enhanced will do the work correctly. Joe |
Odysseus ![]() Send message Joined: 26 Jul 99 Posts: 1808 Credit: 6,701,347 RAC: 6 ![]() |
You use SETI@home V5.15 Rev. 2.4, but it's reported as V6.0 Not necessarily. You can’t trust the reported version where the anonymous-platform mechanism is in use: in that case BOINC reports whatever the app_info.xml file tells it to. It’s not uncommon for those of us who run the Beta project, and also run custom apps here, to include references to the version under testing there in our app_info here, as there’s a bug in the client (fixed in more recent versionsâ€â€5.8.x? 5.10.x?), triggered by the apps from the two projects having the same name, that makes it trash S@h WUs because it suddenly decides it doesn’t have a high enough version to crunch them. |
![]() Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 ![]() |
... I used now BOINC V5.8.16 and got after ~ 2 days 3 'no command - valiate errors'.. ![]() wuid=222497619 wuid=222497619 wuid=222497535 ..again with 'server rejected file' Now I'm back online with BOINC V5.10.30 with Crunch3rs V6.1.0 ![]() When will be a correct BOINC Version available..? ..maybe V5.10.43 ? ![]() |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
... That's totally unknown, the client problem has not yet been diagnosed. What should fix your problem is the February 20 changeset [trac]changeset:14767[/trac] server-side change in file_upload_handler.C. When that's deployed in the S@H Upload handler, your core client will no longer delete the result file. Instead it will try again to upload, with the usual backoffs for failed transfers. Joe |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
That's totally unknown, the client problem has not yet been diagnosed. For the reason you give in your first sentence, I wouldn't really call that changeset a 'fix'. Sure, it's a useful bit of sticking-plaster: it preserves the science, the credits, and the evidence for future debugging. But it does nothing to explain what went wrong in the first place, and why it tends to go on until the client is restarted (at least, that was the problem ten days ago, when this thread was started: Sutaru's report this morning sounds slightly different - but I notice that the tasks which errored today were issued on the first full day of the problem. Coincidence?). No, I'm afraid debugging this one is still a 'work in progress', and we still need volunteers to be on standby, Wireshark at the ready. It's just that the warning sign will be a "Rash of failed uploads", instead of a "Rash of Validate errors". |
![]() Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 ![]() |
... Ops.. nobody noticed, that I posted two times the same WU..? ![]() Here're the correct links: wuid=222497609 wuid=222497619 wuid=222497535 ![]() |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
That's totally unknown, the client problem has not yet been diagnosed. While offtopic for this thread, I was referring to Sutaru's long time problem with a few WUs per day getting a "no command" and Validate error. No restart is needed, uploads earlier and later are successful. That sort of intermittent problem is what the server side change should fix. Sure, it's a useful bit of sticking-plaster: it preserves the science, the credits, and the evidence for future debugging. But it does nothing to explain what went wrong in the first place, and why it tends to go on until the client is restarted (at least, that was the problem ten days ago, when this thread was started: Sutaru's report this morning sounds slightly different - but I notice that the tasks which errored today were issued on the first full day of the problem. Coincidence?). Well, the rash will be accompanied by "no command" messages so at least the reason uploads are being retried will be apparent. If a BOINC restart is required to cure it, those that don't pay attention may eventually error out, there's a 2 week limit for persistent file transfers. I certainly agree that if the problem occurs again, we ought to be prepared to pin it down. I fear Murphy will ensure it doesn't reappear until our guard is down. Official Name: setiboincdata.ssl.berkeley.edu IIRC those IP addresses are the two which were in round robin for the Schedulers. Could that have temporarily been revived, causing some hosts to get the wrong address when looking up setiboincdata? Exiting and restarting BOINC also flushes any DNS caching which libCurl is doing, so might correct that. I don't think that's a likely explanation, but I haven't thought of anything more likely either. Joe |
![]() Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 ![]() |
I have now some more experiences.. ;-) [UTC + 01:00] [u][b]V6.1.0 (Cruch3rs)[/b][/u] 02-Mar-2008 03:52:31 [SETI@home] Computation for task 11dc06af.32510.3753.3.7.44_0 finished 02-Mar-2008 03:52:31 [SETI@home] Starting 11dc06af.32510.3753.3.7.54_0 02-Mar-2008 03:52:31 [SETI@home] Starting task 11dc06af.32510.3753.3.7.54_0 using setiathome_enhanced version 528 02-Mar-2008 03:52:33 [SETI@home] [file_xfer] Started upload of file 11dc06af.32510.3753.3.7.44_0_0 02-Mar-2008 03:52:40 [SETI@home] [file_xfer] Temporarily failed upload of 11dc06af.32510.3753.3.7.44_0_0: HTTP error 02-Mar-2008 03:52:40 [SETI@home] Backing off 1 min 0 sec on upload of file 11dc06af.32510.3753.3.7.44_0_0 02-Mar-2008 03:53:40 [SETI@home] [file_xfer] Started upload of file 11dc06af.32510.3753.3.7.44_0_0 02-Mar-2008 03:53:42 [SETI@home] [file_xfer] Temporarily failed upload of 11dc06af.32510.3753.3.7.44_0_0: HTTP error 02-Mar-2008 03:53:42 [SETI@home] Backing off 1 min 0 sec on upload of file 11dc06af.32510.3753.3.7.44_0_0 02-Mar-2008 03:54:43 [SETI@home] [file_xfer] Started upload of file 11dc06af.32510.3753.3.7.44_0_0 02-Mar-2008 03:54:45 [SETI@home] [error] Error on file upload: no command 02-Mar-2008 03:54:45 [SETI@home] [file_xfer] Giving up on upload of 11dc06af.32510.3753.3.7.44_0_0: [b]permanent upload error[/b] [u][b]V5.10.30[/b][/u] 05-Mar-2008 05:04:43 [SETI@home] Computation for task 24ja07aa.18577.32580.3.7.52_0 finished 05-Mar-2008 05:04:43 [SETI@home] Starting 24ja07aa.18577.32580.3.7.29_0 05-Mar-2008 05:04:43 [SETI@home] Starting task 24ja07aa.18577.32580.3.7.29_0 using setiathome_enhanced version 528 05-Mar-2008 05:04:44 [SETI@home] Started upload of 24ja07aa.18577.32580.3.7.52_0_0 05-Mar-2008 05:04:48 [SETI@home] Temporarily failed upload of 24ja07aa.18577.32580.3.7.52_0_0: http error 05-Mar-2008 05:04:48 [SETI@home] Backing off 1 min 0 sec on upload of 24ja07aa.18577.32580.3.7.52_0_0 05-Mar-2008 05:05:49 [SETI@home] Started upload of 24ja07aa.18577.32580.3.7.52_0_0 05-Mar-2008 05:05:55 [SETI@home] [error] Error on file upload: no command 05-Mar-2008 05:05:55 [SETI@home] Giving up on upload of 24ja07aa.18577.32580.3.7.52_0_0: [b]fatal upload error[/b] It (or the whole story) have maybe something to do with libcurl? V6.1.0 (Crunch3rs) have libcurl/7.16.4 V5.10.30 have libcurl/7.17.1 ![]() |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.