Rash of Validate errors

Message boards : Number crunching : Rash of Validate errors
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Profile Christian Robert, (Poly)

Send message
Joined: 28 Nov 07
Posts: 5
Credit: 10,759,150
RAC: 0
Canada
Message 716167 - Posted: 21 Feb 2008, 2:57:08 UTC - in response to Message 714586.  

I have been suffering a rash of validate errors the past few days. None of the Wu processed today succeeded, and several 'validate errors' remain on the results page from previous days, but not as many as had occurred.

I haven't seen a lot of discussion here, so I'm wondering whether I'm in some way unique.

I know this problem has occurred in the past, and a clean-up script has been used to fix the trouble. But, I'm wondering whether my WU are being picked up, or are simply disappearing from my results page?



I have the same problem on my machine at work, about 80% of my results since
15 Feb give "validate error". At home it work like a charm... My guess would
be a DNS problem making our machines to send our result to the wrong server
at seti. Just a guess thru.

Xtian.
ID: 716167 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1383
Credit: 54,506,847
RAC: 60
United States
Message 716172 - Posted: 21 Feb 2008, 3:05:25 UTC - in response to Message 714644.  




2/17/2008 9:50:20 PM|SETI@home|Starting task 05dc06ae.16424.4980.7.7.100_1 using setiathome_enhanced version 527
2/17/2008 9:50:23 PM|SETI@home|Started upload of 30dc06af.6706.15205.12.7.171_2_0
2/17/2008 9:50:24 PM|SETI@home|[error] Error on file upload: no command
2/17/2008 9:50:24 PM|SETI@home|Giving up on upload of 30dc06af.6706.15205.12.7.171_2_0: fatal upload error


Hmmmm... This is a new one to me. The BOINC client is sucessfully connecting to the server and the transfer happens and completes without TCP errors, but what we receive of the communications is missing vital portions. I'll let Matt and Jeff know about it. It probably had something to do with the weekend server problems. Has anyone seen it since the 17th?

Eric
@SETIEric@qoto.org (Mastodon)

ID: 716172 · Report as offensive
Morris
Volunteer tester

Send message
Joined: 11 Sep 01
Posts: 57
Credit: 9,077,302
RAC: 29
Italy
Message 716209 - Posted: 21 Feb 2008, 4:44:35 UTC - in response to Message 716172.  



Hmmmm... This is a new one to me. The BOINC client is sucessfully connecting to the server and the transfer happens and completes without TCP errors, but what we receive of the communications is missing vital portions. I'll let Matt and Jeff know about it. It probably had something to do with the weekend server problems. Has anyone seen it since the 17th?

Eric


Not me, Eric, i had that prob in my late evening of 17th, around midnight roughly (CET). After that "no command" prob, i couldnt even report the task, the client left them in a "ready to report" state. Next morning, (18th) i could report the complete bunch of "no command", and two additional task computed during the night. The two additional were granted credit, while the whole lot was marked as "Validate error". Since then, except for a few sporadic "HTTP Error" while downloading fresh WU, i didn't experienced that issue anymore.

m.
ID: 716209 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 716279 - Posted: 21 Feb 2008, 9:24:52 UTC - in response to Message 716172.  




2/17/2008 9:50:20 PM|SETI@home|Starting task 05dc06ae.16424.4980.7.7.100_1 using setiathome_enhanced version 527
2/17/2008 9:50:23 PM|SETI@home|Started upload of 30dc06af.6706.15205.12.7.171_2_0
2/17/2008 9:50:24 PM|SETI@home|[error] Error on file upload: no command
2/17/2008 9:50:24 PM|SETI@home|Giving up on upload of 30dc06af.6706.15205.12.7.171_2_0: fatal upload error


Hmmmm... This is a new one to me. The BOINC client is sucessfully connecting to the server and the transfer happens and completes without TCP errors, but what we receive of the communications is missing vital portions. I'll let Matt and Jeff know about it. It probably had something to do with the weekend server problems. Has anyone seen it since the 17th?

Eric

Every case we've seen seems to have started on the 14th.

For Natronomonas it continued until the 20th.

That's the really unusually thing about this one: once a BOINC client gets into this state, it continues until the client is restarted - nothing to do with the servers at all, except possibly the initial glitch which triggers the whole sequence.
ID: 716279 · Report as offensive
Morris
Volunteer tester

Send message
Joined: 11 Sep 01
Posts: 57
Credit: 9,077,302
RAC: 29
Italy
Message 716324 - Posted: 21 Feb 2008, 11:53:37 UTC - in response to Message 716279.  
Last modified: 21 Feb 2008, 12:11:08 UTC


Every case we've seen seems to have started on the 14th.

For Natronomonas it continued until the 20th.

That's the really unusually thing about this one: once a BOINC client gets into this state, it continues until the client is restarted - nothing to do with the servers at all, except possibly the initial glitch which triggers the whole sequence.


Sorry to contradict Richard, but at least in MY case there was no restart of client. The fact happened on 17th evening, next morn upload and report of WU was fine, even if i still have some slight doubt about hibernating... The issue happened on my laptop, and i switched off and on at least once (no power-off, no standby, simple and neat hibernate..) Anyway, i went deep in the stdoutdae.txt file, and there's no sign of client reboot when i switched on again on 18th morn.

HTH

M.
[edited a bit later, to give some "meaning" to the looooooong sentence]
ID: 716324 · Report as offensive
connor

Send message
Joined: 27 Jan 04
Posts: 1
Credit: 1,782,277
RAC: 0
Estonia
Message 716798 - Posted: 22 Feb 2008, 10:52:01 UTC - in response to Message 716324.  

Some logs that may help:
nohup.out from computer 4128879
22-Feb-2008 07:37:10 [SETI@home] Computation for task 28no06ad.11419.13569.9.7.189_0 finished
22-Feb-2008 07:37:10 [SETI@home] Resuming task 24ja07ac.17733.8661.13.7.34_0 using setiathome_enhanced version 527
22-Feb-2008 07:37:10 [SETI@home] Sending scheduler request: To fetch work.  Requesting 4804 seconds of work, reporting 0 completed tasks
22-Feb-2008 07:37:12 [SETI@home] Started upload of 28no06ad.11419.13569.9.7.189_0_0
22-Feb-2008 07:37:14 [SETI@home] [error] Error on file upload: no command
22-Feb-2008 07:37:14 [SETI@home] Giving up on upload of 28no06ad.11419.13569.9.7.189_0_0: fatal upload error
22-Feb-2008 07:37:15 [SETI@home] Scheduler request succeeded: got 1 new tasks
22-Feb-2008 07:37:18 [SETI@home] Started download of 29no06af.21891.15614.15.7.96
22-Feb-2008 07:37:23 [SETI@home] Finished download of 29no06af.21891.15614.15.7.96
22-Feb-2008 08:15:22 [SETI@home] Computation for task 22fe07af.14819.71435.12.7.221_1 finished
22-Feb-2008 08:15:22 [SETI@home] Restarting task 25ja07aa.28174.890.16.7.115_1 using setiathome_enhanced version 527
22-Feb-2008 08:15:24 [SETI@home] Started upload of 22fe07af.14819.71435.12.7.221_1_0
22-Feb-2008 08:15:27 [SETI@home] [error] Error on file upload: no command
22-Feb-2008 08:15:27 [SETI@home] Giving up on upload of 22fe07af.14819.71435.12.7.221_1_0: fatal upload error
22-Feb-2008 10:32:30 [SETI@home] Sending scheduler request: To fetch work.  Requesting 38 seconds of work, reporting 2 completed tasks
22-Feb-2008 10:33:03 [SETI@home] Scheduler request succeeded: got 0 new tasks


tcpdump from 8:15 error upload time:
08:15:24.670218 IP 10.0.1.77.42415 > setiboinc.SSL.Berkeley.EDU.http: F 14561:14561(0) ack 10610 win 6528 <nop,nop,timestamp 2427632632 479489437
>
08:15:24.922164 IP 10.0.1.77.42409 > setiboinc.SSL.Berkeley.EDU.http: F 539:539(0) ack 564 win 1741 <nop,nop,timestamp 2427632884 476926604>
08:15:25.251108 IP 10.0.1.77.42416 > 208.68.240.18.http: F 251:251(0) ack 375570 win 32612 <nop,nop,timestamp 2427633213 1401792860>
08:15:25.412068 IP 10.0.1.77.42411 > setiboincdata.SSL.Berkeley.EDU.http: F 291:291(0) ack 287 win 1728 <nop,nop,timestamp 2427633374 725572490>
08:15:25.585326 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: S 2944937978:2944937978(0) win 5840 <mss 1460,sackOK,timestamp 2427633547 0
,nop,wscale 2>
08:15:25.757000 arp who-has 10.0.1.1 tell 10.0.1.77
08:15:25.757282 arp reply 10.0.1.1 is-at 00:80:ad:8b:c5:80
08:15:25.798150 IP setiboinc.SSL.Berkeley.EDU.http > 10.0.1.77.42421: S 1955859346:1955859346(0) ack 2944937979 win 5696 <mss 1436,sackOK,timesta
mp 484072738 2427633547,nop,wscale 10>
08:15:25.798191 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: . ack 1 win 1460 <nop,nop,timestamp 2427633760 484072738>
08:15:25.798277 IP 10.0.1.2.netbios-dgm > 10.0.1.255.netbios-dgm: NBT UDP PACKET(138)
08:15:25.798396 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: P 1:254(253) ack 1 win 1460 <nop,nop,timestamp 2427633760 484072738>
08:15:26.013019 IP setiboinc.SSL.Berkeley.EDU.http > 10.0.1.77.42421: . ack 254 win 7 <nop,nop,timestamp 484072953 2427633760>
08:15:26.013053 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: P 254:541(287) ack 1 win 1460 <nop,nop,timestamp 2427633975 484072953>
08:15:26.227371 IP setiboinc.SSL.Berkeley.EDU.http > 10.0.1.77.42421: . ack 541 win 8 <nop,nop,timestamp 484073167 2427633975>
08:15:26.231217 IP setiboinc.SSL.Berkeley.EDU.http > 10.0.1.77.42421: P 1:563(562) ack 541 win 8 <nop,nop,timestamp 484073167 2427633975>
08:15:26.231253 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: . ack 563 win 1741 <nop,nop,timestamp 2427634193 484073167>
08:15:26.231265 IP setiboinc.SSL.Berkeley.EDU.http > 10.0.1.77.42421: F 563:563(0) ack 541 win 8 <nop,nop,timestamp 484073167 2427633975>
08:15:26.231687 IP 10.0.1.77.42422 > setiboincdata.SSL.Berkeley.EDU.http: S 2950039049:2950039049(0) win 5840 <mss 1460,sackOK,timestamp 24276341
94 0,nop,wscale 2>
08:15:26.270918 IP 10.0.1.77.42421 > setiboinc.SSL.Berkeley.EDU.http: . ack 564 win 1741 <nop,nop,timestamp 2427634233 484073167>
08:15:26.448472 IP setiboincdata.SSL.Berkeley.EDU.http > 10.0.1.77.42422: S 2982712386:2982712386(0) ack 2950039050 win 5696 <mss 1436,sackOK,tim
estamp 732718837 2427634194,nop,wscale 10>
08:15:26.448537 IP 10.0.1.77.42422 > setiboincdata.SSL.Berkeley.EDU.http: . ack 1 win 1460 <nop,nop,timestamp 2427634411 732718837>
08:15:26.448772 IP 10.0.1.77.42422 > setiboincdata.SSL.Berkeley.EDU.http: P 1:291(290) ack 1 win 1460 <nop,nop,timestamp 2427634411 732718837>
08:15:26.664038 IP setiboincdata.SSL.Berkeley.EDU.http > 10.0.1.77.42422: . ack 291 win 7 <nop,nop,timestamp 732719056 2427634411>
08:15:26.667079 IP setiboincdata.SSL.Berkeley.EDU.http > 10.0.1.77.42422: P 1:286(285) ack 291 win 7 <nop,nop,timestamp 732719058 2427634411>
08:15:26.667099 IP 10.0.1.77.42422 > setiboincdata.SSL.Berkeley.EDU.http: . ack 286 win 1728 <nop,nop,timestamp 2427634629 732719058>
08:15:26.668403 IP setiboincdata.SSL.Berkeley.EDU.http > 10.0.1.77.42422: F 286:286(0) ack 291 win 7 <nop,nop,timestamp 732719058 2427634411>
08:15:26.707835 IP 10.0.1.77.42422 > setiboincdata.SSL.Berkeley.EDU.http: . ack 287 win 1728 <nop,nop,timestamp 2427634670 732719058>


host is 10.0.1.77 DNS 10.0.1.1


Also it looks like whenever client uploading data by doing "Sending scheduler request: To fetch work. Requesting XXX seconds of work, reporting 2 completed tasks" some work gets through and when it is just reporting work, none gets validated.
Anyway the pc hasn't yet restarted so feel free to request any more information.
ID: 716798 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 716912 - Posted: 22 Feb 2008, 15:16:05 UTC - in response to Message 714785.  
Last modified: 22 Feb 2008, 15:17:51 UTC

I noticed/posted my first 'no command' error at: 27 May 2007 21:53:50 UTC [EDIT: ]
But maybe I had before them also, but didn't saw them..


And then I got lot of help here in this board, thanks to all again!!

But, we didn't found/eliminated the reason..


Mauro said, he use BOINC V5.10.30 and he have the 'no command' errors too..

I use BOINC V5.10.28 with Crunch3rs V6.1.0


So to now, we have only to wait and drink a tea or coffee, or?
Nobody know to now the answer why we get sometimes the 'no command' errors..


But I'm happy that I'm no longer alone..
So now maybe is on the side of Berkeley more interest to find and eliminate this problem..



Now, what is now the finally result about this topic?

It's recommened to use a special Version of BOINC?

To reboot the PC, from time to time?
This helped me not much.

I have nearly every day one 'no command' - 'validate error'.
O.K., other have a bunch of this now.. BUT I have/had problem with the 'no command' error min. since 27 May 2007 21:53:50 UTC


OR, it's not at the user, it's a problem with the server at Berkeley?
ID: 716912 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 716930 - Posted: 22 Feb 2008, 16:08:26 UTC

Sorry, Sutaru, but I'm out of my depth on this one!

It sounded from the quote from Dadid Anderson that there may be some changes to future versions of BOINC which make the problem less likely. But all of this started after BOINC v5.10.42 was released for testing - watch out for at least v5.10.43 (in Beta test first, then in full release). But it hasn't even been written yet - nothing to see yet.

I hope Joe Segur / Eric Korpela / David Anderson can get some useful detail out of the TCPdump that Connor has posted - that's the most detailed analysis yet. Thanks, Connor. It's possible that some change to the server code could come from that, so the problems could be eliminated without everyone having to upgrade BOINC - but we'll have to wait and see.
ID: 716930 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 716961 - Posted: 22 Feb 2008, 17:35:53 UTC
Last modified: 22 Feb 2008, 17:38:50 UTC

O.K., Richard.. thanks a lot!


Now we could only wait to the release of the new BOINC client..



Now we have 3 possible unhappy situations..


  • 1.) wingman have BOINC V3.x -> no credits
  • 2.) wingman have BOINC V4.x -> less credits
  • 3.) server don't like BOINC client.. -> no command - validate error.. -> no credits


BUT, if a validate error will come.. a third, or more new WUs will send out and the science is slower..


ID: 716961 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 717263 - Posted: 23 Feb 2008, 5:20:47 UTC

I'm out of my depth too on the protocol stuff, but I do see some interesting things in Connor's tcpdump output:

At 08:15:25.798396 the host sent a 253 character packet to setiboinc.SSL.Berkeley.EDU. The server acknowledged it but said its receive window was 7 characters. The host then sent a 287 character packet which got through anyhow and was acknowledged. The server then sent back a 562 character reply and that connection terminated. Note that setiboinc.SSL.Berkeley.EDU is the Scheduler, not the Upload handler (setiboincdata.SSL.Berkeley.EDU).

When my hosts upload, they first send a 239 character packet with HTTP headers, but I'm using BOINC 5.2.12 and there have been some header changes since; I think that Connor's first 253 character packet size is probably those headers. The second packet with POST data tells the Upload handler what result file is going to be sent, it's like this except this stupid forum software will hide leading spaces:
<data_server_request>
    <core_client_major_version>5</core_client_major_version>
    <core_client_minor_version>2</core_client_minor_version>
    <core_client_release>12</core_client_release>
    <get_file_size>29no06af.24239.19295.12.7.35_1_0</get_file_size>
</data_server_request>

That one of mine was 285 characters, but Connor is running 5.10.28 which would add one character in the <core_client_minor_version> field, and his result filename was one character longer, so 287 for his second packet makes sense. If those packets had been sent to the right server, I think the upload might have worked.

The remainder of the tcpdump has the connection to the Upload handler which I assume ended with the "no command" response. I can't figure out what the packet sizes mean, they're not right for either the first or second phase of doing an upload.

There's a lot of guesswork in that, I hope someone can do a capture with the actual packet contents. What I do with Windump is use -s2000 -w<filename> to capture the full packets and write them to a file, then stop Windump after it has captured enough. Then I invoke it again with -X -r<filename> and redirect the output to a text file. After that I can open the text file and see everything. I'm sure Wireshark would make it simpler.
                                                                Joe
ID: 717263 · Report as offensive
david z

Send message
Joined: 19 Nov 99
Posts: 2
Credit: 4,894,997
RAC: 0
United States
Message 717363 - Posted: 23 Feb 2008, 11:39:13 UTC

I to have had a large number of validate errors. I normally don't check the pending page o a regular basis- work flows smoothly, I use an 8 day cache to get me "over the humps" so I see little down time anymore. But recently on one of my mac intel laptops I noticed that my RAC had dropped several hundred in the past few days- validate errors! 91 of them starting around 2/15. After reading the threads I could see the upload errors as previously discussed. I restarted BOINC, manually updated SETI to have it send in WU to see if they would go through- they did not. Next step was to reinstall BOINC , have a Wii party and go to bed. This morning I checked the log and uploads are going through.

Thanks to all of you who take the time to put information such as this in a forum where all others can get to it if needed.

As pointed out before, as this appears to be MORE than a transient problem it would be nice if Berkley would be able to look into it.

Thanks again guys!
ID: 717363 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 717390 - Posted: 23 Feb 2008, 14:05:10 UTC
Last modified: 23 Feb 2008, 14:20:12 UTC

@ Josef W. Segur

You had never a 'no command - validate error' with your old client?
Why you use this old client?

Maybe now it would be better to use V5.2.x (V5.2.6 or later! ), V5.4.x or V5.8.x ?


BTW.
After looking to your PCs..
You use SETI@home V5.15 Rev. 2.4, but it's reported as V6.0
So the now available opt. apps (Rev. 2.4V) from Crunch3rs homepage are V6.0 compatible?
ID: 717390 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 717599 - Posted: 24 Feb 2008, 1:11:28 UTC - in response to Message 717390.  

@ Josef W. Segur

You had never a 'no command - validate error' with your old client?

Never a "no command", but when Kryten was dropping mounts I had a couple of validate errors. I generally only connect once a day, and try to choose times when the servers appear to be working well.
Why you use this old client?

Rom Walton wouldn't build a libCurl compatible with my Win95 system, so I did. BOINC 5.2.12 was current at that time, so rather than taking time to build newer versions of libCurl for newer BOINCs I've stayed with that.
Maybe now it would be better to use V5.2.x (V5.2.6 or later! ), V5.4.x or V5.8.x ?

I have thought about upgrading, but the few improvements which have been made to BOINC are far ouweighed by excess garbage added. There were a couple of changes for 5.4.6 and 5.5.13 or so which were tempting. In any case I want to keep the same version on all my systems.
BTW.
After looking to your PCs..
You use SETI@home V5.15 Rev. 2.4, but it's reported as V6.0
So the now available opt. apps (Rev. 2.4V) from Crunch3rs homepage are V6.0 compatible?

Because I do SETI Beta too and am running an old BOINC, I have to add a section to the app_info.xml files here spoofing as 6.00. Any version of setiathome_enhanced will do the work correctly.
                                                                Joe
ID: 717599 · Report as offensive
Odysseus
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 1808
Credit: 6,701,347
RAC: 6
Canada
Message 717603 - Posted: 24 Feb 2008, 1:18:32 UTC - in response to Message 717390.  

You use SETI@home V5.15 Rev. 2.4, but it's reported as V6.0
So the now available opt. apps (Rev. 2.4V) from Crunch3rs homepage are V6.0 compatible?

Not necessarily. You can’t trust the reported version where the anonymous-platform mechanism is in use: in that case BOINC reports whatever the app_info.xml file tells it to. It’s not uncommon for those of us who run the Beta project, and also run custom apps here, to include references to the version under testing there in our app_info here, as there’s a bug in the client (fixed in more recent versions—5.8.x? 5.10.x?), triggered by the apps from the two projects having the same name, that makes it trash S@h WUs because it suddenly decides it doesn’t have a high enough version to crunch them.

ID: 717603 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 718236 - Posted: 25 Feb 2008, 10:42:41 UTC - in response to Message 714740.  
Last modified: 25 Feb 2008, 10:44:17 UTC

...
In all the time I had been obtaining work from here, the only times I received a Validate Error were of the "type 1" variety. If you suspect that the "type 2" have been around the whole time, then keep in mind that I am still using 5.8.16. It may be worth pursuing the idea of if 5.8.xx has the issue or not, considering it does have a healthy share of users still...as does 5.4.11...


I used now BOINC V5.8.16 and got after ~ 2 days 3 'no command - valiate errors'..

wuid=222497619
wuid=222497619
wuid=222497535
..again with 'server rejected file'


Now I'm back online with BOINC V5.10.30 with Crunch3rs V6.1.0


When will be a correct BOINC Version available..?

..maybe V5.10.43 ?
ID: 718236 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 718433 - Posted: 25 Feb 2008, 20:12:48 UTC - in response to Message 718236.  

...
I used now BOINC V5.8.16 and got after ~ 2 days 3 'no command - valiate errors'
...
Now I'm back online with BOINC V5.10.30 with Crunch3rs V6.1.0

When will be a correct BOINC Version available..?

..maybe V5.10.43 ?

That's totally unknown, the client problem has not yet been diagnosed.

What should fix your problem is the February 20 changeset [trac]changeset:14767[/trac] server-side change in file_upload_handler.C. When that's deployed in the S@H Upload handler, your core client will no longer delete the result file. Instead it will try again to upload, with the usual backoffs for failed transfers.
                                                                 Joe
ID: 718433 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 718471 - Posted: 25 Feb 2008, 22:31:37 UTC - in response to Message 718433.  

That's totally unknown, the client problem has not yet been diagnosed.

What should fix your problem is the February 20 changeset [trac]changeset:14767[/trac] server-side change in file_upload_handler.C. When that's deployed in the S@H Upload handler, your core client will no longer delete the result file. Instead it will try again to upload, with the usual backoffs for failed transfers.
                                                                 Joe

For the reason you give in your first sentence, I wouldn't really call that changeset a 'fix'.

Sure, it's a useful bit of sticking-plaster: it preserves the science, the credits, and the evidence for future debugging. But it does nothing to explain what went wrong in the first place, and why it tends to go on until the client is restarted (at least, that was the problem ten days ago, when this thread was started: Sutaru's report this morning sounds slightly different - but I notice that the tasks which errored today were issued on the first full day of the problem. Coincidence?).

No, I'm afraid debugging this one is still a 'work in progress', and we still need volunteers to be on standby, Wireshark at the ready. It's just that the warning sign will be a "Rash of failed uploads", instead of a "Rash of Validate errors".
ID: 718471 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 718495 - Posted: 25 Feb 2008, 23:30:52 UTC - in response to Message 718236.  

...
wuid=222497619
wuid=222497619
wuid=222497535
..again with 'server rejected file'
...



Ops.. nobody noticed, that I posted two times the same WU..?

Here're the correct links:

wuid=222497609
wuid=222497619
wuid=222497535


ID: 718495 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 718552 - Posted: 26 Feb 2008, 2:39:29 UTC - in response to Message 718471.  

That's totally unknown, the client problem has not yet been diagnosed.

What should fix your problem is the February 20 changeset [trac]changeset:14767[/trac] server-side change in file_upload_handler.C. When that's deployed in the S@H Upload handler, your core client will no longer delete the result file. Instead it will try again to upload, with the usual backoffs for failed transfers.
                                                                 Joe

For the reason you give in your first sentence, I wouldn't really call that changeset a 'fix'.

While offtopic for this thread, I was referring to Sutaru's long time problem with a few WUs per day getting a "no command" and Validate error. No restart is needed, uploads earlier and later are successful. That sort of intermittent problem is what the server side change should fix.
Sure, it's a useful bit of sticking-plaster: it preserves the science, the credits, and the evidence for future debugging. But it does nothing to explain what went wrong in the first place, and why it tends to go on until the client is restarted (at least, that was the problem ten days ago, when this thread was started: Sutaru's report this morning sounds slightly different - but I notice that the tasks which errored today were issued on the first full day of the problem. Coincidence?).

No, I'm afraid debugging this one is still a 'work in progress', and we still need volunteers to be on standby, Wireshark at the ready. It's just that the warning sign will be a "Rash of failed uploads", instead of a "Rash of Validate errors".

Well, the rash will be accompanied by "no command" messages so at least the reason uploads are being retried will be apparent. If a BOINC restart is required to cure it, those that don't pay attention may eventually error out, there's a 2 week limit for persistent file transfers.

I certainly agree that if the problem occurs again, we ought to be prepared to pin it down. I fear Murphy will ensure it doesn't reappear until our guard is down.
Official Name: setiboincdata.ssl.berkeley.edu
IP Address: 208.68.240.16

Official Name: setiboinc.ssl.berkeley.edu
IP Address: 208.68.240.17

IIRC those IP addresses are the two which were in round robin for the Schedulers. Could that have temporarily been revived, causing some hosts to get the wrong address when looking up setiboincdata? Exiting and restarting BOINC also flushes any DNS caching which libCurl is doing, so might correct that. I don't think that's a likely explanation, but I haven't thought of anything more likely either.
                                                                  Joe
ID: 718552 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 722380 - Posted: 5 Mar 2008, 17:56:33 UTC



I have now some more experiences.. ;-)


[UTC + 01:00]

[u][b]V6.1.0 (Cruch3rs)[/b][/u]
02-Mar-2008 03:52:31 [SETI@home] Computation for task 11dc06af.32510.3753.3.7.44_0 finished
02-Mar-2008 03:52:31 [SETI@home] Starting 11dc06af.32510.3753.3.7.54_0
02-Mar-2008 03:52:31 [SETI@home] Starting task 11dc06af.32510.3753.3.7.54_0 using setiathome_enhanced version 528
02-Mar-2008 03:52:33 [SETI@home] [file_xfer] Started upload of file 11dc06af.32510.3753.3.7.44_0_0
02-Mar-2008 03:52:40 [SETI@home] [file_xfer] Temporarily failed upload of 11dc06af.32510.3753.3.7.44_0_0: HTTP error
02-Mar-2008 03:52:40 [SETI@home] Backing off 1 min 0 sec on upload of file 11dc06af.32510.3753.3.7.44_0_0
02-Mar-2008 03:53:40 [SETI@home] [file_xfer] Started upload of file 11dc06af.32510.3753.3.7.44_0_0
02-Mar-2008 03:53:42 [SETI@home] [file_xfer] Temporarily failed upload of 11dc06af.32510.3753.3.7.44_0_0: HTTP error
02-Mar-2008 03:53:42 [SETI@home] Backing off 1 min 0 sec on upload of file 11dc06af.32510.3753.3.7.44_0_0
02-Mar-2008 03:54:43 [SETI@home] [file_xfer] Started upload of file 11dc06af.32510.3753.3.7.44_0_0
02-Mar-2008 03:54:45 [SETI@home] [error] Error on file upload: no command
02-Mar-2008 03:54:45 [SETI@home] [file_xfer] Giving up on upload of 11dc06af.32510.3753.3.7.44_0_0: [b]permanent upload error[/b]



[u][b]V5.10.30[/b][/u]
05-Mar-2008 05:04:43 [SETI@home] Computation for task 24ja07aa.18577.32580.3.7.52_0 finished
05-Mar-2008 05:04:43 [SETI@home] Starting 24ja07aa.18577.32580.3.7.29_0
05-Mar-2008 05:04:43 [SETI@home] Starting task 24ja07aa.18577.32580.3.7.29_0 using setiathome_enhanced version 528
05-Mar-2008 05:04:44 [SETI@home] Started upload of 24ja07aa.18577.32580.3.7.52_0_0
05-Mar-2008 05:04:48 [SETI@home] Temporarily failed upload of 24ja07aa.18577.32580.3.7.52_0_0: http error
05-Mar-2008 05:04:48 [SETI@home] Backing off 1 min 0 sec on upload of 24ja07aa.18577.32580.3.7.52_0_0
05-Mar-2008 05:05:49 [SETI@home] Started upload of 24ja07aa.18577.32580.3.7.52_0_0
05-Mar-2008 05:05:55 [SETI@home] [error] Error on file upload: no command
05-Mar-2008 05:05:55 [SETI@home] Giving up on upload of 24ja07aa.18577.32580.3.7.52_0_0: [b]fatal upload error[/b]




It (or the whole story) have maybe something to do with libcurl?

V6.1.0 (Crunch3rs) have libcurl/7.16.4

V5.10.30 have libcurl/7.17.1



ID: 722380 · Report as offensive
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : Rash of Validate errors


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.