Upload problems.

Message boards : Number crunching : Upload problems.
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 502900 - Posted: 14 Jan 2007, 20:08:22 UTC

Kryten (aka setiboincdata) is being a bad boy again. A reboot fixed the upload problem, but now I'm unable to log in. I may need to run to the lab and give it another kick.

I have a few tricks to try, but I think we're going to need to replace kryten at some point in the near future. With what, I don't know.

@SETIEric@qoto.org (Mastodon)

ID: 502900 · Report as offensive
Profile Brock
Avatar

Send message
Joined: 19 Dec 06
Posts: 201
Credit: 774,488
RAC: 0
United States
Message 502905 - Posted: 14 Jan 2007, 20:16:29 UTC

Thanks for the update Eric.
ID: 502905 · Report as offensive
Profile bernt
Avatar

Send message
Joined: 10 Dec 06
Posts: 27
Credit: 131,599
RAC: 0
Sweden
Message 503019 - Posted: 14 Jan 2007, 22:14:40 UTC - in response to Message 502900.  

Kryten (aka setiboincdata) is being a bad boy again. A reboot fixed the upload problem, but now I'm unable to log in. I may need to run to the lab and give it another kick.

I have a few tricks to try, but I think we're going to need to replace kryten at some point in the near future. With what, I don't know.

Eric,

My box is running unattended for 11 hrs every day. I have seen a lot of remarks about closing down network traffic to preserve results. Is this really necessary? I mean, when the system is back in business it will take care of the work done or .......? I don't care so much about RAC I am more interested in that the job my pc is performing contributes to the project.

Kind regards from
Bernt in Sweden
ID: 503019 · Report as offensive
Profile kinhull
Volunteer tester
Avatar

Send message
Joined: 3 Oct 03
Posts: 1029
Credit: 636,475
RAC: 0
United Kingdom
Message 503052 - Posted: 14 Jan 2007, 22:33:08 UTC
Last modified: 14 Jan 2007, 22:34:02 UTC

Is this related to the Validate Errors people have been having?

I notice that sah_validate1, 2, 3 & 4 run on kryten.
Join TeamACC

Sometimes I think we are alone in the universe, and sometimes I think we are not. In either case the idea is quite staggering.
ID: 503052 · Report as offensive
Profile John Clark
Volunteer tester
Avatar

Send message
Joined: 29 Sep 99
Posts: 16515
Credit: 4,418,829
RAC: 0
United Kingdom
Message 503162 - Posted: 15 Jan 2007, 1:40:43 UTC

Eric ... thanks for the update!
It's good to be back amongst friends and colleagues



ID: 503162 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 503262 - Posted: 15 Jan 2007, 5:50:47 UTC - in response to Message 503052.  

Is this related to the Validate Errors people have been having?

I notice that sah_validate1, 2, 3 & 4 run on kryten.

It's not because the validator code is running on Kryten, it's because the drive system where the uploads are stored becomes disconnected from Kryten. Then when the 3rd "success" result is reported to the Scheduler and it tells the validator to check them, the files cannot be found. The BOINC code only has one response to that; mark those results "Validate error" and go on. And once they're marked as errors, the system will not try to check them again even after the files are once again available.

It's simply a case of Murphy's Law which the design of BOINC failed to anticipate. It should be possible to add something like a "Deferred" status to the possible validate states which would result in another attempt at validating the files after a few hours. But working out all the detail of implementing that would not be trivial. I believe it would involve changes in at least the Scheduler, Validator, and Transitioner code.
                                                               Joe
ID: 503262 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 503486 - Posted: 15 Jan 2007, 18:09:49 UTC - in response to Message 503262.  

Is this related to the Validate Errors people have been having?

I notice that sah_validate1, 2, 3 & 4 run on kryten.

It's not because the validator code is running on Kryten, it's because the drive system where the uploads are stored becomes disconnected from Kryten. Then when the 3rd "success" result is reported to the Scheduler and it tells the validator to check them, the files cannot be found. The BOINC code only has one response to that; mark those results "Validate error" and go on. And once they're marked as errors, the system will not try to check them again even after the files are once again available.

It's simply a case of Murphy's Law which the design of BOINC failed to anticipate. It should be possible to add something like a "Deferred" status to the possible validate states which would result in another attempt at validating the files after a few hours. But working out all the detail of implementing that would not be trivial. I believe it would involve changes in at least the Scheduler, Validator, and Transitioner code.
                                                               Joe


Thank You very much for the explanation, Joe! I think this is the first time anybody laid out how the validate errors actually occur. Perhaps you could link this to the various 'validate error' threads that are on the forum right now. It might help others a bit to understand how this is actually happening.

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 503486 · Report as offensive
Profile Francesco Forti
Avatar

Send message
Joined: 24 May 00
Posts: 334
Credit: 204,421,005
RAC: 15
Switzerland
Message 504102 - Posted: 16 Jan 2007, 15:34:51 UTC - in response to Message 503262.  

[....] The BOINC code only has one response to that; mark those results "Validate error" and go on. And once they're marked as errors, the system will not try to check them again even after the files are once again available.[...]


Is it possible to rerun le validation process for all the validate error workunits, in order to assign value to the job?
This later, of course, as part of a general recovery procedure after the kryten's fault and restart.

Thanks
Franz

ID: 504102 · Report as offensive
Wander Saito
Volunteer tester

Send message
Joined: 7 Jul 03
Posts: 555
Credit: 2,136,061
RAC: 0
Brazil
Message 504145 - Posted: 16 Jan 2007, 16:59:14 UTC - in response to Message 504102.  

[....] The BOINC code only has one response to that; mark those results "Validate error" and go on. And once they're marked as errors, the system will not try to check them again even after the files are once again available.[...]


Is it possible to rerun le validation process for all the validate error workunits, in order to assign value to the job?
This later, of course, as part of a general recovery procedure after the kryten's fault and restart.

Thanks
Franz


Hi Franz,

Check out this thread. Eric and Pappa are doing exactly that, but the process is manual, and you must inform them which WUs you think are eligible. Pay attention to the first post, explaining what kind of error they are looking for.

Regards,
Wander
ID: 504145 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 504171 - Posted: 16 Jan 2007, 17:36:49 UTC - in response to Message 503262.  

Is this related to the Validate Errors people have been having?

I notice that sah_validate1, 2, 3 & 4 run on kryten.

It's not because the validator code is running on Kryten, it's because the drive system where the uploads are stored becomes disconnected from Kryten. Then when the 3rd "success" result is reported to the Scheduler and it tells the validator to check them, the files cannot be found. The BOINC code only has one response to that; mark those results "Validate error" and go on. And once they're marked as errors, the system will not try to check them again even after the files are once again available.

It's simply a case of Murphy's Law which the design of BOINC failed to anticipate. It should be possible to add something like a "Deferred" status to the possible validate states which would result in another attempt at validating the files after a few hours. But working out all the detail of implementing that would not be trivial. I believe it would involve changes in at least the Scheduler, Validator, and Transitioner code.
                                                               Joe


Thanks for the explanation Joe.
Seems clear that a modification to this oversight is required. Lumping any kind of error as a validate error seems to me more than a little myopic, knowing all the things that can go wrong.

One should make as few assumptions as possible!
NFS drive availability should be checked first before validations are carried out, and all validations deferred with a "Help me" message if the system itself cannot restore the connection (and generate a warning in the logs).

If the files to validate don't exist then they should be flagged and deferred as there is obviously a problem, as validation process isn't instigated until all the files theoretically exist!

I really don't think it should be a big job to find and modify the error reporting and reaction routines.
ID: 504171 · Report as offensive
Profile Francesco Forti
Avatar

Send message
Joined: 24 May 00
Posts: 334
Credit: 204,421,005
RAC: 15
Switzerland
Message 504198 - Posted: 16 Jan 2007, 20:50:49 UTC - in response to Message 504145.  



Hi Franz,

Check out this thread. Eric and Pappa are doing exactly that, but the process is manual, and you must inform them which WUs you think are eligible. Pay attention to the first post, explaining what kind of error they are looking for.

Regards,
Wander


Sorry, I have no time to look at ALL my results (as you see, my RAC is high, in the best 100 so I produce a loto of them) and I hope that you can study a global batch solution.

For validate errors that happenes during kryten faults, rerun all of them!
If the validate was true, it will be again a "validate error".
But ff it was a mistake caused by the system, he will be recovered.

Bye,
Franz
ID: 504198 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19013
Credit: 40,757,560
RAC: 67
United Kingdom
Message 504343 - Posted: 17 Jan 2007, 2:32:12 UTC - in response to Message 504171.  
Last modified: 17 Jan 2007, 2:33:01 UTC

Is this related to the Validate Errors people have been having?

I notice that sah_validate1, 2, 3 & 4 run on kryten.

It's not because the validator code is running on Kryten, it's because the drive system where the uploads are stored becomes disconnected from Kryten. Then when the 3rd "success" result is reported to the Scheduler and it tells the validator to check them, the files cannot be found. The BOINC code only has one response to that; mark those results "Validate error" and go on. And once they're marked as errors, the system will not try to check them again even after the files are once again available.

It's simply a case of Murphy's Law which the design of BOINC failed to anticipate. It should be possible to add something like a "Deferred" status to the possible validate states which would result in another attempt at validating the files after a few hours. But working out all the detail of implementing that would not be trivial. I believe it would involve changes in at least the Scheduler, Validator, and Transitioner code.
                                                               Joe



Thanks for the explanation Joe.
Seems clear that a modification to this oversight is required. Lumping any kind of error as a validate error seems to me more than a little myopic, knowing all the things that can go wrong.

One should make as few assumptions as possible!
NFS drive availability should be checked first before validations are carried out, and all validations deferred with a "Help me" message if the system itself cannot restore the connection (and generate a warning in the logs).

If the files to validate don't exist then they should be flagged and deferred as there is obviously a problem, as validation process isn't instigated until all the files theoretically exist!

I really don't think it should be a big job to find and modify the error reporting and reaction routines.

It may not be as simple as that, as I see it, if Kryten has lost the mount to the database then the process needs to stop the upload ack that deletes the file on the host.
another view:
host uploads to kryten,
kryten rec's, acknowledges, and passes to database
host deletes file, but database mount has been lost so file is not transfered to database and is lost,
Host later reports, as upload has been success, validator looks at database and cannot find three or more files, therefore validate error.

If this view is correct then Kryten needs to check mount to database before issuing ack to host, so that upload can be retried later.

Just my thoughts and as usual could be totally wrong.

Andy
ID: 504343 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 504642 - Posted: 17 Jan 2007, 22:52:23 UTC - in response to Message 504343.  


...if Kryten has lost the mount to the database then the process needs to stop the upload ack that deletes the file on the host.
...
Andy

Once the mount is gone, it reports "Can't open file" which has that effect. I'm convinced most validate errors are files which were uploaded and written correctly, but unavailable when validation is attempted.

Of course there could be some files which were only partly written at the time the mount was lost. I don't know if the ack is delayed until the write has been completed, though I think it probably is.
                                                             Joe
ID: 504642 · Report as offensive
AlexSilver

Send message
Joined: 7 Jan 07
Posts: 1
Credit: 879
RAC: 0
United States
Message 504897 - Posted: 18 Jan 2007, 12:29:09 UTC - in response to Message 504198.  



Hi Franz,

Check out this thread. Eric and Pappa are doing exactly that, but the process is manual, and you must inform them which WUs you think are eligible. Pay attention to the first post, explaining what kind of error they are looking for.

Regards,
Wander


Sorry, I have no time to look at ALL my results (as you see, my RAC is high, in the best 100 so I produce a loto of them) and I hope that you can study a global batch solution.

For validate errors that happenes during kryten faults, rerun all of them!
If the validate was true, it will be again a "validate error".
But ff it was a mistake caused by the system, he will be recovered.

Bye,
Franz


Seems to make sense if possible to rerun them... Are the validate erros caused by the same which is causing what is showing up as 'client error' on results screen?

thank you,

AlexSilver
ID: 504897 · Report as offensive
Profile Mikey
Volunteer tester

Send message
Joined: 1 Jan 07
Posts: 8
Credit: 44,506
RAC: 0
United States
Message 504941 - Posted: 18 Jan 2007, 14:02:10 UTC - in response to Message 502900.  

I thought I was alone out there.

michaelpr
ID: 504941 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 505201 - Posted: 19 Jan 2007, 0:08:05 UTC - in response to Message 504897.  

Alex

Welcome to Seti BOINC, the ones that show with the "Client Error" are different than the simple "Validate Errors." The client errors could be a noisy workunit or a hickup on the computers part. Eric is aware and working on the validate error issue...

Seems to make sense if possible to rerun them... Are the validate erros caused by the same which is causing what is showing up as 'client error' on results screen?

thank you,

AlexSilver


Pappa

Please consider a Donation to the Seti Project.

ID: 505201 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 505798 - Posted: 20 Jan 2007, 4:44:05 UTC - in response to Message 505201.  

The validator issues will recur as long as kryten keeps losing its mounts. I found a clue to the cause of the problem that I can at least check on the next time it happens.

I fixed a bunch of validate errors today (and a few hundred thousand earlier in the week). If you have some in you accounts that I haven't fixed, let me know so I can find them and try to give credit for them.

Eric
@SETIEric@qoto.org (Mastodon)

ID: 505798 · Report as offensive
Profile TimeLord04
Volunteer tester
Avatar

Send message
Joined: 9 Mar 06
Posts: 21140
Credit: 33,933,039
RAC: 23
United States
Message 506003 - Posted: 20 Jan 2007, 18:06:00 UTC

Copied from Cafe's "Donations to Help Keep Seti Alive" Thread:
(Because of the severity of the issue with Kryten failing; I want to make sure that Eric sees this in, (now), one of the three locations that I have posted it...)



Matt needs a replacement for Kryten, please help him get a new server.

Thank you for your donation.





This is NOT, (repeat, NOT), a promise; however, there is a SLIGHT possibility of me getting my hands on a Server System in about two weeks. I will know better after the 29Th if this will occur.

At the moment I have no hardware specifications on what the Server is nor what it has been used for. The only information I currently have is that there are one of three Server Systems that may become available to me. The company that had been using these machines has gone bankrupt. The Landlord of the place where the company was housed has had to first notify creditors of the assets of the now bankrupt company. (These machines being part of the assets...) So; until the 29Th, these machines are off limits to me.)

After the 29Th, (if the Creditors make no claim or give no response to the Landlord), I will be able to check out these Servers and find out what they are, what OS they have, etc... With this information, can Eric "band-aid"-Patch Kryten until I can get to these machines and find out what's on them? If so, (and if all goes as anticipated), then sometime after the 29Th I can then take a road trip from SoCAL to NorCAL look at these machines and see if one will make a good replacement for Kryten.

Eric can contact Knightmare to get my e-Mail address and phone number. I would be happy to talk to Eric about all of this. Hopefully we can make something work here. Oh, and a side note... This place also has a Rack Mounted APC UPS that if unclaimed could be snagged for Eric's Rack System at Berkeley for SETI@Home and SETI Beta's use...


Later,


TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join Calm Chaos
ID: 506003 · Report as offensive
Profile Patrick Terrell Weedon

Send message
Joined: 18 Jan 07
Posts: 9
Credit: 101,585
RAC: 0
Canada
Message 506191 - Posted: 20 Jan 2007, 23:55:55 UTC

How do you you report when work is done? I can not find the update button on boinc manager.
ID: 506191 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 506193 - Posted: 20 Jan 2007, 23:59:52 UTC

Reporting will happen automatically via a schedule. If you're in a hurry, click on the "projects" tab, click on "setiathome......" in the right hand box to highlight it (select the project), then click the "update" button to the left.

welcome aboard

tony
ID: 506193 · Report as offensive
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Number crunching : Upload problems.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.