Message boards :
Number crunching :
Upload problems.
Message board moderation
Author | Message |
---|---|
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Kryten (aka setiboincdata) is being a bad boy again. A reboot fixed the upload problem, but now I'm unable to log in. I may need to run to the lab and give it another kick. I have a few tricks to try, but I think we're going to need to replace kryten at some point in the near future. With what, I don't know. @SETIEric@qoto.org (Mastodon) |
Brock Send message Joined: 19 Dec 06 Posts: 201 Credit: 774,488 RAC: 0 |
Thanks for the update Eric. |
bernt Send message Joined: 10 Dec 06 Posts: 27 Credit: 131,599 RAC: 0 |
Kryten (aka setiboincdata) is being a bad boy again. A reboot fixed the upload problem, but now I'm unable to log in. I may need to run to the lab and give it another kick. Eric, My box is running unattended for 11 hrs every day. I have seen a lot of remarks about closing down network traffic to preserve results. Is this really necessary? I mean, when the system is back in business it will take care of the work done or .......? I don't care so much about RAC I am more interested in that the job my pc is performing contributes to the project. Kind regards from Bernt in Sweden |
kinhull Send message Joined: 3 Oct 03 Posts: 1029 Credit: 636,475 RAC: 0 |
Is this related to the Validate Errors people have been having? I notice that sah_validate1, 2, 3 & 4 run on kryten. Join TeamACC Sometimes I think we are alone in the universe, and sometimes I think we are not. In either case the idea is quite staggering. |
John Clark Send message Joined: 29 Sep 99 Posts: 16515 Credit: 4,418,829 RAC: 0 |
Eric ... thanks for the update! It's good to be back amongst friends and colleagues |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Is this related to the Validate Errors people have been having? It's not because the validator code is running on Kryten, it's because the drive system where the uploads are stored becomes disconnected from Kryten. Then when the 3rd "success" result is reported to the Scheduler and it tells the validator to check them, the files cannot be found. The BOINC code only has one response to that; mark those results "Validate error" and go on. And once they're marked as errors, the system will not try to check them again even after the files are once again available. It's simply a case of Murphy's Law which the design of BOINC failed to anticipate. It should be possible to add something like a "Deferred" status to the possible validate states which would result in another attempt at validating the files after a few hours. But working out all the detail of implementing that would not be trivial. I believe it would involve changes in at least the Scheduler, Validator, and Transitioner code. Joe |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
Is this related to the Validate Errors people have been having? Thank You very much for the explanation, Joe! I think this is the first time anybody laid out how the validate errors actually occur. Perhaps you could link this to the various 'validate error' threads that are on the forum right now. It might help others a bit to understand how this is actually happening. "Time is simply the mechanism that keeps everything from happening all at once." |
Francesco Forti Send message Joined: 24 May 00 Posts: 334 Credit: 204,421,005 RAC: 15 |
[....] The BOINC code only has one response to that; mark those results "Validate error" and go on. And once they're marked as errors, the system will not try to check them again even after the files are once again available.[...] Is it possible to rerun le validation process for all the validate error workunits, in order to assign value to the job? This later, of course, as part of a general recovery procedure after the kryten's fault and restart. Thanks Franz |
Wander Saito Send message Joined: 7 Jul 03 Posts: 555 Credit: 2,136,061 RAC: 0 |
[....] The BOINC code only has one response to that; mark those results "Validate error" and go on. And once they're marked as errors, the system will not try to check them again even after the files are once again available.[...] Hi Franz, Check out this thread. Eric and Pappa are doing exactly that, but the process is manual, and you must inform them which WUs you think are eligible. Pay attention to the first post, explaining what kind of error they are looking for. Regards, Wander |
Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0 |
Is this related to the Validate Errors people have been having? Thanks for the explanation Joe. Seems clear that a modification to this oversight is required. Lumping any kind of error as a validate error seems to me more than a little myopic, knowing all the things that can go wrong. One should make as few assumptions as possible! NFS drive availability should be checked first before validations are carried out, and all validations deferred with a "Help me" message if the system itself cannot restore the connection (and generate a warning in the logs). If the files to validate don't exist then they should be flagged and deferred as there is obviously a problem, as validation process isn't instigated until all the files theoretically exist! I really don't think it should be a big job to find and modify the error reporting and reaction routines. |
Francesco Forti Send message Joined: 24 May 00 Posts: 334 Credit: 204,421,005 RAC: 15 |
Sorry, I have no time to look at ALL my results (as you see, my RAC is high, in the best 100 so I produce a loto of them) and I hope that you can study a global batch solution. For validate errors that happenes during kryten faults, rerun all of them! If the validate was true, it will be again a "validate error". But ff it was a mistake caused by the system, he will be recovered. Bye, Franz |
W-K 666 Send message Joined: 18 May 99 Posts: 19367 Credit: 40,757,560 RAC: 67 |
Is this related to the Validate Errors people have been having? It may not be as simple as that, as I see it, if Kryten has lost the mount to the database then the process needs to stop the upload ack that deletes the file on the host. another view: host uploads to kryten, kryten rec's, acknowledges, and passes to database host deletes file, but database mount has been lost so file is not transfered to database and is lost, Host later reports, as upload has been success, validator looks at database and cannot find three or more files, therefore validate error. If this view is correct then Kryten needs to check mount to database before issuing ack to host, so that upload can be retried later. Just my thoughts and as usual could be totally wrong. Andy |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Once the mount is gone, it reports "Can't open file" which has that effect. I'm convinced most validate errors are files which were uploaded and written correctly, but unavailable when validation is attempted. Of course there could be some files which were only partly written at the time the mount was lost. I don't know if the ack is delayed until the write has been completed, though I think it probably is. Joe |
AlexSilver Send message Joined: 7 Jan 07 Posts: 1 Credit: 879 RAC: 0 |
Seems to make sense if possible to rerun them... Are the validate erros caused by the same which is causing what is showing up as 'client error' on results screen? thank you, AlexSilver |
Mikey Send message Joined: 1 Jan 07 Posts: 8 Credit: 44,506 RAC: 0 |
I thought I was alone out there. michaelpr |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
Alex Welcome to Seti BOINC, the ones that show with the "Client Error" are different than the simple "Validate Errors." The client errors could be a noisy workunit or a hickup on the computers part. Eric is aware and working on the validate error issue... Seems to make sense if possible to rerun them... Are the validate erros caused by the same which is causing what is showing up as 'client error' on results screen? Pappa Please consider a Donation to the Seti Project. |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
The validator issues will recur as long as kryten keeps losing its mounts. I found a clue to the cause of the problem that I can at least check on the next time it happens. I fixed a bunch of validate errors today (and a few hundred thousand earlier in the week). If you have some in you accounts that I haven't fixed, let me know so I can find them and try to give credit for them. Eric @SETIEric@qoto.org (Mastodon) |
TimeLord04 Send message Joined: 9 Mar 06 Posts: 21140 Credit: 33,933,039 RAC: 23 |
Copied from Cafe's "Donations to Help Keep Seti Alive" Thread: (Because of the severity of the issue with Kryten failing; I want to make sure that Eric sees this in, (now), one of the three locations that I have posted it...) Matt needs a replacement for Kryten, please help him get a new server. This is NOT, (repeat, NOT), a promise; however, there is a SLIGHT possibility of me getting my hands on a Server System in about two weeks. I will know better after the 29Th if this will occur. At the moment I have no hardware specifications on what the Server is nor what it has been used for. The only information I currently have is that there are one of three Server Systems that may become available to me. The company that had been using these machines has gone bankrupt. The Landlord of the place where the company was housed has had to first notify creditors of the assets of the now bankrupt company. (These machines being part of the assets...) So; until the 29Th, these machines are off limits to me.) After the 29Th, (if the Creditors make no claim or give no response to the Landlord), I will be able to check out these Servers and find out what they are, what OS they have, etc... With this information, can Eric "band-aid"-Patch Kryten until I can get to these machines and find out what's on them? If so, (and if all goes as anticipated), then sometime after the 29Th I can then take a road trip from SoCAL to NorCAL look at these machines and see if one will make a good replacement for Kryten. Eric can contact Knightmare to get my e-Mail address and phone number. I would be happy to talk to Eric about all of this. Hopefully we can make something work here. Oh, and a side note... This place also has a Rack Mounted APC UPS that if unclaimed could be snagged for Eric's Rack System at Berkeley for SETI@Home and SETI Beta's use... Later, TimeLord04 Have TARDIS, will travel... Come along K-9! Join Calm Chaos |
Patrick Terrell Weedon Send message Joined: 18 Jan 07 Posts: 9 Credit: 101,585 RAC: 0 |
How do you you report when work is done? I can not find the update button on boinc manager. |
Astro Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0 |
Reporting will happen automatically via a schedule. If you're in a hurry, click on the "projects" tab, click on "setiathome......" in the right hand box to highlight it (select the project), then click the "update" button to the left. welcome aboard tony |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.