Message boards :
Number crunching :
Seti_enhanced & client_state errors
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19691 Credit: 40,757,560 RAC: 67 ![]() ![]() |
WinterKnight, Daniel, From that I can see your concerns about this problem are probably not down to your hardware. But seeing what Bill Michael and TMR posted earlier it would seem that it is not a SetiB problem, but a BOINC problem as it is over-writing both client_state.XML files at restart and not using an automated recovery if the current file is corrupt. I would follow TMR's advise and inform the BOINC dev's, he gave the link earlier and refer to this thread so they can see you are doing everything possible to ensure a reliable hardware platforms. |
Don Erway Send message Joined: 18 May 99 Posts: 305 Credit: 471,946 RAC: 0 ![]() |
Is the suspect machine prime95 stable? |
Daniel Schaalma ![]() Send message Joined: 28 May 99 Posts: 297 Credit: 16,953,703 RAC: 0 ![]() |
Is the suspect machine prime95 stable? @Don: Absolutely. This was one of the many diagnostics I ran on each machine that the error has occured on. In fact, I run this as part of a burn-in proceedure on every machine I build before it goes into service. Daniel, @WinterKnight: Very well. I shall post a message to the BOINC dev's and link them to this thread. Actually, from what it looks like, every time BOINC writes to the disk (every 60 seconds), it rewrites the client_state.xml file as well as the client_state_prev.xml file. So, if for any reason client_state gets corrupted, you have only 60 seconds to exit BOINC and restore the client_state_prev file, or else the loss is perminant. The odds of catching this within even several minutes of the error occuring are remote at best. I'd have to be at the keyboard of the machine it happened on at the exact moment at which it happened, in order to have even the slightest chance of preventing this disaster. Unfortunately, the earliest I have ever found out about this problem is 15 minutes after it happened, as I have rarely been either home or awake at the time of the problem. My first knowledge of the problem is when I see the machine has not contacted the server for quite some time, look at the machine, and see the infamous "can't parse state file" message in the BOINC message log. I'll just have to wait and see what I hear from the dev's, if anything. Thanks. Regards, Daniel. ![]() ![]() ![]() |
Don Erway Send message Joined: 18 May 99 Posts: 305 Credit: 471,946 RAC: 0 ![]() |
Is the suspect machine prime95 stable? On my AMD64 system, I found I actually had to slow my fsb down about 4-5 mhz BELOW prime95 stable, for boinc and seti to run reliably. And the problem was exactly this - bad state file, and restarting the project, and abandoning the WUs. I still think it might worth a try... |
Daniel Schaalma ![]() Send message Joined: 28 May 99 Posts: 297 Credit: 16,953,703 RAC: 0 ![]() |
On my AMD64 system, I found I actually had to slow my fsb down about 4-5 mhz BELOW prime95 stable, for boinc and seti to run reliably. And the problem was exactly this - bad state file, and restarting the project, and abandoning the WUs. Don, Is it possible that you had to slow it down because dynamic overclocking was enabled? All the newer AMD64 and Intel P4 motherboards have dynamic overclocking, in one form or another, ENABLED by default. They generally overclock the system by 3 to 5 percent, depending upon CPU load, and I've seen some go as high as 10 percent! I always disable dynamic overclocking on my machines. Regards, Daniel. ![]() ![]() ![]() |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19691 Credit: 40,757,560 RAC: 67 ![]() ![]() |
@WinterKnight: Daniel Have you had the reply from the Dev's yet? |
Don Erway Send message Joined: 18 May 99 Posts: 305 Credit: 471,946 RAC: 0 ![]() |
On my AMD64 system, I found I actually had to slow my fsb down about 4-5 mhz BELOW prime95 stable, for boinc and seti to run reliably. And the problem was exactly this - bad state file, and restarting the project, and abandoning the WUs. No, for sure not. I have yet to have any system, ever, that will run prime95 stable, with dynamic OC on. I always disable it. All I'm saying is try it. Either bump your Vcore up one notch, or drop your fsb a few, just to see if the error goes away. There is always inevitable variation in every component of every system. Sometimes you get unlucky, and get 2 outliers together, like a cpu with a floating point unit that wants right at the high end of the specified voltage range, to always work correctly, and a mobo resistor, that causes the board to undervolt, to the very bottom of its specified range... These processors are very tolerant of voltage, and just the stock retail hsf is plenty good enough, so don't worry about a 5 % boost to Vcore. Don |
![]() Send message Joined: 19 Jul 00 Posts: 3898 Credit: 1,158,042 RAC: 0 ![]() |
Dr. Anderson checked in some changes to the write side of the client state files. This should improve the chances that the files are correctly "flushed" and stable on the disk drive. We had a longer discussion on the start up side and I re-posed the question about the logic on using the state file and back-up ... we wait again ... :) But, in 5.4.x we should see an improved system. For, those with the unstable power, trying one of the beta (stable beta) versions *MAY* be a good option for you. Not sure when this fix will appear for sure, most likely in the next beta drop. Note I do not advocate (normally) running the betas for production ... but, this is one of those cases where it *MAY* be justified. |
![]() ![]() Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0 ![]() |
Dr. Anderson checked in some changes to the write side of the client state files. This should improve the chances that the files are correctly "flushed" and stable on the disk drive. Hope this is true. I lost a 4 day cache on one box today due to this bug. Looks like Boinc completely restarted and started downloading work units to fill the cache again. Log file said something about a corrupted client_state file. The box in question runs by itself and rarely has a human messing around with it. This has now happened to me 3 times in the past 2 months. Boinc....Boinc....Boinc....Boinc.... |
![]() ![]() Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0 ![]() |
Since I have another machine with the same currupted client_state file I need to bump this thread. Thats two machines for this week! Anyone know when the fixes checked in by Dr. Anderson will be available in a new version of Boinc? I strongly suspect this is happening very frequently and is resulting in a lot of "orphaned" files in the data base. A weak cry in the dark for Help! Boinc....Boinc....Boinc....Boinc.... |
Ingleside Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13 ![]() ![]() |
Since I have another machine with the same currupted client_state file I need to bump this thread. Thats two machines for this week! Anyone know when the fixes checked in by Dr. Anderson will be available in a new version of Boinc? I strongly suspect this is happening very frequently and is resulting in a lot of "orphaned" files in the data base. A weak cry in the dark for Help! Well, if you're adventurous you can try v5.3.6, since it was built yesterday it AFAIK should contain the possible fix. Oh, and while v5.2.15 is more resently built, it should not contain this fix... Now, since v5.3.6 isn't even being alpha-tested, use at own risk, and don't complain if your computer starts shooting-out rockets at new-year... |
![]() ![]() Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0 ![]() |
Well, if you're adventurous you can try v5.3.6, since it was built yesterday it AFAIK should contain the possible fix. Thanks for the information. Rockets on New Years would be cool! Boinc....Boinc....Boinc....Boinc.... |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.