Seti_enhanced & client_state errors

Message boards : Number crunching : Seti_enhanced & client_state errors
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19691
Credit: 40,757,560
RAC: 67
United Kingdom
Message 219020 - Posted: 21 Dec 2005, 3:10:23 UTC - in response to Message 219010.  

WinterKnight,

Sorry for the delayed response, but it was unavoidable. I appreciate the second opinion. There are actually only 20 machines in operation. The one that hasn't reported for a while was my first experiment with Linux that didn't work. The current Linux box is its replacement. There are also 2 machines running XP-64Bit. The first is a Dual Opteron 246 machine, with a 550 Watt (sustained under load, tested and confirmed) power supply, MSI K8T Master2-FAR mainboard, 1 GIG Samsung PC2700 ECC Registered SDRAM (2x512), 20 gig EIDE HD, CDROM, floppy and GeForce 4400MX video. Just the bare essentials. The other XP-64Bit machine is an AMD64 3500+ S939 on an Asus A8V-MX mainboard with 512MB Samsung PC3200 SDRAM, 80GIG SATA HD, CDROM, and floppy, 450 Watt (sustained under load, tested and confirmed) power supply. All input power for my machines is filtered through a Deltec 3.1KVA UPS with line conditioning. It will power my network for 11 minutes, long enough to do a safe shutdown, provided I am here when the brownout happens. So, if the anomolous occurances were due to a power failure, all the machines would have gone down. Also, all vcore, memory voltages, etc. are well within spec. I _never_ allow the HD's to spin down while machines are powered. All power management is disabled on all machines. I power off the monitor when I am away. I have also read about this error in 2 or 3 other threads (haven't had the time to find and link to them for you yet).

Regards, Daniel.


Daniel,

From that I can see your concerns about this problem are probably not down to your hardware. But seeing what Bill Michael and TMR posted earlier it would seem that it is not a SetiB problem, but a BOINC problem as it is over-writing both client_state.XML files at restart and not using an automated recovery if the current file is corrupt.

I would follow TMR's advise and inform the BOINC dev's, he gave the link earlier and refer to this thread so they can see you are doing everything possible to ensure a reliable hardware platforms.

ID: 219020 · Report as offensive
Don Erway
Volunteer tester

Send message
Joined: 18 May 99
Posts: 305
Credit: 471,946
RAC: 0
United States
Message 219032 - Posted: 21 Dec 2005, 3:26:19 UTC

Is the suspect machine prime95 stable?


ID: 219032 · Report as offensive
Daniel Schaalma
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 297
Credit: 16,953,703
RAC: 0
United States
Message 219206 - Posted: 21 Dec 2005, 12:09:20 UTC

Is the suspect machine prime95 stable?


@Don:
Absolutely. This was one of the many diagnostics I ran on each machine that the error has occured on. In fact, I run this as part of a burn-in proceedure on every machine I build before it goes into service.

Daniel,

From that I can see your concerns about this problem are probably not down to your hardware. But seeing what Bill Michael and TMR posted earlier it would seem that it is not a SetiB problem, but a BOINC problem as it is over-writing both client_state.XML files at restart and not using an automated recovery if the current file is corrupt.

I would follow TMR's advise and inform the BOINC dev's, he gave the link earlier and refer to this thread so they can see you are doing everything possible to ensure a reliable hardware platforms.


@WinterKnight:

Very well. I shall post a message to the BOINC dev's and link them to this thread. Actually, from what it looks like, every time BOINC writes to the disk (every 60 seconds), it rewrites the client_state.xml file as well as the client_state_prev.xml file. So, if for any reason client_state gets corrupted, you have only 60 seconds to exit BOINC and restore the client_state_prev file, or else the loss is perminant. The odds of catching this within even several minutes of the error occuring are remote at best. I'd have to be at the keyboard of the machine it happened on at the exact moment at which it happened, in order to have even the slightest chance of preventing this disaster. Unfortunately, the earliest I have ever found out about this problem is 15 minutes after it happened, as I have rarely been either home or awake at the time of the problem. My first knowledge of the problem is when I see the machine has not contacted the server for quite some time, look at the machine, and see the infamous "can't parse state file" message in the BOINC message log.
I'll just have to wait and see what I hear from the dev's, if anything. Thanks.

Regards, Daniel.
ID: 219206 · Report as offensive
Don Erway
Volunteer tester

Send message
Joined: 18 May 99
Posts: 305
Credit: 471,946
RAC: 0
United States
Message 219313 - Posted: 21 Dec 2005, 17:03:17 UTC - in response to Message 219206.  

Is the suspect machine prime95 stable?


@Don:
Absolutely. This was one of the many diagnostics I ran on each machine that the error has occured on. In fact, I run this as part of a burn-in proceedure on every machine I build before it goes into service.

Regards, Daniel.


On my AMD64 system, I found I actually had to slow my fsb down about 4-5 mhz BELOW prime95 stable, for boinc and seti to run reliably. And the problem was exactly this - bad state file, and restarting the project, and abandoning the WUs.

I still think it might worth a try...


ID: 219313 · Report as offensive
Daniel Schaalma
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 297
Credit: 16,953,703
RAC: 0
United States
Message 219597 - Posted: 22 Dec 2005, 5:32:10 UTC - in response to Message 219313.  

On my AMD64 system, I found I actually had to slow my fsb down about 4-5 mhz BELOW prime95 stable, for boinc and seti to run reliably. And the problem was exactly this - bad state file, and restarting the project, and abandoning the WUs.


Don,
Is it possible that you had to slow it down because dynamic overclocking was enabled? All the newer AMD64 and Intel P4 motherboards have dynamic overclocking, in one form or another, ENABLED by default. They generally overclock the system by 3 to 5 percent, depending upon CPU load, and I've seen some go as high as 10 percent! I always disable dynamic overclocking on my machines.

Regards, Daniel.
ID: 219597 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19691
Credit: 40,757,560
RAC: 67
United Kingdom
Message 219614 - Posted: 22 Dec 2005, 6:10:52 UTC - in response to Message 219206.  

@WinterKnight:

Very well. I shall post a message to the BOINC dev's and link them to this thread. Actually, from what it looks like, every time BOINC writes to the disk (every 60 seconds), it rewrites the client_state.xml file as well as the client_state_prev.xml file. So, if for any reason client_state gets corrupted, you have only 60 seconds to exit BOINC and restore the client_state_prev file, or else the loss is perminant. The odds of catching this within even several minutes of the error occuring are remote at best. I'd have to be at the keyboard of the machine it happened on at the exact moment at which it happened, in order to have even the slightest chance of preventing this disaster. Unfortunately, the earliest I have ever found out about this problem is 15 minutes after it happened, as I have rarely been either home or awake at the time of the problem. My first knowledge of the problem is when I see the machine has not contacted the server for quite some time, look at the machine, and see the infamous "can't parse state file" message in the BOINC message log.
I'll just have to wait and see what I hear from the dev's, if anything. Thanks.

Regards, Daniel.


Daniel
Have you had the reply from the Dev's yet?

ID: 219614 · Report as offensive
Don Erway
Volunteer tester

Send message
Joined: 18 May 99
Posts: 305
Credit: 471,946
RAC: 0
United States
Message 219616 - Posted: 22 Dec 2005, 6:13:33 UTC - in response to Message 219597.  

On my AMD64 system, I found I actually had to slow my fsb down about 4-5 mhz BELOW prime95 stable, for boinc and seti to run reliably. And the problem was exactly this - bad state file, and restarting the project, and abandoning the WUs.


Don,
Is it possible that you had to slow it down because dynamic overclocking was enabled? All the newer AMD64 and Intel P4 motherboards have dynamic overclocking, in one form or another, ENABLED by default. They generally overclock the system by 3 to 5 percent, depending upon CPU load, and I've seen some go as high as 10 percent! I always disable dynamic overclocking on my machines.

Regards, Daniel.


No, for sure not. I have yet to have any system, ever, that will run prime95 stable, with dynamic OC on. I always disable it.

All I'm saying is try it. Either bump your Vcore up one notch, or drop your fsb a few, just to see if the error goes away. There is always inevitable variation in every component of every system. Sometimes you get unlucky, and get 2 outliers together, like a cpu with a floating point unit that wants right at the high end of the specified voltage range, to always work correctly, and a mobo resistor, that causes the board to undervolt, to the very bottom of its specified range...

These processors are very tolerant of voltage, and just the stock retail hsf is plenty good enough, so don't worry about a 5 % boost to Vcore.

Don


ID: 219616 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 219659 - Posted: 22 Dec 2005, 8:00:54 UTC

Dr. Anderson checked in some changes to the write side of the client state files. This should improve the chances that the files are correctly "flushed" and stable on the disk drive.

We had a longer discussion on the start up side and I re-posed the question about the logic on using the state file and back-up ... we wait again ... :)

But, in 5.4.x we should see an improved system. For, those with the unstable power, trying one of the beta (stable beta) versions *MAY* be a good option for you. Not sure when this fix will appear for sure, most likely in the next beta drop.

Note I do not advocate (normally) running the betas for production ... but, this is one of those cases where it *MAY* be justified.
ID: 219659 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 222240 - Posted: 28 Dec 2005, 4:05:15 UTC - in response to Message 219659.  

Dr. Anderson checked in some changes to the write side of the client state files. This should improve the chances that the files are correctly "flushed" and stable on the disk drive.

We had a longer discussion on the start up side and I re-posed the question about the logic on using the state file and back-up ... we wait again ... :)

But, in 5.4.x we should see an improved system. For, those with the unstable power, trying one of the beta (stable beta) versions *MAY* be a good option for you. Not sure when this fix will appear for sure, most likely in the next beta drop.

Note I do not advocate (normally) running the betas for production ... but, this is one of those cases where it *MAY* be justified.


Hope this is true. I lost a 4 day cache on one box today due to this bug. Looks like Boinc completely restarted and started downloading work units to fill the cache again. Log file said something about a corrupted client_state file. The box in question runs by itself and rarely has a human messing around with it. This has now happened to me 3 times in the past 2 months.



Boinc....Boinc....Boinc....Boinc....
ID: 222240 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 222911 - Posted: 29 Dec 2005, 19:40:55 UTC
Last modified: 29 Dec 2005, 20:11:37 UTC

Since I have another machine with the same currupted client_state file I need to bump this thread. Thats two machines for this week! Anyone know when the fixes checked in by Dr. Anderson will be available in a new version of Boinc? I strongly suspect this is happening very frequently and is resulting in a lot of "orphaned" files in the data base. A weak cry in the dark for Help!


Boinc....Boinc....Boinc....Boinc....
ID: 222911 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 222941 - Posted: 29 Dec 2005, 20:53:47 UTC - in response to Message 222911.  
Last modified: 29 Dec 2005, 20:55:05 UTC

Since I have another machine with the same currupted client_state file I need to bump this thread. Thats two machines for this week! Anyone know when the fixes checked in by Dr. Anderson will be available in a new version of Boinc? I strongly suspect this is happening very frequently and is resulting in a lot of "orphaned" files in the data base. A weak cry in the dark for Help!


Well, if you're adventurous you can try v5.3.6, since it was built yesterday it AFAIK should contain the possible fix.
Oh, and while v5.2.15 is more resently built, it should not contain this fix...

Now, since v5.3.6 isn't even being alpha-tested, use at own risk, and don't complain if your computer starts shooting-out rockets at new-year...
ID: 222941 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 222945 - Posted: 29 Dec 2005, 21:09:36 UTC

Well, if you're adventurous you can try v5.3.6, since it was built yesterday it AFAIK should contain the possible fix.
Oh, and while v5.2.15 is more resently built, it should not contain this fix...

Now, since v5.3.6 isn't even being alpha-tested, use at own risk, and don't complain if your computer starts shooting-out rockets at new-year...


Thanks for the information. Rockets on New Years would be cool!


Boinc....Boinc....Boinc....Boinc....
ID: 222945 · Report as offensive
Previous · 1 · 2

Message boards : Number crunching : Seti_enhanced & client_state errors


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.