Astropulse continually resetting

Message boards : Number crunching : Astropulse continually resetting
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1428688 - Posted: 14 Oct 2013, 21:32:36 UTC

Have we discussed this syndrome yet? Astropulse work-unit continually resetting every ten seconds until I aborted it:
http://setiathome.berkeley.edu/result.php?resultid=3195596954

ID: 1428688 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1428708 - Posted: 14 Oct 2013, 22:27:23 UTC - in response to Message 1428688.  

That sounds like the new heartbeat api method misbehaving, but that app was built something like six months earlier so it should be using the old method still,

client and API: improve the way an app checks for the death of the client.

Old: heartbeat mechanism
Problem: if the client is blocked for > 30 secs (e.g. because it takes a long time to write the state file, or because it's stopped in a debugger) then apps exit. This is bad if the app doesn't checkpoint and has been running for a long time.

New: the client passes its PID to the app.
The app periodically (10 sec) checks that the process still exists.

Notes:
For backward compatibility (e.g. new API w/ old client, or vice versa) the client still sends heartbeats, and the API checks heartbeats if the client doesn't pass a PID.
The new mechanism works only if the client's PID isn't assigned to a new process within 10 secs of the client exiting. Windows 2000 reuses PIDs immediately, so check for Win2K and don't use this mechanism if so.

TODO: For Unix multithread apps, critical sections aren't currently being enforced. Need to fix this by masking signals.

Einstein had a problem with their Linux Gamma-ray pulsar search #2 app where Boinc restarted the app every 11 seconds, David produced a fix:

•API: fix Unix bug when checking if client is alive based on PID.

Can't use waitpid() here; works only for children.
Use kill(pid, 0) instead.

see this thread:

Trouble with Gamma-ray pulsar search #2 v0.01

But first thing i'd try is run the most recent Boinc 7.2.x version, 7.2.4 was only the 2nd release of the Boinc 7.2.x line, try Boinc 7.2.20, that came out only four days ago, as opposed to three months ago:

BOINC 7.1/7.2 Change Log and News

Claggy
ID: 1428708 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1428715 - Posted: 14 Oct 2013, 22:49:47 UTC - in response to Message 1428708.  
Last modified: 14 Oct 2013, 23:01:15 UTC


But first thing i'd try is run the most recent Boinc 7.2.x version, 7.2.4 was only the 2nd release of the Boinc 7.2.x line, try Boinc 7.2.20, that came out only four days ago, as opposed to three months ago:

BOINC 7.1/7.2 Change Log and News

Claggy

Interesting. That came to light after I'd backed off to an earlier kernel (SLC6 vs Ubuntu 13.04); it's still being given as 7.2.4 even though I had to rebuild 7.0.65 due to library incompatabilities (and I don't recall being offered 7.2.4, I thought the download was still 7.0.65). I'll just keep an eye on it for the time being, I'm only running s@h on this machine in the background while I build it up as a Xeon Phi server. Maybe I made a mistake with my git commands...

[Edit] Ah, looking at my command history, I forgot to do
git checkout client_release/7.0/7.0.65; git status

before compiling...
[/Edit]
ID: 1428715 · Report as offensive

Message boards : Number crunching : Astropulse continually resetting


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.