Astropulse continually resetting


log in

Advanced search

Message boards : Number crunching : Astropulse continually resetting

Author Message
Profile ivan
Volunteer tester
Avatar
Send message
Joined: 5 Mar 01
Posts: 639
Credit: 146,911,118
RAC: 68,032
United Kingdom
Message 1428688 - Posted: 14 Oct 2013, 21:32:36 UTC

Have we discussed this syndrome yet? Astropulse work-unit continually resetting every ten seconds until I aborted it:
http://setiathome.berkeley.edu/result.php?resultid=3195596954

____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4213
Credit: 34,475,644
RAC: 15,871
United Kingdom
Message 1428708 - Posted: 14 Oct 2013, 22:27:23 UTC - in response to Message 1428688.

That sounds like the new heartbeat api method misbehaving, but that app was built something like six months earlier so it should be using the old method still,

client and API: improve the way an app checks for the death of the client.

Old: heartbeat mechanism
Problem: if the client is blocked for > 30 secs (e.g. because it takes a long time to write the state file, or because it's stopped in a debugger) then apps exit. This is bad if the app doesn't checkpoint and has been running for a long time.

New: the client passes its PID to the app.
The app periodically (10 sec) checks that the process still exists.

Notes:
For backward compatibility (e.g. new API w/ old client, or vice versa) the client still sends heartbeats, and the API checks heartbeats if the client doesn't pass a PID.
The new mechanism works only if the client's PID isn't assigned to a new process within 10 secs of the client exiting. Windows 2000 reuses PIDs immediately, so check for Win2K and don't use this mechanism if so.

TODO: For Unix multithread apps, critical sections aren't currently being enforced. Need to fix this by masking signals.

Einstein had a problem with their Linux Gamma-ray pulsar search #2 app where Boinc restarted the app every 11 seconds, David produced a fix:

•API: fix Unix bug when checking if client is alive based on PID.

Can't use waitpid() here; works only for children.
Use kill(pid, 0) instead.

see this thread:

Trouble with Gamma-ray pulsar search #2 v0.01

But first thing i'd try is run the most recent Boinc 7.2.x version, 7.2.4 was only the 2nd release of the Boinc 7.2.x line, try Boinc 7.2.20, that came out only four days ago, as opposed to three months ago:

BOINC 7.1/7.2 Change Log and News

Claggy

Profile ivan
Volunteer tester
Avatar
Send message
Joined: 5 Mar 01
Posts: 639
Credit: 146,911,118
RAC: 68,032
United Kingdom
Message 1428715 - Posted: 14 Oct 2013, 22:49:47 UTC - in response to Message 1428708.
Last modified: 14 Oct 2013, 23:01:15 UTC


But first thing i'd try is run the most recent Boinc 7.2.x version, 7.2.4 was only the 2nd release of the Boinc 7.2.x line, try Boinc 7.2.20, that came out only four days ago, as opposed to three months ago:

BOINC 7.1/7.2 Change Log and News

Claggy

Interesting. That came to light after I'd backed off to an earlier kernel (SLC6 vs Ubuntu 13.04); it's still being given as 7.2.4 even though I had to rebuild 7.0.65 due to library incompatabilities (and I don't recall being offered 7.2.4, I thought the download was still 7.0.65). I'll just keep an eye on it for the time being, I'm only running s@h on this machine in the background while I build it up as a Xeon Phi server. Maybe I made a mistake with my git commands...

[Edit] Ah, looking at my command history, I forgot to do
git checkout client_release/7.0/7.0.65; git status

before compiling...
[/Edit]
____________

Message boards : Number crunching : Astropulse continually resetting

Copyright © 2014 University of California