Boinc 4.19 Stalls when Switching Projects

Questions and Answers : Unix/Linux : Boinc 4.19 Stalls when Switching Projects
Message board moderation

To post messages, you must log in.

AuthorMessage
parkut
Volunteer tester

Send message
Joined: 9 Aug 99
Posts: 69
Credit: 9,779,243
RAC: 0
United States
Message 78408 - Posted: 11 Feb 2005, 14:36:46 UTC

On several of my Linux blades I observe the BOINC client stalling when Predictor swaps out and it's time for Seti to reload. I become aware of it by noting the system load dropping to zero. I can "fix" the problem by issuing a killall to the boinc process, waiting a few seconds, and restarting the boinc client. I have clients that stall this way under Fedora Core1, Core2 and Core3. Processors include Intel Celeron, AMD XP, AMD64, and Pentium4. I have my preferences 90/10 Seti/Predictor. Any hints?

2005-02-10 19:21:51 [SETI@home] Pausing result 03ja04aa.13491.13777.179816.85_5 (removed from memory)
2005-02-10 19:21:51 [ProteinPredictorAtHome] Starting result t0201E_1_92145_3 using mfoldB125 version 4.22
2005-02-10 19:22:26 [ProteinPredictorAtHome] Result t0201E_1_92145_3 exited with zero status but no 'finished' file
2005-02-10 19:22:26 [ProteinPredictorAtHome] If this happens repeatedly you may need to reset the project.
2005-02-10 19:22:26 [ProteinPredictorAtHome] Restarting result t0201E_1_92145_3 using mfoldB125 version 4.22
2005-02-10 20:22:26 [ProteinPredictorAtHome] Pausing result t0201E_1_92145_3 (removed from memory)
2005-02-11 08:44:51 [---] Received signal 15
2005-02-11 08:44:51 [---] Exit requested by user
2005-02-11 08:44:56 [---] Starting BOINC client version 4.19 for i686-pc-linux-gnu


ID: 78408 · Report as offensive
Walt Gribben
Volunteer tester

Send message
Joined: 16 May 99
Posts: 353
Credit: 304,016
RAC: 0
United States
Message 78503 - Posted: 11 Feb 2005, 20:27:33 UTC
Last modified: 11 Feb 2005, 20:27:53 UTC

Its interesting that theres 35 seconds between the one WU starting and the "exited with zero status but no 'finished' file" message. That makes it look like theres some kind of startup problem with the new WU. If BOINC has trouble setting up the new WU, it "sleeps" for 35 seconds. Thats long enough for the science apps to time out ("No heartbeat from core client - exiting" in the science applications error log).

Check the science apps error log - its in slots/0, slots/1 and so on, theres a "slots" directory set up for each project. See whats in the error log, for seti thats stderr.txt.

Does this system have one or two CPU's? Thats physical or logical, your computers show a mixture of 1 and 2 processor systems. If two, are you running (or tying to run) two science apps, one for each processor?







ID: 78503 · Report as offensive
parkut
Volunteer tester

Send message
Joined: 9 Aug 99
Posts: 69
Credit: 9,779,243
RAC: 0
United States
Message 78678 - Posted: 12 Feb 2005, 4:03:46 UTC - in response to Message 78503.  
Last modified: 12 Feb 2005, 4:09:54 UTC

All of the problem machines so far are single processor systems. Slot 0 stderr.txt was a zero byte file. Slot 1 stderr.txt contains the single line: No heartbeat from core client for 30.025436 sec - exiting and was time-stamped 14:05:54.

The log below was from this machine, a Linux Fedora Core2 1.2mhz Celeron.

2005-02-11 14:05:24 [SETI@home] Pausing result 08ap04ab.10046.1328.740894.194_5 (removed from memory)
2005-02-11 14:05:24 [ProteinPredictorAtHome] Starting result t0212E_1_25887_5 using mfoldB125 version 4.22
2005-02-11 14:05:54 [ProteinPredictorAtHome] Result t0212E_1_25887_5 exited with zero status but no 'finished' file
2005-02-11 14:05:54 [ProteinPredictorAtHome] If this happens repeatedly you may need to reset the project.
2005-02-11 14:05:54 [ProteinPredictorAtHome] Restarting result t0212E_1_25887_5 using mfoldB125 version 4.22
2005-02-11 15:08:58 [ProteinPredictorAtHome] Pausing result t0212E_1_25887_5 (removed from memory)
2005-02-11 17:01:00 [---] Received signal 15


ID: 78678 · Report as offensive
Walt Gribben
Volunteer tester

Send message
Joined: 16 May 99
Posts: 353
Credit: 304,016
RAC: 0
United States
Message 78712 - Posted: 12 Feb 2005, 6:07:57 UTC - in response to Message 78678.  

> All of the problem machines so far are single processor systems. Slot 0
> stderr.txt was a zero byte file. Slot 1 stderr.txt contains the single line:
> No heartbeat from core client for 30.025436 sec - exiting and was time-stamped
> 14:05:54.

That certainly ties it to the new WU startup. I take it predictor was running in slot 1, right? Should have asked, you can tell by the other files, you'll see the executable file for the science application.

If predictor is the application in slot 1, it shows BOINC started it OK but couldn't set up the shared memory segment for communications.

Two suggestions.

First, have you checked with the predictor people about this?

Second, see if leaving the suspended project in memory changes anything. Might have something to do with initializing the application in slot 1 each time it starts/restarts. Thats set in your general preferences - "Leave applications in memory while preempted?" = yes.

ID: 78712 · Report as offensive
parkut
Volunteer tester

Send message
Joined: 9 Aug 99
Posts: 69
Credit: 9,779,243
RAC: 0
United States
Message 78806 - Posted: 12 Feb 2005, 16:52:46 UTC - in response to Message 78503.  

> have you checked with the predictor people about this?

not yet, since I thought it was a BOINC issue. The problem only showed up when I added Predictor as a second project.

> Second, see if leaving the suspended project in memory changes
> anything. Might have something to do with initializing the application
> in slot 1 each time it starts/restarts. Thats set in your general
> preferences - "Leave applications in memory while preempted?" = yes.

I have experienced additional machines, including dual processor and hyper threaded units where one of the processes stalled out, and the second continued to run.

So, following up on your suggestion I updated the preferences at each machine. We shall see if that halts the stalling problem.

Thanks for your suggestions.


ID: 78806 · Report as offensive
Rudi

Send message
Joined: 9 Oct 02
Posts: 2
Credit: 93,674
RAC: 0
Germany
Message 82142 - Posted: 23 Feb 2005, 9:08:46 UTC

Same problem with Seti and Einstein here.

I used boinc with seti alone since I started crunching with boinc (over a year),
but recently I added Einstein with 33% share of cpu time.

Exactly the same symptoms happen sometimes on my linux boxes. I tagged the
option "leave appl. in memory" but this made things worse. The problem occured
more often. This only happens on my linux boxes. The windows boxes have
shown none of these, despite the same settings and shares between seti and einstein.

Seems as a boinc problem to me, just like duemling said.
(Btw. I added Einstein as a second project to my boxes, just like you did
with predictor)
ID: 82142 · Report as offensive
Profile Trane Francks

Send message
Joined: 18 Jun 99
Posts: 221
Credit: 122,319
RAC: 0
Japan
Message 82341 - Posted: 26 Feb 2005, 0:25:09 UTC

Keep an eye on your system load to make sure that more than one WU isn't being crunched on the same CPU at the same time. Any time I've seen the no finished file issue, it's been when projects have not paused correctly. When it happens, just kill/restart boinc and things should progress smoothly again.

ID: 82341 · Report as offensive

Questions and Answers : Unix/Linux : Boinc 4.19 Stalls when Switching Projects


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.