Questions and Answers :
Unix/Linux :
Boinc 4.19 Stalls when Switching Projects
Message board moderation
Author | Message |
---|---|
parkut Send message Joined: 9 Aug 99 Posts: 69 Credit: 9,779,243 RAC: 0 |
On several of my Linux blades I observe the BOINC client stalling when Predictor swaps out and it's time for Seti to reload. I become aware of it by noting the system load dropping to zero. I can "fix" the problem by issuing a killall to the boinc process, waiting a few seconds, and restarting the boinc client. I have clients that stall this way under Fedora Core1, Core2 and Core3. Processors include Intel Celeron, AMD XP, AMD64, and Pentium4. I have my preferences 90/10 Seti/Predictor. Any hints? 2005-02-10 19:21:51 [SETI@home] Pausing result 03ja04aa.13491.13777.179816.85_5 (removed from memory) 2005-02-10 19:21:51 [ProteinPredictorAtHome] Starting result t0201E_1_92145_3 using mfoldB125 version 4.22 2005-02-10 19:22:26 [ProteinPredictorAtHome] Result t0201E_1_92145_3 exited with zero status but no 'finished' file 2005-02-10 19:22:26 [ProteinPredictorAtHome] If this happens repeatedly you may need to reset the project. 2005-02-10 19:22:26 [ProteinPredictorAtHome] Restarting result t0201E_1_92145_3 using mfoldB125 version 4.22 2005-02-10 20:22:26 [ProteinPredictorAtHome] Pausing result t0201E_1_92145_3 (removed from memory) 2005-02-11 08:44:51 [---] Received signal 15 2005-02-11 08:44:51 [---] Exit requested by user 2005-02-11 08:44:56 [---] Starting BOINC client version 4.19 for i686-pc-linux-gnu |
Walt Gribben Send message Joined: 16 May 99 Posts: 353 Credit: 304,016 RAC: 0 |
Its interesting that theres 35 seconds between the one WU starting and the "exited with zero status but no 'finished' file" message. That makes it look like theres some kind of startup problem with the new WU. If BOINC has trouble setting up the new WU, it "sleeps" for 35 seconds. Thats long enough for the science apps to time out ("No heartbeat from core client - exiting" in the science applications error log). Check the science apps error log - its in slots/0, slots/1 and so on, theres a "slots" directory set up for each project. See whats in the error log, for seti thats stderr.txt. Does this system have one or two CPU's? Thats physical or logical, your computers show a mixture of 1 and 2 processor systems. If two, are you running (or tying to run) two science apps, one for each processor? |
parkut Send message Joined: 9 Aug 99 Posts: 69 Credit: 9,779,243 RAC: 0 |
All of the problem machines so far are single processor systems. Slot 0 stderr.txt was a zero byte file. Slot 1 stderr.txt contains the single line: No heartbeat from core client for 30.025436 sec - exiting and was time-stamped 14:05:54. The log below was from this machine, a Linux Fedora Core2 1.2mhz Celeron. 2005-02-11 14:05:24 [SETI@home] Pausing result 08ap04ab.10046.1328.740894.194_5 (removed from memory) 2005-02-11 14:05:24 [ProteinPredictorAtHome] Starting result t0212E_1_25887_5 using mfoldB125 version 4.22 2005-02-11 14:05:54 [ProteinPredictorAtHome] Result t0212E_1_25887_5 exited with zero status but no 'finished' file 2005-02-11 14:05:54 [ProteinPredictorAtHome] If this happens repeatedly you may need to reset the project. 2005-02-11 14:05:54 [ProteinPredictorAtHome] Restarting result t0212E_1_25887_5 using mfoldB125 version 4.22 2005-02-11 15:08:58 [ProteinPredictorAtHome] Pausing result t0212E_1_25887_5 (removed from memory) 2005-02-11 17:01:00 [---] Received signal 15 |
Walt Gribben Send message Joined: 16 May 99 Posts: 353 Credit: 304,016 RAC: 0 |
> All of the problem machines so far are single processor systems. Slot 0 > stderr.txt was a zero byte file. Slot 1 stderr.txt contains the single line: > No heartbeat from core client for 30.025436 sec - exiting and was time-stamped > 14:05:54. That certainly ties it to the new WU startup. I take it predictor was running in slot 1, right? Should have asked, you can tell by the other files, you'll see the executable file for the science application. If predictor is the application in slot 1, it shows BOINC started it OK but couldn't set up the shared memory segment for communications. Two suggestions. First, have you checked with the predictor people about this? Second, see if leaving the suspended project in memory changes anything. Might have something to do with initializing the application in slot 1 each time it starts/restarts. Thats set in your general preferences - "Leave applications in memory while preempted?" = yes. |
parkut Send message Joined: 9 Aug 99 Posts: 69 Credit: 9,779,243 RAC: 0 |
> have you checked with the predictor people about this? not yet, since I thought it was a BOINC issue. The problem only showed up when I added Predictor as a second project. > Second, see if leaving the suspended project in memory changes > anything. Might have something to do with initializing the application > in slot 1 each time it starts/restarts. Thats set in your general > preferences - "Leave applications in memory while preempted?" = yes. I have experienced additional machines, including dual processor and hyper threaded units where one of the processes stalled out, and the second continued to run. So, following up on your suggestion I updated the preferences at each machine. We shall see if that halts the stalling problem. Thanks for your suggestions. |
Rudi Send message Joined: 9 Oct 02 Posts: 2 Credit: 93,674 RAC: 0 |
Same problem with Seti and Einstein here. I used boinc with seti alone since I started crunching with boinc (over a year), but recently I added Einstein with 33% share of cpu time. Exactly the same symptoms happen sometimes on my linux boxes. I tagged the option "leave appl. in memory" but this made things worse. The problem occured more often. This only happens on my linux boxes. The windows boxes have shown none of these, despite the same settings and shares between seti and einstein. Seems as a boinc problem to me, just like duemling said. (Btw. I added Einstein as a second project to my boxes, just like you did with predictor) |
Trane Francks Send message Joined: 18 Jun 99 Posts: 221 Credit: 122,319 RAC: 0 |
Keep an eye on your system load to make sure that more than one WU isn't being crunched on the same CPU at the same time. Any time I've seen the no finished file issue, it's been when projects have not paused correctly. When it happens, just kill/restart boinc and things should progress smoothly again. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.