Strange New Problem (6.6.20, Opt. Apps)

Author	Message
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 932472 - Posted: 11 Sep 2009, 11:18:24 UTC Last modified: 11 Sep 2009, 11:21:23 UTC I am running 6.6.20 and the Optimized apps (CPU and CUDA). I also use Reschedule 1.9 to balannce my load. I have been doing so for quite a while. Recently, I finally received a full load of WUs (upwards of 2500), and have been seeing lots of the message box for "connecting to client". I assume that is because my internal queues are so long. But I have also been seeing CPU WUs finishing showing a few minutes of elapsed time and a couple of hours or so of CPU time. And similar distortions for GPU WUs. 1) has anyone else had this experience? 2) what causes this, if known? 3) will it affect the credit I get, or is that purely based on CPU time/flops? ID: 932472 ·

MarkJ Volunteer tester Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5	Message 932476 - Posted: 11 Sep 2009, 12:08:55 UTC - in response to Message 932472. I am running 6.6.20 and the Optimized apps (CPU and CUDA). I also use Reschedule 1.9 to balannce my load. I have been doing so for quite a while. Recently, I finally received a full load of WUs (upwards of 2500), and have been seeing lots of the message box for "connecting to client". I assume that is because my internal queues are so long. But I have also been seeing CPU WUs finishing showing a few minutes of elapsed time and a couple of hours or so of CPU time. And similar distortions for GPU WUs. 1) has anyone else had this experience? 2) what causes this, if known? 3) will it affect the credit I get, or is that purely based on CPU time/flops? You might want to read this message thread. The gist of it is that the BOINC manager is telling the core client to give it an update, but when you get so many task it takes a while to give it all the information. The 6.10.xx series of clients have some changes to try and reduce the overhead. I wouldn't recommend 6.10.4 though as it has its own issues. BOINC blog ID: 932476 ·

Charles Anspaugh Volunteer tester Send message Joined: 11 Aug 00 Posts: 48 Credit: 22,715,083 RAC: 0	Message 932521 - Posted: 11 Sep 2009, 16:35:06 UTC - in response to Message 932472. I am running 6.6.20 and the Optimized apps (CPU and CUDA). I also use Reschedule 1.9 to balannce my load. I have been doing so for quite a while. Recently, I finally received a full load of WUs (upwards of 2500), and have been seeing lots of the message box for "connecting to client". I assume that is because my internal queues are so long. But I have also been seeing CPU WUs finishing showing a few minutes of elapsed time and a couple of hours or so of CPU time. And similar distortions for GPU WUs. 1) has anyone else had this experience? 2) what causes this, if known? 3) will it affect the credit I get, or is that purely based on CPU time/flops? 1) somewhere above 1000 WUs is when I notice the same problem. 2) I think that boinc manager is taking up cpu cycles from work units in order to keep files updated. 3) the only way this affects your credit is that it takes longer to finish a workunit. I found two things that help reduce this problem. first is in boinc managers advanced tab 'preferences'. the disk and memory usage tab 'write to disk at most every # seconds. change this to 600 seconds=10 minutes. this helped some. second is dont leave boinc manager open, when it is open it takes cpu cycles away from WUs. Also I have put my entire boinc install on a ramdrive. ramdrives can be more of a problem than i can explain. these are only for to help with 1000s of WUs. ID: 932521 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 932534 - Posted: 11 Sep 2009, 17:34:01 UTC - in response to Message 932521. Last modified: 11 Sep 2009, 17:34:56 UTC Charles - Thanks for your suggestions!... BUT: if I change my "write to disk" time to 600 secs, won't that effect how often checkpoints are done? BTW - is that WtD time PER thread? Or PER machine? (I run 8 cores and 2 CUDAs). If it is per thread, then I am really writing every 6 secs (with the 60 sec default)... ID: 932534 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 932566 - Posted: 11 Sep 2009, 20:15:16 UTC - in response to Message 932534. Charles - Thanks for your suggestions!... BUT: if I change my "write to disk" time to 600 secs, won't that effect how often checkpoints are done? BTW - is that WtD time PER thread? Or PER machine? (I run 8 cores and 2 CUDAs). If it is per thread, then I am really writing every 6 secs (with the 60 sec default)... It's meant to be per machine, when BOINC starts an application it multiplies the WtD by the number of CPUs, or the number of GPUs, to set the <checkpoint_period> value. It doesn't take the max count of the number of processors, though. The latest source simply checks number of CPUs first, then CUDA GPUs, then ATI GPUs so the last checked type which exists on the host determines how WtD is interpreted. In the slot for a running task there's an init_data.xml file which contains that <checkpoint_period> setting. It's probably simplest just to open that with a text editor to see what BOINC on your host is doing. I guess it will be 120 on your host for the 60 second WtD since it has 2 CUDA GPUs. Joe ID: 932566 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 932640 - Posted: 12 Sep 2009, 0:29:31 UTC - in response to Message 932566. Last modified: 12 Sep 2009, 0:30:02 UTC In the slot for a running task there's an init_data.xml file which contains that <checkpoint_period> setting. It's probably simplest just to open that with a text editor to see what BOINC on your host is doing. I guess it will be 120 on your host for the 60 second WtD since it has 2 CUDA GPUs. Joe I looked. Checkpoint period is 480 (=8x60), so "correct" for my dual quadcore...except that according to you it should be 120 (for 2x#CUDA cards)...what gives???? ID: 932640 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 932688 - Posted: 12 Sep 2009, 5:11:05 UTC - in response to Message 932640. In the slot for a running task there's an init_data.xml file which contains that <checkpoint_period> setting. It's probably simplest just to open that with a text editor to see what BOINC on your host is doing. I guess it will be 120 on your host for the 60 second WtD since it has 2 CUDA GPUs. Joe I looked. Checkpoint period is 480 (=8x60), so "correct" for my dual quadcore...except that according to you it should be 120 (for 2x#CUDA cards)...what gives???? Probably your version of BOINC not being built from the latest source code. I didn't chase back when that particular code had been modified nor in what fashion, though the simple detail that it includes allowance for ATI GPUs indicates at least that part is very recent. That's why I suggested checking init_data.xml, BOINC is an ever-changing target. Joe ID: 932688 ·

Charles Anspaugh Volunteer tester Send message Joined: 11 Aug 00 Posts: 48 Credit: 22,715,083 RAC: 0	Message 932689 - Posted: 12 Sep 2009, 5:16:28 UTC - in response to Message 932640. In the slot for a running task there's an init_data.xml file which contains that <checkpoint_period> setting. It's probably simplest just to open that with a text editor to see what BOINC on your host is doing. I guess it will be 120 on your host for the 60 second WtD since it has 2 CUDA GPUs. Joe I looked. Checkpoint period is 480 (=8x60), so "correct" for my dual quadcore...except that according to you it should be 120 (for 2x#CUDA cards)...what gives???? I think there is a slot per core and gpu. in your case slot0-slot9. the slot with your gpu task should read differently for checkpoint. I have four slots for quadcore cpu and one for cuda. I do not know about this checkpoint, what it does, how it ties in to the write to disk. If you could explain some? There is always something new to learn. ID: 932689 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 932692 - Posted: 12 Sep 2009, 5:46:13 UTC - in response to Message 932689. In the slot for a running task there's an init_data.xml file which contains that <checkpoint_period> setting. It's probably simplest just to open that with a text editor to see what BOINC on your host is doing. I guess it will be 120 on your host for the 60 second WtD since it has 2 CUDA GPUs. Joe I looked. Checkpoint period is 480 (=8x60), so "correct" for my dual quadcore...except that according to you it should be 120 (for 2x#CUDA cards)...what gives???? I think there is a slot per core and gpu. in your case slot0-slot9. the slot with your gpu task should read differently for checkpoint. I have four slots for quadcore cpu and one for cuda. I do not know about this checkpoint, what it does, how it ties in to the write to disk. If you could explain some? There is always something new to learn. When BOINC starts a task, it first sets up the init_data.xml file with parameters for the science application to use. The checkpoint period is supposed to control how often the application writes to disk, and for both S@H applications it does. The checkpoints are saved to disk so if the application is shut down and restarted it can pick up from that data rather than having to start again at the beginning of a task. Version 5 and earlier of BOINC simply set the same checkpoint period as the user had for the "write to disk" preference, but with many hosts now having multiple cores, etc. that was changed sometime in the 6.x series. Every time an application checkpoints, BOINC needs to update its own state to keep track of what the application is doing, and with multiple core machines it had caused too much disk activity. So the preference is now being interpreted in a way intended to approximate one disk write for each "write to disk" specified time. It's not accurate, but what is in BOINC? Joe ID: 932692 ·

Charles Anspaugh Volunteer tester Send message Joined: 11 Aug 00 Posts: 48 Credit: 22,715,083 RAC: 0	Message 932922 - Posted: 13 Sep 2009, 0:23:23 UTC - in response to Message 932692. When BOINC starts a task, it first sets up the init_data.xml file with parameters for the science application to use. The checkpoint period is supposed to control how often the application writes to disk, and for both S@H applications it does. The checkpoints are saved to disk so if the application is shut down and restarted it can pick up from that data rather than having to start again at the beginning of a task. Version 5 and earlier of BOINC simply set the same checkpoint period as the user had for the "write to disk" preference, but with many hosts now having multiple cores, etc. that was changed sometime in the 6.x series. Every time an application checkpoints, BOINC needs to update its own state to keep track of what the application is doing, and with multiple core machines it had caused too much disk activity. So the preference is now being interpreted in a way intended to approximate one disk write for each "write to disk" specified time. It's not accurate, but what is in BOINC? Joe Got it, Thank you. ID: 932922 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.