Strange New Problem (6.6.20, Opt. Apps)

Message boards : Number crunching : Strange New Problem (6.6.20, Opt. Apps)
Message board moderation

To post messages, you must log in.

AuthorMessage
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 932472 - Posted: 11 Sep 2009, 11:18:24 UTC
Last modified: 11 Sep 2009, 11:21:23 UTC

I am running 6.6.20 and the Optimized apps (CPU and CUDA). I also use Reschedule 1.9 to balannce my load. I have been doing so for quite a while.

Recently, I finally received a full load of WUs (upwards of 2500), and have been seeing lots of the message box for "connecting to client". I assume that is because my internal queues are so long. But I have also been seeing CPU WUs finishing showing a few minutes of elapsed time and a couple of hours or so of CPU time. And similar distortions for GPU WUs.

1) has anyone else had this experience?
2) what causes this, if known?
3) will it affect the credit I get, or is that purely based on CPU time/flops?
ID: 932472 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 932476 - Posted: 11 Sep 2009, 12:08:55 UTC - in response to Message 932472.  

I am running 6.6.20 and the Optimized apps (CPU and CUDA). I also use Reschedule 1.9 to balannce my load. I have been doing so for quite a while.

Recently, I finally received a full load of WUs (upwards of 2500), and have been seeing lots of the message box for "connecting to client". I assume that is because my internal queues are so long. But I have also been seeing CPU WUs finishing showing a few minutes of elapsed time and a couple of hours or so of CPU time. And similar distortions for GPU WUs.

1) has anyone else had this experience?
2) what causes this, if known?
3) will it affect the credit I get, or is that purely based on CPU time/flops?


You might want to read this message thread.

The gist of it is that the BOINC manager is telling the core client to give it an update, but when you get so many task it takes a while to give it all the information. The 6.10.xx series of clients have some changes to try and reduce the overhead. I wouldn't recommend 6.10.4 though as it has its own issues.
BOINC blog
ID: 932476 · Report as offensive
Charles Anspaugh
Volunteer tester

Send message
Joined: 11 Aug 00
Posts: 48
Credit: 22,715,083
RAC: 0
United States
Message 932521 - Posted: 11 Sep 2009, 16:35:06 UTC - in response to Message 932472.  

I am running 6.6.20 and the Optimized apps (CPU and CUDA). I also use Reschedule 1.9 to balannce my load. I have been doing so for quite a while.

Recently, I finally received a full load of WUs (upwards of 2500), and have been seeing lots of the message box for "connecting to client". I assume that is because my internal queues are so long. But I have also been seeing CPU WUs finishing showing a few minutes of elapsed time and a couple of hours or so of CPU time. And similar distortions for GPU WUs.

1) has anyone else had this experience?
2) what causes this, if known?
3) will it affect the credit I get, or is that purely based on CPU time/flops?

1) somewhere above 1000 WUs is when I notice the same problem.
2) I think that boinc manager is taking up cpu cycles from work units in order to keep files updated.
3) the only way this affects your credit is that it takes longer to finish a workunit.

I found two things that help reduce this problem.

first is in boinc managers advanced tab 'preferences'. the disk and memory usage tab 'write to disk at most every # seconds. change this to 600 seconds=10 minutes. this helped some.

second is dont leave boinc manager open, when it is open it takes cpu cycles away from WUs.

Also I have put my entire boinc install on a ramdrive. ramdrives can be more of a problem than i can explain.

these are only for to help with 1000s of WUs.
ID: 932521 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 932534 - Posted: 11 Sep 2009, 17:34:01 UTC - in response to Message 932521.  
Last modified: 11 Sep 2009, 17:34:56 UTC

Charles - Thanks for your suggestions!...
BUT: if I change my "write to disk" time to 600 secs, won't that effect how often checkpoints are done?

BTW - is that WtD time PER thread? Or PER machine? (I run 8 cores and 2 CUDAs). If it is per thread, then I am really writing every 6 secs (with the 60 sec default)...
ID: 932534 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 932566 - Posted: 11 Sep 2009, 20:15:16 UTC - in response to Message 932534.  

Charles - Thanks for your suggestions!...
BUT: if I change my "write to disk" time to 600 secs, won't that effect how often checkpoints are done?

BTW - is that WtD time PER thread? Or PER machine? (I run 8 cores and 2 CUDAs). If it is per thread, then I am really writing every 6 secs (with the 60 sec default)...

It's meant to be per machine, when BOINC starts an application it multiplies the WtD by the number of CPUs, or the number of GPUs, to set the <checkpoint_period> value. It doesn't take the max count of the number of processors, though. The latest source simply checks number of CPUs first, then CUDA GPUs, then ATI GPUs so the last checked type which exists on the host determines how WtD is interpreted.

In the slot for a running task there's an init_data.xml file which contains that <checkpoint_period> setting. It's probably simplest just to open that with a text editor to see what BOINC on your host is doing. I guess it will be 120 on your host for the 60 second WtD since it has 2 CUDA GPUs.
                                                            Joe
ID: 932566 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 932640 - Posted: 12 Sep 2009, 0:29:31 UTC - in response to Message 932566.  
Last modified: 12 Sep 2009, 0:30:02 UTC


In the slot for a running task there's an init_data.xml file which contains that <checkpoint_period> setting. It's probably simplest just to open that with a text editor to see what BOINC on your host is doing. I guess it will be 120 on your host for the 60 second WtD since it has 2 CUDA GPUs.
                                                            Joe


I looked. Checkpoint period is 480 (=8x60), so "correct" for my dual quadcore...except that according to you it should be 120 (for 2x#CUDA cards)...what gives????
ID: 932640 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 932688 - Posted: 12 Sep 2009, 5:11:05 UTC - in response to Message 932640.  


In the slot for a running task there's an init_data.xml file which contains that <checkpoint_period> setting. It's probably simplest just to open that with a text editor to see what BOINC on your host is doing. I guess it will be 120 on your host for the 60 second WtD since it has 2 CUDA GPUs.
                                                            Joe

I looked. Checkpoint period is 480 (=8x60), so "correct" for my dual quadcore...except that according to you it should be 120 (for 2x#CUDA cards)...what gives????

Probably your version of BOINC not being built from the latest source code. I didn't chase back when that particular code had been modified nor in what fashion, though the simple detail that it includes allowance for ATI GPUs indicates at least that part is very recent. That's why I suggested checking init_data.xml, BOINC is an ever-changing target.
                                                             Joe
ID: 932688 · Report as offensive
Charles Anspaugh
Volunteer tester

Send message
Joined: 11 Aug 00
Posts: 48
Credit: 22,715,083
RAC: 0
United States
Message 932689 - Posted: 12 Sep 2009, 5:16:28 UTC - in response to Message 932640.  


In the slot for a running task there's an init_data.xml file which contains that <checkpoint_period> setting. It's probably simplest just to open that with a text editor to see what BOINC on your host is doing. I guess it will be 120 on your host for the 60 second WtD since it has 2 CUDA GPUs.
                                                            Joe


I looked. Checkpoint period is 480 (=8x60), so "correct" for my dual quadcore...except that according to you it should be 120 (for 2x#CUDA cards)...what gives????

I think there is a slot per core and gpu. in your case slot0-slot9. the slot with your gpu task should read differently for checkpoint.

I have four slots for quadcore cpu and one for cuda.

I do not know about this checkpoint, what it does, how it ties in to the write to disk. If you could explain some? There is always something new to learn.
ID: 932689 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 932692 - Posted: 12 Sep 2009, 5:46:13 UTC - in response to Message 932689.  


In the slot for a running task there's an init_data.xml file which contains that <checkpoint_period> setting. It's probably simplest just to open that with a text editor to see what BOINC on your host is doing. I guess it will be 120 on your host for the 60 second WtD since it has 2 CUDA GPUs.
                                                            Joe

I looked. Checkpoint period is 480 (=8x60), so "correct" for my dual quadcore...except that according to you it should be 120 (for 2x#CUDA cards)...what gives????

I think there is a slot per core and gpu. in your case slot0-slot9. the slot with your gpu task should read differently for checkpoint.

I have four slots for quadcore cpu and one for cuda.

I do not know about this checkpoint, what it does, how it ties in to the write to disk. If you could explain some? There is always something new to learn.

When BOINC starts a task, it first sets up the init_data.xml file with parameters for the science application to use. The checkpoint period is supposed to control how often the application writes to disk, and for both S@H applications it does. The checkpoints are saved to disk so if the application is shut down and restarted it can pick up from that data rather than having to start again at the beginning of a task.

Version 5 and earlier of BOINC simply set the same checkpoint period as the user had for the "write to disk" preference, but with many hosts now having multiple cores, etc. that was changed sometime in the 6.x series. Every time an application checkpoints, BOINC needs to update its own state to keep track of what the application is doing, and with multiple core machines it had caused too much disk activity. So the preference is now being interpreted in a way intended to approximate one disk write for each "write to disk" specified time. It's not accurate, but what is in BOINC?
                                                             Joe
ID: 932692 · Report as offensive
Charles Anspaugh
Volunteer tester

Send message
Joined: 11 Aug 00
Posts: 48
Credit: 22,715,083
RAC: 0
United States
Message 932922 - Posted: 13 Sep 2009, 0:23:23 UTC - in response to Message 932692.  


When BOINC starts a task, it first sets up the init_data.xml file with parameters for the science application to use. The checkpoint period is supposed to control how often the application writes to disk, and for both S@H applications it does. The checkpoints are saved to disk so if the application is shut down and restarted it can pick up from that data rather than having to start again at the beginning of a task.

Version 5 and earlier of BOINC simply set the same checkpoint period as the user had for the "write to disk" preference, but with many hosts now having multiple cores, etc. that was changed sometime in the 6.x series. Every time an application checkpoints, BOINC needs to update its own state to keep track of what the application is doing, and with multiple core machines it had caused too much disk activity. So the preference is now being interpreted in a way intended to approximate one disk write for each "write to disk" specified time. It's not accurate, but what is in BOINC?
                                                             Joe

Got it, Thank you.
ID: 932922 · Report as offensive

Message boards : Number crunching : Strange New Problem (6.6.20, Opt. Apps)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.