SNB-E not using all threads as it should

Message boards : Number crunching : SNB-E not using all threads as it should
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1296597 - Posted: 18 Oct 2012, 16:56:22 UTC

I have 2 OpenSuse 12.1 x64 Linux systems: a 4 Core-HT Sandy Bridge system with a Nvidia GTX460, and 8G DRAM - on that system the total is 14 boinc threads - 8 cpu and 2 gpu. OK, thats great and what i would expect.
I also have an SNB-E system which is 6 core HT (12 threads) and it also has an Nidia GTX460, but 16G DRAM. On that system i get only 11 cpu and 2 cuda tasks running at a time.

Both systems are using seti boinc version 6.10.58

The app_info.xml is virtually identical on both systems. and both are using Alex's AK_V* optimized linux fermi apps

e.g the SNB-E system:
cat ~/BOINC/projects/setiathome.berkeley.edu/app_info.xml
<app_info>
<app>
<name>setiathome_enhanced</name>
</app>
<file_info>
<name>AK_V8_linux64_ssse3</name>
<executable/>
</file_info>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>603</version_num>
<file_ref>
<file_name>AK_V8_linux64_ssse3</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>setiathome_enhanced</name>
</app>
<file_info>
<name>setiathome-6.11.x86_64-pc-linux-gnu__cuda32</name>
<executable/>
</file_info>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>611</version_num>
<plan_class>cuda_fermi</plan_class>
<avg_ncpus>0.250</avg_ncpus>
<max_ncpus>0.50</max_ncpus>
<coproc>
<type>CUDA</type>
<count>0.50</count>
</coproc>
<file_ref>
<file_name>setiathome-6.11.x86_64-pc-linux-gnu__cuda32</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>
ldd AK_V8_linux64_ssse3
linux-vdso.so.1 => (0x00007fff563aa000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa3a5570000)
libc.so.6 => /lib64/libc.so.6 (0x00007fa3a51e0000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa3a578d000)
libm.so.6 => /lib64/libm.so.6 (0x00007fa3a4f89000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa3a4d73000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fa3a4b6f000)
locate libcuda:
BOINC/projects/setiathome.berkeley.edu/libcudart.so.3

/usr/lib/libcuda.so
/usr/lib/libcuda.so.1
/usr/lib/libcuda.so.304.43
/usr/lib64/libcuda.so
/usr/lib64/libcuda.so.1
/usr/lib64/libcuda.so.304.43
/usr/local/cuda/lib/libcudart.so
/usr/local/cuda/lib/libcudart.so.4
/usr/local/cuda/lib/libcudart.so.4.1.28
/usr/local/cuda/lib64/libcudart.so
/usr/local/cuda/lib64/libcudart.so.4
/usr/local/cuda/lib64/libcudart.so.4.1.28



And on the 8 thread SNB system it looks like this:

cat ~/BOINC/projects/setiathome.berkeley.edu/app_info.xml
<app_info>
<app>
<name>setiathome_enhanced</name>
</app>
<file_info>
<name>AK_V8_linux64_ssse3</name>
<executable/>
</file_info>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>603</version_num>
<file_ref>
<file_name>AK_V8_linux64_ssse3</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>setiathome_enhanced</name>
</app>
<file_info>
<name>setiathome-6.11.x86_64-pc-linux-gnu__cuda32</name>
<executable/>
</file_info>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>611</version_num>
<plan_class>cuda_fermi</plan_class>
<avg_ncpus>0.250</avg_ncpus>
<max_ncpus>0.50</max_ncpus>
<coproc>
<type>CUDA</type>
<count>0.50</count>
</coproc>
<file_ref>
<file_name>setiathome-6.11.x86_64-pc-linux-gnu__cuda32</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>
and:
ldd AK_V8_linux64_ssse3
linux-vdso.so.1 => (0x00007fff129c6000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007feb1d660000)
libc.so.6 => /lib64/libc.so.6 (0x00007feb1d2d0000)
/lib64/ld-linux-x86-64.so.2 (0x00007feb1d87d000)
libm.so.6 => /lib64/libm.so.6 (0x00007feb1d079000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007feb1ce63000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007feb1cc5f000)
libcuda seeems to be 4.1.28
BOINC/projects/setiathome.berkeley.edu/libcudart.so.3
usr/lib/libcuda.so
/usr/lib/libcuda.so.1
/usr/lib/libcuda.so.304.43
/usr/lib64/libcuda.so
/usr/lib64/libcuda.so.1
/usr/lib64/libcuda.so.304.43
/usr/local/cuda/lib/libcudart.so
/usr/local/cuda/lib/libcudart.so.4
/usr/local/cuda/lib/libcudart.so.4.1.28
/usr/local/cuda/lib64/libcudart.so
/usr/local/cuda/lib64/libcudart.so.4
/usr/local/cuda/lib64/libcudart.so.4.1.28

ID: 1296597 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1296680 - Posted: 18 Oct 2012, 20:09:56 UTC

If the machine has downloaded more than 11 tasks for the CPU. Then you might want to check the following BOINC settings on your profile and locally.
On multiprocessors, use at most n processors
On multiprocessors, use at most n% of the processors

Also you would want to verify that BOINC is reporting 12 CPUs in the startup. Sort of like this:
12-Oct-2012 16:34:35 [---] Processor: 24 GenuineIntel Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1296680 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1296772 - Posted: 19 Oct 2012, 2:24:19 UTC

I did find that the SNB-E has this cc_config.xml file in the BOINC/projects directory (but its named cc_config.xml.off) which i assume means it wont be read and there is no corresponding file in the the other system. Other than that both are set to use 100% processors and at most use xx processors is set at 192 due to there are a few rare occasions i can get to play with a very big server and 192 more than covers the number of threads that thing has.
I checked these settings on website and also on he manager preferences menu,

Cold this be causing the problem even tho its name is cc_config.xml.off?
Is there any config file in ~/BOINC i can examine to help determine why it only runs 11 cpu tasks? All threads seem fully occupied if i go by the gkrellm display
cat ~/BOINC/cc_config.xml.off
<cc_config>
    <log_flags>
        <cpu_sched>1</cpu_sched>
        <debt_debug>1</debt_debug>
        <cpu_sched_debug>1</cpu_sched_debug>
        <coproc_debug>1</coproc_debug>
        <cpu_sched>1</cpu_sched>
        <file_xfer>0</file_xfer>
        <file_xfer_debug>0</file_xfer_debug>
        <app_msg_send>1</app_msg_send>
        <app_msg_receive>1</app_msg_receive>
        <unparsed_xml>1</unparsed_xml>
        <work_fetch_debug>1</work_fetch_debug>
</log_flags>
</cc_config>

ID: 1296772 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1296785 - Posted: 19 Oct 2012, 3:27:01 UTC

oh, i missed answering one of your questions - yeh i have plenty of cpu and gpu tasks on both machines, according to the boinc manager anyway (hand counted well over 25 each of cpu and gpu bfore i stopped counting). I wrote a script to track some things and while i dont claim its 100% accurate (the estimates of "available" work are only estimates based on what i see as how the average work progresses so they could be off a bit, but its darn close) here is its output. My stats come from analyzing the client_state.xml file and deducing what things meant by looking the boinc manager for clues, eg. find WU xx_yy and see what its state was in the manager then go find it in client_state.xml and see what i could learn. I think I have the id's of most of the states pretty well nailed down. There are actually 2 other states i haven’t worked into my script yet and they are called "active_task_states" : state 0 "started but currently suspended" and state 1 "actually executing" and I'm always on the hunt for more info i can ferret out of that file and add to my script.
I do network upload/downloads once a day and run this script via cron about 5 minutes before that, I'm watching for errors and so forth because i find if you try to do 3 cuda tasks you start to see some errors, maybe 7 out of 50 completed fermi tasks or so, could just be the fermi SW as its the only linux fermi app out there that i know of anyway)

./boinc_stats.sh

                        Current BOINC info Thu Oct 18 20:06:50 PDT 2012
                               Active Tasks: sys1 13  sys2 12
                        Position by Avg Credit: 118   Position In USA: 78
                        RAC: 65,852.15   Position Based on Total Cedit: 180
                        sys1 Computation Errors  MB: 0  CUDA: 0 Freq msr: 0x2600  Act: 3.8 GHz
                        sys2 Computation Errors  MB: 0  CUDA: 0 Freq msr: 0x2900  Act: 4.1 GHz

                                                  CUDA      CUDA       MB        MB        MB
            Total  CUDA  MB   MB Ready CUDA Rdy  Ready to  Uploads   Ready to  Uploads  Downloads   Available  Available Average
  System     WUs    WUs  WUs  to Start to Start  Report    Pending   Report    Pending   Pending    CUDA Work   MB Work  Credit
------------------------------------------------------------------------------------------------------------------------------
   sys1     1966    677   668   507       489         1        97       1        138         0     3.7 days   5.0 days   27154
   sys2     1962   1397   574   443      1261        10        13      42         84         0     4.1 days  10.1 days   15541
------------------------------------------------------------------------------------------------------------------------------


sys1 is the 16 thread snb and snb2 is the quad core HT system
sorry for the formatting, there doesn’t seem to be a way to get the script output to space out properly. You can try to copy and paste it into an editor with fixed spacing and it should be more readable
ID: 1296785 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1296789 - Posted: 19 Oct 2012, 4:14:00 UTC - in response to Message 1296772.  
Last modified: 19 Oct 2012, 4:18:06 UTC

I did find that the SNB-E has this cc_config.xml file in the BOINC/projects directory (but its named cc_config.xml.off) which i assume means it wont be read and there is no corresponding file in the the other system. Other than that both are set to use 100% processors and at most use xx processors is set at 192 due to there are a few rare occasions i can get to play with a very big server and 192 more than covers the number of threads that thing has.
I checked these settings on website and also on he manager preferences menu,

Cold this be causing the problem even tho its name is cc_config.xml.off?
Is there any config file in ~/BOINC i can examine to help determine why it only runs 11 cpu tasks? All threads seem fully occupied if i go by the gkrellm display
cat ~/BOINC/cc_config.xml.off
<cc_config>
    <log_flags>
        <cpu_sched>1</cpu_sched>
        <debt_debug>1</debt_debug>
        <cpu_sched_debug>1</cpu_sched_debug>
        <coproc_debug>1</coproc_debug>
        <cpu_sched>1</cpu_sched>
        <file_xfer>0</file_xfer>
        <file_xfer_debug>0</file_xfer_debug>
        <app_msg_send>1</app_msg_send>
        <app_msg_receive>1</app_msg_receive>
        <unparsed_xml>1</unparsed_xml>
        <work_fetch_debug>1</work_fetch_debug>
</log_flags>
</cc_config>

I would expect BOINC is ignoring your cc_config.xml.off file. You could modify your client config with <ncpus>N</ncpus> to force 12 CPU tasks to run, but that might just run 12 tasks on 11 cores.

I am guessing this the machine in question is host 4520457. Which does show BOINC reporting 12 cores.
If you have something limiting the processor affinity of BOINC to not use one of the processors/cores that might explain what is happening.

Are you seeing only 11 instances of AK_V8_linux64_ssse3 running then?

EDIT: Also the BBCode tags [ pre ] [ /pre ] are for preformatted text.
                                                  CUDA      CUDA       MB        MB        MB
            Total  CUDA  MB   MB Ready CUDA Rdy  Ready to  Uploads   Ready to  Uploads  Downloads   Available  Available Average
  System     WUs    WUs  WUs  to Start to Start  Report    Pending   Report    Pending   Pending    CUDA Work   MB Work  Credit
------------------------------------------------------------------------------------------------------------------------------
   sys1     1966    677   668   507       489         1        97       1        138         0     3.7 days   5.0 days   27154
   sys2     1962   1397   574   443      1261        10        13      42         84         0     4.1 days  10.1 days   15541
------------------------------------------------------------------------------------------------------------------------------

SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1296789 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 1296822 - Posted: 19 Oct 2012, 6:10:07 UTC


Thanks for the tags tip! The [] part kinda threw me for a minute and made me wonder why <tag></tag> didn’t
do anything, then i realized the format was not standard and i would assume it's to prevent attacks.
Yeh post this is better looking now. also i re-wrapped some stuff for easier reading.

> ps aux|grep AK|grep -v grep |wc -l
11
From boinc mgr msgs:
Sun 14 Oct 2012 03:15:10 AM PDT		Starting BOINC client version 6.10.58 for x86_64-pc-linux-gnu
Sun 14 Oct 2012 03:15:10 AM PDT		Config: GUI RPC allowed from:
Sun 14 Oct 2012 03:15:10 AM PDT		Config:   192.168.1.17
Sun 14 Oct 2012 03:15:10 AM PDT		Config:   192.168.1.103
Sun 14 Oct 2012 03:15:10 AM PDT		log flags: file_xfer, sched_ops, task
Sun 14 Oct 2012 03:15:10 AM PDT		Libraries: libcurl/7.18.0 OpenSSL/0.9.8g zlib/1.2.5 c-ares/1.5.1
Sun 14 Oct 2012 03:15:10 AM PDT		Data directory: /home/erbenton/BOINC
Sun 14 Oct 2012 03:15:10 AM PDT		Processor: 12 GenuineIntel Intel(R) Core(TM) i7-3960X CPU @ 3.30GHz [Family 6 Model 45 Stepping 7]
Sun 14 Oct 2012 03:15:10 AM PDT		Processor: 15.00 MB cache
Sun 14 Oct 2012 03:15:10 AM PDT		Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
                                                            dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
                                                            arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni 
Sun 14 Oct 2012 03:15:10 AM PDT		OS: Linux: 3.1.10cstm-1.16-cstm
Sun 14 Oct 2012 03:15:10 AM PDT		Memory: 15.63 GB physical, 512.00 MB virtual
Sun 14 Oct 2012 03:15:10 AM PDT		Disk: 52.87 GB total, 19.04 GB free
Sun 14 Oct 2012 03:15:10 AM PDT		Local time is UTC -7 hours
Sun 14 Oct 2012 03:15:10 AM PDT		NVIDIA GPU 0: GeForce GTX 460 (driver version unknown, CUDA version 4020, 
                                                      compute capability 2.1, 1024MB, 641 GFLOPS peak)
Sun 14 Oct 2012 03:15:10 AM PDT	SETI@home	Found app_info.xml; using anonymous platform
Sun 14 Oct 2012 03:15:10 AM PDT	SETI@home	URL http://setiathome.berkeley.edu/; Computer ID 4520457; resource share 100
Sun 14 Oct 2012 03:15:10 AM PDT		General prefs: from http://milkyway.cs.rpi.edu/milkyway/ (last modified 29-May-2011 00:31:18)
Sun 14 Oct 2012 03:15:10 AM PDT		Host location: none
Sun 14 Oct 2012 03:15:10 AM PDT		General prefs: using your defaults
Sun 14 Oct 2012 03:15:10 AM PDT		Reading preferences override file
Sun 14 Oct 2012 03:15:10 AM PDT		Preferences:
Sun 14 Oct 2012 03:15:10 AM PDT		   max memory usage when active: 8002.02MB
Sun 14 Oct 2012 03:15:10 AM PDT		   max memory usage when idle: 12803.23MB
Sun 14 Oct 2012 03:15:10 AM PDT		   max disk usage: 4.00GB
Sun 14 Oct 2012 03:15:10 AM PDT		   (to change preferences, visit the web site of an attached project, 
                                           or select Preferences in the Manager)



Could it be that Milkyway at home entry? i have not been able to figure out how to get fully rid of it.
e.g it never shows in the boinc mgr but its in various files. How do i completely clean that thing out
of there? It seems a likely candidate for trouble in my case so it would be good to clean that out and
see if it clears up the missing instance problem.

I wonder if its safe to just go delete all these references?
>grep milkyway * |less
Sun 14 Oct 2012 03:15:10 AM PDT		Starting BOINC client version 6.10.58 for x86_64-pc-linux-gnu
Sun 14 Oct 2012 03:15:10 AM PDT		Config: GUI RPC allowed from:
Sun 14 Oct 2012 03:15:10 AM PDT		Config:   192.168.1.17
Sun 14 Oct 2012 03:15:10 AM PDT		Config:   192.168.1.103
Sun 14 Oct 2012 03:15:10 AM PDT		log flags: file_xfer, sched_ops, task
Sun 14 Oct 2012 03:15:10 AM PDT		Libraries: libcurl/7.18.0 OpenSSL/0.9.8g zlib/1.2.5 c-ares/1.5.1
Sun 14 Oct 2012 03:15:10 AM PDT		Data directory: /home/erbenton/BOINC
Sun 14 Oct 2012 03:15:10 AM PDT		Processor: 12 GenuineIntel Intel(R) Core(TM) i7-3960X CPU @ 3.30GHz [Family 6 Model 45 Stepping 7]
Sun 14 Oct 2012 03:15:10 AM PDT		Processor: 15.00 MB cache
Sun 14 Oct 2012 03:15:10 AM PDT		Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
                                                            dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
                                                            arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni 
Sun 14 Oct 2012 03:15:10 AM PDT		OS: Linux: 3.1.10cstm-1.16-cstm
Sun 14 Oct 2012 03:15:10 AM PDT		Memory: 15.63 GB physical, 512.00 MB virtual
Sun 14 Oct 2012 03:15:10 AM PDT		Disk: 52.87 GB total, 19.04 GB free
Sun 14 Oct 2012 03:15:10 AM PDT		Local time is UTC -7 hours
Sun 14 Oct 2012 03:15:10 AM PDT		NVIDIA GPU 0: GeForce GTX 460 (driver version unknown, CUDA version 4020, 
                                                      compute capability 2.1, 1024MB, 641 GFLOPS peak)
Sun 14 Oct 2012 03:15:10 AM PDT	SETI@home	Found app_info.xml; using anonymous platform
Sun 14 Oct 2012 03:15:10 AM PDT	SETI@home	URL http://setiathome.berkeley.edu/; Computer ID 4520457; resource share 100
Sun 14 Oct 2012 03:15:10 AM PDT		General prefs: from http://milkyway.cs.rpi.edu/milkyway/ (last modified 29-May-2011 00:31:18)
Sun 14 Oct 2012 03:15:10 AM PDT		Host location: none
Sun 14 Oct 2012 03:15:10 AM PDT		General prefs: using your defaults
Sun 14 Oct 2012 03:15:10 AM PDT		Reading preferences override file
Sun 14 Oct 2012 03:15:10 AM PDT		Preferences:
Sun 14 Oct 2012 03:15:10 AM PDT		   max memory usage when active: 8002.02MB
Sun 14 Oct 2012 03:15:10 AM PDT		   max memory usage when idle: 12803.23MB
Sun 14 Oct 2012 03:15:10 AM PDT		   max disk usage: 4.00GB
Sun 14 Oct 2012 03:15:10 AM PDT		   (to change preferences, visit the web site of an attached project, 
                                            or select Preferences in the Manager)



ID: 1296822 · Report as offensive
Highlander
Avatar

Send message
Joined: 5 Oct 99
Posts: 167
Credit: 37,987,668
RAC: 16
Germany
Message 1296839 - Posted: 19 Oct 2012, 9:16:41 UTC

about the milkyway "problem":

General prefs: from http://milkyway.cs.rpi.edu/milkyway/ (last modified 29-May-2011 00:31:18)


The Computing Preferences ar boinc-wide setup, only the last modified are used. imho no problem; if you like to change this, go on to your http://setiathome.berkeley.edu/prefs.php?subset=global here at seti, do only an edit + save and a following boinc-client update, then this is done.

assuming from your boinc-startup log, there isnt a milkyway directory at your projects-directory at all....

for calculating on gpus, personally meaning is that a free thread/core for feeding the gpu is very welcome. ok, that isnt helping a lot... (if you had 2 gtx460 in your 12thread machine, then the missing thread can be explained, but try a change of
<avg_ncpus>0.250</avg_ncpus>
<max_ncpus>0.50</max_ncpus>

to
<avg_ncpus>0.2</avg_ncpus>
<max_ncpus>0.2</max_ncpus>

+ restart boinc
- Performance is not a simple linear function of the number of CPUs you throw at the problem. -
ID: 1296839 · Report as offensive

Message boards : Number crunching : SNB-E not using all threads as it should


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.