I have a new system, expected runtimes?

Message boards : Number crunching : I have a new system, expected runtimes?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1872103 - Posted: 10 Jun 2017, 0:58:47 UTC - in response to Message 1872098.  

Another update, with 1 module in either socket, the runtime has reduced from ~10 hrs to ~5 hrs using the AVX app

That sounds much better. Having at least dual channel memory operation I suspect would still give a significant boost, but at least the present runtimes aren't nearly as ridiculous as they were before.
Would be worth re-checking the CPU clock speed & temperatures. They should still be at maximum speed, but i'd expect the temperature to have picked up a bit (or compare the power usage figures). Going from the SSSE3 application (at the time) to the AVX application I had to replace my i7 stock cooler with an after market one as it made the CPU work that much harder.
Grant
Darwin NT
ID: 1872103 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1872104 - Posted: 10 Jun 2017, 1:01:03 UTC - in response to Message 1872098.  

Another update, with 1 module in either socket, the runtime has reduced from ~10 hrs to ~5 hrs using the AVX app

With 1 DIMM per CPU socket lower CPU times would make sense to me.
Given each CPU would have direct access to the memory. Instead of one CPU having to access memory via the QPI link to the other CPU.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1872104 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1872105 - Posted: 10 Jun 2017, 1:01:17 UTC - in response to Message 1872103.  
Last modified: 10 Jun 2017, 1:05:20 UTC

Another update, with 1 module in either socket, the runtime has reduced from ~10 hrs to ~5 hrs using the AVX app

That sounds much better. Having at least dual channel memory operation I suspect would still give a significant boost, but at least the present runtimes aren't nearly as ridiculous as they were before.
Would be worth re-checking the CPU clock speed & temperatures. They should still be at maximum speed, but i'd expect the temperature to have picked up a bit (or compare the power usage figures). Going from the SSSE3 application (at the time) to the AVX application I had to replace my i7 stock cooler with an after market one as it made the CPU work that much harder.


The core clock is still 3005Mhz and the package temp is 53C in a room that is about 12C

Except I am still noticing once in a while a few tasks will still go up and above 9+ hrs
ID: 1872105 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1872111 - Posted: 10 Jun 2017, 1:42:31 UTC - in response to Message 1872105.  
Last modified: 10 Jun 2017, 1:44:06 UTC

Except I am still noticing once in a while a few tasks will still go up and above 9+ hrs

A result of Memory contention, combined with feeding the GPU is my guess.
One module in each bank should reduce overall runtimes and reduce that occurrence.

Also, if you use an app_config to reserve 1 CPU core for each GPU WU, it should reduce that occasional extra long runtime occurrence even with the limited memory.
<app_config>
  <app>
    <name>setiathome_v8</name>
    <gpu_versions>
    <gpu_usage>1.00</gpu_usage>
    <cpu_usage>1.00</cpu_usage>
   </gpu_versions>
 </app>


And if you set sbs to 1024 and period_iterations to 1 (for a dedicated cruncher. Try 5, or 10 or 30, whichever has the least impact on usability, for a general use system) you should get a bit more out of your GPU.
Grant
Darwin NT
ID: 1872111 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1872126 - Posted: 10 Jun 2017, 4:16:43 UTC - in response to Message 1872098.  

Another update, with 1 module in either socket, the runtime has reduced from ~10 hrs to ~5 hrs using the AVX app


That is GREAT news!

Tom
A proud member of the OFA (Old Farts Association).
ID: 1872126 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1872128 - Posted: 10 Jun 2017, 4:34:38 UTC - in response to Message 1872126.  

Another update, with 1 module in either socket, the runtime has reduced from ~10 hrs to ~5 hrs using the AVX app


That is GREAT news!

Tom


It is good news but because memory is still being heavily contended, some units especially units with Angle Ranges of ~0.44(midAR), these run upto 9hrs. I have seen 2.x(VHAR) AR running for 5 hrs. I am seeing vlar tasks do about 6 hrs, with this new memory configuration. Plus I have a GPU taking time doing work as well
ID: 1872128 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1872270 - Posted: 10 Jun 2017, 17:15:18 UTC

So another update: I have installed PCM - Processor Counter Monitor
And in there I can see memory bandwidth utilisation
And wow is my memory getting hammered!
18GB/s on single channel DDR3-1333(PC3-10600R) memory, and because my motherboard does weird things with memory, if I only populate 1 slot of a channel it halves the memory bandwidth!
ID: 1872270 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1872329 - Posted: 11 Jun 2017, 1:20:02 UTC - in response to Message 1872270.  

So another update: I have installed PCM - Processor Counter Monitor
And in there I can see memory bandwidth utilisation
And wow is my memory getting hammered!
18GB/s on single channel DDR3-1333(PC3-10600R) memory, and because my motherboard does weird things with memory, if I only populate 1 slot of a channel it halves the memory bandwidth!


That is my argument for using the preferences on Seti/Boinc website (or locally in the Boinc manager) to reduce the number of cpu's you are trying to use. It will reduce the memory contention until you are able to add to memory. And I think it will increase productivity. If the test goal were to reduce to processing to say four (4) cores you would set the amount of cpu to: (4/36 actual cores) = 11%

It is also very possible the paging file on your hard disk is being hit on a fulltime basis (which slows life down). The best fix for that is more memory :) But there are two free things you can do that MIGHT help. 1) Set your paging file minimum and maximum's to whatever the current recommendation is by windows (re-visit it every time you add memory). Reboot as required. 2) Download a free version of "DeFragler" from Piriform. Under Settings -> Boot time defrag -> do once. What this will do is defrag your paging file so that it has the fastest possible access. This might help your memory contention issue some, and it will help system "responsiveness". It does take time, so don't freak out if it takes 5 minutes+ to defrag the paging file (that is why I only do it "once").

The above trick will help, for instance, a PC/Laptop/Netbook that is pausing but otherwise has sufficient memory. It gets less "laggy".

HTH,
Tom
A proud member of the OFA (Old Farts Association).
ID: 1872329 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1872345 - Posted: 11 Jun 2017, 1:57:02 UTC - in response to Message 1872329.  

It is also very possible the paging file on your hard disk is being hit on a fulltime basis (which slows life down). The best fix for that is more memory :) But there are two free things you can do that MIGHT help. 1) Set your paging file minimum and maximum's to whatever the current recommendation is by windows (re-visit it every time you add memory). Reboot as required. 2) Download a free version of "DeFragler" from Piriform. Under Settings -> Boot time defrag -> do once. What this will do is defrag your paging file so that it has the fastest possible access.

Or better yet, use a SSD.
Grant
Darwin NT
ID: 1872345 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1872360 - Posted: 11 Jun 2017, 2:51:25 UTC - in response to Message 1872329.  
Last modified: 11 Jun 2017, 2:53:49 UTC

So another update: I have installed PCM - Processor Counter Monitor
And in there I can see memory bandwidth utilisation
And wow is my memory getting hammered!
18GB/s on single channel DDR3-1333(PC3-10600R) memory, and because my motherboard does weird things with memory, if I only populate 1 slot of a channel it halves the memory bandwidth!


That is my argument for using the preferences on Seti/Boinc website (or locally in the Boinc manager) to reduce the number of cpu's you are trying to use. It will reduce the memory contention until you are able to add to memory. And I think it will increase productivity. If the test goal were to reduce to processing to say four (4) cores you would set the amount of cpu to: (4/36 actual cores) = 11%

It is also very possible the paging file on your hard disk is being hit on a fulltime basis (which slows life down). The best fix for that is more memory :) But there are two free things you can do that MIGHT help. 1) Set your paging file minimum and maximum's to whatever the current recommendation is by windows (re-visit it every time you add memory). Reboot as required. 2) Download a free version of "DeFragler" from Piriform. Under Settings -> Boot time defrag -> do once. What this will do is defrag your paging file so that it has the fastest possible access. This might help your memory contention issue some, and it will help system "responsiveness". It does take time, so don't freak out if it takes 5 minutes+ to defrag the paging file (that is why I only do it "once").

The above trick will help, for instance, a PC/Laptop/Netbook that is pausing but otherwise has sufficient memory. It gets less "laggy".

HTH,
Tom


Its not hitting my paging file at all, I'll grab a snippet of what is happening from pcm:
---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  1: Reads (MB/s):    -1.00 --||-- Mem Ch  1: Reads (MB/s):  5795.37 --|
|--            Writes(MB/s):    -1.00 --||--            Writes(MB/s):  3507.86 --|
|-- Mem Ch  3: Reads (MB/s):  5527.16 --||-- Mem Ch  3: Reads (MB/s):    -1.00 --|
|--            Writes(MB/s):  2983.31 --||--            Writes(MB/s):    -1.00 --|
|-- NODE 0 Mem Read (MB/s) :  5527.16 --||-- NODE 1 Mem Read (MB/s) :  5795.37 --|
|-- NODE 0 Mem Write(MB/s) :  2983.31 --||-- NODE 1 Mem Write(MB/s) :  3507.86 --|
|-- NODE 0 P. Write (T/s):    1665388 --||-- NODE 1 P. Write (T/s):    2343303 --|
|-- NODE 0 Memory (MB/s):     8510.47 --||-- NODE 1 Memory (MB/s):     9303.23 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--                 System Read Throughput(MB/s):      11322.53                --|
|--                System Write Throughput(MB/s):       6491.17                --|
|--               System Memory Throughput(MB/s):      17813.70                --|
|---------------------------------------||---------------------------------------|

That is it running 27 CPU tasks + 1 GPU. SO what is happening is that the app is waiting for data from memory, and it waits and waits, but it still uses 100% of the thread until it has that data
Say I reduce it to 4 CPU tasks and each of them run for 1.5 hrs and leaving them at 27 tasks and each of them run for 5 hrs for example. This PC would be more productive in doing 27 tasks in a 5 hr period than 12-14 tasks in the same 5 hr period when I reduce it
ID: 1872360 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1872379 - Posted: 11 Jun 2017, 5:02:19 UTC - in response to Message 1872360.  

So another update: I have installed PCM - Processor Counter Monitor
And in there I can see memory bandwidth utilisation
And wow is my memory getting hammered!
18GB/s on single channel DDR3-1333(PC3-10600R) memory, and because my motherboard does weird things with memory, if I only populate 1 slot of a channel it halves the memory bandwidth!


That is my argument for using the preferences on Seti/Boinc website (or locally in the Boinc manager) to reduce the number of cpu's you are trying to use. It will reduce the memory contention until you are able to add to memory. And I think it will increase productivity. If the test goal were to reduce to processing to say four (4) cores you would set the amount of cpu to: (4/36 actual cores) = 11%

It is also very possible the paging file on your hard disk is being hit on a fulltime basis (which slows life down). The best fix for that is more memory :) But there are two free things you can do that MIGHT help. 1) Set your paging file minimum and maximum's to whatever the current recommendation is by windows (re-visit it every time you add memory). Reboot as required. 2) Download a free version of "DeFragler" from Piriform. Under Settings -> Boot time defrag -> do once. What this will do is defrag your paging file so that it has the fastest possible access. This might help your memory contention issue some, and it will help system "responsiveness". It does take time, so don't freak out if it takes 5 minutes+ to defrag the paging file (that is why I only do it "once").

The above trick will help, for instance, a PC/Laptop/Netbook that is pausing but otherwise has sufficient memory. It gets less "laggy".

HTH,
Tom


Its not hitting my paging file at all, I'll grab a snippet of what is happening from pcm:
---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  1: Reads (MB/s):    -1.00 --||-- Mem Ch  1: Reads (MB/s):  5795.37 --|
|--            Writes(MB/s):    -1.00 --||--            Writes(MB/s):  3507.86 --|
|-- Mem Ch  3: Reads (MB/s):  5527.16 --||-- Mem Ch  3: Reads (MB/s):    -1.00 --|
|--            Writes(MB/s):  2983.31 --||--            Writes(MB/s):    -1.00 --|
|-- NODE 0 Mem Read (MB/s) :  5527.16 --||-- NODE 1 Mem Read (MB/s) :  5795.37 --|
|-- NODE 0 Mem Write(MB/s) :  2983.31 --||-- NODE 1 Mem Write(MB/s) :  3507.86 --|
|-- NODE 0 P. Write (T/s):    1665388 --||-- NODE 1 P. Write (T/s):    2343303 --|
|-- NODE 0 Memory (MB/s):     8510.47 --||-- NODE 1 Memory (MB/s):     9303.23 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--                 System Read Throughput(MB/s):      11322.53                --|
|--                System Write Throughput(MB/s):       6491.17                --|
|--               System Memory Throughput(MB/s):      17813.70                --|
|---------------------------------------||---------------------------------------|

That is it running 27 CPU tasks + 1 GPU. SO what is happening is that the app is waiting for data from memory, and it waits and waits, but it still uses 100% of the thread until it has that data
Say I reduce it to 4 CPU tasks and each of them run for 1.5 hrs and leaving them at 27 tasks and each of them run for 5 hrs for example. This PC would be more productive in doing 27 tasks in a 5 hr period than 12-14 tasks in the same 5 hr period when I reduce it

I would probably start BOINC with the affinity command and set it to only use CPU 0 and then have the 2 dimms in slots for CPU0.
Something along the lines of "start /affinity FFFF C:\BOINC\boinc.exe". Alternativly if you are using NUMA you could simplify the command like "start /NODE 0 C:\BOINC\boinc.exe"
Also making use of the cc_config.xml option for <ncpus> or telling BOINC to only use 50% of the CPUs would be needed. Otherwise it would try to run 32 tasks across 16 CPUs.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1872379 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1872922 - Posted: 14 Jun 2017, 15:13:12 UTC

So I just installed 2 more sticks/modules into my system, I'll report on the runtimes when they have a chance to run, with 26 tasks active at one time, this time no GPU as that is not being used
ID: 1872922 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1872948 - Posted: 14 Jun 2017, 18:08:31 UTC

Update: with 2 channels occupied, preliminary results are around about 9k seconds per task with 26 tasks running at once
ID: 1872948 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1872979 - Posted: 14 Jun 2017, 22:23:28 UTC - in response to Message 1872948.  

Update: with 2 channels occupied, preliminary results are around about 9k seconds per task with 26 tasks running at once


I read that as 2.5 hours / task. That is WAY much better! I remember reading as high as 9+ hours / task.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1872979 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1872982 - Posted: 14 Jun 2017, 22:40:40 UTC - in response to Message 1872948.  

Update: with 2 channels occupied, preliminary results are around about 9k seconds per task with 26 tasks running at once

That is looking much better. That is around the upper limit for tasks on mine running 32 tasks at once with 8 dimms (4 per CPU) in quad channel mode.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1872982 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1873058 - Posted: 15 Jun 2017, 6:03:39 UTC - in response to Message 1872982.  

Update: with 2 channels occupied, preliminary results are around about 9k seconds per task with 26 tasks running at once

That is looking much better. That is around the upper limit for tasks on mine running 32 tasks at once with 8 dimms (4 per CPU) in quad channel mode.


SO I am going to spend about another $120 for more dimms, so I can occupy the last 2 channels
ID: 1873058 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1873060 - Posted: 15 Jun 2017, 6:11:54 UTC - in response to Message 1872922.  
Last modified: 15 Jun 2017, 6:12:50 UTC

So I just installed 2 more sticks/modules into my system, I'll report on the runtimes when they have a chance to run, with 26 tasks active at one time, this time no GPU as that is not being used

So at present you've got 1 module in each bank? (blue slots i'm guessing, and the ones closest to the CPUs?)
Grant
Darwin NT
ID: 1873060 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1873270 - Posted: 16 Jun 2017, 6:53:16 UTC

For those that are interested, Tom's Hardware posted an article looking at the new Mesh architecture Intel are using for their high core count/multi socket CPUs to replace their long standing Ring Bus architecture.
... on the Broadwell LCC (Low Core Count) die... for instance, moving data from one core to its closest neighbor requires one cycle. Moving data to more distant cores requires more cycles, thus increasing the latency associated with data transit. It can take up to 12 cycles to reach the most distant core...
The larger HCC (High Core Count) die exposes one of the problems with this approach. To increase the cores and cache, the HCC die employs dual ring buses. Communication between the two rings has to flow through a buffered switch (seen between the two rings at the top and bottom). Traversing the switch imposes a five-cycle penalty, and that is before the data has to continue through more hops to its destination.

So now I can see why, even though Seti work itself doesn't benefit from huge amounts of memory bandwidth, Kiska's runtimes were so high with the original setup of both DIMMs on the one CPU socket.
Intel mesh architecture.
Grant
Darwin NT
ID: 1873270 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1873271 - Posted: 16 Jun 2017, 7:04:52 UTC - in response to Message 1873270.  

For those that are interested, Tom's Hardware posted an article looking at the new Mesh architecture Intel are using for their high core count/multi socket CPUs to replace their long standing Ring Bus architecture.
... on the Broadwell LCC (Low Core Count) die... for instance, moving data from one core to its closest neighbor requires one cycle. Moving data to more distant cores requires more cycles, thus increasing the latency associated with data transit. It can take up to 12 cycles to reach the most distant core...
The larger HCC (High Core Count) die exposes one of the problems with this approach. To increase the cores and cache, the HCC die employs dual ring buses. Communication between the two rings has to flow through a buffered switch (seen between the two rings at the top and bottom). Traversing the switch imposes a five-cycle penalty, and that is before the data has to continue through more hops to its destination.

So now I can see why, even though Seti work itself doesn't benefit from huge amounts of memory bandwidth, Kiska's runtimes were so high with the original setup of both DIMMs on the one CPU socket.
Intel mesh architecture.


Even with 1 dimm in each socket
ID: 1873271 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1873464 - Posted: 17 Jun 2017, 0:18:56 UTC - in response to Message 1873270.  
Last modified: 17 Jun 2017, 0:19:09 UTC

For those that are interested, Tom's Hardware posted an article looking at the new Mesh architecture Intel are using for their high core count/multi socket CPUs to replace their long standing Ring Bus architecture.
... on the Broadwell LCC (Low Core Count) die... for instance, moving data from one core to its closest neighbor requires one cycle. Moving data to more distant cores requires more cycles, thus increasing the latency associated with data transit. It can take up to 12 cycles to reach the most distant core...
The larger HCC (High Core Count) die exposes one of the problems with this approach. To increase the cores and cache, the HCC die employs dual ring buses. Communication between the two rings has to flow through a buffered switch (seen between the two rings at the top and bottom). Traversing the switch imposes a five-cycle penalty, and that is before the data has to continue through more hops to its destination.

So now I can see why, even though Seti work itself doesn't benefit from huge amounts of memory bandwidth, Kiska's runtimes were so high with the original setup of both DIMMs on the one CPU socket.
Intel mesh architecture.

Not populating all of the memory channels for a CPU reminds me of the saying "You can't put 10lbs of 'stuff' into a 5lb box"
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1873464 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : I have a new system, expected runtimes?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.