I have a new system, expected runtimes?

Author	Message
Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 1872103 - Posted: 10 Jun 2017, 0:58:47 UTC - in response to Message 1872098. Another update, with 1 module in either socket, the runtime has reduced from ~10 hrs to ~5 hrs using the AVX app That sounds much better. Having at least dual channel memory operation I suspect would still give a significant boost, but at least the present runtimes aren't nearly as ridiculous as they were before. Would be worth re-checking the CPU clock speed & temperatures. They should still be at maximum speed, but i'd expect the temperature to have picked up a bit (or compare the power usage figures). Going from the SSSE3 application (at the time) to the AVX application I had to replace my i7 stock cooler with an after market one as it made the CPU work that much harder. Grant Darwin NT ID: 1872103 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1872104 - Posted: 10 Jun 2017, 1:01:03 UTC - in response to Message 1872098. Another update, with 1 module in either socket, the runtime has reduced from ~10 hrs to ~5 hrs using the AVX app With 1 DIMM per CPU socket lower CPU times would make sense to me. Given each CPU would have direct access to the memory. Instead of one CPU having to access memory via the QPI link to the other CPU. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1872104 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1872105 - Posted: 10 Jun 2017, 1:01:17 UTC - in response to Message 1872103. Last modified: 10 Jun 2017, 1:05:20 UTC Another update, with 1 module in either socket, the runtime has reduced from ~10 hrs to ~5 hrs using the AVX app That sounds much better. Having at least dual channel memory operation I suspect would still give a significant boost, but at least the present runtimes aren't nearly as ridiculous as they were before. Would be worth re-checking the CPU clock speed & temperatures. They should still be at maximum speed, but i'd expect the temperature to have picked up a bit (or compare the power usage figures). Going from the SSSE3 application (at the time) to the AVX application I had to replace my i7 stock cooler with an after market one as it made the CPU work that much harder. The core clock is still 3005Mhz and the package temp is 53C in a room that is about 12C Except I am still noticing once in a while a few tasks will still go up and above 9+ hrs ID: 1872105 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 1872111 - Posted: 10 Jun 2017, 1:42:31 UTC - in response to Message 1872105. Last modified: 10 Jun 2017, 1:44:06 UTC Except I am still noticing once in a while a few tasks will still go up and above 9+ hrs A result of Memory contention, combined with feeding the GPU is my guess. One module in each bank should reduce overall runtimes and reduce that occurrence. Also, if you use an app_config to reserve 1 CPU core for each GPU WU, it should reduce that occasional extra long runtime occurrence even with the limited memory. <app_config> <app> <name>setiathome_v8</name> <gpu_versions> <gpu_usage>1.00</gpu_usage> <cpu_usage>1.00</cpu_usage> </gpu_versions> </app> And if you set sbs to 1024 and period_iterations to 1 (for a dedicated cruncher. Try 5, or 10 or 30, whichever has the least impact on usability, for a general use system) you should get a bit more out of your GPU. Grant Darwin NT ID: 1872111 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1872126 - Posted: 10 Jun 2017, 4:16:43 UTC - in response to Message 1872098. Another update, with 1 module in either socket, the runtime has reduced from ~10 hrs to ~5 hrs using the AVX app That is GREAT news! Tom A proud member of the OFA (Old Farts Association). ID: 1872126 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1872128 - Posted: 10 Jun 2017, 4:34:38 UTC - in response to Message 1872126. Another update, with 1 module in either socket, the runtime has reduced from ~10 hrs to ~5 hrs using the AVX app That is GREAT news! Tom It is good news but because memory is still being heavily contended, some units especially units with Angle Ranges of ~0.44(midAR), these run upto 9hrs. I have seen 2.x(VHAR) AR running for 5 hrs. I am seeing vlar tasks do about 6 hrs, with this new memory configuration. Plus I have a GPU taking time doing work as well ID: 1872128 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1872270 - Posted: 10 Jun 2017, 17:15:18 UTC So another update: I have installed PCM - Processor Counter Monitor And in there I can see memory bandwidth utilisation And wow is my memory getting hammered! 18GB/s on single channel DDR3-1333(PC3-10600R) memory, and because my motherboard does weird things with memory, if I only populate 1 slot of a channel it halves the memory bandwidth! ID: 1872270 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1872329 - Posted: 11 Jun 2017, 1:20:02 UTC - in response to Message 1872270. So another update: I have installed PCM - Processor Counter Monitor And in there I can see memory bandwidth utilisation And wow is my memory getting hammered! 18GB/s on single channel DDR3-1333(PC3-10600R) memory, and because my motherboard does weird things with memory, if I only populate 1 slot of a channel it halves the memory bandwidth! That is my argument for using the preferences on Seti/Boinc website (or locally in the Boinc manager) to reduce the number of cpu's you are trying to use. It will reduce the memory contention until you are able to add to memory. And I think it will increase productivity. If the test goal were to reduce to processing to say four (4) cores you would set the amount of cpu to: (4/36 actual cores) = 11% It is also very possible the paging file on your hard disk is being hit on a fulltime basis (which slows life down). The best fix for that is more memory :) But there are two free things you can do that MIGHT help. 1) Set your paging file minimum and maximum's to whatever the current recommendation is by windows (re-visit it every time you add memory). Reboot as required. 2) Download a free version of "DeFragler" from Piriform. Under Settings -> Boot time defrag -> do once. What this will do is defrag your paging file so that it has the fastest possible access. This might help your memory contention issue some, and it will help system "responsiveness". It does take time, so don't freak out if it takes 5 minutes+ to defrag the paging file (that is why I only do it "once"). The above trick will help, for instance, a PC/Laptop/Netbook that is pausing but otherwise has sufficient memory. It gets less "laggy". HTH, Tom A proud member of the OFA (Old Farts Association). ID: 1872329 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 1872345 - Posted: 11 Jun 2017, 1:57:02 UTC - in response to Message 1872329. It is also very possible the paging file on your hard disk is being hit on a fulltime basis (which slows life down). The best fix for that is more memory :) But there are two free things you can do that MIGHT help. 1) Set your paging file minimum and maximum's to whatever the current recommendation is by windows (re-visit it every time you add memory). Reboot as required. 2) Download a free version of "DeFragler" from Piriform. Under Settings -> Boot time defrag -> do once. What this will do is defrag your paging file so that it has the fastest possible access. Or better yet, use a SSD. Grant Darwin NT ID: 1872345 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1872360 - Posted: 11 Jun 2017, 2:51:25 UTC - in response to Message 1872329. Last modified: 11 Jun 2017, 2:53:49 UTC So another update: I have installed PCM - Processor Counter Monitor And in there I can see memory bandwidth utilisation And wow is my memory getting hammered! 18GB/s on single channel DDR3-1333(PC3-10600R) memory, and because my motherboard does weird things with memory, if I only populate 1 slot of a channel it halves the memory bandwidth! That is my argument for using the preferences on Seti/Boinc website (or locally in the Boinc manager) to reduce the number of cpu's you are trying to use. It will reduce the memory contention until you are able to add to memory. And I think it will increase productivity. If the test goal were to reduce to processing to say four (4) cores you would set the amount of cpu to: (4/36 actual cores) = 11% It is also very possible the paging file on your hard disk is being hit on a fulltime basis (which slows life down). The best fix for that is more memory :) But there are two free things you can do that MIGHT help. 1) Set your paging file minimum and maximum's to whatever the current recommendation is by windows (re-visit it every time you add memory). Reboot as required. 2) Download a free version of "DeFragler" from Piriform. Under Settings -> Boot time defrag -> do once. What this will do is defrag your paging file so that it has the fastest possible access. This might help your memory contention issue some, and it will help system "responsiveness". It does take time, so don't freak out if it takes 5 minutes+ to defrag the paging file (that is why I only do it "once"). The above trick will help, for instance, a PC/Laptop/Netbook that is pausing but otherwise has sufficient memory. It gets less "laggy". HTH, Tom Its not hitting my paging file at all, I'll grab a snippet of what is happening from pcm: ---------------------------------------\|\|---------------------------------------\| \|-- Socket 0 --\|\|-- Socket 1 --\| \|---------------------------------------\|\|---------------------------------------\| \|-- Memory Channel Monitoring --\|\|-- Memory Channel Monitoring --\| \|---------------------------------------\|\|---------------------------------------\| \|-- Mem Ch 1: Reads (MB/s): -1.00 --\|\|-- Mem Ch 1: Reads (MB/s): 5795.37 --\| \|-- Writes(MB/s): -1.00 --\|\|-- Writes(MB/s): 3507.86 --\| \|-- Mem Ch 3: Reads (MB/s): 5527.16 --\|\|-- Mem Ch 3: Reads (MB/s): -1.00 --\| \|-- Writes(MB/s): 2983.31 --\|\|-- Writes(MB/s): -1.00 --\| \|-- NODE 0 Mem Read (MB/s) : 5527.16 --\|\|-- NODE 1 Mem Read (MB/s) : 5795.37 --\| \|-- NODE 0 Mem Write(MB/s) : 2983.31 --\|\|-- NODE 1 Mem Write(MB/s) : 3507.86 --\| \|-- NODE 0 P. Write (T/s): 1665388 --\|\|-- NODE 1 P. Write (T/s): 2343303 --\| \|-- NODE 0 Memory (MB/s): 8510.47 --\|\|-- NODE 1 Memory (MB/s): 9303.23 --\| \|---------------------------------------\|\|---------------------------------------\| \|---------------------------------------\|\|---------------------------------------\| \|-- System Read Throughput(MB/s): 11322.53 --\| \|-- System Write Throughput(MB/s): 6491.17 --\| \|-- System Memory Throughput(MB/s): 17813.70 --\| \|---------------------------------------\|\|---------------------------------------\| That is it running 27 CPU tasks + 1 GPU. SO what is happening is that the app is waiting for data from memory, and it waits and waits, but it still uses 100% of the thread until it has that data Say I reduce it to 4 CPU tasks and each of them run for 1.5 hrs and leaving them at 27 tasks and each of them run for 5 hrs for example. This PC would be more productive in doing 27 tasks in a 5 hr period than 12-14 tasks in the same 5 hr period when I reduce it ID: 1872360 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1872379 - Posted: 11 Jun 2017, 5:02:19 UTC - in response to Message 1872360. So another update: I have installed PCM - Processor Counter Monitor And in there I can see memory bandwidth utilisation And wow is my memory getting hammered! 18GB/s on single channel DDR3-1333(PC3-10600R) memory, and because my motherboard does weird things with memory, if I only populate 1 slot of a channel it halves the memory bandwidth! That is my argument for using the preferences on Seti/Boinc website (or locally in the Boinc manager) to reduce the number of cpu's you are trying to use. It will reduce the memory contention until you are able to add to memory. And I think it will increase productivity. If the test goal were to reduce to processing to say four (4) cores you would set the amount of cpu to: (4/36 actual cores) = 11% It is also very possible the paging file on your hard disk is being hit on a fulltime basis (which slows life down). The best fix for that is more memory :) But there are two free things you can do that MIGHT help. 1) Set your paging file minimum and maximum's to whatever the current recommendation is by windows (re-visit it every time you add memory). Reboot as required. 2) Download a free version of "DeFragler" from Piriform. Under Settings -> Boot time defrag -> do once. What this will do is defrag your paging file so that it has the fastest possible access. This might help your memory contention issue some, and it will help system "responsiveness". It does take time, so don't freak out if it takes 5 minutes+ to defrag the paging file (that is why I only do it "once"). The above trick will help, for instance, a PC/Laptop/Netbook that is pausing but otherwise has sufficient memory. It gets less "laggy". HTH, Tom Its not hitting my paging file at all, I'll grab a snippet of what is happening from pcm: ---------------------------------------\|\|---------------------------------------\| \|-- Socket 0 --\|\|-- Socket 1 --\| \|---------------------------------------\|\|---------------------------------------\| \|-- Memory Channel Monitoring --\|\|-- Memory Channel Monitoring --\| \|---------------------------------------\|\|---------------------------------------\| \|-- Mem Ch 1: Reads (MB/s): -1.00 --\|\|-- Mem Ch 1: Reads (MB/s): 5795.37 --\| \|-- Writes(MB/s): -1.00 --\|\|-- Writes(MB/s): 3507.86 --\| \|-- Mem Ch 3: Reads (MB/s): 5527.16 --\|\|-- Mem Ch 3: Reads (MB/s): -1.00 --\| \|-- Writes(MB/s): 2983.31 --\|\|-- Writes(MB/s): -1.00 --\| \|-- NODE 0 Mem Read (MB/s) : 5527.16 --\|\|-- NODE 1 Mem Read (MB/s) : 5795.37 --\| \|-- NODE 0 Mem Write(MB/s) : 2983.31 --\|\|-- NODE 1 Mem Write(MB/s) : 3507.86 --\| \|-- NODE 0 P. Write (T/s): 1665388 --\|\|-- NODE 1 P. Write (T/s): 2343303 --\| \|-- NODE 0 Memory (MB/s): 8510.47 --\|\|-- NODE 1 Memory (MB/s): 9303.23 --\| \|---------------------------------------\|\|---------------------------------------\| \|---------------------------------------\|\|---------------------------------------\| \|-- System Read Throughput(MB/s): 11322.53 --\| \|-- System Write Throughput(MB/s): 6491.17 --\| \|-- System Memory Throughput(MB/s): 17813.70 --\| \|---------------------------------------\|\|---------------------------------------\| That is it running 27 CPU tasks + 1 GPU. SO what is happening is that the app is waiting for data from memory, and it waits and waits, but it still uses 100% of the thread until it has that data Say I reduce it to 4 CPU tasks and each of them run for 1.5 hrs and leaving them at 27 tasks and each of them run for 5 hrs for example. This PC would be more productive in doing 27 tasks in a 5 hr period than 12-14 tasks in the same 5 hr period when I reduce it I would probably start BOINC with the affinity command and set it to only use CPU 0 and then have the 2 dimms in slots for CPU0. Something along the lines of "start /affinity FFFF C:\BOINC\boinc.exe". Alternativly if you are using NUMA you could simplify the command like "start /NODE 0 C:\BOINC\boinc.exe" Also making use of the cc_config.xml option for <ncpus> or telling BOINC to only use 50% of the CPUs would be needed. Otherwise it would try to run 32 tasks across 16 CPUs. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1872379 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1872922 - Posted: 14 Jun 2017, 15:13:12 UTC So I just installed 2 more sticks/modules into my system, I'll report on the runtimes when they have a chance to run, with 26 tasks active at one time, this time no GPU as that is not being used ID: 1872922 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1872948 - Posted: 14 Jun 2017, 18:08:31 UTC Update: with 2 channels occupied, preliminary results are around about 9k seconds per task with 26 tasks running at once ID: 1872948 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1872979 - Posted: 14 Jun 2017, 22:23:28 UTC - in response to Message 1872948. Update: with 2 channels occupied, preliminary results are around about 9k seconds per task with 26 tasks running at once I read that as 2.5 hours / task. That is WAY much better! I remember reading as high as 9+ hours / task. Tom A proud member of the OFA (Old Farts Association). ID: 1872979 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1872982 - Posted: 14 Jun 2017, 22:40:40 UTC - in response to Message 1872948. Update: with 2 channels occupied, preliminary results are around about 9k seconds per task with 26 tasks running at once That is looking much better. That is around the upper limit for tasks on mine running 32 tasks at once with 8 dimms (4 per CPU) in quad channel mode. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1872982 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1873058 - Posted: 15 Jun 2017, 6:03:39 UTC - in response to Message 1872982. Update: with 2 channels occupied, preliminary results are around about 9k seconds per task with 26 tasks running at once That is looking much better. That is around the upper limit for tasks on mine running 32 tasks at once with 8 dimms (4 per CPU) in quad channel mode. SO I am going to spend about another $120 for more dimms, so I can occupy the last 2 channels ID: 1873058 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 1873060 - Posted: 15 Jun 2017, 6:11:54 UTC - in response to Message 1872922. Last modified: 15 Jun 2017, 6:12:50 UTC So I just installed 2 more sticks/modules into my system, I'll report on the runtimes when they have a chance to run, with 26 tasks active at one time, this time no GPU as that is not being used So at present you've got 1 module in each bank? (blue slots i'm guessing, and the ones closest to the CPUs?) Grant Darwin NT ID: 1873060 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 1873270 - Posted: 16 Jun 2017, 6:53:16 UTC For those that are interested, Tom's Hardware posted an article looking at the new Mesh architecture Intel are using for their high core count/multi socket CPUs to replace their long standing Ring Bus architecture. ... on the Broadwell LCC (Low Core Count) die... for instance, moving data from one core to its closest neighbor requires one cycle. Moving data to more distant cores requires more cycles, thus increasing the latency associated with data transit. It can take up to 12 cycles to reach the most distant core... The larger HCC (High Core Count) die exposes one of the problems with this approach. To increase the cores and cache, the HCC die employs dual ring buses. Communication between the two rings has to flow through a buffered switch (seen between the two rings at the top and bottom). Traversing the switch imposes a five-cycle penalty, and that is before the data has to continue through more hops to its destination. So now I can see why, even though Seti work itself doesn't benefit from huge amounts of memory bandwidth, Kiska's runtimes were so high with the original setup of both DIMMs on the one CPU socket. Intel mesh architecture. Grant Darwin NT ID: 1873270 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1873271 - Posted: 16 Jun 2017, 7:04:52 UTC - in response to Message 1873270. For those that are interested, Tom's Hardware posted an article looking at the new Mesh architecture Intel are using for their high core count/multi socket CPUs to replace their long standing Ring Bus architecture. ... on the Broadwell LCC (Low Core Count) die... for instance, moving data from one core to its closest neighbor requires one cycle. Moving data to more distant cores requires more cycles, thus increasing the latency associated with data transit. It can take up to 12 cycles to reach the most distant core... The larger HCC (High Core Count) die exposes one of the problems with this approach. To increase the cores and cache, the HCC die employs dual ring buses. Communication between the two rings has to flow through a buffered switch (seen between the two rings at the top and bottom). Traversing the switch imposes a five-cycle penalty, and that is before the data has to continue through more hops to its destination. So now I can see why, even though Seti work itself doesn't benefit from huge amounts of memory bandwidth, Kiska's runtimes were so high with the original setup of both DIMMs on the one CPU socket. Intel mesh architecture. Even with 1 dimm in each socket ID: 1873271 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1873464 - Posted: 17 Jun 2017, 0:18:56 UTC - in response to Message 1873270. Last modified: 17 Jun 2017, 0:19:09 UTC For those that are interested, Tom's Hardware posted an article looking at the new Mesh architecture Intel are using for their high core count/multi socket CPUs to replace their long standing Ring Bus architecture. ... on the Broadwell LCC (Low Core Count) die... for instance, moving data from one core to its closest neighbor requires one cycle. Moving data to more distant cores requires more cycles, thus increasing the latency associated with data transit. It can take up to 12 cycles to reach the most distant core... The larger HCC (High Core Count) die exposes one of the problems with this approach. To increase the cores and cache, the HCC die employs dual ring buses. Communication between the two rings has to flow through a buffered switch (seen between the two rings at the top and bottom). Traversing the switch imposes a five-cycle penalty, and that is before the data has to continue through more hops to its destination. So now I can see why, even though Seti work itself doesn't benefit from huge amounts of memory bandwidth, Kiska's runtimes were so high with the original setup of both DIMMs on the one CPU socket. Intel mesh architecture. Not populating all of the memory channels for a CPU reminds me of the saying "You can't put 10lbs of 'stuff' into a 5lb box" SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1873464 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.