Message boards :
Number crunching :
Best performing hardware
Message board moderation
Author | Message |
---|---|
Ryan Munro Send message Joined: 5 Feb 06 Posts: 63 Credit: 18,519,866 RAC: 10 |
So just a thought, what kit out there would perform the best for Seti?, Im not talking big clusters ect just specific pieces of hardware. I would assume it would be some form of GPU but what about the Intel Xeon Phi cards for example? I'm not looking to purchase just curious as to what the ideal Seti rig would be :) |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
So just a thought, what kit out there would perform the best for Seti?, Im not talking big clusters ect just specific pieces of hardware. I'm not sure that anyone has developed any SETI@home apps for the Xeon Phi hardware yet. However at the moment the best performance per watt would be doing work on GPUs. The type of work depends on which vendor hardware is the most efficient. For SETI@home (MB work): NV GPUs using CUDA For Astropulse (AP work): ATI GPUs using OpenCL That isn't saying MB work on ATI GPUs or AP work on NV GPUs is bad. In the NV range the 750ti seems to be the best PPW GPU. I have not seen enough reports on the new NV 900 series GPUs to know how well they preform at the moment. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
qbit Send message Joined: 19 Sep 04 Posts: 630 Credit: 6,868,528 RAC: 0 |
Correct me if I'm wrong, but AFAIK the important thing for SETI is single precision performance. The Xeon Phi 7120P seems to have a theoretical single precision peak of 2.4 TFLOPS/s: http://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-theoretical-maximums.html A GTX780 or a Titan seems to be much faster: The GTX 780 still offers respectable single precision performance though, clocking in at 4 Teraflops compared to the Titan's 4.5 Teraflops. http://www.maximumpc.com/article/news/geforce_gtx_780_benchmarks The Titan Black is rated at 5.1 TFLOPS/s http://www.bit-tech.net/news/hardware/2014/02/18/nvidia-gtx-titan-black-launched/1 The GTX980 should be about the same with ~5 TFLOPS/s http://www.pcworld.com/article/2686115/nvidia-unveils-its-all-new-geforce-gtx-980-and-gtx-970-graphics-processors.html But the by far fastest "computing unit" is the human brain, which is rated at ~1000000 TFLOPS/s (that's 1 Exaflop/s)! So maybe we should look for a way to use our brains to crunch SETI ;-) |
ivan Send message Joined: 5 Mar 01 Posts: 783 Credit: 348,560,338 RAC: 223 |
Correct me if I'm wrong, but AFAIK the important thing for SETI is single precision performance. I have a Xeon Phi. I gave up on trying to port BOINC and S@H to it. For the best performance you have to run native code on the Phi cluster (otherwise communication bottlenecks between the host and the cluster slow you down). That would mean running BOINC on the Phi as well and BOINC code is not the most portable (but not too bad if you ignore boincmgr and use boinccmd for all interactions). Then there's the memory problem -- our particular model only has 8 GB of ram, which has to contain the OS as well as applications. At the moment top reports 7.6 GB free, so for 60 cores x 4 threads that'd be only around 30 MB available per thread; currently on an Ubuntu box s@h reports 104 or 164 MB virtual memory per process, 40 or 96 MB resident per process, and 12 This might all change with the new Phis, which slot into motherboard sockets and can access main memory, but Intel hasn't offered me one to play with yet... |
Admiral Gloval Send message Joined: 31 Mar 13 Posts: 20287 Credit: 5,308,449 RAC: 0 |
If you want a hint at a number cruncher. Just go to statistics and look at the top performers. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
[quote] Would it be possible to use massively-parallel system via OpenCL drivers and not to run separate instance on each node/core ? |
Ryan Munro Send message Joined: 5 Feb 06 Posts: 63 Credit: 18,519,866 RAC: 10 |
SP / DP was to be my next question but that's been answered, so it looks like if you had the cash a Dual 18 core Xeon box with 2x Titan Z's would be the best bet? Damn when I win the lottery I am defiantly building the worlds best home Seti box, something you can still run off the mains and use day to day I think, would be awesome getting to play with that kit and giving something back with all the power :) |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
SP / DP was to be my next question but that's been answered, so it looks like if you had the cash a Dual 18 core Xeon box with 2x Titan Z's would be the best bet? You don't really need that much CPU oomph. An i5 or i7 with a pair of those GPUs would likely be in the top 15-20 computers for the project. The CPUs, in comparison to the GPUs, would only account for a small fraction of the work the machine completes. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
ivan Send message Joined: 5 Mar 01 Posts: 783 Credit: 348,560,338 RAC: 223 |
[quote] Perhaps, you have more experience there than I. I've done a few OpenMP things but not really had the chance to play with it otherwise. Do you have to explicitly parallelise everything in OpenCL or does the system work that out for you? (Guess I should dig out your code if I get an idle moment.) Another possibility would be to try to get the shareable section much larger by using shared libraries: [eesridr:~] > ldd BOINC/projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu not a dynamic executable |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
Then there's the memory problem -- our particular model only has 8 GB of ram, which has to contain the OS as well as applications. At the moment top reports 7.6 GB free, so for 60 cores x 4 threads that'd be only around 30 MB available per thread; currently on an Ubuntu box s@h reports 104 or 164 MB virtual memory per process, 40 or 96 MB resident per process, and 12 If you are running stock apps, those are UPX compressed. Decompress them and see what figures you get then. Although, iirc, the apps use more than 30 MB for data. |
ivan Send message Joined: 5 Mar 01 Posts: 783 Credit: 348,560,338 RAC: 223 |
Then there's the memory problem -- our particular model only has 8 GB of ram, which has to contain the OS as well as applications. At the moment top reports 7.6 GB free, so for 60 cores x 4 threads that'd be only around 30 MB available per thread; currently on an Ubuntu box s@h reports 104 or 164 MB virtual memory per process, 40 or 96 MB resident per process, and 12 Those are in-memory figures from top -- compression of the executable wouldn't affect that (only the space on disk), surely? |
Woodgie Send message Joined: 6 Dec 99 Posts: 134 Credit: 89,630,417 RAC: 55 |
SP / DP was to be my next question but that's been answered, so it looks like if you had the cash a Dual 18 core Xeon box with 2x Titan Z's would be the best bet? If anyone cares to have a look at outlander, it's an i7 overclocked to 4.4GHz with 16GB RAM and 2 original TITANs (not Z or Black) When I win the lottery I'm going to use a couple of these... ~W |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
Then there's the memory problem -- our particular model only has 8 GB of ram, which has to contain the OS as well as applications. At the moment top reports 7.6 GB free, so for 60 cores x 4 threads that'd be only around 30 MB available per thread; currently on an Ubuntu box s@h reports 104 or 164 MB virtual memory per process, 40 or 96 MB resident per process, and 12 The uncompressed code started its life as compressed data. In particular, before the code became code, it was written to memory as data. Pages that have been written to aren't shared between different processes. |
Ryan Munro Send message Joined: 5 Feb 06 Posts: 63 Credit: 18,519,866 RAC: 10 |
Damn that system beats mine : http://setiathome.berkeley.edu/show_host_detail.php?hostid=7407076 Couple of questions, on the Nvidia system posted he is doing a wide range of GPU based units where as I am only doing one type, is this because he has a Cuda capable card? Second I was under the assumption that CPU was still important as there is specific work types that only run on CPU? Third I have a second box with a 3770k installed, is it possible to crunch on the CPU's GPU and the discrete (270x) card at the same time? I think this combo rather than CPU + 270x would use less power and produce about the same points? |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Damn that system beats mine : In my first message I has mentioned which apps run best on which hardware. However with SETI@home all 4 types of hardware CPU, ATI GPU, Intel GPU, & NVIDIA GPU can run both Astropulse & SETI@home applications. At least in Windows. From the Applications page you can see what apps there are for which types of hardware in each OS. OpenCL is used for the ATI, Intel, & Nvidia Astropulse apps OpenCL is used for the ATI, & Intel SETI@home apps. CUDA is use for the NVIDIA SETI@home app. It looks like none of your machines have done any Astropulse work. You may have disabled Astropulse in your preferences. You can check your Project Preferences & make sure everything is enabled or at least SETI@home v7 & AstroPulse v7. As those two are the current active applications that we are working on right now. Run only the selected applications: SETI@home Enhanced - Obsolete & replaced by SETI@home v7 SETI@home v7 AstroPulse v6 - Obsolete & replaced by AstroPulse v7 AstroPulse v7 CPUs are still important. However a system with a mid to high GPU the CPU output does only account for a fraction of the systems total output. Primarily because GPUs are several times more efficient that CPUs in the manor we are using them. For example my HD6870's will process an Astropulse task in about 30 min & my i5-4670's will take about 4 hours running 4 at a time. So in 4 hours my HD6870 has completed 8 tasks where my CPU has completed 4. Also the Intel HD4000 @ 107.52 GFLOPS is much less powerful than your R9 270x @ 2560 GFLOPS. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Ryan Munro Send message Joined: 5 Feb 06 Posts: 63 Credit: 18,519,866 RAC: 10 |
Thanks for the info. Regards to the Intel GPU I was thinking of running this instead of the Intel CPU alongside the Radeon, My thoughts were lower power for the same sort of output? |
qbit Send message Joined: 19 Sep 04 Posts: 630 Credit: 6,868,528 RAC: 0 |
Oh yeah! The K80 is rated 8.7 TFLOPS/s for single precision and therefore would/should be considerably faster then the fastest consumer GPUs. http://www.computerworld.com/article/2848128/servers/nvidia-reaches-high-on-graphics-performance-with-tesla-k80.html Problem really is the price, it should be around $7000 :-( |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Thanks for the info. So iGPU + Radeon & no CPU? I'm not as familiar with the Ivy Bridge iGPU as I am with Haswell. If the CPU to iGPU performance scales the same. Then you could expect about the same amount of output vs CPU + Radeon at lower power levels. The iGPU runs in the neighborhood of 11-15 or so watts. EDIT: Also I just noticed I messed up on the GFLOPS for your iGPU. It should have been 294.4 instead of 107.52. As it has 16, not 6, execution units. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I don't know about accuracy, but the price is quoted as being $5,000 on anandtech. Which is $2000 more than a Titan Z that is rated at 8.1 TFLOPS. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
BassieXp Send message Joined: 5 Jun 05 Posts: 14 Credit: 1,408,518 RAC: 0 |
Oh yeah! The K80 is rated 8.7 TFLOPS/s for single precision and therefore would/should be considerably faster then the fastest consumer GPUs. just a thought. From the article Dell's PowerEdge C4130 is a 1U server that looks more like an appliance and will be able to accommodate up to four K80 cards At 8,75TFlops a card. 4 cards per 1u server and a 40u rack(2u for switches). 8,75*4*40 = 1400Tflops That's a lot of crunching, with six of these racks you could double the output of the boinc program entirely. (boinc site says 8,4 Pflops across all projects.) As a side note the press release from dell says it has up to 7,2Tflops per server. What I find a strange number as the K80 has 2,91Tflops double precision and that makes for an 11,64Tflops for four cards. PS. This is just some thinking from someone who has no experience with rack servers. Do know the power an cooling can be problematic with so much in a rack. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.