Message boards :
Number crunching :
4x HD7990 + 2x E5-2630v2
Message board moderation
Author | Message |
---|---|
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
I would like to build a new system (already have/ordered): 4x HD7990 (dual GPU cards) 2x Xeon E5-2630v2 2.6-3.1GHz (each CPU: 6 core/12 threads) Intel W2600CR2 Mobo This mean, every CPU thread/one GPU app. I thought to buy reg-ECC DDR3-1600 RAM (I thought CL9). I thought 4x 4GB / CPU (Quad-Channel). So 8x 4GB (32GB) / whole system. This would be enough or too much for 24 OpenCL apps/simultaneously on the 8 GPUs (3 apps/GPU)? (I don't think I'll crunch also on CPUs. But if yes, 24 additional CPU apps - the above mentioned 32GB would be enough in whole then also?) To now Win8 isn't tested with this mobo. So I will need to go with Win7 64bit, I thought Professional (dual CPU support). Or I need to go with a higher version? Thanks so far. Maybe more questions will follow. ;-) * Best regards! :-) * Philip J. Fry, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. * |
rob smith Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380 |
Having seen Win8 come and go in 48 hours at work I'd go with Win7, pro. Can't help with your question about number of simultaneous tasks, but is sounds logical. You will probably have to reserve a couple of CPU cores as a minimum to keep those GPU from crying with hunger pains... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 |
I keep 2 cores idle just for my single 7970. since each card is equal to 2 GPUs you might have to free up more CPU cores. YOu'll need to monitor things. I was reading more about the AMD line and they are putting out the next series of GPU's soon. You might consider holding off to get the newest ones. In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
4 cards * 2 GPUs = 8 GPUs 8 * 3 tasks = 24 GPU tasks + 24 CPU tasks = 48 tasks total 48 * ~50 MB = 2400 MB So < 3 GB RAM for 48 SETI tasks (make that 5 GB if 100 MB/task) 32 GB RAM is too much (money) just for SETI (and reg-ECC is expensive - will you run an airport traffic control server? ;) ) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Thanks to all. rob smith wrote: Having seen Win8 come and go in 48 hours at work I'd go with Win7, pro. So Win7 'Pro' would be 'enough'? I don't need to go with 'Ultimate' or 'Enterprise'? skildude wrote: I keep 2 cores idle just for my single 7970. since each card is equal to 2 GPUs you might have to free up more CPU cores. YOu'll need to monitor things. So you mean the two CPUs can feed/support the 8 GPUs (24 apps)? Maybe it would be difficult to crunch also on CPU? BilBg wrote: 4 cards * 2 GPUs = 8 GPUs The W2600CR2 mobo just support: DDR3 ECC UDIMM 1600, RDIMM 1600, LRDIMM 1333 (not non-ECC) AFAIK, registered is faster than unbuffered. Yes, no? Fastest possible RAM for this mobo, or? So I thought I go with RDIMM 1600 (reg-ECC DDR3 1600). It looks like ~ 300.- € for 4x 4GB. Currently I don't know if there is 2GB reg-ECC DDR 1600 available .. From my experiences on my current system (but can't remember if with the current app revision): If the APv6 result will be 30/30, the system RAM usage increase to ~ 250MB (or more)/OpenCL NV app. Normally 23MB avg., 36MB peak. So maybe ..? 24 x 250MB = ~ 5.9GB (at least) * Best regards! :-) * Philip J. Fry, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. * |
rob smith Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380 |
The differences between win7 pro and win7 ultimate/enterprise are not related to the performance but to do with the multimedia and networking features which are not needed for crunching. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
So Win7 'Pro' would be 'enough'? For your purposes, there's nothing that Ultimate or Enterprise gives you that Pro doesn't already give in the form of hardware support: Windows 7 Comparison Chart. AFAIK, registered is faster than unbuffered. Yes, no? Fastest possible RAM for this mobo, or? Registered memory is more reliable when using large capacities of RAM on a system board by adding an extra layer (or buffer) between the RAM controller and the physical RAM chips. For this reason, Registered memory is typically slower than standard unbuffered memory, but if you plan to use large capacities, or boards that require it, then it's your only choice. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Not even that. The difference between Pro and Ultimate is that with Ultimate you can encrypt everything on drive using BitLocker, plus you can change the language of your Windows to any of 40 languages. See Compare Windows 7 for all features between versions. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Thanks to all. AFAIK, registered is faster than unbuffered. Yes, no? Fastest possible RAM for this mobo, or? .. which system RAM you would recommend/is the fastest if the W2600CR2 mobo support: DDR3 ECC UDIMM 1600, RDIMM 1600, LRDIMM 1333 - and I would like to go with 4x 4GB for each CPU (32GB whole system)? (4x 2GB, 8GB each CPU, 16GB whole system - would be an other recommendation?) * Best regards! :-) * Philip J. Fry, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. * |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
UDIMM = Unregistered DIMM (unbuffered) RDIMM = Registered DIMM LRDIMM = Load-reduced registered DIMM None of those are about speed, but about different types of RAM. With 32GB on a dual Xeon workstation/server class board, you should probably use RDIMM. My reasoning is that if you ever choose to add more RAM for whatever reason, you won't have to buy all new RAM because the UDIMM's couldn't support more load on the chipset. LRDIMM seems to be a "load reduced" form of registered ECC that allows you to use up to 8 rank memory as a single rank, thus allowing you to use cheaper, high capacity RAM on a supported chipset. This is ideal for capacities over 64-128GB of RAM. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Disclaimer: This advice a) is written at beer a clock. b) may not be applicable to AMD/NON NVIDIA users. I'd try linux. And then ... 1) export LD_PRELOAD=libsleep.so You would not have to reserve any physical or logigal cores for AP. -- The 100% usage is only for yield() - an idle loop inside NVIDIA openCl driver. Libsleep.so replaces yield() with nanosleep. This gives lower proirity tasks (CPU tasks) an opportunity to run. 2) Then turn HT on in BIOS. - release the Power. 3) Then set boingmgr to use 50% of available CPU. (Meaning calculate CPU MB/AP only on a c t u a l processing units of the CPU (i.e. no trashing/no competition of floating point processor resources/ no cache races) -- AND AT the same time -- Reserve the HT power for occasional/system/OS/blanked or GPU memory transfer needs of the GPU application(s)). 4) Get the latest optimized versions of the CPU and GPU apps. 5) OR (to 4) compile Your own executables. Tweak/test/compare/ask/tweak... But this (#5) needs to be done as a hobby/interest. You may want to run multiple instances at a time and then first edit the Astropulse_KernelsXXX.cl file if you are interested in programming on any language. The .cl/CUDA is quite similar to C/C++. : (an example) // Experimental code by petri33 // released for the first time 'here' for the public domain. All rights NON reserved. // modified from the original publicly available source code. // ** this may give an advantage on NV780 when running multiple AP tasks at at time. Other platforms need testing. // __kernel void splitter_bits_to_float_range_kernel(__global uint4* gpu_raw,__global float4* gpu_data,const uint raw_offset){ //R: I assume that first bit represents data[0].re, second - data[0].im and so on. //R: Each workitem takes 4 unsigned ints from bit stream, that is 4*4*8 bits or 4*4*8/2 data elements //R: that is, 64 data elements per work item. We need 32k*DATA_CHUNK_UNROLL elements total => //R: there are plenty workitems to load (512*7 in current version) all processing elements of GPU. //R: raw_offset contains offset into raw array considering raw as array of uint4 entities //R: (that is, each element contains 4*4*8 bits) uint tid= get_global_id(0); uint dchunk=get_global_id(1); //R: we need 512 work items per data array int raw_offset_fin=raw_offset+tid+512/2*dchunk; if(raw_offset_fin>=(8*1024*1024>>(2+2))) return; uint4 bits=gpu_raw[raw_offset_fin];//R:data chunks overlapped by 50% hence 512/2 int data_offset = dchunk*(32768/2)+tid*32; for(uint i = 0; i < 8; i++, data_offset++) { float4 f1, f2, f3, f4; f1.x = (bits.x & 1 ? 1 : -1); bits.x >>= 1; f1.y = (bits.x & 1 ? 1 : -1); bits.x >>= 1; f1.z = (bits.x & 1 ? 1 : -1); bits.x >>= 1; f1.w = (bits.x & 1 ? 1 : -1); bits.x >>= 1; f2.x = (bits.y & 1 ? 1 : -1); bits.y >>= 1; f2.y = (bits.y & 1 ? 1 : -1); bits.y >>= 1; f2.z = (bits.y & 1 ? 1 : -1); bits.y >>= 1; f2.w = (bits.y & 1 ? 1 : -1); bits.y >>= 1; f3.x = (bits.z & 1 ? 1 : -1); bits.z >>= 1; f3.y = (bits.z & 1 ? 1 : -1); bits.z >>= 1; f3.z = (bits.z & 1 ? 1 : -1); bits.z >>= 1; f3.w = (bits.z & 1 ? 1 : -1); bits.z >>= 1; f4.x = (bits.w & 1 ? 1 : -1); bits.w >>= 1; f4.y = (bits.w & 1 ? 1 : -1); bits.w >>= 1; f4.z = (bits.w & 1 ? 1 : -1); bits.w >>= 1; f4.w = (bits.w & 1 ? 1 : -1); bits.w >>= 1; gpu_data[data_offset ] = f1; gpu_data[data_offset +8] = f2; gpu_data[data_offset+16] = f3; gpu_data[data_offset+24] = f4; } } // compiles nice on NVIDIA. Similar performance (r_XXX) running one at a time. Testing what happens with multiple tasks. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
So you try to use the old trick to unwrap (or unwind? - I don't know the English term) the loops and to avoid multiply (*) This was working great on 8-bit 1 MHz 6502 CPU (used in Apple ][+ (and by 'The Terminator' ;) )) But does it have any effect on today's CPUs/GPUs/compilers? Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
I try all sort of things, this is one example. Some operations do not take long time but their result is usable only after several clocks. Some operations can be issued in pairs if they are not dependent. I tried a 4 bit lookup table (giving 16 different float4 values) but that was slow. The example code generates compare and select instructions, but places them next to each other. It also puts 4 store operations togeteher. Two can be done at the same time on Kepler. That will need further inspection. I'll try manually editing the assembler code to get comparisons done first and then the selects based on the result after that... and put the store operations in pairs in between. I have not read the instruction cookbook but i like to try things. Since i have had no time to profile and do not know the hotspots in the code i just play a guessing game. All for fun. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
[/quote] All for fun.[/quote] If you are interested in the running times you should know that I currently run .5 MB and .33 AP. i.e. 2MB/3AP or 1MB+1AP on GPU. The 12 virtual core CPU is busy with 6 other (AP or MB) and the rest 6 HT cores are free for any app/OS/etc needs. Oddly enough my AP blanking is taking a lot more of CPU time when having compiled with AVX and no SSE. SSE2/3 was faster. If any of You could give me a pointer -- what have I done wrong: ------------------ #Pre-prepared configuration for an AstroPulse AMD/ATI-GPU build using OpenCL : export BOINC_DIR=/petri/boinc_repo export BOINCDIR=/home/petri/boinc_repo export INCLUDE=/root/NVIDIA_GPU_Computing_SDK/OpenCL/common/inc export GPU=NV ./configure --enable-bitness=64 --build=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --with-boinc-platform=x86_64-pc-linux-gnu --enable-static --enable-static-client --enable-avx --disable-shared --disable-graphics --enable-\ intrinsics CXXFLAGS=" -O3 -march=core2 -mtune=core2 -mavx --param inline-unit-growth=3000 -I/root/NVIDIA_GPU_Computing_SDK/OpenCL/common/inc -I/root/NVIDIA_GPU_Computing_SDK/shared/inc" CPPFLAGS=" -DUSE_AVX -DUSE_FFTW -DUSE_CONV\ ERSION_OPT -DUSE_INCREASED_PRECISION -DSMALL_CHIRP_TABLE -DUSE_OPENCL -DUSE_OPENCL_NV -DOPENCL_WRITE -DCOMBINED_DECHIRP_KERNEL -DOCL_ZERO_COPY -DAP_CLIENT" LIBS=" -L/usr/lib64 -lOpenCL -L/opt/lib-4.12/lib" LDFLAGS=" -static-libg\ cc -static-libstdc++" BOINCDIR=" /home/petri/boinc_repo" SETI_BOINC_DIR=" ../../AKv8" LIBS="/opt/lib-4.12/lib/libm.so.6 /opt/lib-4.12/lib/libc.so /opt/lib-4.12/lib/libpthread.so /usr/lib64/libstdc++.so /usr/lib64/lib/libm.so.6 \ " To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 |
skildude wrote: It's very likely that you'd be running the GPU's only and leaving the CPU completely free to feed the GPU's In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Thanks to all. It could be that Linux OS (OpenCL ATI/AMD apps) could be faster than Windows OS (OpenCL ATI/AMD apps)? I search now a SATA III (rev3 (6Gb/s)) HDD for this new machine. (GB size not important, I guess 50 GB at least for OS and so on, or?) If I want a very quick HDD (much app checkpoints to write), which specs it should have? Cache, RPM, read/write sec. and so on? What/which would be the fastest HDD? I calculate with: 300W CPUs/mobo/fans 4x 300W the 4 graphic cards = 1.500W (OpenCL apps) (From Inet tests, a HD7990 could use up to ~ 400W) I like to build PCs that the PSUs are running @ ~ 50 % of the max. So 1x 1200W (600W: mobo, 1 card) + 2x 900W (1 1/2 cards (usage of 3x 8 pin PCIe plugs) each PSU) Or maybe 5x 600W PSUs? * Best regards! :-) * Philip J. Fry, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. * |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
Last I checked (about 6 months ago) the Samsung 840 Pro (not the regular 840 models) were the fastest around. I hear Samsung released the 840 EVO which is a TLC SSD that matches the performance of the MLC type drives. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
But that's a SSD, or? I search a HDD. AFAIK, the SSDs don't last as long than HDDs. The continuously writing (of the OS also) reduce the lifespan. Or not? * Best regards! :-) * Philip J. Fry, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. * |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
The WD Raptor is the fastest HDD available but if you can't quite come at the price then any WD Black Edition HDD will be good. I actually use SSD's now as my boot/OS/programs drive and all data and BOINC on WD Black HDD's. Cheers. |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
Yes, it is an SSD, and I thought I saw something about SSD initially in your post... At any rate, SSDs do have a limited write cycle, but only on bits that actually change. They're reliable enough that the enterprise class drives are used in many high-end datacenters. That being said, with all of the small writes made by BOINC, I put my BOINC directory on a mechanical HDD, so if that's the route you want to go... Most HDDs are going to have very similar performance. The faster the RPMs, the more responsive the OS and applications "feel" (lower latency). More cache helps mask the speed issue, but they'll never be as fast as an SSD. No mechanical drive will reach speeds above approx 125MB/s (give or take about 10MB/s). Because the technology is very limited to the spindle speed, nearly any drive on the market will suffice. If you want fast and you don't want SSD, you can take a look at the HDD/SDD hybrid drives. These are regular mechanical drives that utilize an SSD for cache. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.