Message boards :
Number crunching :
Hmmm...something wrong in here : GPU runtimes
Message board moderation
Author | Message |
---|---|
I3APR Send message Joined: 23 Apr 16 Posts: 99 Credit: 70,717,488 RAC: 0 |
Ok, I might be a "freshman" here, and starting to lorn how to evaluate the results, but I've found almost all the WU like this. My Host ID is 8052170 Lets' start with WU 2221866232 : Now, I worked this with my brand new GTX1070 : <core_client_version>7.6.22</core_client_version> <![CDATA[ <stderr_txt> v8 task detected setiathome_CUDA: Found 5 CUDA device(s): nVidia Driver Version 368.81 Device 1: GeForce GTX 1070, 4095 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 2, pciSlotID = 0 Device 2: GeForce GTX 660 Ti, 2048 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 7 pciBusID = 3, pciSlotID = 0 Device 3: GeForce GTX 1070, 4095 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 15 pciBusID = 1, pciSlotID = 0 Device 4: GeForce GTX 780 Ti, 3072 MiB, regsPerBlock 65536 computeCap 3.5, multiProcs 15 pciBusID = 130, pciSlotID = 0 Device 5: GeForce GTX 660 Ti, 2048 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 7 pciBusID = 129, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 1070 is okay SETI@home using CUDA accelerated device GeForce GTX 1070 pulsefind: blocks per SM 4 (Fermi or newer default) pulsefind: periods per launch 100 (default) Priority of process set to BELOW_NORMAL (default) successfully Priority of worker thread set successfully setiathome enhanced x41zi (baseline v8), Cuda 5.00 setiathome_v8 task detected Detected Autocorrelations as enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 6.089407 GPU current clockRate = 2088 MHz re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes Thread call stack limit is: 1k cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... cudaAcc_free() DONE. Flopcounter: 16006632523549.781000 And here's my wingman stderr ( shortened ) : <core_client_version>7.6.22</core_client_version> <![CDATA[ <stderr_txt> Running on device number: 1 Priority of worker thread raised successfully Priority of process adjusted successfully, below normal priority class used OpenCL platform detected: Advanced Micro Devices, Inc. OpenCL platform detected: NVIDIA Corporation BOINC assigns device 1 Info: BOINC provided OpenCL device ID used Build features: SETI8 Non-graphics OpenCL USE_OPENCL_NV OCL_ZERO_COPY SIGNALS_ON_GPU OCL_CHIRP3 FFTW USE_SSE3 x86 CPUID: AMD Phenom(tm) II X4 965 Processor Cache: L1=64K L2=512K CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSE4A OpenCL-kernels filename : MultiBeam_Kernels_r3430.cl ar=6.089407 NumCfft=99281 NumGauss=0 NumPulse=13136434853 NumTriplet=13136434853 Currently allocated 201 MB for GPU buffers In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768 Windows optimized setiathome_v8 application Based on Intel, Core 2-optimized v8-nographics V5.13 by Alex Kan SSE3xj Win32 Build 3430 , Ported by : Raistmer, JDWhale SETI8 update by Raistmer OpenCL version by Raistmer, r3430 Number of OpenCL platforms: 2 OpenCL Platform Name: AMD Accelerated Parallel Processing Number of devices: 0 OpenCL Platform Name: NVIDIA CUDA Number of devices: 3 Max compute units: 7 Max work group size: 1024 Max clock frequency: 1110Mhz ......... Out-of-Order: Yes Name: GeForce GTX 660 Ti Vendor: NVIDIA Corporation Driver version: 359.06 Version: OpenCL 1.2 CUDA Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts Max compute units: 6 Max work group size: 1024 Max clock frequency: 1058Mhz ........... Out-of-Order: Yes Name: GeForce GTX 760 Vendor: NVIDIA Corporation Driver version: 359.06 Version: OpenCL 1.2 CUDA Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts Max compute units: 6 Max work group size: 1024 Max clock frequency: 1058Mhz ........ Out-of-Order: Yes Name: GeForce GTX 760 Vendor: NVIDIA Corporation Driver version: 359.06 Version: OpenCL 1.2 CUDA Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 6.089407 Used GPU device parameters are: Number of compute units: 6 Single buffer allocation size: 128MB Total device global memory: 4096MB max WG size: 1024 local mem type: Real FERMI path used: yes LotOfMem path: yes LowPerformanceGPU path: no period_iterations_num=50 ......... Flopcounter: 3335872831.090946 ........ .......... GPU device sync requested... ...GPU device synched 05:00:49 (29828): called boinc_finish(0) Long story short : how can a host with 366.64 GFLOPS capability, crunch a WU almost seven times faster than a host working with a 1,987.29 GFLOPS capability ? But I found several other examples, like : Where 980ti, which is a fine gpu indeed, crunches the same wu in a ninth of the time of my 1070 ( and I OCed to 2100 ! ) Ok, I'm running 3 tasks un my 1070, and I can't understand if my wingmans are doing the same...anyway, correcting my runtime accordingly and dividing it by 3 , I still believe something is wrong, in both case. Can someone extrapolate a pattern and or give me a clue on what's happening here ? Am I getting it all wrong ? Thank you A. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Long story short : how can a host with 366.64 GFLOPS capability, crunch a WU almost seven times faster than a host working with a 1,987.29 GFLOPS capability ? The one-word answer: software. First, it saves a lot of time if you could post your sample workunits as links: http://setiathome.berkeley.edu/workunit.php?wuid=2221866232 http://setiathome.berkeley.edu/workunit.php?wuid=2221415950 From those, we can see - with three clicks - all the data you spent all that time editing. From there, we can see that both your wingmates were running opencl_nvidia_SoG - the first, r3430 as stock, and the second, r3472 under Anonymous Platform. You, on the other hand, are running "x41zi (baseline v8), Cuda 5.00". My guess is that the cuda50 application (originally developed for Kepler series cards, but still the latest available) is diverging further and further from the latest Pascal architecture (for which Cuda v8.0 was developed). Jason Gee, our Cuda developer, will be able to amplify that suggestion further, or possibly shoot it down in flames. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
[Check the GPU's actual load...] That task looks like a VHAR (aka 'shorty'). Because of the data layouts and (small) search sizes involved (with short launches), throughput is nearly completely bound to system latencies involved. These include what CPU resources you may or may not be dedicating to feed the GPU, the PCIe Lanes used by the card (that's a lot of cards in one system), and settings. My 'feeling' is that your 1070 processes its requirement quickly, then starves. I would suggest freeing some CPU cores to feed all those GPUs (whichever applications), then wind out the process priority setting to Above normal, and pulsefind settings to the max (your stderr shows conservative defaults all around). [For Cuda 5.0, per device settings are possible] Stepping back and looking back at the broader status quo, you find nVidia now limiting gaming sli support to 2 GPUs with the new generation. If you want to have a monster you must be able to feed it (i know this because I have a lazy dog :P ) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
My guess is that the cuda50 application (originally developed for Kepler series cards, but still the latest available) is diverging further and further from the latest Pascal architecture (for which Cuda v8.0 was developed). Jason Gee, our Cuda developer, will be able to amplify that suggestion further, or possibly shoot it down in flames. Close, but for Reference, others have labelled Pascal as 'Maxwell on Speed'. That's no derogative, because with 750ti onward nV basically struck gold, and the process shrink gives the higher clocks. reference: https://youtu.be/nDaekpMBYUA There are ways to better utilise Maxwell+ than the current baseline apps, however more instances lower utilisation Vs Fewer instances higher utilisation will be trading blows until we solve some more fundamental infrastructure issues, namely that nearly all building/debugging has become unmanageable. Petri's managed to demonstrate the VLAR (not applicable in the provided example) issues have nothing to do with the architecture per se. With multiple instances and shorties then we're left with load balancing and stream-based latency hiding concerns, which are going to be exacerbated on a many GPU rig. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
The_Matrix Send message Joined: 17 Nov 03 Posts: 414 Credit: 5,827,850 RAC: 0 |
I think the problem is that all these PCIex1 to PCIex16 adapters are sticking on one PCIe lane, or is it not the point ? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I think the problem is that all these PCIex1 to PCIex16 adapters are sticking on one PCIe lane, or is it not the point ? I think it is that.... , but we have information only of 48x 2,2Ghz cpus (which is probably only 24 hypperthreaded at half the GHz effective) ---> So the hyperthreaded (1.1 Ghz) CPU cores are running at a lower clockrate than some of the GPUs already, which could easily be bad) ---> 24 or 48 onto how many PCIe Lanes total ? Yeah starved GPUs IMO, they need 16 lanes each. 2.2 GHz hyperthreaded = 2 x 1.1 Ghz effective ( no free lunch ), driving 2+ Ghz GPU(s) ? I call hardware mismatch. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
AMDave Send message Joined: 9 Mar 01 Posts: 234 Credit: 11,671,730 RAC: 0 |
I think the problem is that all these PCIex1 to PCIex16 adapters are sticking on one PCIe lane, or is it not the point ? So, would it be fair to say that in order to get the most out of such multi-core, multi-socket systems, you should turn off HT (to avoid bottlenecks)? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
2.2 GHz hyperthreaded = 2 x 1.1 Ghz effective ( no free lunch ), driving 2+ Ghz GPU(s) ? I call hardware mismatch. Or simply run them as nature intended - as pure CPU compute devices? If you want to run GPUs as well, put them in ones or twos in simple i5 chassis - as many boxen as you need. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I think the problem is that all these PCIex1 to PCIex16 adapters are sticking on one PCIe lane, or is it not the point ? I think that's fair while the Boinc, and the bulk of the absolute performance crowd focus on our simple apps and our existing naive infrastructure. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
2.2 GHz hyperthreaded = 2 x 1.1 Ghz effective ( no free lunch ), driving 2+ Ghz GPU(s) ? I call hardware mismatch. Careful of the natural confusion. The OP question related to a Cuda device, while the OpenCL deevices are more or less arbitrary. If Intel GPU devices are involved here, I would be more than happy to have a discussion on camera with that Tosser Francois (From Intel) who proposed that GPUs would never be a thing. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
AMDave Send message Joined: 9 Mar 01 Posts: 234 Credit: 11,671,730 RAC: 0 |
OK, a multi-core, multi-socket system with 1 or more GPUs would be more efficient/productive without HT. Now, take that same system sans GPUs, strictly running cpu apps.  Would it be more efficient/productive with or without HT enabled? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304 |
Now, take that same system sans GPUs, strictly running cpu apps.  Would it be more efficient/productive with or without HT enabled? HyperThreading is like running multiple WUs on a GPU; the individual WU run times are longer, however the overall throughput per hour is much higher. Grant Darwin NT |
The_Matrix Send message Joined: 17 Nov 03 Posts: 414 Credit: 5,827,850 RAC: 0 |
It´s very badly , i found NO accrurate PCIe x16 to PCIe x16 riser to buy on amazon.it. Think a cable must be minimum 15" inches long to use it effectively. https://www.caseking.de/lian-li-pw-pci-e-1-riser-card-kabel-gen.3-schwarz-geli-732.html Found this in germany, but´s not payable, and where is the powercable !? Edit: ok, i would install the big high performance cards ON BOARD , and use this PCIe x1 adapters only for the "low" performance cards. |
Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196 |
Getting off-topic but my 2 credits on hyper-threading: it's a great way to avoid idle CPU resources while waiting for a cache-miss that could be 100's of cycles; it's particularly helpful on LPDDR and other slower memory subsystems. The down-side is that by "getting more work done" while there's a cache miss you can inadvertently pollute the cache even more and bog down the first core (a problem we struggled with on the PS3). You can mitigate the impact with streaming loads and stores that bypass the cache but it's tricky. Of course, the answer to the question of "is it faster without" is "it depends on your code!" -- so you might as well try it to see. If only there was a standardized BOINC benchmark suite... |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Not quite BOINC but SETI. But very standartized and definitely appropriate to check HT benefits if any (actually I did such tests on long ago missed Atom-based netbook with 2 hyperthreaded CPUs): Lunatics KWSN bench (can't give exact link to download section cause currently Lunatics site down for me). SETI apps news We're not gonna fight them. We're gonna transcend them. |
Mike Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80 |
Can`t reach Lunatics site too. With each crime and every kindness we birth our future. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Lunatics is unreachable for me as well, so it's not just you. Regarding HT..... If you may recall, when this was first questioned, the kitties ran some rather extensive tests. I was assisted by a very kind and knowledgeable Intel guru, archae. And the results at that time were that HT off was more stable and allowed better OCing than with HT on. HT was NOT, as was previously posted, like running more than one task at a time on a GPU to increase utilization. It simply split CPU time between jobs, and at that time was not as efficient at it as one might think. I believe in my case at least, there was a few percent net loss as opposed to just running the cores full bore without the encumbrance of the HT overhead. In a non-crunching world, where apps are not trying to use the CPUs full time, there could be some benefit to having multi-apps launched on HT virtual cores, where the CPU can jump between little bits of this and that and attend to each one in turn. But not in the full usage crunching world. In short.....there is no free CPU lunch. A clock cycle is a clock cycle, whether used in real time or split up in HT time. The kitties believe that it is best to least HT off and let the cores work without the toys. Of course, things have advanced. I am running on rather old hardware. And YMMV always applies. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
In short.....there is no free CPU lunch. A clock cycle is a clock cycle, whether used in real time or split up in HT time. Word up Kittyman! [Edit: if you have low latency already, latency hiding won't help you] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
In short.....there is no free CPU lunch. A clock cycle is a clock cycle, whether used in real time or split up in HT time. Talking about memory clock cycles? Yes, they help. A CPU clock cycle is still a clock cycle, my friend. They can be held up if the memory is not available. But the discussion was about HT. And HT does not help accessing memory cycles, if I am correct. Meow, my friend. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
The_Matrix Send message Joined: 17 Nov 03 Posts: 414 Credit: 5,827,850 RAC: 0 |
Is it it worth, to set the timings of the main memory down, from 667Mhz 9,9,9,24 to 457Mhz 6,6,6,17 will that have positive effects on crunching time for cpu-workunits ? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.