Hmmm...something wrong in here : GPU runtimes

Message boards : Number crunching : Hmmm...something wrong in here : GPU runtimes
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
I3APR

Send message
Joined: 23 Apr 16
Posts: 99
Credit: 70,717,488
RAC: 0
Italy
Message 1805613 - Posted: 29 Jul 2016, 16:21:36 UTC
Last modified: 29 Jul 2016, 16:25:56 UTC

Ok, I might be a "freshman" here, and starting to lorn how to evaluate the results, but I've found almost all the WU like this. My Host ID is 8052170

Lets' start with WU 2221866232 :



Now, I worked this with my brand new GTX1070 :

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<stderr_txt>
v8 task detected
setiathome_CUDA: Found 5 CUDA device(s):
nVidia Driver Version 368.81
Device 1: GeForce GTX 1070, 4095 MiB, regsPerBlock 65536
computeCap 6.1, multiProcs 15
pciBusID = 2, pciSlotID = 0
Device 2: GeForce GTX 660 Ti, 2048 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 7
pciBusID = 3, pciSlotID = 0
Device 3: GeForce GTX 1070, 4095 MiB, regsPerBlock 65536
computeCap 6.1, multiProcs 15
pciBusID = 1, pciSlotID = 0
Device 4: GeForce GTX 780 Ti, 3072 MiB, regsPerBlock 65536
computeCap 3.5, multiProcs 15
pciBusID = 130, pciSlotID = 0
Device 5: GeForce GTX 660 Ti, 2048 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 7
pciBusID = 129, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 1070 is okay
SETI@home using CUDA accelerated device GeForce GTX 1070
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to BELOW_NORMAL (default) successfully
Priority of worker thread set successfully

setiathome enhanced x41zi (baseline v8), Cuda 5.00

setiathome_v8 task detected
Detected Autocorrelations as enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is : 6.089407

GPU current clockRate = 2088 MHz

re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
cudaAcc_free() DONE.

Flopcounter: 16006632523549.781000



And here's my wingman stderr ( shortened ) :

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<stderr_txt>
Running on device number: 1
Priority of worker thread raised successfully
Priority of process adjusted successfully, below normal priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
OpenCL platform detected: NVIDIA Corporation
BOINC assigns device 1
Info: BOINC provided OpenCL device ID used

Build features: SETI8 Non-graphics OpenCL USE_OPENCL_NV OCL_ZERO_COPY SIGNALS_ON_GPU OCL_CHIRP3 FFTW USE_SSE3 x86
CPUID: AMD Phenom(tm) II X4 965 Processor

Cache: L1=64K L2=512K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSE4A
OpenCL-kernels filename : MultiBeam_Kernels_r3430.cl
ar=6.089407 NumCfft=99281 NumGauss=0 NumPulse=13136434853 NumTriplet=13136434853
Currently allocated 201 MB for GPU buffers
In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768

Windows optimized setiathome_v8 application
Based on Intel, Core 2-optimized v8-nographics V5.13 by Alex Kan
SSE3xj Win32 Build 3430 , Ported by : Raistmer, JDWhale

SETI8 update by Raistmer

OpenCL version by Raistmer, r3430

Number of OpenCL platforms: 2


OpenCL Platform Name: AMD Accelerated Parallel Processing
Number of devices: 0


OpenCL Platform Name: NVIDIA CUDA
Number of devices: 3
Max compute units: 7
Max work group size: 1024
Max clock frequency: 1110Mhz
.........
Out-of-Order: Yes
Name: GeForce GTX 660 Ti
Vendor: NVIDIA Corporation
Driver version: 359.06
Version: OpenCL 1.2 CUDA
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts
Max compute units: 6
Max work group size: 1024
Max clock frequency: 1058Mhz
...........
Out-of-Order: Yes
Name: GeForce GTX 760
Vendor: NVIDIA Corporation
Driver version: 359.06
Version: OpenCL 1.2 CUDA
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts
Max compute units: 6
Max work group size: 1024
Max clock frequency: 1058Mhz
........
Out-of-Order: Yes
Name: GeForce GTX 760
Vendor: NVIDIA Corporation
Driver version: 359.06
Version: OpenCL 1.2 CUDA
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts


Work Unit Info:
...............
Credit multiplier is : 2.85
WU true angle range is : 6.089407
Used GPU device parameters are:
Number of compute units: 6
Single buffer allocation size: 128MB
Total device global memory: 4096MB
max WG size: 1024
local mem type: Real
FERMI path used: yes
LotOfMem path: yes
LowPerformanceGPU path: no
period_iterations_num=50

.........
Flopcounter: 3335872831.090946
........
..........

GPU device sync requested... ...GPU device synched
05:00:49 (29828): called boinc_finish(0)


Long story short : how can a host with 366.64 GFLOPS capability, crunch a WU almost seven times faster than a host working with a 1,987.29 GFLOPS capability ?


But I found several other examples, like :



Where 980ti, which is a fine gpu indeed, crunches the same wu in a ninth of the time of my 1070 ( and I OCed to 2100 ! )

Ok, I'm running 3 tasks un my 1070, and I can't understand if my wingmans are doing the same...anyway, correcting my runtime accordingly and dividing it by 3 , I still believe something is wrong, in both case.

Can someone extrapolate a pattern and or give me a clue on what's happening here ?
Am I getting it all wrong ?

Thank you
A.
ID: 1805613 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1805618 - Posted: 29 Jul 2016, 16:46:48 UTC - in response to Message 1805613.  

Long story short : how can a host with 366.64 GFLOPS capability, crunch a WU almost seven times faster than a host working with a 1,987.29 GFLOPS capability ?

The one-word answer: software.

First, it saves a lot of time if you could post your sample workunits as links:

http://setiathome.berkeley.edu/workunit.php?wuid=2221866232
http://setiathome.berkeley.edu/workunit.php?wuid=2221415950

From those, we can see - with three clicks - all the data you spent all that time editing.

From there, we can see that both your wingmates were running opencl_nvidia_SoG - the first, r3430 as stock, and the second, r3472 under Anonymous Platform. You, on the other hand, are running "x41zi (baseline v8), Cuda 5.00".

My guess is that the cuda50 application (originally developed for Kepler series cards, but still the latest available) is diverging further and further from the latest Pascal architecture (for which Cuda v8.0 was developed). Jason Gee, our Cuda developer, will be able to amplify that suggestion further, or possibly shoot it down in flames.
ID: 1805618 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1805628 - Posted: 29 Jul 2016, 18:00:01 UTC - in response to Message 1805613.  
Last modified: 29 Jul 2016, 18:02:56 UTC

[Check the GPU's actual load...]

That task looks like a VHAR (aka 'shorty'). Because of the data layouts and (small) search sizes involved (with short launches), throughput is nearly completely bound to system latencies involved. These include what CPU resources you may or may not be dedicating to feed the GPU, the PCIe Lanes used by the card (that's a lot of cards in one system), and settings.

My 'feeling' is that your 1070 processes its requirement quickly, then starves.

I would suggest freeing some CPU cores to feed all those GPUs (whichever applications), then wind out the process priority setting to Above normal, and pulsefind settings to the max (your stderr shows conservative defaults all around). [For Cuda 5.0, per device settings are possible]

Stepping back and looking back at the broader status quo, you find nVidia now limiting gaming sli support to 2 GPUs with the new generation. If you want to have a monster you must be able to feed it (i know this because I have a lazy dog :P )
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1805628 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1805629 - Posted: 29 Jul 2016, 18:18:58 UTC - in response to Message 1805618.  
Last modified: 29 Jul 2016, 18:19:44 UTC

My guess is that the cuda50 application (originally developed for Kepler series cards, but still the latest available) is diverging further and further from the latest Pascal architecture (for which Cuda v8.0 was developed). Jason Gee, our Cuda developer, will be able to amplify that suggestion further, or possibly shoot it down in flames.


Close, but for Reference, others have labelled Pascal as 'Maxwell on Speed'. That's no derogative, because with 750ti onward nV basically struck gold, and the process shrink gives the higher clocks. reference: https://youtu.be/nDaekpMBYUA

There are ways to better utilise Maxwell+ than the current baseline apps, however more instances lower utilisation Vs Fewer instances higher utilisation will be trading blows until we solve some more fundamental infrastructure issues, namely that nearly all building/debugging has become unmanageable.

Petri's managed to demonstrate the VLAR (not applicable in the provided example) issues have nothing to do with the architecture per se.

With multiple instances and shorties then we're left with load balancing and stream-based latency hiding concerns, which are going to be exacerbated on a many GPU rig.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1805629 · Report as offensive
The_Matrix
Volunteer tester

Send message
Joined: 17 Nov 03
Posts: 414
Credit: 5,827,850
RAC: 0
Germany
Message 1805653 - Posted: 29 Jul 2016, 20:05:23 UTC

I think the problem is that all these PCIex1 to PCIex16 adapters are sticking on one PCIe lane, or is it not the point ?
ID: 1805653 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1805657 - Posted: 29 Jul 2016, 20:09:35 UTC - in response to Message 1805653.  
Last modified: 29 Jul 2016, 20:15:06 UTC

I think the problem is that all these PCIex1 to PCIex16 adapters are sticking on one PCIe lane, or is it not the point ?


I think it is that.... , but we have information only of 48x 2,2Ghz cpus (which is probably only 24 hypperthreaded at half the GHz effective) ---> So the hyperthreaded (1.1 Ghz) CPU cores are running at a lower clockrate than some of the GPUs already, which could easily be bad) ---> 24 or 48 onto how many PCIe Lanes total ? Yeah starved GPUs IMO, they need 16 lanes each.

2.2 GHz hyperthreaded = 2 x 1.1 Ghz effective ( no free lunch ), driving 2+ Ghz GPU(s) ? I call hardware mismatch.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1805657 · Report as offensive
AMDave
Volunteer tester

Send message
Joined: 9 Mar 01
Posts: 234
Credit: 11,671,730
RAC: 0
United States
Message 1805667 - Posted: 29 Jul 2016, 20:44:39 UTC - in response to Message 1805657.  

I think the problem is that all these PCIex1 to PCIex16 adapters are sticking on one PCIe lane, or is it not the point ?


I think it is that.... , but we have information only of 48x 2,2Ghz cpus (which is probably only 24 hypperthreaded at half the GHz effective) ---> So the hyperthreaded (1.1 Ghz) CPU cores are running at a lower clockrate than some of the GPUs already, which could easily be bad) ---> 24 or 48 onto how many PCIe Lanes total ? Yeah starved GPUs IMO, they need 16 lanes each.

2.2 GHz hyperthreaded = 2 x 1.1 Ghz effective ( no free lunch ), driving 2+ Ghz GPU(s) ? I call hardware mismatch.

So, would it be fair to say that in order to get the most out of such multi-core, multi-socket systems, you should turn off HT (to avoid bottlenecks)?
ID: 1805667 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1805673 - Posted: 29 Jul 2016, 21:12:19 UTC - in response to Message 1805667.  

2.2 GHz hyperthreaded = 2 x 1.1 Ghz effective ( no free lunch ), driving 2+ Ghz GPU(s) ? I call hardware mismatch.

So, would it be fair to say that in order to get the most out of such multi-core, multi-socket systems, you should turn off HT (to avoid bottlenecks)?

Or simply run them as nature intended - as pure CPU compute devices?

If you want to run GPUs as well, put them in ones or twos in simple i5 chassis - as many boxen as you need.
ID: 1805673 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1805679 - Posted: 29 Jul 2016, 21:40:39 UTC - in response to Message 1805667.  

I think the problem is that all these PCIex1 to PCIex16 adapters are sticking on one PCIe lane, or is it not the point ?


I think it is that.... , but we have information only of 48x 2,2Ghz cpus (which is probably only 24 hypperthreaded at half the GHz effective) ---> So the hyperthreaded (1.1 Ghz) CPU cores are running at a lower clockrate than some of the GPUs already, which could easily be bad) ---> 24 or 48 onto how many PCIe Lanes total ? Yeah starved GPUs IMO, they need 16 lanes each.

2.2 GHz hyperthreaded = 2 x 1.1 Ghz effective ( no free lunch ), driving 2+ Ghz GPU(s) ? I call hardware mismatch.

So, would it be fair to say that in order to get the most out of such multi-core, multi-socket systems, you should turn off HT (to avoid bottlenecks)?


I think that's fair while the Boinc, and the bulk of the absolute performance crowd focus on our simple apps and our existing naive infrastructure.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1805679 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1805680 - Posted: 29 Jul 2016, 21:43:52 UTC - in response to Message 1805673.  

2.2 GHz hyperthreaded = 2 x 1.1 Ghz effective ( no free lunch ), driving 2+ Ghz GPU(s) ? I call hardware mismatch.

So, would it be fair to say that in order to get the most out of such multi-core, multi-socket systems, you should turn off HT (to avoid bottlenecks)?

Or simply run them as nature intended - as pure CPU compute devices?

If you want to run GPUs as well, put them in ones or twos in simple i5 chassis - as many boxen as you need.


Careful of the natural confusion. The OP question related to a Cuda device, while the OpenCL deevices are more or less arbitrary. If Intel GPU devices are involved here, I would be more than happy to have a discussion on camera with that Tosser Francois (From Intel) who proposed that GPUs would never be a thing.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1805680 · Report as offensive
AMDave
Volunteer tester

Send message
Joined: 9 Mar 01
Posts: 234
Credit: 11,671,730
RAC: 0
United States
Message 1805704 - Posted: 29 Jul 2016, 23:30:49 UTC - in response to Message 1805679.  


So, would it be fair to say that in order to get the most out of such multi-core, multi-socket systems, you should turn off HT (to avoid bottlenecks)?


I think that's fair while the Boinc, and the bulk of the absolute performance crowd focus on our simple apps and our existing naive infrastructure.

OK, a multi-core, multi-socket system with 1 or more GPUs would be more efficient/productive without HT.

Now, take that same system sans GPUs, strictly running cpu apps.  Would it be more efficient/productive with or without HT enabled?
ID: 1805704 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1805712 - Posted: 29 Jul 2016, 23:44:21 UTC - in response to Message 1805704.  

Now, take that same system sans GPUs, strictly running cpu apps.  Would it be more efficient/productive with or without HT enabled?

HyperThreading is like running multiple WUs on a GPU; the individual WU run times are longer, however the overall throughput per hour is much higher.
Grant
Darwin NT
ID: 1805712 · Report as offensive
The_Matrix
Volunteer tester

Send message
Joined: 17 Nov 03
Posts: 414
Credit: 5,827,850
RAC: 0
Germany
Message 1805756 - Posted: 30 Jul 2016, 3:05:12 UTC
Last modified: 30 Jul 2016, 3:20:41 UTC

It´s very badly , i found NO accrurate PCIe x16 to PCIe x16 riser to buy on amazon.it.

Think a cable must be minimum 15" inches long to use it effectively.

https://www.caseking.de/lian-li-pw-pci-e-1-riser-card-kabel-gen.3-schwarz-geli-732.html

Found this in germany, but´s not payable, and where is the powercable !?

Edit:

ok, i would install the big high performance cards ON BOARD , and use this PCIe x1 adapters only for the "low" performance cards.
ID: 1805756 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1805757 - Posted: 30 Jul 2016, 3:10:49 UTC

Getting off-topic but my 2 credits on hyper-threading: it's a great way to avoid idle CPU resources while waiting for a cache-miss that could be 100's of cycles; it's particularly helpful on LPDDR and other slower memory subsystems.

The down-side is that by "getting more work done" while there's a cache miss you can inadvertently pollute the cache even more and bog down the first core (a problem we struggled with on the PS3). You can mitigate the impact with streaming loads and stores that bypass the cache but it's tricky.

Of course, the answer to the question of "is it faster without" is "it depends on your code!" -- so you might as well try it to see.

If only there was a standardized BOINC benchmark suite...
ID: 1805757 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1805795 - Posted: 30 Jul 2016, 9:32:11 UTC - in response to Message 1805757.  


If only there was a standardized BOINC benchmark suite...

Not quite BOINC but SETI. But very standartized and definitely appropriate to check HT benefits if any (actually I did such tests on long ago missed Atom-based netbook with 2 hyperthreaded CPUs):
Lunatics KWSN bench (can't give exact link to download section cause currently Lunatics site down for me).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1805795 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1805812 - Posted: 30 Jul 2016, 11:42:36 UTC - in response to Message 1805795.  


If only there was a standardized BOINC benchmark suite...

Not quite BOINC but SETI. But very standartized and definitely appropriate to check HT benefits if any (actually I did such tests on long ago missed Atom-based netbook with 2 hyperthreaded CPUs):
Lunatics KWSN bench (can't give exact link to download section cause currently Lunatics site down for me).


Can`t reach Lunatics site too.


With each crime and every kindness we birth our future.
ID: 1805812 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1805816 - Posted: 30 Jul 2016, 12:02:06 UTC
Last modified: 30 Jul 2016, 12:13:55 UTC

Lunatics is unreachable for me as well, so it's not just you.

Regarding HT.....
If you may recall, when this was first questioned, the kitties ran some rather extensive tests.
I was assisted by a very kind and knowledgeable Intel guru, archae.

And the results at that time were that HT off was more stable and allowed better OCing than with HT on.
HT was NOT, as was previously posted, like running more than one task at a time on a GPU to increase utilization.

It simply split CPU time between jobs, and at that time was not as efficient at it as one might think. I believe in my case at least, there was a few percent net loss as opposed to just running the cores full bore without the encumbrance of the HT overhead.

In a non-crunching world, where apps are not trying to use the CPUs full time, there could be some benefit to having multi-apps launched on HT virtual cores, where the CPU can jump between little bits of this and that and attend to each one in turn.

But not in the full usage crunching world.
In short.....there is no free CPU lunch. A clock cycle is a clock cycle, whether used in real time or split up in HT time.

The kitties believe that it is best to least HT off and let the cores work without the toys.

Of course, things have advanced. I am running on rather old hardware. And YMMV always applies.

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1805816 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1805820 - Posted: 30 Jul 2016, 12:23:29 UTC - in response to Message 1805816.  
Last modified: 30 Jul 2016, 12:24:45 UTC

In short.....there is no free CPU lunch. A clock cycle is a clock cycle, whether used in real time or split up in HT time.


Word up Kittyman! [Edit: if you have low latency already, latency hiding won't help you]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1805820 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1805824 - Posted: 30 Jul 2016, 12:44:20 UTC - in response to Message 1805820.  
Last modified: 30 Jul 2016, 12:44:50 UTC

In short.....there is no free CPU lunch. A clock cycle is a clock cycle, whether used in real time or split up in HT time.


Word up Kittyman! [Edit: if you have low latency already, latency hiding won't help you]

Talking about memory clock cycles?
Yes, they help. A CPU clock cycle is still a clock cycle, my friend.
They can be held up if the memory is not available. But the discussion was about HT. And HT does not help accessing memory cycles, if I am correct.

Meow, my friend.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1805824 · Report as offensive
The_Matrix
Volunteer tester

Send message
Joined: 17 Nov 03
Posts: 414
Credit: 5,827,850
RAC: 0
Germany
Message 1805835 - Posted: 30 Jul 2016, 12:59:40 UTC

Is it it worth, to set the timings of the main memory down,

from 667Mhz 9,9,9,24
to 457Mhz 6,6,6,17

will that have positive effects on crunching time for cpu-workunits ?
ID: 1805835 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : Hmmm...something wrong in here : GPU runtimes


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.