Just some early observations on V7 CUDA work.

Message boards : Number crunching : Just some early observations on V7 CUDA work.
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1377655 - Posted: 6 Jun 2013, 21:24:58 UTC

Just some early observations on V7 CUDA work versus V6 CUDA work with my GTX 550Ti's and GTX 660's, if anyone is interested.

My 550's run 2 concurrent workunits while my 660's run 3 (the 550's are equivalent to running 2x 9800GTX's and the 660's equivalent to running 3x 9800GTX's).

Shorties that use to take 4-5mins are now taking 18-20mins.

Workunits that took 10-12mins now take about 25mins.

Workunits that took about 20mins now take around 30mins.

And those odd heavy 1's that took around 25mins are now taking around 35mins.

On my 660's, VLAR's can be done much quicker on my 2500K and cause very strange loadings (people just running 1 workunit per card will probably be fine with these but running 2 or more degrades progress badly) so I just abort any that come my way.

I can't at this time comment on my CPU work yet as I've still got plenty of AP work to get through yet before I get around to them.

I will say that it does look strange now when a V6 shortie resend gets done in 4mins while the V7's being done at the same time just chug along.

Cheers.
ID: 1377655 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1377657 - Posted: 6 Jun 2013, 21:28:27 UTC

Please remember that additional processing/searches were added to the WU's for V7 so we expected slowdowns. Perhaps in the future we'll get more optimization dealing with the new searches.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1377657 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1377666 - Posted: 6 Jun 2013, 21:42:26 UTC
Last modified: 6 Jun 2013, 21:46:19 UTC

Some tips with the new search in V7 (autocorrelations, see wiki)
- The new search is very video memory intensive (for the time being), This means speed will tend to be limited by the video memory type, clockrate, and bus width more than you're used to, as opposed to core clocks. This is subject to change (with future optimisation) now that baseline operation (precision, algorithm & logic) is being proven on a wide scale (against stock cpu reference).
- Autocorrelations being memory bound, as above, will tend to mean you 'bottleneck' at fewer concurrent tasks, e.g. 2 instead of 3, or 1 instead of 2.
- More searches is more processing. More work takes longer.
- The credit payout seems 'pretty bad' for V7 tasks so far, when IMO it should be 1.5-2x V6, or even more, at the same angle range. what's up with that ?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1377666 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1377674 - Posted: 6 Jun 2013, 21:51:04 UTC - in response to Message 1377655.  

Just some early observations on V7 CUDA work versus V6 CUDA work with my GTX 550Ti's and GTX 660's, if anyone is interested.

My 550's run 2 concurrent workunits while my 660's run 3 (the 550's are equivalent to running 2x 9800GTX's and the 660's equivalent to running 3x 9800GTX's).

Shorties that use to take 4-5mins are now taking 18-20mins.

Workunits that took 10-12mins now take about 25mins.

Workunits that took about 20mins now take around 30mins.

And those odd heavy 1's that took around 25mins are now taking around 35mins.
...

As you might infer from that, the additional Autocorr searches are not affected by angle range. In fact every v7 WU produced by the current splitter code needs 519336 Autocorr searches (except those which quit early on overflow).
                                                                  Joe
ID: 1377674 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1377685 - Posted: 6 Jun 2013, 22:17:32 UTC - in response to Message 1377674.  

I'm sorry guys, but I was not inferring that anything was wrong or that I was complaining about anything.

I was just throwing out my own observations.

At least now I shouldn't run out of CUDA work during the weekly outage so all is good in my books.

Cheers.
ID: 1377685 · Report as offensive
Profile shizaru
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1130
Credit: 1,967,904
RAC: 0
Greece
Message 1377693 - Posted: 6 Jun 2013, 22:38:28 UTC - in response to Message 1377674.  

In fact every v7 WU produced by the current splitter code needs 519336 Autocorr searches (except those which quit early on overflow).


I won't pretend to understand that but wouldn't such an exact number make it easy to tack on an x amount of credit to each WU?
ID: 1377693 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1377826 - Posted: 7 Jun 2013, 6:42:28 UTC - in response to Message 1377666.  

- The new search is very video memory intensive (for the time being), This means speed will tend to be limited by the video memory type, clockrate, and bus width more than you're used to, as opposed to core clocks. This is subject to change (with future optimisation) now that baseline operation (precision, algorithm & logic) is being proven on a wide scale (against stock cpu reference).
- Autocorrelations being memory bound, as above, will tend to mean you 'bottleneck' at fewer concurrent tasks, e.g. 2 instead of 3, or 1 instead of 2.

I can't remember exactly what the value was, but i'm pretty sure that according to GPUz the memory controller load used to be around 55-60% for 2WU at a time. With v7 it's 75%.
Grant
Darwin NT
ID: 1377826 · Report as offensive
Profile [AF>HFR]yoda51

Send message
Joined: 16 May 99
Posts: 13
Credit: 218,099,206
RAC: 90
France
Message 1377838 - Posted: 7 Jun 2013, 7:04:18 UTC

since I use seti version 7 my average of work unit with same optimized binary:

Lunatics_x41zc_win32_cuda42.exe (GPU FERMI)
AKv8c_Bb_r1846_winx86_SSSE3x.exe (CPU)

has drop over 20% !! (from 55000 to 40000 credit !!!)

the only difference is i can not use anymore boinc resheduler 2.7 with V7 to assign vlar bloc to CPU

do you have any explanation or solution for that ?


ID: 1377838 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1377840 - Posted: 7 Jun 2013, 7:07:37 UTC - in response to Message 1377838.  
Last modified: 7 Jun 2013, 7:09:06 UTC

since I use seti version 7 my average of work unit with same optimized binary:

Lunatics_x41zc_win32_cuda42.exe (GPU FERMI)
AKv8c_Bb_r1846_winx86_SSSE3x.exe (CPU)

has drop over 20% !! (from 55000 to 40000 credit !!!)

the only difference is i can not use anymore boinc resheduler 2.7 with V7 to assign vlar bloc to CPU

do you have any explanation or solution for that ?


Yes. Let the VLARs run as the scheduler has issued them.
I know it's painful watching a GPU tied up with a VLAR.
I have them too.
Hopefully, the issue shall be addressed in the coming weeks.
But, the VLAR issue is NOT the primary reason for your RAC dropping, it is the fact that credits are not being issued commensurate with the run times at present. Hopefully, that shall recover in the coming weeks as well.

My RAC has dropped from the 560k range to 408k at present. And it's still dropping.

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1377840 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1377845 - Posted: 7 Jun 2013, 7:15:48 UTC - in response to Message 1377826.  

- The new search is very video memory intensive (for the time being), This means speed will tend to be limited by the video memory type, clockrate, and bus width more than you're used to, as opposed to core clocks. This is subject to change (with future optimisation) now that baseline operation (precision, algorithm & logic) is being proven on a wide scale (against stock cpu reference).
- Autocorrelations being memory bound, as above, will tend to mean you 'bottleneck' at fewer concurrent tasks, e.g. 2 instead of 3, or 1 instead of 2.

I can't remember exactly what the value was, but i'm pretty sure that according to GPUz the memory controller load used to be around 55-60% for 2WU at a time. With v7 it's 75%.

Running 3 concurrent tasks on the 660's the GPU load is is around 90-95% (up about 10% on V6), the memory controller load is 60-75% (pretty much the same as V6), using no more than 860MB of memory (with a couple of browsers and other applications running) and power consumption 80-85% TDP (75-80% under V6).

The only bottlenecking here happens when a VLAR is added to the mix but as I said earlier, I abort the suckers when I find them.

Cheers.
ID: 1377845 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1377847 - Posted: 7 Jun 2013, 7:19:08 UTC - in response to Message 1377838.  
Last modified: 7 Jun 2013, 7:21:09 UTC

This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's.

Cheers.
ID: 1377847 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1377849 - Posted: 7 Jun 2013, 7:25:10 UTC - in response to Message 1377847.  

This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's.

Cheers.

Now, let's be nice........
The title about observations on v7 Cuda work does include the drop in RAC and credits awarded for doing the longer running WUs, don't you think?
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1377849 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1377854 - Posted: 7 Jun 2013, 7:35:57 UTC - in response to Message 1377849.  

This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's.

Cheers.

Now, let's be nice........
The title about observations on v7 Cuda work does include the drop in RAC and credits awarded for doing the longer running WUs, don't you think?

Sorry, but there are enough other threads where that's happening without this 1 being another added to the list, I'm only talking about work behaviour. ;-)

Cheers.
ID: 1377854 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1377855 - Posted: 7 Jun 2013, 7:37:13 UTC - in response to Message 1377854.  

This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's.

Cheers.

Now, let's be nice........
The title about observations on v7 Cuda work does include the drop in RAC and credits awarded for doing the longer running WUs, don't you think?

Sorry, but there are enough other threads where that's happening without this 1 being another added to the list, I'm only talking about work behaviour. ;-)

Cheers.

I understand......
Now that you have clarified, let all be warned...LOL.

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1377855 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1377856 - Posted: 7 Jun 2013, 7:44:32 UTC

And please bear with me, this post may be a mix of the two.

I do find that v7 is crunching up just fine on my farm.
Wish the VLARs were not showing up in my Cuda caches, but I shall crunch them the way they were issued nonetheless.

I do wish the credit thingy had not been so corrupted though, so I could see the real difference in the amount of work I am doing now VS what I was doing before the changeover and subsequent release of the Lunatics installer to get my rigs back to the way they were crunching before both of the new releases.

It's hard to get a handle on the real impact.

Just a hunch, but I think that if all things had remained equal, my RAC would have jumped a bit.

And the other positive side of it....just a bit of a positive, is that with the longer crunch times of v7, it does stretch the 100/100 cache enough to make it through the weekly outage.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1377856 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1377858 - Posted: 7 Jun 2013, 7:51:46 UTC

And Jason's posts about the way v7 works mean that GPU oc'ing may take a tad bit different tack.
In the past, it was the core clock and moreover the shader clock that had the most impact on throughput. I usually left the memory speed at stock.

According to Jason's input, it now might be as important to push the GPU RAM speed a bit as well.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1377858 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1377860 - Posted: 7 Jun 2013, 8:00:04 UTC - in response to Message 1377858.  

According to Jason's input, it now might be as important to push the GPU RAM speed a bit as well.


It's worth a go, just be wary of pushing into artefact territory. 10% faster only to get invalids.... we'll you probably realise that already.

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1377860 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1377866 - Posted: 7 Jun 2013, 8:04:02 UTC - in response to Message 1377845.  
Last modified: 7 Jun 2013, 8:05:25 UTC

Running 3 concurrent tasks on the 660's the GPU load is is around 90-95% (up about 10% on V6), the memory controller load is 60-75% (pretty much the same as V6), using no more than 860MB of memory (with a couple of browsers and other applications running) and power consumption 80-85% TDP (75-80% under V6).

The only bottlenecking here happens when a VLAR is added to the mix but as I said earlier, I abort the suckers when I find them.

Cheers.


What you're seeing in the likes of GPU-z are averages over the sample period (typically 1 second) while Cuda kernel launches are typically in the millisecond range. You need a profiler to analyse whether code is memory or compute bound, and it is always one or the other, though preferably a delicate balance of course. 'Saturation' is a different concept.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1377866 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1377867 - Posted: 7 Jun 2013, 8:05:48 UTC - in response to Message 1377860.  

According to Jason's input, it now might be as important to push the GPU RAM speed a bit as well.


It's worth a go, just be wary of pushing into artefact territory. 10% faster only to get invalids.... we'll you probably realise that already.

New territory for the kitties. I always left the RAM on it's own and pushed the shaders until they hurt.

It's hard to do that kind of tweaking on 9 rigs with different cards on all of them.


"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1377867 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1377868 - Posted: 7 Jun 2013, 8:06:05 UTC - in response to Message 1377856.  

Your 680's should handle VLAR's much better than my 660's due to their larger 256-bit memory bus which should make a bit of a difference to you Mark, but if I see 1 in amongst my CUDA tasks it will be aborted immediately (but each to their own ways and this is "my way" to treat something that I see as a problem that should never have been allowed to start in the 1st place).

Cheers.
ID: 1377868 · Report as offensive
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Just some early observations on V7 CUDA work.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.