Message boards :
Number crunching :
Just some early observations on V7 CUDA work.
Message board moderation
Author | Message |
---|---|
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
Just some early observations on V7 CUDA work versus V6 CUDA work with my GTX 550Ti's and GTX 660's, if anyone is interested. My 550's run 2 concurrent workunits while my 660's run 3 (the 550's are equivalent to running 2x 9800GTX's and the 660's equivalent to running 3x 9800GTX's). Shorties that use to take 4-5mins are now taking 18-20mins. Workunits that took 10-12mins now take about 25mins. Workunits that took about 20mins now take around 30mins. And those odd heavy 1's that took around 25mins are now taking around 35mins. On my 660's, VLAR's can be done much quicker on my 2500K and cause very strange loadings (people just running 1 workunit per card will probably be fine with these but running 2 or more degrades progress badly) so I just abort any that come my way. I can't at this time comment on my CPU work yet as I've still got plenty of AP work to get through yet before I get around to them. I will say that it does look strange now when a V6 shortie resend gets done in 4mins while the V7's being done at the same time just chug along. Cheers. |
skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 |
Please remember that additional processing/searches were added to the WU's for V7 so we expected slowdowns. Perhaps in the future we'll get more optimization dealing with the new searches. In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Some tips with the new search in V7 (autocorrelations, see wiki) - The new search is very video memory intensive (for the time being), This means speed will tend to be limited by the video memory type, clockrate, and bus width more than you're used to, as opposed to core clocks. This is subject to change (with future optimisation) now that baseline operation (precision, algorithm & logic) is being proven on a wide scale (against stock cpu reference). - Autocorrelations being memory bound, as above, will tend to mean you 'bottleneck' at fewer concurrent tasks, e.g. 2 instead of 3, or 1 instead of 2. - More searches is more processing. More work takes longer. - The credit payout seems 'pretty bad' for V7 tasks so far, when IMO it should be 1.5-2x V6, or even more, at the same angle range. what's up with that ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Just some early observations on V7 CUDA work versus V6 CUDA work with my GTX 550Ti's and GTX 660's, if anyone is interested. As you might infer from that, the additional Autocorr searches are not affected by angle range. In fact every v7 WU produced by the current splitter code needs 519336 Autocorr searches (except those which quit early on overflow). Joe |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
I'm sorry guys, but I was not inferring that anything was wrong or that I was complaining about anything. I was just throwing out my own observations. At least now I shouldn't run out of CUDA work during the weekly outage so all is good in my books. Cheers. |
shizaru Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0 |
In fact every v7 WU produced by the current splitter code needs 519336 Autocorr searches (except those which quit early on overflow). I won't pretend to understand that but wouldn't such an exact number make it easy to tack on an x amount of credit to each WU? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
- The new search is very video memory intensive (for the time being), This means speed will tend to be limited by the video memory type, clockrate, and bus width more than you're used to, as opposed to core clocks. This is subject to change (with future optimisation) now that baseline operation (precision, algorithm & logic) is being proven on a wide scale (against stock cpu reference). I can't remember exactly what the value was, but i'm pretty sure that according to GPUz the memory controller load used to be around 55-60% for 2WU at a time. With v7 it's 75%. Grant Darwin NT |
[AF>HFR]yoda51 Send message Joined: 16 May 99 Posts: 13 Credit: 218,099,206 RAC: 90 |
since I use seti version 7 my average of work unit with same optimized binary: Lunatics_x41zc_win32_cuda42.exe (GPU FERMI) AKv8c_Bb_r1846_winx86_SSSE3x.exe (CPU) has drop over 20% !! (from 55000 to 40000 credit !!!) the only difference is i can not use anymore boinc resheduler 2.7 with V7 to assign vlar bloc to CPU do you have any explanation or solution for that ? |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
since I use seti version 7 my average of work unit with same optimized binary: Yes. Let the VLARs run as the scheduler has issued them. I know it's painful watching a GPU tied up with a VLAR. I have them too. Hopefully, the issue shall be addressed in the coming weeks. But, the VLAR issue is NOT the primary reason for your RAC dropping, it is the fact that credits are not being issued commensurate with the run times at present. Hopefully, that shall recover in the coming weeks as well. My RAC has dropped from the 560k range to 408k at present. And it's still dropping. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
- The new search is very video memory intensive (for the time being), This means speed will tend to be limited by the video memory type, clockrate, and bus width more than you're used to, as opposed to core clocks. This is subject to change (with future optimisation) now that baseline operation (precision, algorithm & logic) is being proven on a wide scale (against stock cpu reference). Running 3 concurrent tasks on the 660's the GPU load is is around 90-95% (up about 10% on V6), the memory controller load is 60-75% (pretty much the same as V6), using no more than 860MB of memory (with a couple of browsers and other applications running) and power consumption 80-85% TDP (75-80% under V6). The only bottlenecking here happens when a VLAR is added to the mix but as I said earlier, I abort the suckers when I find them. Cheers. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's. Cheers. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's. Now, let's be nice........ The title about observations on v7 Cuda work does include the drop in RAC and credits awarded for doing the longer running WUs, don't you think? "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's. Sorry, but there are enough other threads where that's happening without this 1 being another added to the list, I'm only talking about work behaviour. ;-) Cheers. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's. I understand...... Now that you have clarified, let all be warned...LOL. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
And please bear with me, this post may be a mix of the two. I do find that v7 is crunching up just fine on my farm. Wish the VLARs were not showing up in my Cuda caches, but I shall crunch them the way they were issued nonetheless. I do wish the credit thingy had not been so corrupted though, so I could see the real difference in the amount of work I am doing now VS what I was doing before the changeover and subsequent release of the Lunatics installer to get my rigs back to the way they were crunching before both of the new releases. It's hard to get a handle on the real impact. Just a hunch, but I think that if all things had remained equal, my RAC would have jumped a bit. And the other positive side of it....just a bit of a positive, is that with the longer crunch times of v7, it does stretch the 100/100 cache enough to make it through the weekly outage. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
And Jason's posts about the way v7 works mean that GPU oc'ing may take a tad bit different tack. In the past, it was the core clock and moreover the shader clock that had the most impact on throughput. I usually left the memory speed at stock. According to Jason's input, it now might be as important to push the GPU RAM speed a bit as well. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
According to Jason's input, it now might be as important to push the GPU RAM speed a bit as well. It's worth a go, just be wary of pushing into artefact territory. 10% faster only to get invalids.... we'll you probably realise that already. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Running 3 concurrent tasks on the 660's the GPU load is is around 90-95% (up about 10% on V6), the memory controller load is 60-75% (pretty much the same as V6), using no more than 860MB of memory (with a couple of browsers and other applications running) and power consumption 80-85% TDP (75-80% under V6). What you're seeing in the likes of GPU-z are averages over the sample period (typically 1 second) while Cuda kernel launches are typically in the millisecond range. You need a profiler to analyse whether code is memory or compute bound, and it is always one or the other, though preferably a delicate balance of course. 'Saturation' is a different concept. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
According to Jason's input, it now might be as important to push the GPU RAM speed a bit as well. New territory for the kitties. I always left the RAM on it's own and pushed the shaders until they hurt. It's hard to do that kind of tweaking on 9 rigs with different cards on all of them. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
Your 680's should handle VLAR's much better than my 660's due to their larger 256-bit memory bus which should make a bit of a difference to you Mark, but if I see 1 in amongst my CUDA tasks it will be aborted immediately (but each to their own ways and this is "my way" to treat something that I see as a problem that should never have been allowed to start in the 1st place). Cheers. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.