Just some early observations on V7 CUDA work.

Author	Message
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1377655 - Posted: 6 Jun 2013, 21:24:58 UTC Just some early observations on V7 CUDA work versus V6 CUDA work with my GTX 550Ti's and GTX 660's, if anyone is interested. My 550's run 2 concurrent workunits while my 660's run 3 (the 550's are equivalent to running 2x 9800GTX's and the 660's equivalent to running 3x 9800GTX's). Shorties that use to take 4-5mins are now taking 18-20mins. Workunits that took 10-12mins now take about 25mins. Workunits that took about 20mins now take around 30mins. And those odd heavy 1's that took around 25mins are now taking around 35mins. On my 660's, VLAR's can be done much quicker on my 2500K and cause very strange loadings (people just running 1 workunit per card will probably be fine with these but running 2 or more degrades progress badly) so I just abort any that come my way. I can't at this time comment on my CPU work yet as I've still got plenty of AP work to get through yet before I get around to them. I will say that it does look strange now when a V6 shortie resend gets done in 4mins while the V7's being done at the same time just chug along. Cheers. ID: 1377655 ·

skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60	Message 1377657 - Posted: 6 Jun 2013, 21:28:27 UTC Please remember that additional processing/searches were added to the WU's for V7 so we expected slowdowns. Perhaps in the future we'll get more optimization dealing with the new searches. In a rich man's house there is no place to spit but his face. Diogenes Of Sinope ID: 1377657 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1377666 - Posted: 6 Jun 2013, 21:42:26 UTC Last modified: 6 Jun 2013, 21:46:19 UTC Some tips with the new search in V7 (autocorrelations, see wiki) - The new search is very video memory intensive (for the time being), This means speed will tend to be limited by the video memory type, clockrate, and bus width more than you're used to, as opposed to core clocks. This is subject to change (with future optimisation) now that baseline operation (precision, algorithm & logic) is being proven on a wide scale (against stock cpu reference). - Autocorrelations being memory bound, as above, will tend to mean you 'bottleneck' at fewer concurrent tasks, e.g. 2 instead of 3, or 1 instead of 2. - More searches is more processing. More work takes longer. - The credit payout seems 'pretty bad' for V7 tasks so far, when IMO it should be 1.5-2x V6, or even more, at the same angle range. what's up with that ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1377666 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1377674 - Posted: 6 Jun 2013, 21:51:04 UTC - in response to Message 1377655. Just some early observations on V7 CUDA work versus V6 CUDA work with my GTX 550Ti's and GTX 660's, if anyone is interested. My 550's run 2 concurrent workunits while my 660's run 3 (the 550's are equivalent to running 2x 9800GTX's and the 660's equivalent to running 3x 9800GTX's). Shorties that use to take 4-5mins are now taking 18-20mins. Workunits that took 10-12mins now take about 25mins. Workunits that took about 20mins now take around 30mins. And those odd heavy 1's that took around 25mins are now taking around 35mins. ... As you might infer from that, the additional Autocorr searches are not affected by angle range. In fact every v7 WU produced by the current splitter code needs 519336 Autocorr searches (except those which quit early on overflow). Joe ID: 1377674 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1377685 - Posted: 6 Jun 2013, 22:17:32 UTC - in response to Message 1377674. I'm sorry guys, but I was not inferring that anything was wrong or that I was complaining about anything. I was just throwing out my own observations. At least now I shouldn't run out of CUDA work during the weekly outage so all is good in my books. Cheers. ID: 1377685 ·

shizaru Volunteer tester Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0	Message 1377693 - Posted: 6 Jun 2013, 22:38:28 UTC - in response to Message 1377674. In fact every v7 WU produced by the current splitter code needs 519336 Autocorr searches (except those which quit early on overflow). I won't pretend to understand that but wouldn't such an exact number make it easy to tack on an x amount of credit to each WU? ID: 1377693 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1377826 - Posted: 7 Jun 2013, 6:42:28 UTC - in response to Message 1377666. - The new search is very video memory intensive (for the time being), This means speed will tend to be limited by the video memory type, clockrate, and bus width more than you're used to, as opposed to core clocks. This is subject to change (with future optimisation) now that baseline operation (precision, algorithm & logic) is being proven on a wide scale (against stock cpu reference). - Autocorrelations being memory bound, as above, will tend to mean you 'bottleneck' at fewer concurrent tasks, e.g. 2 instead of 3, or 1 instead of 2. I can't remember exactly what the value was, but i'm pretty sure that according to GPUz the memory controller load used to be around 55-60% for 2WU at a time. With v7 it's 75%. Grant Darwin NT ID: 1377826 ·

[AF>HFR]yoda51 Send message Joined: 16 May 99 Posts: 13 Credit: 218,099,206 RAC: 90	Message 1377838 - Posted: 7 Jun 2013, 7:04:18 UTC since I use seti version 7 my average of work unit with same optimized binary: Lunatics_x41zc_win32_cuda42.exe (GPU FERMI) AKv8c_Bb_r1846_winx86_SSSE3x.exe (CPU) has drop over 20% !! (from 55000 to 40000 credit !!!) the only difference is i can not use anymore boinc resheduler 2.7 with V7 to assign vlar bloc to CPU do you have any explanation or solution for that ? ID: 1377838 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1377840 - Posted: 7 Jun 2013, 7:07:37 UTC - in response to Message 1377838. Last modified: 7 Jun 2013, 7:09:06 UTC since I use seti version 7 my average of work unit with same optimized binary: Lunatics_x41zc_win32_cuda42.exe (GPU FERMI) AKv8c_Bb_r1846_winx86_SSSE3x.exe (CPU) has drop over 20% !! (from 55000 to 40000 credit !!!) the only difference is i can not use anymore boinc resheduler 2.7 with V7 to assign vlar bloc to CPU do you have any explanation or solution for that ? Yes. Let the VLARs run as the scheduler has issued them. I know it's painful watching a GPU tied up with a VLAR. I have them too. Hopefully, the issue shall be addressed in the coming weeks. But, the VLAR issue is NOT the primary reason for your RAC dropping, it is the fact that credits are not being issued commensurate with the run times at present. Hopefully, that shall recover in the coming weeks as well. My RAC has dropped from the 560k range to 408k at present. And it's still dropping. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1377840 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1377845 - Posted: 7 Jun 2013, 7:15:48 UTC - in response to Message 1377826. - The new search is very video memory intensive (for the time being), This means speed will tend to be limited by the video memory type, clockrate, and bus width more than you're used to, as opposed to core clocks. This is subject to change (with future optimisation) now that baseline operation (precision, algorithm & logic) is being proven on a wide scale (against stock cpu reference). - Autocorrelations being memory bound, as above, will tend to mean you 'bottleneck' at fewer concurrent tasks, e.g. 2 instead of 3, or 1 instead of 2. I can't remember exactly what the value was, but i'm pretty sure that according to GPUz the memory controller load used to be around 55-60% for 2WU at a time. With v7 it's 75%. Running 3 concurrent tasks on the 660's the GPU load is is around 90-95% (up about 10% on V6), the memory controller load is 60-75% (pretty much the same as V6), using no more than 860MB of memory (with a couple of browsers and other applications running) and power consumption 80-85% TDP (75-80% under V6). The only bottlenecking here happens when a VLAR is added to the mix but as I said earlier, I abort the suckers when I find them. Cheers. ID: 1377845 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1377847 - Posted: 7 Jun 2013, 7:19:08 UTC - in response to Message 1377838. Last modified: 7 Jun 2013, 7:21:09 UTC This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's. Cheers. ID: 1377847 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1377849 - Posted: 7 Jun 2013, 7:25:10 UTC - in response to Message 1377847. This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's. Cheers. Now, let's be nice........ The title about observations on v7 Cuda work does include the drop in RAC and credits awarded for doing the longer running WUs, don't you think? "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1377849 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1377854 - Posted: 7 Jun 2013, 7:35:57 UTC - in response to Message 1377849. This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's. Cheers. Now, let's be nice........ The title about observations on v7 Cuda work does include the drop in RAC and credits awarded for doing the longer running WUs, don't you think? Sorry, but there are enough other threads where that's happening without this 1 being another added to the list, I'm only talking about work behaviour. ;-) Cheers. ID: 1377854 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1377855 - Posted: 7 Jun 2013, 7:37:13 UTC - in response to Message 1377854. This thread has nothing to do with your RAC so go and complain elsewhere and that goes for anyone else who wants to complain their sagging RAC's. Cheers. Now, let's be nice........ The title about observations on v7 Cuda work does include the drop in RAC and credits awarded for doing the longer running WUs, don't you think? Sorry, but there are enough other threads where that's happening without this 1 being another added to the list, I'm only talking about work behaviour. ;-) Cheers. I understand...... Now that you have clarified, let all be warned...LOL. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1377855 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1377856 - Posted: 7 Jun 2013, 7:44:32 UTC And please bear with me, this post may be a mix of the two. I do find that v7 is crunching up just fine on my farm. Wish the VLARs were not showing up in my Cuda caches, but I shall crunch them the way they were issued nonetheless. I do wish the credit thingy had not been so corrupted though, so I could see the real difference in the amount of work I am doing now VS what I was doing before the changeover and subsequent release of the Lunatics installer to get my rigs back to the way they were crunching before both of the new releases. It's hard to get a handle on the real impact. Just a hunch, but I think that if all things had remained equal, my RAC would have jumped a bit. And the other positive side of it....just a bit of a positive, is that with the longer crunch times of v7, it does stretch the 100/100 cache enough to make it through the weekly outage. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1377856 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1377858 - Posted: 7 Jun 2013, 7:51:46 UTC And Jason's posts about the way v7 works mean that GPU oc'ing may take a tad bit different tack. In the past, it was the core clock and moreover the shader clock that had the most impact on throughput. I usually left the memory speed at stock. According to Jason's input, it now might be as important to push the GPU RAM speed a bit as well. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1377858 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1377860 - Posted: 7 Jun 2013, 8:00:04 UTC - in response to Message 1377858. According to Jason's input, it now might be as important to push the GPU RAM speed a bit as well. It's worth a go, just be wary of pushing into artefact territory. 10% faster only to get invalids.... we'll you probably realise that already. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1377860 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1377866 - Posted: 7 Jun 2013, 8:04:02 UTC - in response to Message 1377845. Last modified: 7 Jun 2013, 8:05:25 UTC Running 3 concurrent tasks on the 660's the GPU load is is around 90-95% (up about 10% on V6), the memory controller load is 60-75% (pretty much the same as V6), using no more than 860MB of memory (with a couple of browsers and other applications running) and power consumption 80-85% TDP (75-80% under V6). The only bottlenecking here happens when a VLAR is added to the mix but as I said earlier, I abort the suckers when I find them. Cheers. What you're seeing in the likes of GPU-z are averages over the sample period (typically 1 second) while Cuda kernel launches are typically in the millisecond range. You need a profiler to analyse whether code is memory or compute bound, and it is always one or the other, though preferably a delicate balance of course. 'Saturation' is a different concept. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1377866 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1377867 - Posted: 7 Jun 2013, 8:05:48 UTC - in response to Message 1377860. According to Jason's input, it now might be as important to push the GPU RAM speed a bit as well. It's worth a go, just be wary of pushing into artefact territory. 10% faster only to get invalids.... we'll you probably realise that already. New territory for the kitties. I always left the RAM on it's own and pushed the shaders until they hurt. It's hard to do that kind of tweaking on 9 rigs with different cards on all of them. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1377867 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1377868 - Posted: 7 Jun 2013, 8:06:05 UTC - in response to Message 1377856. Your 680's should handle VLAR's much better than my 660's due to their larger 256-bit memory bus which should make a bit of a difference to you Mark, but if I see 1 in amongst my CUDA tasks it will be aborted immediately (but each to their own ways and this is "my way" to treat something that I see as a problem that should never have been allowed to start in the 1st place). Cheers. ID: 1377868 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.