Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19)

Author	Message
jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1524119 - Posted: 3 Jun 2014, 19:39:26 UTC Thanks! That should help me out as I prepare to streamline things a fair bit. Most likely I won't make any changes there in a hurry, being tied up with some testing for nv, and some Boinc patching, but it'll definitely help me out reducing the number of fiddly bits down the line. Cheers, Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1524119 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1524151 - Posted: 3 Jun 2014, 20:32:16 UTC Thanks, Guy. I'll give it a try when I get home in around 9 hours. Soli Deo Gloria ID: 1524151 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1524292 - Posted: 4 Jun 2014, 6:08:10 UTC Last modified: 4 Jun 2014, 6:09:57 UTC I haven't tried anything yet, but I'm just wondering: was it really necessary to force a manual installation of the NV proprietary display driver? In K/Ubuntu, I just installed the distribution-packaged display driver along with the other CUDA libraries and what-not. The RPM package from nvidia.com is not suitable? I'll follow as instructed anyway. Another question: Presumably the compiled binary is auto-labelled ..._cuda60 because that's the current version of the CUDA tools installed from nvidia.com. I remember the notes for the binary distributions of the MB CUDA application referring to things like the cuda42 build being suitable for Fermi GPUs, cuda50 for Kepler GPUs, etc. Is this note relevant when one is building manually? Soli Deo Gloria ID: 1524292 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1524333 - Posted: 4 Jun 2014, 7:28:35 UTC - in response to Message 1524292. Last modified: 4 Jun 2014, 7:31:22 UTC I haven't tried anything yet, but I'm just wondering: was it really necessary to force a manual installation of the NV proprietary display driver? In K/Ubuntu, I just installed the distribution-packaged display driver along with the other CUDA libraries and what-not. The RPM package from nvidia.com is not suitable? I'll follow as instructed anyway. Another question: Presumably the compiled binary is auto-labelled ..._cuda60 because that's the current version of the CUDA tools installed from nvidia.com. I remember the notes for the binary distributions of the MB CUDA application referring to things like the cuda42 build being suitable for Fermi GPUs, cuda50 for Kepler GPUs, etc. Is this note relevant when one is building manually? Try both ways and tell us ? Frankly I had less struggle on Ubuntu, though have a tendency to do stuff on automatic. The need to do 'weird stuff' can come from either necessity or prior experience doing it when no longer needed. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1524333 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1525382 - Posted: 7 Jun 2014, 2:00:50 UTC Last modified: 7 Jun 2014, 2:01:43 UTC I had another using my original Ubuntu set-up (basically starting after the Fedora-specific set-up finishes, at step 16) - I think the main points I needed were 35 to 43 (there's an extra cuda directory listed in those path names, but that's okay, I worked out where everything was). I managed to get the BOINC libraries to compile this time, plus I got a lot further in the MB CUDA compilation... but hitting another error now and I'm not sure where to look (seems to be something to do with hires_timer). Can't tell if there's a few more libraries I need to download. libtool: line 375: $'\r': command not found ../libtool: line 377: $'\r': command not found ../libtool: line 384: $'\r': command not found /bin/sed: -e expression #1, char 10: unknown option to `s' ../libtool: line 388: $'\r': command not found ../libtool: line 392: $'\r': command not found ../libtool: line 397: $'\r': command not found ../libtool: line 408: syntax error near unexpected token `elif' './libtool: line 408: `elif test "X$1" = X--fallback-echo; then make[2]: * [hires_timer_test] Error 2 make[2]: Leaving directory `/home/.../sah_v7_opt/Xbranch/client' make[1]: * [all-recursive] Error 1 make[1]: Leaving directory `/home/.../sah_v7_opt/Xbranch' make: *** [all] Error 2 Soli Deo Gloria ID: 1525382 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1525545 - Posted: 7 Jun 2014, 10:14:00 UTC Okay, I realised my generated libtool script got carriage-return characters stuck in it because I had copied the source code repository from my Windows-based main machine. Download quotas are still common in Australia so I didn't want to re-download the entire repository on the Linux machine. I seem to have a successful compilation now, albeit using the package-based CUDA 5.5 tools instead of the stand-alone CUDA 6.0, so here comes the testing phase... No idea if I can get any of the other applications compiled... Soli Deo Gloria ID: 1525545 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1525556 - Posted: 7 Jun 2014, 10:42:58 UTC - in response to Message 1525545. No idea if I can get any of the other applications compiled... Stock CPU multibeam should have few roadblocks. Not sure what the Linux status of any AP is, stock or opt. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1525556 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1525567 - Posted: 7 Jun 2014, 12:42:47 UTC Last modified: 7 Jun 2014, 12:43:24 UTC That sounds unfortunate, though if it's getting old maybe it is just the motherboard parts dying. If you're looking for typos, I recall step 37 should refer to util.h. For step 31 I used: ~$ sh _autosetup My testing on the basic WUs matched the x41g_cuda32 version to no less than 99.99% as well. Most of them proved to be equally as fast or faster by a small single-digit percentage, but strangely, for PG0009_v7.wu, I get a 17% performance drop: 178 seconds vs 213. I get the same results running it twice. But I've started running my compilation on actual tasks, so I'll see how things go in terms of performance and validity. Soli Deo Gloria ID: 1525567 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1525596 - Posted: 7 Jun 2014, 14:07:46 UTC - in response to Message 1525556. Stock CPU multibeam should have few roadblocks. Not sure what the Linux status of any AP is, stock or opt. I got a complaint about splitdd for MB ATI, but MB CPU seemed to work. Current testing results show a 14-22% improvement on the r1848 SSE3 build from Lunatics. My guess is that's mostly from the additional optimisations for Piledriver-specific AVX, but whatever the reason this is quite a pleasant surprise. Oh, and results are matching 100% as well. I still can't get anything to compile for AstroPulse, whether CPU or OpenCL with ATI or NV. Soli Deo Gloria ID: 1525596 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1525972 - Posted: 9 Jun 2014, 2:53:51 UTC Last modified: 9 Jun 2014, 2:57:22 UTC I based the pre-processor definitions on what petri had. In all the other applications you can see the defines in the stderr part of the results but for the CUDA application does not do this. So without further information, I think looking through the source code is the only option here. For the compiler options, you can read the GCC documentation. Probably the main ones to focus on are the -Ox and -march flags as they include a set of pre-defined optimisation flags (I used -O3) and optimisations specific to a CPU architecture (I used -march=core2 and -march=bdver2 as I'm only running Linux on Core 2 and Piledriver CPUs). petri uses a bunch more optimisation flags beyond this and I included those as well as they seem to help performance slightly, except for the redundant ones (since they are already included in -Ox) and the ones flagged as 'unsafe' in terms of adhering to IEEE/ANSI standards. Those unsafe optimisations don't appear to be causing petri much grief, so I'd be interested in learning how much of a performance difference they make. Soli Deo Gloria ID: 1525972 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1525974 - Posted: 9 Jun 2014, 3:20:02 UTC - in response to Message 1525966. Last modified: 9 Jun 2014, 3:30:33 UTC ... So now my question is, what about all those CFLAGS (you use Petri) when ./configuring Xbranch? Is there some place I can find a list of them specific to configuring this MB app? Do I look through the source code for answers to specifying CFLAGS before configuring? Where did you get them Petri? I believe tuning through the use of those CFLAGS is key to compiling the fastest app. On top of what Perti33 comes up with, my recommendations there for CFLAGS are pretty general. That's because they are for the host CPU component(s), and a lot of the CPU time is beyond our control for the time being, being a function of Driver latencies (traditionally lower on Linux, but growing to match Windows and Mac). That's why Tesla compute cards with no Graphics offer the use of special low-latency drivers (TCC Tesla Compute Cluster Drivers): - Use SSE or better for both the boinc libraries and Xbranch itself. Since it's x64 you should be getting SSE2 anyway by default, but worth checking - make sure to enable fast math - It's true I don't tend to use a lot of hardcoded #defines, because the code is generalised, as opposed to multiple-pathed. For the Cuda compilation portions ( IIRC NVFLAGS ?), you'll want to check O2 or O3 is enabled (command line will be visible during compilation of .cu files. That enables some safe fast math ones for GPU side. You'd likely need to leave maxregisters at 32 even though some Cuda kernels will tolerate more under certain situations. That's because most of the device kernels are custom hand coded for low-latency high bandwidth, as opposed to high occupancy, meaning register pressure isn't a large factor. After that (for me), performance optimisation becomes less about twiddling compiler options, and far more about very high-level algorithmic choices (Like Perti33's custom chirp he's been saving for me), and starting down the road to find ways to hide those awful latencies (using 'Latency hiding mechanisms', changing the way work is 'fed', and the way the host CPU interacts with partial results). That's where we're at the moment, with me designing new tools and techniques to find the best low-latency methods, and allow generalising, optimising on-host and plugging in new core functionality like Petri's for general distribution, eventually at install &/or run-time without application recompile [...similar to the way mobile phone apps do.] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1525974 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1526527 - Posted: 10 Jun 2014, 21:07:04 UTC - in response to Message 1526489. Last modified: 10 Jun 2014, 21:17:22 UTC If there's something else any of you can think of that may be beneficial to try and report on, let me know. Not much more from me wrt compiler options, and glad the fast math options etc are doing what they should. A bit of fine tuning never seems to hurt :). Yeah the compatibility thing is a big stumbling block at the moment, and why I have to keep relatively general for the time being, along with waiting for some things to settle before some attempt at generic third party Linux distribution. You might find that O2 with more generic options could produce faster host side code (or not :) ), but that's debatable because we're talking CPU side for those CFLAGS. For the GPU we've got lots of latency to hide as evidenced by the low utilisation on high end cards. Best bet if you want to coax a bit more speed out before we get further into x42, would be to grab Petri's chirp improvement and give that a try. He claims slightly lower accuracy, but some speed improvement there for specific GK110 GPUs. [Edit: note that I see no particular evidence on Petri's host of reduced chirp accuracy. The inconclusive to pending ratio is right about 4.6%, which is right where mine was on my 780 before the first one blew up...] For my end of that, after working out some boinc/boincapi issues for the time being, I'll be back in the process of building a special cross platform tool to unit test individual code portions for performance and accuracy to speed development along. So far it looks like this: A fair way to go yet, and a few hurdles to cross, but basically its intent is to be able to prove and generalise code before having to rebuild main applications. That's come about because I see no reason why Petri's approach shouldn't be generalisable to earlier GPUs, and be able to incorporate with my own work on latency hiding and using single floats for extra precision with little/speed cost in some places. Hopefully back onto that in the next week or so. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1526527 ·

Wedge009 Volunteer tester Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553	Message 1526581 - Posted: 10 Jun 2014, 22:13:09 UTC I also had the warnings on -fsection-anchors, so I dropped that particular switch as well. It looks to me that the fast/unsafe maths optimisations on their own don't do that much, as they're present in builds 2, 3 and 4 - other switches seem to have more of an impact on performance. Either way, that's a nice set of results. Will you need testers once your tool is finalised, jason? I still have a GF114, GF110 and GK110 available. Soli Deo Gloria ID: 1526581 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20	Message 1526613 - Posted: 10 Jun 2014, 23:55:19 UTC - in response to Message 1526527. ... You might find that O2 with more generic options could produce faster host side code (or not :) ), but that's debatable because we're talking CPU side for those CFLAGS... Just to throw in a different vector: On the CPU side of things, I've seen good success with the -Os (small code size) option to keep the code size small and hopefully small enough to fit within the CPU (fast) L1 cache... Or at least to leave more of the L2/L3 cache available for data... You should see a better effect for that on the lower spec CPUs or for the "hyper threaded" Intels. Note as always, as GCC evolves, the combinations of CFLAGS evolves also... Hope of useful interest, Happy fast crunchin'! Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 1526613 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1526705 - Posted: 11 Jun 2014, 4:22:36 UTC - in response to Message 1526613. ... You might find that O2 with more generic options could produce faster host side code (or not :) ), but that's debatable because we're talking CPU side for those CFLAGS... Just to throw in a different vector: On the CPU side of things, I've seen good success with the -Os (small code size) option to keep the code size small and hopefully small enough to fit within the CPU (fast) L1 cache... Or at least to leave more of the L2/L3 cache available for data... You should see a better effect for that on the lower spec CPUs or for the "hyper threaded" Intels. yep, another definite possibility. Since something like pentium 4's have ridiculously long pipelines, so 'prefer' long relatively branch free code (highly unrolled, highly hard/hand optimised). More modern chips will sometimes unroll tight loops themselves after decode moving the bottlenecks fron branch prediction to instruction decode, so fewer instructions then fewer instructions to decode. smaller code fewer page faults. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1526705 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1526710 - Posted: 11 Jun 2014, 4:47:14 UTC - in response to Message 1526581. ...Will you need testers once your tool is finalised, jason? I still have a GF114, GF110 and GK110 available. Yep. once at least semi operational, basically I'll be fielding it like a bench / stress-test type tool intended for the following purposes: user side: - find key, previously fairly arcane, application settings &/or select from custom builds or versions based on user preferences of how to run ( e.g. max throughput, minimal user impact etc) - identify stability issues. - Submit and compare data to other hosts/hardware - become some sortof defacto-standard for comparing gpGPU (e.g. in reviews) offering something a bit more useful (to us) than benches out there than graphic oriented, synthetic, or folding points-per-day type benches. developer side (as mentioned): - test code schemes/approaches on lots of different hardware quickly, submitting to a database. There's approximately 15 key high level optimisation points designed into x42, and most of them will want different or self-scaling code on different hardware. That's why x42 isn't out already. - guide both generic and targeted optimisations, possibly distribute certain ones semi-automatically - identify performance, accuracy & precision, and problem hardware sooner. - pull ahead of the architecture & toolkit release cycles. - try out some adaptive/genetic algorithms for better matching processing to the hardware. So big job, With Cuda versions coming out like I change my socks. But looks like it's turning out the right way to head to get x42 over the next big hurdles. The time spent now I hope to be functional enough to preempt Big Maxwell, we'll see. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1526710 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.