Linux CUDA 'Special' App finally available, featuring Low CPU use

Author	Message
Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1893673 - Posted: 6 Oct 2017, 19:51:28 UTC - in response to Message 1893672. Thanks for the reply Petri. Yes, I understand the differences in hardware P-states and "Performance Levels" in Linux. I can only surmise that the inability to raise the default P2 state in the Linux environment is because of the lack of support in the Linux tools, vis-a-vis, nvidia-settings and/or the actual Nvidia Linux drivers. The same hardware can be moved in Windows from P2 state to P0 state with common tools such as Nvidia Inspector, Precision X or MSI Afterburner. Wonder where the defect is .... drivers or tools? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1893673 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1893815 - Posted: 7 Oct 2017, 14:07:16 UTC Here is a new executable for most Maxwell (750, 750Ti, 9x0) and Pascal(10x0) cards namely sm 50, 52 and sm61. It is 3-13% faster than zi3x. The fft callbacks are now implemented at all places. The executable is statically linked and needs no external CUDA libraries. Test offline first. source code : https://drive.google.com/open?id=0B9PYeBxtfMjaMzlDQ0t0cnpUZ2c executable : https://drive.google.com/open?id=0B9PYeBxtfMjaTUxJT3ZURHd6M2M Petri. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1893815 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1893840 - Posted: 7 Oct 2017, 15:43:19 UTC - in response to Message 1893815. Last modified: 7 Oct 2017, 15:47:05 UTC This seems to be a bit faster than zi3v, but, it seems to have the same problem with the Arecibo Overflows. The CUDA App finds all Triplets where the SoG App finds more Pulses; https://setiathome.berkeley.edu/workunit.php?wuid=2703056012 On one of the Test Overflow BLC3s the New App found 4 Bad signals against the CPU whereas zi3v only finds 1 Bad signal out of 30 against the CPU. Which version of sah_v7_opt are you using? When I tried zi3xs1 with both CUDA 8 & 9 in Ubuntu 16.04 I received compile Errors with both 3707 and 3679. I haven't tried any lower versions of sah_v7_opt or Ubuntu. I suppose it could have been the boinc-master too, this time I was using boinc-master 7.9 instead of 7.5. ID: 1893840 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1893868 - Posted: 7 Oct 2017, 17:06:58 UTC - in response to Message 1893840. Hi TBar, I've got an ancient sah_v7_opt on my computer. A couple of years old. If you try the s2 you get rid of compile error that says hires_timer or something. Modify the Makefile to have LDFLAGS = -arch=sm_61 or what ever your card supports. It shaves off 4 seconds of the run time because the executable then has the right cuda code linked in place. The s2 executable I shared has sm50, 52 and 61 and it is 4 seconds slower than it should be. Makefile has a lot of changes to support static linking. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1893868 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1893880 - Posted: 7 Oct 2017, 17:32:56 UTC - in response to Message 1893868. It looks as though it was the boinc-master with zi3xs1. I trashed the full build of 7.9 and went back to the small 7.5 boinc-master and it works. This is the first time I've tried building the CUDA App in Ubuntu 16.04, things are a little different. Now to try zi3xs2. On my machine there is a nice speed up with the Arecibo tasks using your xs2 with the version 9 cards. Trying to upload over 155mb to Crunchers Anonymous may be a challenge though. ID: 1893880 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1893922 - Posted: 7 Oct 2017, 20:30:57 UTC - in response to Message 1893880. Last modified: 7 Oct 2017, 20:32:28 UTC It looks as though it was the boinc-master with zi3xs1. I trashed the full build of 7.9 and went back to the small 7.5 boinc-master and it works. This is the first time I've tried building the CUDA App in Ubuntu 16.04, things are a little different. Now to try zi3xs2. On my machine there is a nice speed up with the Arecibo tasks using your xs2 with the version 9 cards. Trying to upload over 155mb to Crunchers Anonymous may be a challenge though. :) You may upload a link to my google drive. I have released an executable and I have to publish the source code for three years, GPL. But the four second slowdown forces us to deliver at least 3 different (sm_50, sm_52, sm_61) executables. I can publish them. It is the same source. I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!! To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1893922 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1893931 - Posted: 7 Oct 2017, 21:08:53 UTC - in response to Message 1893922. Last modified: 7 Oct 2017, 21:10:20 UTC I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!! Would there continue to be any benefit with such a large number of Compute Units? Do you get diminishing returns with the higher number of CUs, and running more than 1 WU at a time would be necessary to continue to take advantage of the extra CUs available? From the looks of you current results, 1 particular WU took 2min on your GTX 1080 which has 20 CUs. With linear scaling, that would mean a 36 sec run time for a card with 60CUs! Grant Darwin NT ID: 1893931 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1893954 - Posted: 7 Oct 2017, 22:29:17 UTC - in response to Message 1893815. Here is a new executable for most Maxwell (750, 750Ti, 9x0) and Pascal(10x0) cards namely sm 50, 52 and sm61. It is 3-13% faster than zi3x. The fft callbacks are now implemented at all places. The executable is statically linked and needs no external CUDA libraries. Test offline first. source code : https://drive.google.com/open?id=0B9PYeBxtfMjaMzlDQ0t0cnpUZ2c executable : https://drive.google.com/open?id=0B9PYeBxtfMjaTUxJT3ZURHd6M2M Petri. I would like to understand how CUDA is enabled in the default Windows drivers. I assume that the current Windows drivers have the CUDA 8.0 libraries statically linked into the drivers or some such mechanism. I don't believe I have ever seen something like a CUDA library directory or anything on a Windows machine. Can someone explain how CUDA is supported in a Windows driver? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1893954 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1893974 - Posted: 8 Oct 2017, 0:16:49 UTC - in response to Message 1893954. Can someone explain how CUDA is supported in a Windows driver? I'd say it's just part of the video driver. Physix & digital audio support & the like is included by default, but you can exclude them if you go for the Advanced Installation option as they are not part of the video driver. Why for Linux they choose to make it separate is the question that should be asked IMHO. Grant Darwin NT ID: 1893974 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1893978 - Posted: 8 Oct 2017, 0:36:04 UTC - in response to Message 1893974. Can someone explain how CUDA is supported in a Windows driver? I'd say it's just part of the video driver. Physix & digital audio support & the like is included by default, but you can exclude them if you go for the Advanced Installation option as they are not part of the video driver. Why for Linux they choose to make it separate is the question that should be asked IMHO. Yes, that is what had my interest and why I asked. I didn't know just how it all worked when I saw Petri's post that the CUDA 9.0 libraries were statically linked into the new beta special app he is developing. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1893978 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1893990 - Posted: 8 Oct 2017, 1:17:27 UTC - in response to Message 1893978. Last modified: 8 Oct 2017, 1:21:33 UTC There is a difference between Drivers and Libraries. All three platforms have the CUDA Libraries separate from the Driver. In Windows they are called .dll and have names such as cufft32_42_9.dll & cudart32_42_9.dll. In OSX they are called libcudart.8.0.dylib & libcufft.8.0.dylib. The problem is the Libraries tend to get larger with each CUDA version, the current Libraries are quite large. I haven't had any luck compiling the zi3xs2 App the normal way. The App is the normal size but every copy immediately Overflows with 30 spikes whether they are compiled in 14.04 or 16.04. I'm also seeing a new error never seen before when trying the static build; seti_cuda-seti.o: In function `seti_init_state()': /home/tbar/sah_v7_opt/Xbranch/client/seti.cpp:352: undefined reference to `boinc_resolve_filename_s(char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' seti_cuda-worker.o: In function `initialize_for_wu()': /home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:100: undefined reference to `boinc_resolve_filename_s(char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' seti_cuda-worker.o: In function `read_wu_state()': /home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:119: undefined reference to `boinc_resolve_filename_s(char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' collect2: error: ld returned 1 exit status Makefile:1497: recipe for target 'seti_cuda' failed make[2]: ** [seti_cuda] Error 1 I usually don't have any trouble compiling the Apps. ID: 1893990 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1893994 - Posted: 8 Oct 2017, 1:25:05 UTC - in response to Message 1893990. There is a difference between Drivers and Libraries. All three platforms have the CUDA Libraries separate from the Driver. In Windows they are called .dll and have names such as cufft32_42_9.dll & cudart32_42_9.dll. In OSX they are called libcudart.8.0.dylib & libcufft.8.0.dylib. The problem is the Libraries tend to get larger with each CUDA version, the current Libraries are quite large. I haven't had any luck compiling the zi3xs2 App the normal way. The App is the normal size but every copy immediately Overflows with 30 spikes whether they are compiled in 14.04 or 16.04. I'm also seeing a new error never seen before when trying the static build; seti_cuda-seti.o: In function `seti_init_state()': /home/tbar/sah_v7_opt/Xbranch/client/seti.cpp:352: undefined reference to `boinc_resolve_filename_s(char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' seti_cuda-worker.o: In function `initialize_for_wu()': /home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:100: undefined reference to `boinc_resolve_filename_s(char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' seti_cuda-worker.o: In function `read_wu_state()': /home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:119: undefined reference to `boinc_resolve_filename_s(char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' collect2: error: ld returned 1 exit status Makefile:1497: recipe for target 'seti_cuda' failed make[2]: ** [seti_cuda] Error 1 I usually don't have any trouble compiling the Apps. Of course, duh, I don't know what I was thinking. I totally forgot about the .dlls in Windows. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1893994 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1894049 - Posted: 8 Oct 2017, 7:31:52 UTC - in response to Message 1893931. Last modified: 8 Oct 2017, 8:02:14 UTC I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!! Would there continue to be any benefit with such a large number of Compute Units? Do you get diminishing returns with the higher number of CUs, and running more than 1 WU at a time would be necessary to continue to take advantage of the extra CUs available? From the looks of you current results, 1 particular WU took 2min on your GTX 1080 which has 20 CUs. With linear scaling, that would mean a 36 sec run time for a card with 60CUs! ~~The scaling would be almost linear.~~ Pulse find uses parallel scan and when something is found it reverts back to serial code. Thats max 30 times per wu and plus some for the best found pulse. Other signal types are parallel. You can look at the scaling from my results since I have three GTX1080 and one Ti (20 vs 28 CUs). EDIT: the 1080 has the following The GeForce GTX 1080 and its GP104 GPU consist of four GPCs, twenty Pascal Streaming Multiprocessors, and eight memory controllers. In the GeForce GTX 1080, each GPC ships with a dedicated raster engine and five SMs. Each SM contains 128 CUDA cores, 256 KB of register file capacity, a 96 KB shared memory unit, 48 KB of total L1 cache storage, and eight texture units. (2560 cores) and (3584 for the Ti). and the P100 has 56 SMs and 64 fp 32 CUDA cores per sm. (3584 cores). The memory system is totally different and that might yield some interesting results. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1894049 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1894050 - Posted: 8 Oct 2017, 7:38:53 UTC - in response to Message 1893990. There is a difference between Drivers and Libraries. All three platforms have the CUDA Libraries separate from the Driver. In Windows they are called .dll and have names such as cufft32_42_9.dll & cudart32_42_9.dll. In OSX they are called libcudart.8.0.dylib & libcufft.8.0.dylib. The problem is the Libraries tend to get larger with each CUDA version, the current Libraries are quite large. I haven't had any luck compiling the zi3xs2 App the normal way. The App is the normal size but every copy immediately Overflows with 30 spikes whether they are compiled in 14.04 or 16.04. I'm also seeing a new error never seen before when trying the static build; seti_cuda-seti.o: In function `seti_init_state()': /home/tbar/sah_v7_opt/Xbranch/client/seti.cpp:352: undefined reference to `boinc_resolve_filename_s(char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' seti_cuda-worker.o: In function `initialize_for_wu()': /home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:100: undefined reference to `boinc_resolve_filename_s(char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' seti_cuda-worker.o: In function `read_wu_state()': /home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:119: undefined reference to `boinc_resolve_filename_s(char const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' collect2: error: ld returned 1 exit status Makefile:1497: recipe for target 'seti_cuda' failed make[2]: ** [seti_cuda] Error 1 I usually don't have any trouble compiling the Apps. grep -r boinc_resolve_file ~petri/boinc/* /home/petri/boinc/lib/app_ipc.cpp:int boinc_resolve_filename( /home/petri/boinc/lib/app_ipc.cpp:int boinc_resolve_filename_s(const char virtual_name, string& physical_name) { Binary file /home/petri/boinc/lib/libboinc.a matches /home/petri/boinc/lib/app_ipc.h:extern int boinc_resolve_filename_s(const char, std::string&); /home/petri/boinc/lib/app_ipc.h:extern int boinc_resolve_filename(const char, char, int len); my boinc is 7.5. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1894050 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1894095 - Posted: 8 Oct 2017, 14:00:33 UTC - in response to Message 1894050. Seems strange I'm suddenly getting boinc errors when it appears the boinc files haven't changed. What commands are you using for the static build? The only commands I could find end up with the line; NVCCFLAGS = -O3 --use_fast_math --cudart static --ptxas-options="-v" --compiler-options "$(AM_CXXFLAGS) $(CXXFLAGS) -lcufft_static -fno-strict-aliasing" -m64 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61 It would be nice if the App could be compiled with a lower version of Ubuntu 14.04. Right now my systems installed with the 14.04.1 image and kernel 3.13 won't work as the version of libstdc++ is too old. I can run the current version of zi3v in Ubuntu 12.04.5 without any trouble using the driver from the CUDA 8 Toolkit, not so with zi3xs2. ID: 1894095 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1894108 - Posted: 8 Oct 2017, 16:50:48 UTC - in response to Message 1894095. Seems strange I'm suddenly getting boinc errors when it appears the boinc files haven't changed. What commands are you using for the static build? The only commands I could find end up with the line; NVCCFLAGS = -O3 --use_fast_math --cudart static --ptxas-options="-v" --compiler-options "$(AM_CXXFLAGS) $(CXXFLAGS) -lcufft_static -fno-strict-aliasing" -m64 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61 It would be nice if the App could be compiled with a lower version of Ubuntu 14.04. Right now my systems installed with the 14.04.1 image and kernel 3.13 won't work as the version of libstdc++ is too old. I can run the current version of zi3v in Ubuntu 12.04.5 without any trouble using the driver from the CUDA 8 Toolkit, not so with zi3xs2. The Makefile is in the source zip. I modify it by hand. Not by .configure or autoconf. I have changed a lot of things to enable static cuda. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1894108 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1894119 - Posted: 8 Oct 2017, 17:48:43 UTC - in response to Message 1894108. OK, I think I'm beginning to understand. After running autosetup and configure, I need to replace the Xbranch/client Makefile with the one from the source folder, then go through it and replace all the petri with tbar and all the places where your boinc-master folder is just named boinc whereas mine is boinc-master... That seems to be working in 14.04.1 now. If you add a couple more code=sm flags the thing comes to 179mb. Kinda large for a CUDA App. ID: 1894119 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1894127 - Posted: 8 Oct 2017, 18:10:35 UTC - in response to Message 1894119. OK, I think I'm beginning to understand. After running autosetup and configure, I need to replace the Xbranch/client Makefile with the one from the source folder, then go through it and replace all the petri with tbar and all the places where your boinc-master folder is just named boinc whereas mine is boinc-master... That seems to be working in 14.04.1 now. If you add a couple more code=sm flags the thing comes to 179mb. Kinda large for a CUDA App. Yes, it is large, but the small app used to load libraries during start up so it is not such a big deal. I recommend making 4 apps. One for sm_50, 52 and 61 + a generic one. The generic one has a 4 second penalty at startup. Users can choose what they like. Set LDFLAGS = -arch=sm_61 to make one for sm_61 and leave LDFLAGS = empty to make a generic one. The fft callbacks need the static link but it gives a nice speed-up. Petri. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1894127 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1894268 - Posted: 9 Oct 2017, 14:00:05 UTC - in response to Message 1894127. Having a separate App for the 7xx, 9xx, and 10 series would be a problem for people with mixed GPUs in one machine. It would probably be best to just leave the main three in one App. I doubt anything lower than the 750, cc 5.0, will work very well since they already have problems with CUDA 8. I suppose you could make one for just cc 3.5 and put the -gencode arch=compute_35,code=compute_35 at the end. Wouldn't that be the same as a Generic App? ID: 1894268 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1894385 - Posted: 9 Oct 2017, 20:42:04 UTC - in response to Message 1894268. Having a separate App for the 7xx, 9xx, and 10 series would be a problem for people with mixed GPUs in one machine. It would probably be best to just leave the main three in one App. I doubt anything lower than the 750, cc 5.0, will work very well since they already have problems with CUDA 8. I suppose you could make one for just cc 3.5 and put the -gencode arch=compute_35,code=compute_35 at the end. Wouldn't that be the same as a Generic App? The shorties run 27 seconds on my 1080Ti. A four second start up delay would be unacceptable for many users. The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1894385 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.