Message boards :
Number crunching :
Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation
Previous · 1 . . . 57 · 58 · 59 · 60 · 61 · 62 · 63 . . . 83 · Next
Author | Message |
---|---|
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for the reply Petri. Yes, I understand the differences in hardware P-states and "Performance Levels" in Linux. I can only surmise that the inability to raise the default P2 state in the Linux environment is because of the lack of support in the Linux tools, vis-a-vis, nvidia-settings and/or the actual Nvidia Linux drivers. The same hardware can be moved in Windows from P2 state to P0 state with common tools such as Nvidia Inspector, Precision X or MSI Afterburner. Wonder where the defect is .... drivers or tools? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Here is a new executable for most Maxwell (750, 750Ti, 9x0) and Pascal(10x0) cards namely sm 50, 52 and sm61. It is 3-13% faster than zi3x. The fft callbacks are now implemented at all places. The executable is statically linked and needs no external CUDA libraries. Test offline first. source code : https://drive.google.com/open?id=0B9PYeBxtfMjaMzlDQ0t0cnpUZ2c executable : https://drive.google.com/open?id=0B9PYeBxtfMjaTUxJT3ZURHd6M2M Petri. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
This seems to be a bit faster than zi3v, but, it seems to have the same problem with the Arecibo Overflows. The CUDA App finds all Triplets where the SoG App finds more Pulses; https://setiathome.berkeley.edu/workunit.php?wuid=2703056012 On one of the Test Overflow BLC3s the New App found 4 Bad signals against the CPU whereas zi3v only finds 1 Bad signal out of 30 against the CPU. Which version of sah_v7_opt are you using? When I tried zi3xs1 with both CUDA 8 & 9 in Ubuntu 16.04 I received compile Errors with both 3707 and 3679. I haven't tried any lower versions of sah_v7_opt or Ubuntu. I suppose it could have been the boinc-master too, this time I was using boinc-master 7.9 instead of 7.5. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Hi TBar, I've got an ancient sah_v7_opt on my computer. A couple of years old. If you try the s2 you get rid of compile error that says hires_timer or something. Modify the Makefile to have LDFLAGS = -arch=sm_61 or what ever your card supports. It shaves off 4 seconds of the run time because the executable then has the right cuda code linked in place. The s2 executable I shared has sm50, 52 and 61 and it is 4 seconds slower than it should be. Makefile has a lot of changes to support static linking. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
It looks as though it was the boinc-master with zi3xs1. I trashed the full build of 7.9 and went back to the small 7.5 boinc-master and it works. This is the first time I've tried building the CUDA App in Ubuntu 16.04, things are a little different. Now to try zi3xs2. On my machine there is a nice speed up with the Arecibo tasks using your xs2 with the version 9 cards. Trying to upload over 155mb to Crunchers Anonymous may be a challenge though. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
It looks as though it was the boinc-master with zi3xs1. I trashed the full build of 7.9 and went back to the small 7.5 boinc-master and it works. This is the first time I've tried building the CUDA App in Ubuntu 16.04, things are a little different. Now to try zi3xs2. On my machine there is a nice speed up with the Arecibo tasks using your xs2 with the version 9 cards. :) You may upload a link to my google drive. I have released an executable and I have to publish the source code for three years, GPL. But the four second slowdown forces us to deliver at least 3 different (sm_50, sm_52, sm_61) executables. I can publish them. It is the same source. I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!! To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!! Would there continue to be any benefit with such a large number of Compute Units? Do you get diminishing returns with the higher number of CUs, and running more than 1 WU at a time would be necessary to continue to take advantage of the extra CUs available? From the looks of you current results, 1 particular WU took 2min on your GTX 1080 which has 20 CUs. With linear scaling, that would mean a 36 sec run time for a card with 60CUs! Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Here is a new executable for most Maxwell (750, 750Ti, 9x0) and Pascal(10x0) cards namely sm 50, 52 and sm61. It is 3-13% faster than zi3x. The fft callbacks are now implemented at all places. The executable is statically linked and needs no external CUDA libraries. Test offline first. I would like to understand how CUDA is enabled in the default Windows drivers. I assume that the current Windows drivers have the CUDA 8.0 libraries statically linked into the drivers or some such mechanism. I don't believe I have ever seen something like a CUDA library directory or anything on a Windows machine. Can someone explain how CUDA is supported in a Windows driver? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
Can someone explain how CUDA is supported in a Windows driver? I'd say it's just part of the video driver. Physix & digital audio support & the like is included by default, but you can exclude them if you go for the Advanced Installation option as they are not part of the video driver. Why for Linux they choose to make it separate is the question that should be asked IMHO. Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Can someone explain how CUDA is supported in a Windows driver? Yes, that is what had my interest and why I asked. I didn't know just how it all worked when I saw Petri's post that the CUDA 9.0 libraries were statically linked into the new beta special app he is developing. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
There is a difference between Drivers and Libraries. All three platforms have the CUDA Libraries separate from the Driver. In Windows they are called .dll and have names such as cufft32_42_9.dll & cudart32_42_9.dll. In OSX they are called libcudart.8.0.dylib & libcufft.8.0.dylib. The problem is the Libraries tend to get larger with each CUDA version, the current Libraries are quite large. I haven't had any luck compiling the zi3xs2 App the normal way. The App is the normal size but every copy immediately Overflows with 30 spikes whether they are compiled in 14.04 or 16.04. I'm also seeing a new error never seen before when trying the static build; seti_cuda-seti.o: In function `seti_init_state()': /home/tbar/sah_v7_opt/Xbranch/client/seti.cpp:352: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' seti_cuda-worker.o: In function `initialize_for_wu()': /home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:100: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' seti_cuda-worker.o: In function `read_wu_state()': /home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:119: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)' collect2: error: ld returned 1 exit status Makefile:1497: recipe for target 'seti_cuda' failed make[2]: *** [seti_cuda] Error 1 I usually don't have any trouble compiling the Apps. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
There is a difference between Drivers and Libraries. All three platforms have the CUDA Libraries separate from the Driver. In Windows they are called .dll and have names such as cufft32_42_9.dll & cudart32_42_9.dll. In OSX they are called libcudart.8.0.dylib & libcufft.8.0.dylib. The problem is the Libraries tend to get larger with each CUDA version, the current Libraries are quite large. Of course, duh, I don't know what I was thinking. I totally forgot about the .dlls in Windows. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!! You can look at the scaling from my results since I have three GTX1080 and one Ti (20 vs 28 CUs). EDIT: the 1080 has the following The GeForce GTX 1080 and its GP104 GPU consist of four GPCs, twenty Pascal Streaming Multiprocessors, and eight memory controllers. In the GeForce GTX 1080, each GPC ships with a dedicated raster engine and five SMs. Each SM contains 128 CUDA cores, 256 KB of register file capacity, a 96 KB shared memory unit, 48 KB of total L1 cache storage, and eight texture units. (2560 cores) and (3584 for the Ti). and the P100 has 56 SMs and 64 fp 32 CUDA cores per sm. (3584 cores). The memory system is totally different and that might yield some interesting results. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
There is a difference between Drivers and Libraries. All three platforms have the CUDA Libraries separate from the Driver. In Windows they are called .dll and have names such as cufft32_42_9.dll & cudart32_42_9.dll. In OSX they are called libcudart.8.0.dylib & libcufft.8.0.dylib. The problem is the Libraries tend to get larger with each CUDA version, the current Libraries are quite large. grep -r boinc_resolve_file ~petri/boinc/* /home/petri/boinc/lib/app_ipc.cpp:int boinc_resolve_filename( /home/petri/boinc/lib/app_ipc.cpp:int boinc_resolve_filename_s(const char *virtual_name, string& physical_name) { Binary file /home/petri/boinc/lib/libboinc.a matches /home/petri/boinc/lib/app_ipc.h:extern int boinc_resolve_filename_s(const char*, std::string&); /home/petri/boinc/lib/app_ipc.h:extern int boinc_resolve_filename(const char*, char*, int len); my boinc is 7.5. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Seems strange I'm suddenly getting boinc errors when it appears the boinc files haven't changed. What commands are you using for the static build? The only commands I could find end up with the line; NVCCFLAGS = -O3 --use_fast_math --cudart static --ptxas-options="-v" --compiler-options "$(AM_CXXFLAGS) $(CXXFLAGS) -lcufft_static -fno-strict-aliasing" -m64 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61 It would be nice if the App could be compiled with a lower version of Ubuntu 14.04. Right now my systems installed with the 14.04.1 image and kernel 3.13 won't work as the version of libstdc++ is too old. I can run the current version of zi3v in Ubuntu 12.04.5 without any trouble using the driver from the CUDA 8 Toolkit, not so with zi3xs2. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Seems strange I'm suddenly getting boinc errors when it appears the boinc files haven't changed. What commands are you using for the static build? The only commands I could find end up with the line; The Makefile is in the source zip. I modify it by hand. Not by .configure or autoconf. I have changed a lot of things to enable static cuda. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
OK, I think I'm beginning to understand. After running autosetup and configure, I need to replace the Xbranch/client Makefile with the one from the source folder, then go through it and replace all the petri with tbar and all the places where your boinc-master folder is just named boinc whereas mine is boinc-master... That seems to be working in 14.04.1 now. If you add a couple more code=sm flags the thing comes to 179mb. Kinda large for a CUDA App. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
OK, I think I'm beginning to understand. After running autosetup and configure, I need to replace the Xbranch/client Makefile with the one from the source folder, then go through it and replace all the petri with tbar and all the places where your boinc-master folder is just named boinc whereas mine is boinc-master... Yes, it is large, but the small app used to load libraries during start up so it is not such a big deal. I recommend making 4 apps. One for sm_50, 52 and 61 + a generic one. The generic one has a 4 second penalty at startup. Users can choose what they like. Set LDFLAGS = -arch=sm_61 to make one for sm_61 and leave LDFLAGS = empty to make a generic one. The fft callbacks need the static link but it gives a nice speed-up. Petri. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Having a separate App for the 7xx, 9xx, and 10 series would be a problem for people with mixed GPUs in one machine. It would probably be best to just leave the main three in one App. I doubt anything lower than the 750, cc 5.0, will work very well since they already have problems with CUDA 8. I suppose you could make one for just cc 3.5 and put the -gencode arch=compute_35,code=compute_35 at the end. Wouldn't that be the same as a Generic App? |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Having a separate App for the 7xx, 9xx, and 10 series would be a problem for people with mixed GPUs in one machine. It would probably be best to just leave the main three in one App. I doubt anything lower than the 750, cc 5.0, will work very well since they already have problems with CUDA 8. I suppose you could make one for just cc 3.5 and put the -gencode arch=compute_35,code=compute_35 at the end. Wouldn't that be the same as a Generic App? The shorties run 27 seconds on my 1080Ti. A four second start up delay would be unacceptable for many users. The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.