Linux CUDA 'Special' App finally available, featuring Low CPU use

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 57 · 58 · 59 · 60 · 61 · 62 · 63 . . . 82 · Next

AuthorMessage
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1529
Credit: 340,207,233
RAC: 327,472
Finland
Message 1893672 - Posted: 6 Oct 2017, 19:42:29 UTC - in response to Message 1893328.  
Last modified: 6 Oct 2017, 19:48:19 UTC

This is a call out to anybody running Nvidia cards under Linux. Doesn't matter whether you are running the stock apps or the special app. Has anybody tried to overclock a Pascal card and move from Performance Level 2 to Performance Level 3 with nvidia-settings? I wonder why you can move Maxwell cards and earlier up to Performance Level 3 with any Nvidia driver version compatible with the required driver version for the card. And not supposedly with Pascal cards. Anybody tried it yet?


I've tried that with every new driver release with no success.

In addition to that I tried rising the GPU and mem clocks when the executable is running and setting them back to low values before the executable ends -- with no success. That would have utilized the P2 state overclock when computing and still remained in the tolerable range when idle and thus in P0. (The Perf level 3 is P0 and Perf level 2 and 1 are P1-P8. P+number is performance state P0 being the fastest, Perf levels go from 1 to 3 with 3 being the fastest. The driver sets the card to perf level 3 when idle and maximum performance level is requested from nvidia-settings and drops it to P2 state when it finds compute load.

With 980 and 780 that used to happen too. There were drivers that allowed P0 and some that did not.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1893672 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 4611
Credit: 295,420,604
RAC: 614,781
United States
Message 1893673 - Posted: 6 Oct 2017, 19:51:28 UTC - in response to Message 1893672.  

Thanks for the reply Petri. Yes, I understand the differences in hardware P-states and "Performance Levels" in Linux. I can only surmise that the inability to raise the default P2 state in the Linux environment is because of the lack of support in the Linux tools, vis-a-vis, nvidia-settings and/or the actual Nvidia Linux drivers. The same hardware can be moved in Windows from P2 state to P0 state with common tools such as Nvidia Inspector, Precision X or MSI Afterburner.

Wonder where the defect is .... drivers or tools?
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1893673 · Report as offensive     Reply Quote
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1529
Credit: 340,207,233
RAC: 327,472
Finland
Message 1893815 - Posted: 7 Oct 2017, 14:07:16 UTC

Here is a new executable for most Maxwell (750, 750Ti, 9x0) and Pascal(10x0) cards namely sm 50, 52 and sm61. It is 3-13% faster than zi3x. The fft callbacks are now implemented at all places. The executable is statically linked and needs no external CUDA libraries. Test offline first.

source code : https://drive.google.com/open?id=0B9PYeBxtfMjaMzlDQ0t0cnpUZ2c
executable : https://drive.google.com/open?id=0B9PYeBxtfMjaTUxJT3ZURHd6M2M

Petri.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1893815 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4072
Credit: 239,055,114
RAC: 215,971
United States
Message 1893840 - Posted: 7 Oct 2017, 15:43:19 UTC - in response to Message 1893815.  
Last modified: 7 Oct 2017, 15:47:05 UTC

This seems to be a bit faster than zi3v, but, it seems to have the same problem with the Arecibo Overflows. The CUDA App finds all Triplets where the SoG App finds more Pulses;
https://setiathome.berkeley.edu/workunit.php?wuid=2703056012
On one of the Test Overflow BLC3s the New App found 4 Bad signals against the CPU whereas zi3v only finds 1 Bad signal out of 30 against the CPU.

Which version of sah_v7_opt are you using? When I tried zi3xs1 with both CUDA 8 & 9 in Ubuntu 16.04 I received compile Errors with both 3707 and 3679. I haven't tried any lower versions of sah_v7_opt or Ubuntu. I suppose it could have been the boinc-master too, this time I was using boinc-master 7.9 instead of 7.5.
ID: 1893840 · Report as offensive     Reply Quote
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1529
Credit: 340,207,233
RAC: 327,472
Finland
Message 1893868 - Posted: 7 Oct 2017, 17:06:58 UTC - in response to Message 1893840.  

Hi TBar,

I've got an ancient sah_v7_opt on my computer. A couple of years old.

If you try the s2 you get rid of compile error that says hires_timer or something.

Modify the Makefile to have LDFLAGS = -arch=sm_61 or what ever your card supports. It shaves off 4 seconds of the run time because the executable then has the right cuda code linked in place. The s2 executable I shared has sm50, 52 and 61 and it is 4 seconds slower than it should be.

Makefile has a lot of changes to support static linking.

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1893868 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4072
Credit: 239,055,114
RAC: 215,971
United States
Message 1893880 - Posted: 7 Oct 2017, 17:32:56 UTC - in response to Message 1893868.  

It looks as though it was the boinc-master with zi3xs1. I trashed the full build of 7.9 and went back to the small 7.5 boinc-master and it works. This is the first time I've tried building the CUDA App in Ubuntu 16.04, things are a little different. Now to try zi3xs2. On my machine there is a nice speed up with the Arecibo tasks using your xs2 with the version 9 cards.

Trying to upload over 155mb to Crunchers Anonymous may be a challenge though.
ID: 1893880 · Report as offensive     Reply Quote
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1529
Credit: 340,207,233
RAC: 327,472
Finland
Message 1893922 - Posted: 7 Oct 2017, 20:30:57 UTC - in response to Message 1893880.  
Last modified: 7 Oct 2017, 20:32:28 UTC

It looks as though it was the boinc-master with zi3xs1. I trashed the full build of 7.9 and went back to the small 7.5 boinc-master and it works. This is the first time I've tried building the CUDA App in Ubuntu 16.04, things are a little different. Now to try zi3xs2. On my machine there is a nice speed up with the Arecibo tasks using your xs2 with the version 9 cards.

Trying to upload over 155mb to Crunchers Anonymous may be a challenge though.


:)

You may upload a link to my google drive. I have released an executable and I have to publish the source code for three years, GPL.

But the four second slowdown forces us to deliver at least 3 different (sm_50, sm_52, sm_61) executables. I can publish them. It is the same source.

I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!!
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1893922 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 9976
Credit: 130,696,680
RAC: 82,807
Australia
Message 1893931 - Posted: 7 Oct 2017, 21:08:53 UTC - in response to Message 1893922.  
Last modified: 7 Oct 2017, 21:10:20 UTC

I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!!

Would there continue to be any benefit with such a large number of Compute Units?
Do you get diminishing returns with the higher number of CUs, and running more than 1 WU at a time would be necessary to continue to take advantage of the extra CUs available?

From the looks of you current results, 1 particular WU took 2min on your GTX 1080 which has 20 CUs.
With linear scaling, that would mean a 36 sec run time for a card with 60CUs!
Grant
Darwin NT
ID: 1893931 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 4611
Credit: 295,420,604
RAC: 614,781
United States
Message 1893954 - Posted: 7 Oct 2017, 22:29:17 UTC - in response to Message 1893815.  

Here is a new executable for most Maxwell (750, 750Ti, 9x0) and Pascal(10x0) cards namely sm 50, 52 and sm61. It is 3-13% faster than zi3x. The fft callbacks are now implemented at all places. The executable is statically linked and needs no external CUDA libraries. Test offline first.

source code : https://drive.google.com/open?id=0B9PYeBxtfMjaMzlDQ0t0cnpUZ2c
executable : https://drive.google.com/open?id=0B9PYeBxtfMjaTUxJT3ZURHd6M2M

Petri.

I would like to understand how CUDA is enabled in the default Windows drivers. I assume that the current Windows drivers have the CUDA 8.0 libraries statically linked into the drivers or some such mechanism. I don't believe I have ever seen something like a CUDA library directory or anything on a Windows machine. Can someone explain how CUDA is supported in a Windows driver?
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1893954 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 9976
Credit: 130,696,680
RAC: 82,807
Australia
Message 1893974 - Posted: 8 Oct 2017, 0:16:49 UTC - in response to Message 1893954.  

Can someone explain how CUDA is supported in a Windows driver?

I'd say it's just part of the video driver. Physix & digital audio support & the like is included by default, but you can exclude them if you go for the Advanced Installation option as they are not part of the video driver.

Why for Linux they choose to make it separate is the question that should be asked IMHO.
Grant
Darwin NT
ID: 1893974 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 4611
Credit: 295,420,604
RAC: 614,781
United States
Message 1893978 - Posted: 8 Oct 2017, 0:36:04 UTC - in response to Message 1893974.  

Can someone explain how CUDA is supported in a Windows driver?

I'd say it's just part of the video driver. Physix & digital audio support & the like is included by default, but you can exclude them if you go for the Advanced Installation option as they are not part of the video driver.

Why for Linux they choose to make it separate is the question that should be asked IMHO.

Yes, that is what had my interest and why I asked. I didn't know just how it all worked when I saw Petri's post that the CUDA 9.0 libraries were statically linked into the new beta special app he is developing.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1893978 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4072
Credit: 239,055,114
RAC: 215,971
United States
Message 1893990 - Posted: 8 Oct 2017, 1:17:27 UTC - in response to Message 1893978.  
Last modified: 8 Oct 2017, 1:21:33 UTC

There is a difference between Drivers and Libraries. All three platforms have the CUDA Libraries separate from the Driver. In Windows they are called .dll and have names such as cufft32_42_9.dll & cudart32_42_9.dll. In OSX they are called libcudart.8.0.dylib & libcufft.8.0.dylib. The problem is the Libraries tend to get larger with each CUDA version, the current Libraries are quite large.

I haven't had any luck compiling the zi3xs2 App the normal way. The App is the normal size but every copy immediately Overflows with 30 spikes whether they are compiled in 14.04 or 16.04. I'm also seeing a new error never seen before when trying the static build;
seti_cuda-seti.o: In function `seti_init_state()':
/home/tbar/sah_v7_opt/Xbranch/client/seti.cpp:352: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
seti_cuda-worker.o: In function `initialize_for_wu()':
/home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:100: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
seti_cuda-worker.o: In function `read_wu_state()':
/home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:119: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
collect2: error: ld returned 1 exit status
Makefile:1497: recipe for target 'seti_cuda' failed
make[2]: *** [seti_cuda] Error 1

I usually don't have any trouble compiling the Apps.
ID: 1893990 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 4611
Credit: 295,420,604
RAC: 614,781
United States
Message 1893994 - Posted: 8 Oct 2017, 1:25:05 UTC - in response to Message 1893990.  

There is a difference between Drivers and Libraries. All three platforms have the CUDA Libraries separate from the Driver. In Windows they are called .dll and have names such as cufft32_42_9.dll & cudart32_42_9.dll. In OSX they are called libcudart.8.0.dylib & libcufft.8.0.dylib. The problem is the Libraries tend to get larger with each CUDA version, the current Libraries are quite large.

I haven't had any luck compiling the zi3xs2 App the normal way. The App is the normal size but every copy immediately Overflows with 30 spikes whether they are compiled in 14.04 or 16.04. I'm also seeing a new error never seen before when trying the static build;
seti_cuda-seti.o: In function `seti_init_state()':
/home/tbar/sah_v7_opt/Xbranch/client/seti.cpp:352: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
seti_cuda-worker.o: In function `initialize_for_wu()':
/home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:100: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
seti_cuda-worker.o: In function `read_wu_state()':
/home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:119: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
collect2: error: ld returned 1 exit status
Makefile:1497: recipe for target 'seti_cuda' failed
make[2]: *** [seti_cuda] Error 1

I usually don't have any trouble compiling the Apps.

Of course, duh, I don't know what I was thinking. I totally forgot about the .dlls in Windows.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1893994 · Report as offensive     Reply Quote
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1529
Credit: 340,207,233
RAC: 327,472
Finland
Message 1894049 - Posted: 8 Oct 2017, 7:31:52 UTC - in response to Message 1893931.  
Last modified: 8 Oct 2017, 8:02:14 UTC

I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!!

Would there continue to be any benefit with such a large number of Compute Units?
Do you get diminishing returns with the higher number of CUs, and running more than 1 WU at a time would be necessary to continue to take advantage of the extra CUs available?

From the looks of you current results, 1 particular WU took 2min on your GTX 1080 which has 20 CUs.
With linear scaling, that would mean a 36 sec run time for a card with 60CUs!


The scaling would be almost linear. Pulse find uses parallel scan and when something is found it reverts back to serial code. Thats max 30 times per wu and plus some for the best found pulse. Other signal types are parallel.

You can look at the scaling from my results since I have three GTX1080 and one Ti (20 vs 28 CUs).


EDIT:
the 1080 has the following
The GeForce GTX 1080 and its GP104 GPU consist of four GPCs, twenty Pascal Streaming
Multiprocessors, and eight memory controllers. In the GeForce GTX 1080, each GPC ships with a
dedicated raster engine and five SMs. Each SM contains 128 CUDA cores, 256 KB of register file capacity,
a 96 KB shared memory unit, 48 KB of total L1 cache storage, and eight texture units. (2560 cores) and (3584 for the Ti).

and the P100 has 56 SMs and 64 fp 32 CUDA cores per sm. (3584 cores).

The memory system is totally different and that might yield some interesting results.

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1894049 · Report as offensive     Reply Quote
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1529
Credit: 340,207,233
RAC: 327,472
Finland
Message 1894050 - Posted: 8 Oct 2017, 7:38:53 UTC - in response to Message 1893990.  

There is a difference between Drivers and Libraries. All three platforms have the CUDA Libraries separate from the Driver. In Windows they are called .dll and have names such as cufft32_42_9.dll & cudart32_42_9.dll. In OSX they are called libcudart.8.0.dylib & libcufft.8.0.dylib. The problem is the Libraries tend to get larger with each CUDA version, the current Libraries are quite large.

I haven't had any luck compiling the zi3xs2 App the normal way. The App is the normal size but every copy immediately Overflows with 30 spikes whether they are compiled in 14.04 or 16.04. I'm also seeing a new error never seen before when trying the static build;
seti_cuda-seti.o: In function `seti_init_state()':
/home/tbar/sah_v7_opt/Xbranch/client/seti.cpp:352: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
seti_cuda-worker.o: In function `initialize_for_wu()':
/home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:100: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
seti_cuda-worker.o: In function `read_wu_state()':
/home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:119: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
collect2: error: ld returned 1 exit status
Makefile:1497: recipe for target 'seti_cuda' failed
make[2]: *** [seti_cuda] Error 1

I usually don't have any trouble compiling the Apps.


grep -r boinc_resolve_file ~petri/boinc/*

/home/petri/boinc/lib/app_ipc.cpp:int boinc_resolve_filename(
/home/petri/boinc/lib/app_ipc.cpp:int boinc_resolve_filename_s(const char *virtual_name, string& physical_name) {
Binary file /home/petri/boinc/lib/libboinc.a matches
/home/petri/boinc/lib/app_ipc.h:extern int boinc_resolve_filename_s(const char*, std::string&);
/home/petri/boinc/lib/app_ipc.h:extern int boinc_resolve_filename(const char*, char*, int len);


my boinc is 7.5.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1894050 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4072
Credit: 239,055,114
RAC: 215,971
United States
Message 1894095 - Posted: 8 Oct 2017, 14:00:33 UTC - in response to Message 1894050.  

Seems strange I'm suddenly getting boinc errors when it appears the boinc files haven't changed. What commands are you using for the static build? The only commands I could find end up with the line;
NVCCFLAGS = -O3 --use_fast_math --cudart static --ptxas-options="-v" --compiler-options "$(AM_CXXFLAGS) $(CXXFLAGS) -lcufft_static -fno-strict-aliasing" -m64 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61

It would be nice if the App could be compiled with a lower version of Ubuntu 14.04. Right now my systems installed with the 14.04.1 image and kernel 3.13 won't work as the version of libstdc++ is too old. I can run the current version of zi3v in Ubuntu 12.04.5 without any trouble using the driver from the CUDA 8 Toolkit, not so with zi3xs2.
ID: 1894095 · Report as offensive     Reply Quote
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1529
Credit: 340,207,233
RAC: 327,472
Finland
Message 1894108 - Posted: 8 Oct 2017, 16:50:48 UTC - in response to Message 1894095.  

Seems strange I'm suddenly getting boinc errors when it appears the boinc files haven't changed. What commands are you using for the static build? The only commands I could find end up with the line;
NVCCFLAGS = -O3 --use_fast_math --cudart static --ptxas-options="-v" --compiler-options "$(AM_CXXFLAGS) $(CXXFLAGS) -lcufft_static -fno-strict-aliasing" -m64 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61

It would be nice if the App could be compiled with a lower version of Ubuntu 14.04. Right now my systems installed with the 14.04.1 image and kernel 3.13 won't work as the version of libstdc++ is too old. I can run the current version of zi3v in Ubuntu 12.04.5 without any trouble using the driver from the CUDA 8 Toolkit, not so with zi3xs2.


The Makefile is in the source zip.
I modify it by hand. Not by .configure or autoconf.
I have changed a lot of things to enable static cuda.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1894108 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4072
Credit: 239,055,114
RAC: 215,971
United States
Message 1894119 - Posted: 8 Oct 2017, 17:48:43 UTC - in response to Message 1894108.  

OK, I think I'm beginning to understand. After running autosetup and configure, I need to replace the Xbranch/client Makefile with the one from the source folder, then go through it and replace all the petri with tbar and all the places where your boinc-master folder is just named boinc whereas mine is boinc-master...

That seems to be working in 14.04.1 now. If you add a couple more code=sm flags the thing comes to 179mb. Kinda large for a CUDA App.
ID: 1894119 · Report as offensive     Reply Quote
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1529
Credit: 340,207,233
RAC: 327,472
Finland
Message 1894127 - Posted: 8 Oct 2017, 18:10:35 UTC - in response to Message 1894119.  

OK, I think I'm beginning to understand. After running autosetup and configure, I need to replace the Xbranch/client Makefile with the one from the source folder, then go through it and replace all the petri with tbar and all the places where your boinc-master folder is just named boinc whereas mine is boinc-master...

That seems to be working in 14.04.1 now. If you add a couple more code=sm flags the thing comes to 179mb. Kinda large for a CUDA App.


Yes, it is large, but the small app used to load libraries during start up so it is not such a big deal. I recommend making 4 apps. One for sm_50, 52 and 61 + a generic one. The generic one has a 4 second penalty at startup. Users can choose what they like. Set LDFLAGS = -arch=sm_61 to make one for sm_61 and leave LDFLAGS = empty to make a generic one.


The fft callbacks need the static link but it gives a nice speed-up.

Petri.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1894127 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4072
Credit: 239,055,114
RAC: 215,971
United States
Message 1894268 - Posted: 9 Oct 2017, 14:00:05 UTC - in response to Message 1894127.  

Having a separate App for the 7xx, 9xx, and 10 series would be a problem for people with mixed GPUs in one machine. It would probably be best to just leave the main three in one App. I doubt anything lower than the 750, cc 5.0, will work very well since they already have problems with CUDA 8. I suppose you could make one for just cc 3.5 and put the -gencode arch=compute_35,code=compute_35 at the end. Wouldn't that be the same as a Generic App?
ID: 1894268 · Report as offensive     Reply Quote
Previous · 1 . . . 57 · 58 · 59 · 60 · 61 · 62 · 63 . . . 82 · Next

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use


 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.