Linux CUDA 'Special' App finally available, featuring Low CPU use

Author	Message
petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1921985 - Posted: 1 Mar 2018, 22:56:17 UTC - in response to Message 1920282. Last modified: 1 Mar 2018, 22:59:39 UTC Hi all, Since no one has tested the special app on Linux and Volta GPU I ordered a Titan V so I can test it myself. It has been shipped and I expect to receive it next week. First of all I'll have to test for compatibility. Then for performance. I'll post news as soon as possible -- after the initial set-up and test runs. Btw. Then I'll have two 1080's collecting dust on the shelf. And it is a good thing it will arrive next week. The forthcoming weekend will be well spent installing and testing a brand new BenQ W1700 4k home theater projector. Petri First impressions of a Titan V: 1) Nice colour 2) Expensive 3) Easy to install 4) Slim (more space between cards compared to 1080 and 1080 Ti) 5) Does not work with old (bad) CUDA code. I have to make sure that the NVIDIA thread model and recommendations are followed thoroughly 6) FAST! 7) Hard to get to use more than 130 W of juice (this may change when I get to the optimize stage) 8) Makes me smile 9) Addictive 10) I have had the card now for 30 hours and slept and been to a day job in between and tested and coded only for about 8 hours and I have found just one packet it does not do as well as my 1080 and 1080Ti. --- Crash, boom, bang! It halts completely on just one packet in my extended test suite. My standard test suite that takes 1701 seconds on a 1080@1974MHz is now done in 1003 seconds on the TITAN V - non overclocked!. 10+) I'm not ready to use it under BOINC yet. I have to iron out the one(?and the next one?) bug in my code first. Petri - running on 2x1080+1x1080Ti now until the TITAN V special sauce is ready. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1921985 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1921986 - Posted: 1 Mar 2018, 23:12:00 UTC - in response to Message 1921985. Thanks for the update Petri. I was curious how that project was going. I looked earlier today for some tasks done by it and didn't find any. Now I know why. Was it pretty straight forward in use and programming? Or did you have to start from scratch because the architecture is so different from earlier generations? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1921986 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13871 Credit: 208,696,464 RAC: 304	Message 1921990 - Posted: 1 Mar 2018, 23:17:33 UTC - in response to Message 1921985. 7) Hard to get to use more than 130 W of juice (this may change when I get to the optimize stage) I figure you'd have to make use of the Tensor cores to come even remotely close to it's maximum power loading. I suspect highly optimized CUDA code will still be able to hit the thermal limits for the card (similar to CPUs thermal & clock throttling with highly optimized AVX code). Rumours are that the next series of desktop NVidia GPUs have been delayed to late this year (maybe even early next) as the AMD Vegas weren't as good as anticipated. There's plenty of money being made selling current GPUs to miners, and even more money selling current Tesla cards for scientific work, so no rush for the next generation to be released. Grant Darwin NT ID: 1921990 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1921991 - Posted: 1 Mar 2018, 23:34:09 UTC - in response to Message 1921986. Thanks for the update Petri. I was curious how that project was going. I looked earlier today for some tasks done by it and didn't find any. Now I know why. Was it pretty straight forward in use and programming? Or did you have to start from scratch because the architecture is so different from earlier generations? Pretty staightforward. The main obstacle was making sure the code allowed all threads to arrive to a checkpoint ("__syncthreads()")when needed. The 1080 and the compiler are not that picky about that. Volta GPU can run out of sync (for performance) and the compiler does not enforce strict rules to make sure that all threads run in sync. My code relied on the old model that did not allow for out of sync execution and I had to make sure that at certain steps of the processing exact synchronisation is properly done. My guess is that there is a similar reason for the one bug that still remains. I'll leave that till tomorrow and the weekend. I'll be back. -- Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1921991 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1921993 - Posted: 1 Mar 2018, 23:45:12 UTC - in response to Message 1921990. Last modified: 1 Mar 2018, 23:46:25 UTC 7) Hard to get to use more than 130 W of juice (this may change when I get to the optimize stage) I figure you'd have to make use of the Tensor cores to come even remotely close to it's maximum power loading. I suspect highly optimized CUDA code will still be able to hit the thermal limits for the card (similar to CPUs thermal & clock throttling with highly optimized AVX code). Rumours are that the next series of desktop NVidia GPUs have been delayed to late this year (maybe even early next) as the AMD Vegas weren't as good as anticipated. There's plenty of money being made selling current GPUs to miners, and even more money selling current Tesla cards for scientific work, so no rush for the next generation to be released. Hi Grant, The tensor cores can be used for Gaussian analysis stage in Seti. I'll try that in the future. It will give some speed up, but the whole Gaussian part of the code is minimal compared to that of the pulse finding. Note that I have not yet overclocked the new GPU and its RAM. And that the software is not optimised for the Volta architecture yet. And the GPU seems to run at P2 instead of P0 on compute work load. Just like 1080. That is something I'll look next after I get the code running properly. But I find it interesting that I get such good performance at half the power in this early stage. -- Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1921993 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13871 Credit: 208,696,464 RAC: 304	Message 1921995 - Posted: 2 Mar 2018, 0:36:37 UTC - in response to Message 1921993. But I find it interesting that I get such good performance at half the power in this early stage. Particularly so with code that wasn't written for it. Where as going to Maxwell from Kepler & earlier, the run times for the existing CUDA code were about the same as for the previous generation. Looks like Nvidia have put a lot of work in to this next generation, and it appears it's going to pay off nicely. Grant Darwin NT ID: 1921995 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1922015 - Posted: 2 Mar 2018, 2:38:01 UTC - in response to Message 1921995. It sure does, and it will be cool when the technology filters down to the more 'pedestrian' versions of their cards... ;-) ID: 1922015 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1922018 - Posted: 2 Mar 2018, 3:05:40 UTC - in response to Message 1921993. Last modified: 2 Mar 2018, 3:07:08 UTC And the GPU seems to run at P2 instead of P0 on compute work load. Just like 1080. That is something I'll look next after I get the code running properly. But I find it interesting that I get such good performance at half the power in this early stage. I discovered a simple way to keep the Pascal cards in P0 mode for compute loads . . . . . . at least in Windows. Doesn't help in Linux however. In case you didn't know, you can use Nvidia Profile Inspector to turn off the P2 compute load restriction very easily. I think just about everyone in the GPUUG group that is running Windows is using that. Maybe you can figure out what the Nvidia Profile Inspector is doing in Windows and figure out the same for Linux. We special app users would really appreciate that! Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1922018 ·

MarkJ Volunteer tester Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5	Message 1922024 - Posted: 2 Mar 2018, 4:03:12 UTC - in response to Message 1921523. After your question, re: boinc manager benchmark score, I looked again at my "reported" integer speed of 82+ Gips. Something is odd there. It is reported as "speed per core" but it seems to me unrealistically high. @Gene, I get this on my Ryzen 1700's. All four of them are pretty much the same. Starting BOINC client version 7.8.4 for x86_64-pc-linux-gnu Processor: 16 AuthenticAMD AMD Ryzen 7 1700 Eight-Core Processor [Family 23 Model 1 Stepping 1] Benchmark results: Number of CPUs: 16 4139 floating point MIPS (Whetstone) per CPU 50370 integer MIPS (Dhrystone) per CPU Although when the first version of 7.9.2 came out it was getting 100,000 integer. The second version had it back to 50,000 so they probably tweaked a compiler flag. BOINC blog ID: 1922024 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1922026 - Posted: 2 Mar 2018, 4:43:55 UTC - in response to Message 1922024. Hi Mark, all I can say is WOW! So I guess I can chalk my low integer scores up to my kernel. Are those 4.14.0-0.bpo.3-amd64 kernels special just for Ryzen? Or just the normal kernels for Debian Stretch? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1922026 ·

MarkJ Volunteer tester Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5	Message 1922044 - Posted: 2 Mar 2018, 7:25:24 UTC - in response to Message 1922026. Last modified: 2 Mar 2018, 7:25:50 UTC Hi Mark, all I can say is WOW! So I guess I can chalk my low integer scores up to my kernel. Are those 4.14.0-0.bpo.3-amd64 kernels special just for Ryzen? Or just the normal kernels for Debian Stretch? Its the standard kernel from Debian, currently in stretch-backports. I was using the 4.9 kernel before and getting similar numbers. Machines are standard Ryzen 1700's, nothing OC'ed. Memory is DDR4 2400MHz. BIOS is one version behind their current offering (they're ASUS Prime X370-Pro motherboards). I just take the optimised defaults and adjust fan settings whenever I update the BIOS. BOINC blog ID: 1922044 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1922110 - Posted: 2 Mar 2018, 15:18:09 UTC - in response to Message 1922044. I have the same motherboard. Current BIOS on the Windows machine and a couple of versions behind on the Linux machine. Both are overclocked to 3950 Mhz. Memory at 3200 or 3300 Mhz. I wonder if overclocking the machines throws off the Benchmark or is it just the BOINC version or kernel level? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1922110 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1922174 - Posted: 2 Mar 2018, 19:27:50 UTC Last modified: 2 Mar 2018, 19:58:56 UTC Hello All, My host has just begun to run with a Volta GPU. Version x41p_V0.9. This is the first compilation of the software to run correctly with my test suite. The performace of 1080 and 1080Ti has dropped 5% - 10% due to changes in the code to make the Volta to run. Things will change in due time ... This is what I call a shorty: http://setiathome.berkeley.edu/workunit.php?wuid=2883079084 Current configuration is: 1 x Volta 1 x 1080Ti 2 x 1080 On this host https://setiathome.berkeley.edu/show_host_detail.php?hostid=7475713 TITAN V GPU@1335 MHz and RAM@1700 MHz \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \|===============================+======================+======================\| \| 0 GeForce GTX 1080 On \| 00000000:05:00.0 On \| N/A \| \|100% 69C P2 156W / 215W \| 3005MiB / 8118MiB \| 100% Default \| +-------------------------------+----------------------+----------------------+ \| 1 TITAN V On \| 00000000:06:00.0 Off \| N/A \| \|100% 51C P2 118W / 285W \| 7368MiB / 12066MiB \| 97% Default \| +-------------------------------+----------------------+----------------------+ \| 2 GeForce GTX 1080 On \| 00000000:09:00.0 Off \| N/A \| \|100% 69C P2 145W / 215W \| 2726MiB / 8119MiB \| 100% Default \| +-------------------------------+----------------------+----------------------+ \| 3 GeForce GTX 108... On \| 00000000:0A:00.0 Off \| N/A \| \|100% 72C P2 223W / 285W \| 3340MiB / 11178MiB \| 97% Default \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 979 G /usr/bin/X 216MiB \| \| 0 1353 G compiz 59MiB \| \| 0 10102 G nvidia-settings 0MiB \| \| 0 10460 C ...ome_x41zc_x86_64-pc-linux-gnu_cuda65_v8 2693MiB \| \| 1 10685 C ...ome_x41zc_x86_64-pc-linux-gnu_cuda65_v8 7332MiB \| \| 2 10195 C ...ome_x41zc_x86_64-pc-linux-gnu_cuda65_v8 2691MiB \| \| 3 10751 C ...ome_x41zc_x86_64-pc-linux-gnu_cuda65_v8 3305MiB \| +-----------------------------------------------------------------------------+ ... To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1922174 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1922180 - Posted: 2 Mar 2018, 19:59:03 UTC Thanks for the update Petri. Chuckle (LOL) yes I would call that a 'shorty' too. 19 seconds Hah!!! Glad to see you are making progress in the alpha app to enable SETI processing on the Titan V. Hope you can recover some of the compromise made for the 1080 and 1080Ti. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1922180 ·

Gene Send message Joined: 26 Apr 99 Posts: 150 Credit: 48,393,279 RAC: 118	Message 1922345 - Posted: 3 Mar 2018, 6:03:15 UTC - in response to Message 1922018. \| Maybe you can figure out what the Nvidia Profile Inspector is doing in Windows and figure out the same for Linux. We special app users would really appreciate that! @Keith +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 390.25 Driver Version: 390.25 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \|===============================+======================+======================\| \| 0 GeForce GTX 750 Ti Off \| 00000000:0A:00.0 On \| N/A \| \| 65% 62C P0 26W / 38W \| 1489MiB / 1994MiB \| 97% Default \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 987 G /usr/lib/xorg/Xorg 15MiB \| \| 0 20369 C ...me_x41p_zi3v_x86_64-pc-linux-gnu_cuda90 1459MiB \| +-----------------------------------------------------------------------------+ For me, nvidia-settings -> GPU 0 ->PowerMizer -> Preferred Mode = Prefer Maximum Performance and that seems to lock-in P0 state. It's a 750Ti so YMMV on a 1070 card although I assume that with the same driver (390.25) you would have the same control/setting options. ID: 1922345 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13871 Credit: 208,696,464 RAC: 304	Message 1922347 - Posted: 3 Mar 2018, 6:12:24 UTC - in response to Message 1922345. It's a 750Ti so YMMV on a 1070 card although I assume that with the same driver (390.25) you would have the same control/setting options. From memory they've made some changes as to how the GPU responds to that setting between Maxwell & Pascal. Grant Darwin NT ID: 1922347 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1922351 - Posted: 3 Mar 2018, 6:23:58 UTC - in response to Message 1922345. Last modified: 3 Mar 2018, 6:39:36 UTC \| Maybe you can figure out what the Nvidia Profile Inspector is doing in Windows and figure out the same for Linux. We special app users would really appreciate that! @Keith +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 390.25 Driver Version: 390.25 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \|===============================+======================+======================\| \| 0 GeForce GTX 750 Ti Off \| 00000000:0A:00.0 On \| N/A \| \| 65% 62C P0 26W / 38W \| 1489MiB / 1994MiB \| 97% Default \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 987 G /usr/lib/xorg/Xorg 15MiB \| \| 0 20369 C ...me_x41p_zi3v_x86_64-pc-linux-gnu_cuda90 1459MiB \| +-----------------------------------------------------------------------------+ For me, nvidia-settings -> GPU 0 ->PowerMizer -> Preferred Mode = Prefer Maximum Performance and that seems to lock-in P0 state. It's a 750Ti so YMMV on a 1070 card although I assume that with the same driver (390.25) you would have the same control/setting options. The '50' cards don't have any problems with P0. My 750Ti, 950, 1050, and 1050Ti all run at P0. It's the cards in the higher levels that run at P2 during crunching. I have a 2GB 960 that refuses All attempts to run it in anything other than P2...which is read only. I can't even get it to list the available clockrates, it just says Not Supported. So, the 960 actually runs Slower than my 950s which run in P0...such a deal. Sat Mar 3 01:37:57 2018 +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 390.25 Driver Version: 390.25 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \|===============================+======================+======================\| \| 0 GeForce GTX 960 On \| 00000000:01:00.0 Off \| N/A \| \| 67% 67C P2 71W / 160W \| 1709MiB / 2002MiB \| 100% Default \| +-------------------------------+----------------------+----------------------+ \| 1 GeForce GTX 950 On \| 00000000:08:00.0 On \| N/A \| \| 40% 65C P0 65W / 75W \| 1832MiB / 1994MiB \| 98% Default \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 23880 C ...me_x41p_zi3v_x86_64-pc-linux-gnu_cuda90 1687MiB \| \| 1 950 G /usr/lib/xorg/Xorg 162MiB \| \| 1 1533 G compiz 110MiB \| \| 1 24219 C ...me_x41p_zi3v_x86_64-pc-linux-gnu_cuda90 1535MiB \| +-----------------------------------------------------------------------------+ ID: 1922351 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1922366 - Posted: 3 Mar 2018, 7:31:03 UTC - in response to Message 1922345. For me, nvidia-settings -> GPU 0 ->PowerMizer -> Preferred Mode = Prefer Maximum Performance and that seems to lock-in P0 state. It's a 750Ti so YMMV on a 1070 card although I assume that with the same driver (390.25) you would have the same control/setting options. As TBar explained earlier, only the low end, low power Pascal cards will run in P0. The limitation has to be in the Windows and Linux drivers. Only in Windows though do you have Profile Inspector that can override what the drivers are limiting a card's power state. I've surmised that Nvidia doesn't want to cannibalize their compute card sales like the Quadro and Tesla, by allowing consumer gaming cards to do compute loads. So they restrict the higher performance cards. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1922366 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1922569 - Posted: 3 Mar 2018, 22:18:29 UTC Hi again, I did some overclocking (P2) with my Titan V with 1582MHz@GPU@P2 and 1960@RAM Here is a random sample of a wu done with a 1080Ti and a TITAN V: http://setiathome.berkeley.edu/workunit.php?wuid=2884551077 I know that this is not a fair comparison since the software is a totally different and I do not know how many WUs at a time the other is running. This is just an update of the performance of an overclocked Titan V with the first acceptable code. Software tuning is still to come! To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1922569 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1922586 - Posted: 3 Mar 2018, 23:15:57 UTC - in response to Message 1922569. Thanks for the update Petri. A very good showing of the card on a standard .44 AR Arecibo task which gives us a benchmark standard. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1922586 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.