Linux CUDA 'Special' App finally available, featuring Low CPU use

Author	Message
Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1865075 - Posted: 1 May 2017, 19:02:25 UTC - in response to Message 1865071. It shouldn't be too difficult to figure out... ;-) It's f%&%)ing LINUX! It's designed* to be difficult to figure out anything, at least for the average non-geek user. ;^) ID: 1865075 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1865079 - Posted: 1 May 2017, 19:56:40 UTC - in response to Message 1865075. Let Google be your friend in Linux. Actually I find Linux much easier to crunch with - there is only like 3 different setting for command line options compared to the never ending list with SOG. It's just setting up your initial app_info that's a little tedious if you don't read the info provided. ID: 1865079 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1865132 - Posted: 2 May 2017, 1:46:36 UTC - in response to Message 1865069. When I can get to it, perhaps this evening, I'll try doing a similar comparison for the run times on the 780, although I don't have one in another box to use as a control. I'm certainly curious to see if the Special App significantly improves the throughput for the 780. Okay, I got to it. Using the same format as my post for the 960, here are the numbers for the 780: Host 8253697 \| Host 7057115 Linux "Special" \| Win8.1 "SoG" (2/GPU) Avg RT (Tasks/Hr) \| Avg RT (Tasks/Hr) -------------------\|--------------------- High AR ----- 1:44 (34.62) \| 5:29 (21.88) Normal AR --- 4:54 (12.24) \| 11:42 (10.26) VLAR -------- 7:39 (7.84) \| 20:32 (5.84) Clearly, the 780 does get better overall throughput with the Special App, about 58% better on High AR tasks, 19% on Normal ARs, and about 34% on the VLARs. Certainly impressive, but likely not good enough to compensate for the loss of use of the 670 and the reduced output of the 960. And, without a smoking gun pointing to whatever issue the 960 may be having, I'll likely switch back to SoG in Windows in a day or so, since I don't really want to devote much more time to this experiment. ID: 1865132 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1865136 - Posted: 2 May 2017, 2:44:38 UTC - in response to Message 1865132. Meanwhile, other people's 960s running the current App are producing times not much different than the ones you have posted for your 780. Which means their 960s are literally wiping the floor with your 960. It's common practice to compare similar hardware with similar software before making conclusions, you might want to look at a few other 960s. My 960 would be an easy find, it seems to work pretty much the same with a couple different systems, currently it's running in the previous system, http://setiathome.berkeley.edu/results.php?hostid=6796479&offset=320. The slower times are being produced by a 950 running in a x4 PCIe slot, the other two cards are pretty close. Then there is Gianfranco's 960, https://setiathome.berkeley.edu/results.php?hostid=8215300&offset=220. You could probably find a few more running version zi3t2b if you look around. There are a few 750Ti cards running the zi3t2b App, mine http://setiathome.berkeley.edu/results.php?hostid=6906726&offset=220 and another, http://setiathome.berkeley.edu/results.php?hostid=7942417&offset=320. Even the 750s are putting your 960 to shame. sad... ;) ID: 1865136 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1865144 - Posted: 2 May 2017, 3:39:43 UTC - in response to Message 1865136. Meanwhile, other people's 960s running the current App are producing times not much different than the ones you have posted for your 780. Which means their 960s are literally wiping the floor with your 960. It's common practice to compare similar hardware with similar software before making conclusions, you might want to look at a few other 960s. My 960 would be an easy find, it seems to work pretty much the same with a couple different systems, currently it's running in the previous system, http://setiathome.berkeley.edu/results.php?hostid=6796479&offset=320. The slower times are being produced by a 950 running in a x4 PCIe slot, the other two cards are pretty close. Then there is Gianfranco's 960, https://setiathome.berkeley.edu/results.php?hostid=8215300&offset=220. You could probably find a few more running version zi3t2b if you look around. There are a few 750Ti cards running the zi3t2b App, mine http://setiathome.berkeley.edu/results.php?hostid=6906726&offset=220 and another, http://setiathome.berkeley.edu/results.php?hostid=7942417&offset=320. Even the 750s are putting your 960 to shame. sad... ;) Actually, what I'm in the process of doing is installing Ubuntu on my box with the four GTX 960s (2 onboard, 2 on risers), which I've already demonstrated perform about the same as this one in Windows. Then, when I do comparisons, I have what should be a solid baseline to start from. The OS is already installed, but it will still take awhile to bring it up to date (happening now) and then get all the other bits and pieces in place before I install BOINC and S@h. I doubt if I'll actually start running anything this evening, what with the outage tomorrow, but perhaps by tomorrow evening or Wednesday. And again, I will ask two questions that I posed in earlier posts for which no answers were forthcoming: 1) Does it seem likely that Linux would have a problem with a GPU on a riser, when Windows does not? 2) I see in the output for that last example you provided that he's apparently using a "pfb" parameter, which I assume is the same as the "pfblockspersm" that I used to set in the mbcuda.cfg file when I was running Cuda in the pre-SoG days. Could something like that be causing such an improvement? ID: 1865144 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1865153 - Posted: 2 May 2017, 4:08:00 UTC - in response to Message 1865144. And again, I will ask two questions that I posed in earlier posts for which no answers were forthcoming: 1) Does it seem likely that Linux would have a problem with a GPU on a riser, when Windows does not? 2) I see in the output for that last example you provided that he's apparently using a "pfb" parameter, which I assume is the same as the "pfblockspersm" that I used to set in the mbcuda.cfg file when I was running Cuda in the pre-SoG days. Could something like that be causing such an improvement? I've given my suggestions, Numerous times. Go back and read the posts. I'm not going to keep repeating myself. Did you see any improvement when running pfblockspersm in Windows? I don't remember any, and the other people in Linux who are Not running that setting don't seem to be having any problems...do they. You Might see a second or two difference when trying that setting in the benchmark, my tests are all inconclusive when I tried those settings. So, I don't bother with something that produces negligible results. Clearly it won't produce a 200% difference, which is your problem. ID: 1865153 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1865155 - Posted: 2 May 2017, 4:11:35 UTC - in response to Message 1865153. Don't know why you're being so snippy when somebody's trying to help test your app. But if that's the way you choose to be, STUFF IT. ID: 1865155 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1865159 - Posted: 2 May 2017, 4:29:04 UTC - in response to Message 1865155. Last modified: 2 May 2017, 4:32:29 UTC Don't know why you're being so snippy when somebody's trying to help test your app. But if that's the way you choose to be, STUFF IT. Hey, I built an App just for your 780. It works well doesn't it. I also identified the problem with your 960 within minutes, gave you suggestions multiple times. Yet you keep insisting there isn't a problem with your setup, and suggest Petri's App is the problem instead, while ignoring the results from other people's machines. If that's the way you're going to test things, then perhaps you should stop. It's annoying when people keep ignoring your suggestions and insist the problem is elsewhere. I at least try reasonable suggestions....most of the time. nods head ID: 1865159 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1865221 - Posted: 2 May 2017, 9:57:05 UTC - in response to Message 1865045. If you wish to try another Kepler GPU you might try your 630 GT My guess is it will have problems with the CUDA 8 App, but might work with the CUDA 6.5 App. There aren't many people around with supported Kepler cards, so, it would appear it's your move ;-) . . I have a GT 730 which I would love to try with CUDA80, it would not be the fastest thing but I am confident it would take to it quite well. When I swing Bertie over to Linux I might stick it in the spare slot and cripple the 970s for a while just to see how it manages. Since it, like the 630, only has two CUs I would have to limit the unroll to 2 instead of 13 :( Stephen :) ID: 1865221 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1865222 - Posted: 2 May 2017, 10:03:36 UTC - in response to Message 1865075. It's f%&%)ing LINUX! It's designed* to be difficult to figure out anything, at least for the average non-geek user. ;^) . . I know that feeling ... Stephen :) ID: 1865222 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1865224 - Posted: 2 May 2017, 10:11:00 UTC - in response to Message 1865136. [quote Even the 750s are putting your 960 to shame. sad... ;)[/quote] . . I would have expected 960's to do close to the times my GTX1050ti is doing. . . Halflings (VHAR) . . . 2 mins (28 / hour) . . Normals . . . . . . . . . . . . .4.5 mins (13 / hour) . . Guppis . . . . . . . . . . . .. 7.5 mins (8 / hour) . . But I am using an earlier version. Stephen ID: 1865224 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1865239 - Posted: 3 May 2017, 1:28:01 UTC - in response to Message 1865224. You should upgrade to the current zi3t2b and you might break 7 minutes on the blc2s. I think the difference between zi3k+ and the current build is about 30 seconds on the blc tasks and Fewer Inconclusive results. I can't get better than 7:45 on My GTX 960 with the blc2s but Gianfranco is running about 6.5 minutes on his 960 in Linux. I suppose that's the difference between OSX and Linux. My GTX 1050 is running just above 7 minutes on the blc2 tasks in Linux, Run time: 7 min 17 sec . There is a much greater improvement between zi+ and zi3t2b. Anyone with zi+, zi3k+, or anything else should upgrade the current version. ID: 1865239 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1865244 - Posted: 3 May 2017, 1:48:06 UTC - in response to Message 1865239. You should upgrade to the current zi3t2b and you might break 7 minutes on the blc2s. I think the difference between zi3k+ and the current build is about 30 seconds on the blc tasks and Fewer Inconclusive results. . . For me the $64,000 question is, does the upgrade run automagically or do I have to edit config files? Stephen ?? ID: 1865244 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1865251 - Posted: 3 May 2017, 2:07:12 UTC - in response to Message 1865244. Last modified: 3 May 2017, 2:13:35 UTC Try being a little more descriptive. Read this post https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1863856 The Latest version zi3t2b with the Lowered vRam patch is now available at Crunchers Anonymous, http://www.arkayn.us/forum/index.php?topic=197.msg4499#msg4499 This version has the -unroll autotune setting which Automatically sets the Unroll to match your number of Compute Units. If you have a GPU with only One GB of vRam you must change the -unroll setting in the app_info.xml to 1 or 2 before running BOINC. GPUs with 2 GB of vRam Might be ok with using 7 or 8, it would be safe to use -unroll 6 until further tested. See the README_x41p_zi3t2b.txt file in the docs folder for other hints. This version has the Blocking Sync set as default, similar to the Older CUDA Apps. You can increase the CPU use by using the -poll command, similar to the old CUDA Apps. As usual, if you have Existing Work Units you Must Change the app_info.xml version number and plan class to match the tasks assigned in the client_state.xml file to avoid creating Ghost tasks. If in doubt, finish the current tasks before making the change. This package has the AP and CPU App included with an appropriate app_info.xml. On my machines, the BLC tasks are a little faster and there are fewer Inconclusive results with this version. If the new app_info.xml has the same version number and plan class entries then all you have to do is add the new files. It's up to you if you want to remove the old files. ID: 1865251 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1865256 - Posted: 3 May 2017, 2:36:52 UTC - in response to Message 1865251. Try being a little more descriptive. Read this post https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1863856 If the new app_info.xml has the same version number and plan class entries then all you have to do is add the new files. It's up to you if you want to remove the old files. . . I have d/l'd the file and extracted it. I will have a go after lunch. Stephen :) ID: 1865256 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1865257 - Posted: 3 May 2017, 2:37:08 UTC For Petri or Jason, if you're still collecting examples of WUs for offline testing, where the Pulse reporting by the Special App is a bit odd, here's a new Inconclusive I spotted. Workunit 2525653719 (blc02_2bit_guppi_57835_05245_HIP38647_0023.28046.409.23.46.204.vlar) Task 5705555225 (S=1, A=0, P=4, T=0, G=0) x41p_zi3t2b, Cuda 6.50 special Task 5705555226 (S=1, A=0, P=4, T=0, G=0) v8.22 (opencl_ati5_nocal) windows_intelx86 The Pulse counts for both hosts are the same, as are the reported Best Pulse. However, for the Special App, the Best Pulse doesn't match one of the previously reported Pulses, as follows: Pulse: peak=4.645121, time=45.82, period=9.947, d_freq=2674792128.46, score=1.001, chirp=-14.771, fft_len=256 Spike: peak=24.31458, time=40.09, d_freq=2674791485.17, chirp=-20.315, fft_len=128k Pulse: peak=0.2709029, time=45.82, period=0.09916, d_freq=2674792451.61, score=1.082, chirp=26.43, fft_len=64 Pulse: peak=3.874821, time=45.84, period=9.009, d_freq=2674796407.05, score=1.011, chirp=61.024, fft_len=512 Pulse: peak=3.920059, time=45.82, period=8.14, d_freq=2674798612.19, score=1.004, chirp=75.017, fft_len=256 ... Best spike: peak=24.31458, time=40.09, d_freq=2674791485.17, chirp=-20.315, fft_len=128k Best autocorr: peak=16.87247, time=85.9, delay=2.4451, d_freq=2674793425.33, chirp=-24.528, fft_len=128k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.124e+11, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=4.221204, time=45.84, period=8.998, d_freq=2674796407.05, score=1.101, chirp=61.024, fft_len=512 Best triplet: peak=0, time=-2.124e+11, period=0, d_freq=0, chirp=0, fft_len=0 In the corresponding task from an OpenCL app, those Pulses match: Pulse: peak=4.645125, time=45.82, period=9.947, d_freq=2674792128.46, score=1.001, chirp=-14.771, fft_len=256 D: threshold 0.08758591; unscaled peak power: 0.08762821 exceeds threshold for 0.04829% Spike: peak=24.31457, time=40.09, d_freq=2674791485.17, chirp=-20.315, fft_len=128k Pulse: peak=0.270903, time=45.82, period=0.09916, d_freq=2674792451.61, score=1.082, chirp=26.43, fft_len=64 D: threshold 0.004868537; unscaled peak power: 0.004948388 exceeds threshold for 1.64% Pulse: peak=4.221204, time=45.84, period=8.998, d_freq=2674796407.05, score=1.101, chirp=61.024, fft_len=512 D: threshold 0.1511975; unscaled peak power: 0.1633066 exceeds threshold for 8.009% Pulse: peak=3.920061, time=45.82, period=8.14, d_freq=2674798612.19, score=1.004, chirp=75.017, fft_len=256 D: threshold 0.07585774; unscaled peak power: 0.07607242 exceeds threshold for 0.283% Best spike: peak=24.31457, time=40.09, d_freq=2674791485.17, chirp=-20.315, fft_len=128k Best autocorr: peak=16.87248, time=85.9, delay=2.4451, d_freq=2674793425.33, chirp=-24.528, fft_len=128k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.124e+011, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=4.221204, time=45.84, period=8.998, d_freq=2674796407.05, score=1.101, chirp=61.024, fft_len=512 Best triplet: peak=0, time=-2.124e+011, period=0, d_freq=0, chirp=0, fft_len=0 ID: 1865257 ·

Gianfranco Lizzio Volunteer tester Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87	Message 1865294 - Posted: 3 May 2017, 5:20:48 UTC Last modified: 3 May 2017, 5:36:58 UTC @TBar Performance between OSX and Linux are almost the same. In Linux I get better times just because I overclock the card. The memories works at 7700 Mhz against the factory 7010 Mhz and the core graphics run at 1480Mhz. With these overclock â€‹â€‹the card works at 50 degrees with fans at 50%. I don't want to believe, I want to know! ID: 1865294 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1865301 - Posted: 3 May 2017, 5:57:26 UTC - in response to Message 1865294. Last modified: 3 May 2017, 6:05:36 UTC OK, thanks. That helps explaining it. I had decided it had to be something about the memory. I have the 2 GB card and it's running out of memory in Sierra if used as the main display. I changed the wires around and use a 950 for the display and now I can run the benchmark test at unroll 8 on the 960 without the App crashing. I'm still afraid to try running it in BOINC at unroll 8. Maybe tomorrow. It doesn't seem to make much difference using 6 or 8 though. One thing I noticed about this version is it gives better results with the Known PulseFind bug. In my benchmark tests some of the tasks known to produce the Bad Best Pulse actually pass. The One task that always produced 2 Bad Pulses now only finds One bad Pulse. So, this version is definitely better than past versions with the Known issue that has existed since the unroll feature was added. That is one reason it produces fewer Inconclusives and why Everyone should Upgrade to this version. One step closer... ID: 1865301 ·

Gianfranco Lizzio Volunteer tester Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87	Message 1865309 - Posted: 3 May 2017, 6:51:36 UTC - in response to Message 1865301. Last modified: 3 May 2017, 6:52:24 UTC @ base clock with blc02 the runtime is 7min 17 sec. I hope this can help you. I don't want to believe, I want to know! ID: 1865309 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1865310 - Posted: 3 May 2017, 7:01:56 UTC - in response to Message 1865251. If the new app_info.xml has the same version number and plan class entries then all you have to do is add the new files. It's up to you if you want to remove the old files. . . So there is nothing that overwrites apart from the app_info.xml? I will rename the old app_info.xml as a reference before copying the new files across but I want to be sure I am not going to scrap anything else that might come back to bite me. Will I need to restart BOINC or will it just pick up on the change when the next new task starts? Or should I execute a "read config files". I know you said to "just add the new files" but I have ghosted full caches before ... :( Stephen ? ID: 1865310 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.