Linux CUDA 'Special' App finally available, featuring Low CPU use

Author	Message
Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1894398 - Posted: 9 Oct 2017, 21:17:18 UTC - in response to Message 1894385. The shorties run 27 seconds on my 1080Ti. A four second start up delay would be unacceptable for many users. The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do. . . Hi Petri, . . On the subject of delays, on my Linux rigs tasks have an approx. 12 to 15 sec delay at the apps completion before starting the next task. Is this normal? The app shows 100% complete and the time to run clock shows zero (blank) but the tasks takes about 8 to 12 secs before changing status to uploading and then another 4 secs or so before showing ready to report. The last part I understand as it is preparing the result files and uploading them, or do I have that wrong? Is it possible that it takes 8 to 12 secs to prepare the upload files? Stephen ?? ID: 1894398 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1894406 - Posted: 9 Oct 2017, 21:30:45 UTC - in response to Message 1894398. The shorties run 27 seconds on my 1080Ti. A four second start up delay would be unacceptable for many users. The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do. . . Hi Petri, . . On the subject of delays, on my Linux rigs tasks have an approx. 12 to 15 sec delay at the apps completion before starting the next task. Is this normal? The app shows 100% complete and the time to run clock shows zero (blank) but the tasks takes about 8 to 12 secs before changing status to uploading and then another 4 secs or so before showing ready to report. The last part I understand as it is preparing the result files and uploading them, or do I have that wrong? Is it possible that it takes 8 to 12 secs to prepare the upload files? Stephen ?? Hi Stephen, I have that same problem but with 4-7 seconds delay at the end and only with some tasks. Some tasks finish immediately when reaching 100% and some have this wait. I have a feeling that when a task has gaussian search in it then it shows the delay at the end. But I'm not sure. I'd like to get rid of the end delay. It can not take 4-7 seconds to write files. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1894406 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1894412 - Posted: 9 Oct 2017, 21:47:37 UTC - in response to Message 1894385. The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do. Yes, the 3.5 GPUs had problems with the CUDA 8 App, so I compiled a CUDA 6.5 App. This works on the 780s pretty decent, but the TITAN Z still gives many Invalids even with the 6.5 App. This Host could probably do much better if the GPU had a better App, he is currently #8, https://setiathome.berkeley.edu/results.php?hostid=8323950 That 6.5 App works fine on my 750s, 950s, and 1050s, it has to be something with the cc 3.5 GPUs. OK, so, a Pascal App and a separate 5.0 & 5.2 App. Then whatever you decide for the 3.5 GPUs. Right now there isn't much of a speedup on the BLC tasks, my 750s actually look a little slower on the BLCs. Anyway you could set the callback for the PulseFind before posting any new Apps? ID: 1894412 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1894421 - Posted: 9 Oct 2017, 22:33:00 UTC - in response to Message 1894412. The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do. Yes, the 3.5 GPUs had problems with the CUDA 8 App, so I compiled a CUDA 6.5 App. This works on the 780s pretty decent, but the TITAN Z still gives many Invalids even with the 6.5 App. This Host could probably do much better if the GPU had a better App, he is currently #8, https://setiathome.berkeley.edu/results.php?hostid=8323950 That 6.5 App works fine on my 750s, 950s, and 1050s, it has to be something with the cc 3.5 GPUs. OK, so, a Pascal App and a separate 5.0 & 5.2 App. Then whatever you decide for the 3.5 GPUs. Right now there isn't much of a speedup on the BLC tasks, my 750s actually look a little slower on the BLCs. Anyway you could set the callback for the PulseFind before posting any new Apps? Yep, sounds right. And I'll look (1) at the possible explanations for the 3.5 problems. And the callback is already implemented in main fft and in autocorr-fft in the s2 version. PulseFind does not need an own fft nor callback. In future I'm going to test dynamic parallelism on long pulse finds. I'm going to have to make some test builds for 1050(or Ti) to address the fact that the latest exe is slower on some WUs and gtx1050. I'll look (2) at the kernel startup code and the pulse find fold 5,4,3 and 2 times code for any changes that may result to slow down. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1894421 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1894429 - Posted: 9 Oct 2017, 23:22:19 UTC - in response to Message 1894406. . . Hi Petri, . . On the subject of delays, on my Linux rigs tasks have an approx. 12 to 15 sec delay at the apps completion before starting the next task. Is this normal? The app shows 100% complete and the time to run clock shows zero (blank) but the tasks takes about 8 to 12 secs before changing status to uploading and then another 4 secs or so before showing ready to report. The last part I understand as it is preparing the result files and uploading them, or do I have that wrong? Is it possible that it takes 8 to 12 secs to prepare the upload files? Stephen ?? Hi Stephen, I have that same problem but with 4-7 seconds delay at the end and only with some tasks. Some tasks finish immediately when reaching 100% and some have this wait. I have a feeling that when a task has gaussian search in it then it shows the delay at the end. But I'm not sure. I'd like to get rid of the end delay. It can not take 4-7 seconds to write files. Petri . . That was my thought as well. Such a long delay to write 20K to 30K files seems absurd. Stephen .. ID: 1894429 ·

Bruce Volunteer tester Send message Joined: 15 Mar 02 Posts: 123 Credit: 124,955,234 RAC: 11	Message 1894431 - Posted: 9 Oct 2017, 23:58:52 UTC - in response to Message 1894412. The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do. Yes, the 3.5 GPUs had problems with the CUDA 8 App, so I compiled a CUDA 6.5 App. This works on the 780s pretty decent, but the TITAN Z still gives many Invalids even with the 6.5 App. This Host could probably do much better if the GPU had a better App, he is currently #8, https://setiathome.berkeley.edu/results.php?hostid=8323950 That 6.5 App works fine on my 750s, 950s, and 1050s, it has to be something with the cc 3.5 GPUs. OK, so, a Pascal App and a separate 5.0 & 5.2 App. Then whatever you decide for the 3.5 GPUs. Right now there isn't much of a speedup on the BLC tasks, my 750s actually look a little slower on the BLCs. Anyway you could set the callback for the PulseFind before posting any new Apps? Petri & TBar If you guys can find the the problem with the x41p_zi3t2b cuda65 app I would be more than happy to act as guinea pig. Just let me know when and where that I can download it when the time comes. Thanks. *Bruce* ID: 1894431 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1894469 - Posted: 10 Oct 2017, 3:40:19 UTC Last modified: 10 Oct 2017, 3:51:58 UTC Quite a significant difference in the Best Pulse on this WU. Workunit 2705262578 (07ap07aa.16319.13160.7.34.221) Task 6080947466 (S=1, A=0, P=0, T=9, G=0, BG=0) v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin Task 6080947467 (S=1, A=0, P=0, T=9, G=0, BG=0) x41p_zi3xs2, Cuda 9.00 special One of my machines holds the tiebreaker. EDIT: I have to assume that Best Pulse is the issue on this one, too, though the differences in peak, period and score are pretty small. Workunit 2705643325 (08ap07aa.6131.361879.11.38.45) Task 6081742892 (S=6, A=1, P=0, T=10, G=0, BG=0) x41p_zi3xs2, Cuda 9.00 special Task 6081742893 (S=6, A=1, P=0, T=10, G=0, BG=0) x41p_zi3t2b, Cuda 8.00 special The tiebreaker belongs to a v8.08 (alt) windows_x86_64 host. ID: 1894469 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1894526 - Posted: 11 Oct 2017, 0:10:12 UTC - in response to Message 1894469. Quite a significant difference in the Best Pulse on this WU. Workunit 2705262578 (07ap07aa.16319.13160.7.34.221) Task 6080947466 (S=1, A=0, P=0, T=9, G=0, BG=0) v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin Task 6080947467 (S=1, A=0, P=0, T=9, G=0, BG=0) x41p_zi3xs2, Cuda 9.00 special One of my machines holds the tiebreaker. So much for tiebreaking. My host showed yet another significantly different Best Pulse. The three apps and their reported Best Pulses are: v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin: peak=7.699861, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256 x41p_zi3xs2, Cuda 9.00 special: peak=0.751317, time=13.42, period=0.02444, d_freq=1419661865.23, score=0.7804, chirp=0, fft_len=8 x41p_zi3v, Cuda 8.00 special: peak=0.6058947, time=41.94, period=0.01732, d_freq=1419654541.02, score=0.8102, chirp=0, fft_len=8 The WU is now in the hands of a fourth host. Not good. ID: 1894526 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1894531 - Posted: 11 Oct 2017, 0:37:23 UTC - in response to Message 1894526. On this one the CUDA 9 App was given canonical, Workunit 2705643325 canonical result 6081742892 : Task 6081742892 = Computer 6906726 ID: 1894531 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1894537 - Posted: 11 Oct 2017, 1:27:21 UTC - in response to Message 1894531. On this one the CUDA 9 App was given canonical, Workunit 2705643325 canonical result 6081742892 : Task 6081742892 = Computer 6906726 Yeah, the differences seemed pretty minor to begin with. The x41p_zi3xs2 and the v8.08 (alt) ended up the closest, but the x41p_zi3t2b got credit, too. If I have time tomorrow, perhaps I'll make offline Windows CPU runs with both this WU and the other one. ID: 1894537 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1894547 - Posted: 11 Oct 2017, 2:25:13 UTC - in response to Message 1894526. Quite a significant difference in the Best Pulse on this WU. Workunit 2705262578 (07ap07aa.16319.13160.7.34.221) Task 6080947466 (S=1, A=0, P=0, T=9, G=0, BG=0) v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin Task 6080947467 (S=1, A=0, P=0, T=9, G=0, BG=0) x41p_zi3xs2, Cuda 9.00 special One of my machines holds the tiebreaker. So much for tiebreaking. My host showed yet another significantly different Best Pulse. The three apps and their reported Best Pulses are: v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin: peak=7.699861, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256 x41p_zi3xs2, Cuda 9.00 special: peak=0.751317, time=13.42, period=0.02444, d_freq=1419661865.23, score=0.7804, chirp=0, fft_len=8 x41p_zi3v, Cuda 8.00 special: peak=0.6058947, time=41.94, period=0.01732, d_freq=1419654541.02, score=0.8102, chirp=0, fft_len=8 The WU is now in the hands of a fourth host. Not good. To finish this one off, the 4th host has reported, matched the 1st one, and everybody got validated in the end, even though both versions of the Special App appear to have missed the mark by quite a bit. v8.22 (opencl_nvidia_SoG): peak=7.699859, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256 Keep in mind, this was not an overflow WU. This was a high AR Arecibo WU that ran to full term. ID: 1894547 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1894558 - Posted: 11 Oct 2017, 3:39:29 UTC - in response to Message 1894547. Quite a significant difference in the Best Pulse on this WU. Workunit 2705262578 (07ap07aa.16319.13160.7.34.221) Task 6080947466 (S=1, A=0, P=0, T=9, G=0, BG=0) v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin Task 6080947467 (S=1, A=0, P=0, T=9, G=0, BG=0) x41p_zi3xs2, Cuda 9.00 special One of my machines holds the tiebreaker. So much for tiebreaking. My host showed yet another significantly different Best Pulse. The three apps and their reported Best Pulses are: v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin: peak=7.699861, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256 x41p_zi3xs2, Cuda 9.00 special: peak=0.751317, time=13.42, period=0.02444, d_freq=1419661865.23, score=0.7804, chirp=0, fft_len=8 x41p_zi3v, Cuda 8.00 special: peak=0.6058947, time=41.94, period=0.01732, d_freq=1419654541.02, score=0.8102, chirp=0, fft_len=8 The WU is now in the hands of a fourth host. Not good. To finish this one off, the 4th host has reported, matched the 1st one, and everybody got validated in the end, even though both versions of the Special App appear to have missed the mark by quite a bit. v8.22 (opencl_nvidia_SoG): peak=7.699859, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256 Keep in mind, this was not an overflow WU. This was a high AR Arecibo WU that ran to full term. Keep in mind this packet has no reportable pulses. The best non reportable is eye candy. They are so faint "signals" that they are most probably noise or so near the computational precision that any different summation order of floating point values gives always a different result. There is a reason they are not reported as found pulses. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1894558 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1894571 - Posted: 11 Oct 2017, 5:00:51 UTC - in response to Message 1894558. Keep in mind this packet has no reportable pulses. The best non reportable is eye candy. They are so faint "signals" that they are most probably noise or so near the computational precision that any different summation order of floating point values gives always a different result. There is a reason they are not reported as found pulses. Petri But there is apparently also a reason why they are reported as Best Pulses and need to be validated accordingly. Once again, that's a decision that was made by the project administrators/scientists, and it's certainly not up to application developers to arbitrarily ignore whatever is the existing standard, simply to squeeze out a little more speed. Accuracy and consistency come first, speed comes second. ID: 1894571 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1894605 - Posted: 11 Oct 2017, 13:05:15 UTC - in response to Message 1894571. Last modified: 11 Oct 2017, 13:07:19 UTC It would seem it's the Same old Race condition that's been present since the Unroll was added. Since the two CUDA Apps didn't Validate, the chance of Cross Validation in Low. It's been like this for a while and was labeled harmless by other developers some time ago. As far as I know, the previously discovered Best Gaussian problem discovered with the Windows SoG App DOES cross validate, and STILL EXISTS. You don't seem very concerned about that problem, and it's actually more troublesome than an occasional race condition with the Best Pulse. ID: 1894605 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1894674 - Posted: 11 Oct 2017, 18:38:34 UTC - in response to Message 1894605. Last modified: 11 Oct 2017, 18:39:41 UTC It would seem it's the Same old Race condition that's been present since the Unroll was added. Since the two CUDA Apps didn't Validate, the chance of Cross Validation in Low. It's been like this for a while and was labeled harmless by other developers some time ago. As far as I know, the previously discovered Best Gaussian problem discovered with the Windows SoG App DOES cross validate, and STILL EXISTS. You don't seem very concerned about that problem, and it's actually more troublesome than an occasional race condition with the Best Pulse. It's not harmless. If it's same PulseFind issue as before it's true bug. 7.6 and 0.7 are too different to be explained by limited computational precision. And no need to refer on one bug to hide/diminish another. Bug is bug anyway. Do we have this task offline? [quote]Quite a significant difference in the Best Pulse on this WU. Workunit 2705262578 (07ap07aa.16319.13160.7.34.221) Task 6080947466 (S=1, A=0, P=0, T=9, G=0, BG=0) v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin Task 6080947467 (S=1, A=0, P=0, T=9, G=0, BG=0) x41p_zi3xs2, Cuda 9.00 special One of my machines holds the tiebreaker. So much for tiebreaking. My host showed yet another significantly different Best Pulse. The three apps and their reported Best Pulses are: v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin: peak=7.699861, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256 x41p_zi3xs2, Cuda 9.00 special: peak=0.751317, time=13.42, period=0.02444, d_freq=1419661865.23, score=0.7804, chirp=0, fft_len=8 x41p_zi3v, Cuda 8.00 special: peak=0.6058947, time=41.94, period=0.01732, d_freq=1419654541.02, score=0.8102, chirp=0, fft_len=8 The WU is now in the hands of a fourth host. Not good. To finish this one off, the 4th host has reported, matched the 1st one, and everybody got validated in the end, even though both versions of the Special App appear to have missed the mark by quite a bit. v8.22 (opencl_nvidia_SoG): peak=7.699859, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256 Keep in mind, this was not an overflow WU. This was a high AR Arecibo WU that ran to full term. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1894674 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1894682 - Posted: 11 Oct 2017, 19:02:15 UTC - in response to Message 1894674. Do we have this task offline? Here you go, Raistmer: WU2705262578 ID: 1894682 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1894685 - Posted: 11 Oct 2017, 19:08:50 UTC - in response to Message 1894682. Do we have this task offline? Here you go, Raistmer: WU2705262578 Got it, thanks. Still had to restore building environment to hunt any bugs in OpenCL and very limited on free time to setup Linux host to help with Petri's app bughunting but TestCase reserved for the future... SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1894685 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1894687 - Posted: 11 Oct 2017, 19:23:25 UTC - in response to Message 1894685. I've already run it on a couple machines, the CPU App gives; Best pulse: peak=7.699858, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256 The ATI App 3505 pretty much agrees with the CPU App. I'd say this is the occasional Best Pulse problem left over after the last attempt to fix the race as it doesn't matter if you run it at unroll 1 or 8. Obviously it happens infrequently or the Inconclusive rate would be much higher. ID: 1894687 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1894701 - Posted: 11 Oct 2017, 19:56:41 UTC Hi, zi3t2 may report wrong pulses time to time. It should not be used. The s2 can sometimes report things wrong. It and the latest cuda8 should be used instead. Period. To the pulse issue that is not a pulse (Not a reported one): Do not look at the peak. Look at the score. Score is used to determine if a pulse should be reported. The s2 sometimes misses one but that is a rare occasion. Then, if it is said by the administration that a pulse should be reported the it will -- and they allow half of them to be wrong. If the score is less than a given threshold then it is reported as best so far just to make the screen saver happy and to make an educated guesses of a sequential apps inner workings. The is no scientific meaning in those not reported but best anyway still pulses. They are there to prevent faking. One could say that no pulses were found without scanning through all possibilities. The best but not reported is a sanity check. If my app fails that sometimes it is not so big a deal. And I'm working on it. The bigger problem is that there are people running zi3t2 that is faster but does not sometimes report all true pulses. The t2 has a parallel only pulse search (it is fast) but it is not valid. The s2 is far much better. When it finds a suspect best or a true pulse it reverts back to sequential search. The t2 does not. So: Stop using t2 even though it is faster on 1050 or lesser cards than the s2. And an eye candy is still an eye candy. It can detect a fradulent attempt to gain score by not doing any work at all. It is good for that. My SW does all the work needed. No faking. Everything is computed. The problem is in (storing intermediate results on same PoT) the reporting, my lack of time during the weeks I have to go to the work and the day having only 24 hours in it during the weekends. I still like to keep this as a hobby. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1894701 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1894702 - Posted: 11 Oct 2017, 20:00:02 UTC - in response to Message 1894685. Last modified: 11 Oct 2017, 20:02:48 UTC Do we have this task offline? Here you go, Raistmer: WU2705262578 Got it, thanks. Still had to restore building environment to hunt any bugs in OpenCL and very limited on free time to setup Linux host to help with Petri's app bughunting but TestCase reserved for the future... Thanks for any help you can provide. All help needed. Insights, ideas, ... Thank you. EDIT: Saved the wu too to include it into my development test cycle. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1894702 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.