Linux CUDA 'Special' App finally available, featuring Low CPU use

Author	Message
jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875076 - Posted: 25 Jun 2017, 23:11:11 UTC - in response to Message 1875060. ...All the reported signals and Best signals seem to match between the two. If implemented as I picture: For the pulse mechanism shunt/workaround, the stderr.txt 'realtime' log might see the racing pulse detections, then shunt to unroll 1 to record the correct ones. If that's the case, it does reflect reality in the new 'racey-fixey' kindof way, but may need to be presented more clearly. Ah, so perhaps the actual Result file would contain a different Best Pulse value than the Stderr shows? possible, making a lot of assumptions there. Naturally the result file is the important one. Probably prior assumptions about processing vs printing order become somewhat muddy as parallelism and reprocessing is involved, while stderr is sequential. Something that will have to no doubt be de-confused as we go along. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875076 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1875103 - Posted: 26 Jun 2017, 1:45:51 UTC - in response to Message 1875014. The only way to tell for sure is to run the task with your CPU and compare the results. You should give that a try, you can run a CPU task in the benchmark App while running BOINC. Just reduce the CPU usage by One in BOINC and remove any Apps from the APPS folder in the Benchmark package. The CPU App in the REF_APPS folder will search the WU folder and run any task it doesn't have results for. The Benchmark tool is here, KWSN Linux MB Bench v2.01.08. Extract the KWSN-Bench-Linux-MBv7_v2.01.08.7z to your Home folder and run it from there. Okay, I tried running it with the Windows CPU app that I use here on my daily driver. It almost perfectly matches the v8.22 (opencl_ati_cat132) result. Workunit 2567983999 (20oc08aa.4777.254820.12.39.5) Task 5794100079 (S=10, A=3, P=0, T=0, G=0, BG=0) v8.22 (opencl_ati_cat132) windows_intelx86 Task 5829376759 (S=10, A=3, P=0, T=0, G=0, BG=0) x41p_zi3v, Cuda 8.00 special v8.22 (opencl_ati_cat132) windows_intelx86 - Best pulse: peak=0.4685673, time=98.45, period=0.01441, d_freq=1420048834.69, score=0.9218, chirp=-61.928, fft_len=8 x41p_zi3v, Cuda 8.00 special - Best pulse: peak=0.3951461, time=68.92, period=0.0147, d_freq=1420052490.23, score=0.7774, chirp=0, fft_len=8. MB8_win_x86_SSE3_VS2008_r3330 - Best pulse: peak=0.4685681, time=98.45, period=0.01441, d_freq=1420048834.69, score=0.9218, chirp=-61.928, fft_len=8 ID: 1875103 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875107 - Posted: 26 Jun 2017, 1:56:32 UTC - in response to Message 1875103. Last modified: 26 Jun 2017, 2:04:22 UTC The only way to tell for sure is to run the task with your CPU and compare the results. You should give that a try, you can run a CPU task in the benchmark App while running BOINC. Just reduce the CPU usage by One in BOINC and remove any Apps from the APPS folder in the Benchmark package. The CPU App in the REF_APPS folder will search the WU folder and run any task it doesn't have results for. The Benchmark tool is here, KWSN Linux MB Bench v2.01.08. Extract the KWSN-Bench-Linux-MBv7_v2.01.08.7z to your Home folder and run it from there. Okay, I tried running it with the Windows CPU app that I use here on my daily driver. It almost perfectly matches the v8.22 (opencl_ati_cat132) result. Workunit 2567983999 (20oc08aa.4777.254820.12.39.5) Task 5794100079 (S=10, A=3, P=0, T=0, G=0, BG=0) v8.22 (opencl_ati_cat132) windows_intelx86 Task 5829376759 (S=10, A=3, P=0, T=0, G=0, BG=0) x41p_zi3v, Cuda 8.00 special v8.22 (opencl_ati_cat132) windows_intelx86 - Best pulse: peak=0.4685673, time=98.45, period=0.01441, d_freq=1420048834.69, score=0.9218, chirp=-61.928, fft_len=8 x41p_zi3v, Cuda 8.00 special - Best pulse: peak=0.3951461, time=68.92, period=0.0147, d_freq=1420052490.23, score=0.7774, chirp=0, fft_len=8. MB8_win_x86_SSE3_VS2008_r3330 - Best pulse: peak=0.4685681, time=98.45, period=0.01441, d_freq=1420048834.69, score=0.9218, chirp=-61.928, fft_len=8 Are you able to cross compare that with Cuda Baseline? It'll narrow down where to look once I get to the special code. [Edit:] Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here. [which one(s) differ to reference Windows/x86 8.00 may point in the right directions] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875107 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1875108 - Posted: 26 Jun 2017, 2:21:25 UTC - in response to Message 1875107. Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here. [which one(s) differ to reference Windows/x86 8.00 may point in the right directions] My OSX CPU App r3344 is from AKv8 and it has the same results as r3330. ID: 1875108 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1875109 - Posted: 26 Jun 2017, 2:22:29 UTC - in response to Message 1875107. Are you able to cross compare that with Cuda Baseline? It'll narrow down where to look once I get to the special code. Can that also be run with the MBBench 2.10? [Edit:] Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here. [which one(s) differ to reference Windows/x86 8.00 may point in the right directions] No clue. That answer will have to come from elsewhere. ;^) ID: 1875109 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875111 - Posted: 26 Jun 2017, 2:29:06 UTC Last modified: 26 Jun 2017, 2:35:53 UTC Hmmm, AK8 branch might be missing Joe's fix from ~2011 ? (svn posted earlier in thread): gaussfit.cpp (stock seti_boinc branch): report = chisqOK // chisqOK is (ChiSq <= swi.analysis_cfg.gauss_chi_sq_thresh) && (gi.g.peak_power >= gi.g.mean_power swi.analysis_cfg.gauss_peak_power_thresh) && (gi.g.null_chisqr >= swi.analysis_cfg.gauss_null_chi_sq_thresh); if (gaussian_count==0\|\|report) { gi.score = score_offset +lcgf(0.5gauss_dof,std::max(gi.g.chisqr0.5gauss_bins,0.5gauss_dof+1)) -lcgf(0.5null_dof,std::max(gi.g.null_chisqr0.5gauss_bins,0.5null_dof+1)); } // Only include "real" Gaussians (those meeting the chisqr threshold) // in the best Gaussian display. if (gi.score > best_gauss->score && chisqOK) { *best_gauss = gi; .... The special appears to have it, as does Cuda baseline. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875111 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875112 - Posted: 26 Jun 2017, 2:32:21 UTC - in response to Message 1875108. Last modified: 26 Jun 2017, 2:34:49 UTC Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here. [which one(s) differ to reference Windows/x86 8.00 may point in the right directions] My OSX CPU App r3344 is from AKv8 and it has the same results as r3330. Probably that fix is missing from the AK derived builds then [seems to be the case, and includes an SIGNALS_ON_GPU path in the same file (sah_v7_opt\AKv8\client\gaussfit.cpp ] . "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875112 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1875114 - Posted: 26 Jun 2017, 2:37:22 UTC - in response to Message 1875111. Last modified: 26 Jun 2017, 2:41:38 UTC All the OpenCL MB Apps come from AKv8 as far as I know. That includes the Apps that don't use the SoG path and work, such as my r3567 and the Non-SoG r3584 Linux App. ID: 1875114 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875115 - Posted: 26 Jun 2017, 2:39:08 UTC - in response to Message 1875114. Last modified: 26 Jun 2017, 2:52:50 UTC All the OpenCL App s come from AKv8 as far as I know. That includes the Apps that don't use the SoG path and work, such as my r3567 and the Non-SoG r3584 Linux App. Ugh, that's a lot of builds if really [some] missing the fix, as it appears. [Probably Raistmer will have to identify which use codebases with the fix, as there are a lot of alternate codepaths there] [Edit:] Some paths appear to have their own implementation of something similar, some not. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875115 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1875137 - Posted: 26 Jun 2017, 5:53:22 UTC - in response to Message 1875115. Last modified: 26 Jun 2017, 5:55:22 UTC It seems to be a real cluster to me. Apparently the Pulsefind has nothing to do with Best Pulse as running with unroll 1 doesn't help. The task 23se08ac.6875.22968.6.33.135 is particularly nasty as not much seems to work with it, not even the Linux Baseline CUDA 4.2 App. The zi3v works with it on the Mac, and the OpenCl App r3567 works with it in Linux. Thankfully these tasks are rare, so most people will be oblivious to them. I'll post the results and let others decipher them. MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu is from AKv8 and has been working well on 3 of my machines for well over a year MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu uses the NV path but Not the SoG path setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG is the Current Stock Linux App and uses the SoG path setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 is the zi3k source with the gaussian fix from zi3s, wanted to see if the new changes since zi3k were the fault setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 is the Baseline App from Xbranch KWSN-Linux-MBbench v2.1.08 Running on TBarxxxx at Mon 26 Jun 2017 03:37:23 AM UTC ---------------------------------------------------------------- Starting benchmark run... ---------------------------------------------------------------- Listing wu-file(s) in /testWUs : 09no16aa.18442.2116.6.33.31.wu 20oc08aa.4777.254820.12.39.5.wu 23se08ac.6875.22968.6.33.135.wu Listing executable(s) in /APPS : MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 Listing executable in /REF_APPS : MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu ---------------------------------------------------------------- Current WU: 09no16aa.18442.2116.6.33.31.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 6283 seconds ---------------------------------------------------------------- Running app with command : .......... MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 466 seconds Speed compared to default : ......... 1348 % ----------------- Comparing results Result : Strongly similar, Q= 99.91% ---------------------------------------------------------------- Running app with command : .......... setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 412 seconds Speed compared to default : ......... 1525 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 3 3 3 0 0 3 3 3 0 Autocorr 0 2 2 2 0 0 2 2 2 0 Gaussian 0 3 3 3 0 0 3 3 3 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 0 0 0 0 0 0 0 0 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 0 0 0 0 0 0 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 12 12 12 1 0 12 12 12 1 Unmatched signal(s) in R1 at line(s) 563 Unmatched signal(s) in R2 at line(s) 563 For R1:R2 matched signals only, Q= 99.91% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 -unroll 1 -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=18874368 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=4194304 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocating tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 CUDA stream priority range: low 0 and high: -1 GPSF 1.618911 2 3.229842 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes ............................................................................................................................................................................................................................................................................................. Best scores written Out file closed Cuda free done Cuda device reset done Elapsed Time : ...................... 200 seconds Speed compared to default : ......... 3141 % ----------------- Comparing results Result : Strongly similar, Q= 99.81% ---------------------------------------------------------------- Done with 09no16aa.18442.2116.6.33.31.wu ==================================================================== Current WU: 20oc08aa.4777.254820.12.39.5.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 3396 seconds ---------------------------------------------------------------- Running app with command : .......... MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 404 seconds Speed compared to default : ......... 840 % ----------------- Comparing results Result : Strongly similar, Q= 99.97% ---------------------------------------------------------------- Running app with command : .......... setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 360 seconds Speed compared to default : ......... 943 % ----------------- Comparing results Result : Strongly similar, Q= 99.97% ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 -unroll 1 -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=18874368 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=4194304 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocating tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 CUDA stream priority range: low 0 and high: -1 GPSF 0.830732 1 1.732689 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes ................................................................................................................................................................................................... Best scores written Out file closed Cuda free done Cuda device reset done Elapsed Time : ...................... 131 seconds Speed compared to default : ......... 2592 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 10 10 10 0 0 10 10 10 0 Autocorr 0 3 3 3 0 0 3 3 3 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 0 0 0 0 0 0 0 0 0 Triplet 0 0 0 0 0 0 0 0 0 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 1 1 1 1 0 1 1 1 1 0 Best Pulse 0 0 0 0 1 0 0 0 0 1 Best Triplet 0 0 0 0 0 0 0 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 1 16 16 16 1 1 16 16 16 1 Unmatched signal(s) in R1 at line(s) 607 Unmatched signal(s) in R2 at line(s) 607 For R1:R2 matched signals only, Q= 99.97% Result : Weakly similar. ---------------------------------------------------------------- Done with 20oc08aa.4777.254820.12.39.5.wu ==================================================================== Current WU: 23se08ac.6875.22968.6.33.135.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 8171 seconds ---------------------------------------------------------------- Running app with command : .......... MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 608 seconds Speed compared to default : ......... 1343 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 3 3 3 0 0 3 3 3 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 11 11 11 1 0 11 11 11 1 Unmatched signal(s) in R1 at line(s) 500 Unmatched signal(s) in R2 at line(s) 500 For R1:R2 matched signals only, Q= 99.97% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 545 seconds Speed compared to default : ......... 1499 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 3 3 3 0 0 3 3 3 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 11 11 11 1 0 11 11 11 1 Unmatched signal(s) in R1 at line(s) 500 Unmatched signal(s) in R2 at line(s) 500 For R1:R2 matched signals only, Q= 99.97% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 -unroll 1 -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=18874368 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=4194304 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocating tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 CUDA stream priority range: low 0 and high: -1 GPSF 3.034932 3 5.351412 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes ..................................................................................................................................................................................................................................................................................................................................................................................... Best scores written Out file closed Cuda free done Cuda device reset done Elapsed Time : ...................... 289 seconds Speed compared to default : ......... 2827 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 1 3 3 3 0 1 3 3 3 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 1 1 1 1 0 1 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 2 11 11 11 1 2 11 11 11 1 Unmatched signal(s) in R1 at line(s) 500 Unmatched signal(s) in R2 at line(s) 500 For R1:R2 matched signals only, Q= 99.99% Result : Weakly similar. ---------------------------------------------------------------- Done with 23se08ac.6875.22968.6.33.135.wu ==================================================================== Done with Benchmark run! Removing temporary files! tbar@TBar-iSETI:~/KWSN-Bench-Linux-MBv7$ ./benchmark KWSN-Linux-MBbench v2.1.08 Running on TBar-iSETI at Mon 26 Jun 2017 04:46:26 AM UTC ---------------------------------------------------------------- Starting benchmark run... ---------------------------------------------------------------- Listing wu-file(s) in /testWUs : 09no16aa.18442.2116.6.33.31.wu 20oc08aa.4777.254820.12.39.5.wu 23se08ac.6875.22968.6.33.135.wu Listing executable(s) in /APPS : setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 Listing executable in /REF_APPS : MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu ---------------------------------------------------------------- Current WU: 09no16aa.18442.2116.6.33.31.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 6283 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 -unroll 1 -bs -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=65536 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocing tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=6291456 GPSF 1.618911 2 3.229842 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 205 seconds Speed compared to default : ......... 3064 % ----------------- Comparing results Result : Strongly similar, Q= 99.81% ---------------------------------------------------------------- Running app with command : .......... setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 -device 1 Elapsed Time : ...................... 490 seconds Speed compared to default : ......... 1282 % ----------------- Comparing results Result : Strongly similar, Q= 99.98% ---------------------------------------------------------------- Done with 09no16aa.18442.2116.6.33.31.wu ==================================================================== Current WU: 20oc08aa.4777.254820.12.39.5.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 3396 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 -unroll 1 -bs -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=65536 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocing tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=6291456 GPSF 0.830732 1 1.732689 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 133 seconds Speed compared to default : ......... 2553 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 10 10 10 0 0 10 10 10 0 Autocorr 0 3 3 3 0 0 3 3 3 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 0 0 0 0 0 0 0 0 0 Triplet 0 0 0 0 0 0 0 0 0 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 1 1 1 1 0 1 1 1 1 0 Best Pulse 0 0 0 0 1 0 0 0 0 1 Best Triplet 0 0 0 0 0 0 0 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 1 16 16 16 1 1 16 16 16 1 Unmatched signal(s) in R1 at line(s) 607 Unmatched signal(s) in R2 at line(s) 607 For R1:R2 matched signals only, Q= 99.97% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 -device 1 Elapsed Time : ...................... 364 seconds Speed compared to default : ......... 932 % ----------------- Comparing results Result : Strongly similar, Q= 99.97% ---------------------------------------------------------------- Done with 20oc08aa.4777.254820.12.39.5.wu ==================================================================== Current WU: 23se08ac.6875.22968.6.33.135.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 8171 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 -unroll 1 -bs -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=65536 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocing tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=6291456 GPSF 3.034932 3 5.351412 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 310 seconds Speed compared to default : ......... 2635 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 1 3 3 3 0 1 3 3 3 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 1 1 1 1 0 1 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 2 11 11 11 1 2 11 11 11 1 Unmatched signal(s) in R1 at line(s) 500 Unmatched signal(s) in R2 at line(s) 500 For R1:R2 matched signals only, Q= 99.99% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 -device 1 Elapsed Time : ...................... 607 seconds Speed compared to default : ......... 1346 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 3 3 3 0 0 3 3 3 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 11 11 11 1 0 11 11 11 1 Unmatched signal(s) in R1 at line(s) 500 Unmatched signal(s) in R2 at line(s) 500 For R1:R2 matched signals only, Q= 99.99% Result : Weakly similar. ---------------------------------------------------------------- Done with 23se08ac.6875.22968.6.33.135.wu ==================================================================== ID: 1875137 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875138 - Posted: 26 Jun 2017, 6:03:21 UTC Last modified: 26 Jun 2017, 6:05:33 UTC Correct, Best Pulse has nothing to do with Best Gaussian (Those certainly can be separately influenced by Unroll, and if uncorrected [by Petri yet] will be random + rare problem event ) The fact there is indeed a 'cluster' of different things going on, is precisely why stock Win32 CPU is considered reference (not OpenCL, not Cuda, Not Linux, not AK, or anything else) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875138 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1875239 - Posted: 26 Jun 2017, 16:48:43 UTC - in response to Message 1875111. Last modified: 26 Jun 2017, 16:51:35 UTC Hmmm, AK8 branch might be missing Joe's fix from ~2011 ? (svn posted earlier in thread): gaussfit.cpp (stock seti_boinc branch): report = chisqOK // chisqOK is (ChiSq <= swi.analysis_cfg.gauss_chi_sq_thresh) && (gi.g.peak_power >= gi.g.mean_power swi.analysis_cfg.gauss_peak_power_thresh) && (gi.g.null_chisqr >= swi.analysis_cfg.gauss_null_chi_sq_thresh); if (gaussian_count==0\|\|report) { gi.score = score_offset +lcgf(0.5gauss_dof,std::max(gi.g.chisqr0.5gauss_bins,0.5gauss_dof+1)) -lcgf(0.5null_dof,std::max(gi.g.null_chisqr0.5gauss_bins,0.5null_dof+1)); } // Only include "real" Gaussians (those meeting the chisqr threshold) // in the best Gaussian display. if (gi.score > best_gauss->score && chisqOK) { best_gauss = gi; .... The special appears to have it, as does Cuda baseline. opt build more complex in this area: BOOLEAN chisq = (ChiSq <= swi.analysis_cfg.gauss_chi_sq_thresh); ... if (chisq) { #endif BOOLEAN newbest=false, report; //R: same optimization as for GPU build: if there is reportable Gaussian already - //R: skip score calculation for all except new reportable Gaussians //R: TODO: carefully check if it's valid assumption! report = chisq && (PeakPower >= TrueMean PoTInfo.GaussPeakPowerThresh) && (null_ChiSq >= swi.analysis_cfg.gauss_null_chi_sq_thresh); if(gaussian_count==0 \|\| report){ score = calc_GaussFit_score(ChiSq,null_ChiSq); newbest = chisq && (score > best_gauss->score); } #if USE_COUNTERS Counter<Gaussian_skip6_low_power>::update(!(PeakPower >= TrueMean * PoTInfo.GaussPeakPowerThresh)); //fprintf(stderr,"best_score=%.7g, score=%.7g\n",best_gauss->score,score); #endif #ifdef BOINC_APP_GRAPHICS if (newbest \|\| report \|\| graphics) { #else #if USE_COUNTERS if(! (newbest\|\|report) ){ Counter<Gaussian_miss>::update(1); } #endif if (newbest \|\| report) { #endif .... But chi-square check seems to be present. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1875239 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875249 - Posted: 26 Jun 2017, 17:11:39 UTC - in response to Message 1875239. Last modified: 26 Jun 2017, 17:14:11 UTC ... But chi-square check seems to be present. Saw that part, but not sure in which codepaths that's active (e.g. SoG). If active in all paths then will have to arrange a bench with 8.00 Win32 reference, then call for suspects. Will have to be after Wednesday for me. [I'm not clear on if this particular Gaussian rabbit-hole has enough impact to be concerned about, but understanding it would be good for me] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875249 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1875253 - Posted: 26 Jun 2017, 17:22:10 UTC - in response to Message 1875249. SoG has own parallelized reduction for Gaussians (should implement same logic though). And what warries me - the difference between SoG and non-SoG OpenCL results - that's definitely worth check when I'll have easy access to hardware for that. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1875253 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875255 - Posted: 26 Jun 2017, 17:25:41 UTC - in response to Message 1875253. SoG has own parallelized reduction for Gaussians (should implement same logic though). And what warries me - the difference between SoG and non-SoG OpenCL results - that's definitely worth check when I'll have easy access to hardware for that. Yes, I've lost track of which apps match and which don't now, and have yet to examine in detail Petri's fix for the pulse race condition also. So plenty to examine as the dust falls out. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875255 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1875263 - Posted: 26 Jun 2017, 19:14:35 UTC - in response to Message 1875255. Last modified: 26 Jun 2017, 19:16:52 UTC SoG has own parallelized reduction for Gaussians (should implement same logic though). And what warries me - the difference between SoG and non-SoG OpenCL results - that's definitely worth check when I'll have easy access to hardware for that. Yes, I've lost track of which apps match and which don't now, and have yet to examine in detail Petri's fix for the pulse race condition also. So plenty to examine as the dust falls out. Petri's fix for the race condition fixed the Bad Pulse Count which was being caused by the Unroll function. However, there has Always been a Bad Best Pulse that is even more rare than the Bad Pulse. I assumed the two were related, apparently they aren't as the Bad Pulse is fixed but the Bad Best Pulse remains and is Not solved by using Unroll 1 the way the Bad Pulse was fixed. It seems both CUDA and OpenCL have a Best Gaussian Problem with some rare tasks. Only CUDA has the Bad Best Pulse and it is even more rare. That's the way I see it. ID: 1875263 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1875266 - Posted: 26 Jun 2017, 19:25:24 UTC Hi, I'm here just for a quick peek. The special pulse find is doing a scan with unroll N depending on autotune or the user set limit for each CPU-code-icfft-round and if the scan finds a suspected pulse the round is run again with unroll 1 for that CPU-code-icfft. That (to find a pulse or an even better unreported best) is a rare event for real pulses and for best pulses that happens at the first round and then more and more infrequently since the bar for best and not yet reported rises after each one found. The gauss-find is to my mind not touched for a long time in the special code. My mind-memory may be short. I guess it (the gauss find code) is working as it should be. It is a separate problem if it does not. I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something. Petri. I'm following and will provide help when needed later on in this summer.. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1875266 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1875274 - Posted: 26 Jun 2017, 20:28:20 UTC - in response to Message 1875266. I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something. That sounds like the problem I was running into with my GTX 780 (now replaced by a GTX 980), which I detailed in Message 1864874. In fact, with the Cuda8.0 Special App, it was happening quite frequently. Dialing back to the Cuda6.5 version, it became rare, but didn't go away entirely. It has never (yet) shown up on any of my other cards (GTX 750Ti, GTX 960, GTX 980). You'd need to find somebody else running a 780 to see if the problem is common to that model or unique to my card. ID: 1875274 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1875275 - Posted: 26 Jun 2017, 20:33:03 UTC - in response to Message 1875266. I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something. Could it be solved in same fashion - by re=processing after discovery? SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1875275 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1875279 - Posted: 26 Jun 2017, 20:45:27 UTC - in response to Message 1875274. I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something. That sounds like the problem I was running into with my GTX 780 (now replaced by a GTX 980), which I detailed in Message 1864874. In fact, with the Cuda8.0 Special App, it was happening quite frequently. Dialing back to the Cuda6.5 version, it became rare, but didn't go away entirely. It has never (yet) shown up on any of my other cards (GTX 750Ti, GTX 960, GTX 980). You'd need to find somebody else running a 780 to see if the problem is common to that model or unique to my card. I'll put my 780 back in the Mac Pro on the weekend. Its unique Hyper-Q feature might be in play, differs by OS and Cuda version in subtle implementation ways. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1875279 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.