Message boards :
Number crunching :
Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation
Previous · 1 . . . 35 · 36 · 37 · 38 · 39 · 40 · 41 . . . 83 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Ah, so perhaps the actual Result file would contain a different Best Pulse value than the Stderr shows?...All the reported signals and Best signals seem to match between the two. *possible*, making a lot of assumptions there. Naturally the result file is the important one. Probably prior assumptions about processing vs printing order become somewhat muddy as parallelism and reprocessing is involved, while stderr is sequential. Something that will have to no doubt be de-confused as we go along. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
The only way to tell for sure is to run the task with your CPU and compare the results. You should give that a try, you can run a CPU task in the benchmark App while running BOINC. Just reduce the CPU usage by One in BOINC and remove any Apps from the APPS folder in the Benchmark package. The CPU App in the REF_APPS folder will search the WU folder and run any task it doesn't have results for. The Benchmark tool is here, KWSN Linux MB Bench v2.01.08. Extract the KWSN-Bench-Linux-MBv7_v2.01.08.7z to your Home folder and run it from there.Okay, I tried running it with the Windows CPU app that I use here on my daily driver. It almost perfectly matches the v8.22 (opencl_ati_cat132) result. Workunit 2567983999 (20oc08aa.4777.254820.12.39.5)MB8_win_x86_SSE3_VS2008_r3330 - Best pulse: peak=0.4685681, time=98.45, period=0.01441, d_freq=1420048834.69, score=0.9218, chirp=-61.928, fft_len=8 |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
The only way to tell for sure is to run the task with your CPU and compare the results. You should give that a try, you can run a CPU task in the benchmark App while running BOINC. Just reduce the CPU usage by One in BOINC and remove any Apps from the APPS folder in the Benchmark package. The CPU App in the REF_APPS folder will search the WU folder and run any task it doesn't have results for. The Benchmark tool is here, KWSN Linux MB Bench v2.01.08. Extract the KWSN-Bench-Linux-MBv7_v2.01.08.7z to your Home folder and run it from there.Okay, I tried running it with the Windows CPU app that I use here on my daily driver. It almost perfectly matches the v8.22 (opencl_ati_cat132) result. Are you able to cross compare that with Cuda Baseline? It'll narrow down where to look once I get to the special code. [Edit:] Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here. [which one(s) differ to reference Windows/x86 8.00 may point in the right directions] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here.My OSX CPU App r3344 is from AKv8 and it has the same results as r3330. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Are you able to cross compare that with Cuda Baseline? It'll narrow down where to look once I get to the special code.Can that also be run with the MBBench 2.10? [Edit:] Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here.No clue. That answer will have to come from elsewhere. ;^) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Hmmm, AK8 branch *might be missing Joe's fix from ~2011 ? (svn posted earlier in thread): gaussfit.cpp (stock seti_boinc branch): report = chisqOK // chisqOK is (ChiSq <= swi.analysis_cfg.gauss_chi_sq_thresh) The special appears to have it, as does Cuda baseline. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here.My OSX CPU App r3344 is from AKv8 and it has the same results as r3330. Probably that fix is missing from the AK derived builds then [seems to be the case, and includes an SIGNALS_ON_GPU path in the same file (sah_v7_opt\AKv8\client\gaussfit.cpp ] . "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
All the OpenCL MB Apps come from AKv8 as far as I know. That includes the Apps that don't use the SoG path and work, such as my r3567 and the Non-SoG r3584 Linux App. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
All the OpenCL App s come from AKv8 as far as I know. That includes the Apps that don't use the SoG path and work, such as my r3567 and the Non-SoG r3584 Linux App. Ugh, that's a lot of builds if really [some] missing the fix, as it appears. [Probably Raistmer will have to identify which use codebases with the fix, as there are a lot of alternate codepaths there] [Edit:] Some paths appear to have their own implementation of something similar, some not. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
It seems to be a real cluster to me. Apparently the Pulsefind has nothing to do with Best Pulse as running with unroll 1 doesn't help. The task 23se08ac.6875.22968.6.33.135 is particularly nasty as not much seems to work with it, not even the Linux Baseline CUDA 4.2 App. The zi3v works with it on the Mac, and the OpenCl App r3567 works with it in Linux. Thankfully these tasks are rare, so most people will be oblivious to them. I'll post the results and let others decipher them. MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu is from AKv8 and has been working well on 3 of my machines for well over a year MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu uses the NV path but Not the SoG path setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG is the Current Stock Linux App and uses the SoG path setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 is the zi3k source with the gaussian fix from zi3s, wanted to see if the new changes since zi3k were the fault setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 is the Baseline App from Xbranch KWSN-Linux-MBbench v2.1.08 Running on TBarxxxx at Mon 26 Jun 2017 03:37:23 AM UTC ---------------------------------------------------------------- Starting benchmark run... ---------------------------------------------------------------- Listing wu-file(s) in /testWUs : 09no16aa.18442.2116.6.33.31.wu 20oc08aa.4777.254820.12.39.5.wu 23se08ac.6875.22968.6.33.135.wu Listing executable(s) in /APPS : MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 Listing executable in /REF_APPS : MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu ---------------------------------------------------------------- Current WU: 09no16aa.18442.2116.6.33.31.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 6283 seconds ---------------------------------------------------------------- Running app with command : .......... MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 466 seconds Speed compared to default : ......... 1348 % ----------------- Comparing results Result : Strongly similar, Q= 99.91% ---------------------------------------------------------------- Running app with command : .......... setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 412 seconds Speed compared to default : ......... 1525 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 3 3 3 0 0 3 3 3 0 Autocorr 0 2 2 2 0 0 2 2 2 0 Gaussian 0 3 3 3 0 0 3 3 3 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 0 0 0 0 0 0 0 0 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 0 0 0 0 0 0 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 12 12 12 1 0 12 12 12 1 Unmatched signal(s) in R1 at line(s) 563 Unmatched signal(s) in R2 at line(s) 563 For R1:R2 matched signals only, Q= 99.91% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 -unroll 1 -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=18874368 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=4194304 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocating tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 CUDA stream priority range: low 0 and high: -1 GPSF 1.618911 2 3.229842 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes ............................................................................................................................................................................................................................................................................................. Best scores written Out file closed Cuda free done Cuda device reset done Elapsed Time : ...................... 200 seconds Speed compared to default : ......... 3141 % ----------------- Comparing results Result : Strongly similar, Q= 99.81% ---------------------------------------------------------------- Done with 09no16aa.18442.2116.6.33.31.wu ==================================================================== Current WU: 20oc08aa.4777.254820.12.39.5.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 3396 seconds ---------------------------------------------------------------- Running app with command : .......... MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 404 seconds Speed compared to default : ......... 840 % ----------------- Comparing results Result : Strongly similar, Q= 99.97% ---------------------------------------------------------------- Running app with command : .......... setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 360 seconds Speed compared to default : ......... 943 % ----------------- Comparing results Result : Strongly similar, Q= 99.97% ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 -unroll 1 -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=18874368 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=4194304 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocating tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 CUDA stream priority range: low 0 and high: -1 GPSF 0.830732 1 1.732689 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes ................................................................................................................................................................................................... Best scores written Out file closed Cuda free done Cuda device reset done Elapsed Time : ...................... 131 seconds Speed compared to default : ......... 2592 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 10 10 10 0 0 10 10 10 0 Autocorr 0 3 3 3 0 0 3 3 3 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 0 0 0 0 0 0 0 0 0 Triplet 0 0 0 0 0 0 0 0 0 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 1 1 1 1 0 1 1 1 1 0 Best Pulse 0 0 0 0 1 0 0 0 0 1 Best Triplet 0 0 0 0 0 0 0 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 1 16 16 16 1 1 16 16 16 1 Unmatched signal(s) in R1 at line(s) 607 Unmatched signal(s) in R2 at line(s) 607 For R1:R2 matched signals only, Q= 99.97% Result : Weakly similar. ---------------------------------------------------------------- Done with 20oc08aa.4777.254820.12.39.5.wu ==================================================================== Current WU: 23se08ac.6875.22968.6.33.135.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 8171 seconds ---------------------------------------------------------------- Running app with command : .......... MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 608 seconds Speed compared to default : ......... 1343 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 3 3 3 0 0 3 3 3 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 11 11 11 1 0 11 11 11 1 Unmatched signal(s) in R1 at line(s) 500 Unmatched signal(s) in R2 at line(s) 500 For R1:R2 matched signals only, Q= 99.97% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1 Elapsed Time : ...................... 545 seconds Speed compared to default : ......... 1499 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 3 3 3 0 0 3 3 3 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 11 11 11 1 0 11 11 11 1 Unmatched signal(s) in R1 at line(s) 500 Unmatched signal(s) in R2 at line(s) 500 For R1:R2 matched signals only, Q= 99.97% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 -unroll 1 -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=18874368 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=4194304 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocating tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 CUDA stream priority range: low 0 and high: -1 GPSF 3.034932 3 5.351412 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes ..................................................................................................................................................................................................................................................................................................................................................................................... Best scores written Out file closed Cuda free done Cuda device reset done Elapsed Time : ...................... 289 seconds Speed compared to default : ......... 2827 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 1 3 3 3 0 1 3 3 3 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 1 1 1 1 0 1 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 2 11 11 11 1 2 11 11 11 1 Unmatched signal(s) in R1 at line(s) 500 Unmatched signal(s) in R2 at line(s) 500 For R1:R2 matched signals only, Q= 99.99% Result : Weakly similar. ---------------------------------------------------------------- Done with 23se08ac.6875.22968.6.33.135.wu ==================================================================== Done with Benchmark run! Removing temporary files! tbar@TBar-iSETI:~/KWSN-Bench-Linux-MBv7$ ./benchmark KWSN-Linux-MBbench v2.1.08 Running on TBar-iSETI at Mon 26 Jun 2017 04:46:26 AM UTC ---------------------------------------------------------------- Starting benchmark run... ---------------------------------------------------------------- Listing wu-file(s) in /testWUs : 09no16aa.18442.2116.6.33.31.wu 20oc08aa.4777.254820.12.39.5.wu 23se08ac.6875.22968.6.33.135.wu Listing executable(s) in /APPS : setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 Listing executable in /REF_APPS : MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu ---------------------------------------------------------------- Current WU: 09no16aa.18442.2116.6.33.31.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 6283 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 -unroll 1 -bs -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=65536 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocing tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=6291456 GPSF 1.618911 2 3.229842 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 205 seconds Speed compared to default : ......... 3064 % ----------------- Comparing results Result : Strongly similar, Q= 99.81% ---------------------------------------------------------------- Running app with command : .......... setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 -device 1 Elapsed Time : ...................... 490 seconds Speed compared to default : ......... 1282 % ----------------- Comparing results Result : Strongly similar, Q= 99.98% ---------------------------------------------------------------- Done with 09no16aa.18442.2116.6.33.31.wu ==================================================================== Current WU: 20oc08aa.4777.254820.12.39.5.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 3396 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 -unroll 1 -bs -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=65536 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocing tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=6291456 GPSF 0.830732 1 1.732689 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 133 seconds Speed compared to default : ......... 2553 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 10 10 10 0 0 10 10 10 0 Autocorr 0 3 3 3 0 0 3 3 3 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 0 0 0 0 0 0 0 0 0 Triplet 0 0 0 0 0 0 0 0 0 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 1 1 1 1 0 1 1 1 1 0 Best Pulse 0 0 0 0 1 0 0 0 0 1 Best Triplet 0 0 0 0 0 0 0 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 1 16 16 16 1 1 16 16 16 1 Unmatched signal(s) in R1 at line(s) 607 Unmatched signal(s) in R2 at line(s) 607 For R1:R2 matched signals only, Q= 99.97% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 -device 1 Elapsed Time : ...................... 364 seconds Speed compared to default : ......... 932 % ----------------- Comparing results Result : Strongly similar, Q= 99.97% ---------------------------------------------------------------- Done with 20oc08aa.4777.254820.12.39.5.wu ==================================================================== Current WU: 23se08ac.6875.22968.6.33.135.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 8171 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 -unroll 1 -bs -device 1 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=67108864 T=33554432 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=65536 MallocHost best_PoTP=16777216 MallocHost bestPoTG=4194304 Allocing tmp data buf for unroll 1 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=6291456 GPSF 3.034932 3 5.351412 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 310 seconds Speed compared to default : ......... 2635 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 1 3 3 3 0 1 3 3 3 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 1 1 1 1 0 1 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 2 11 11 11 1 2 11 11 11 1 Unmatched signal(s) in R1 at line(s) 500 Unmatched signal(s) in R2 at line(s) 500 For R1:R2 matched signals only, Q= 99.99% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 -device 1 Elapsed Time : ...................... 607 seconds Speed compared to default : ......... 1346 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 3 3 3 0 0 3 3 3 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 1 1 1 0 0 1 1 1 0 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 1 1 1 0 0 1 1 1 0 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 11 11 11 1 0 11 11 11 1 Unmatched signal(s) in R1 at line(s) 500 Unmatched signal(s) in R2 at line(s) 500 For R1:R2 matched signals only, Q= 99.99% Result : Weakly similar. ---------------------------------------------------------------- Done with 23se08ac.6875.22968.6.33.135.wu ==================================================================== |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Correct, Best Pulse has nothing to do with Best Gaussian (Those certainly can be separately influenced by Unroll, and if uncorrected [by Petri yet] will be random + rare problem event ) The fact there is indeed a 'cluster' of different things going on, is precisely why stock Win32 CPU is considered reference (not OpenCL, not Cuda, Not Linux, not AK, or anything else) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Hmmm, AK8 branch *might be missing Joe's fix from ~2011 ? (svn posted earlier in thread): opt build more complex in this area: BOOLEAN chisq = (ChiSq <= swi.analysis_cfg.gauss_chi_sq_thresh); ... if (chisq) { #endif BOOLEAN newbest=false, report; //R: same optimization as for GPU build: if there is reportable Gaussian already - //R: skip score calculation for all except new reportable Gaussians //R: TODO: carefully check if it's valid assumption! report = chisq && (PeakPower >= TrueMean * PoTInfo.GaussPeakPowerThresh) && (null_ChiSq >= swi.analysis_cfg.gauss_null_chi_sq_thresh); if(gaussian_count==0 || report){ score = calc_GaussFit_score(ChiSq,null_ChiSq); newbest = chisq && (score > best_gauss->score); } #if USE_COUNTERS Counter<Gaussian_skip6_low_power>::update(!(PeakPower >= TrueMean * PoTInfo.GaussPeakPowerThresh)); //fprintf(stderr,"best_score=%.7g, score=%.7g\n",best_gauss->score,score); #endif #ifdef BOINC_APP_GRAPHICS if (newbest || report || graphics) { #else #if USE_COUNTERS if(! (newbest||report) ){ Counter<Gaussian_miss>::update(1); } #endif if (newbest || report) { #endif .... But chi-square check seems to be present. SETI apps news We're not gonna fight them. We're gonna transcend them. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
... Saw that part, but not sure in which codepaths that's active (e.g. SoG). If active in all paths then will have to arrange a bench with 8.00 Win32 reference, then call for suspects. Will have to be after Wednesday for me. [I'm not clear on if this particular Gaussian rabbit-hole has enough impact to be concerned about, but understanding it would be good for me] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
SoG has own parallelized reduction for Gaussians (should implement same logic though). And what warries me - the difference between SoG and non-SoG OpenCL results - that's definitely worth check when I'll have easy access to hardware for that. SETI apps news We're not gonna fight them. We're gonna transcend them. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
SoG has own parallelized reduction for Gaussians (should implement same logic though). Yes, I've lost track of which apps match and which don't now, and have yet to examine in detail Petri's fix for the pulse race condition also. So plenty to examine as the dust falls out. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Petri's fix for the race condition fixed the Bad Pulse Count which was being caused by the Unroll function. However, there has Always been a Bad Best Pulse that is even more rare than the Bad Pulse. I assumed the two were related, apparently they aren't as the Bad Pulse is fixed but the Bad Best Pulse remains and is Not solved by using Unroll 1 the way the Bad Pulse was fixed. It seems both CUDA and OpenCL have a Best Gaussian Problem with some rare tasks. Only CUDA has the Bad Best Pulse and it is even more rare.SoG has own parallelized reduction for Gaussians (should implement same logic though). That's the way I see it. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Hi, I'm here just for a quick peek. The special pulse find is doing a scan with unroll N depending on autotune or the user set limit for each CPU-code-icfft-round and if the scan finds a suspected pulse the round is run again with unroll 1 for that CPU-code-icfft. That (to find a pulse or an even better unreported best) is a rare event for real pulses and for best pulses that happens at the first round and then more and more infrequently since the bar for best and not yet reported rises after each one found. The gauss-find is to my mind not touched for a long time in the special code. My mind-memory may be short. I guess it (the gauss find code) is working as it should be. It is a separate problem if it does not. I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something. Petri. I'm following and will provide help when needed later on in this summer.. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something.That sounds like the problem I was running into with my GTX 780 (now replaced by a GTX 980), which I detailed in Message 1864874. In fact, with the Cuda8.0 Special App, it was happening quite frequently. Dialing back to the Cuda6.5 version, it became rare, but didn't go away entirely. It has never (yet) shown up on any of my other cards (GTX 750Ti, GTX 960, GTX 980). You'd need to find somebody else running a 780 to see if the problem is common to that model or unique to my card. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Could it be solved in same fashion - by re=processing after discovery? SETI apps news We're not gonna fight them. We're gonna transcend them. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something.That sounds like the problem I was running into with my GTX 780 (now replaced by a GTX 980), which I detailed in Message 1864874. In fact, with the Cuda8.0 Special App, it was happening quite frequently. Dialing back to the Cuda6.5 version, it became rare, but didn't go away entirely. It has never (yet) shown up on any of my other cards (GTX 750Ti, GTX 960, GTX 980). You'd need to find somebody else running a 780 to see if the problem is common to that model or unique to my card. I'll put my 780 back in the Mac Pro on the weekend. Its unique Hyper-Q feature might be in play, differs by OS and Cuda version in subtle implementation ways. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.