Linux CUDA 'Special' App finally available, featuring Low CPU use

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 35 · 36 · 37 · 38 · 39 · 40 · 41 . . . 83 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1875076 - Posted: 25 Jun 2017, 23:11:11 UTC - in response to Message 1875060.  

...All the reported signals and Best signals seem to match between the two.


If implemented as I picture: For the pulse mechanism shunt/workaround, the stderr.txt 'realtime' log might see the racing pulse detections, then shunt to unroll 1 to record the correct ones. If that's the case, it does reflect reality in the new 'racey-fixey' kindof way, but may need to be presented more clearly.
Ah, so perhaps the actual Result file would contain a different Best Pulse value than the Stderr shows?


*possible*, making a lot of assumptions there. Naturally the result file is the important one. Probably prior assumptions about processing vs printing order become somewhat muddy as parallelism and reprocessing is involved, while stderr is sequential. Something that will have to no doubt be de-confused as we go along.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1875076 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1875103 - Posted: 26 Jun 2017, 1:45:51 UTC - in response to Message 1875014.  

The only way to tell for sure is to run the task with your CPU and compare the results. You should give that a try, you can run a CPU task in the benchmark App while running BOINC. Just reduce the CPU usage by One in BOINC and remove any Apps from the APPS folder in the Benchmark package. The CPU App in the REF_APPS folder will search the WU folder and run any task it doesn't have results for. The Benchmark tool is here, KWSN Linux MB Bench v2.01.08. Extract the KWSN-Bench-Linux-MBv7_v2.01.08.7z to your Home folder and run it from there.
Okay, I tried running it with the Windows CPU app that I use here on my daily driver. It almost perfectly matches the v8.22 (opencl_ati_cat132) result.

Workunit 2567983999 (20oc08aa.4777.254820.12.39.5)
Task 5794100079 (S=10, A=3, P=0, T=0, G=0, BG=0) v8.22 (opencl_ati_cat132) windows_intelx86
Task 5829376759 (S=10, A=3, P=0, T=0, G=0, BG=0) x41p_zi3v, Cuda 8.00 special

v8.22 (opencl_ati_cat132) windows_intelx86 - Best pulse: peak=0.4685673, time=98.45, period=0.01441, d_freq=1420048834.69, score=0.9218, chirp=-61.928, fft_len=8
x41p_zi3v, Cuda 8.00 special - Best pulse: peak=0.3951461, time=68.92, period=0.0147, d_freq=1420052490.23, score=0.7774, chirp=0, fft_len=8.
MB8_win_x86_SSE3_VS2008_r3330 - Best pulse: peak=0.4685681, time=98.45, period=0.01441, d_freq=1420048834.69, score=0.9218, chirp=-61.928, fft_len=8
ID: 1875103 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1875107 - Posted: 26 Jun 2017, 1:56:32 UTC - in response to Message 1875103.  
Last modified: 26 Jun 2017, 2:04:22 UTC

The only way to tell for sure is to run the task with your CPU and compare the results. You should give that a try, you can run a CPU task in the benchmark App while running BOINC. Just reduce the CPU usage by One in BOINC and remove any Apps from the APPS folder in the Benchmark package. The CPU App in the REF_APPS folder will search the WU folder and run any task it doesn't have results for. The Benchmark tool is here, KWSN Linux MB Bench v2.01.08. Extract the KWSN-Bench-Linux-MBv7_v2.01.08.7z to your Home folder and run it from there.
Okay, I tried running it with the Windows CPU app that I use here on my daily driver. It almost perfectly matches the v8.22 (opencl_ati_cat132) result.

Workunit 2567983999 (20oc08aa.4777.254820.12.39.5)
Task 5794100079 (S=10, A=3, P=0, T=0, G=0, BG=0) v8.22 (opencl_ati_cat132) windows_intelx86
Task 5829376759 (S=10, A=3, P=0, T=0, G=0, BG=0) x41p_zi3v, Cuda 8.00 special

v8.22 (opencl_ati_cat132) windows_intelx86 - Best pulse: peak=0.4685673, time=98.45, period=0.01441, d_freq=1420048834.69, score=0.9218, chirp=-61.928, fft_len=8
x41p_zi3v, Cuda 8.00 special - Best pulse: peak=0.3951461, time=68.92, period=0.0147, d_freq=1420052490.23, score=0.7774, chirp=0, fft_len=8.
MB8_win_x86_SSE3_VS2008_r3330 - Best pulse: peak=0.4685681, time=98.45, period=0.01441, d_freq=1420048834.69, score=0.9218, chirp=-61.928, fft_len=8


Are you able to cross compare that with Cuda Baseline? It'll narrow down where to look once I get to the special code.

[Edit:] Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here.
[which one(s) differ to reference Windows/x86 8.00 may point in the right directions]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1875107 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1875108 - Posted: 26 Jun 2017, 2:21:25 UTC - in response to Message 1875107.  

Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here.
[which one(s) differ to reference Windows/x86 8.00 may point in the right directions]
My OSX CPU App r3344 is from AKv8 and it has the same results as r3330.
ID: 1875108 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1875109 - Posted: 26 Jun 2017, 2:22:29 UTC - in response to Message 1875107.  

Are you able to cross compare that with Cuda Baseline? It'll narrow down where to look once I get to the special code.
Can that also be run with the MBBench 2.10?

[Edit:] Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here.
[which one(s) differ to reference Windows/x86 8.00 may point in the right directions]
No clue. That answer will have to come from elsewhere. ;^)
ID: 1875109 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1875111 - Posted: 26 Jun 2017, 2:29:06 UTC
Last modified: 26 Jun 2017, 2:35:53 UTC

Hmmm, AK8 branch *might be missing Joe's fix from ~2011 ? (svn posted earlier in thread):

gaussfit.cpp (stock seti_boinc branch):
report = chisqOK // chisqOK is (ChiSq <= swi.analysis_cfg.gauss_chi_sq_thresh)
&& (gi.g.peak_power >= gi.g.mean_power * swi.analysis_cfg.gauss_peak_power_thresh)
&& (gi.g.null_chisqr >= swi.analysis_cfg.gauss_null_chi_sq_thresh);
if (gaussian_count==0||report) {
gi.score = score_offset
+lcgf(0.5*gauss_dof,std::max(gi.g.chisqr*0.5*gauss_bins,0.5*gauss_dof+1))
-lcgf(0.5*null_dof,std::max(gi.g.null_chisqr*0.5*gauss_bins,0.5*null_dof+1));
}
// Only include "real" Gaussians (those meeting the chisqr threshold)
// in the best Gaussian display.
if (gi.score > best_gauss->score && chisqOK) {
*best_gauss = gi;

....


The special appears to have it, as does Cuda baseline.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1875111 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1875112 - Posted: 26 Jun 2017, 2:32:21 UTC - in response to Message 1875108.  
Last modified: 26 Jun 2017, 2:34:49 UTC

Which branch is that MB8 derived from ? Stock seti_boinc master ? or AKv8 ? The difference may be important here.
[which one(s) differ to reference Windows/x86 8.00 may point in the right directions]
My OSX CPU App r3344 is from AKv8 and it has the same results as r3330.


Probably that fix is missing from the AK derived builds then [seems to be the case, and includes an SIGNALS_ON_GPU path in the same file (sah_v7_opt\AKv8\client\gaussfit.cpp ] .
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1875112 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1875114 - Posted: 26 Jun 2017, 2:37:22 UTC - in response to Message 1875111.  
Last modified: 26 Jun 2017, 2:41:38 UTC

All the OpenCL MB Apps come from AKv8 as far as I know. That includes the Apps that don't use the SoG path and work, such as my r3567 and the Non-SoG r3584 Linux App.
ID: 1875114 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1875115 - Posted: 26 Jun 2017, 2:39:08 UTC - in response to Message 1875114.  
Last modified: 26 Jun 2017, 2:52:50 UTC

All the OpenCL App s come from AKv8 as far as I know. That includes the Apps that don't use the SoG path and work, such as my r3567 and the Non-SoG r3584 Linux App.


Ugh, that's a lot of builds if really [some] missing the fix, as it appears. [Probably Raistmer will have to identify which use codebases with the fix, as there are a lot of alternate codepaths there]

[Edit:] Some paths appear to have their own implementation of something similar, some not.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1875115 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1875137 - Posted: 26 Jun 2017, 5:53:22 UTC - in response to Message 1875115.  
Last modified: 26 Jun 2017, 5:55:22 UTC

It seems to be a real cluster to me. Apparently the Pulsefind has nothing to do with Best Pulse as running with unroll 1 doesn't help. The task 23se08ac.6875.22968.6.33.135 is particularly nasty as not much seems to work with it, not even the Linux Baseline CUDA 4.2 App. The zi3v works with it on the Mac, and the OpenCl App r3567 works with it in Linux. Thankfully these tasks are rare, so most people will be oblivious to them. I'll post the results and let others decipher them.
MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu is from AKv8 and has been working well on 3 of my machines for well over a year
MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu uses the NV path but Not the SoG path
setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG is the Current Stock Linux App and uses the SoG path
setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 is the zi3k source with the gaussian fix from zi3s, wanted to see if the new changes since zi3k were the fault
setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 is the Baseline App from Xbranch

KWSN-Linux-MBbench v2.1.08
Running on TBarxxxx at Mon 26 Jun 2017 03:37:23 AM UTC
----------------------------------------------------------------
Starting benchmark run...
----------------------------------------------------------------
Listing wu-file(s) in /testWUs :
09no16aa.18442.2116.6.33.31.wu
20oc08aa.4777.254820.12.39.5.wu
23se08ac.6875.22968.6.33.135.wu

Listing executable(s) in /APPS :
MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu
setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG
setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80

Listing executable in /REF_APPS :
MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu
----------------------------------------------------------------
Current WU: 09no16aa.18442.2116.6.33.31.wu
----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 6283 seconds
----------------------------------------------------------------
Running app with command : .......... MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1
Elapsed Time : ...................... 466 seconds
Speed compared to default : ......... 1348 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.91%
----------------------------------------------------------------
Running app with command : .......... setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1
Elapsed Time : ...................... 412 seconds
Speed compared to default : ......... 1525 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0      3      3      3      0        0      3      3      3      0
     Autocorr      0      2      2      2      0        0      2      2      2      0
     Gaussian      0      3      3      3      0        0      3      3      3      0
        Pulse      0      1      1      1      0        0      1      1      1      0
      Triplet      0      0      0      0      0        0      0      0      0      0
   Best Spike      0      1      1      1      0        0      1      1      1      0
Best Autocorr      0      1      1      1      0        0      1      1      1      0
Best Gaussian      0      0      0      0      1        0      0      0      0      1
   Best Pulse      0      1      1      1      0        0      1      1      1      0
 Best Triplet      0      0      0      0      0        0      0      0      0      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   0     12     12     12      1        0     12     12     12      1

Unmatched signal(s) in R1 at line(s) 563
Unmatched signal(s) in R2 at line(s) 563
For R1:R2 matched signals only, Q= 99.91%
Result      : Weakly similar.
----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 -unroll 1 -device 1
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=67108864 T=33554432 P=18874368 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=4194304
MallocHost best_PoTP=16777216
MallocHost bestPoTG=4194304
Allocating tmp data buf for unroll 1
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
CUDA stream priority range: low 0 and high: -1
GPSF 1.618911 2 3.229842
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
.............................................................................................................................................................................................................................................................................................
Best scores written
Out file closed
Cuda free done
Cuda device reset done
Elapsed Time : ...................... 200 seconds
Speed compared to default : ......... 3141 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.81%
----------------------------------------------------------------
Done with 09no16aa.18442.2116.6.33.31.wu
====================================================================
Current WU: 20oc08aa.4777.254820.12.39.5.wu
----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 3396 seconds
----------------------------------------------------------------
Running app with command : .......... MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1
Elapsed Time : ...................... 404 seconds
Speed compared to default : ......... 840 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.97%
----------------------------------------------------------------
Running app with command : .......... setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1
Elapsed Time : ...................... 360 seconds
Speed compared to default : ......... 943 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.97%
----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 -unroll 1 -device 1
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=67108864 T=33554432 P=18874368 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=4194304
MallocHost best_PoTP=16777216
MallocHost bestPoTG=4194304
Allocating tmp data buf for unroll 1
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
CUDA stream priority range: low 0 and high: -1
GPSF 0.830732 1 1.732689
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
...................................................................................................................................................................................................
Best scores written
Out file closed
Cuda free done
Cuda device reset done
Elapsed Time : ...................... 131 seconds
Speed compared to default : ......... 2592 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0     10     10     10      0        0     10     10     10      0
     Autocorr      0      3      3      3      0        0      3      3      3      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      0      0      0      0        0      0      0      0      0
      Triplet      0      0      0      0      0        0      0      0      0      0
   Best Spike      0      1      1      1      0        0      1      1      1      0
Best Autocorr      0      1      1      1      0        0      1      1      1      0
Best Gaussian      1      1      1      1      0        1      1      1      1      0
   Best Pulse      0      0      0      0      1        0      0      0      0      1
 Best Triplet      0      0      0      0      0        0      0      0      0      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   1     16     16     16      1        1     16     16     16      1

Unmatched signal(s) in R1 at line(s) 607
Unmatched signal(s) in R2 at line(s) 607
For R1:R2 matched signals only, Q= 99.97%
Result      : Weakly similar.
----------------------------------------------------------------
Done with 20oc08aa.4777.254820.12.39.5.wu
====================================================================
Current WU: 23se08ac.6875.22968.6.33.135.wu
----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 8171 seconds
----------------------------------------------------------------
Running app with command : .......... MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1
Elapsed Time : ...................... 608 seconds
Speed compared to default : ......... 1343 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0      3      3      3      0        0      3      3      3      0
     Autocorr      0      0      0      0      0        0      0      0      0      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      1      1      1      0        0      1      1      1      0
      Triplet      0      3      3      3      0        0      3      3      3      0
   Best Spike      0      1      1      1      0        0      1      1      1      0
Best Autocorr      0      1      1      1      0        0      1      1      1      0
Best Gaussian      0      0      0      0      1        0      0      0      0      1
   Best Pulse      0      1      1      1      0        0      1      1      1      0
 Best Triplet      0      1      1      1      0        0      1      1      1      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   0     11     11     11      1        0     11     11     11      1

Unmatched signal(s) in R1 at line(s) 500
Unmatched signal(s) in R2 at line(s) 500
For R1:R2 matched signals only, Q= 99.97%
Result      : Weakly similar.
----------------------------------------------------------------
Running app with command : .......... setiathome_8.22_x86_64-pc-linux-gnu__opencl_nvidia_SoG -sbs 256 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 -period_iterations_num 10 -device 1
Elapsed Time : ...................... 545 seconds
Speed compared to default : ......... 1499 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0      3      3      3      0        0      3      3      3      0
     Autocorr      0      0      0      0      0        0      0      0      0      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      1      1      1      0        0      1      1      1      0
      Triplet      0      3      3      3      0        0      3      3      3      0
   Best Spike      0      1      1      1      0        0      1      1      1      0
Best Autocorr      0      1      1      1      0        0      1      1      1      0
Best Gaussian      0      0      0      0      1        0      0      0      0      1
   Best Pulse      0      1      1      1      0        0      1      1      1      0
 Best Triplet      0      1      1      1      0        0      1      1      1      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   0     11     11     11      1        0     11     11     11      1

Unmatched signal(s) in R1 at line(s) 500
Unmatched signal(s) in R2 at line(s) 500
For R1:R2 matched signals only, Q= 99.97%
Result      : Weakly similar.
----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 -unroll 1 -device 1
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=67108864 T=33554432 P=18874368 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=4194304
MallocHost best_PoTP=16777216
MallocHost bestPoTG=4194304
Allocating tmp data buf for unroll 1
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
CUDA stream priority range: low 0 and high: -1
GPSF 3.034932 3 5.351412
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
.....................................................................................................................................................................................................................................................................................................................................................................................
Best scores written
Out file closed
Cuda free done
Cuda device reset done
Elapsed Time : ...................... 289 seconds
Speed compared to default : ......... 2827 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      1      3      3      3      0        1      3      3      3      0
     Autocorr      0      0      0      0      0        0      0      0      0      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      1      1      1      0        0      1      1      1      0
      Triplet      0      3      3      3      0        0      3      3      3      0
   Best Spike      1      1      1      1      0        1      1      1      1      0
Best Autocorr      0      1      1      1      0        0      1      1      1      0
Best Gaussian      0      0      0      0      1        0      0      0      0      1
   Best Pulse      0      1      1      1      0        0      1      1      1      0
 Best Triplet      0      1      1      1      0        0      1      1      1      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   2     11     11     11      1        2     11     11     11      1

Unmatched signal(s) in R1 at line(s) 500
Unmatched signal(s) in R2 at line(s) 500
For R1:R2 matched signals only, Q= 99.99%
Result      : Weakly similar.
----------------------------------------------------------------
Done with 23se08ac.6875.22968.6.33.135.wu
====================================================================
Done with Benchmark run! Removing temporary files!
tbar@TBar-iSETI:~/KWSN-Bench-Linux-MBv7$ ./benchmark
KWSN-Linux-MBbench v2.1.08
Running on TBar-iSETI at Mon 26 Jun 2017 04:46:26 AM UTC
----------------------------------------------------------------
Starting benchmark run...
----------------------------------------------------------------
Listing wu-file(s) in /testWUs :
09no16aa.18442.2116.6.33.31.wu
20oc08aa.4777.254820.12.39.5.wu
23se08ac.6875.22968.6.33.135.wu

Listing executable(s) in /APPS :
setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80
setiathome_x41zi_x86_64-pc-linux-gnu_cuda42

Listing executable in /REF_APPS :
MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu
----------------------------------------------------------------
Current WU: 09no16aa.18442.2116.6.33.31.wu
----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 6283 seconds
----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 -unroll 1 -bs -device 1
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=67108864 T=33554432 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=65536
MallocHost best_PoTP=16777216
MallocHost bestPoTG=4194304
Allocing tmp data buf for unroll 1
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=6291456
GPSF 1.618911 2 3.229842
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 205 seconds
Speed compared to default : ......... 3064 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.81%
----------------------------------------------------------------
Running app with command : .......... setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 -device 1
Elapsed Time : ...................... 490 seconds
Speed compared to default : ......... 1282 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.98%
----------------------------------------------------------------
Done with 09no16aa.18442.2116.6.33.31.wu
====================================================================
Current WU: 20oc08aa.4777.254820.12.39.5.wu
----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 3396 seconds
----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 -unroll 1 -bs -device 1
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=67108864 T=33554432 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=65536
MallocHost best_PoTP=16777216
MallocHost bestPoTG=4194304
Allocing tmp data buf for unroll 1
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=6291456
GPSF 0.830732 1 1.732689
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 133 seconds
Speed compared to default : ......... 2553 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0     10     10     10      0        0     10     10     10      0
     Autocorr      0      3      3      3      0        0      3      3      3      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      0      0      0      0        0      0      0      0      0
      Triplet      0      0      0      0      0        0      0      0      0      0
   Best Spike      0      1      1      1      0        0      1      1      1      0
Best Autocorr      0      1      1      1      0        0      1      1      1      0
Best Gaussian      1      1      1      1      0        1      1      1      1      0
   Best Pulse      0      0      0      0      1        0      0      0      0      1
 Best Triplet      0      0      0      0      0        0      0      0      0      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   1     16     16     16      1        1     16     16     16      1

Unmatched signal(s) in R1 at line(s) 607
Unmatched signal(s) in R2 at line(s) 607
For R1:R2 matched signals only, Q= 99.97%
Result      : Weakly similar.
----------------------------------------------------------------
Running app with command : .......... setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 -device 1
Elapsed Time : ...................... 364 seconds
Speed compared to default : ......... 932 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.97%
----------------------------------------------------------------
Done with 20oc08aa.4777.254820.12.39.5.wu
====================================================================
Current WU: 23se08ac.6875.22968.6.33.135.wu
----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 8171 seconds
----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi3k+_x86_64-pc-linux-gnu_cuda80 -unroll 1 -bs -device 1
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=67108864 T=33554432 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=65536
MallocHost best_PoTP=16777216
MallocHost bestPoTG=4194304
Allocing tmp data buf for unroll 1
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=6291456
GPSF 3.034932 3 5.351412
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 310 seconds
Speed compared to default : ......... 2635 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      1      3      3      3      0        1      3      3      3      0
     Autocorr      0      0      0      0      0        0      0      0      0      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      1      1      1      0        0      1      1      1      0
      Triplet      0      3      3      3      0        0      3      3      3      0
   Best Spike      1      1      1      1      0        1      1      1      1      0
Best Autocorr      0      1      1      1      0        0      1      1      1      0
Best Gaussian      0      0      0      0      1        0      0      0      0      1
   Best Pulse      0      1      1      1      0        0      1      1      1      0
 Best Triplet      0      1      1      1      0        0      1      1      1      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   2     11     11     11      1        2     11     11     11      1

Unmatched signal(s) in R1 at line(s) 500
Unmatched signal(s) in R2 at line(s) 500
For R1:R2 matched signals only, Q= 99.99%
Result      : Weakly similar.
----------------------------------------------------------------
Running app with command : .......... setiathome_x41zi_x86_64-pc-linux-gnu_cuda42 -device 1
Elapsed Time : ...................... 607 seconds
Speed compared to default : ......... 1346 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0      3      3      3      0        0      3      3      3      0
     Autocorr      0      0      0      0      0        0      0      0      0      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      1      1      1      0        0      1      1      1      0
      Triplet      0      3      3      3      0        0      3      3      3      0
   Best Spike      0      1      1      1      0        0      1      1      1      0
Best Autocorr      0      1      1      1      0        0      1      1      1      0
Best Gaussian      0      0      0      0      1        0      0      0      0      1
   Best Pulse      0      1      1      1      0        0      1      1      1      0
 Best Triplet      0      1      1      1      0        0      1      1      1      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   0     11     11     11      1        0     11     11     11      1

Unmatched signal(s) in R1 at line(s) 500
Unmatched signal(s) in R2 at line(s) 500
For R1:R2 matched signals only, Q= 99.99%
Result      : Weakly similar.
----------------------------------------------------------------
Done with 23se08ac.6875.22968.6.33.135.wu
====================================================================
ID: 1875137 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1875138 - Posted: 26 Jun 2017, 6:03:21 UTC
Last modified: 26 Jun 2017, 6:05:33 UTC

Correct, Best Pulse has nothing to do with Best Gaussian (Those certainly can be separately influenced by Unroll, and if uncorrected [by Petri yet] will be random + rare problem event )

The fact there is indeed a 'cluster' of different things going on, is precisely why stock Win32 CPU is considered reference (not OpenCL, not Cuda, Not Linux, not AK, or anything else)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1875138 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1875239 - Posted: 26 Jun 2017, 16:48:43 UTC - in response to Message 1875111.  
Last modified: 26 Jun 2017, 16:51:35 UTC

Hmmm, AK8 branch *might be missing Joe's fix from ~2011 ? (svn posted earlier in thread):

gaussfit.cpp (stock seti_boinc branch):
report = chisqOK // chisqOK is (ChiSq <= swi.analysis_cfg.gauss_chi_sq_thresh)
&& (gi.g.peak_power >= gi.g.mean_power * swi.analysis_cfg.gauss_peak_power_thresh)
&& (gi.g.null_chisqr >= swi.analysis_cfg.gauss_null_chi_sq_thresh);
if (gaussian_count==0||report) {
gi.score = score_offset
+lcgf(0.5*gauss_dof,std::max(gi.g.chisqr*0.5*gauss_bins,0.5*gauss_dof+1))
-lcgf(0.5*null_dof,std::max(gi.g.null_chisqr*0.5*gauss_bins,0.5*null_dof+1));
}
// Only include "real" Gaussians (those meeting the chisqr threshold)
// in the best Gaussian display.
if (gi.score > best_gauss->score && chisqOK) {
*best_gauss = gi;

....


The special appears to have it, as does Cuda baseline.


opt build more complex in this area:
BOOLEAN chisq = (ChiSq <= swi.analysis_cfg.gauss_chi_sq_thresh);
...
if (chisq) {
#endif
BOOLEAN newbest=false, report;
//R: same optimization as for GPU build: if there is reportable Gaussian already -
//R: skip score calculation for all except new reportable Gaussians
//R: TODO: carefully check if it's valid assumption!
report = chisq && (PeakPower >= TrueMean * PoTInfo.GaussPeakPowerThresh) &&
(null_ChiSq >= swi.analysis_cfg.gauss_null_chi_sq_thresh);
if(gaussian_count==0 || report){
score = calc_GaussFit_score(ChiSq,null_ChiSq);
newbest = chisq && (score > best_gauss->score);
}
#if USE_COUNTERS
Counter<Gaussian_skip6_low_power>::update(!(PeakPower >= TrueMean * PoTInfo.GaussPeakPowerThresh));
//fprintf(stderr,"best_score=%.7g, score=%.7g\n",best_gauss->score,score);
#endif
#ifdef BOINC_APP_GRAPHICS
if (newbest || report || graphics) {
#else
#if USE_COUNTERS
if(! (newbest||report) ){
Counter<Gaussian_miss>::update(1);
}
#endif
if (newbest || report) {
#endif
....

But chi-square check seems to be present.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1875239 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1875249 - Posted: 26 Jun 2017, 17:11:39 UTC - in response to Message 1875239.  
Last modified: 26 Jun 2017, 17:14:11 UTC

...
But chi-square check seems to be present.

Saw that part, but not sure in which codepaths that's active (e.g. SoG). If active in all paths then will have to arrange a bench with 8.00 Win32 reference, then call for suspects. Will have to be after Wednesday for me. [I'm not clear on if this particular Gaussian rabbit-hole has enough impact to be concerned about, but understanding it would be good for me]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1875249 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1875253 - Posted: 26 Jun 2017, 17:22:10 UTC - in response to Message 1875249.  

SoG has own parallelized reduction for Gaussians (should implement same logic though).
And what warries me - the difference between SoG and non-SoG OpenCL results - that's definitely worth check when I'll have easy access to hardware for that.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1875253 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1875255 - Posted: 26 Jun 2017, 17:25:41 UTC - in response to Message 1875253.  

SoG has own parallelized reduction for Gaussians (should implement same logic though).
And what warries me - the difference between SoG and non-SoG OpenCL results - that's definitely worth check when I'll have easy access to hardware for that.


Yes, I've lost track of which apps match and which don't now, and have yet to examine in detail Petri's fix for the pulse race condition also. So plenty to examine as the dust falls out.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1875255 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1875263 - Posted: 26 Jun 2017, 19:14:35 UTC - in response to Message 1875255.  
Last modified: 26 Jun 2017, 19:16:52 UTC

SoG has own parallelized reduction for Gaussians (should implement same logic though).
And what warries me - the difference between SoG and non-SoG OpenCL results - that's definitely worth check when I'll have easy access to hardware for that.


Yes, I've lost track of which apps match and which don't now, and have yet to examine in detail Petri's fix for the pulse race condition also. So plenty to examine as the dust falls out.
Petri's fix for the race condition fixed the Bad Pulse Count which was being caused by the Unroll function. However, there has Always been a Bad Best Pulse that is even more rare than the Bad Pulse. I assumed the two were related, apparently they aren't as the Bad Pulse is fixed but the Bad Best Pulse remains and is Not solved by using Unroll 1 the way the Bad Pulse was fixed. It seems both CUDA and OpenCL have a Best Gaussian Problem with some rare tasks. Only CUDA has the Bad Best Pulse and it is even more rare.
That's the way I see it.
ID: 1875263 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1875266 - Posted: 26 Jun 2017, 19:25:24 UTC

Hi,
I'm here just for a quick peek.
The special pulse find is doing a scan with unroll N depending on autotune or the user set limit for each CPU-code-icfft-round and if the scan finds a suspected pulse the round is run again with unroll 1 for that CPU-code-icfft. That (to find a pulse or an even better unreported best) is a rare event for real pulses and for best pulses that happens at the first round and then more and more infrequently since the bar for best and not yet reported rises after each one found.

The gauss-find is to my mind not touched for a long time in the special code. My mind-memory may be short. I guess it (the gauss find code) is working as it should be. It is a separate problem if it does not.

I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something.

Petri.
I'm following and will provide help when needed later on in this summer..
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1875266 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1875274 - Posted: 26 Jun 2017, 20:28:20 UTC - in response to Message 1875266.  

I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something.
That sounds like the problem I was running into with my GTX 780 (now replaced by a GTX 980), which I detailed in Message 1864874. In fact, with the Cuda8.0 Special App, it was happening quite frequently. Dialing back to the Cuda6.5 version, it became rare, but didn't go away entirely. It has never (yet) shown up on any of my other cards (GTX 750Ti, GTX 960, GTX 980). You'd need to find somebody else running a 780 to see if the problem is common to that model or unique to my card.
ID: 1875274 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1875275 - Posted: 26 Jun 2017, 20:33:03 UTC - in response to Message 1875266.  


I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something.

Could it be solved in same fashion - by re=processing after discovery?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1875275 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1875279 - Posted: 26 Jun 2017, 20:45:27 UTC - in response to Message 1875274.  

I know there is a problem with my code reporting over 20 pulses at identical time with a small difference in frequency. That is an extremely rare event. And it always happens at 46.something.
That sounds like the problem I was running into with my GTX 780 (now replaced by a GTX 980), which I detailed in Message 1864874. In fact, with the Cuda8.0 Special App, it was happening quite frequently. Dialing back to the Cuda6.5 version, it became rare, but didn't go away entirely. It has never (yet) shown up on any of my other cards (GTX 750Ti, GTX 960, GTX 980). You'd need to find somebody else running a 780 to see if the problem is common to that model or unique to my card.


I'll put my 780 back in the Mac Pro on the weekend. Its unique Hyper-Q feature might be in play, differs by OS and Cuda version in subtle implementation ways.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1875279 · Report as offensive
Previous · 1 . . . 35 · 36 · 37 · 38 · 39 · 40 · 41 . . . 83 · Next

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.