Linux CUDA 'Special' App finally available, featuring Low CPU use

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 83 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1834548 - Posted: 7 Dec 2016, 5:43:36 UTC
Last modified: 7 Dec 2016, 5:57:56 UTC

Now available for GPUs with a Compute Capability of 3.2 and above, https://en.wikipedia.org/wiki/CUDA#GPUs_supported
Some examples running on a 750Ti;
Shorty, AR = 3.133362; Run time:: 2 min 52 sec CPU time: 20 sec
Mid, AR = 0.447852; Run time: 6 min 54 sec CPU time: 2 min 28 sec
BLC3, AR = 0.006417; Run time: 14 min 12 sec CPU time: 35 sec
http://setiathome.berkeley.edu/results.php?hostid=7769537&offset=160

Download and Install instructions are here, Linux CUDA 6 Special App
ID: 1834548 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1834554 - Posted: 7 Dec 2016, 6:07:01 UTC - in response to Message 1834548.  

Does Pulse detection issue solved in this version?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1834554 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1834556 - Posted: 7 Dec 2016, 6:18:12 UTC - in response to Message 1834554.  
Last modified: 7 Dec 2016, 6:21:35 UTC

It's passed all the test I've run with known 'problem' WUs. If you have a certain Test WU you'd like to see tested post it and I'll try it with the benchmark App.
It's been running a week without a single Error or Invalid, and has a Lower Inconclusive count than any other 'Special' version. Something has been solved.
SETI@home v8 (anonymous platform, NVIDIA GPU)
Number of tasks completed: 1018
Consecutive valid tasks: 1185
https://setiathome.berkeley.edu/host_app_versions.php?hostid=7769537
ID: 1834556 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13866
Credit: 208,696,464
RAC: 304
Australia
Message 1834559 - Posted: 7 Dec 2016, 6:42:30 UTC - in response to Message 1834556.  
Last modified: 7 Dec 2016, 6:49:19 UTC

It's passed all the test I've run with known 'problem' WUs. If you have a certain Test WU you'd like to see tested post it and I'll try it with the benchmark App.
It's been running a week without a single Error or Invalid, and has a Lower Inconclusive count than any other 'Special' version. Something has been solved.
SETI@home v8 (anonymous platform, NVIDIA GPU)
Number of tasks completed: 1018
Consecutive valid tasks: 1185
https://setiathome.berkeley.edu/host_app_versions.php?hostid=7769537


Average processing rate: 306.75 GFLOPS
For a GTX 750Ti, very nice.

And inconclusives are around 7.66% ; not quite the 5% target, but damn close.
Very, very nice.
Grant
Darwin NT
ID: 1834559 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1834564 - Posted: 7 Dec 2016, 7:19:40 UTC - in response to Message 1834554.  

With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1834564 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13866
Credit: 208,696,464
RAC: 304
Australia
Message 1834565 - Posted: 7 Dec 2016, 7:23:18 UTC - in response to Message 1834564.  

With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail.

Does it basically come down to precision/rounding issues due to differences in the different libraries used on the different Operating Systems?
Grant
Darwin NT
ID: 1834565 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1834567 - Posted: 7 Dec 2016, 7:36:30 UTC - in response to Message 1834564.  

With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail.

My guess is there is some difference in the Paths, at least with the OSX and Linux builds. The OSX builds are still running about twice as many Inconclusives as the Linux build even though they are almost exactly the same builds. Chris was running the original p_zi and running around 45 Inconclusives, now the machine is climbing. I expect it to level out around where my Mac is running. They should be running about the same with p_zi+ as they were with p_zi.
ID: 1834567 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1834575 - Posted: 7 Dec 2016, 8:43:50 UTC - in response to Message 1834565.  
Last modified: 7 Dec 2016, 8:55:30 UTC

With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail.

Does it basically come down to precision/rounding issues due to differences in the different libraries used on the different Operating Systems?


My Initial attempts using my normal compiler/precision approaches yielded the right ballpark accuracy wise, since then broken during various attempts to fix other issues. My problems with the codebase on Windows have been related more to the different driver demands on this OS, with respect to execution times of kernel launches.

In a nutshell, Windows drivers have DirectX based gaming oriented optimisations the other OSes don't. These hidden 'features' fuse kernels in their streams into single large launches, removing synchronisation that really needs to be there. I've experimented with some methods to limit/mitigate this, which have worked to some extent, though introduced other instability along the way (to be isolated).

Switching to the new 1050ti verified the same behaviour as I was seeing on my GTX980 on Win7, and running the 1050ti on generic baseline since last night it looks like the instability is gone (so far). [...so specific to the alpha code & my breakages]. It just means, as I lay down some of the new infrastructure for x42, I'll need to include some comprehensive timing and debug code aimed at improved automatic scaling and giving some control. That's what was intended for x42 anyway, just Petri's contributions change the direction a bit. Ideally I'd like the compatibility broadened along the way, since stock integration at any level requires as broad a support as possible (A Boinc server/scheduler issue limitation). The breaking changes by Cuda version & deprecated devices are complicating what the next generation will look like.

My next dev run in the generic stock direction will end up polymorphic, so as to support multiple Cuda versions/devices. The current alpha code embedded in a clean framework adapted from stock CPU, and 'pluginised', is likely to supplant x41 baseline, and serve as a platform for my Vulkan compute kernels.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1834575 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1834576 - Posted: 7 Dec 2016, 8:53:28 UTC - in response to Message 1834567.  
Last modified: 7 Dec 2016, 8:56:17 UTC

With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail.

My guess is there is some difference in the Paths, at least with the OSX and Linux builds. The OSX builds are still running about twice as many Inconclusives as the Linux build even though they are almost exactly the same builds. Chris was running the original p_zi and running around 45 Inconclusives, now the machine is climbing. I expect it to level out around where my Mac is running. They should be running about the same with p_zi+ as they were with p_zi.


Yeah, whatever that difference is will likely turn up as I assemble the various pieces (the alpha sources, new buildsystem, and cleaned out codebase). Now having the parts here to turn my Mac Pro into a triple OS dev machine, should aid tracking down any compiler/library/build differences. Just a matter of a lot of rejigging of development environment to do first, which will be easier now that work will let up for Christmas.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1834576 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1834615 - Posted: 7 Dec 2016, 13:25:34 UTC

Would be good also that those who install "special" app would list corresponding hosts here. Also would be good to have those hosts to be joined beta (with "special" app too).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1834615 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1834678 - Posted: 7 Dec 2016, 20:23:43 UTC - in response to Message 1834575.  


In a nutshell, Windows drivers have DirectX based gaming oriented optimisations the other OSes don't. These hidden 'features' fuse kernels in their streams into single large launches, removing synchronisation that really needs to be there. I've experimented with some methods to limit/mitigate this, which have worked to some extent, though introduced other instability along the way (to be isolated).

Jason, this "fused kernels" issue in Windows .... is that with the latest optimized for gaming Windows drivers?? Was/Is the problem present with much older Windows drivers? I'm thinking that BOINC users not interested in gaming but in stable and productive systems avoid to a great extent getting on the "latest Windows drivers" carousel that are released seemingly every other day to coincide with the latest popular game. I'm sure a lot of us just use the earliest stable driver that supports the board architecture of the cards we use.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1834678 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1834732 - Posted: 8 Dec 2016, 1:13:51 UTC - in response to Message 1834678.  
Last modified: 8 Dec 2016, 1:57:16 UTC


In a nutshell, Windows drivers have DirectX based gaming oriented optimisations the other OSes don't. These hidden 'features' fuse kernels in their streams into single large launches, removing synchronisation that really needs to be there. I've experimented with some methods to limit/mitigate this, which have worked to some extent, though introduced other instability along the way (to be isolated).

Jason, this "fused kernels" issue in Windows .... is that with the latest optimized for gaming Windows drivers?? Was/Is the problem present with much older Windows drivers? I'm thinking that BOINC users not interested in gaming but in stable and productive systems avoid to a great extent getting on the "latest Windows drivers" carousel that are released seemingly every other day to coincide with the latest popular game. I'm sure a lot of us just use the earliest stable driver that supports the board architecture of the cards we use.


The Cuda specific portion optimisations, whereby implicit synchronisations can be optimised out by the driver, trace back a fair way. Slightly earlier than whichever driver it was where 'trusty old' Cuda 3.2 magically became incompatible with later gen cards. At that point I was prompted to alert Eric to block stock distribution of that build if there was any Maxwell [Even Kepler now I think back...] or Newer GPU in the system (Which he promptly did).

Most relevant seems to be that the Cuda version switched to an LLVM based compiler after that, which resides in the driver, replacing the old Cuda one. ~Cuda 4/4.1 were too buggy to use here, though 4.2 and 5 vastly improved the picture. The mechanism is that the drivers ignore the embedded PTX binaries in favour of a JIT recompile that gets cached in %APPDATA%\NVIDIA\ComputeCache

Since the 'old school' Cuda code relies on synchronisation points that get 'optimised out', the Cuda 3.2, 4.2 and 5.0 Cuda sources are identical, and debugging reveals underlying DirectX calls that won't be in play on Linux or Mac, It just points to nVidia's deprecation of Pre-Fermi architecture as a vector for introducing new bugs, with deprecation of Fermi class and x86 platform starting with Cuda [~6.5-8.0].

It's that complex round of breaking changes and deprecations with Cuda, whereby Petri wisely chooses to go the path of least resistance in supporting newest generations only, taking advantage of the improved streaming optimisation capabilities. At the same time Boinc limitations in identifying mixed GPU systems block stock integration of the newest forms. (i.e. What app to send if someone puts a Fermi class and Pascal in the same system).

The 'obvious' solution then is to engineer dispatch into the next generation of Cuda-enabled applications, such that internal regression tests can choose the code based on what works. With Raistmer having chosen the OpenCL squillion build route, which is handling things nicely for GBT at the moment, it does give some breathing room for the daunting amount of software engineering to take place. We have the Stock CPU example of such a mechanism working for a different context. For the Generic stock distribution route that means I personally become tied up in creating new supporting infrastructure. Fortunately that doesn't mean non-stock third party development becomes impeded, though it does mean the alpha code here will be very situation specific, and have quirks across the platforms & devices that prevent widespread adoption for the time being.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1834732 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1834756 - Posted: 8 Dec 2016, 6:34:45 UTC - in response to Message 1834732.  

Thanks for the technical explanation, Jason. I was hoping that development of two paths where XXX.app for Windows drivers <<37X.XX or whatever had a disclaimer to run the XXX.app on XXX family of cards and then another fork where ZZZ.app for Windows drivers >> 38Z.ZZ or whatever had a disclaimer not to run on less than ZZZ family of cards. I was thinking that might simplify your development where you have to develop an app for all possible card architectures and all the older, current and future drivers. See that is not easy or desirable in fact. At some point in time I would think that the developers have to just deprecate support for old hardware. The manufacturers do it for their latest drivers. Why can't the BOINC developers? Again, thanks for the explanation of your development methodology.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1834756 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1834767 - Posted: 8 Dec 2016, 7:36:05 UTC - in response to Message 1834756.  
Last modified: 8 Dec 2016, 7:36:20 UTC

At some point in time I would think that the developers have to just deprecate support for old hardware. The manufacturers do it for their latest drivers. Why can't the BOINC developers?

Because of opposite goals.
Goal of vendor to take your money as much as it can.
Goal of BOINC-based projects developers to allow you to use what you have w/o additional money spend.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1834767 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1834772 - Posted: 8 Dec 2016, 8:33:32 UTC
Last modified: 8 Dec 2016, 8:58:28 UTC

TBAR: Have you compared the speed of your compile compared to Petris different builds? Good that the invalid rates are down but as we all know by know we cant eliminate the way the validator thing works either.
The more you produce faster the more inconclusive ratio that host seem to have until it vanishes of.

What i now write below is my theory:

With that i mean, if you have a slow host that doesn't process that much WUs per day you tend to end up crunching units that your wingman already has crunched. If the validator compares the work of a (I call it Petri Cuda) WU and compare it to the other that has been crunched already you get a validation pass and both get rewarded credits and the WU soon is cleared from the system (Cannot find the WU) as we can see when they have been processed and thus the invalid ratio is low!

When you have the opposite a ultra-speedy system that crunches thousands of WUs per day the more inconclusive you will get because that machine is so fast and returns the work first of them all and Waits for other computers to Catch up and when they start to return WUs and the overflowed results are pooring in so that speedy Machines inconclusive ratio will rise faster than others as well.

/End of Theory

What actually matters is ofcourse that the code does the work properly! Q ratio high as possible in various tasks, GBT, High/Low AR etc etc you all know that part but the value as Tbar refers to as "Consecutive valid tasks"
That one is the main thing to keep track of in my mind not the inconclusive part because the more parallel code the more inconclusives we will get wether it's an CPU, GPU , FPGA, PS4 yada.

Thanks for your work TBar and thank you Petri for going the brute force route of taking advantage of newer hardware that made this leap. Latest SoG is also speedy as hell! The 1080 if mine is utilized better than running multiple parallell Cudas now! Thank you Raistmer,Jason,Urs and all you Alpha/Beta testers and others that has contributed that has made that we're where we are at the moment! The list of ppl would get long.

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1834772 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1834774 - Posted: 8 Dec 2016, 8:43:14 UTC - in response to Message 1834767.  

At some point in time I would think that the developers have to just deprecate support for old hardware. The manufacturers do it for their latest drivers. Why can't the BOINC developers?

Because of opposite goals.
Goal of vendor to take your money as much as it can.
Goal of BOINC-based projects developers to allow you to use what you have w/o additional money spend.


Agreed. The $ amount of nVidia cards already retired, given away to the needy, or on my shelf collecting dust, is already worth way more than my car. IMO AMD's got a pretty golden opportunity right now, with a vacuum created by NV and Intel gouging.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1834774 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1834778 - Posted: 8 Dec 2016, 9:07:38 UTC - in response to Message 1834772.  

TBAR: Have you compared the speed of your compile compared to Petris different builds?.....

I was one of the First testers, been at it for over a Year now. I've tested hundreds of builds during that Year, right to p_zi3i. I haven't been sent any newer version that zi3i.
Your other theory doesn't take into account the use of Offline Benchmarking. The Benchmark App will identify the source of the problem. I just ran another series of tests which show the Pulsefind Error that was addressed in zi3f is still present, it's just a little better in the zi+ build than the zi3i build.

The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. The same Work Units give different results when run with the same build on a different platform.
Compare the results below with the results from the Mac, http://setiathome.berkeley.edu/forum_thread.php?id=78569&postid=1834748#1834748
tbar@TBar-iSETI:~$ cd '/home/tbar/KWSN-Bench-Linux-MBv7_v2.01.08' 
tbar@TBar-iSETI:~/KWSN-Bench-Linux-MBv7_v2.01.08$ ./benchmark
KWSN-Linux-MBbench v2.1.08
Running on TBar-iSETI at Thu 08 Dec 2016 07:55:11 AM UTC
----------------------------------------------------------------
Starting benchmark run...
----------------------------------------------------------------
Listing wu-file(s) in /testWUs :
18au09aa.4654.85539.7.34.226.wu
18dc09ah.26284.16432.6.33.125.wu
blc3_2bit_guppi_57424_80774_HIP9480_0005.24846.0.17.26.134.vlar.wu
blc3_2bit_guppi_57424_81430_HIP9480_0007.5224.831.17.26.71.vlar.wu

Listing executable(s) in /APPS :
setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75
setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda60

Listing executable in /REF_APPS :
MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu
----------------------------------------------------------------
Current WU: 18au09aa.4654.85539.7.34.226.wu

----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 8738 seconds
----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 5 -device 0
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=33554432 T=16777216 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=12582912
MallocHost best_PoTP=16777216
MallocHost bestPoTG=12582912
Allocing tmp data buf for unroll 5
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
GPSF 3.109209 3 5.412199
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 387 seconds
Speed compared to default : ......... 2257 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0      5      5      5      0        0      5      5      5      0
     Autocorr      0      0      0      0      0        0      0      0      0      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      4      4      4      1        0      4      4      4      1
      Triplet      0      0      0      0      0        0      0      0      0      0
   Best Spike      0      1      1      1      0        0      1      1      1      0
Best Autocorr      0      1      1      1      0        0      1      1      1      0
Best Gaussian      0      1      1      1      0        0      1      1      1      0
   Best Pulse      0      0      0      0      1        0      0      0      0      1
 Best Triplet      0      0      0      0      0        0      0      0      0      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   0     12     12     12      2        0     12     12     12      2

Unmatched signal(s) in R1 at line(s) 422 611
Unmatched signal(s) in R2 at line(s) 422 611
For R1:R2 matched signals only, Q= 99.96%
Result      : Weakly similar.

----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda60 -bs -unroll 5 -device 0
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=33554432 T=16777216 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=12582912
MallocHost best_PoTP=16777216
MallocHost bestPoTG=12582912
Allocing tmp data buf for unroll 5
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
GPSF 3.109209 3 5.412199
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 411 seconds
Speed compared to default : ......... 2126 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0      5      5      5      0        0      5      5      5      0
     Autocorr      0      0      0      0      0        0      0      0      0      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      4      4      4      1        0      4      4      4      1
      Triplet      0      0      0      0      0        0      0      0      0      0
   Best Spike      0      1      1      1      0        0      1      1      1      0
Best Autocorr      0      1      1      1      0        0      1      1      1      0
Best Gaussian      0      1      1      1      0        0      1      1      1      0
   Best Pulse      0      0      0      0      1        0      0      0      0      1
 Best Triplet      0      0      0      0      0        0      0      0      0      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   0     12     12     12      2        0     12     12     12      2

Unmatched signal(s) in R1 at line(s) 422 611
Unmatched signal(s) in R2 at line(s) 422 611
For R1:R2 matched signals only, Q= 99.96%
Result      : Weakly similar.

----------------------------------------------------------------
Done with 18au09aa.4654.85539.7.34.226.wu

====================================================================
Current WU: 18dc09ah.26284.16432.6.33.125.wu

----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 3495 seconds
----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 5 -device 0
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=33554432 T=16777216 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=12582912
MallocHost best_PoTP=16777216
MallocHost bestPoTG=12582912
Allocing tmp data buf for unroll 5
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
GPSF 0.498642 0 1.000000
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 168 seconds
Speed compared to default : ......... 2080 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0      0      0      0      0        0      0      0      0      0
     Autocorr      0      0      0      0      0        0      0      0      0      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      0      0      0      1        0      0      0      0      1
      Triplet      0      3      3      3      0        0      3      3      3      0
   Best Spike      0      1      1      1      0        0      1      1      1      0
Best Autocorr      1      1      1      1      0        1      1      1      1      0
Best Gaussian      1      1      1      1      0        1      1      1      1      0
   Best Pulse      0      0      0      0      1        0      0      0      0      1
 Best Triplet      0      1      1      1      0        0      1      1      1      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   2      7      7      7      2        2      7      7      7      2

Unmatched signal(s) in R1 at line(s) 393 473
Unmatched signal(s) in R2 at line(s) 393 473
For R1:R2 matched signals only, Q= 100.0%
Result      : Weakly similar.

----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda60 -bs -unroll 5 -device 0
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=33554432 T=16777216 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=12582912
MallocHost best_PoTP=16777216
MallocHost bestPoTG=12582912
Allocing tmp data buf for unroll 5
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
GPSF 0.498642 0 1.000000
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 173 seconds
Speed compared to default : ......... 2020 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.70%

----------------------------------------------------------------
Done with 18dc09ah.26284.16432.6.33.125.wu

====================================================================
Current WU: blc3_2bit_guppi_57424_80774_HIP9480_0005.24846.0.17.26.134.vlar.wu

----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 6957 seconds
----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 5 -device 0
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=33554432 T=16777216 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=12582912
MallocHost best_PoTP=16777216
MallocHost bestPoTG=12582912
Allocing tmp data buf for unroll 5
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
GPSF 603.228455 603 977.571899
Sigma > GaussTOffsetStop: 603 > -539
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 793 seconds
Speed compared to default : ......... 877 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.25%

----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda60 -bs -unroll 5 -device 0
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=33554432 T=16777216 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=12582912
MallocHost best_PoTP=16777216
MallocHost bestPoTG=12582912
Allocing tmp data buf for unroll 5
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
GPSF 603.228455 603 977.571899
Sigma > GaussTOffsetStop: 603 > -539
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 828 seconds
Speed compared to default : ......... 840 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.25%

----------------------------------------------------------------
Done with blc3_2bit_guppi_57424_80774_HIP9480_0005.24846.0.17.26.134.vlar.wu

====================================================================
Current WU: blc3_2bit_guppi_57424_81430_HIP9480_0007.5224.831.17.26.71.vlar.wu

----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 496 seconds
----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 5 -device 0
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=33554432 T=16777216 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=12582912
MallocHost best_PoTP=16777216
MallocHost bestPoTG=12582912
Allocing tmp data buf for unroll 5
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
GPSF 645.210266 645 1045.605347
Sigma > GaussTOffsetStop: 645 > -581
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 47 seconds
Speed compared to default : ......... 1055 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      2     15     15     15      0        2     15     15     15      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0     11     11     12      1        0     11     11     12      1
      Triplet      0      2      2      2      0        0      2      2      2      0
   Best Spike      0      0      0      0      0        0      0      0      0      0
Best Gaussian      0      0      0      0      0        0      0      0      0      0
   Best Pulse      0      0      0      0      0        0      0      0      0      0
 Best Triplet      0      0      0      0      0        0      0      0      0      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   2     28     28     29      1        2     28     28     29      1

Unmatched signal(s) in R1 at line(s) 524
Unmatched signal(s) in R2 at line(s) 524
For R1:R2 matched signals only, Q= 38.66%
Result      : Weakly similar.

----------------------------------------------------------------
Running app with command : .......... setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda60 -bs -unroll 5 -device 0
gCudaDevProps.multiProcessorCount = 5
Work data buffer for fft results size = 320864256
MallocHost G=33554432 T=16777216 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=12582912
MallocHost best_PoTP=16777216
MallocHost bestPoTG=12582912
Allocing tmp data buf for unroll 5
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
GPSF 645.210266 645 1045.605347
Sigma > GaussTOffsetStop: 645 > -581
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 47 seconds
Speed compared to default : ......... 1055 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      6     15     15     15      0        6     15     15     15      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0     11     11     12      1        0     11     11     12      1
      Triplet      0      2      2      2      0        0      2      2      2      0
   Best Spike      0      0      0      0      0        0      0      0      0      0
Best Gaussian      0      0      0      0      0        0      0      0      0      0
   Best Pulse      0      0      0      0      0        0      0      0      0      0
 Best Triplet      0      0      0      0      0        0      0      0      0      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   6     28     28     29      1        6     28     28     29      1

Unmatched signal(s) in R1 at line(s) 524
Unmatched signal(s) in R2 at line(s) 524
For R1:R2 matched signals only, Q= 38.66%
Result      : Weakly similar.

----------------------------------------------------------------
Done with blc3_2bit_guppi_57424_81430_HIP9480_0007.5224.831.17.26.71.vlar.wu

====================================================================

Done with Benchmark run! Removing temporary files!
tbar@TBar-iSETI:~/KWSN-Bench-Linux-MBv7_v2.01.08$
ID: 1834778 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1834779 - Posted: 8 Dec 2016, 9:21:16 UTC - in response to Message 1834778.  
Last modified: 8 Dec 2016, 9:22:16 UTC

...The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. ...


The differences underneath between Linux (OpenGL/Vulkan), OSX (Metal), and Windows(DirectX) are directly in the way synchronisation is done, which is key in the new optimisations. IMO out of the 3, the Linux one looks the most solid/stable (despite some pretty radical changes to cope with 4k block NVME devices in recent kernels). Probably gremlins can turn up in the app code for sure, however all 3 of those systems are in a state of flux.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1834779 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1834781 - Posted: 8 Dec 2016, 9:42:38 UTC - in response to Message 1834779.  
Last modified: 8 Dec 2016, 9:45:23 UTC

...The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. ...


The differences underneath between Linux (OpenGL/Vulkan), OSX (Metal), and Windows(DirectX) are directly in the way synchronisation is done, which is key in the new optimisations. IMO out of the 3, the Linux one looks the most solid/stable (despite some pretty radical changes to cope with 4k block NVME devices in recent kernels). Probably gremlins can turn up in the app code for sure, however all 3 of those systems are in a state of flux.

Except the problem doesn't happen with other Apps. It doesn't even happen with the Old version of the same App. I'm more inclined to think it's similar to the Pulsefind problem prior to zi3f. Some overlooked character that induces a random error when accessed just right. That would explain why the same WU on one platform can end up with a Bad Pulse while working fine on a different platform with the same build number. That happened twice BTW. The first WU worked on the Mac but gave a Bad Pulse in Linux. The normal BLC3 worked in Linux but gave a Bad Pulse on the Mac. Seriously strange in my book.
Have you noticed it's always just One Bad Pulse? Never 2 or more, always one , no matter how many Pulses are found.
ID: 1834781 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1834783 - Posted: 8 Dec 2016, 10:15:58 UTC - in response to Message 1834781.  
Last modified: 8 Dec 2016, 10:20:24 UTC

...The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. ...


The differences underneath between Linux (OpenGL/Vulkan), OSX (Metal), and Windows(DirectX) are directly in the way synchronisation is done, which is key in the new optimisations. IMO out of the 3, the Linux one looks the most solid/stable (despite some pretty radical changes to cope with 4k block NVME devices in recent kernels). Probably gremlins can turn up in the app code for sure, however all 3 of those systems are in a state of flux.

Except the problem doesn't happen with other Apps. It doesn't even happen with the Old version of the same App. I'm more inclined to think it's similar to the Pulsefind problem prior to zi3f. Some overlooked character that induces a random error when accessed just right. That would explain why the same WU on one platform can end up with a Bad Pulse while working fine on a different platform with the same build number. That happened twice BTW. The first WU worked on the Mac but gave a Bad Pulse in Linux. The normal BLC3 worked in Linux but gave a Bad Pulse on the Mac. Seriously strange in my book.
Have you noticed it's always just One Bad Pulse? Never 2 or more, always one , no matter how many Pulses are found.


That's right. That's how race conditions (due to omissions or typos) tend to manifest. The architecture is virtualised, so ordering and correctness (or otherwise) is dependant almost completely on the underlying implementation. The same situation arose with the introduction of Fermi, whereby NV had to return to produce 6.10. [6.09] Code worked as is on Pre Fermi, produced garbage pulses on Fermi, simply due to cache/thread behaviour.

Quite possible there's one or more reduction pointers that Petri hadn't realised need to be marked 'volatile'. That different systems, drivers and GPUs manage virtualised memory and caching differently, is not surprising, but either way the omission or other problem is in the app code rather than the drivers. It's just complicated by that the implementations are changing underneath.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1834783 · Report as offensive
1 · 2 · 3 · 4 . . . 83 · Next

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.