Message boards :
Number crunching :
Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Now available for GPUs with a Compute Capability of 3.2 and above, https://en.wikipedia.org/wiki/CUDA#GPUs_supported Some examples running on a 750Ti; Shorty, AR = 3.133362; Run time:: 2 min 52 sec CPU time: 20 sec Mid, AR = 0.447852; Run time: 6 min 54 sec CPU time: 2 min 28 sec BLC3, AR = 0.006417; Run time: 14 min 12 sec CPU time: 35 sec http://setiathome.berkeley.edu/results.php?hostid=7769537&offset=160 Download and Install instructions are here, Linux CUDA 6 Special App |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Does Pulse detection issue solved in this version? SETI apps news We're not gonna fight them. We're gonna transcend them. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
It's passed all the test I've run with known 'problem' WUs. If you have a certain Test WU you'd like to see tested post it and I'll try it with the benchmark App. It's been running a week without a single Error or Invalid, and has a Lower Inconclusive count than any other 'Special' version. Something has been solved. SETI@home v8 (anonymous platform, NVIDIA GPU) Number of tasks completed: 1018 Consecutive valid tasks: 1185 https://setiathome.berkeley.edu/host_app_versions.php?hostid=7769537 |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13847 Credit: 208,696,464 RAC: 304 |
It's passed all the test I've run with known 'problem' WUs. If you have a certain Test WU you'd like to see tested post it and I'll try it with the benchmark App. Average processing rate: 306.75 GFLOPS For a GTX 750Ti, very nice. And inconclusives are around 7.66% ; not quite the 5% target, but damn close. Very, very nice. Grant Darwin NT |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13847 Credit: 208,696,464 RAC: 304 |
With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail. Does it basically come down to precision/rounding issues due to differences in the different libraries used on the different Operating Systems? Grant Darwin NT |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail. My guess is there is some difference in the Paths, at least with the OSX and Linux builds. The OSX builds are still running about twice as many Inconclusives as the Linux build even though they are almost exactly the same builds. Chris was running the original p_zi and running around 45 Inconclusives, now the machine is climbing. I expect it to level out around where my Mac is running. They should be running about the same with p_zi+ as they were with p_zi. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail. My Initial attempts using my normal compiler/precision approaches yielded the right ballpark accuracy wise, since then broken during various attempts to fix other issues. My problems with the codebase on Windows have been related more to the different driver demands on this OS, with respect to execution times of kernel launches. In a nutshell, Windows drivers have DirectX based gaming oriented optimisations the other OSes don't. These hidden 'features' fuse kernels in their streams into single large launches, removing synchronisation that really needs to be there. I've experimented with some methods to limit/mitigate this, which have worked to some extent, though introduced other instability along the way (to be isolated). Switching to the new 1050ti verified the same behaviour as I was seeing on my GTX980 on Win7, and running the 1050ti on generic baseline since last night it looks like the instability is gone (so far). [...so specific to the alpha code & my breakages]. It just means, as I lay down some of the new infrastructure for x42, I'll need to include some comprehensive timing and debug code aimed at improved automatic scaling and giving some control. That's what was intended for x42 anyway, just Petri's contributions change the direction a bit. Ideally I'd like the compatibility broadened along the way, since stock integration at any level requires as broad a support as possible (A Boinc server/scheduler issue limitation). The breaking changes by Cuda version & deprecated devices are complicating what the next generation will look like. My next dev run in the generic stock direction will end up polymorphic, so as to support multiple Cuda versions/devices. The current alpha code embedded in a clean framework adapted from stock CPU, and 'pluginised', is likely to supplant x41 baseline, and serve as a platform for my Vulkan compute kernels. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail. Yeah, whatever that difference is will likely turn up as I assemble the various pieces (the alpha sources, new buildsystem, and cleaned out codebase). Now having the parts here to turn my Mac Pro into a triple OS dev machine, should aid tracking down any compiler/library/build differences. Just a matter of a lot of rejigging of development environment to do first, which will be easier now that work will let up for Christmas. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Would be good also that those who install "special" app would list corresponding hosts here. Also would be good to have those hosts to be joined beta (with "special" app too). SETI apps news We're not gonna fight them. We're gonna transcend them. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Jason, this "fused kernels" issue in Windows .... is that with the latest optimized for gaming Windows drivers?? Was/Is the problem present with much older Windows drivers? I'm thinking that BOINC users not interested in gaming but in stable and productive systems avoid to a great extent getting on the "latest Windows drivers" carousel that are released seemingly every other day to coincide with the latest popular game. I'm sure a lot of us just use the earliest stable driver that supports the board architecture of the cards we use. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
The Cuda specific portion optimisations, whereby implicit synchronisations can be optimised out by the driver, trace back a fair way. Slightly earlier than whichever driver it was where 'trusty old' Cuda 3.2 magically became incompatible with later gen cards. At that point I was prompted to alert Eric to block stock distribution of that build if there was any Maxwell [Even Kepler now I think back...] or Newer GPU in the system (Which he promptly did). Most relevant seems to be that the Cuda version switched to an LLVM based compiler after that, which resides in the driver, replacing the old Cuda one. ~Cuda 4/4.1 were too buggy to use here, though 4.2 and 5 vastly improved the picture. The mechanism is that the drivers ignore the embedded PTX binaries in favour of a JIT recompile that gets cached in %APPDATA%\NVIDIA\ComputeCache Since the 'old school' Cuda code relies on synchronisation points that get 'optimised out', the Cuda 3.2, 4.2 and 5.0 Cuda sources are identical, and debugging reveals underlying DirectX calls that won't be in play on Linux or Mac, It just points to nVidia's deprecation of Pre-Fermi architecture as a vector for introducing new bugs, with deprecation of Fermi class and x86 platform starting with Cuda [~6.5-8.0]. It's that complex round of breaking changes and deprecations with Cuda, whereby Petri wisely chooses to go the path of least resistance in supporting newest generations only, taking advantage of the improved streaming optimisation capabilities. At the same time Boinc limitations in identifying mixed GPU systems block stock integration of the newest forms. (i.e. What app to send if someone puts a Fermi class and Pascal in the same system). The 'obvious' solution then is to engineer dispatch into the next generation of Cuda-enabled applications, such that internal regression tests can choose the code based on what works. With Raistmer having chosen the OpenCL squillion build route, which is handling things nicely for GBT at the moment, it does give some breathing room for the daunting amount of software engineering to take place. We have the Stock CPU example of such a mechanism working for a different context. For the Generic stock distribution route that means I personally become tied up in creating new supporting infrastructure. Fortunately that doesn't mean non-stock third party development becomes impeded, though it does mean the alpha code here will be very situation specific, and have quirks across the platforms & devices that prevent widespread adoption for the time being. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for the technical explanation, Jason. I was hoping that development of two paths where XXX.app for Windows drivers <<37X.XX or whatever had a disclaimer to run the XXX.app on XXX family of cards and then another fork where ZZZ.app for Windows drivers >> 38Z.ZZ or whatever had a disclaimer not to run on less than ZZZ family of cards. I was thinking that might simplify your development where you have to develop an app for all possible card architectures and all the older, current and future drivers. See that is not easy or desirable in fact. At some point in time I would think that the developers have to just deprecate support for old hardware. The manufacturers do it for their latest drivers. Why can't the BOINC developers? Again, thanks for the explanation of your development methodology. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
At some point in time I would think that the developers have to just deprecate support for old hardware. The manufacturers do it for their latest drivers. Why can't the BOINC developers? Because of opposite goals. Goal of vendor to take your money as much as it can. Goal of BOINC-based projects developers to allow you to use what you have w/o additional money spend. SETI apps news We're not gonna fight them. We're gonna transcend them. |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
TBAR: Have you compared the speed of your compile compared to Petris different builds? Good that the invalid rates are down but as we all know by know we cant eliminate the way the validator thing works either. The more you produce faster the more inconclusive ratio that host seem to have until it vanishes of. What i now write below is my theory: With that i mean, if you have a slow host that doesn't process that much WUs per day you tend to end up crunching units that your wingman already has crunched. If the validator compares the work of a (I call it Petri Cuda) WU and compare it to the other that has been crunched already you get a validation pass and both get rewarded credits and the WU soon is cleared from the system (Cannot find the WU) as we can see when they have been processed and thus the invalid ratio is low! When you have the opposite a ultra-speedy system that crunches thousands of WUs per day the more inconclusive you will get because that machine is so fast and returns the work first of them all and Waits for other computers to Catch up and when they start to return WUs and the overflowed results are pooring in so that speedy Machines inconclusive ratio will rise faster than others as well. /End of Theory What actually matters is ofcourse that the code does the work properly! Q ratio high as possible in various tasks, GBT, High/Low AR etc etc you all know that part but the value as Tbar refers to as "Consecutive valid tasks" That one is the main thing to keep track of in my mind not the inconclusive part because the more parallel code the more inconclusives we will get wether it's an CPU, GPU , FPGA, PS4 yada. Thanks for your work TBar and thank you Petri for going the brute force route of taking advantage of newer hardware that made this leap. Latest SoG is also speedy as hell! The 1080 if mine is utilized better than running multiple parallell Cudas now! Thank you Raistmer,Jason,Urs and all you Alpha/Beta testers and others that has contributed that has made that we're where we are at the moment! The list of ppl would get long. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
At some point in time I would think that the developers have to just deprecate support for old hardware. The manufacturers do it for their latest drivers. Why can't the BOINC developers? Agreed. The $ amount of nVidia cards already retired, given away to the needy, or on my shelf collecting dust, is already worth way more than my car. IMO AMD's got a pretty golden opportunity right now, with a vacuum created by NV and Intel gouging. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
TBAR: Have you compared the speed of your compile compared to Petris different builds?..... I was one of the First testers, been at it for over a Year now. I've tested hundreds of builds during that Year, right to p_zi3i. I haven't been sent any newer version that zi3i. Your other theory doesn't take into account the use of Offline Benchmarking. The Benchmark App will identify the source of the problem. I just ran another series of tests which show the Pulsefind Error that was addressed in zi3f is still present, it's just a little better in the zi+ build than the zi3i build. The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. The same Work Units give different results when run with the same build on a different platform. Compare the results below with the results from the Mac, http://setiathome.berkeley.edu/forum_thread.php?id=78569&postid=1834748#1834748 tbar@TBar-iSETI:~$ cd '/home/tbar/KWSN-Bench-Linux-MBv7_v2.01.08' tbar@TBar-iSETI:~/KWSN-Bench-Linux-MBv7_v2.01.08$ ./benchmark KWSN-Linux-MBbench v2.1.08 Running on TBar-iSETI at Thu 08 Dec 2016 07:55:11 AM UTC ---------------------------------------------------------------- Starting benchmark run... ---------------------------------------------------------------- Listing wu-file(s) in /testWUs : 18au09aa.4654.85539.7.34.226.wu 18dc09ah.26284.16432.6.33.125.wu blc3_2bit_guppi_57424_80774_HIP9480_0005.24846.0.17.26.134.vlar.wu blc3_2bit_guppi_57424_81430_HIP9480_0007.5224.831.17.26.71.vlar.wu Listing executable(s) in /APPS : setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda60 Listing executable in /REF_APPS : MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu ---------------------------------------------------------------- Current WU: 18au09aa.4654.85539.7.34.226.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 8738 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 5 -device 0 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=33554432 T=16777216 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=12582912 MallocHost best_PoTP=16777216 MallocHost bestPoTG=12582912 Allocing tmp data buf for unroll 5 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 GPSF 3.109209 3 5.412199 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 387 seconds Speed compared to default : ......... 2257 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 5 5 5 0 0 5 5 5 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 4 4 4 1 0 4 4 4 1 Triplet 0 0 0 0 0 0 0 0 0 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 1 1 1 0 0 1 1 1 0 Best Pulse 0 0 0 0 1 0 0 0 0 1 Best Triplet 0 0 0 0 0 0 0 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 12 12 12 2 0 12 12 12 2 Unmatched signal(s) in R1 at line(s) 422 611 Unmatched signal(s) in R2 at line(s) 422 611 For R1:R2 matched signals only, Q= 99.96% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda60 -bs -unroll 5 -device 0 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=33554432 T=16777216 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=12582912 MallocHost best_PoTP=16777216 MallocHost bestPoTG=12582912 Allocing tmp data buf for unroll 5 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 GPSF 3.109209 3 5.412199 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 411 seconds Speed compared to default : ......... 2126 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 5 5 5 0 0 5 5 5 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 4 4 4 1 0 4 4 4 1 Triplet 0 0 0 0 0 0 0 0 0 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 1 1 1 0 0 1 1 1 0 Best Pulse 0 0 0 0 1 0 0 0 0 1 Best Triplet 0 0 0 0 0 0 0 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 12 12 12 2 0 12 12 12 2 Unmatched signal(s) in R1 at line(s) 422 611 Unmatched signal(s) in R2 at line(s) 422 611 For R1:R2 matched signals only, Q= 99.96% Result : Weakly similar. ---------------------------------------------------------------- Done with 18au09aa.4654.85539.7.34.226.wu ==================================================================== Current WU: 18dc09ah.26284.16432.6.33.125.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 3495 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 5 -device 0 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=33554432 T=16777216 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=12582912 MallocHost best_PoTP=16777216 MallocHost bestPoTG=12582912 Allocing tmp data buf for unroll 5 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 GPSF 0.498642 0 1.000000 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 168 seconds Speed compared to default : ......... 2080 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 0 0 0 0 0 0 0 0 0 Autocorr 0 0 0 0 0 0 0 0 0 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 0 0 0 1 0 0 0 0 1 Triplet 0 3 3 3 0 0 3 3 3 0 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 1 1 1 1 0 1 1 1 1 0 Best Gaussian 1 1 1 1 0 1 1 1 1 0 Best Pulse 0 0 0 0 1 0 0 0 0 1 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 2 7 7 7 2 2 7 7 7 2 Unmatched signal(s) in R1 at line(s) 393 473 Unmatched signal(s) in R2 at line(s) 393 473 For R1:R2 matched signals only, Q= 100.0% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda60 -bs -unroll 5 -device 0 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=33554432 T=16777216 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=12582912 MallocHost best_PoTP=16777216 MallocHost bestPoTG=12582912 Allocing tmp data buf for unroll 5 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 GPSF 0.498642 0 1.000000 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 173 seconds Speed compared to default : ......... 2020 % ----------------- Comparing results Result : Strongly similar, Q= 99.70% ---------------------------------------------------------------- Done with 18dc09ah.26284.16432.6.33.125.wu ==================================================================== Current WU: blc3_2bit_guppi_57424_80774_HIP9480_0005.24846.0.17.26.134.vlar.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 6957 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 5 -device 0 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=33554432 T=16777216 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=12582912 MallocHost best_PoTP=16777216 MallocHost bestPoTG=12582912 Allocing tmp data buf for unroll 5 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 GPSF 603.228455 603 977.571899 Sigma > GaussTOffsetStop: 603 > -539 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 793 seconds Speed compared to default : ......... 877 % ----------------- Comparing results Result : Strongly similar, Q= 99.25% ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda60 -bs -unroll 5 -device 0 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=33554432 T=16777216 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=12582912 MallocHost best_PoTP=16777216 MallocHost bestPoTG=12582912 Allocing tmp data buf for unroll 5 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 GPSF 603.228455 603 977.571899 Sigma > GaussTOffsetStop: 603 > -539 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 828 seconds Speed compared to default : ......... 840 % ----------------- Comparing results Result : Strongly similar, Q= 99.25% ---------------------------------------------------------------- Done with blc3_2bit_guppi_57424_80774_HIP9480_0005.24846.0.17.26.134.vlar.wu ==================================================================== Current WU: blc3_2bit_guppi_57424_81430_HIP9480_0007.5224.831.17.26.71.vlar.wu ---------------------------------------------------------------- Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s) Elapsed Time: ....................... 496 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 5 -device 0 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=33554432 T=16777216 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=12582912 MallocHost best_PoTP=16777216 MallocHost bestPoTG=12582912 Allocing tmp data buf for unroll 5 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 GPSF 645.210266 645 1045.605347 Sigma > GaussTOffsetStop: 645 > -581 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 47 seconds Speed compared to default : ......... 1055 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 2 15 15 15 0 2 15 15 15 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 11 11 12 1 0 11 11 12 1 Triplet 0 2 2 2 0 0 2 2 2 0 Best Spike 0 0 0 0 0 0 0 0 0 0 Best Gaussian 0 0 0 0 0 0 0 0 0 0 Best Pulse 0 0 0 0 0 0 0 0 0 0 Best Triplet 0 0 0 0 0 0 0 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 2 28 28 29 1 2 28 28 29 1 Unmatched signal(s) in R1 at line(s) 524 Unmatched signal(s) in R2 at line(s) 524 For R1:R2 matched signals only, Q= 38.66% Result : Weakly similar. ---------------------------------------------------------------- Running app with command : .......... setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda60 -bs -unroll 5 -device 0 gCudaDevProps.multiProcessorCount = 5 Work data buffer for fft results size = 320864256 MallocHost G=33554432 T=16777216 P=16777216 (16) MallocHost tmp_PoTP=16777216 MallocHost tmp_PoTP2=16777216 MallocHost tmp_PoTT=16777216 MallocHost tmp_PoTG=12582912 MallocHost best_PoTP=16777216 MallocHost bestPoTG=12582912 Allocing tmp data buf for unroll 5 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 GPSF 645.210266 645 1045.605347 Sigma > GaussTOffsetStop: 645 > -581 AcIn 16779264 AcOut 33558528 Mallocing blockSums 24576 bytes Elapsed Time : ...................... 47 seconds Speed compared to default : ......... 1055 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 6 15 15 15 0 6 15 15 15 0 Gaussian 0 0 0 0 0 0 0 0 0 0 Pulse 0 11 11 12 1 0 11 11 12 1 Triplet 0 2 2 2 0 0 2 2 2 0 Best Spike 0 0 0 0 0 0 0 0 0 0 Best Gaussian 0 0 0 0 0 0 0 0 0 0 Best Pulse 0 0 0 0 0 0 0 0 0 0 Best Triplet 0 0 0 0 0 0 0 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 6 28 28 29 1 6 28 28 29 1 Unmatched signal(s) in R1 at line(s) 524 Unmatched signal(s) in R2 at line(s) 524 For R1:R2 matched signals only, Q= 38.66% Result : Weakly similar. ---------------------------------------------------------------- Done with blc3_2bit_guppi_57424_81430_HIP9480_0007.5224.831.17.26.71.vlar.wu ==================================================================== Done with Benchmark run! Removing temporary files! tbar@TBar-iSETI:~/KWSN-Bench-Linux-MBv7_v2.01.08$ |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
...The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. ... The differences underneath between Linux (OpenGL/Vulkan), OSX (Metal), and Windows(DirectX) are directly in the way synchronisation is done, which is key in the new optimisations. IMO out of the 3, the Linux one looks the most solid/stable (despite some pretty radical changes to cope with 4k block NVME devices in recent kernels). Probably gremlins can turn up in the app code for sure, however all 3 of those systems are in a state of flux. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
...The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. ... Except the problem doesn't happen with other Apps. It doesn't even happen with the Old version of the same App. I'm more inclined to think it's similar to the Pulsefind problem prior to zi3f. Some overlooked character that induces a random error when accessed just right. That would explain why the same WU on one platform can end up with a Bad Pulse while working fine on a different platform with the same build number. That happened twice BTW. The first WU worked on the Mac but gave a Bad Pulse in Linux. The normal BLC3 worked in Linux but gave a Bad Pulse on the Mac. Seriously strange in my book. Have you noticed it's always just One Bad Pulse? Never 2 or more, always one , no matter how many Pulses are found. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
...The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. ... That's right. That's how race conditions (due to omissions or typos) tend to manifest. The architecture is virtualised, so ordering and correctness (or otherwise) is dependant almost completely on the underlying implementation. The same situation arose with the introduction of Fermi, whereby NV had to return to produce 6.10. [6.09] Code worked as is on Pre Fermi, produced garbage pulses on Fermi, simply due to cache/thread behaviour. Quite possible there's one or more reduction pointers that Petri hadn't realised need to be marked 'volatile'. That different systems, drivers and GPUs manage virtualised memory and caching differently, is not surprising, but either way the omission or other problem is in the app code rather than the drivers. It's just complicated by that the implementations are changing underneath. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.