I've Built a Couple OSX CUDA Apps...

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 37 · 38 · 39 · 40 · 41 · 42 · 43 . . . 58 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1817839 - Posted: 17 Sep 2016, 17:43:48 UTC - in response to Message 1817787.  
Last modified: 17 Sep 2016, 17:44:57 UTC

building here. Visual studio has become pretty whiney about the includes/device code, which I've mitigated by separating the cuda device codelets from cudaAcceleration.h into a seperate cudaAcceleration_inlines.h . There are a couple of other minor windows build breakages that I'm wading through now.

There'll be quite a bit of restructuring needed to properly separate the device specific code from the common core, though that can come gradually while various builds are in test.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1817839 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1817849 - Posted: 17 Sep 2016, 18:24:32 UTC - in response to Message 1817839.  
Last modified: 17 Sep 2016, 18:25:04 UTC

building here. Visual studio has become pretty whiney about the includes/device code, which I've mitigated by separating the cuda device codelets from cudaAcceleration.h into a seperate cudaAcceleration_inlines.h . There are a couple of other minor windows build breakages that I'm wading through now.

There'll be quite a bit of restructuring needed to properly separate the device specific code from the common core, though that can come gradually while various builds are in test.


@jason_gee : My bad. I placed the cache handling internals into the very first place that was common to all CUDA files. I should have made that a separate one and #include that where needed.

My goal was to get it working and make the code run fast in my private environment. Your goal (a good one for the rest of us) is to make it portable.

Thank you Jason!
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1817849 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1817857 - Posted: 17 Sep 2016, 18:54:32 UTC

I've been running the first build x41p_zi+ for a few hours now. So far the 750Ti seems to be working correctly with driver 7.5.27 in Yosemite. No stalls and no suspicious overflows. I tried the latest build zi+a a little while ago in the benchmark, but couldn't get it to find any signals. It just sat there at 100% CPU use until I killed it. Back to running the first zi+ build now.

It certainly didn't help when the files changed to guppi_57403_xxx right when I switched to the new App. Those 57403s are much faster than the normal BLC4s for some reason. Made me look...
ID: 1817857 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1818017 - Posted: 18 Sep 2016, 15:53:29 UTC - in response to Message 1817857.  
Last modified: 18 Sep 2016, 16:05:10 UTC

It's looking here like sync behaviour, and by extension possible stalls etc,( which I'm not getting on Windows with zi+a + minor Windows centric tweaks) are connected to driver optimisations. These are particularly aggressive on Windows ::S, and tend to fuse kernel chains such as those pulsefind ones the pf and unroll settings affect.

It's possible later OSX OS+Drivers begin implementing something similar, so these long chains main need to be optionally hard synced with a finer grain than the last zi+a base -bs option does. If that's in the right direction for OSX, especially on the likes of 750ti, Reducing the settings much lower than you'd expect helps, as the driver will fuse kernels again anyway (upping the effect of the settings against you)

Another possibility lurking is I had one restart failure that reverted (probably incorrectly) to processing triplets on CPU because of the limited GPU implementation (and annoyingly familiar client shutdown issues I know how to address). I since Aborted that task to hand off to other users to process properly. That would normally be fine when the app was slower, but should probably be be set to error out instead, since data layouts probably don't match what that CPU code is expecting anymore. That part could also possibly be broken (if not just slow). Most likely the best choice for implementation would come after checking+fixing exit/restart function. Since genuine 'Too Many triplets' are rare, and the newer code very fast, probably a slower bloated GPU kernel that would kick in before resorting to (fixed) CPU code as absolute last resort will be better. That's looking further to x42 implementation though. Might have to put up with some quirks short-medium term there.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1818017 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1818024 - Posted: 18 Sep 2016, 16:43:43 UTC - in response to Message 1818017.  

The latest problem appears to have been caused by the entries at the end of the zi+ confsetting.cpp file. As noted before, the settings -pfb and -pfp don't help my Mac at all. They usually cause the task to run Slower and create More Inconclusive results. I don't use those settings for that reason. Trying to just comment out those lines in confsetting.cpp didn't work. I was finally able to piece together the end of confsetting.cpp from the baseline file enough to get a working build, a couple hours ago, that doesn't have those settings hard-coded into the App. Of course, now there are a few Invalid overflows waiting to be declared. Since removing those settings the App is running faster, and the overflows will probably stop as well.

In a related incident, the cheap SSD in the Linux machine died after a power failure last night. While reinstalling everything on a new drive I was down to the older p_zi3f, everything newer was on the dead SSD. So, I decided to try driver 361.45 with the zi3f. Just as with my Mac, the 750Ti cards began stalling. I updated back to 367.44 and so far the stalling has stopped. I had already witnessed stalling on the Linux machine with driver 352.79, the stalling was much worse with 361.45. Hopefully I'll soon be able to build a zi+ on the Linux machine.
ID: 1818024 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1818094 - Posted: 19 Sep 2016, 0:48:47 UTC - in response to Message 1818024.  
Last modified: 19 Sep 2016, 1:06:55 UTC

yeah with 750ti you're definitely going to need functional settings, whether by <cmdline> or preferably mbcuda.cfg. The first case is non-ideal because it makes configuration of a stock app impossible, while the configuration file was patched in for Windows, and not regarded as necessary for Linux (with Baseline code)

[It does also sound like the driver or runtime may be fusing kernels there, which won't happen with Baseline]

With the streaming/faster code though, it becomes a bit more obvious example of where the codebase needs to shift from a piecemeal bandaided design, to a unified simpler clean one. I do have a working c++ ini/cfg file reader meant to replace the Windows centric one put aside, just haven't installed it yet because it wasn't needed. Guess I'll need to look for it after work.

I probably need to compare the -bs implementation to my prior attempts, as I'm getting more lag on Guppies with lower settings, which also implies the driver and settings are doing something similar on Windows.

[Edit:] On a positive note, my Windows PC w/GTX980 seems to be ripping through guppies some 2-4x baseline rate, so Petri's efforts are starting to pay off even though considerable polish is needed. Will eyeball validation characteristic after work, and see if we're a bit closer to a triple platform alpha.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1818094 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1818179 - Posted: 19 Sep 2016, 9:22:54 UTC - in response to Message 1818094.  

... functional settings ... <cmdline> ... The first case is non-ideal because it makes configuration of a stock app impossible.

Actually, you can pass a cmdline via the <app_version> extension to app_config.xml

But that doesn't help with automatically setting up the right parameters for a stock-distributed application, for the 99% of users who just want it to work automagically - or just work, period.
ID: 1818179 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1818182 - Posted: 19 Sep 2016, 10:07:51 UTC - in response to Message 1818179.  

... functional settings ... <cmdline> ... The first case is non-ideal because it makes configuration of a stock app impossible.

Actually, you can pass a cmdline via the <app_version> extension to app_config.xml

But that doesn't help with automatically setting up the right parameters for a stock-distributed application, for the 99% of users who just want it to work automagically - or just work, period.


Yes, it's default/stock-distributed operation I'm looking toward. Since the next round looks pretty promising performance wise, getting the usability and 'out of the boxness' right will be priority (even if appropriate for some subset of hosts & third party).

I was lucky enough to be here, and quite green, when Lunatics/KWSN 2.0 and 2.2 was a thing that introduced a bench lag during startup. For the <1% that were dedicated crunchers, that wasn't such a big deal, though for general stock issue the less obtrusive and more automated the better.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1818182 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1818188 - Posted: 19 Sep 2016, 10:59:33 UTC - in response to Message 1818094.  

Good news.
I was looking over the bottom of the zi+a confsettings.cpp and noticed it was the same as the zi+ file after commenting out those two lines. So, I decided to try the same edit that got zi+ working. It seems to have worked, at least it's not just sitting there at 100% CPU usage anyway. It's running the Benchmark now. I changed this;
	fprintf(stderr,"pulsefind: periods per launch %d %s\n", pfPeriodsPerLaunch, (pfPeriodsPerLaunch == def) ? "(default)":"" );
#else // not win, is other
	confSetPriority = pt_BELOWNORMAL;
	/*	if(pfBlocksPerSM == 0)
		pfBlocksPerSM = (gCudaDevProps.major < 2) ? 1 : 64;
		else
		fprintf(stderr, "Using pfb = %d from command line args\n", pfBlocksPerSM);*/
	//pfBlocksPerSM = 64;
	
	//	if(pfPeriodsPerLaunch == 0)
	//pfPeriodsPerLaunch = 3;
	//	else
	//	  fprintf(stderr, "Using pfp = %d from command line args\n", pfPeriodsPerLaunch);
#endif //_WIN32

	if(unroll != 1)
	  fprintf(stderr, "Using unroll = %d from command line args\n", unroll);

	return;
}

To this;

	fprintf(stderr,"pulsefind: periods per launch %d %s\n", pfPeriodsPerLaunch, (pfPeriodsPerLaunch == def) ? "(default)":"" );
#else // not win, is other
    confSetPriority = pt_BELOWNORMAL;
    pfBlocksPerSM = (gCudaDevProps.major < 2) ? DEFAULT_PFBLOCKSPERSM:DEFAULT_PFBLOCKSPERSM_FERMI;
    pfPeriodsPerLaunch = DEFAULT_PFPERIODSPERLAUNCH;
#endif //_WIN32

	if(unroll != 1)
	  fprintf(stderr, "Using unroll = %d from command line args\n", unroll);

	return;
}

I don't know if that's the best choice, but it does get the App working.
I'll post the 'marks in a little while.
ID: 1818188 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1818191 - Posted: 19 Sep 2016, 11:25:30 UTC

Here are the benchmarks using the last version of zi3i I have, verses the last version of zi+a I was sent.

KWSN-Darwin-MBbench v2.1.07
Running on TomsMacPro.local at Mon Sep 19 10:46:35 2016
---------------------------------------------------
Starting benchmark run...
---------------------------------------------------
Listing wu-file(s) in /testWUs :
02ap09ab.355.104188.7.34.13.wu blc2_2bit_guppi_57403_HIP11048_0006.17091.831.22.45.71.wu reference_work_unit_r3215.wu

Listing executable(s) in /APPS :
setiathome_x41p_zi+_x86_64-apple-darwin_cuda75 setiathome_x41p_zi3i_x86_64-apple-darwin_cuda75

Listing executable in /REF_APPs :
MBv8_8.05r3344_sse41_x86_64-apple-darwin
---------------------------------------------------
Current WU: 02ap09ab.355.104188.7.34.13.wu
---------------------------------------------------
Skipping default app MBv8_8.05r3344_sse41_x86_64-apple-darwin, displaying saved result(s)
Elapsed Time: ………………………………… 8822 seconds
---------------------------------------------------
Running app with command : setiathome_x41p_zi+_x86_64-apple-darwin_cuda75 -bs -unroll 4 -device 0
      357.64 real       110.56 user        30.54 sys
Elapsed Time : ……………………………… 358 seconds
Speed compared to default : 2464 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.95%
---------------------------------------------------
Running app with command : setiathome_x41p_zi3i_x86_64-apple-darwin_cuda75 -bs -unroll 4 -device 0
      332.75 real       167.03 user        36.44 sys
Elapsed Time : ……………………………… 333 seconds
Speed compared to default : 2649 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.96%
---------------------------------------------------
Done with 02ap09ab.355.104188.7.34.13.wu.
Current WU: blc2_2bit_guppi_57403_HIP11048_0006.17091.831.22.45.71.wu
---------------------------------------------------
Skipping default app MBv8_8.05r3344_sse41_x86_64-apple-darwin, displaying saved result(s)
Elapsed Time: ………………………………… 4955 seconds
---------------------------------------------------
Running app with command : setiathome_x41p_zi+_x86_64-apple-darwin_cuda75 -bs -unroll 4 -device 0
      552.90 real        22.58 user        12.36 sys
Elapsed Time : ……………………………… 553 seconds
Speed compared to default : 896 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.37%
---------------------------------------------------
Running app with command : setiathome_x41p_zi3i_x86_64-apple-darwin_cuda75 -bs -unroll 4 -device 0
      522.05 real        77.05 user        19.24 sys
Elapsed Time : ……………………………… 522 seconds
Speed compared to default : 949 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.37%
---------------------------------------------------
Done with blc2_2bit_guppi_57403_HIP11048_0006.17091.831.22.45.71.wu.
Current WU: reference_work_unit_r3215.wu
---------------------------------------------------
Skipping default app MBv8_8.05r3344_sse41_x86_64-apple-darwin, displaying saved result(s)
Elapsed Time: ………………………………… 2198 seconds
---------------------------------------------------
Running app with command : setiathome_x41p_zi+_x86_64-apple-darwin_cuda75 -bs -unroll 4 -device 0
      107.30 real        31.28 user         9.15 sys
Elapsed Time : ……………………………… 108 seconds
Speed compared to default : 2035 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.81%
---------------------------------------------------
Running app with command : setiathome_x41p_zi3i_x86_64-apple-darwin_cuda75 -bs -unroll 4 -device 0
      101.40 real        51.99 user        11.42 sys
Elapsed Time : ……………………………… 101 seconds
Speed compared to default : 2176 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.81%
---------------------------------------------------

The last zi+ App has been working fine with driver 7.5.27 since removing those settings, we'll see how this one works.
ID: 1818191 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1818648 - Posted: 21 Sep 2016, 16:00:42 UTC

The good news is the latest version of zi+ also works in Darwin 15.6 with the current 7.5.30 driver without any Stalling/Hanging on the 750Ti. The bad news is most of the recent Arecibo Overflows are listed as Inconclusive, meaning I now have Pages of Overflow Inconclusives. So far none have been listed as Invalid on the Mac, but, the Linux machine running zi3i has the same problem and some of those Overflows have been Invalid.
ID: 1818648 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1818656 - Posted: 21 Sep 2016, 16:43:42 UTC - in response to Message 1818648.  
Last modified: 21 Sep 2016, 17:04:21 UTC

hmmm, interesting. In the scheme of things inconclusive overflows is a fairly minor thing compared to woes of the past, though I'll stick it on the list of quirks to look for here (as its impact is likely to grow with throughput). So far the list contains that, familiar Windows Drivers not liking Boinc's shutdown methods, and some finer grained load control needed for my crumby CPU (probably also mostly a windows issue only, though could come in handy elsewhere). I *think* the weekend should sort at least the last two.

[Edit:] I'm yet to see inconclusive overflows yet here, so if you can save any such tasks it'd be great if you could save them, for later cross platform comparison and refinement purposes.

[Edit2:] naturally as soon as I posted that, my host reached a run of them in cache, so grabbing at least one for inspection on the weekend.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1818656 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1819582 - Posted: 25 Sep 2016, 6:03:09 UTC

A few steps closer here with Windows, having isolated some issues that will have to be designed for down the line (while test builds circulate). For alpha here will just need to turn some workarounds for the blocking sync into parameters, as Windows is pretty tetchy about the duration of those improved long pulsefinds.

zi+a source definitely brought the validation characteristic into line, with dominant inconclusives being the familiar dicey wingmen/apps. If you see more than the expected <5% inconclusive/pending ratio on OSX or Linux, then Likely it'll be something build/platform specific, as opposed to any further lurking accuracy issues in the source.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1819582 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1819603 - Posted: 25 Sep 2016, 11:04:08 UTC - in response to Message 1819582.  

zi+a source definitely brought the validation characteristic into line, with dominant inconclusives being the familiar dicey wingmen/apps. If you see more than the expected <5% inconclusive/pending ratio on OSX or Linux, then Likely it'll be something build/platform specific, as opposed to any further lurking accuracy issues in the source.

Could you check a workunit we've been worrying over at Beta, please? Beta WU 8902774

The final replicant, task 24852195, is Petri's own machine, reporting as 'x41p_zi3j, Cuda 8.00 special'. As posted in Beta message 59697, the final reported pulse seems to be out of tolerance for validation, although the 'best pulse' matches the values which should have been reported.
ID: 1819603 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1819649 - Posted: 25 Sep 2016, 13:47:08 UTC - in response to Message 1819603.  

zi+a source definitely brought the validation characteristic into line, with dominant inconclusives being the familiar dicey wingmen/apps. If you see more than the expected <5% inconclusive/pending ratio on OSX or Linux, then Likely it'll be something build/platform specific, as opposed to any further lurking accuracy issues in the source.

Could you check a workunit we've been worrying over at Beta, please? Beta WU 8902774

The final replicant, task 24852195, is Petri's own machine, reporting as 'x41p_zi3j, Cuda 8.00 special'. As posted in Beta message 59697, the final reported pulse seems to be out of tolerance for validation, although the 'best pulse' matches the values which should have been reported.


Can look at that one after a short work day in the morning. Petri's own Linux build (p_zi3j) is likely significantly different from my Windows one (zipa2) and may not have the pulsefind fix (not sure on that) .
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1819649 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1819807 - Posted: 26 Sep 2016, 1:15:51 UTC - in response to Message 1819603.  

@Richard, I'm not getting anything on the spreadsheet generated address of http://boinc2.ssl.berkeley.edu/beta/download/13b/27jl16aa.19977.160094.8.42.65

something changed ?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1819807 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1819862 - Posted: 26 Sep 2016, 6:39:24 UTC - in response to Message 1819807.  

@Richard, I'm not getting anything on the spreadsheet generated address of http://boinc2.ssl.berkeley.edu/beta/download/13b/27jl16aa.19977.160094.8.42.65

something changed ?

Looks like Eric must have turned on file deletion. You have mail.
ID: 1819862 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1820172 - Posted: 27 Sep 2016, 12:48:36 UTC
Last modified: 27 Sep 2016, 13:03:01 UTC

Just posted an Updated App at Crunchers Anonymous for the ATI5/AMD Mac Pros. This is the same App running for the past month on this Host, Computer 6105482 This particular Host is running 3 tasks at once and using aggressive cmdline settings, your mileage will vary. The App may miss a single running in El Capitan, it is recommended for Darwin 16.x.

In other news, I ran at test with the Linux CUDA Apps p_zi3i and p_zi+. It appears that p_zi3i didn't match one of the signals that just happened to be on the particular GUPPI tasks I randomly chose. Interesting.
Starting benchmark run...
----------------------------------------------------------------
Listing wu-file(s) in /testWUs :
blc4_2bit_guppi_57449_48424_HIP83043_OFF_0026.9575.831.17.26.142.vlar.wu
reference_work_unit_r3215.wu

Listing executable(s) in /APPS :
setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75
setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda75

Listing executable in /REF_APPS :
MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu
----------------------------------------------------------------
Current WU: blc4_2bit_guppi_57449_48424_HIP83043_OFF_0026.9575.831.17.26.142.vlar.wu

----------------------------------------------------------------
Skipping default app MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu, displaying saved result(s)
Elapsed Time: ....................... 4756 seconds
----------------------------------------------------------------
Running app with command : ./setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 4 -device 1
./setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 4 -device 1
599.72 sec 66.18 sec 36.78 sec
Elapsed Time : ...................... 600 seconds
Speed compared to default : ......... 792 %
-----------------
Comparing results
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0      1      1      1      0        0      1      1      1      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0     27     27     27      1        0     27     27     27      1
      Triplet      0      1      1      1      0        0      1      1      1      0
   Best Spike      0      0      0      0      0        0      0      0      0      0
Best Gaussian      0      0      0      0      0        0      0      0      0      0
   Best Pulse      0      0      0      0      0        0      0      0      0      0
 Best Triplet      0      0      0      0      0        0      0      0      0      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   0     29     29     29      1        0     29     29     29      1

Unmatched signal(s) in R1 at line(s) 472
Unmatched signal(s) in R2 at line(s) 472
For R1:R2 matched signals only, Q= 99.28%
Result      : Weakly similar.

----------------------------------------------------------------
Running app with command : ./setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda75 -bs -unroll 4 -device 1
./setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda75 -bs -unroll 4 -device 1
639.63 sec 22.38 sec 4.45 sec
Elapsed Time : ...................... 640 seconds
Speed compared to default : ......... 743 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.28%

----------------------------------------------------------------
Done with blc4_2bit_guppi_57449_48424_HIP83043_OFF_0026.9575.831.17.26.142.vlar.wu

====================================================================
Current WU: reference_work_unit_r3215.wu

----------------------------------------------------------------
Running default app with command :... MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu
./MBv8_8.0r3305_ssse3_x86_64-pc-linux-gnu
2000.36 sec 1992.30 sec 1.09 sec
Elapsed Time: ....................... 2001 seconds

----------------------------------------------------------------
Running app with command : ./setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 4 -device 1
./setiathome_x41p_zi3i_x86_64-pc-linux-gnu_cuda75 -bs -unroll 4 -device 1
116.86 sec 44.68 sec 28.91 sec
Elapsed Time : ...................... 117 seconds
Speed compared to default : ......... 1710 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.82%

----------------------------------------------------------------
Running app with command : ./setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda75 -bs -unroll 4 -device 1
./setiathome_x41p_zi+_x86_64-pc-linux-gnu_cuda75 -bs -unroll 4 -device 1
119.58 sec 26.13 sec 13.99 sec
Elapsed Time : ...................... 119 seconds
Speed compared to default : ......... 1681 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.82%

----------------------------------------------------------------

Well, the GUPPI was a late overflow...
I haven't seen any p_zi3j ;)

Since running p_zi+ it seems to be very similar to p_zi3i on my machine with driver 367.44, I've yet to test it with 361.
ID: 1820172 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1820203 - Posted: 27 Sep 2016, 15:05:33 UTC

Some thing to think about on CUDA side too:
if device capable to pair kernels - how it will ensure that such pairing will not trigger TDR under Windows....
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1820203 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1820218 - Posted: 27 Sep 2016, 15:50:27 UTC - in response to Message 1820203.  
Last modified: 27 Sep 2016, 15:54:15 UTC

Some thing to think about on CUDA side too:
if device capable to pair kernels - how it will ensure that such pairing will not trigger TDR under Windows....


Yeah it's going tricky/fiddly on that front. I'm having to inject extra sync points to overcome the driver optimisations that fuse kernels in the streams, making them too long. It is working on my host though, with conservative settings. I'd imagine my system might be worst case for this, with old Core2Duo, only DDR2, PCIe 1.1, trying to feed a GTX 980. Naturally the 980 is easily chocking the rest of the system unless I dial back a lot. Probably others will have less trouble, with me adding some settings/options, and better match of hardware.

I will probably end up having to add an option to split the 3-4-5 folds that are fused, at some unknown compromise. Managed to get the 3-4 second lags down to usable area, but it'll take some rework to eventually scale for the full range of hardware.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1820218 · Report as offensive
Previous · 1 . . . 37 · 38 · 39 · 40 · 41 · 42 · 43 . . . 58 · Next

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.