Posts by petri33


log in
1) Message boards : Number crunching : Lack of SoG WUs (Message 1819470)
Posted 2 days ago by Profile petri33Project Donor
Hi,

the ar doesn't affect much of the CPU tasks. I've seen two major categories.

Just before I headed for a short family trip to a Holiday resort I had a major vision of how to improve triplet finding having a low or very low ar.

I'll ponder that over and over in my mind and try to be still present to my family and then finally implement that on late sunday evening or monday afternoon. I'm not sure it will reduce the processing time of GBT packets, but this is something I have to test and try.

The idea is similar to the -unroll in special version that cut the time in half.


Now some ()___)________))))~~~
and sleep.
2) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1818492)
Posted 6 days ago by Profile petri33Project Donor

There may be over 30 pulses, over 30 triplets, over 30 autocorrelations over 30 spikes in the same packet. Any of them can cause an overflow and some of them may have not been processed yet. Parallel execution is different from sequential.

Also, there can be much more than 30 spikes in overflow for example. And cause many arrays processed at once per kernel call some ordering required not only between kernel calls but inside single kernel call too.

I made attempt to emulate serial order as much as possible w/o real sacrifices in performance. For example, first 50 icffts done with sync on each iterations with SoG. This allows to catch most of early overflows, but late overflows remain and give inconclusives time to time.


Well said.

On noisy packets (low average, low peak) any peak will give a 'signal'. On parallel systems it is almost impossible to report back all found signals without using excessive amount of work and memory just for book keeping. If a packet is noisy and has over 30 of anything it is a bad packet. It may or may not validate against any current implementation or the future quantum-sah-open-room-temperature-CLUDA-supra.exe

The old zi is not very parallel. Only zi3 and zi+ have the -unroll in pulsefinging. Thus Pulses are found in a totally new order (almost randomly depending of the GPU scheduler of which I have no control of).

And just as Raistmer said: The order does not matter when the number of signals is reasonable and the findings match otherwise.

Petri
3) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1818399)
Posted 6 days ago by Profile petri33Project Donor
1) The order of processing is different. The check for triplets, pulses, spikes, gaussians and autocorrelations is not done in the same order as in main version. Pulses tend to take longest on GPU, so I check them last. I see no problem in sending a 4 second task for rechecking with another host. The data is invalid anyway. I could store the findings and report them at the same order as main but that is not my priority right now.

There may be over 30 pulses, over 30 triplets, over 30 autocorrelations over 30 spikes in the same packet. Any of them can cause an overflow and some of them may have not been processed yet. Parallel execution is different from sequential.
4) Message boards : Number crunching : GPU FLOPS: Theory vs Reality (Message 1818292)
Posted 7 days ago by Profile petri33Project Donor


That'll be tricky while driving 2 x 27" displays watching video and underfeeding with a core2duo, haha


My 130" video screen ...

The Infinity speakers are nowadays Genelec.
5) Message boards : Number crunching : GPU FLOPS: Theory vs Reality (Message 1818283)
Posted 7 days ago by Profile petri33Project Donor
Cheers. I had thought the stock OpenCL CPU usage issue had been solved with a default -use_sleep option to more resemble Cuda's ... guess not.

In any case, my Win+GTX 980 host is now running pre-Alpha Petri's optimisations,. single instance. Looks like a fair bit to generalise, though looking at those figures some care would bring Cuda back on the charts. It would be interesting to have some idea of where mine+new code might fall on the charts, even if the heavy guppi bias subsides.

On the tricky moving target issue of Power for the Credit/Whr comparisons, I'm noticing there is still decent headroom available in terms of Power%, temperature, and any other metric I look at. My guess is that Cuda+OpenCL will just end up trading blows until it becomes splitting hairs. At that point we probably switch to other techniques anyway.


Can you put -unroll 16 to your options?


did that for a bit, though needed to lighten the load for unrelated reasons. Will be able to wind out the settings for today, though bear in mind I'm going for comfort, lol


Yeah,

I was hoping to see some guppi tasks go near 300-500 seconds instead of the 1000 now. :)
6) Message boards : Number crunching : GPU FLOPS: Theory vs Reality (Message 1818258)
Posted 7 days ago by Profile petri33Project Donor
Cheers. I had thought the stock OpenCL CPU usage issue had been solved with a default -use_sleep option to more resemble Cuda's ... guess not.

In any case, my Win+GTX 980 host is now running pre-Alpha Petri's optimisations,. single instance. Looks like a fair bit to generalise, though looking at those figures some care would bring Cuda back on the charts. It would be interesting to have some idea of where mine+new code might fall on the charts, even if the heavy guppi bias subsides.

On the tricky moving target issue of Power for the Credit/Whr comparisons, I'm noticing there is still decent headroom available in terms of Power%, temperature, and any other metric I look at. My guess is that Cuda+OpenCL will just end up trading blows until it becomes splitting hairs. At that point we probably switch to other techniques anyway.


Can you put -unroll 16 to your options?
7) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1817849)
Posted 9 days ago by Profile petri33Project Donor
building here. Visual studio has become pretty whiney about the includes/device code, which I've mitigated by separating the cuda device codelets from cudaAcceleration.h into a seperate cudaAcceleration_inlines.h . There are a couple of other minor windows build breakages that I'm wading through now.

There'll be quite a bit of restructuring needed to properly separate the device specific code from the common core, though that can come gradually while various builds are in test.


@jason_gee : My bad. I placed the cache handling internals into the very first place that was common to all CUDA files. I should have made that a separate one and #include that where needed.

My goal was to get it working and make the code run fast in my private environment. Your goal (a good one for the rest of us) is to make it portable.

Thank you Jason!
8) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1817787)
Posted 9 days ago by Profile petri33Project Donor
@TBar
you've got email.
9) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1817784)
Posted 9 days ago by Profile petri33Project Donor
@TBar

I'got now a running zi+ with -unroll and -bs
Preliminary test show that it is working at least on my 1080 linux system.

The newer zi3 is a bit faster, but I'll let you to run the MAC 750Ti tests. Then we will know if the -unroll and -bs do work as intended.

Now I'll put the device selection in to it.

Petri
10) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1817624)
Posted 10 days ago by Profile petri33Project Donor
Ah, yeah, two things that may help: First @petri33 a reminder to update your main.cpp from svn because of a boincapi change (this fixes the device selection). I applied Juha's patch to both baseline and alpha main.cpp.

Second, My Linux machine is GTX 680 ( Kepler class but only compute capability 3.0). I'll be pretty determined to refactor the 1 or 2 compute capability 3.2+ demanding kernels in alpha early, since dividing support in the middle of of a major compute capability is bound to cause issues in the same Way Boinc does, by changing behaviour in the middle of a major version.

Probably at least the second issue will remain an issue until I can take the time to inject the proper preprocessor controls in the .cu files, however that's probably a comparatively minor issue compared to the major ones likely to crop up soon.


Thanks, will do.

Also, the older x41p_zi needs the Blocking Sync. I built a version last week with the Older BS changes, but it was producing infrequent Overflows with 30 Gaussians. I built another without the BS on the Gaussian line, but haven't had a chance to test it. I suppose I could try it now even though about all I've got are GUPPIs. It would be nice to have a few Arecibo tasks to breakup all these GUPPIs.


I'll add blocking sync -bs flag and the device seletion thingy too.
11) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1817539)
Posted 10 days ago by Profile petri33Project Donor
Ah, yeah, two things that may help: First @petri33 a reminder to update your main.cpp from svn because of a boincapi change (this fixes the device selection). I applied Juha's patch to both baseline and alpha main.cpp.

Second, My Linux machine is GTX 680 ( Kepler class but only compute capability 3.0). I'll be pretty determined to refactor the 1 or 2 compute capability 3.2+ demanding kernels in alpha early, since dividing support in the middle of of a major compute capability is bound to cause issues in the same Way Boinc does, by changing behaviour in the middle of a major version.

Probably at least the second issue will remain an issue until I can take the time to inject the proper preprocessor controls in the .cu files, however that's probably a comparatively minor issue compared to the major ones likely to crop up soon.


Thanks, will do.
12) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1817538)
Posted 10 days ago by Profile petri33Project Donor
@TBar

Hi, I'm working on the 'zi plus unroll' right now. Some adjustment needed to get the unroll going.

Petri
13) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1817421)
Posted 11 days ago by Profile petri33Project Donor
Yep, the last time I looked at the Mac Hosts at Beta all the Hosts idled by the block on Darwin 15.x are just sitting there doing nothing. They are either not aware they could be testing the new Apps or they don't want to be like everyone else and install a nVidia driver to run CUDA on their Mac. It won't hurt people, everyone running Windows and Linux also have to install a Driver to run SETI work. Just install the latest driver and update it when you update the OS, http://www.nvidia.com/object/mac-driver-archive.html If you're running an older card in Mountain Lion or Lion install this one, http://www.nvidia.com/object/macosx-cuda-5.5.47-driver.html

I built another OpenCL App earlier, it's not any better than the last one in Darwin 15.6;
Running on TomsMacPro.local at Thu Sep 15 13:57:29 2016 --------------------------------------------------- Starting benchmark run... --------------------------------------------------- Listing wu-file(s) in /testWUs : reference_work_unit_r3215.wu sniff.wu Listing executable(s) in /APPS : MBv8_8.18r3528_NV_ssse3_x86_64-apple-darwin Listing executable in /REF_APPs : MBv8_8.05r3344_sse41_x86_64-apple-darwin --------------------------------------------------- Current WU: reference_work_unit_r3215.wu --------------------------------------------------- Skipping default app MBv8_8.05r3344_sse41_x86_64-apple-darwin, displaying saved result(s) Elapsed Time: ………………………………… 2110 seconds --------------------------------------------------- Running app with command : MBv8_8.18r3528_NV_ssse3_x86_64-apple-darwin -sbs 192 -oclfft_tune_gr 256 -oclfft_tune_wg 128 -device 2 Elapsed Time : ……………………………… 418 seconds Speed compared to default : 504 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 9 11 13 0 0 9 11 13 0 Autocorr 0 1 1 1 0 0 1 1 1 0 Gaussian 0 0 0 1 5 0 0 0 1 5 Pulse 0 0 0 0 0 0 0 0 0 2 Triplet 0 1 1 2 0 0 1 1 2 1 Best Spike 0 1 1 1 0 0 1 1 1 0 Best Autocorr 0 1 1 1 0 0 1 1 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 0 0 0 1 0 0 0 0 1 Best Triplet 0 1 1 1 0 0 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0 14 16 20 7 0 14 16 20 10 Unmatched signal(s) in R1 at line(s) 499 526 580 607 634 694 720 Unmatched signal(s) in R2 at line(s) 482 509 526 569 595 649 676 703 763 789 For R1:R2 matched signals only, Q= 7.885% Result : Weakly similar. --------------------------------------------------- Done with reference_work_unit_r3215.wu. Current WU: sniff.wu --------------------------------------------------- Skipping default app MBv8_8.05r3344_sse41_x86_64-apple-darwin, displaying saved result(s) Elapsed Time: ………………………………… 199 seconds --------------------------------------------------- Running app with command : MBv8_8.18r3528_NV_ssse3_x86_64-apple-darwin -sbs 192 -oclfft_tune_gr 256 -oclfft_tune_wg 128 -device 2 Elapsed Time : ……………………………… 25 seconds Speed compared to default : 796 % ----------------- Comparing results ------------- R1:R2 ------------ ------------- R2:R1 ------------ Exact Super Tight Good Bad Exact Super Tight Good Bad Spike 0 2 5 10 1 0 2 5 10 0 Autocorr 0 1 1 2 0 0 1 1 2 0 Gaussian 0 0 0 7 4 0 0 0 7 4 Pulse 0 1 1 1 2 0 1 1 1 2 Triplet 2 2 2 2 0 2 2 2 2 0 Best Spike 0 0 1 1 0 0 0 1 1 0 Best Autocorr 0 0 0 1 0 0 0 0 1 0 Best Gaussian 0 0 0 0 1 0 0 0 0 1 Best Pulse 0 0 0 0 1 0 0 0 0 1 Best Triplet 1 1 1 1 0 1 1 1 1 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 3 7 11 25 9 3 7 11 25 8 Unmatched signal(s) in R1 at line(s) 554 613 738 765 792 808 834 894 920 Unmatched signal(s) in R2 at line(s) 586 695 738 765 792 818 878 904 For R1:R2 matched signals only, Q= ???? Result : Weakly similar. ---------------------------------------------------

Bad juju going on there with MBv8_8.18r3528_NV_ssse3_x86_64-apple-darwin
The CUDA App is looking much better, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=63959


Hi,

Should anything show with zi3i as a no go / stop working .. then ...

How about, for MAC, rigging the old and reliable zi with unroll?


I need your cudaAcceleration.cu and cudaAcceleration.h plus cudaAcc_pulsefind.cu and confsettings.cpp in addition to main.cpp


With those files (in a zip to my email) I'll return a zi+ version to test with MAC. (I hope you're not too tired yet to testing, testing, testing, ... "Is this thing even on", testing, ..., "now it works. So ..")

That would give a nice guppi speed boost and hopefully maintain usability, I guess.

Should this give you any kind of a ahead ache I will not mention it again :)
14) Message boards : Number crunching : Boinc without screen saver? (Message 1817218)
Posted 12 days ago by Profile petri33Project Donor
...
As TBar has pointed out, the apps for macs aren't as up to date as they could be and many return invalid results but he's working on new apps for them and hopefully will have some out for all of us in the future.
...

Zalster



I'd like thank TBar for porting the latest Linux advancements and behind the curtain the beta versions to MAC. And for the time and effort he's spent on testing all the different versions and configurations. Without his effort, there would be a lot of bugs around and the Windows version to come would be much more in the future.

EDIT: And thank you for some other fellows doing their job too.

Petri
15) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1817088)
Posted 12 days ago by Profile petri33Project Donor
I do hope you can find a solution to the cache problem as well. Pascal has 24-48KB(Depending on GP100 or GP104) per SM, whilst Maxwell has 48KB per SM of l1 cache........


The cache management is taking place already. The CUDA API allows for different kinds of reads and writes to be specified. WE have 3 major platforms that can use the current special version: Kepler, Maxwell and Pascal. They all may benefit a few percents from different kind of memory optimizations, but they all receive a 50% reduction (or more) in run time from what is implemented now.

Cache modes include Use L1 and L2, Use L2, Use none, Use streaming (mark ready to be discarded after use). They are well documented in the CUDA specification manuals that come with the devkit.
16) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1817075)
Posted 12 days ago by Profile petri33Project Donor
Hi,

I found and fixed a bug concerning autocorrelations giving 'big' results.

One of the GPU memory buffers (dev_ac_partials) was used as in input and an output and depending of the parallel order of calculations sometimes a part of the input to another thread was already overwritten by another thread that completed a bit earlier.

It was hard to find since I know that parallel implementations are prone to this kind of error and I try to avoid using the same memory area for multiple purposes.

I'm testing zi3i now in beta and in main.
17) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1816139)
Posted 16 days ago by Profile petri33Project Donor
Hi,

Yes, I noticed it did not slow things down.
The changes work with cufft 6.5 but give a lot of autocorr errors when run with cufft 8.0.

To get the error number in autocorrelation you can change the code to be..
if(fft_num == 0) { err = cudaEventSynchronize(autocorrelationDoneEvent); // host (CPU) code waits for the all (specific) GPU task to complete if(cudaSuccess != err) { fprintf(stderr, "GetAutocorr - sync fft_num = %d error = %d\r\n", fft_num, err); exit(0); } }
18) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1816127)
Posted 16 days ago by Profile petri33Project Donor
Task blc5_2bit_guppi_57449_46749_HIP83043_0021.14243.831.18.27.242.vlar_2 exited with zero status but no 'finished' file

Yes, that is the task running on the 750Ti. It usually happens sometime before the 750Ti Stalls.
Oh well, I was able to apply the Pulsefind fix, and the Blocking Sync to zi3c, and it did pass the benchmark.
I suppose I could boot back to Darwin 15.4 and try it with Driver 8.0.29, 'cause it looks like it's going to Stall in 14.5 with driver 7.5.27.



Hi TBar,

I had to make a amall change into cudaAcceleration.cu to make the autocorr work with 8.0

//cu_errf = cufftPlan1d(&cudaAutoCorr_plan, ac_fftlen*2, CUFFT_R2C, 8); // RFFT method, batch of 8 int size = ac_fftlen*2; cu_errf = cufftPlanMany(&cudaAutoCorr_plan, 1, &size, NULL, 0, 0, 0, 0, 0, CUFFT_R2C, 8);


The plan1d with batch is deprecated and does not work correctly in 8.0. The PlanMany works ok.
EDIT: no it does not. I have found one wu that still gives an ac error. That is good - now i can debug.
19) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1815929)
Posted 17 days ago by Profile petri33Project Donor
Thanks Petri.
I was about to try it with Toolkit 8 and ran across this post while pondering the connection to cudaAutoCorr_plan.
cufft: use CUFFT_COMPATIBILITY_FFTW_PADDING instead of CUFFT_COMPATIBILITY_NATIVE
It still works in Toolkit 7.5, so, not sure if it will help. I'm going to replace the 3 CUFFT_COMPATIBILITY_NATIVE entries with CUFFT_COMPATIBILITY_FFTW_PADDING.
Unless someone comes up with a better plan...soon.


NAtive with 7.5
That is one reason I use 7.5 include files and 6.5 or 7.5 library for fft.
I haven't tried with 8.0 cufft since it doesn't support NATIVE.
20) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1815909)
Posted 17 days ago by Profile petri33Project Donor
Well, swapping cards in the same host rules out faulty hardware good enough.
Then faulty driver remains (driver/OS combo). Just as you managed to prove already in case with OS X and OpenCL builds. They work under some OS X/driver and produce invalids under another.
Can this case be ruled out for CUDA app? Do you see wrong autocorr signals under whole range of OS/driver versions?
If so, then faulty OS/driver combo will be ruled out too and then more close attention to app itself should be achieved.

The interesting thing is, the two 950s Never had the stalling problem, it Only happens with the 750Ti cards. The stalls were present with the first version of zi3, very noticeable, the AC problems were also present but very infrequent. Both problems became worse with newer versions. The stalls would only happen with driver 7.5, both problems happened in Darwin 14.5 and 15.x, and the AC problem happens with all combinations. With driver 8.0.29, which only works with 15.5 to 15.0, the only problems so far, are the AC problems. Again, None of these problems exist with the other cuda Apps. I did just compile a new cuda80 zi3g App a few hours ago, I'm still looking at it. I also just compiled a new zi3g App for the Linux machine, still looking at that as well.

There is now a zi3h version in your mailbox. Hope it cures the ac.


Next 20

Copyright © 2016 University of California