Posts by jason_gee

1) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1875057)
Posted 56 minutes ago by Profile jason_gee
Post:
...All the reported signals and Best signals seem to match between the two.


If implemented as I picture: For the pulse mechanism shunt/workaround, the stderr.txt 'realtime' log might see the racing pulse detections, then shunt to unroll 1 to record the correct ones. If that's the case, it does reflect reality in the new 'racey-fixey' kindof way, but may need to be presented more clearly.
2) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874806)
Posted 1 day ago by Profile jason_gee
Post:
Have Verified Cuda baseline matches Joe Segur's Bugfix/changes to stock best scoring
Revision: 1146
Author: korpela
Date: Wednesday, 17 August 2011 7:41:35 AM
Message:
- Fix to bug introduced in last change of gaussfit.cpp. Much of the new code
is from the AK8 branch.
- Version number to 6.97

----
Modified : /branches/sah_v7/seti_boinc/client/gaussfit.cpp
...


Coming from Baseline Petri's special should match that logic (to check). The original modification this fixes, by Joe Segur, contains comments by Raistmer, and appears to be from an AK commit by Raistmer (also committed via Eric) sometime before. I'm unfamiliar with the intent, as mentioned, though Joe's comments on the code seem reasonable.

// Gauss score used for "best of" and graphics.
// This score is now set to be based upon the probability that a signal
// would occur due to noise and the probability that it is shaped like
// a Gaussian (normalized to 0 at thresholds). Thanks to Tetsuji for
// making me think about this. The Gaussian has 62 degrees of freedom and
// the null hypothesis has 63 degrees of freedom when gauss_pot_length=64;
//JWS: Calculate invariant terms once, ala Alex Kan and Przemyslaw Zych

Probably each branch will need checking, in case a lack of this change propagated from AK sources into other builds.
3) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874797)
Posted 1 day ago by Profile jason_gee
Post:

...
One thing that I think I'm noticing is that when there is a reported Gaussian, that peak will match the Best Gaussian peak in SoG. However, in the other apps, the Best Gaussian will have a higher peak than the reported Gaussian. Perhaps there's some significance there. Or perhaps not. :^)
Hmmm, I am seeing svn commits after January on stock CPU multibeam, though nothing immediately stands out as affecting best/reportable policy. Will do some trawling of the codebases.

[Edit:] Superficially looks like certain incompatible changes to Gaussian best reporting were integrated circa 2011, By Eric from AK code, then 'bugfixed' a revision later. So will need more digging, but there could be as many as 3 different best gaussian reporting variants floating about. Will check the baseline and special CUDA variants against the 'bugfixed' variant logic. Ideally we'd want to match stock CPU logic, therefore if any number of applications require updates, will probably have to happen. Probably the actual intent of the change will need to be looked at, as it may possibly have holes in it. At the moment it appears as though the reportable gaussian should not be being used to update best if it isn't very gaussian-ey, though the intent is unclear to me on the first pass (a possible red flag)
4) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874681)
Posted 2 days ago by Profile jason_gee
Post:
Does x41p_zi3v contain bugfix for <wrong Pulse selecting as reportable> issue?


I could be corrected, but I believe it contains the shunt/workaround I recommended to serialise the race condition, though Petri implemented it and I haven't had a chance to examine it. Word is that it worked though, so validation characteristic should be more or less identical to Older Cuda variants.
5) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874436)
Posted 3 days ago by Profile jason_gee
Post:
Ugh, well that's a starker demonstration than I intended, while engineering in the truth. We've lost people recently.
6) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874382)
Posted 3 days ago by Profile jason_gee
Post:
It could be as Jason theorized here; https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1874104#1874104
Those seem like pretty low peaks to start with for best Gaussian. 1 in 300 with contention deep in the noise floor 'Feels' as though we're pushing technology limits (once again), but it will warrant more definite understanding either way. Plotting the PoT data from the results and visually comparing if they look anything alike might say something. My suspicion is they won't look very 'Gaussiany' at all. If so, pushing further into the noisefloor, while possible, may be fruitless. Eric's ruled out that we need double-precision or bit-Identical results below reportable thresholds (in the case of Gaussians, iirc score derived from the ChiSq Fit and null hypothesis).

Or it could be something else. I'm currently running the same WUs with the older OpenCL App MBv8r3567, which doesn't use the nVidia SoG path, and the newer MBv8r3602 from Lunatics. So far r3602 is batting .33% while r3567 is batting 1000%. Seems r3567 is a little slower though, but it is producing the correct Best Gaussians.
Interesting.
oops, r3602 just failed another one...

In any event, seeing as how it takes Hundreds of tasks to find one bad Best Gaussian it's well within the Project's Goal of less than 5% Inconclusive. 5% would allow 5 Inconclusives per 100 tasks.


Some history, just clarifying the origins of that 5% target. Back in ~v5 days, inconclusives were upwards of 20%. ~v6 ~10% as GPU apps came in, now <5% with v7.

That was due to a combination of stock CPU apps (then 32 bit only) using x87 FPU only, which are 80-bit internal, other KWSN and Alex Kan (Mac) builds using SIMD ( MMX, SSE through SSSE3). With v6, Joe Segur injecting KWSN and AK via Lunatics into stock CPU. For v7 I performed several numerical analyses of the algorithms in Matlab, mostly while attempting to devise a GPU form of the autocorrelation in v7 (which didn't previously exist).

The Cuda numbers actually came out more accurate, due to the way certain sums were calculated, but differing enough that something needed to be done to make stock 64 bit and cross platform (e.g. android science app, which didn;t exist yet either) more viable in terms of cross platform match, less error growth as the workunit analysis parameters were widened and later GBT added. So With Eric's permission I changed some stock sums to block sums (which are similar to Cuda blocked and AKv8 SSEx Striped summations), so pulling the results about 3-6 decimal places closer together.

Bearing in mind the platforms/devices all have different compilers, use different algorithms, and the vagaries inherent in floating point computation, the 5% is chosen as target (by me) because that's where Eric tends to set thresholds for the analysis, such that ~5% of results return overflow. That's where the analysis+(telescope)recording noise floor would be, so any better than 5% cross platform match we're pretty much digging into technological limitations outside our control (with existing Fourier method anyway).

At some point, I forget when, the improved cross platform matches and application reliability across the board allowed the workunit initial replication to be reduced from 3 to 2, reducing server load by a third.

On the basis of all that, If other various classes of builds/devices see much worse than 5% inconclusives, then they're not up to par, while at the same time Eric's assertions that additional precision shouldn't be needed suggest <5% is 'good enough' statistically. (Which I'm fine with, because much more tightening toward bit exact cross platform would be extremely expensive (development time, money and computationally) and lead to de-optimisation.

I'm mostly writing this out, because at some point I'll need to add explanations to stock documentation, because the temptation to optimise out the stock summing refinements may be a trap for future developers [Saved for when I revisit the stock codebase, at some point Astropulse should undergo a similar analysis, though it's beyond my resources at present].
7) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874209)
Posted 4 days ago by Profile jason_gee
Post:
Oh yeah, no illusions there. Have prepared for probable extended downtime by stocking up on the mobile data. Switching over to a completely new infrastructure will have teething problems.
8) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874189)
Posted 4 days ago by Profile jason_gee
Post:
OK. Will be some (yet more, *sigh*) juggling here, as better broadband arrives tomorrow, a month sooner than expected. Teething problems are likely, though will factor in setting aside some hosting space for various and sundry as the dust settles. Hopefully things get less difficult as time goes on.
9) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874138)
Posted 5 days ago by Profile jason_gee
Post:
Oh well. It appears since moving to the new server I'm not allowed to upload. All I get is;
403
Forbidden
Access to this resource on the server is denied!
So....you'll have to wait.


Contact Arkayn, as he did message about the ftp server shifting a week or so ago. If problems still by the weekend, let me know and I'll put it to jgopt.org (if you email the binaries). Once proper broadband is installed here in a month or so, I'll be rehosting my own domain to in my living room, so there will be juggling nomatter which of those works, but you could share it on Google Drive or similar in .7z form or similar if you're stuck.
10) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874114)
Posted 5 days ago by Profile jason_gee
Post:
Just a comment, I think this may deserve it's own thread since it appears this is leaning towards an SoG issue more than CUDA.

Say What? It's about proving the CUDA App should be accepted to Beta as it agrees with the CPU Apps Better than the current 'standard'.
I just checked...it is My name on the thread.


It's going to be pretty important to discuss these things here, because the special is the new kid on the block, with the most changes in quite a while. If it turns out the SoG app needs some attention, then that's a good thing, because it solidifies knowledge all around. My unconfirmed suspicion is that the SoG app may still be using an OpenCL derivation of the single precision chirp I made for Pre-Fermi (CUDA), which was tailored for unique Pre-Fermi characteristics, namely that Pre-Fermi Cuda devices don't have IEEE-754 floating point compliance, and in Pre-GTX2xx cases no double precision at all, therefore it won't necessarily compile to the most accurate GPU code on Fermi or later devices. It was made specifically for G80 type devices. I switched to double precision chirp for newer devices many moons ago. So under the hood there is valuable history to take into account, and probably should be properly documented one day.

As for 'Allowing on Beta' I'd concur the special needs to be run extensively under anonymous platform, so will be aiming for some trial builds ASAP. As previously mentioned, stock distribution is problematic because of Boinc server limitations more or less demanding a quite generally compatible app. There may be a driver version cuttoff where Pre-Fermi or Fermi class drop support, though generalising to even Kepler class onwards will still require significant work.
11) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874104)
Posted 5 days ago by Profile jason_gee
Post:
Those seem like pretty low peaks to start with for best Gaussian. 1 in 300 with contention deep in the noise floor 'Feels' as though we're pushing technology limits (once again), but it will warrant more definite understanding either way. Plotting the PoT data from the results and visually comparing if they look anything alike might say something. My suspicion is they won't look very 'Gaussiany' at all. If so, pushing further into the noisefloor, while possible, may be fruitless. Eric's ruled out that we need double-precision or bit-Identical results below reportable thresholds (in the case of Gaussians, iirc score derived from the ChiSq Fit and null hypothesis).

[Edit:] That's also at a very high chirp rate near chirp limits, so one or another application struggling with that, especially 64-bit builds, wouldn't be surprising or unacceptable. Cumulative error will be at its greatest there.
12) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1874029)
Posted 6 days ago by Profile jason_gee
Post:
Saved for analysis when I can. Will probably see what win32 CPU stock and ye-olde Cuda turn up. There are some differences between the way stock cpu and 3rd party cpu sum, with the former having been refined to greater portability & less cumulative error. Wouldn't have thought that alone could explain that great of a difference in numbers though, so other explanations could be in the compilation used etc. The lack of 80-bit FPU availability in the later apps can be problematic there if not handled carefully. That's just some possibilities assuming no lingering bugs in either case though, which could easily have been buried in the noise of prior other issues amongst different applications.

Overall I'd suspect the SSE3xj Win32 gaussian values are 'more correct', as since the powerspectrum values are positive any increase in cumulative error tends to push peaks higher rather than lower. [Edit: hmmm, a different time value too]
13) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1873783)
Posted 7 days ago by Profile jason_gee
Post:
It's possible. We need to always reference to Win32 stock CPU, just because Eric Made it. It's not that 1-5% is problematic, just that every goal needs a reference.
14) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1873761)
Posted 7 days ago by Profile jason_gee
Post:
Yep, looking at a few tasks it looks as though the idea worked. +1 for 3am lightbulb moments. I guess how costly it is will come out over time. Will see if I can spot a Gaussian issue. Usually those issues come from minor precision issues in the normalisation sums (easily checked and remedied). @Petri33 please email latest and I'll update the alpha folder ASAP.


. . Does any of this portend a possible Windows version anytime in the forseeable future ??

Stephen

??


Sure. I've built Windows ones on multiple occasions, which is how I stumbled on the unroll validation problem and possible fix (lots of help from TBar and Jeff Buck in spotting the patterns solved that). I only sat on them because the scope of damage saturating results with 1000 Windows hosts would potentially compromise the project too far.

Probably there will still be niggles to address, and the high-end device requirements will need tailoring to solve, Though advanced user 'Use at you own risk' Windows builds become viable once again.
15) Message boards : Number crunching : GPU FLOPS: Theory vs Reality (Message 1873759)
Posted 7 days ago by Profile jason_gee
Post:
Even if it doesn't make it as a stock application (does it run on pre Maxwell or pre Kepler hardware, and minimum VRAM requirements?), it would be good if it were available for general use under Anonymous Platform for all OSs. But it does need to keep the Inconclusives below 5% to be able to make it available for general use.
If the current version is good for less than 5% Inconclusives, it would be nice to see a Windows version made available for some testing to see if under the many versions of Windows and the many versions of video drivers it's able to keep the Inconclusives below that 5% threshold.


Those are the current rubs for stock, mainly Boinc server limitations on distribution side. That's where I step in when I can. I'm confident most of the refinements can be propagated back through the generations, with varying levels of benefit. With the majority of validation concerns apparently addressed, that helps a lot. in the meantime It'll be suitable for 'Advanced-User' anonymous platform distribution until appropriate dispatch code can be embedded to support all Cuda devices at some level. Once it does though, options open up for, in no particular order, stock distribution (via Beta test first), retooling for Cuda 9 inclusion, and then incorporating some more modern feature recognition methods.
16) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1873754)
Posted 7 days ago by Profile jason_gee
Post:
Yep, looking at a few tasks it looks as though the idea worked. +1 for 3am lightbulb moments. I guess how costly it is will come out over time. Will see if I can spot a Gaussian issue. Usually those issues come from minor precision issues in the normalisation sums (easily checked and remedied). @Petri33 please email latest and I'll update the alpha folder ASAP.

Earlier than was planned, due to Petri's pushing the technology so far, it opens multiple doors for 'next level' pre-scans. Namely Wavelet &/or Convolutional Neural Network based feature recognition, followed up with sparser traditional Fourier processing so as to get the same numbers.

[Sidenote:]
At present, still wrestling with home & work issues myself, though We get semi decent internet here in a month or so. Sick of working for the benefit of other bosses, I'll probably end up streaming during development at some point, and work in the Open Source stuff somehow. Probably priority would be on survival and having a laugh, we'll see.
17) Message boards : Number crunching : Theory and Practice of "dynamic optimization" for both cpu and gpu processing (Message 1872372)
Posted 14 days ago by Profile jason_gee
Post:
There are 3 'kinds' of optimisation currently used, to varying degrees and combinations, in the seti@home CPU and GPU applications:

1 - build-time, which are hardwired codepaths relevant to the build's target platform/device
2 - install-time or first-run, which amounts to selecting specific builds manually, or JIT compilation in the case of OpenCL/CUDA
3 - run-time (dynamic), which resembles a short initial benchmark, with or without FFTW-like 'wisdom'.

Few current builds are purely one or another kind, because of the use of libraries and frameworks provided by other vendors. Examples include:
1) most builds have at least some amount of build time optimisation, depending on platform, if it's meant as a stock/generically applicable build or for 3rd party anonymous platform installation (requiring more power-user type knowledge). There are different stock CPU builds per platform, but they are pretty generic within that. 'Too specific' optimisations of this type are problematic for stock distribution, due to Boinc infrastructure limitations.
2) Even stock CPU builds store some FFTW Wisdom, GPU builds tend to cache JIT compiled Kernels as binaries on the system and usually change if hardware changes. [Note: this includes user imparted 'wisdom', such as command line settings. ]
3) Stock CPU has a lot of this run-time determination, in addition to the FFTW wisdom, as displayed in stderr.txt (It's possible to add a -verbose switch to display the many codepaths included), 3rd party & GPU not-so-much. [Auto tune storing partial or info, overlaps with #2]

Being this 'organic' an ecosystem, you'll probably see more mature builds head to #2 & #3. As the device+system landscape grows ever more complex, the optimisation process becomes one more of an AI-like process than a deterministic one, and the demands on our systems (other than dedicated crunchers) change from moment to moment. So eventually a #4 is likely to materialise, such that the applications will be able to work around whatever you're doing on the system. At present, process priorities appear 'not-great' for this, and unable to cope if a device leaves or is added during run. In a sense the Boinc client/manager should be doing this heuristic process management, and the landscape is changing. As the Boinc infrastructure is underfunded/staffed probably alternative solutions will need to be devised 3rd party for this level of management at some point.
18) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1871952)
Posted 16 days ago by Profile jason_gee
Post:
Just to break the cone of silence a little, I've been asked to participate in Cuda 9 testing (as I did with prior versions). So I'll look distant for a bit longer, since carefully running through the mill caught some problems before.
19) Message boards : Number crunching : CES 2017 -- AMD RYZEN CPU (Message 1869701)
Posted 29 days ago by Profile jason_gee
Post:
Following up with some reading of Agner Fog's latest optimisation guides, looks as though IPC is higher than any Intel processor, so the memory latency + frequency issues will dominate potentially for some time. Having watched through portions of an AMD livestream, while looking for info on the IOMMU updates due for AGESA 1006 update to fix groupings, I noted that they mentioned they've covered 'standard'JEDEC compatibility, and are moving onto the custom XMP2 style support, with memory with the Samsung B-Die having been the easy one.

Most likely FFTW tweaks will end up being incorporated at some point, then as things settle MT apps will need to be produced to make better use of these, in addition to figuring out some additional optimisations. Apparently the AVX2 implementation, despite being effectively half clocked, is faster than separate faster-clocked SSEx, because it preserves entries/fetches/decodes in the instruction pipeline.
20) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1868922)
Posted 23 May 2017 by Profile jason_gee
Post:
...
So, at least for a GTX 960, it would appear that Blocking Sync might not be a good choice for a card tied into a PCIe slot that's less than x8 electircal. Why that would be the case is not something I have the expertise to explain.
...


Present blocking sync implementation is relatively simple/naive, and there are likely too many of them. Once the pulsefinding (and any other serious) wrinkles are ironed out, we can look at scaling the synchronisation on a timed basis, with something akin to a frames per second target (perhaps launches per second, and scale the launches). Before other issues are addressed, that would be putting the cart before the horse though.


Next 20


 
©2017 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.