Posts by jason_gee


log in
1) Message boards : Number crunching : Phantom Triplets (Message 1589048)
Posted 6 days ago by Profile jason_gee
...leaves me wondering what I might actually be facing here...


Basically the same as the white dot graphical artefacts you'd get overclocking a GPU for gaming, at which point backing off at least two 'notches' is the general corrective method. White dots in graphical glitches imply ~24-32 bits of saturation (bits flipped to on), so yeah many more than a single bit flip, though typically tied to single memory fetches.

As with overclocking, It happens with 'gaming grade' GPUs sometimes from factory due to price/performance market pressures, and the non-critical nature of a rare white dot, when parts binning and setting clocks for mid range GPUs, where the competition is steepest.
2) Message boards : Number crunching : Phantom Triplets (Message 1588603)
Posted 8 days ago by Profile jason_gee
OK, a quick look through and those all manifested from the same chirp-fft pair
<fft_len>8</fft_len>
<chirp_rate>40.543618462507</chirp_rate>

and have the same miniscule mean power, and calculated detection time. That makes them all part of the same 'event'.

Based on the nature of that glitch 'looking' much like a numerical version of a graphical artefact, I would suspect a few things. Through age or manufacturing concerns that silicon may well be operating near capability in terms of frequency &/or heat. Presuming you've checked the latter, with no evidence of overheating etc, Then I recommend either a ~0.05V core voltage bump, or a ~20MHz frequency drop. Either would IMO compensate for a combination of the age of the GPU, and mid range performance/price cards being pushed to the frequency limits (with some number of acceptable artefacts) from factory. Memory clock may also be an issue, though I would expect the nature to be more random with memory glitches than core loading. In this case there appears to be a quite specific borderline circuit.
3) Message boards : Number crunching : AP v7 Credit Impact (Message 1586178)
Posted 12 days ago by Profile jason_gee
... Not least of those is comparing the complexity of maintaining/patching the existing hodgepodge over wholesale reengineering.

As painful as constructing something from scratch is, often it is the best option to take as trying to patch something that is badly broken often only leads to further problems.
Cosmetic problems are easily repaired. Some structural problems can be repaired, with considerable effort. However when the core structure is significantly damaged, demolition & reconstruction is the best option.


Yep, pretty much sums up the point we're at in many ways. There's a hiatus while I refamiliarise with some modern tools (such as Matlab and estimate localisation techniques), and I know Richard's been busily gathering data and symtpoms for some time. That will yield an idealised model for direct comparison against the existing core, giving us at least some scope for how much needs replacement, along with pros & cons. Frustratingly what isn't feasible at the moment is any kindof timeline, though there have been benefits to extended observation & research time nonetheless.
4) Message boards : Number crunching : AP v7 Credit Impact (Message 1586152)
Posted 13 days ago by Profile jason_gee
The credit system appears to be structurally flawed and needs a fresh new look at.

I was under the impression that Jason and others were doing exactly that.


A group of us were recently , tongue firmly in cheek, referred to as the 'CreditNew Vigilantes', and it's that group that's variously assessing code, system behaviour, and refreshing on control systems engineering practices. My own contributions cover the last part of the system, headed toward full scale modelling of the control systems as should have been done before (but clearly weren't, you find as you get deeper)

As things stand, the overall approaches and extensive study of the existing mechanism have been discussed and poked at in various ways, and we get closer to wholesale patching and systemic recommendations, though there are still quite a few finer points to be looked at in more detail as a refined model is constructed. Not least of those is comparing the complexity of maintaining/patching the existing hodgepodge over wholesale reengineering.
5) Message boards : Number crunching : AP v7 Credit Impact (Message 1586053)
Posted 13 days ago by Profile jason_gee
The simple fact is, it is taking Much longer on the same Hardware to gain the same credits as with MBv6. That is Not the case with APv7. The same task is slightly faster than with APv6.


Correct. AVv7 has less processing than v6 (simpler blanking).

[Edit:] blanking was always unpaid overhead though. Going from rough memory, AP credit (v6 or V7) should really be ~1000 credits, but I predict will probably settle in the region of 300-400 credits, with substantial variance.
6) Message boards : Number crunching : AP v7 Credit Impact (Message 1586052)
Posted 13 days ago by Profile jason_gee
up a large number of VM's and running the base stock app.

I went over this with Jason a while back. All you would have to do is work outside of CreditFew with an EXTERNAL Correction.


Basically yes. Bernd (from Albert) has mentioned being in favour of a full gut and do over, While Oliver's recommended (which I agree with) manageable subproblems need to be tackled. Drawing dividing lines and planning to deal with the spaghettification factor is where things are at, and ground to a crawl, so larger highly modular mechanism transplants seem likely at this point.
7) Message boards : Number crunching : AP v7 Credit Impact (Message 1586050)
Posted 13 days ago by Profile jason_gee
Eric has explain before that anything he would do to try in order to equalized the credit to where it was before would get revered by CreditNew.

In theory it seems like it would be possible for someone to externally change the base credit. By settings up a large number of VM's and running the base stock app.


Yeah, a lot of back and forward has been happening. The consensus working with Albert on the issue, and occasional input from Eric, has been there are a number of key issues, not least being improper use of averages and quite a lot of spaghetti code that needs a trasnsplant. (validator and scheduler)
8) Message boards : Number crunching : AP v7 Credit Impact (Message 1586047)
Posted 13 days ago by Profile jason_gee
The main reason MBv7 suffered such a RAC drop was due to the GPU App being much Slower than MBv6.


Actually the main reason MBv7 dropped over v6 is because the baseline stock CPU application gained AVX (SIMD) functionality, but the peak flops figures used to scale the efficiencies use Boinc Whetstone, which is a serial FPU measure.

For ratios Compare Boinc Whetstone, to Sisoft Sandra Lite FPU single thread Whetstone, then SSE-SSE2, and the AVX Single threaded bench. Then you have two dicrete credit drops, one during MB5-6, and the other with MB7.

You're going to have to be more convincing than that. I have here example #1, an old NV8800 Veteran of the MB wars, http://setiathome.berkeley.edu/results.php?hostid=6813106
That Card ran a MBv6 Shorty in just over 4 minutes and netted in the low 30s. Now it takes over 16 minutes to complete the same Shorty. Pretty strong evidence if you ask me...


A 'shorty' is no longer the same amount of work, since it contains autocorrelations, so not comparing apples with apples. Shorties, as with other angle ranges of course, should have received more credit, rather than less, due to the added work.

'Correct' Credit award for a MB7 VHAR, according to the Cobblestone scale, is ~90+ Credits.


There are certainly inefficiencies in the autocorrelation implementation, however the artefacts dominant are from the application used for credit normalisation, which is CPU stock by a factor of ~3.3x effective underclaim. Fixing that gives you back half your MB7 effective drop, then some bonus for the drop that preceeded GPU implementations.

Future optimisations may provide some improvement also, though the options are dwindling for that particular generation of GPU as well, which struggle with the required large Fourier transforms.
9) Message boards : Number crunching : AP v7 Credit Impact (Message 1586042)
Posted 13 days ago by Profile jason_gee
The main reason MBv7 suffered such a RAC drop was due to the GPU App being much Slower than MBv6.


Actually the main reason MBv7 dropped over v6 is because the baseline stock CPU application gained AVX (SIMD) functionality, but the peak flops figures used to scale the efficiencies use Boinc Whetstone, which is a serial FPU measure.

For ratios Compare Boinc Whetstone, to Sisoft Sandra Lite FPU single thread Whetstone, then SSE-SSE2, and the AVX Single threaded bench. Then you have two dicrete credit drops, one during MB5-6, and the other with MB7.
10) Message boards : Number crunching : Phantom Triplets (Message 1584231)
Posted 16 days ago by Profile jason_gee
I ran some GPU memory test programs today, first one called OCCT and then a couple from the Folding@Home Utilities page, MemtestG80 and MemtestCL. I wasn't terribly impressed by the OCCT program, but the other two seem to be reasonable facsimiles of Memtest86, adapted for the GPU. MemtestG80 is just for NVIDIA CUDA-enabled GPUs, while the MemtestCL can run on both NVIDIA and ATI Open-CL cards.

None of them detected any errors on the GTX 550 Ti, even after many iterations and several hours. However, the programs seem to have some limitations in regard to the maximum amount of memory they can test. The max I could get MemtestG80 to look at was 680MB out of the 1024MB on the GPU, even though GPU-Z only reported 81MB being in use prior to running the test. MemtestCL, however, was able to test 924MB under the same conditions. The advantage of MemtestG80 is that it runs about 8-10 times faster than MemtestCL (which took about 2.5 hours to test the 924MB for 50 iterations of its 13 different test schemes).

Of course, the absence of errors doesn't really prove that there isn't a weak bit lurking somewhere in there but, for now, I think I'll just let it ride. I've got BoincLogX running to capture Result files, so if the phantom Triplets show up again, perhaps there will be some more evidence available to help pin it down.

Finally had another one of these show up this morning, about 17 days after the last one. It's WU #1608838137, which my host completed and reported at about 11:51 PM local time on October 5. My machine thinks it found 24 Triplets, whereas both wingmen found none.

Unfortunately, when I checked the result file that BoincLogX captured, I found that it only contained the workunit_header information and none of the actual result data. Checking other result files captured around the same time, I found that some did contain result data and others didn't. I have to assume the BoincLogX monitoring interval is to blame, since I hadn't thought to change it from the 15 second default. Sigh... So now I've changed the monitoring interval to 5 seconds and will just have to wait and see if that does the trick the next time the phantom triplets show up (since I doubt if the problem will just go away on its own, even though it's pretty rare).


The result file truncation 'smells' a bit like the boincapi thread safety issues might be concerned (similar to truncated/missing stderr cases). Whether that's just a symptom, or somehow acting as a cause is a totally different, not easily answered, question I guess.

Though the time between events is long, since you are running under anonymous platform could you switch to one of my commode.obj enabled builds ? (intended to test one kindof workaround for boinc thread safety problems). If the same extra triplet issues still appear (different cause) then at least stderr and result content should be complete with this build, improving the likliehood of a firmer diagnosis (such as a driver synchronisation issue lurking).

If they stop appearing with this build, I would investigate potential host DPC ( driver/software-interrupt) latency issues, which can manifest from obscure driver and software quality issues, chipset/RAID/SATA drivers for one possible suspect of many.

Commode.obj enabled builds are provided for diagnostic purposes at:
http://jgopt.org/download.html
11) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1582498)
Posted 19 days ago by Profile jason_gee
Agree, I will block (can be done via OpenCL runtime itself perhaps) all such devices from next release. But cause BOINC now assigns execution device, just to not enumerate them internally seems not enough. boinc_temporary_exit with user notification about the reason of exit would be more safe perhaps.
Or, logic can be more complex: to not enumerate 1.x in case 2.0 present (there are something to run on though overcommitted) and clear exit in case only 1.x available (nothing good to run on at all).


Yep, was editing to add that because of user switching, there is surrounding logic with temp exits. It's been working as desired here so I avoided fiddling with it more than I had to, to add the device skip in its complete enumeration. (didn't want to break what appeared to be working)
12) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1582490)
Posted 19 days ago by Profile jason_gee
Not sure this will help in this situation, but the way I've handled Pre-Fermis being unsupported in Cuda 6.5 altogether is to skip those with compute capability < 2.0 (Fermi) in device enumeration, then the project can later restrict device minimum at leisure. Since you use a dedicated plan class for NVOpenCL, you can link in a Cuda driverapi call & compare the compute capability & driver version. 'Non-ideal', as is the only slightly less complicated Cuda situation with such devices, but probably better than relying on users to not update to the transitional broken driver, or wait to figure out a more ideal solution once the full picture is clearer.

#if CUDART_VERSION >= 6050
// Check the supported major revision to ensure it's valid and not some pre-Fermi
if ((cDevProp[i].major < 2))
{
fprintf(stderr, "setiathome_CUDA: device %d is Pre-Fermi CUDA 2.x compute compatibility, only has %d.%d\n",
i+1, cDevProp[i].major, cDevProp[i].minor);
continue; // Skips initialising this device....
}
#else
...


Having no usable device at all would fall through multiple temporary exit retries (further in the surrounding logic), and eventually hard error when Boinc decides enough is enough (fingers crossed anyway). It's that complex way for Cuda initialisation because devices may disappear and come back, depending on user switching, so the interaction with temporary exits is complex.
13) Message boards : Number crunching : First invalid (Message 1581243)
Posted 23 days ago by Profile jason_gee
Any idea as to check my hard drive all the others after that one seems ok maybe it was me checking to see if I got any viruses etc with skybot or upgrading my windows


After Joe's comment, I would suggest for the specific example you can't do much & your host is unlikely at fault through accident or misadventure on your part. As always keep an eye on when they release new apps/installers etc. There are obscure problems in Boinc code that crop up sometimes in weird ways like this, sometimes workarounds are found. In other cases the server logic is not ideal, so strange things can happen.
14) Message boards : Number crunching : First invalid (Message 1581239)
Posted 23 days ago by Profile jason_gee
Oh yeah, the old boincapi and the missing stderr err bits. Amazing that keeps popping up.
15) Message boards : Number crunching : First invalid (Message 1581036)
Posted 23 days ago by Profile jason_gee
Not exactly so here is your spikes and the error



Spike count: 28
Autocorr count: 0
Pulse count: 0
Triplet count: 0
Gaussian count: 2

Worker preemptively acknowledging an overflow exit.->
called boinc_finish
Exit Status: 0
boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->
Cuda threadsafe ExitProcess() initiated, rval 0



There's no error apparent in that Cuda (2nd) nor stock CPU (3rd, decider) results. The stock CPU result agreed it was a genuine overflow result (lots of signals).
http://setiathome.berkeley.edu/workunit.php?wuid=1603747341

No idea why that Opt CPU run of MadMac's would appear to have processed full length (?), and found no spikes at all. Hard drive needs some checking perhaps ?
16) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1579043)
Posted 27 days ago by Profile jason_gee
Doctor's prescription, take at least 2 beers and chill out before you give yourself a heart attack or stroke. ;-)

Cheers.


I can only agree to this prescription, eventually an Attivan, would do the trick, but I'm not a fycisian . .;^)



FWIW numbers sometimes also help too. Current top host Astropulse v6 inconclusive to pending ratio (a holistic indicator of host, app and project health) is currently ~4.9% , which is about twice as good or better than it used to be (well over ~10%). I'd have to guess that this apparently low impact might be partially due to that a lower proportion of Pre-Fermis tend to run AP anyway, amongst others like app improvement and better floating point support in the newer remaining cases. Not forgetting that pre-fermi throughput is a lot lower to start with.
17) Message boards : Number crunching : the latest on release of AP_v7? (Message 1575994)
Posted 22 Sep 2014 by Profile jason_gee
I (or we) don't know which app Eric will choose for reference point.

For 'Mac OS X' (32bit) just the 'non CPU instruction set' app is available.

For other OSs are also 'CPU instruction set' apps available.

Maybe the mix of all available SETI hosts (which app/s will be used) will speed up the AP CPU reference point (I guess there are not much hosts around which will crunch just with the 'non CPU instructions set' app (too old CPUs)) ... -> less Cr./AP task.

I'm correct, or wrong?

Some time ago Eric proposed to use SLOWEST app as reference point. If he will go ahead and implement this for AP subproject at least I think all will be only happy.

I might be wrong, but I don't think that Eric proposed that.

I think he simply observed that the current BOINC server code operated that way, and that whatever knob Eric twiddled in an attempt to induce credit-rate parity between AP and MB, nothing changed.

Having said that, I don't think that the importance - or otherwise - of raw <rsc_fpops_est> on credit awards has been fully explored experimentally. IF the slowest app for AP v7 were the same speed as the base app here for v6 (not currently the case at Beta, so a big IF), then I'd be interested to see the outcome of halving <rsc_fpops_est> for AP here - but that may be too big and risky an experiment to carry out on the live project.


The 'observation' in question did happen, but had nothing to do with 'fastest' or 'slowest'. It was Eric's observation that normalising baseline, i.e. the one that receives exactly COBBLESTONE_SCALE credit, should be the 'least efficient' [does not mean slowest], and I pointed out that the 'most efficient claiming' was normalising baseline by a factor of 3.3x peak ~2x average [too efficient, by 'incorrect measurement']. That's for multibeam because of a design omission in Boinc Whetstone, since proven by SIMAP with its android app in a different scenario with opposite omission [ and similarly confusing results in a different context]. \

Numbers for AP are around 2.25x peak and ~1.5x average.

Boinc developers understand neither SIMD, nor multithreading. That's consistently demonstrated through both design and implementation.
18) Message boards : Number crunching : Phantom Triplets (Message 1575810)
Posted 22 Sep 2014 by Profile jason_gee
I'd still like to try running a GPU memory test first, if I can find one similar to memtest86, just to test the memory, not stress test the whole GPU.


Yeah, won't hurt to start eliminating the easiest possible causes. In the not too-distant future, I'm hoping to convert some of my test pieces into a user friendly stress/bench test format. Until then, general memory and artefact scanning tools can turn up something.
19) Message boards : Number crunching : Phantom Triplets (Message 1575803)
Posted 22 Sep 2014 by Profile jason_gee
Yeah, the hardware validation process, rates of failure due to radioactivity in the device encapsulation, and external factors, are statistical processes, and similar processes like leakage govern whether a given chip will end up in a budget card or an enterprise level Tesla.

The cynical part of me is saying Probably throw in that the cards are made to last (stay in spec) through the warranty period with a duty cycle expected of a gamer or desktop user, as opposed to a 24x7 HPC usage scenario.

What it adds up to for me, is that when someone buys a several thousand dollar Tesla, they pay for extra testing, headroom, and cherry picked parts, they do actually get *something* for their money.

My challenge continues to be to figure out how to make it work for the rest of us. I think for the most part the Boinc backend does a good job to protect the science using the validation approach (with notable past exceptions). Detecting, and correcting for this kindof spurious event completely at the end-user side is going to take a few more leaps, with some cost-benefit and feasibility constraints.
20) Message boards : Number crunching : Phantom Triplets (Message 1575698)
Posted 21 Sep 2014 by Profile jason_gee
...Has anybody else run into a similar problem, or can anyone think of something else I might look at? At this point, I don't think it warrants a great deal of investigative effort, but I do find it mildly annoying and if it continues, or the frequency increases, I may have to try to dig deeper. I suppose I'm old-fashioned, but I actually expect a computer to be consistent! ;^)


Yeah it does get quite difficult when you want to isolate variation that rare, for the following list of possibilities:
- possible 'genuine' core or memory voltage issue on the card or anywhere else in the system. That includes some cards needing a small voltage bump from the factory, especially 560ti factory overclock models. Background relates to how card manufacturers bin parts & set base clocks, usually accepting some number of visual artefacts per time period, for desktop gaming cards.
- The inherent susceptibility of 'consumer grade' hardware to soft error
http://en.wikipedia.org/wiki/Soft_error#Causes_of_soft_errors, especially,
IBM estimated in 1996 that one error per month per 256 MiB of ram was expected for a desktop computer
, So enterprise ECC RAM and ECC features on professional Tesla model cards.
- known vagaries in floating point arithmetic, driven by cross platform or cross application version validation.
- possible bugs at any software, firmware or hardware level

and probably others that don't come immediately to mind. As it happens the example you gave 'feels' most like the first case, as opposed to the second, third or fourth possibilities listed, though without direct experimentation for the purposes of isolation, it'd be mostly guessing.

One way to test the scenario might be to actually drop the GPU core voltage a notch or two, and see if the invalid rate climbs. If so, then going the other way or reducing clocks should eliminate the particular cause, leaving 0 or more other sources of variation. The chances remain of the other causes. Yeah you're crossing the boundaries between certainty and some scary door here ;)


Next 20

Copyright © 2014 University of California