Posts by jason_gee

log in
1) Message boards : Number crunching : NVIDIA Jetson TK1 (Message 1567654)
Posted 13 days ago by Profile jason_geeProject donor
Any patches or recommendations for X-branch ? It's on idle while I address some complex Boinc issues, but I don't stop taking issues for the ToDo list :)
2) Message boards : Number crunching : @NVidia-developers: Low background GPU load RISES crunching throughput significantly (Message 1566426)
Posted 15 days ago by Profile jason_geeProject donor
Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them.

DirectX12 is meant to give a significant performance boost by reducing CPU load to allow GPU load to increase (for iGPU gaming). Would part of it possibly be due to reducing driver latencies?

Absolutely. It's said that DIrectX12 will be using a closer to hardware approach similar to AMD's Mantle (also been said Intel may move to for its iGPUs). At least some portion of that would be hardware spec, but probably streamlining of driver architecture etc would get those latencies down.

Cuda and OpenCL with NV, on Windows, underneath, use low level directX calls for much functionality, so any improvements there should help.
3) Message boards : Number crunching : @NVidia-developers: Low background GPU load RISES crunching throughput significantly (Message 1566424)
Posted 15 days ago by Profile jason_geeProject donor
He is using two different GPU`s 430 and 640 IIRC.
Thats more complicated to find sweet spot.

I see. Yes, on MB, for configuration of mixed GPUs, I needed to add the ability for advanced users to configure the application by device PCI(e) bus+slot ID. Even then, I would expect the VRAM on those GPUs to be an additional complicating bottleneck.

So not an easy setup ;)
4) Message boards : Number crunching : @NVidia-developers: Low background GPU load RISES crunching throughput significantly (Message 1566421)
Posted 15 days ago by Profile jason_geeProject donor
I already suggested bigger unroll values but this causes stuttering while streaming video.
He dont like this.

Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them.

Hmmm, might be running into similar driver latency issues as I see on larger GPUs with Cuda MB. There *should* be a happy middle unroll setting that would maintain loading, but not overcommit. Finding that (if available) could be the hard part though, and will depend on how Raistmer implemented the unroll.
5) Message boards : Number crunching : @NVidia-developers: Low background GPU load RISES crunching throughput significantly (Message 1566419)
Posted 15 days ago by Profile jason_geeProject donor
My guess is that in your particular setup (with limited knowledge of your setup and current AP apps), the single task loading is just below the threshold to trigger the higher power state, which is also one parameter among many that dictates boost functionality.
6) Message boards : Number crunching : 15th Anni T-shirt discussions (Message 1566032)
Posted 16 days ago by Profile jason_geeProject donor
Thanks Aussies,
Great they got there safely :). In the mayhem of a busy week I sent the local ones all out, some being paid for completely, some not yet :) ( Thanks for the phone message reminder specialjimmy ). I'll get to PM exchanges & working out who paid who, at the end of the week, not rushing as I covered the Berkeley end and it's my delays with work and world events.
7) Message boards : Number crunching : Lunatics Windows Installer v0.42 Release Notes (Message 1563687)
Posted 21 days ago by Profile jason_geeProject donor
I've generally observed that the NVidia cards comparably scream through Cuda MB tasks like hell while taking their time on AP (although still good) - while AMD cards eat AP tasks for breakfast but really take their time chugging through MB tasks.

That's what Raistmer say from a long time:
- the most effective use of GPUs here (not 'by credit' but 'by performance'/'by work done') is to:
use NVIDIA for CUDA MultiBeam
use ATI AMD for OpenCL AstroPulse

So is the "problem" that there's no CUDA AP app?

If there is a 'problem' as such, and I'm not certain there is because I'm not currently focussed on the AP side of things, it'll be because Cuda Cores are not like whatever AMD has, so require different programming approaches. Direct/straight translation of generic OpenCL to Cuda would likely yield poor results (and vice versa).

For Multibeam, I estimate that single instance currently runs, on modern Cuda cards, at about 5% of peak theoretical efficiency, edging up to 7-8% or so with 2 or 3 instances. Looking at the fpops estimates AP would be significantly lower efficiency than that at the moment.

I'm not sure the work supply or infrastructure is 'ready' for an all-tech thrown in Cuda variant based on everything that has been learned so far on multibeam, But moving toward adding AP support within x42, along with improved existing multibeam, is planned.
8) Message boards : Number crunching : contoling cuda tasks (Message 1555727)
Posted 37 days ago by Profile jason_geeProject donor
HighTech67 If you look at the detail for one of your machines you will see
While BOINC running, % of time GPU work is allowed 0.00%
The features is implemented in BOINC, but the project has not flipped the switch to use it as of yet.

It could be that the feature is not fully ready yet or they have just chosen not to use it yet.

HAL9000, thank you. I had never noticed that.

The only reason I look that far down on that page is to set the location for a new computer and I have not done that for a while. Even then, I admit I don't pay attention to most of what is there.

Could that be a not yet fully implemented feature in the latest release of the BOINC server software? I have no idea when that was released as I don't have anything to do with any project except crunch and I don't do that very well. I think I need to thin my projects down.

The client has been tracking the information for a while. I think the 7.0.64 version I am suing is doing it as well. There might be something in the BOINC white papers as to why it is not being used yet, but I haven't felt like reading them.

Yeah, some of us have been trawling the scheduler (server) design for a while now, and the only figures really wired into work fetch at the sendWork point relate to total host-usage. A very general CPU/GPU distinction is made there, converted to Flops, as opposed to detailed per device resource usage by applications. At the moment that's a design issue coming forward as new devices like FPGAs and specialised ASICs are used in some projects (like bitcoin mining) already, and don't really use general purpose floating point operations.

WHere that's relevant here, is that GPUs were kindof tacked onto the pre-existing CPU mechanism, so there are holes in the logic there when supporting a wide range of devices in a heterogeneous arrangement.

This, self-made incomplete UML notation use-case, documents more of the scheduler work fetch structure and operation than is documented in the white papers.

9) Message boards : Number crunching : Boinc Mngr losing client connection (Message 1532380)
Posted 84 days ago by Profile jason_geeProject donor
I use a TPLink PCIe card for Wireless-N, and it's an Atheros chip too.

Not saying this is likely the issue, but worth some quick checks.

The vendor drivers from either TP-link, Atheros themselves, or Microsoft Update are prone to Very high DPC latency spikes (at least on Win7 x64). As Boinc comms is through the TCPip stack, these stalls can cause these symptoms.

To verify if this is the case on yours, disable the adaptor and see if the local Boinc client-manager symptoms are gone. If so, reenabling will see the symptoms return at some point. You can also see the extended DPC latency spikes using 'DPC Latency Checker'.

I solved this for my host by using custom drivers available from laptopvideo2go. These drivers reveal all the hidden wifi options for the adaptor. The key setting that mitigated most of the latency problems was called "pcieaspm" set to "L0s On, L1 On" , I assume if it's a USB dongle the setting may be called something else, and is not visible in the device manager properties with 'normal' drivers.

Also possibly watch DPC latencies while having web browsers open. I switched to chrome and disabled hardware acceleration.

I can search for some links if you have a great deal of issues finding the atheros custom driver via laptopvideo2go, and/or need DPC latency checker.

Most likely when I eventually replace my current PCIe Wireless card, I'll be looking for something with better quality low latency drivers.
10) Message boards : Number crunching : Compiling Applications for Linux (Message 1528192)
Posted 95 days ago by Profile jason_geeProject donor
I got the following error when i tried that option:

configure: WARNING: unrecognized options: --enable-type-prefix

and the first thing i did was ./configure --help

There was no mention of --enable-type-prefix

and there is no mention of it here:


probably removed in that version of fftw then :). I think I did 3.3.2 (will check a bit later)

[edit:] Nope, I did 3.3.4, but it's running configure and ignoring that option anyway, with a warning [i.e. not an error]. You can leave it out, as it's obviously from old versions.

[edit2:] you'll probably want to enable the vector instructions for whatever you're compiling for as well , e.g. --enable-sse --enable sse2 --enable-avx , anyway checking the generated wisdom on a multibeam build run will tell you what fftw kernels it's used to check you get what you want. 64 bit x86 should imply SSE2+ already, but never know what matteo's done unless you look.
11) Message boards : Number crunching : Compiling Applications for Linux (Message 1528107)
Posted 96 days ago by Profile jason_geeProject donor
Okey doke, seemed to get 7.28 (stock multibeam) building here (untested binary) Ubuntu 12.04 LTS x64, fully patched.

Biggest issue here was that Ubuntu's repository fftw3 doesn't appear to put fftw libraries in the standard places, so wasn't being picked up by MB 7's .configure script. Rather than dig for it to make links to an old lib, I just built it to the standard lcoation as per fftw docs instead. I did both the double precision and and single float ones just for completion.

Ignoring checking out / extracting Sequence was:

#1) make and install fftw libraries and headers: ( location I used ~/seti/fftw)
sudo make install
make clean
./configure --enable-float --enable-type-prefix
sudo make install

Trying to build fftw 3.3.4 libs, but don't understand what the --enable-type-prefix is supposed to do.


FFTW libraries come in single and double precision flavours. The --enable-float config option is fairly obvious (we want single precision floats). The type prefix relates to the [name clashes that would occur if you need both libraries] . enabling that [for headers and lib naming], and the floats [ultimately] prefixes the calls with fftwf_ internally (for these single float versions), which will be the format multibeam will be expecting. Without [both] those I'd expect link time issues.

[Edit:] note to check the multibeam generated makefiles too, Eric may or may noy still be using the headers and libs the old way (without explicit prefix, but specifying float, not sure what's current with the Android messing around going on)
12) Message boards : Number crunching : Massive failure rate for GPU taks on Nvidia (Message 1526952)
Posted 99 days ago by Profile jason_geeProject donor
I'd try reset the project on that host, see if it unsticks/redownloads some sortof damaged files. Other than that I would have thought driver reinstall, but if other Cuda and OpenCL projects are working fine, don't think it'd be that. More likely something stuck in the project folder or slots perhaps. [If you used Lunatics Installer, then perhaps reinstalling that might replace something broken in there too ]
13) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1526710)
Posted 99 days ago by Profile jason_geeProject donor
...Will you need testers once your tool is finalised, jason? I still have a GF114, GF110 and GK110 available.

Yep. once at least semi operational, basically I'll be fielding it like a bench / stress-test type tool intended for the following purposes:

user side:
- find key, previously fairly arcane, application settings &/or select from custom builds or versions based on user preferences of how to run ( e.g. max throughput, minimal user impact etc)
- identify stability issues.
- Submit and compare data to other hosts/hardware
- become some sortof defacto-standard for comparing gpGPU (e.g. in reviews) offering something a bit more useful (to us) than benches out there than graphic oriented, synthetic, or folding points-per-day type benches.

developer side (as mentioned):
- test code schemes/approaches on lots of different hardware quickly, submitting to a database. There's approximately 15 key high level optimisation points designed into x42, and most of them will want different or self-scaling code on different hardware. That's why x42 isn't out already.
- guide both generic and targeted optimisations, possibly distribute certain ones semi-automatically
- identify performance, accuracy & precision, and problem hardware sooner.
- pull ahead of the architecture & toolkit release cycles.
- try out some adaptive/genetic algorithms for better matching processing to the hardware.

So big job, With Cuda versions coming out like I change my socks. But looks like it's turning out the right way to head to get x42 over the next big hurdles. The time spent now I hope to be functional enough to preempt Big Maxwell, we'll see.
14) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1526705)
Posted 99 days ago by Profile jason_geeProject donor
... You *might* find that O2 with more generic options could produce faster host side code (or not :) ), but that's debatable because we're talking CPU side for those CFLAGS...

Just to throw in a different vector:

On the CPU side of things, I've seen good success with the -Os (small code size) option to keep the code size small and hopefully small enough to fit within the CPU (fast) L1 cache... Or at least to leave more of the L2/L3 cache available for data...

You should see a better effect for that on the lower spec CPUs or for the "hyper threaded" Intels.

yep, another definite possibility. Since something like pentium 4's have ridiculously long pipelines, so 'prefer' long relatively branch free code (highly unrolled, highly hard/hand optimised). More modern chips will sometimes unroll tight loops themselves after decode moving the bottlenecks fron branch prediction to instruction decode, so fewer instructions then fewer instructions to decode. smaller code fewer page faults.
15) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1526527)
Posted 100 days ago by Profile jason_geeProject donor
If there's something else any of you can think of that may be beneficial to try and report on, let me know.

Not much more from me wrt compiler options, and glad the fast math options etc are doing what they should. A bit of fine tuning never seems to hurt :). Yeah the compatibility thing is a big stumbling block at the moment, and why I have to keep relatively general for the time being, along with waiting for some things to settle before some attempt at generic third party Linux distribution.

You *might* find that O2 with more generic options could produce faster host side code (or not :) ), but that's debatable because we're talking CPU side for those CFLAGS. For the GPU we've got lots of latency to hide as evidenced by the low utilisation on high end cards.

Best bet if you want to coax a bit more speed out before we get further into x42, would be to grab Petri's chirp improvement and give that a try. He claims slightly lower accuracy, but some speed improvement there for specific GK110 GPUs. [Edit: note that I see no particular evidence on Petri's host of reduced chirp accuracy. The inconclusive to pending ratio is right about 4.6%, which is right where mine was on my 780 before the first one blew up...]

For my end of that, after working out some boinc/boincapi issues for the time being, I'll be back in the process of building a special cross platform tool to unit test individual code portions for performance and accuracy to speed development along. So far it looks like this:

A fair way to go yet, and a few hurdles to cross, but basically its intent is to be able to prove and generalise code before having to rebuild main applications. That's come about because I see no reason why Petri's approach shouldn't be generalisable to earlier GPUs, and be able to incorporate with my own work on latency hiding and using single floats for extra precision with little/speed cost in some places.

Hopefully back onto that in the next week or so.
16) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1525974)
Posted 101 days ago by Profile jason_geeProject donor
... So now my question is, what about all those CFLAGS (you use Petri) when ./configuring Xbranch? Is there some place I can find a list of them specific to configuring this MB app? Do I look through the source code for answers to specifying CFLAGS before configuring? Where did you get them Petri? I believe tuning through the use of those CFLAGS is key to compiling the fastest app.

On top of what Perti33 comes up with, my recommendations there for CFLAGS are pretty general. That's because they are for the host CPU component(s), and a lot of the CPU time is beyond our control for the time being, being a function of Driver latencies (traditionally lower on Linux, but growing to match Windows and Mac). That's why Tesla compute cards with no Graphics offer the use of special low-latency drivers (TCC Tesla Compute Cluster Drivers):
- Use SSE or better for both the boinc libraries and Xbranch itself. Since it's x64 you should be getting SSE2 anyway by default, but worth checking
- make sure to enable fast math
- It's true I don't tend to use a lot of hardcoded #defines, because the code is generalised, as opposed to multiple-pathed.

For the Cuda compilation portions ( IIRC NVFLAGS ?), you'll want to check O2 or O3 is enabled (command line will be visible during compilation of .cu files. That enables some safe fast math ones for GPU side. You'd likely need to leave maxregisters at 32 even though some Cuda kernels will tolerate more under certain situations. That's because most of the device kernels are custom hand coded for low-latency high bandwidth, as opposed to high occupancy, meaning register pressure isn't a large factor.

After that (for me), performance optimisation becomes less about twiddling compiler options, and far more about very high-level algorithmic choices (Like Perti33's custom chirp he's been saving for me), and starting down the road to find ways to hide those awful latencies (using 'Latency hiding mechanisms', changing the way work is 'fed', and the way the host CPU interacts with partial results). That's where we're at the moment, with me designing new tools and techniques to find the best low-latency methods, and allow generalising, optimising on-host and plugging in new core functionality like Petri's for general distribution, eventually at install &/or run-time without application recompile [...similar to the way mobile phone apps do.]
17) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1525556)
Posted 103 days ago by Profile jason_geeProject donor
No idea if I can get any of the other applications compiled...

Stock CPU multibeam should have few roadblocks. Not sure what the Linux status of any AP is, stock or opt.
18) Message boards : Number crunching : New card having a few errors... (Message 1524938)
Posted 104 days ago by Profile jason_geeProject donor
One additional thing to be aware of when dealing with factory overclocked/superclocked models from any brand (may or may not have been a factor here). Those factory overclocks are typically made to a level of stability, determined by some maximum number of acceptable graphical 'artefacts' over a time period. In games minute/infrequent graphical glitches are often considered acceptable.

For number crunching that acceptable number of glitches is zero, and can be using different parts of the chip than the dedicated graphics parts. So sometimes a small core voltage and fan curve boost can be necessary (e.g. using eVGA precision X or similar). Both my 560ti and 780sc require small voltage bumps for rock solid Cuda operation at the factory (superclocked) frequencies.
19) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1524333)
Posted 106 days ago by Profile jason_geeProject donor
I haven't tried anything yet, but I'm just wondering: was it really necessary to force a manual installation of the NV proprietary display driver? In K/Ubuntu, I just installed the distribution-packaged display driver along with the other CUDA libraries and what-not. The RPM package from is not suitable? I'll follow as instructed anyway.

Another question: Presumably the compiled binary is auto-labelled ..._cuda60 because that's the current version of the CUDA tools installed from I remember the notes for the binary distributions of the MB CUDA application referring to things like the cuda42 build being suitable for Fermi GPUs, cuda50 for Kepler GPUs, etc. Is this note relevant when one is building manually?

Try both ways and tell us ? Frankly I had less struggle on Ubuntu, though have a tendency to do stuff on automatic. The need to do 'weird stuff' can come from either necessity or prior experience doing it when no longer needed.
20) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1524119)
Posted 107 days ago by Profile jason_geeProject donor
That should help me out as I prepare to streamline things a fair bit. Most likely I won't make any changes there in a hurry, being tied up with some testing for nv, and some Boinc patching, but it'll definitely help me out reducing the number of fiddly bits down the line.


Next 20

Copyright © 2014 University of California