Posts by jason_gee

log in
1) Message boards : Number crunching : Lunatics Windows Installer v0.42 Release Notes (Message 1563687)
Posted 14 hours ago by Profile jason_geeProject donor
I've generally observed that the NVidia cards comparably scream through Cuda MB tasks like hell while taking their time on AP (although still good) - while AMD cards eat AP tasks for breakfast but really take their time chugging through MB tasks.

That's what Raistmer say from a long time:
- the most effective use of GPUs here (not 'by credit' but 'by performance'/'by work done') is to:
use NVIDIA for CUDA MultiBeam
use ATI AMD for OpenCL AstroPulse

So is the "problem" that there's no CUDA AP app?

If there is a 'problem' as such, and I'm not certain there is because I'm not currently focussed on the AP side of things, it'll be because Cuda Cores are not like whatever AMD has, so require different programming approaches. Direct/straight translation of generic OpenCL to Cuda would likely yield poor results (and vice versa).

For Multibeam, I estimate that single instance currently runs, on modern Cuda cards, at about 5% of peak theoretical efficiency, edging up to 7-8% or so with 2 or 3 instances. Looking at the fpops estimates AP would be significantly lower efficiency than that at the moment.

I'm not sure the work supply or infrastructure is 'ready' for an all-tech thrown in Cuda variant based on everything that has been learned so far on multibeam, But moving toward adding AP support within x42, along with improved existing multibeam, is planned.
2) Message boards : Number crunching : contoling cuda tasks (Message 1555727)
Posted 16 days ago by Profile jason_geeProject donor
HighTech67 If you look at the detail for one of your machines you will see
While BOINC running, % of time GPU work is allowed 0.00%
The features is implemented in BOINC, but the project has not flipped the switch to use it as of yet.

It could be that the feature is not fully ready yet or they have just chosen not to use it yet.

HAL9000, thank you. I had never noticed that.

The only reason I look that far down on that page is to set the location for a new computer and I have not done that for a while. Even then, I admit I don't pay attention to most of what is there.

Could that be a not yet fully implemented feature in the latest release of the BOINC server software? I have no idea when that was released as I don't have anything to do with any project except crunch and I don't do that very well. I think I need to thin my projects down.

The client has been tracking the information for a while. I think the 7.0.64 version I am suing is doing it as well. There might be something in the BOINC white papers as to why it is not being used yet, but I haven't felt like reading them.

Yeah, some of us have been trawling the scheduler (server) design for a while now, and the only figures really wired into work fetch at the sendWork point relate to total host-usage. A very general CPU/GPU distinction is made there, converted to Flops, as opposed to detailed per device resource usage by applications. At the moment that's a design issue coming forward as new devices like FPGAs and specialised ASICs are used in some projects (like bitcoin mining) already, and don't really use general purpose floating point operations.

WHere that's relevant here, is that GPUs were kindof tacked onto the pre-existing CPU mechanism, so there are holes in the logic there when supporting a wide range of devices in a heterogeneous arrangement.

This, self-made incomplete UML notation use-case, documents more of the scheduler work fetch structure and operation than is documented in the white papers.

3) Message boards : Number crunching : Boinc Mngr losing client connection (Message 1532380)
Posted 63 days ago by Profile jason_geeProject donor
I use a TPLink PCIe card for Wireless-N, and it's an Atheros chip too.

Not saying this is likely the issue, but worth some quick checks.

The vendor drivers from either TP-link, Atheros themselves, or Microsoft Update are prone to Very high DPC latency spikes (at least on Win7 x64). As Boinc comms is through the TCPip stack, these stalls can cause these symptoms.

To verify if this is the case on yours, disable the adaptor and see if the local Boinc client-manager symptoms are gone. If so, reenabling will see the symptoms return at some point. You can also see the extended DPC latency spikes using 'DPC Latency Checker'.

I solved this for my host by using custom drivers available from laptopvideo2go. These drivers reveal all the hidden wifi options for the adaptor. The key setting that mitigated most of the latency problems was called "pcieaspm" set to "L0s On, L1 On" , I assume if it's a USB dongle the setting may be called something else, and is not visible in the device manager properties with 'normal' drivers.

Also possibly watch DPC latencies while having web browsers open. I switched to chrome and disabled hardware acceleration.

I can search for some links if you have a great deal of issues finding the atheros custom driver via laptopvideo2go, and/or need DPC latency checker.

Most likely when I eventually replace my current PCIe Wireless card, I'll be looking for something with better quality low latency drivers.
4) Message boards : Number crunching : Compiling Applications for Linux (Message 1528192)
Posted 74 days ago by Profile jason_geeProject donor
I got the following error when i tried that option:

configure: WARNING: unrecognized options: --enable-type-prefix

and the first thing i did was ./configure --help

There was no mention of --enable-type-prefix

and there is no mention of it here:


probably removed in that version of fftw then :). I think I did 3.3.2 (will check a bit later)

[edit:] Nope, I did 3.3.4, but it's running configure and ignoring that option anyway, with a warning [i.e. not an error]. You can leave it out, as it's obviously from old versions.

[edit2:] you'll probably want to enable the vector instructions for whatever you're compiling for as well , e.g. --enable-sse --enable sse2 --enable-avx , anyway checking the generated wisdom on a multibeam build run will tell you what fftw kernels it's used to check you get what you want. 64 bit x86 should imply SSE2+ already, but never know what matteo's done unless you look.
5) Message boards : Number crunching : Compiling Applications for Linux (Message 1528107)
Posted 75 days ago by Profile jason_geeProject donor
Okey doke, seemed to get 7.28 (stock multibeam) building here (untested binary) Ubuntu 12.04 LTS x64, fully patched.

Biggest issue here was that Ubuntu's repository fftw3 doesn't appear to put fftw libraries in the standard places, so wasn't being picked up by MB 7's .configure script. Rather than dig for it to make links to an old lib, I just built it to the standard lcoation as per fftw docs instead. I did both the double precision and and single float ones just for completion.

Ignoring checking out / extracting Sequence was:

#1) make and install fftw libraries and headers: ( location I used ~/seti/fftw)
sudo make install
make clean
./configure --enable-float --enable-type-prefix
sudo make install

Trying to build fftw 3.3.4 libs, but don't understand what the --enable-type-prefix is supposed to do.


FFTW libraries come in single and double precision flavours. The --enable-float config option is fairly obvious (we want single precision floats). The type prefix relates to the [name clashes that would occur if you need both libraries] . enabling that [for headers and lib naming], and the floats [ultimately] prefixes the calls with fftwf_ internally (for these single float versions), which will be the format multibeam will be expecting. Without [both] those I'd expect link time issues.

[Edit:] note to check the multibeam generated makefiles too, Eric may or may noy still be using the headers and libs the old way (without explicit prefix, but specifying float, not sure what's current with the Android messing around going on)
6) Message boards : Number crunching : Massive failure rate for GPU taks on Nvidia (Message 1526952)
Posted 78 days ago by Profile jason_geeProject donor
I'd try reset the project on that host, see if it unsticks/redownloads some sortof damaged files. Other than that I would have thought driver reinstall, but if other Cuda and OpenCL projects are working fine, don't think it'd be that. More likely something stuck in the project folder or slots perhaps. [If you used Lunatics Installer, then perhaps reinstalling that might replace something broken in there too ]
7) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1526710)
Posted 78 days ago by Profile jason_geeProject donor
...Will you need testers once your tool is finalised, jason? I still have a GF114, GF110 and GK110 available.

Yep. once at least semi operational, basically I'll be fielding it like a bench / stress-test type tool intended for the following purposes:

user side:
- find key, previously fairly arcane, application settings &/or select from custom builds or versions based on user preferences of how to run ( e.g. max throughput, minimal user impact etc)
- identify stability issues.
- Submit and compare data to other hosts/hardware
- become some sortof defacto-standard for comparing gpGPU (e.g. in reviews) offering something a bit more useful (to us) than benches out there than graphic oriented, synthetic, or folding points-per-day type benches.

developer side (as mentioned):
- test code schemes/approaches on lots of different hardware quickly, submitting to a database. There's approximately 15 key high level optimisation points designed into x42, and most of them will want different or self-scaling code on different hardware. That's why x42 isn't out already.
- guide both generic and targeted optimisations, possibly distribute certain ones semi-automatically
- identify performance, accuracy & precision, and problem hardware sooner.
- pull ahead of the architecture & toolkit release cycles.
- try out some adaptive/genetic algorithms for better matching processing to the hardware.

So big job, With Cuda versions coming out like I change my socks. But looks like it's turning out the right way to head to get x42 over the next big hurdles. The time spent now I hope to be functional enough to preempt Big Maxwell, we'll see.
8) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1526705)
Posted 78 days ago by Profile jason_geeProject donor
... You *might* find that O2 with more generic options could produce faster host side code (or not :) ), but that's debatable because we're talking CPU side for those CFLAGS...

Just to throw in a different vector:

On the CPU side of things, I've seen good success with the -Os (small code size) option to keep the code size small and hopefully small enough to fit within the CPU (fast) L1 cache... Or at least to leave more of the L2/L3 cache available for data...

You should see a better effect for that on the lower spec CPUs or for the "hyper threaded" Intels.

yep, another definite possibility. Since something like pentium 4's have ridiculously long pipelines, so 'prefer' long relatively branch free code (highly unrolled, highly hard/hand optimised). More modern chips will sometimes unroll tight loops themselves after decode moving the bottlenecks fron branch prediction to instruction decode, so fewer instructions then fewer instructions to decode. smaller code fewer page faults.
9) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1526527)
Posted 79 days ago by Profile jason_geeProject donor
If there's something else any of you can think of that may be beneficial to try and report on, let me know.

Not much more from me wrt compiler options, and glad the fast math options etc are doing what they should. A bit of fine tuning never seems to hurt :). Yeah the compatibility thing is a big stumbling block at the moment, and why I have to keep relatively general for the time being, along with waiting for some things to settle before some attempt at generic third party Linux distribution.

You *might* find that O2 with more generic options could produce faster host side code (or not :) ), but that's debatable because we're talking CPU side for those CFLAGS. For the GPU we've got lots of latency to hide as evidenced by the low utilisation on high end cards.

Best bet if you want to coax a bit more speed out before we get further into x42, would be to grab Petri's chirp improvement and give that a try. He claims slightly lower accuracy, but some speed improvement there for specific GK110 GPUs. [Edit: note that I see no particular evidence on Petri's host of reduced chirp accuracy. The inconclusive to pending ratio is right about 4.6%, which is right where mine was on my 780 before the first one blew up...]

For my end of that, after working out some boinc/boincapi issues for the time being, I'll be back in the process of building a special cross platform tool to unit test individual code portions for performance and accuracy to speed development along. So far it looks like this:

A fair way to go yet, and a few hurdles to cross, but basically its intent is to be able to prove and generalise code before having to rebuild main applications. That's come about because I see no reason why Petri's approach shouldn't be generalisable to earlier GPUs, and be able to incorporate with my own work on latency hiding and using single floats for extra precision with little/speed cost in some places.

Hopefully back onto that in the next week or so.
10) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1525974)
Posted 80 days ago by Profile jason_geeProject donor
... So now my question is, what about all those CFLAGS (you use Petri) when ./configuring Xbranch? Is there some place I can find a list of them specific to configuring this MB app? Do I look through the source code for answers to specifying CFLAGS before configuring? Where did you get them Petri? I believe tuning through the use of those CFLAGS is key to compiling the fastest app.

On top of what Perti33 comes up with, my recommendations there for CFLAGS are pretty general. That's because they are for the host CPU component(s), and a lot of the CPU time is beyond our control for the time being, being a function of Driver latencies (traditionally lower on Linux, but growing to match Windows and Mac). That's why Tesla compute cards with no Graphics offer the use of special low-latency drivers (TCC Tesla Compute Cluster Drivers):
- Use SSE or better for both the boinc libraries and Xbranch itself. Since it's x64 you should be getting SSE2 anyway by default, but worth checking
- make sure to enable fast math
- It's true I don't tend to use a lot of hardcoded #defines, because the code is generalised, as opposed to multiple-pathed.

For the Cuda compilation portions ( IIRC NVFLAGS ?), you'll want to check O2 or O3 is enabled (command line will be visible during compilation of .cu files. That enables some safe fast math ones for GPU side. You'd likely need to leave maxregisters at 32 even though some Cuda kernels will tolerate more under certain situations. That's because most of the device kernels are custom hand coded for low-latency high bandwidth, as opposed to high occupancy, meaning register pressure isn't a large factor.

After that (for me), performance optimisation becomes less about twiddling compiler options, and far more about very high-level algorithmic choices (Like Perti33's custom chirp he's been saving for me), and starting down the road to find ways to hide those awful latencies (using 'Latency hiding mechanisms', changing the way work is 'fed', and the way the host CPU interacts with partial results). That's where we're at the moment, with me designing new tools and techniques to find the best low-latency methods, and allow generalising, optimising on-host and plugging in new core functionality like Petri's for general distribution, eventually at install &/or run-time without application recompile [...similar to the way mobile phone apps do.]
11) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1525556)
Posted 82 days ago by Profile jason_geeProject donor
No idea if I can get any of the other applications compiled...

Stock CPU multibeam should have few roadblocks. Not sure what the Linux status of any AP is, stock or opt.
12) Message boards : Number crunching : New card having a few errors... (Message 1524938)
Posted 84 days ago by Profile jason_geeProject donor
One additional thing to be aware of when dealing with factory overclocked/superclocked models from any brand (may or may not have been a factor here). Those factory overclocks are typically made to a level of stability, determined by some maximum number of acceptable graphical 'artefacts' over a time period. In games minute/infrequent graphical glitches are often considered acceptable.

For number crunching that acceptable number of glitches is zero, and can be using different parts of the chip than the dedicated graphics parts. So sometimes a small core voltage and fan curve boost can be necessary (e.g. using eVGA precision X or similar). Both my 560ti and 780sc require small voltage bumps for rock solid Cuda operation at the factory (superclocked) frequencies.
13) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1524333)
Posted 85 days ago by Profile jason_geeProject donor
I haven't tried anything yet, but I'm just wondering: was it really necessary to force a manual installation of the NV proprietary display driver? In K/Ubuntu, I just installed the distribution-packaged display driver along with the other CUDA libraries and what-not. The RPM package from is not suitable? I'll follow as instructed anyway.

Another question: Presumably the compiled binary is auto-labelled ..._cuda60 because that's the current version of the CUDA tools installed from I remember the notes for the binary distributions of the MB CUDA application referring to things like the cuda42 build being suitable for Fermi GPUs, cuda50 for Kepler GPUs, etc. Is this note relevant when one is building manually?

Try both ways and tell us ? Frankly I had less struggle on Ubuntu, though have a tendency to do stuff on automatic. The need to do 'weird stuff' can come from either necessity or prior experience doing it when no longer needed.
14) Message boards : Number crunching : Step by Step, compile NVIDIA/MB CUDA app under Linux (Fedora 19) (Message 1524119)
Posted 86 days ago by Profile jason_geeProject donor
That should help me out as I prepare to streamline things a fair bit. Most likely I won't make any changes there in a hurry, being tied up with some testing for nv, and some Boinc patching, but it'll definitely help me out reducing the number of fiddly bits down the line.

15) Message boards : Number crunching : slow computer reaching its limit (Message 1523570)
Posted 87 days ago by Profile jason_geeProject donor
Until now there are no High range maxwells avaible to make the comparation but seems like the latency problem still happening with the mid range maxwells avaiable.

That's right. The trick is to get the scaling right in advance oF Big Maxwells, and have automatic scaling perfected before Pascal &/or Volta. Some testing to get out of the way behind the scenes with nv, then the Latency issues are getting some hefty glove slap :)
16) Message boards : Number crunching : AMD Athlon 64 3500+ system with PCI-e (Message 1523436)
Posted 87 days ago by Profile jason_geeProject donor
Why aren't AP Units run on CUDA?

Simply because no-one wrote a Cuda application for AP yet :) (well that I know of, perhaps they exist).

I've chosen to focus on Multibeam with Cuda for the time being for several reasons:
- relatively limited or sporadic AP task availability
- a lot of questions about AP validation and blanking to be resolved.
- focus on refining tools and techniques rather than immediate credit, for later use by any application (AP, GBT, other projects etc)
- The OpenCL variants developed by others appear to be doing fine (while when I started on Cuda multibeam it was not doing 'fine')
- Gives me as many field proven tools as possible when the time is right ( If done now, Cuda hosts would basically chew through all the issued AP in hours...), leading to,
- ... Significant credit issues would need to be resolved first.

That doesn't stop anyone making a Cuda enabled AP app, and AP integration is planned in x42 phase 2 (of 3), but for me that's lower priority than pulling ahead of the Cuda architecture release cycles for MB, and perfecting higher level tools/techniques.
17) Message boards : Number crunching : Compiling Applications for Linux (Message 1522825)
Posted 90 days ago by Profile jason_geeProject donor
*should* be OK, and I doubt you'd run into any issues.

Most likely, IMO, the *glitches* are actually the CUda32 one being a bit old & cranky now, e.g. missing its best signals in that first one, because some synchronisation has changed since way back when. If you wanted to be sure, you could run reference CPU app runs on the two problem ones under controlled conditions, though I suspect it'd come up clean.

Speed wise, The low angle ranges are showing the effects measured elsewhere of increased driver latencies in Later Cuda version paths (despite the same code, same driver, same card). The Big K and Maxwell architectures are too big/fast for the small kernels being thrown at them, so there results upwards of 40% underutilisation at times, directly attributable to driver latency. That's the big part of recent research, finding that the whole pulsefinding (dominant in VLAR like those) needs to be transposed, and the newer latency hiding mechanisms need to be marshalled throughout the application.

To perfect that in a general way with so many different architectures now online & coming, is requiring me to build some special tools. More on that when I'm a little further along, as those tools will be cross platform.

Manual targeted (to a specific device + Cuda version) latency hiding approaches are possible, though that's proven a dead end for me, as I need to support every Cuda enabled GPU. Instead I'm building tools to make the apps more self-scaling/optimising (i.e. decide to feed larger chunks of work, and choose and configure appropriate latency hiding mechanisms automatically where needed).
18) Message boards : Number crunching : New HP Z400 - Lunatics in question (Message 1522311)
Posted 91 days ago by Profile jason_geeProject donor
...I would like to point out in the astropulse info is a reference to using the "fermi path". Does the current astropulse build not use Kepler?

I think most of the NV OpenCL code is is fairly generic, though not being involved with development of that myself I can;t say for certain. Fermi+ do have special needs, which is probably why there is a special path named that.

For your mbcuda.cfg's in stderrs, looks like except for the oldest stderr you posted, they all picked up. I'll just point out, only in case you didn't realise it, that if you want both devices to use the same settings, then you only need the global ones, so (trimmed):

processpriority = abovenormal
pfblockspersm = 8
pfperiodsperlaunch = 300

processpriority = abovenormal
pfblockspersm = 8
pfperiodsperlaunch = 300

processpriority = abovenormal
pfblockspersm = 8
pfperiodsperlaunch = 300

can become simply:

processpriority = abovenormal
pfblockspersm = 8
pfperiodsperlaunch = 300
19) Message boards : Number crunching : Compiling Applications for Linux (Message 1522219)
Posted 92 days ago by Profile jason_geeProject donor
A little bit of difference. This is what I got on a GT620:

In terms of dealing with [the vagaries of] floating point precision, Those Q's indicate identical results.

2-11% performance based on straight Cuda version swap is something, though you can expect more down the road, since those times would appear to indicate Linux is headed down the heavyweight high latency driver road that MS and Mac have. (inevitable I suppose with all those new features). From there better performance and utilisation requires new tools and techniques.
20) Message boards : Number crunching : New HP Z400 - Lunatics in question (Message 1521598)
Posted 93 days ago by Profile jason_geeProject donor
1. Why does boinc not always read "mbcuda.cfg" when starting a task?

That you need to ask to JasonĀ“s, who creat the build, but i belive the answer is because the configuration only changes if you change the GPU or something similar.

*scratching head*. That's not a Boinc file, but an x-branch one. My Windows builds of x41zc, stock and installer, ALL read the mbcuda.cfg file at every task startup. You'll need to show me what you're looking at, if you think it is not doing so.

[Edit:] one idea, if you are using stock then there are different mbcuda.cfg txt files per version in the project folder. That's an unfortunate limitation of Boinc I can do nothing about. You should put the same settings in each variant.

Running stock could also explain why builds swap around. Boinc doesn't do time estimates all that well (part of the reason it screws up credit too). That means sometimes it can get confused and stuck on the wrong build.

The current (custom) Cuda 6 build I'm running isn't any faster for Kepler GPUs than the Cuda 5 variant, so not worth releasing for the sakes of a number. That's mostly because I'm bogged down with Boinc problems and making some special tools to use all the new Cuda 6 features. Probably once some issues are resolved application development will be much faster (fingers crossed).

Next 20

Copyright © 2014 University of California