Posts by jason_gee


log in
1) Message boards : Number crunching : Linux, Nvidia 750-TI and Multiple Tasks. (Message 1666447)
Posted 1 day ago by Profile jason_gee
Since you are running hyperthreaded, the cache, timeslices and execution resources are split. If you free an extra (virtual) core, you may see some improvement.
2) Message boards : Number crunching : Loading APU to the limit: performance considerations - ongoing research (Message 1666438)
Posted 1 day ago by Profile jason_gee
...Could smaller L2 or maybe added layer of L3 cache be related. Does each layer of cache add latency?
That might be down to how the logic inside the chip works. idk.


Yes it does. general rule of thumb is that if a L1 miss is 1 unit (latency) cost, then a L2 miss is 10 units, and a L3 miss (soft page fault) 100 units. A Hard page fault out to disk should be considered as 1000+ (out to infinite for estimation/optimisation purposes)

The usual 'tricks' to make sure data is there before you need it, are architecture specific, though there are more rules of thumb with respect to keeping given dataset sizes hot' at รก given cache level, software prefetch, and triggering available hardware prefetch mechanisms by touching data ahead of time in certain patterns.
3) Message boards : Number crunching : Linux, Nvidia 750-TI and Multiple Tasks. (Message 1666388)
Posted 1 day ago by Profile jason_gee
hmm, comparing differences in those hosts could be well worthwhile.
4) Message boards : Number crunching : Linux, Nvidia 750-TI and Multiple Tasks. (Message 1665735)
Posted 3 days ago by Profile jason_gee
I also tried playing with <GPU_usage> and <cpu_usage> without much effect.


Those are hints to the Boinc scheduler, rather than application controls. For the time being utilisation and bandwidth considerations are unthrottled, therefore 'controlled' by whatever system bottleneck is the tightest. With the current generation of applications on higher performance cards, the most dominant bottleneck is driver latency.

This is likely to change in x42 series, as I move to completely different mechanisms for GPU synchronisation (after our Cuda streams testing as x41zd). The goals for x42 include a kindof time based work issue, which would be configurable for different user scenarios (via a simple tool).
5) Message boards : Number crunching : Linux, Nvidia 750-TI and Multiple Tasks. (Message 1665733)
Posted 3 days ago by Profile jason_gee
Unfortunately I don't know enough about the Linux driver model to say things with authority, but can point out the reasons it works like that on Windows so you can compare.

First main point is that the reason running multiple Cuda/OpenCL tasks on Windows is effective, is the driver architecture has considerable latency (i.e. stalls for some period when issued a command). Because of the way the applications have had to scale from very small to very large GPUs, the current (let's call them 'traditional') applications are quite 'chatty', and make limited use of Cuda-streams, which is another finer grained latency hiding mechism. So for now on Windows at least, running multiple tasks per GPU amounts to a coarse grained and not-super-efficient latency hiding mechanism, that is pretty hard on cache mechanisms, PCIe and drivers.

Second point is that the Linux drivers are I believe (again limited knowledge here) built in as kernel modules (while the Windows ones involve 'user mode drivers'. Leaving out a whole swathe of DirectX related commands they are probably substantially leaner and able to use some GPU features more directly (manifesting in a lower latency). That's important because if you try to hide latencies that aren't there, you'll get the extra costs imposed with extra tasks ( Cache abuse and OS scheduling overheads ) without gaining much from latency hiding (because there isn't as much there to hide).

Lastly, the 750ti itself doesn't have a huge number of multiprocessors or high bandwidth (even though certainly much 'larger' than low power cards in previous generations). Those considerations form a performance ceiling highly dependant on the application design, which is over-due for some major changes.

There are a lot of changes happening simultaneously, and I'll probably miss some, but here they are in bullet points:
- We (Petri33 and myself) are building and testing more use of Cuda streams and 'larger' code into some application areas, as well as reducing 'chattiness'
- Cuda 7, which came out recently, has better support for Maxwell architecture, which combined with Kepler 2 ( GK110, GTX780 etc ) changes involved big shifts in the way parallelism is done on Big K onwards, which we are still coming to grips with. This also mandates 64 bit, which has a performance penalty on the GPUs through larger addresses chewing up registers, but the hope is that improved mechanisms might bury those costs.
- Windows 10 is also moving to a lower latency driver architecture, so things will change with respect to optimal number of tasks there as well (to some extent even on earlier OS where the drivers will change even though DirectX12 won't be available on older gen)

That all amounts to a picture where in the future running fewer tasks will probably be better ( more efficient, higher overall throughput etc), but will still vary by OS, drivers, GPU generation, application(s), and your particular goals. Pretty complex, but ongoing maturation of GPGPU has meant trying to manage these changes, which hasn't been without some pretty hefty bumps in the road.
6) Message boards : Number crunching : Panic Mode On (96) Server Problems? (Message 1663475)
Posted 9 days ago by Profile jason_gee
All I know is, If I were an Oracle or IBM Executive, I'd be all over this as a product showcase, and have an emergency response team choppered in to lend a hand and beef up the gear where needed. Money well spent ? I think so :D
7) Message boards : Number crunching : Strange CPU Load caused by SETI (Message 1662981)
Posted 9 days ago by Profile jason_gee
Hey Edwin,
I see similar effects in a different context, namely experiments with GPU utilisation loading GPUs differently under both Cuda and OpenCL.

What I traced that behaviour to may have a similar source here, basically a interference pattern between the sampling rate of the monitoring program, versus a regular frequency of the device being loaded.

My guess is that either underlying hardware, settings or drivers are regulating your CPU usage.

Assuming you are not explicitly throttling with some application like TThrottle, Intel Speedstep or AMD Cool 'n Quiet, I would look at possible issues like CHipset driver and device quality, using for example DPC latency checker.

You might find that simply changing the sampling rate in task manager will eliminate the patterns (in which case they were pure mathematical sampling artefacts rather than real) or they change, or even disappear if you use a different tool like process explorer.

Patterns like that are a good indication there is some delay or frequency in the source that 'beats' with the monitoring timer.
8) Message boards : Number crunching : having issues with my new GTX980 (Message 1661485)
Posted 14 days ago by Profile jason_gee

- check system DPC latencies while at full crunch (which will say something about quality of other drivers in the system that can interfere)

did I miss anything anyone ? AP has other totally different settings of course, which I'm not familiar with.



Pretty succinct summarizing, Jason. The note about DPC latencies got me thinking. I do have the DPCLAT utility sitting in my utilities folder. Jason, what do you think is the maximum or typical latency you can get away with when crunching MB? Probably subjective but I was wondering what you typically see?

Cheers, Keith


P.S. I have AP setting suggestions when the OP gets around to it.


I like using my main dev system, a crappy old COre2Duo, as a baseline for that kindof thing. That's because while well maintained, over the years chipset and (wifi) network drivers have been problematic/finicky. The simplest tool for a general overview / health check is DPC Latency checker, which gives me this at full crunch while streaming a video:


That's after a lot of hair pulling some years back, which led to using customised wifi drivers and forcing Intel chipset driver updates.

If I get periodic red spikes, the I use LatencyMon to isolate the specific driver/hardware involved. I'm told it's normal for such tools to report an extra 1 millisecond offset in Windows 8/8.1, so I suspect it's the consistency of the latencies that's more important than the exact values on a given system.

Also I have other systems that just always indicated good quality drivers etc out of the box, so some systems require more finagling than others.

[Edit:] as an extra note, I just find characterising the driver, hardware & configuration quality this way tends to rule out or isolate a whole swathe of possible issues at hardware, firmware, OS, driver, and application level.
9) Message boards : Number crunching : Lunatics_x41g_linux64_cuda32.7z (Message 1661368)
Posted 15 days ago by Profile jason_gee
He could always try running it under Synecdoche instead.


Looks awesome. I might need to have a conference with those guys in due course ;)
10) Message boards : Number crunching : Lunatics_x41g_linux64_cuda32.7z (Message 1661363)
Posted 15 days ago by Profile jason_gee
The basic principles involved are that if you have a robust system (referring to Boinc) then it will cope with all sorts of odd situations, while if a flea farting in Brazil sends it into meltdown, then probably more care is warranted.
11) Message boards : Number crunching : CUDA works on ATI AMD (1st of April) (Message 1661342)
Posted 15 days ago by Profile jason_gee
I heard that Cuda has a general purpose emulator called OpenCL :P
12) Message boards : Number crunching : having issues with my new GTX980 (Message 1661336)
Posted 15 days ago by Profile jason_gee
Current status quo with MB provides a growing list of things you could address (each on their own terms, and many will help with appropriate tools/advice for each):

Broadly each optional 'at own risk' before OCing:
- Force p2 power state memory clock to full rate
- up the fan speeds/fan-curve to disengage turbo boost temperature limits
- use Lunatics to fix the application to the Cuda50 build
- use the mbcuda.cfg to raise process priority and pulsefind settings (abovenormal,16,400)
- check system DPC latencies while at full crunch (which will say something about quality of other drivers in the system that can interfere)
- increase instances from 1 to probably 3

Then for OCing:
- familiarise with an artefact scanner like OCCT
- practice dropping to factoiry settings (except for raised fan curve) then raising one setting at a time until artefacts are produced, then backing off 2 or more 'notches' (The size of a 'notch' is determined by granularity of application sliders or your own 'feel')
- possible core voltage increases & OC/hardware-limitations of the feeding system can be considered, rinsing and repeating artefact scans.

did I miss anything anyone ? AP has other totally different settings of course, which I'm not familiar with.
13) Message boards : Number crunching : Lunatics_x41g_linux64_cuda32.7z (Message 1660995)
Posted 16 days ago by Profile jason_gee
So far the compile of the Xbranch I did is working smoothly and performance seems ok. Nothing has errored out, but BOINC is completely out to lunch on how to judge estimated computation size and times...

http://setiathome.berkeley.edu/results.php?hostid=7533120

Jason.. I could try and do a staticly linked build of what I'm using if you have others who want to try it out?


Static'd be good, if it can get more static. What I'd suggest is we all consolidate (yourself, Petri etc), and pile into the Github thing. So if you have specific modification to the makefiles, do a fork and pull request to master, and we can all discuss it there.

That's a new thing for me too, so a bit of an adventure, but getting that jangling feeling that it's the right way to go (which happens sometimes). Then we can choose to field specific tests to wider audiences pretty quickly.

There's a lot I want to change going into x42, and some of that is process for involvement/collaboration, testing and publication, another is no-compromise re-jigging to prepare for the next application, part of which include abstracting/wrapping some problematic BoincApi Code that isn;t going to be fixed.

Estimate-wise, yeah lots (and I mean LOTS) of research and development been spent over about 5-7 years on that (specifically the 'CreditNew' mechanism). A customised client with improved estimation/prediction isn't out of the question, though plenty is understood to offer a superior mechanism to any project that wants it for server side improvement down the road (better estimates and fairer RAC normalising to the COBBLESTONE_SCALE as intended)
.
14) Message boards : Number crunching : Lunatics_x41g_linux64_cuda32.7z (Message 1660954)
Posted 16 days ago by Profile jason_gee
Looking good. None of the new WUs have validated yet but I am not seeing any overflows which is a good sign.


Seeing some [Cuda 6] valids there now, so Cuda 3.2 no good on Maxwell confirmed for Linux. (for this application anyway, as on Windows)


I have done some research in/on linux and a hand full of GTX780's (1- max 4).

What would be the best way to deliver the code to the community To Help Us All to do more science before we get the Jason's x42? (Jason?)

I have done some kernel optimizations (780 specific, not tested anywhere else. just on my computer) and some stream-inclined/induced/oriented changes to the already Well and Good optimized Jason's and his precursors code.


What would be the most neutral way to publish? (I can not host any piece of a code for three-four years) ??


I'll drop the source to Your mail box or whatever.

I'm running two MB at at time with 3 GPUs and in between an AP on GPU or CPU if available.

See for yourselves (on top hosts) .. (remember to divide the time by 2 for any MB).


And Boom! published XBranch on Github

last commit is me shovelling in the exisitng x41zc code. Please Fork, modify, test and submit pull requests for discussion/collaboration getting them back into the master :D
15) Message boards : Number crunching : Lunatics_x41g_linux64_cuda32.7z (Message 1660945)
Posted 16 days ago by Profile jason_gee
Yeah current process is pretty lengthy/involved, and I've been thinking about that for some time (actually since your submissions, and my major redesigns for x42). I was finding options that I've used before too restrictive (systemically slow) for myself, or getting others' code in to test quickly & easily, and then melded into the general Berkeley sources after that (where sufficiently generalised and field proven etc)

The rough conclusion that I've come to is that despite a painful transition (for me) getting more familiar with Github and the workflows for that would be best, then you (or anyone) could fork there and make pull requests for additions back to the optimised master. Periodically we'd collectively decide (with Eric and others input when we think some major revision is good enough to be called stock), and rolled into Berkeley's svn.

Still nutting out the details, And working out the finer points of github, but seems like it might be more workable.
16) Message boards : Number crunching : Lunatics_x41g_linux64_cuda32.7z (Message 1660803)
Posted 16 days ago by Profile jason_gee
Looking good. None of the new WUs have validated yet but I am not seeing any overflows which is a good sign.


Seeing some [Cuda 6] valids there now, so Cuda 3.2 no good on Maxwell confirmed for Linux. (for this application anyway, as on Windows)
17) Message boards : Number crunching : Lunatics_x41g_linux64_cuda32.7z (Message 1660652)
Posted 17 days ago by Profile jason_gee
I've hadd success with the compile on the cuda 5.5 here. Over a dozen have vailidated not long ago and its been running smoothly on the old card.

http://setiathome.berkeley.edu/results.php?hostid=7533120

If you want a list of what I did to get it to compile or a diff of the changes I made I can post it or send it over.


I'd like that list via PM please, which I will compare to what I have setup on the Ubuntu dev machine where the Cuda6 build is running, and choose the best simplifying set of updates adding in Cuda7 for the Centos install as a fresh environment.

Ultimately, as briefly mentioned before, I'll be going with the Gradle build system, though it makes sense to keep the traditional gnutools setup operational, since it seems to play ball with effort.
18) Message boards : Number crunching : Lunatics_x41g_linux64_cuda32.7z (Message 1660631)
Posted 17 days ago by Profile jason_gee
Seem to be getting valids, and seems reasonably fast, so added the Cuda60 build to bottom of Downloads page at:

http://jgopt.org/download.html

Assuming works on Maxwell, next would be nice to know which Linux distributions &/or Boinc versions it has problems on (and some idea of why)

That would involve looking into slot directories for task stderr.txt, on problematic systems.
19) Message boards : Number crunching : Lunatics_x41g_linux64_cuda32.7z (Message 1660589)
Posted 17 days ago by Profile jason_gee
Progress here as I stumbled across a Cuda 6 build I made, which *should* be nearly identical to that last Cuda32 one, which runs/works on anything but Maxwell.

Cuda 6 should be fine, though I'll witrhold until I prove it here, and at the moment the project is giving me 'no tasks available', so the waiting game it is, lol

[Edit:] Lol, got 7 vlars for CPU... well not what I wanted exactly, but a start

[Edit2:] And looks like it's off and running on the 680 at least. Will let it accumulate some pendings and see if anything looks strage. If it looks OK will add it to the same download page, keeping the Cuda32 one for older GPUs.
20) Message boards : Number crunching : Lunatics_x41g_linux64_cuda32.7z (Message 1660571)
Posted 17 days ago by Profile jason_gee
At the moment the GPU is processing work (Thanks Jason) so I think I will see what happens before trying to use slackpkg or slapt-get.

My big concern at the moment is how the GPU is processing work. So far the GPU has processed eight WUs but they are all pending so I do not know if the results are valid or not. The times for each seem to be three to six minutes which is so much better than CPU times that I have my doubts on their being valid. As a result I have currently stopped downloads with about 20 WUs downloaded until I know. Time will tell.


@dsh
Great to see it's running the Cuda 3.2 build, though looks like excessive pulses detected. That looks like the weird problem that was spotted on Windows With Maxwell and Cuda 3.2 (Then stopped sending to those)

In that case, back to the plan of updating BoincAPI and spitting out a Later Cuda version. That'll probably happen later tonight, when I'll probably bring the newer Centos install development tools, drivers and sources up to scratch.

[Edit:] and as I typed first invalid ticked over, so moving on.

Making an updated Linux build fits with my current development plans anyhow, so a bit of juggling to do tonight.


Next 20

Copyright © 2015 University of California