Posts by jason_gee


log in
1) Message boards : Number crunching : no finished file (Message 1612729)
Posted 9 days ago by Profile jason_gee
...Is this a symptom of a dying GPU, or a faulty WU?...exited with zero status but no 'finished' file


In short neither. It's a symptom of (the BOINC client) using fixed timeouts in a multithreaded (Modern OS) environment. If it's a substantial change from your host's familiar behaviour, I would consider examining what device drivers changed recently, and system DPC latency behaviour.
2) Message boards : Number crunching : Question regarding graphics cards and picking workunits (Message 1612629)
Posted 9 days ago by Profile jason_gee
...(What the heck is it about MB that my NVIDIA dislikes? :)
...


As the MB Cuda application developer, I would like to know this as well. Looking at the mentioned invalids, I see spontaneous overflow results that the wingmen disagrreed with (at a higher rate than might be expected by some faulty app being in circulation etc).

As you mention Einstein and AP both seem fine, I would first compare loadings (single instance) (GPU usage in e.g. evga precision X or similar), and temperatures. Since we know that different applications push the devices differently, there can be more pressure on VRAM, BUS, GPU itself, temperatures. You're looking for anything that stands out as particularly different in the MB application compared to the others.

Generally speaking, Cuda MB is relatively high reliability due to longer maturity. Single instance, compute efficiency on new apps like AP and Einstein is a bit lower, and therefore MB will push the envelope with multiple instances and a marginal factory voltage/overclock. With temperature cocnerns eliminated/controlled, One often effective means to correct marginal factory overclocks is a slight boost to core voltage (temperatures & reasonable safety permitting).

In the future, I'll be adding a throttle to Cuda MB, since going flat chat will never suit every system/user.
3) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1610259)
Posted 14 days ago by Profile jason_gee
FWIW, totally agree with both points of view here. I would only add that for some time there have been chaotic changes going on, no doubts still to be resolved. So in some cases 'right' answers may not yet exist.
4) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1609798)
Posted 15 days ago by Profile jason_gee
I don't yet think that they don't care. Rather, I think that it is not yet a priority to them. We will see.


I'd tend to agree. Cuda7 early alpha only just ended, then there'd be DirectX12/Win10, Maxwell midrange products, and 'a bazillion' 64 bit Denver Tegra K1 cores for the Google Nexus 9, absorbing dev resources along with silicon, not to mention the blindside by AMD with Mantle low-latency gaming drivers/apis in there to compete with.

Yeah, spread pretty thin I'd reckon. Not to say they shouldn't address the problem, or have plenty of resources to address it, though I know I'd have trouble justifying dedication of resources for no immediate return pretty hard to take before a shareholder or board meeting ;)
5) Message boards : Number crunching : Panic Mode On (92) Server Problems? (Message 1609698)
Posted 15 days ago by Profile jason_gee
ey! something's alive! my 980 finally woke up :)
6) Message boards : Number crunching : Panic Mode On (92) Server Problems? (Message 1605892)
Posted 24 days ago by Profile jason_gee
Update on my Beta crunching: my i7 crunched through all the cuda32s and cuda42s (and a few cuda50s) it was sent, and got sent more 32s. Looking at the processing times, though, it seems to my amateur eye like the 42s were more efficient. Does the scheduler think differently than I do? (Probably...)

The Average processing rate (APR) shown on the Application details for host 66539 is 72.81 GFLOPS for CUDA32 and 64.93 GFLOPS for CUDA42. That's the basis on which the Scheduler logic considers CUDA32 faster.

However, the Scheduler choice of which to send has a random factor applied. It's based on the normal distribution so is usually small, and it is further scaled down by the number of completed tasks in the host average. The idea is that the host averages will get more reliable as more tasks are averaged in. But the host averages are calculated using exponential smoothing such that about half the average is based on the most recent 69 values, so they can actually vary considerably for GPU processing.

The "GFLOPS" are derived from the estimate of floating point operations the splitters provide. For SaH v7 those estimates are based on angle range (AR), and are a compromise between how AR affects processing on CPU or GPU. That compromise means that very high AR "shorties" provide a lower APR than normal midrange AR tasks when processed on GPU. It looks like all the CUDA42 tasks you got were shorties but CUDA32 has gotten a mix. Perhaps the next time the random factor makes the Scheduler send CUDA42 you'll get a batch of midrange AR tasks which will increase the APR for those.
Joe


A sidenote perhaps interesting to some: In addition to the averages being volatile with work mix and other aspects ignoring any hardware or app change not managed, the estimate mechanism already contains several 'noisy' inputs with various offsets/scales and variances in practice. Those sources of noise easily swamp those random factors inducing chaotic behaviour and making those random offsets more or less redundant.
7) Message boards : Number crunching : Einstein@Home Locks PC (Message 1604888)
Posted 26 days ago by Profile jason_gee


I'm running 3 tasks on a GTX 770 with 2gb ram, same as I run under Seti. I gave up all overclocking and all 6 cores of the 970 are running around 70C at 100% load with a water cooler. I have 4gb ram which is the max for 32bit Win7 OS.

Bob B.

Running 3 Einstein tasks on that GPU and only using a 32-bit OS with your limited system memory likely is a bad combination and a possible cause of your problem, turn the GPU down to 2 tasks and reserve at least 1 CPU core if you already havn't (two maybe needed though).

Why didn't you go with a 64-bit OS and at least twice the memory?

Cheers.


hmm, yes, this mix of 32 Bit Windows, and this Hardware is problematic for a host of accumulated reasons. These point toward the described problems/symptoms, before necessarily considering specific application(s).

Some bullet points (probably not comprehensive)
- Windoiws Vista/7 onwards moved to the WDDM ( Windows Display Driver Model) that 'virtualises' video memory.
- the described system will not be able to utilise the full 4GiB host RAM, and 2GiB VRAM, since the 32 bit address space won't address that much combined (physically) [ note no physical address extension here... ]
- something (probably both host memory and the GPU VRAM) will be being substantially capped.
- The Kernel portion will be reserving around half of whatever's left, including for various driver crash recovery & caching purposes.
- When whatever VRAM is actually allowed becomes close to filled, this will rely on the virtual memory and page to disk
- This in turn will induce delays prone to expose system resources limits/timeouts and produce various kinds of failures.

All that considered, I would venture to guess that even running 2 instances of a demanding application may become unreliable. Most likely just switching to a 64 Bit version of Windows would resolve the problem. Even still, 2GiB VRAM is a lot to mirror (As WDDM does) with only ~2GiB kernel space (leaving 2GiB for user applications, minus overheads).

Most likely, IMO, these mounting limitations go a long way toward the GPU Vendors' gradual moves toward mostly 64 bit support for the newer cards. There just simply aren't enough system resources available under 32 bit to support all the extra Driver+OS heavy lifting in the background.
8) Message boards : Number crunching : HDD Questions For The Elite - [RESOLVED] (Message 1603568)
Posted 21 Nov 2014 by Profile jason_gee
Why would this issue not show up months ago instead of a couple days ago?

It may be bad SATA cable or bad contact of cable 'pins' to DVD or to motherboard.

Some DVD drives (or motherboard controllers?) make the LED always ON when they are opened. (and the device may 'think' it is opened even when you see it as closed)
1. -> If you open/close the DVD drive what happens to the LED on the computer case?


I don't have my CD/DVD burner app running. There should be no "pending I/O operation" where the optical drive is concerned.

No need to have any special 'app', Windows itself may poll the DVD from time to time.
In 'Safely remove' icon (on Windows XP) I see my DVD drive listed for 'removal'


If I remove the cable, I will have to wait 6 to 9 hours to see if that will "fix" the issue.

2. -> So what? Do you use the DVD every day?


The issue goes away with a re-boot.

3. -> On every re-boot or is random?


1. If I have no disk in the optical drive, nothing happens with the access indicator light when it is opened and closed.. If, however, I have a disk in the drive and close it, the light will flicker as it is looking for an appropriate app to access the disk.

2. No need to get so snarky. If you would have read and quoted that WHOLE line of text you would have read that I was going to run with the cable off today. As I am doing right now. See what happens when you do not read everything in a message?

3. The issue happens with each and every re-boot I have done since the issue was discovered 3 or 4 days ago.

Keep on BOINCing...! :)


When Windows updates are performed, *sometimes* (actually more often than not), they include a 'dotnet framework' aka common language runtime (clr) ) security update. This mandates a rebuild of the native binaries performed usually by a dotnet or clr optimisation service. This service can consume from some small percentage to 100% of system resources, from a few second, to several hours, depending on what you have installed. This will typicality light the HDD light on solid for the period, disappearing only after some substantial ontime with limited other activity.
9) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1601157)
Posted 14 Nov 2014 by Profile jason_gee
[NEW]
14 November 2014 1:21 am -- Kevin Kang
Hi Jacob, Sorry for update on this issue late. As noted in release notes, the R340 drivers will continue to support the Tesla generation of NVIDIA GPUs until April 1, 2016, and the NVIDIA support team will continue to address driver issues for these products in driver branches up to and including Release 340. However, future driver enhancements and optimizations in driver releases after Release 340 will not support these products. Our developer team is working on this issue actively for the future R340 driver release, we'll keep you posted once it has been fixed. Sorry for any inconvenience! Thanks, Kevin


This certainly concurs with the way I interpreted the existing Cuda documentation (not in an pure OpenCL context). We're at an inevitable juncture where upcoming OSes will require hardware features only available in Fermi+, mainly 64 bit addressing, and the emulations having taken place to bring the older gens along so far have become unwieldly (making the older cards crunch slower with each driver iteration).

The line makes practical sense for nVidia. It's just really unfortunate that the timing of these moves was near Maxwell release, resulting in the usual driver maturation problems converging with unrelated major changes. Could definitely have been done cleaner IMO, though I suspect there'll be some more growing pains yet.
10) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599328)
Posted 10 Nov 2014 by Profile jason_gee
Thanks Jason for the heads up. I will now with draw my complaint. And we folks do appreciate what you ALL do to help develop code.


No Problems. I completely understand these issues draw odd looks (especially for example when Eric and I have dissected some things in news threads, lol).

Some of the best things come from 'messy minds', and in that state protocol sometimes just doesn;t fit. Some of us try though ;)
11) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599324)
Posted 10 Nov 2014 by Profile jason_gee
As I have no horse in this race, Why would you guys discuss and fight over code in this forum?
Should not this pulic display of angst have been more appropriate in the Boinc developers thread or Beta or even PMs?


I'm sure a similar sentiment wasn't the entire point of Richard's initial response, but at least some part of it.

To be fair all around, sometimes as a developer it's difficult to find a sympathetic ear, despite something being 'obviously wrong'. I gather your own views would rather not see this side of development (a view which I happen to agree with mostly), however *sometimes* communications on a large and complex issue like this require breaking a few molds and 'rules'. On occasion something good can come from more public exposure.

[for example, I'd wager Raistmer had little or no idea that my control systems oriented creditNew research would have any relationsship to this 'simple' problem. There's no Forum for that, but 'numbercrunching' does fit ;) ]

[Edit:] I'll add, that from experience boinc forum would be the wrong forum to speak about this, and PM's wholly inappropriate in development matters. If kept to Lunatics I probably would not have seen it and had the opportunity to respond.
12) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599152)
Posted 9 Nov 2014 by Profile jason_gee
Then came the change, make it a three term (PID), which was an easy task as I'd already thought about that...


Funnily enough, the dated 6.10.58 build of the Boinc client I run here, I modified replacing a portion of the task duration estimates with a three term PID driven mechanism. It's been working fine and adapting to significant local hardware and application changes without issue since 6.10.58 was current.

That's one of several approaches I'll be comparing models of, for some of the server side estimates for task scheduling. (in addition to client).

Most likely the PID variant will yield to the slightly more sophisticated Kalman filter ( or extended version ), but remove the need for tuning. There's other options that are going to be compared (including the server's current dicey use of running sample averages), and areas where it's been suggested neither the PID or Kalman would be an optimal choice, but fun to see steady state runtime estimates dial in to the second at times. That's better than required, so simplest/smartest will probably win out.
13) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599146)
Posted 9 Nov 2014 by Profile jason_gee
For my 2 cents toward the original topic, this multiple device issue has at least 3 main relevant impacts on the (mostly non credit related) creditNew work I've been doing.
- Scheduling for task allocation to hosts: in case of multiple disparate devices the throughput used for requests by the client should be the aggregate sum of peak theoretical flops, and a filtered efficiency ( aggregated from separately tracked device.appversion local efficiencies), which would be dominantly client side refinements.
- Increasingly heterogeneous hosts, currently unsupported (again mostly a client side concern. To some extent the server has all of the information it needs for its tasks, though underused, and small fragments missing or misused in places), and
- local (client) estimate scheduling

Considering those, which stand out as dominantly client side concerns, I'd be wary of recommending increased server side complexity, especially since the problem domain ( our subjective observations of how well the scheduling works) are of little relevance to server/project side goals and scope.

IOW, try to keep solutions close to the original problem source, rather than migrate them back into a problem domain which is already overly complicated by special exceptions and burdened by poor management.

My own work, which will undoubtedly result in recommendations mostly for client refinement, but definitely some server bulletproofing & simplification too (in support of the separate credit issues). This will reach a viable point to model heterogeneous hosts/application & workloads, in part 1.2 - 'controllers', of the plan below.

That doesn't prevent anyone researching & developing other ways to address the limitations we've dealt with for so long. I'd suggest though that David's 'hands-off' approach to the problem may be at least in some small portion due to some of the other design issues not specifically relating to multiple devices. It may instead be recognition that the problem is a larger design one, relevant across more problems than just mixing disparate devices... (which it is.)

14) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1597374)
Posted 5 Nov 2014 by Profile jason_gee
Thomas Arnold wrote:
Hello, I need your insight and help.
I am using this Video card, NVIDIA GeForce GTX 260 (896MB) driver: 311.06 OpenCL: 1.0

In the past we have never had a problem but now we are receiving
Computation error running seti@homev77.00 (cuda22)

We are not too familiar with much of the program but we support the efforts to run the data sets. Can you please tell me if we need to change something with our setup or will these errors clear themselves or just continue to build up in the task tab?

The driver is old enough so it doesn't have the issue which started this thread. I don't know why all SETI@home v7 7.00 windows_intelx86 (cuda22) and (cuda23) tasks are failing on your host 6648399, but it does very well on (cuda32), (cuda42), and (cuda50). Perhaps one of the CUDA experts here can figure out why the servers aren't sending tasks for the plan classes which work well.
Joe


Will likely be digging out the scheduler code again on the weekend, if someone doesn't beat me to it. No accumulated data for the app versions, plus a logic hole with respect to systematically issuing to all app versions, ignoring the error count & quota, seems to be along the lines of what's happening. [I'll need to start by looking if that server code's been changed since a couple of months ago]

For the host side, FWIW the application (2.2 & 2.3 planclasses) appears to not even be making it to device initialisation. That would seem to me the DLLs are somehow damaged, or the driver install has gone awry. I'd imagine a clean driver install (of a suitable known good older version for this GPU) and a project reset may be in order.
15) Message boards : Number crunching : Phantom Triplets (Message 1592789)
Posted 27 Oct 2014 by Profile jason_gee
Yep, sounds good. It's these rare obscure issues with an otherwise perfectly running system & software that test the patience. Sleep sounds like a good idea.
16) Message boards : Number crunching : Phantom Triplets (Message 1592787)
Posted 27 Oct 2014 by Profile jason_gee
Here's a quick screenshot of my ~6 year old Core2 (driving a 980SC) DPC latencies, grabbed with DPC Latency checker [while crunching], to compare:

http://prntscr.com/506a9i

It took manual piecewise force update replacement of various Intel Chipset drivers, and a modified Atheros Wireless driver to get that, Using 'LatencyMon' to identify the culprits

http://www.thesycon.de/eng/latency_check.shtml
http://www.resplendence.com/latencymon
17) Message boards : Number crunching : Phantom Triplets (Message 1592785)
Posted 27 Oct 2014 by Profile jason_gee
That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue)

No, it runs AP tasks, too. It will run either 2 MBs or 1 MB + 1 AP on the GPU.


OK, you'll need to consult the AP app author.
as per edit I can't vouch for the thread safety of any app other than x41zc builds, lack of which can easily cause driver 'sticky downclocks.', alongside the possible system driver DPC issues.
18) Message boards : Number crunching : Phantom Triplets (Message 1592781)
Posted 27 Oct 2014 by Profile jason_gee
That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue, I can't vouch for the thread safety of any app other than x41zc builds, lack of which can easily cause driver 'sticky downclocks.')
19) Message boards : Number crunching : Phantom Triplets (Message 1589048)
Posted 20 Oct 2014 by Profile jason_gee
...leaves me wondering what I might actually be facing here...


Basically the same as the white dot graphical artefacts you'd get overclocking a GPU for gaming, at which point backing off at least two 'notches' is the general corrective method. White dots in graphical glitches imply ~24-32 bits of saturation (bits flipped to on), so yeah many more than a single bit flip, though typically tied to single memory fetches.

As with overclocking, It happens with 'gaming grade' GPUs sometimes from factory due to price/performance market pressures, and the non-critical nature of a rare white dot, when parts binning and setting clocks for mid range GPUs, where the competition is steepest.
20) Message boards : Number crunching : Phantom Triplets (Message 1588603)
Posted 18 Oct 2014 by Profile jason_gee
OK, a quick look through and those all manifested from the same chirp-fft pair
<fft_len>8</fft_len>
<chirp_rate>40.543618462507</chirp_rate>

and have the same miniscule mean power, and calculated detection time. That makes them all part of the same 'event'.

Based on the nature of that glitch 'looking' much like a numerical version of a graphical artefact, I would suspect a few things. Through age or manufacturing concerns that silicon may well be operating near capability in terms of frequency &/or heat. Presuming you've checked the latter, with no evidence of overheating etc, Then I recommend either a ~0.05V core voltage bump, or a ~20MHz frequency drop. Either would IMO compensate for a combination of the age of the GPU, and mid range performance/price cards being pushed to the frequency limits (with some number of acceptable artefacts) from factory. Memory clock may also be an issue, though I would expect the nature to be more random with memory glitches than core loading. In this case there appears to be a quite specific borderline circuit.


Next 20

Copyright © 2014 University of California