Posts by jason_gee


log in
1) Message boards : Number crunching : Panic Mode On (92) Server Problems? (Message 1605892)
Posted 1 day ago by Profile jason_gee
Update on my Beta crunching: my i7 crunched through all the cuda32s and cuda42s (and a few cuda50s) it was sent, and got sent more 32s. Looking at the processing times, though, it seems to my amateur eye like the 42s were more efficient. Does the scheduler think differently than I do? (Probably...)

The Average processing rate (APR) shown on the Application details for host 66539 is 72.81 GFLOPS for CUDA32 and 64.93 GFLOPS for CUDA42. That's the basis on which the Scheduler logic considers CUDA32 faster.

However, the Scheduler choice of which to send has a random factor applied. It's based on the normal distribution so is usually small, and it is further scaled down by the number of completed tasks in the host average. The idea is that the host averages will get more reliable as more tasks are averaged in. But the host averages are calculated using exponential smoothing such that about half the average is based on the most recent 69 values, so they can actually vary considerably for GPU processing.

The "GFLOPS" are derived from the estimate of floating point operations the splitters provide. For SaH v7 those estimates are based on angle range (AR), and are a compromise between how AR affects processing on CPU or GPU. That compromise means that very high AR "shorties" provide a lower APR than normal midrange AR tasks when processed on GPU. It looks like all the CUDA42 tasks you got were shorties but CUDA32 has gotten a mix. Perhaps the next time the random factor makes the Scheduler send CUDA42 you'll get a batch of midrange AR tasks which will increase the APR for those.
Joe


A sidenote perhaps interesting to some: In addition to the averages being volatile with work mix and other aspects ignoring any hardware or app change not managed, the estimate mechanism already contains several 'noisy' inputs with various offsets/scales and variances in practice. Those sources of noise easily swamp those random factors inducing chaotic behaviour and making those random offsets more or less redundant.
2) Message boards : Number crunching : Einstein@Home Locks PC (Message 1604888)
Posted 3 days ago by Profile jason_gee


I'm running 3 tasks on a GTX 770 with 2gb ram, same as I run under Seti. I gave up all overclocking and all 6 cores of the 970 are running around 70C at 100% load with a water cooler. I have 4gb ram which is the max for 32bit Win7 OS.

Bob B.

Running 3 Einstein tasks on that GPU and only using a 32-bit OS with your limited system memory likely is a bad combination and a possible cause of your problem, turn the GPU down to 2 tasks and reserve at least 1 CPU core if you already havn't (two maybe needed though).

Why didn't you go with a 64-bit OS and at least twice the memory?

Cheers.


hmm, yes, this mix of 32 Bit Windows, and this Hardware is problematic for a host of accumulated reasons. These point toward the described problems/symptoms, before necessarily considering specific application(s).

Some bullet points (probably not comprehensive)
- Windoiws Vista/7 onwards moved to the WDDM ( Windows Display Driver Model) that 'virtualises' video memory.
- the described system will not be able to utilise the full 4GiB host RAM, and 2GiB VRAM, since the 32 bit address space won't address that much combined (physically) [ note no physical address extension here... ]
- something (probably both host memory and the GPU VRAM) will be being substantially capped.
- The Kernel portion will be reserving around half of whatever's left, including for various driver crash recovery & caching purposes.
- When whatever VRAM is actually allowed becomes close to filled, this will rely on the virtual memory and page to disk
- This in turn will induce delays prone to expose system resources limits/timeouts and produce various kinds of failures.

All that considered, I would venture to guess that even running 2 instances of a demanding application may become unreliable. Most likely just switching to a 64 Bit version of Windows would resolve the problem. Even still, 2GiB VRAM is a lot to mirror (As WDDM does) with only ~2GiB kernel space (leaving 2GiB for user applications, minus overheads).

Most likely, IMO, these mounting limitations go a long way toward the GPU Vendors' gradual moves toward mostly 64 bit support for the newer cards. There just simply aren't enough system resources available under 32 bit to support all the extra Driver+OS heavy lifting in the background.
3) Message boards : Number crunching : HDD Questions For The Elite - [RESOLVED] (Message 1603568)
Posted 7 days ago by Profile jason_gee
Why would this issue not show up months ago instead of a couple days ago?

It may be bad SATA cable or bad contact of cable 'pins' to DVD or to motherboard.

Some DVD drives (or motherboard controllers?) make the LED always ON when they are opened. (and the device may 'think' it is opened even when you see it as closed)
1. -> If you open/close the DVD drive what happens to the LED on the computer case?


I don't have my CD/DVD burner app running. There should be no "pending I/O operation" where the optical drive is concerned.

No need to have any special 'app', Windows itself may poll the DVD from time to time.
In 'Safely remove' icon (on Windows XP) I see my DVD drive listed for 'removal'


If I remove the cable, I will have to wait 6 to 9 hours to see if that will "fix" the issue.

2. -> So what? Do you use the DVD every day?


The issue goes away with a re-boot.

3. -> On every re-boot or is random?


1. If I have no disk in the optical drive, nothing happens with the access indicator light when it is opened and closed.. If, however, I have a disk in the drive and close it, the light will flicker as it is looking for an appropriate app to access the disk.

2. No need to get so snarky. If you would have read and quoted that WHOLE line of text you would have read that I was going to run with the cable off today. As I am doing right now. See what happens when you do not read everything in a message?

3. The issue happens with each and every re-boot I have done since the issue was discovered 3 or 4 days ago.

Keep on BOINCing...! :)


When Windows updates are performed, *sometimes* (actually more often than not), they include a 'dotnet framework' aka common language runtime (clr) ) security update. This mandates a rebuild of the native binaries performed usually by a dotnet or clr optimisation service. This service can consume from some small percentage to 100% of system resources, from a few second, to several hours, depending on what you have installed. This will typicality light the HDD light on solid for the period, disappearing only after some substantial ontime with limited other activity.
4) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1601157)
Posted 13 days ago by Profile jason_gee
[NEW]
14 November 2014 1:21 am -- Kevin Kang
Hi Jacob, Sorry for update on this issue late. As noted in release notes, the R340 drivers will continue to support the Tesla generation of NVIDIA GPUs until April 1, 2016, and the NVIDIA support team will continue to address driver issues for these products in driver branches up to and including Release 340. However, future driver enhancements and optimizations in driver releases after Release 340 will not support these products. Our developer team is working on this issue actively for the future R340 driver release, we'll keep you posted once it has been fixed. Sorry for any inconvenience! Thanks, Kevin


This certainly concurs with the way I interpreted the existing Cuda documentation (not in an pure OpenCL context). We're at an inevitable juncture where upcoming OSes will require hardware features only available in Fermi+, mainly 64 bit addressing, and the emulations having taken place to bring the older gens along so far have become unwieldly (making the older cards crunch slower with each driver iteration).

The line makes practical sense for nVidia. It's just really unfortunate that the timing of these moves was near Maxwell release, resulting in the usual driver maturation problems converging with unrelated major changes. Could definitely have been done cleaner IMO, though I suspect there'll be some more growing pains yet.
5) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599328)
Posted 18 days ago by Profile jason_gee
Thanks Jason for the heads up. I will now with draw my complaint. And we folks do appreciate what you ALL do to help develop code.


No Problems. I completely understand these issues draw odd looks (especially for example when Eric and I have dissected some things in news threads, lol).

Some of the best things come from 'messy minds', and in that state protocol sometimes just doesn;t fit. Some of us try though ;)
6) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599324)
Posted 18 days ago by Profile jason_gee
As I have no horse in this race, Why would you guys discuss and fight over code in this forum?
Should not this pulic display of angst have been more appropriate in the Boinc developers thread or Beta or even PMs?


I'm sure a similar sentiment wasn't the entire point of Richard's initial response, but at least some part of it.

To be fair all around, sometimes as a developer it's difficult to find a sympathetic ear, despite something being 'obviously wrong'. I gather your own views would rather not see this side of development (a view which I happen to agree with mostly), however *sometimes* communications on a large and complex issue like this require breaking a few molds and 'rules'. On occasion something good can come from more public exposure.

[for example, I'd wager Raistmer had little or no idea that my control systems oriented creditNew research would have any relationsship to this 'simple' problem. There's no Forum for that, but 'numbercrunching' does fit ;) ]

[Edit:] I'll add, that from experience boinc forum would be the wrong forum to speak about this, and PM's wholly inappropriate in development matters. If kept to Lunatics I probably would not have seen it and had the opportunity to respond.
7) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599152)
Posted 18 days ago by Profile jason_gee
Then came the change, make it a three term (PID), which was an easy task as I'd already thought about that...


Funnily enough, the dated 6.10.58 build of the Boinc client I run here, I modified replacing a portion of the task duration estimates with a three term PID driven mechanism. It's been working fine and adapting to significant local hardware and application changes without issue since 6.10.58 was current.

That's one of several approaches I'll be comparing models of, for some of the server side estimates for task scheduling. (in addition to client).

Most likely the PID variant will yield to the slightly more sophisticated Kalman filter ( or extended version ), but remove the need for tuning. There's other options that are going to be compared (including the server's current dicey use of running sample averages), and areas where it's been suggested neither the PID or Kalman would be an optimal choice, but fun to see steady state runtime estimates dial in to the second at times. That's better than required, so simplest/smartest will probably win out.
8) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599146)
Posted 18 days ago by Profile jason_gee
For my 2 cents toward the original topic, this multiple device issue has at least 3 main relevant impacts on the (mostly non credit related) creditNew work I've been doing.
- Scheduling for task allocation to hosts: in case of multiple disparate devices the throughput used for requests by the client should be the aggregate sum of peak theoretical flops, and a filtered efficiency ( aggregated from separately tracked device.appversion local efficiencies), which would be dominantly client side refinements.
- Increasingly heterogeneous hosts, currently unsupported (again mostly a client side concern. To some extent the server has all of the information it needs for its tasks, though underused, and small fragments missing or misused in places), and
- local (client) estimate scheduling

Considering those, which stand out as dominantly client side concerns, I'd be wary of recommending increased server side complexity, especially since the problem domain ( our subjective observations of how well the scheduling works) are of little relevance to server/project side goals and scope.

IOW, try to keep solutions close to the original problem source, rather than migrate them back into a problem domain which is already overly complicated by special exceptions and burdened by poor management.

My own work, which will undoubtedly result in recommendations mostly for client refinement, but definitely some server bulletproofing & simplification too (in support of the separate credit issues). This will reach a viable point to model heterogeneous hosts/application & workloads, in part 1.2 - 'controllers', of the plan below.

That doesn't prevent anyone researching & developing other ways to address the limitations we've dealt with for so long. I'd suggest though that David's 'hands-off' approach to the problem may be at least in some small portion due to some of the other design issues not specifically relating to multiple devices. It may instead be recognition that the problem is a larger design one, relevant across more problems than just mixing disparate devices... (which it is.)

9) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1597374)
Posted 22 days ago by Profile jason_gee
Thomas Arnold wrote:
Hello, I need your insight and help.
I am using this Video card, NVIDIA GeForce GTX 260 (896MB) driver: 311.06 OpenCL: 1.0

In the past we have never had a problem but now we are receiving
Computation error running seti@homev77.00 (cuda22)

We are not too familiar with much of the program but we support the efforts to run the data sets. Can you please tell me if we need to change something with our setup or will these errors clear themselves or just continue to build up in the task tab?

The driver is old enough so it doesn't have the issue which started this thread. I don't know why all SETI@home v7 7.00 windows_intelx86 (cuda22) and (cuda23) tasks are failing on your host 6648399, but it does very well on (cuda32), (cuda42), and (cuda50). Perhaps one of the CUDA experts here can figure out why the servers aren't sending tasks for the plan classes which work well.
Joe


Will likely be digging out the scheduler code again on the weekend, if someone doesn't beat me to it. No accumulated data for the app versions, plus a logic hole with respect to systematically issuing to all app versions, ignoring the error count & quota, seems to be along the lines of what's happening. [I'll need to start by looking if that server code's been changed since a couple of months ago]

For the host side, FWIW the application (2.2 & 2.3 planclasses) appears to not even be making it to device initialisation. That would seem to me the DLLs are somehow damaged, or the driver install has gone awry. I'd imagine a clean driver install (of a suitable known good older version for this GPU) and a project reset may be in order.
10) Message boards : Number crunching : Phantom Triplets (Message 1592789)
Posted 27 Oct 2014 by Profile jason_gee
Yep, sounds good. It's these rare obscure issues with an otherwise perfectly running system & software that test the patience. Sleep sounds like a good idea.
11) Message boards : Number crunching : Phantom Triplets (Message 1592787)
Posted 27 Oct 2014 by Profile jason_gee
Here's a quick screenshot of my ~6 year old Core2 (driving a 980SC) DPC latencies, grabbed with DPC Latency checker [while crunching], to compare:

http://prntscr.com/506a9i

It took manual piecewise force update replacement of various Intel Chipset drivers, and a modified Atheros Wireless driver to get that, Using 'LatencyMon' to identify the culprits

http://www.thesycon.de/eng/latency_check.shtml
http://www.resplendence.com/latencymon
12) Message boards : Number crunching : Phantom Triplets (Message 1592785)
Posted 27 Oct 2014 by Profile jason_gee
That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue)

No, it runs AP tasks, too. It will run either 2 MBs or 1 MB + 1 AP on the GPU.


OK, you'll need to consult the AP app author.
as per edit I can't vouch for the thread safety of any app other than x41zc builds, lack of which can easily cause driver 'sticky downclocks.', alongside the possible system driver DPC issues.
13) Message boards : Number crunching : Phantom Triplets (Message 1592781)
Posted 27 Oct 2014 by Profile jason_gee
That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue, I can't vouch for the thread safety of any app other than x41zc builds, lack of which can easily cause driver 'sticky downclocks.')
14) Message boards : Number crunching : Phantom Triplets (Message 1589048)
Posted 20 Oct 2014 by Profile jason_gee
...leaves me wondering what I might actually be facing here...


Basically the same as the white dot graphical artefacts you'd get overclocking a GPU for gaming, at which point backing off at least two 'notches' is the general corrective method. White dots in graphical glitches imply ~24-32 bits of saturation (bits flipped to on), so yeah many more than a single bit flip, though typically tied to single memory fetches.

As with overclocking, It happens with 'gaming grade' GPUs sometimes from factory due to price/performance market pressures, and the non-critical nature of a rare white dot, when parts binning and setting clocks for mid range GPUs, where the competition is steepest.
15) Message boards : Number crunching : Phantom Triplets (Message 1588603)
Posted 18 Oct 2014 by Profile jason_gee
OK, a quick look through and those all manifested from the same chirp-fft pair
<fft_len>8</fft_len>
<chirp_rate>40.543618462507</chirp_rate>

and have the same miniscule mean power, and calculated detection time. That makes them all part of the same 'event'.

Based on the nature of that glitch 'looking' much like a numerical version of a graphical artefact, I would suspect a few things. Through age or manufacturing concerns that silicon may well be operating near capability in terms of frequency &/or heat. Presuming you've checked the latter, with no evidence of overheating etc, Then I recommend either a ~0.05V core voltage bump, or a ~20MHz frequency drop. Either would IMO compensate for a combination of the age of the GPU, and mid range performance/price cards being pushed to the frequency limits (with some number of acceptable artefacts) from factory. Memory clock may also be an issue, though I would expect the nature to be more random with memory glitches than core loading. In this case there appears to be a quite specific borderline circuit.
16) Message boards : Number crunching : AP v7 Credit Impact (Message 1586178)
Posted 13 Oct 2014 by Profile jason_gee
... Not least of those is comparing the complexity of maintaining/patching the existing hodgepodge over wholesale reengineering.

As painful as constructing something from scratch is, often it is the best option to take as trying to patch something that is badly broken often only leads to further problems.
Cosmetic problems are easily repaired. Some structural problems can be repaired, with considerable effort. However when the core structure is significantly damaged, demolition & reconstruction is the best option.


Yep, pretty much sums up the point we're at in many ways. There's a hiatus while I refamiliarise with some modern tools (such as Matlab and estimate localisation techniques), and I know Richard's been busily gathering data and symtpoms for some time. That will yield an idealised model for direct comparison against the existing core, giving us at least some scope for how much needs replacement, along with pros & cons. Frustratingly what isn't feasible at the moment is any kindof timeline, though there have been benefits to extended observation & research time nonetheless.
17) Message boards : Number crunching : AP v7 Credit Impact (Message 1586152)
Posted 13 Oct 2014 by Profile jason_gee
The credit system appears to be structurally flawed and needs a fresh new look at.

I was under the impression that Jason and others were doing exactly that.


A group of us were recently , tongue firmly in cheek, referred to as the 'CreditNew Vigilantes', and it's that group that's variously assessing code, system behaviour, and refreshing on control systems engineering practices. My own contributions cover the last part of the system, headed toward full scale modelling of the control systems as should have been done before (but clearly weren't, you find as you get deeper)

As things stand, the overall approaches and extensive study of the existing mechanism have been discussed and poked at in various ways, and we get closer to wholesale patching and systemic recommendations, though there are still quite a few finer points to be looked at in more detail as a refined model is constructed. Not least of those is comparing the complexity of maintaining/patching the existing hodgepodge over wholesale reengineering.
18) Message boards : Number crunching : AP v7 Credit Impact (Message 1586053)
Posted 13 Oct 2014 by Profile jason_gee
The simple fact is, it is taking Much longer on the same Hardware to gain the same credits as with MBv6. That is Not the case with APv7. The same task is slightly faster than with APv6.


Correct. AVv7 has less processing than v6 (simpler blanking).

[Edit:] blanking was always unpaid overhead though. Going from rough memory, AP credit (v6 or V7) should really be ~1000 credits, but I predict will probably settle in the region of 300-400 credits, with substantial variance.
19) Message boards : Number crunching : AP v7 Credit Impact (Message 1586052)
Posted 13 Oct 2014 by Profile jason_gee
up a large number of VM's and running the base stock app.

I went over this with Jason a while back. All you would have to do is work outside of CreditFew with an EXTERNAL Correction.


Basically yes. Bernd (from Albert) has mentioned being in favour of a full gut and do over, While Oliver's recommended (which I agree with) manageable subproblems need to be tackled. Drawing dividing lines and planning to deal with the spaghettification factor is where things are at, and ground to a crawl, so larger highly modular mechanism transplants seem likely at this point.
20) Message boards : Number crunching : AP v7 Credit Impact (Message 1586050)
Posted 13 Oct 2014 by Profile jason_gee
Eric has explain before that anything he would do to try in order to equalized the credit to where it was before would get revered by CreditNew.

In theory it seems like it would be possible for someone to externally change the base credit. By settings up a large number of VM's and running the base stock app.


Yeah, a lot of back and forward has been happening. The consensus working with Albert on the issue, and occasional input from Eric, has been there are a number of key issues, not least being improper use of averages and quite a lot of spaghetti code that needs a trasnsplant. (validator and scheduler)


Next 20

Copyright © 2014 University of California