Posts by jason_gee


log in
1) Message boards : Number crunching : HDD Questions For The Elite - [RESOLVED] (Message 1603568)
Posted 2 days ago by Profile jason_gee
Why would this issue not show up months ago instead of a couple days ago?

It may be bad SATA cable or bad contact of cable 'pins' to DVD or to motherboard.

Some DVD drives (or motherboard controllers?) make the LED always ON when they are opened. (and the device may 'think' it is opened even when you see it as closed)
1. -> If you open/close the DVD drive what happens to the LED on the computer case?


I don't have my CD/DVD burner app running. There should be no "pending I/O operation" where the optical drive is concerned.

No need to have any special 'app', Windows itself may poll the DVD from time to time.
In 'Safely remove' icon (on Windows XP) I see my DVD drive listed for 'removal'


If I remove the cable, I will have to wait 6 to 9 hours to see if that will "fix" the issue.

2. -> So what? Do you use the DVD every day?


The issue goes away with a re-boot.

3. -> On every re-boot or is random?


1. If I have no disk in the optical drive, nothing happens with the access indicator light when it is opened and closed.. If, however, I have a disk in the drive and close it, the light will flicker as it is looking for an appropriate app to access the disk.

2. No need to get so snarky. If you would have read and quoted that WHOLE line of text you would have read that I was going to run with the cable off today. As I am doing right now. See what happens when you do not read everything in a message?

3. The issue happens with each and every re-boot I have done since the issue was discovered 3 or 4 days ago.

Keep on BOINCing...! :)


When Windows updates are performed, *sometimes* (actually more often than not), they include a 'dotnet framework' aka common language runtime (clr) ) security update. This mandates a rebuild of the native binaries performed usually by a dotnet or clr optimisation service. This service can consume from some small percentage to 100% of system resources, from a few second, to several hours, depending on what you have installed. This will typicality light the HDD light on solid for the period, disappearing only after some substantial ontime with limited other activity.
2) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1601157)
Posted 9 days ago by Profile jason_gee
[NEW]
14 November 2014 1:21 am -- Kevin Kang
Hi Jacob, Sorry for update on this issue late. As noted in release notes, the R340 drivers will continue to support the Tesla generation of NVIDIA GPUs until April 1, 2016, and the NVIDIA support team will continue to address driver issues for these products in driver branches up to and including Release 340. However, future driver enhancements and optimizations in driver releases after Release 340 will not support these products. Our developer team is working on this issue actively for the future R340 driver release, we'll keep you posted once it has been fixed. Sorry for any inconvenience! Thanks, Kevin


This certainly concurs with the way I interpreted the existing Cuda documentation (not in an pure OpenCL context). We're at an inevitable juncture where upcoming OSes will require hardware features only available in Fermi+, mainly 64 bit addressing, and the emulations having taken place to bring the older gens along so far have become unwieldly (making the older cards crunch slower with each driver iteration).

The line makes practical sense for nVidia. It's just really unfortunate that the timing of these moves was near Maxwell release, resulting in the usual driver maturation problems converging with unrelated major changes. Could definitely have been done cleaner IMO, though I suspect there'll be some more growing pains yet.
3) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599328)
Posted 13 days ago by Profile jason_gee
Thanks Jason for the heads up. I will now with draw my complaint. And we folks do appreciate what you ALL do to help develop code.


No Problems. I completely understand these issues draw odd looks (especially for example when Eric and I have dissected some things in news threads, lol).

Some of the best things come from 'messy minds', and in that state protocol sometimes just doesn;t fit. Some of us try though ;)
4) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599324)
Posted 13 days ago by Profile jason_gee
As I have no horse in this race, Why would you guys discuss and fight over code in this forum?
Should not this pulic display of angst have been more appropriate in the Boinc developers thread or Beta or even PMs?


I'm sure a similar sentiment wasn't the entire point of Richard's initial response, but at least some part of it.

To be fair all around, sometimes as a developer it's difficult to find a sympathetic ear, despite something being 'obviously wrong'. I gather your own views would rather not see this side of development (a view which I happen to agree with mostly), however *sometimes* communications on a large and complex issue like this require breaking a few molds and 'rules'. On occasion something good can come from more public exposure.

[for example, I'd wager Raistmer had little or no idea that my control systems oriented creditNew research would have any relationsship to this 'simple' problem. There's no Forum for that, but 'numbercrunching' does fit ;) ]

[Edit:] I'll add, that from experience boinc forum would be the wrong forum to speak about this, and PM's wholly inappropriate in development matters. If kept to Lunatics I probably would not have seen it and had the opportunity to respond.
5) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599152)
Posted 13 days ago by Profile jason_gee
Then came the change, make it a three term (PID), which was an easy task as I'd already thought about that...


Funnily enough, the dated 6.10.58 build of the Boinc client I run here, I modified replacing a portion of the task duration estimates with a three term PID driven mechanism. It's been working fine and adapting to significant local hardware and application changes without issue since 6.10.58 was current.

That's one of several approaches I'll be comparing models of, for some of the server side estimates for task scheduling. (in addition to client).

Most likely the PID variant will yield to the slightly more sophisticated Kalman filter ( or extended version ), but remove the need for tuning. There's other options that are going to be compared (including the server's current dicey use of running sample averages), and areas where it's been suggested neither the PID or Kalman would be an optimal choice, but fun to see steady state runtime estimates dial in to the second at times. That's better than required, so simplest/smartest will probably win out.
6) Message boards : Number crunching : BOINC not always reports faster GPU device... (Message 1599146)
Posted 13 days ago by Profile jason_gee
For my 2 cents toward the original topic, this multiple device issue has at least 3 main relevant impacts on the (mostly non credit related) creditNew work I've been doing.
- Scheduling for task allocation to hosts: in case of multiple disparate devices the throughput used for requests by the client should be the aggregate sum of peak theoretical flops, and a filtered efficiency ( aggregated from separately tracked device.appversion local efficiencies), which would be dominantly client side refinements.
- Increasingly heterogeneous hosts, currently unsupported (again mostly a client side concern. To some extent the server has all of the information it needs for its tasks, though underused, and small fragments missing or misused in places), and
- local (client) estimate scheduling

Considering those, which stand out as dominantly client side concerns, I'd be wary of recommending increased server side complexity, especially since the problem domain ( our subjective observations of how well the scheduling works) are of little relevance to server/project side goals and scope.

IOW, try to keep solutions close to the original problem source, rather than migrate them back into a problem domain which is already overly complicated by special exceptions and burdened by poor management.

My own work, which will undoubtedly result in recommendations mostly for client refinement, but definitely some server bulletproofing & simplification too (in support of the separate credit issues). This will reach a viable point to model heterogeneous hosts/application & workloads, in part 1.2 - 'controllers', of the plan below.

That doesn't prevent anyone researching & developing other ways to address the limitations we've dealt with for so long. I'd suggest though that David's 'hands-off' approach to the problem may be at least in some small portion due to some of the other design issues not specifically relating to multiple devices. It may instead be recognition that the problem is a larger design one, relevant across more problems than just mixing disparate devices... (which it is.)

7) Message boards : Number crunching : @Pre-FERMI nVidia GPU users: Important warning (Message 1597374)
Posted 17 days ago by Profile jason_gee
Thomas Arnold wrote:
Hello, I need your insight and help.
I am using this Video card, NVIDIA GeForce GTX 260 (896MB) driver: 311.06 OpenCL: 1.0

In the past we have never had a problem but now we are receiving
Computation error running seti@homev77.00 (cuda22)

We are not too familiar with much of the program but we support the efforts to run the data sets. Can you please tell me if we need to change something with our setup or will these errors clear themselves or just continue to build up in the task tab?

The driver is old enough so it doesn't have the issue which started this thread. I don't know why all SETI@home v7 7.00 windows_intelx86 (cuda22) and (cuda23) tasks are failing on your host 6648399, but it does very well on (cuda32), (cuda42), and (cuda50). Perhaps one of the CUDA experts here can figure out why the servers aren't sending tasks for the plan classes which work well.
Joe


Will likely be digging out the scheduler code again on the weekend, if someone doesn't beat me to it. No accumulated data for the app versions, plus a logic hole with respect to systematically issuing to all app versions, ignoring the error count & quota, seems to be along the lines of what's happening. [I'll need to start by looking if that server code's been changed since a couple of months ago]

For the host side, FWIW the application (2.2 & 2.3 planclasses) appears to not even be making it to device initialisation. That would seem to me the DLLs are somehow damaged, or the driver install has gone awry. I'd imagine a clean driver install (of a suitable known good older version for this GPU) and a project reset may be in order.
8) Message boards : Number crunching : Phantom Triplets (Message 1592789)
Posted 27 days ago by Profile jason_gee
Yep, sounds good. It's these rare obscure issues with an otherwise perfectly running system & software that test the patience. Sleep sounds like a good idea.
9) Message boards : Number crunching : Phantom Triplets (Message 1592787)
Posted 27 days ago by Profile jason_gee
Here's a quick screenshot of my ~6 year old Core2 (driving a 980SC) DPC latencies, grabbed with DPC Latency checker [while crunching], to compare:

http://prntscr.com/506a9i

It took manual piecewise force update replacement of various Intel Chipset drivers, and a modified Atheros Wireless driver to get that, Using 'LatencyMon' to identify the culprits

http://www.thesycon.de/eng/latency_check.shtml
http://www.resplendence.com/latencymon
10) Message boards : Number crunching : Phantom Triplets (Message 1592785)
Posted 27 days ago by Profile jason_gee
That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue)

No, it runs AP tasks, too. It will run either 2 MBs or 1 MB + 1 AP on the GPU.


OK, you'll need to consult the AP app author.
as per edit I can't vouch for the thread safety of any app other than x41zc builds, lack of which can easily cause driver 'sticky downclocks.', alongside the possible system driver DPC issues.
11) Message boards : Number crunching : Phantom Triplets (Message 1592781)
Posted 27 days ago by Profile jason_gee
That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue, I can't vouch for the thread safety of any app other than x41zc builds, lack of which can easily cause driver 'sticky downclocks.')
12) Message boards : Number crunching : Phantom Triplets (Message 1589048)
Posted 20 Oct 2014 by Profile jason_gee
...leaves me wondering what I might actually be facing here...


Basically the same as the white dot graphical artefacts you'd get overclocking a GPU for gaming, at which point backing off at least two 'notches' is the general corrective method. White dots in graphical glitches imply ~24-32 bits of saturation (bits flipped to on), so yeah many more than a single bit flip, though typically tied to single memory fetches.

As with overclocking, It happens with 'gaming grade' GPUs sometimes from factory due to price/performance market pressures, and the non-critical nature of a rare white dot, when parts binning and setting clocks for mid range GPUs, where the competition is steepest.
13) Message boards : Number crunching : Phantom Triplets (Message 1588603)
Posted 18 Oct 2014 by Profile jason_gee
OK, a quick look through and those all manifested from the same chirp-fft pair
<fft_len>8</fft_len>
<chirp_rate>40.543618462507</chirp_rate>

and have the same miniscule mean power, and calculated detection time. That makes them all part of the same 'event'.

Based on the nature of that glitch 'looking' much like a numerical version of a graphical artefact, I would suspect a few things. Through age or manufacturing concerns that silicon may well be operating near capability in terms of frequency &/or heat. Presuming you've checked the latter, with no evidence of overheating etc, Then I recommend either a ~0.05V core voltage bump, or a ~20MHz frequency drop. Either would IMO compensate for a combination of the age of the GPU, and mid range performance/price cards being pushed to the frequency limits (with some number of acceptable artefacts) from factory. Memory clock may also be an issue, though I would expect the nature to be more random with memory glitches than core loading. In this case there appears to be a quite specific borderline circuit.
14) Message boards : Number crunching : AP v7 Credit Impact (Message 1586178)
Posted 13 Oct 2014 by Profile jason_gee
... Not least of those is comparing the complexity of maintaining/patching the existing hodgepodge over wholesale reengineering.

As painful as constructing something from scratch is, often it is the best option to take as trying to patch something that is badly broken often only leads to further problems.
Cosmetic problems are easily repaired. Some structural problems can be repaired, with considerable effort. However when the core structure is significantly damaged, demolition & reconstruction is the best option.


Yep, pretty much sums up the point we're at in many ways. There's a hiatus while I refamiliarise with some modern tools (such as Matlab and estimate localisation techniques), and I know Richard's been busily gathering data and symtpoms for some time. That will yield an idealised model for direct comparison against the existing core, giving us at least some scope for how much needs replacement, along with pros & cons. Frustratingly what isn't feasible at the moment is any kindof timeline, though there have been benefits to extended observation & research time nonetheless.
15) Message boards : Number crunching : AP v7 Credit Impact (Message 1586152)
Posted 13 Oct 2014 by Profile jason_gee
The credit system appears to be structurally flawed and needs a fresh new look at.

I was under the impression that Jason and others were doing exactly that.


A group of us were recently , tongue firmly in cheek, referred to as the 'CreditNew Vigilantes', and it's that group that's variously assessing code, system behaviour, and refreshing on control systems engineering practices. My own contributions cover the last part of the system, headed toward full scale modelling of the control systems as should have been done before (but clearly weren't, you find as you get deeper)

As things stand, the overall approaches and extensive study of the existing mechanism have been discussed and poked at in various ways, and we get closer to wholesale patching and systemic recommendations, though there are still quite a few finer points to be looked at in more detail as a refined model is constructed. Not least of those is comparing the complexity of maintaining/patching the existing hodgepodge over wholesale reengineering.
16) Message boards : Number crunching : AP v7 Credit Impact (Message 1586053)
Posted 13 Oct 2014 by Profile jason_gee
The simple fact is, it is taking Much longer on the same Hardware to gain the same credits as with MBv6. That is Not the case with APv7. The same task is slightly faster than with APv6.


Correct. AVv7 has less processing than v6 (simpler blanking).

[Edit:] blanking was always unpaid overhead though. Going from rough memory, AP credit (v6 or V7) should really be ~1000 credits, but I predict will probably settle in the region of 300-400 credits, with substantial variance.
17) Message boards : Number crunching : AP v7 Credit Impact (Message 1586052)
Posted 13 Oct 2014 by Profile jason_gee
up a large number of VM's and running the base stock app.

I went over this with Jason a while back. All you would have to do is work outside of CreditFew with an EXTERNAL Correction.


Basically yes. Bernd (from Albert) has mentioned being in favour of a full gut and do over, While Oliver's recommended (which I agree with) manageable subproblems need to be tackled. Drawing dividing lines and planning to deal with the spaghettification factor is where things are at, and ground to a crawl, so larger highly modular mechanism transplants seem likely at this point.
18) Message boards : Number crunching : AP v7 Credit Impact (Message 1586050)
Posted 13 Oct 2014 by Profile jason_gee
Eric has explain before that anything he would do to try in order to equalized the credit to where it was before would get revered by CreditNew.

In theory it seems like it would be possible for someone to externally change the base credit. By settings up a large number of VM's and running the base stock app.


Yeah, a lot of back and forward has been happening. The consensus working with Albert on the issue, and occasional input from Eric, has been there are a number of key issues, not least being improper use of averages and quite a lot of spaghetti code that needs a trasnsplant. (validator and scheduler)
19) Message boards : Number crunching : AP v7 Credit Impact (Message 1586047)
Posted 13 Oct 2014 by Profile jason_gee
The main reason MBv7 suffered such a RAC drop was due to the GPU App being much Slower than MBv6.


Actually the main reason MBv7 dropped over v6 is because the baseline stock CPU application gained AVX (SIMD) functionality, but the peak flops figures used to scale the efficiencies use Boinc Whetstone, which is a serial FPU measure.

For ratios Compare Boinc Whetstone, to Sisoft Sandra Lite FPU single thread Whetstone, then SSE-SSE2, and the AVX Single threaded bench. Then you have two dicrete credit drops, one during MB5-6, and the other with MB7.

You're going to have to be more convincing than that. I have here example #1, an old NV8800 Veteran of the MB wars, http://setiathome.berkeley.edu/results.php?hostid=6813106
That Card ran a MBv6 Shorty in just over 4 minutes and netted in the low 30s. Now it takes over 16 minutes to complete the same Shorty. Pretty strong evidence if you ask me...


A 'shorty' is no longer the same amount of work, since it contains autocorrelations, so not comparing apples with apples. Shorties, as with other angle ranges of course, should have received more credit, rather than less, due to the added work.

'Correct' Credit award for a MB7 VHAR, according to the Cobblestone scale, is ~90+ Credits.


There are certainly inefficiencies in the autocorrelation implementation, however the artefacts dominant are from the application used for credit normalisation, which is CPU stock by a factor of ~3.3x effective underclaim. Fixing that gives you back half your MB7 effective drop, then some bonus for the drop that preceeded GPU implementations.

Future optimisations may provide some improvement also, though the options are dwindling for that particular generation of GPU as well, which struggle with the required large Fourier transforms.
20) Message boards : Number crunching : AP v7 Credit Impact (Message 1586042)
Posted 13 Oct 2014 by Profile jason_gee
The main reason MBv7 suffered such a RAC drop was due to the GPU App being much Slower than MBv6.


Actually the main reason MBv7 dropped over v6 is because the baseline stock CPU application gained AVX (SIMD) functionality, but the peak flops figures used to scale the efficiencies use Boinc Whetstone, which is a serial FPU measure.

For ratios Compare Boinc Whetstone, to Sisoft Sandra Lite FPU single thread Whetstone, then SSE-SSE2, and the AVX Single threaded bench. Then you have two dicrete credit drops, one during MB5-6, and the other with MB7.


Next 20

Copyright © 2014 University of California