Message boards :
Number crunching :
BOINC not always reports faster GPU device...
Message board moderation
Author | Message |
---|---|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Example. My host has GSO9600 and GT9400. Obviously GSO9600 much faster. But init_data.xml reported only 2 .. GT 9400 devices to app. <coproc_cuda> <count>2</count> <name>GeForce 9400 GT</name> <available_ram>501481472.000000</available_ram> <have_cuda>1</have_cuda> <have_opencl>1</have_opencl> |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
BOINC decides which GPU is best based on these factors, in decreasing priority: - compute capability - software version - available memory - speed http://boinc.berkeley.edu/dev/forum_thread.php?id=7899&postid=45886 Since the 9400GT has more memory available, Boinc is classing it as more capable: <core_client_version>7.2.33</core_client_version> Claggy |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
And this particular example perfectly illustrates why BOINC's approach is wrong one. W/o use all GPUs switch in cc_config it would ignore faster and definitely more capable for computations device. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Also, reporting 2 devices as one with biggest memory has another fundamental flaw. If any real scientific app would depend on BOINC in device capabilities detection (I avoid this as much as I can so not SETI OpenCL apps) it will treat both cards as having ~512MB memory and may ajust task for this memory amount. Then task will fail on much faster but smaller memory GSO9600. Nice approach, nothing to say... |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Also, reporting 2 devices as one with biggest memory has another fundamental flaw. The Current Boinc design is that it detects all the NV/ATI/Intel GPUs, then uses the most capable from each vendor, and only uses multiple GPUs from the same vendor if they are the same, or close to being the same, Then it passes only the most capable GPU(s) to the project, and the project bases what work is sent on what the most capable GPU is. If you add GPUs from the same vendor that are different in some way, and use <use_all_gpus> to enable them, then you're working outside the current design, DA already has said that it's a lot of work to change this design, who knows when it'll happen. Ask Him. Claggy |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
Also, reporting 2 devices as one with biggest memory has another fundamental flaw. And in his keynote 'history of BOINC' talk to the BOINC Workshop in Budapest, six weeks ago, he acknowledged that this design decision was a mistake*, and would be re-worked eventually - as I told Raistmer, the day he said it, via the Lunatics site. In addition to the 2012 BOINC dev thread which Claggy linked, I quoted (and sourced) some of the 'decide between GPUs' code here in early 2011: message 1085712. The decision is based on a broader concept of 'capability', rather than raw speed. In short, this is old - very old - news, and I see no point in bringing it up on a project message board in apocalyptic tones, as if some disaster had just struck. Raistmer and David share a surprising number of programming traits. * Neither necessarily chooses the optimal path when they start a new venture * Both tend to start coding before they've finished designing * Both write spaghetti code which it is hard for anybody else to read * Neither likes to be called back to revise something after they've moved on * Both make multiple changes at once, merging new features with bugfixes * Neither likes to be criticised in public I'd like to see this BOINC 'feature' changed, too - but (especially in view of that last comment), I don't think this thread is the best way of achieving that desired result. * reference: http://boinc.berkeley.edu/trac/attachment/wiki/WorkShop14/workshop_14.pdf, slide 52. Reflections on software: things we need to change |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Sorry, Richard. I'm afraid this is your wishful thinking. Especially last one. Any problems that I'm first to create corresponding "issues and errors" thread for own build almost each time? Well, maybe you just defend BOINC pointing bones here, not? 1) What non-optimal path do you mean? Please examples, I would like to correct that path if it really no optimal. 2) yep, details of design show itself when real coding starts. Devil in details as known, so design corrected while coding. It's OK, and that's not the issue. Issue when bad design persists. 3) really, do you read my code a lot? Pity I get almost no feedback then. Try to read actually, you find lots of comments inside :) 4) yep, it happens. And it happens especially when I'm quite sure that changes are isolated ones. More branching would be better, indeed, but it takes additional time to merge back, I have no paid staff to do that. 5) LoL, criticise me in public, no probs, but with facts, constructive critic, please. Need to note that constructive critic leads to bug fixing, I like bug reports, real bug reports, of course, not to straight others hands day by day - that's I like to do only rarely ;D And back to topic. Yep, surely that "treat different devices as same" is obvious bad approach, here even nothing to discuss. But my current topic not about that (this design decision too deep inside BOINC architecture to solve it fast and with little blood). My thread about another small but quite inconvenient decision to treat memory amount above device speed and how easely it can lead to issues, with list of those issues. What do you defend here pointing bones to me? BOINC design? For what? Even if you see some mistake in my own creation should it matter I should never spot and speak about any errors in BOINC? About what your "similarities" in this thread ??? P.S. Ah, sorry, I missed "not like to come back" similarity, yep, it's hard to come back, especially if not owner of eidetic memory, you got me here :) But quite a big step between don't like and never do. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
And regarding coprocessor model - I take this field too close to heart cause I proposed required changes so many years ago... If recall right even filled formal "bug report/feature request" in Trac... but again, this particular thread not about coprocessor model, it's about just another issue inside coprocessor model. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
My purpose in posting in this thread was to provoke a response, and hence start a discussion. "What is the best - most effective - way of getting improvements made to the BOINC code"... ... bearing in mind all the constraints of time, personnel, and psychology that get in the way. The last (very small) bug that I reported was fixed by David within an hour. You complain that the bugs you report never get any attention (example - though a false one, as it turned out). I do suggest that paying more care and attention to analysing who is responsible for an area of design, and addressing yourself to the right person in a timely and constructive manner, is more likely to get positive results. On the occasions when I've lost my temper and antagonised David (yes, it has happened), it has in general been counterproductive. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I'm afraid response was provoked indeed, but such that would not lead to any useful discussion. At least from my side. Issue described. Who interested may continue. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
* Neither likes to be criticised in public QED. Point made. Next time you want David to change something, post where he will read it - and put yourself in his shoes, before you write it. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
* Neither likes to be criticised in public I don't want, I'm fine with this issue as it is cause use corresponding flg in cc_config.xml. But I think issue worth to be recorded in case someone stumble on it and will be trapped. What I want to be fixed I write in BOINC dev list. With almost same outcome as to write on house wall in Moscow :P |
rob smith Send message Joined: 7 Mar 03 Posts: 22219 Credit: 416,307,556 RAC: 380 |
Neither necessarily chooses the optimal path when they start a new venture I doubt that either (or even both in concert) would ever reach the standards of spaghetti code that I had to sort out a few years ago... The task "Just test these few lines of ASM86", yes 8086 assembler, only about 40 line or so, no big deal. But these 40 line had something like 30 immediate jumps, to fairly large chunks of code, each of which had multiple jumps, some computed, and others direct, all told about 12000 lines of code, tied up like a large bowl of spaghetti - the "cook" who brewed this lot up couldn't (wouldn't?) see there was any problem until I threw the mess at a code tracer..... Re-written this PI controller was more robust (no random crashed), and much smoother and more predictable... Then came the change, make it a three term (PID), which was an easy task as I'd already thought about that... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
For my 2 cents toward the original topic, this multiple device issue has at least 3 main relevant impacts on the (mostly non credit related) creditNew work I've been doing. - Scheduling for task allocation to hosts: in case of multiple disparate devices the throughput used for requests by the client should be the aggregate sum of peak theoretical flops, and a filtered efficiency ( aggregated from separately tracked device.appversion local efficiencies), which would be dominantly client side refinements. - Increasingly heterogeneous hosts, currently unsupported (again mostly a client side concern. To some extent the server has all of the information it needs for its tasks, though underused, and small fragments missing or misused in places), and - local (client) estimate scheduling Considering those, which stand out as dominantly client side concerns, I'd be wary of recommending increased server side complexity, especially since the problem domain ( our subjective observations of how well the scheduling works) are of little relevance to server/project side goals and scope. IOW, try to keep solutions close to the original problem source, rather than migrate them back into a problem domain which is already overly complicated by special exceptions and burdened by poor management. My own work, which will undoubtedly result in recommendations mostly for client refinement, but definitely some server bulletproofing & simplification too (in support of the separate credit issues). This will reach a viable point to model heterogeneous hosts/application & workloads, in part 1.2 - 'controllers', of the plan below. That doesn't prevent anyone researching & developing other ways to address the limitations we've dealt with for so long. I'd suggest though that David's 'hands-off' approach to the problem may be at least in some small portion due to some of the other design issues not specifically relating to multiple devices. It may instead be recognition that the problem is a larger design one, relevant across more problems than just mixing disparate devices... (which it is.) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Then came the change, make it a three term (PID), which was an easy task as I'd already thought about that... Funnily enough, the dated 6.10.58 build of the Boinc client I run here, I modified replacing a portion of the task duration estimates with a three term PID driven mechanism. It's been working fine and adapting to significant local hardware and application changes without issue since 6.10.58 was current. That's one of several approaches I'll be comparing models of, for some of the server side estimates for task scheduling. (in addition to client). Most likely the PID variant will yield to the slightly more sophisticated Kalman filter ( or extended version ), but remove the need for tuning. There's other options that are going to be compared (including the server's current dicey use of running sample averages), and areas where it's been suggested neither the PID or Kalman would be an optimal choice, but fun to see steady state runtime estimates dial in to the second at times. That's better than required, so simplest/smartest will probably win out. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
James Sotherden Send message Joined: 16 May 99 Posts: 10436 Credit: 110,373,059 RAC: 54 |
As I have no horse in this race, Why would you guys discuss and fight over code in this forum? Should not this pulic display of angst have been more appropriate in the Boinc developers thread or Beta or even PMs? [/quote] Old James |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
As I have no horse in this race, Why would you guys discuss and fight over code in this forum? I'm sure a similar sentiment wasn't the entire point of Richard's initial response, but at least some part of it. To be fair all around, sometimes as a developer it's difficult to find a sympathetic ear, despite something being 'obviously wrong'. I gather your own views would rather not see this side of development (a view which I happen to agree with mostly), however *sometimes* communications on a large and complex issue like this require breaking a few molds and 'rules'. On occasion something good can come from more public exposure. [for example, I'd wager Raistmer had little or no idea that my control systems oriented creditNew research would have any relationsship to this 'simple' problem. There's no Forum for that, but 'numbercrunching' does fit ;) ] [Edit:] I'll add, that from experience boinc forum would be the wrong forum to speak about this, and PM's wholly inappropriate in development matters. If kept to Lunatics I probably would not have seen it and had the opportunity to respond. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
James Sotherden Send message Joined: 16 May 99 Posts: 10436 Credit: 110,373,059 RAC: 54 |
Thanks Jason for the heads up. I will now with draw my complaint. And we folks do appreciate what you ALL do to help develop code. [/quote] Old James |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Thanks Jason for the heads up. I will now with draw my complaint. And we folks do appreciate what you ALL do to help develop code. No Problems. I completely understand these issues draw odd looks (especially for example when Eric and I have dissected some things in news threads, lol). Some of the best things come from 'messy minds', and in that state protocol sometimes just doesn;t fit. Some of us try though ;) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
I think I'll make that my "thought for today." |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.