BOINC not always reports faster GPU device...

Message boards : Number crunching : BOINC not always reports faster GPU device...
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1598962 - Posted: 9 Nov 2014, 12:25:45 UTC
Last modified: 9 Nov 2014, 12:26:05 UTC

Example.

My host has GSO9600 and GT9400.
Obviously GSO9600 much faster. But init_data.xml reported only 2 .. GT 9400 devices to app.

<coproc_cuda>
   <count>2</count>
   <name>GeForce 9400 GT</name>
   <available_ram>501481472.000000</available_ram>
   <have_cuda>1</have_cuda>
   <have_opencl>1</have_opencl>
ID: 1598962 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1598964 - Posted: 9 Nov 2014, 12:42:04 UTC - in response to Message 1598962.  
Last modified: 9 Nov 2014, 12:46:08 UTC

BOINC decides which GPU is best based on these factors, in decreasing priority:
- compute capability
- software version
- available memory
- speed

http://boinc.berkeley.edu/dev/forum_thread.php?id=7899&postid=45886

Since the 9400GT has more memory available, Boinc is classing it as more capable:

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 2 CUDA device(s):
Device 1: GeForce 9600 GSO, 361 MiB, regsPerBlock 8192
computeCap 1.1, multiProcs 12
clockRate = 1700 MHz
Device 2: GeForce 9400 GT, 489 MiB, regsPerBlock 8192
computeCap 1.1, multiProcs 2
clockRate = 1400 MHz
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce 9600 GSO is okay
SETI@home using CUDA accelerated device GeForce 9600 GSO
pulsefind: blocks per SM 1 (Pre-Fermi default)
pulsefind: periods per launch 100 (default)
Priority of process set to BELOW_NORMAL (default) successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 2.30


Claggy
ID: 1598964 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1598967 - Posted: 9 Nov 2014, 13:09:15 UTC - in response to Message 1598964.  

And this particular example perfectly illustrates why BOINC's approach is wrong one.
W/o use all GPUs switch in cc_config it would ignore faster and definitely more capable for computations device.
ID: 1598967 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1598968 - Posted: 9 Nov 2014, 13:14:18 UTC - in response to Message 1598967.  
Last modified: 9 Nov 2014, 13:15:00 UTC

Also, reporting 2 devices as one with biggest memory has another fundamental flaw.
If any real scientific app would depend on BOINC in device capabilities detection (I avoid this as much as I can so not SETI OpenCL apps) it will treat both cards as having ~512MB memory and may ajust task for this memory amount. Then task will fail on much faster but smaller memory GSO9600. Nice approach, nothing to say...
ID: 1598968 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1598973 - Posted: 9 Nov 2014, 13:36:14 UTC - in response to Message 1598968.  
Last modified: 9 Nov 2014, 13:44:58 UTC

Also, reporting 2 devices as one with biggest memory has another fundamental flaw.
If any real scientific app would depend on BOINC in device capabilities detection (I avoid this as much as I can so not SETI OpenCL apps) it will treat both cards as having ~512MB memory and may ajust task for this memory amount. Then task will fail on much faster but smaller memory GSO9600. Nice approach, nothing to say...

The Current Boinc design is that it detects all the NV/ATI/Intel GPUs, then uses the most capable from each vendor, and only uses multiple GPUs from the same vendor if they are the same, or close to being the same,
Then it passes only the most capable GPU(s) to the project, and the project bases what work is sent on what the most capable GPU is.

If you add GPUs from the same vendor that are different in some way, and use <use_all_gpus> to enable them, then you're working outside the current design,
DA already has said that it's a lot of work to change this design, who knows when it'll happen.

Ask Him.

Claggy
ID: 1598973 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1598987 - Posted: 9 Nov 2014, 14:38:33 UTC - in response to Message 1598973.  
Last modified: 9 Nov 2014, 14:43:15 UTC

Also, reporting 2 devices as one with biggest memory has another fundamental flaw.
If any real scientific app would depend on BOINC in device capabilities detection (I avoid this as much as I can so not SETI OpenCL apps) it will treat both cards as having ~512MB memory and may ajust task for this memory amount. Then task will fail on much faster but smaller memory GSO9600. Nice approach, nothing to say...

The Current Boinc design is that it detects all the NV/ATI/Intel GPUs, then uses the most capable from each vendor, and only uses multiple GPUs from the same vendor if they are the same, or close to being the same,
Then it passes only the most capable GPU(s) to the project, and the project bases what work is sent on what the most capable GPU is.

If you add GPUs from the same vendor that are different in some way, and use <use_all_gpus> to enable them, then you're working outside the current design,
DA already has said that it's a lot of work to change this design, who knows when it'll happen.

Ask Him.

Claggy

And in his keynote 'history of BOINC' talk to the BOINC Workshop in Budapest, six weeks ago, he acknowledged that this design decision was a mistake*, and would be re-worked eventually - as I told Raistmer, the day he said it, via the Lunatics site.

In addition to the 2012 BOINC dev thread which Claggy linked, I quoted (and sourced) some of the 'decide between GPUs' code here in early 2011: message 1085712. The decision is based on a broader concept of 'capability', rather than raw speed.

In short, this is old - very old - news, and I see no point in bringing it up on a project message board in apocalyptic tones, as if some disaster had just struck.

Raistmer and David share a surprising number of programming traits.
* Neither necessarily chooses the optimal path when they start a new venture
* Both tend to start coding before they've finished designing
* Both write spaghetti code which it is hard for anybody else to read
* Neither likes to be called back to revise something after they've moved on
* Both make multiple changes at once, merging new features with bugfixes
* Neither likes to be criticised in public

I'd like to see this BOINC 'feature' changed, too - but (especially in view of that last comment), I don't think this thread is the best way of achieving that desired result.

* reference: http://boinc.berkeley.edu/trac/attachment/wiki/WorkShop14/workshop_14.pdf, slide 52.

Reflections on software: things we need to change
● Coprocessor model
ID: 1598987 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1598991 - Posted: 9 Nov 2014, 15:07:01 UTC - in response to Message 1598987.  
Last modified: 9 Nov 2014, 15:16:26 UTC


Raistmer and David share a surprising number of programming traits.
* Neither necessarily chooses the optimal path when they start a new venture
* Both tend to start coding before they've finished designing
* Both write spaghetti code which it is hard for anybody else to read
* Neither likes to be called back to revise something after they've moved on
* Both make multiple changes at once, merging new features with bugfixes
* Neither likes to be criticised in public


Sorry, Richard. I'm afraid this is your wishful thinking.

Especially last one. Any problems that I'm first to create corresponding "issues and errors" thread for own build almost each time? Well, maybe you just defend BOINC pointing bones here, not?

1) What non-optimal path do you mean? Please examples, I would like to correct that path if it really no optimal.
2) yep, details of design show itself when real coding starts. Devil in details as known, so design corrected while coding. It's OK, and that's not the issue.
Issue when bad design persists.
3) really, do you read my code a lot? Pity I get almost no feedback then. Try to read actually, you find lots of comments inside :)
4) yep, it happens. And it happens especially when I'm quite sure that changes are isolated ones. More branching would be better, indeed, but it takes additional time to merge back, I have no paid staff to do that.
5) LoL, criticise me in public, no probs, but with facts, constructive critic, please. Need to note that constructive critic leads to bug fixing, I like bug reports, real bug reports, of course, not to straight others hands day by day - that's I like to do only rarely ;D


And back to topic.
Yep, surely that "treat different devices as same" is obvious bad approach, here even nothing to discuss.
But my current topic not about that (this design decision too deep inside BOINC architecture to solve it fast and with little blood). My thread about another small but quite inconvenient decision to treat memory amount above device speed and how easely it can lead to issues, with list of those issues.

What do you defend here pointing bones to me? BOINC design? For what? Even if you see some mistake in my own creation should it matter I should never spot and speak about any errors in BOINC? About what your "similarities" in this thread ???

P.S. Ah, sorry, I missed "not like to come back" similarity, yep, it's hard to come back, especially if not owner of eidetic memory, you got me here :)
But quite a big step between don't like and never do.
ID: 1598991 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1598994 - Posted: 9 Nov 2014, 15:19:21 UTC

And regarding coprocessor model - I take this field too close to heart cause I proposed required changes so many years ago... If recall right even filled formal "bug report/feature request" in Trac... but again, this particular thread not about coprocessor model, it's about just another issue inside coprocessor model.
ID: 1598994 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1598995 - Posted: 9 Nov 2014, 15:28:34 UTC - in response to Message 1598991.  

My purpose in posting in this thread was to provoke a response, and hence start a discussion.

"What is the best - most effective - way of getting improvements made to the BOINC code"...

... bearing in mind all the constraints of time, personnel, and psychology that get in the way.

The last (very small) bug that I reported was fixed by David within an hour. You complain that the bugs you report never get any attention (example - though a false one, as it turned out). I do suggest that paying more care and attention to analysing who is responsible for an area of design, and addressing yourself to the right person in a timely and constructive manner, is more likely to get positive results.

On the occasions when I've lost my temper and antagonised David (yes, it has happened), it has in general been counterproductive.
ID: 1598995 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1598997 - Posted: 9 Nov 2014, 15:35:11 UTC - in response to Message 1598995.  

I'm afraid response was provoked indeed, but such that would not lead to any useful discussion. At least from my side. Issue described. Who interested may continue.
ID: 1598997 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1598998 - Posted: 9 Nov 2014, 15:38:06 UTC - in response to Message 1598987.  

* Neither likes to be criticised in public

QED. Point made. Next time you want David to change something, post where he will read it - and put yourself in his shoes, before you write it.
ID: 1598998 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1599008 - Posted: 9 Nov 2014, 16:35:22 UTC - in response to Message 1598998.  

* Neither likes to be criticised in public

QED. Point made. Next time you want David to change something, post where he will read it - and put yourself in his shoes, before you write it.

I don't want, I'm fine with this issue as it is cause use corresponding flg in cc_config.xml. But I think issue worth to be recorded in case someone stumble on it and will be trapped. What I want to be fixed I write in BOINC dev list. With almost same outcome as to write on house wall in Moscow :P
ID: 1599008 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22202
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1599094 - Posted: 9 Nov 2014, 21:07:58 UTC

Neither necessarily chooses the optimal path when they start a new venture
* Both tend to start coding before they've finished designing
* Both write spaghetti code which it is hard for anybody else to read
* Neither likes to be called back to revise something after they've moved on
* Both make multiple changes at once, merging new features with bugfixes


I doubt that either (or even both in concert) would ever reach the standards of spaghetti code that I had to sort out a few years ago...
The task "Just test these few lines of ASM86", yes 8086 assembler, only about 40 line or so, no big deal. But these 40 line had something like 30 immediate jumps, to fairly large chunks of code, each of which had multiple jumps, some computed, and others direct, all told about 12000 lines of code, tied up like a large bowl of spaghetti - the "cook" who brewed this lot up couldn't (wouldn't?) see there was any problem until I threw the mess at a code tracer..... Re-written this PI controller was more robust (no random crashed), and much smoother and more predictable... Then came the change, make it a three term (PID), which was an easy task as I'd already thought about that...
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1599094 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1599146 - Posted: 9 Nov 2014, 23:21:16 UTC
Last modified: 9 Nov 2014, 23:51:20 UTC

For my 2 cents toward the original topic, this multiple device issue has at least 3 main relevant impacts on the (mostly non credit related) creditNew work I've been doing.
- Scheduling for task allocation to hosts: in case of multiple disparate devices the throughput used for requests by the client should be the aggregate sum of peak theoretical flops, and a filtered efficiency ( aggregated from separately tracked device.appversion local efficiencies), which would be dominantly client side refinements.
- Increasingly heterogeneous hosts, currently unsupported (again mostly a client side concern. To some extent the server has all of the information it needs for its tasks, though underused, and small fragments missing or misused in places), and
- local (client) estimate scheduling

Considering those, which stand out as dominantly client side concerns, I'd be wary of recommending increased server side complexity, especially since the problem domain ( our subjective observations of how well the scheduling works) are of little relevance to server/project side goals and scope.

IOW, try to keep solutions close to the original problem source, rather than migrate them back into a problem domain which is already overly complicated by special exceptions and burdened by poor management.

My own work, which will undoubtedly result in recommendations mostly for client refinement, but definitely some server bulletproofing & simplification too (in support of the separate credit issues). This will reach a viable point to model heterogeneous hosts/application & workloads, in part 1.2 - 'controllers', of the plan below.

That doesn't prevent anyone researching & developing other ways to address the limitations we've dealt with for so long. I'd suggest though that David's 'hands-off' approach to the problem may be at least in some small portion due to some of the other design issues not specifically relating to multiple devices. It may instead be recognition that the problem is a larger design one, relevant across more problems than just mixing disparate devices... (which it is.)


"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1599146 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1599152 - Posted: 9 Nov 2014, 23:34:04 UTC - in response to Message 1599094.  
Last modified: 9 Nov 2014, 23:42:10 UTC

Then came the change, make it a three term (PID), which was an easy task as I'd already thought about that...


Funnily enough, the dated 6.10.58 build of the Boinc client I run here, I modified replacing a portion of the task duration estimates with a three term PID driven mechanism. It's been working fine and adapting to significant local hardware and application changes without issue since 6.10.58 was current.

That's one of several approaches I'll be comparing models of, for some of the server side estimates for task scheduling. (in addition to client).

Most likely the PID variant will yield to the slightly more sophisticated Kalman filter ( or extended version ), but remove the need for tuning. There's other options that are going to be compared (including the server's current dicey use of running sample averages), and areas where it's been suggested neither the PID or Kalman would be an optimal choice, but fun to see steady state runtime estimates dial in to the second at times. That's better than required, so simplest/smartest will probably win out.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1599152 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1599307 - Posted: 10 Nov 2014, 6:48:40 UTC

As I have no horse in this race, Why would you guys discuss and fight over code in this forum?
Should not this pulic display of angst have been more appropriate in the Boinc developers thread or Beta or even PMs?
[/quote]

Old James
ID: 1599307 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1599324 - Posted: 10 Nov 2014, 7:30:30 UTC - in response to Message 1599307.  
Last modified: 10 Nov 2014, 7:43:05 UTC

As I have no horse in this race, Why would you guys discuss and fight over code in this forum?
Should not this pulic display of angst have been more appropriate in the Boinc developers thread or Beta or even PMs?


I'm sure a similar sentiment wasn't the entire point of Richard's initial response, but at least some part of it.

To be fair all around, sometimes as a developer it's difficult to find a sympathetic ear, despite something being 'obviously wrong'. I gather your own views would rather not see this side of development (a view which I happen to agree with mostly), however *sometimes* communications on a large and complex issue like this require breaking a few molds and 'rules'. On occasion something good can come from more public exposure.

[for example, I'd wager Raistmer had little or no idea that my control systems oriented creditNew research would have any relationsship to this 'simple' problem. There's no Forum for that, but 'numbercrunching' does fit ;) ]

[Edit:] I'll add, that from experience boinc forum would be the wrong forum to speak about this, and PM's wholly inappropriate in development matters. If kept to Lunatics I probably would not have seen it and had the opportunity to respond.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1599324 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1599326 - Posted: 10 Nov 2014, 7:42:33 UTC

Thanks Jason for the heads up. I will now with draw my complaint. And we folks do appreciate what you ALL do to help develop code.
[/quote]

Old James
ID: 1599326 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1599328 - Posted: 10 Nov 2014, 7:46:20 UTC - in response to Message 1599326.  

Thanks Jason for the heads up. I will now with draw my complaint. And we folks do appreciate what you ALL do to help develop code.


No Problems. I completely understand these issues draw odd looks (especially for example when Eric and I have dissected some things in news threads, lol).

Some of the best things come from 'messy minds', and in that state protocol sometimes just doesn;t fit. Some of us try though ;)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1599328 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1599343 - Posted: 10 Nov 2014, 8:17:11 UTC - in response to Message 1599328.  



Some of the best things come from 'messy minds'


I think I'll make that my "thought for today."
ID: 1599343 · Report as offensive

Message boards : Number crunching : BOINC not always reports faster GPU device...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.