First results of a SETI crunching PC ( 2 GPU -> to be 6 )...not what I expected

Author	Message
I3APR Send message Joined: 23 Apr 16 Posts: 99 Credit: 70,717,488 RAC: 0	Message 1799890 - Posted: 1 Jul 2016, 12:58:44 UTC Hello guys, first of all a quick recap : RIG "A" CPU : Core i7-4790K Quad-Core 4 Ghz unlocked PSU: Corsair AX1500i Titanium PSU 1500W GPU : 1 x Nvidia 660ti + 1 x Nvidia 780ti I'm slowly adding GPUs to the rig, but for the moment I'm running the GPU configuration above. I'm a bit disappointed, although the result migt be biased by my poor knowledge on how to fine tuning the systems. In parallel I'm running another dedicated PC since 1 month : RIG "B" CPU : Core i7-3770 Quad-Core 3.4 Ghz PSU: Unknown 750 W PSU GPU : 1 x XFX Radeon HD7850/7870 Now, tonight I've been running both machines with : <gpu_usage>1</gpu_usage> <cpu_usage>1</cpu_usage> So on PC "A" there were 6 CPU tasks and 2 CPU+GPU tasks running while on PC "B" there were 7 CPU tasks and 1 CPU+GPU tasks. This is consistent with the idea that BOINC with this config uses one core per GPU, independently on whether it is physical or HT ( hope this sentence make sense also in english...sorry guys !) Now the numbers : keeping in mind that not all the WU are equal, but trying to find a pattern on the most recurrent/similar ones : RIG "A" CPU WU : about 1h 30 min GPU WU : about 33 min for GTX780ti and 38 min for GTX660ti ( both cuda50 ) RIG "B" CPU WU : about 3h 30 min GPU WU : about 24 min for Radeon HD7850 ( opencl_ati5_cat132 ) Now...I'm puzzled by at least a couple of fact here :the Radeon is on paper a rather slower card than the two nVidias, see here : http://www.gpureview.com Radeon HD7850 : 1761.28 GFLOPS nVidia 660ti : 2459.52 GFLOPS nVidia 780ti : 5045.76 GFLOPS I found inconsistent the difference in theoretical GFLOPS ratio between 780ti and 660ti = 2 times more, while the measured GPU time ratio is 780ti/660ti = 1.15 times more only... But more than that, theoretical GFLOPS ratio between 780ti and HD7850 = 2.9 times more while the measured GPU time ratio between 780ti and HD7850 = 0.7 !! I know it is not right to stick to theoretical calculation, but if I'm not wrong here, the 780ti instead of being twice as fast as the 660ti is only marginally faster and more evidently, the Radeon, which on paper should be the slowest in terms of GFLOPS, actually it is the fastest of them all !! Can someone explain this to me ( like I'm 5 ) ? Is it something related to the better app code optimization for AMD devices or am I doing something horribly wrong here..? Ciao A. ID: 1799890 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1799893 - Posted: 1 Jul 2016, 13:05:52 UTC - in response to Message 1799890. Can someone explain this to me ( like I'm 5 ) ? Is it something related to the better app code optimization for AMD devices or am I doing something horribly wrong here..? I think I'd try a statement like "NVidia hardware is matched less well to the mathematics that SETI currently needs". ID: 1799893 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1799894 - Posted: 1 Jul 2016, 13:08:38 UTC One thing you have to look at is the "AR" for each work unit. In my experience: At mid AR (somewhere between 0.2 and 0.6) they are very good, at low AR they get worse, and at very low AR they are inconsistently poor in that small changes of AR can produce hugely different run times Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1799894 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1799895 - Posted: 1 Jul 2016, 13:09:13 UTC - in response to Message 1799890. It's simple, Nvidia cards do not like guppis without some tweaking. I forget if CUDA needs a dedicated core, but those cards can handle 2 tasks. Try 0.5 GPU, 0.5 CPU to kick it up to 2 tasks, and watch your CPU usage. ID: 1799895 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1799903 - Posted: 1 Jul 2016, 13:34:50 UTC - in response to Message 1799890. Can someone explain this to me ( like I'm 5 ) ? Is it something related to the better app code optimization for AMD devices or am I doing something horribly wrong here..? Ciao A. You might look into this exchange; The slowdown in low ar comes from having a (only) one full pot to process. That kind of makes all work to go to one SM/SMX unit... Work is still in progress. The main question is which GPUs will be able to use the modified code that uses all the compute units. The VLARs are much faster when the GPUs use all the Compute Units. Much faster. ID: 1799903 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1799904 - Posted: 1 Jul 2016, 13:42:34 UTC Lots of answers. compared 1 to 1, Radeon is faster than Nvidia. Where the nvidia advantage comes from is being able to run more than 1 work units on each card. The ATI at the moment are having issues (with certain cards) in crunching more than 1 work unit at the same time If they ever solve that problem then they will again be the most productive cards. So while at 1:1 the Radeon is faster, if you increase your nvidia to 2 or 3 per card, you will see more productivity out of them. ID: 1799904 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1799908 - Posted: 1 Jul 2016, 13:56:52 UTC - in response to Message 1799903. Can someone explain this to me ( like I'm 5 ) ? Is it something related to the better app code optimization for AMD devices or am I doing something horribly wrong here..? Ciao A. You might look into this exchange; The slowdown in low ar comes from having a (only) one full pot to process. That kind of makes all work to go to one SM/SMX unit... Work is still in progress. The main question is which GPUs will be able to use the modified code that uses all the compute units. The VLARs are much faster when the GPUs use all the Compute Units. Much faster. Yes I too can confirm that watching the great work Petri's doing has been compelling. Breaking things down to the simplest as requested --> the work changed, and the technologies and techniques are changing to keep up. Lots (and lots) of gotchas+work remain, such as the mentioned 'which devices will it work on?' questions, as well as laying some groundwork to make response to future changes easier while trying to simplify things for users. It gets slightly more complicated in the details of course: We'll probably reach a point sometime soon after a fair amount of heavy lifting is done, where a single Cuda binary per platform can handle the majority of Cuda enabled devices well. That particular heavy lifting is mostly about re-thinking the stale/complex build system that is currently complicating cross platform build+testing, as well as internalising and automating the hardware specific dispatch traditionally left to confuse users with too many options. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1799908 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1799909 - Posted: 1 Jul 2016, 13:59:45 UTC Ok, I just had a brilliant flash (in my own mind, at least.. lol) regarding the AMD/Nvidia production dilemma. Help me out with this, let me know if it makes sense and is at all possible to implement. I've been an Nvidia guy pretty much forever, so have very limited experience with AMD cards, it's been years since I've ran one. But, reading this post made me think, since in a number of my systems I am running multiple GPUs, would it possibly make sense to run both in one system? Is that even technically feasible? If so, the next question would be, would it be possible to direct the very low AR units to the AMD card, and send the mid and low(ish) AR to the Nvidia. Is the program even able to be that granular to direct tasks in that way? The saying that comes to my mind is when all you have is a hammer, everything looks like a nail, but this way it would also be using a screwdriver to drive the screws, while allowing the hammer to bang away at all the nails. I guess this is just a thought experiment, but I thought I'd throw it out there and let those in the know dissect it. The fact that I haven't seen it mentioned before, at least in the reading that I do here regularly (without doing an exhaustive search), would indicate to me that there are good reasons that it isn't happening, but I'm the kind of guy that likes to think outside the box. So, whaddya think? ID: 1799909 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1799915 - Posted: 1 Jul 2016, 14:24:18 UTC - in response to Message 1799909. Last modified: 1 Jul 2016, 14:33:06 UTC Ok, I just had a brilliant flash (in my own mind, at least.. lol) regarding the AMD/Nvidia production dilemma. Help me out with this, let me know if it makes sense and is at all possible to implement. I've been an Nvidia guy pretty much forever, so have very limited experience with AMD cards, it's been years since I've ran one. But, reading this post made me think, since in a number of my systems I am running multiple GPUs, would it possibly make sense to run both in one system? Is that even technically feasible? If so, the next question would be, would it be possible to direct the very low AR units to the AMD card, and send the mid and low(ish) AR to the Nvidia. Is the program even able to be that granular to direct tasks in that way? The saying that comes to my mind is when all you have is a hammer, everything looks like a nail, but this way it would also be using a screwdriver to drive the screws, while allowing the hammer to bang away at all the nails. I guess this is just a thought experiment, but I thought I'd throw it out there and let those in the know dissect it. The fact that I haven't seen it mentioned before, at least in the reading that I do here regularly (without doing an exhaustive search), would indicate to me that there are good reasons that it isn't happening, but I'm the kind of guy that likes to think outside the box. So, whaddya think? Very very slowly [on a global scale], things are headed down that path. This is happening even slower than the transition from single CPU cores to Multicores, and multithreaded programs to take advantage of them. Truly 'Heterogeneous applications' should be able to examine the available resources, determine some optimal way to run using them, and then run in a dynamically adaptive way, toward a prescribed goal. Example goals might include max throughput, energy efficiency, a throttle setting (e.g temperature). Then there would be the ability to respond if a given resource becomes unavailable, or is added on. People do some of this now, more or less manually [as you describe], in a coarse way with existing clients and applications. It's only that devising mechanisms that can adapt to the multitude of different situations more seamlessly and in a more user friendly way will become more important than any one specific scenario ---> since those specific scenarios are dating faster than code can be made for them already, that makes for complex situations. [Automation should be used to simplify complex situations] Sounds complex (and it is underneath), but bear in mind that mobile devices do a lot of this already, so the concepts and example implementations are there to learn from. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1799915 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1799927 - Posted: 1 Jul 2016, 15:00:36 UTC - in response to Message 1799909. Ok, I just had a brilliant flash (in my own mind, at least.. lol) regarding the AMD/Nvidia production dilemma. Help me out with this, let me know if it makes sense and is at all possible to implement. I've been an Nvidia guy pretty much forever, so have very limited experience with AMD cards, it's been years since I've ran one. But, reading this post made me think, since in a number of my systems I am running multiple GPUs, would it possibly make sense to run both in one system? Is that even technically feasible? If so, the next question would be, would it be possible to direct the very low AR units to the AMD card, and send the mid and low(ish) AR to the Nvidia. Is the program even able to be that granular to direct tasks in that way? The saying that comes to my mind is when all you have is a hammer, everything looks like a nail, but this way it would also be using a screwdriver to drive the screws, while allowing the hammer to bang away at all the nails. I guess this is just a thought experiment, but I thought I'd throw it out there and let those in the know dissect it. The fact that I haven't seen it mentioned before, at least in the reading that I do here regularly (without doing an exhaustive search), would indicate to me that there are good reasons that it isn't happening, but I'm the kind of guy that likes to think outside the box. So, whaddya think? First question: Yes Second question: Not yet ID: 1799927 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1799940 - Posted: 1 Jul 2016, 16:06:32 UTC - in response to Message 1799893. Last modified: 1 Jul 2016, 16:07:00 UTC Can someone explain this to me ( like I'm 5 ) ? Is it something related to the better app code optimization for AMD devices or am I doing something horribly wrong here..? I think I'd try a statement like "NVidia hardware is matched less well to the mathematics that SETI currently needs". If one would look in latest webinar regarding CUDA 8 one will see most of time devoted to HUGE (!) memory amount tasks and their optimization. New CUDA and Pascal mostly about unified memory that allows directly use up to 2TB of system memory no matter how much onboard GPU RAM device has. SETI doesn't require huge amount of RAM. Instead, it requires FAST RAM. Quite different. And again, from that webinar, speedups NVIDIA proud of are 3x (!!!!!) or so (on those huge memory amount tasks) versus CPU-only system. 3 times, LoL!!! Nothing about those 50x or so they told before. It's apparent that direct computation ability of modern GPUs means absolutely nothing if it doesn't backed with corresponding memory speed. If your many-core device does nothing but wait data from memory, no matter how much those many cores it has... SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1799940 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1799941 - Posted: 1 Jul 2016, 16:14:15 UTC - in response to Message 1799909. Ok, I just had a brilliant flash (in my own mind, at least.. lol) regarding the AMD/Nvidia production dilemma. Help me out with this, let me know if it makes sense and is at all possible to implement. I've been an Nvidia guy pretty much forever, so have very limited experience with AMD cards, it's been years since I've ran one. But, reading this post made me think, since in a number of my systems I am running multiple GPUs, would it possibly make sense to run both in one system? Is that even technically feasible? If so, the next question would be, would it be possible to direct the very low AR units to the AMD card, and send the mid and low(ish) AR to the Nvidia. Is the program even able to be that granular to direct tasks in that way? The saying that comes to my mind is when all you have is a hammer, everything looks like a nail, but this way it would also be using a screwdriver to drive the screws, while allowing the hammer to bang away at all the nails. I guess this is just a thought experiment, but I thought I'd throw it out there and let those in the know dissect it. The fact that I haven't seen it mentioned before, at least in the reading that I do here regularly (without doing an exhaustive search), would indicate to me that there are good reasons that it isn't happening, but I'm the kind of guy that likes to think outside the box. So, whaddya think? It's definitely possible. Moreover, veterans could recall my "team mod" that did some tasks on CPU allowing other on GPU. But as you understand quite specific host config (that involves both AMD and NV cards in single host) is required in this particular case. And all this not fit well to BOICN model of computations. Proxy that accepts both CPU and GPU work and then distribute it invisible to BOINC will required. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1799941 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1799957 - Posted: 1 Jul 2016, 17:34:15 UTC Well, thanks for the replies, it's good to hear that is may someday be a possibility. I'll keep my eyes open around here as time goes on, and if the situation happens where it would make sense to run both, I'll look at grabbing whatever is a decently performing model to try it out. That is, as long as they fix their cards to get the voltages to run within the allowed PC specs, so I my motherboard doesn't catch fire... :-O ID: 1799957 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1799958 - Posted: 1 Jul 2016, 17:36:49 UTC - in response to Message 1799957. so I my motherboard doesn't catch fire... :-O Why you always have a Dry Chemical Extinguisher near by, lol... ID: 1799958 ·

I3APR Send message Joined: 23 Apr 16 Posts: 99 Credit: 70,717,488 RAC: 0	Message 1799963 - Posted: 1 Jul 2016, 18:14:21 UTC Ok guys, tnx for the hints, but I feel I wasn't explained "like I'm 5" but mostly "like I'm in this since forever". Anyway, my fault for such poor knowledge. One update, which partially explain the GFLOPS inconsistency between the 780ti and the Radeon 7850 : it turned out, it doesn't matter what some sites says, this is what BOINC log for PC "A" says : 7/1/2016 7:20:19 PM \| \| CUDA: NVIDIA GPU 0: GeForce GTX 780 Ti (driver version 368.39, CUDA version 8.0, compute capability 3.5, 3072MB, 2491MB available, 6022 GFLOPS peak) 7/1/2016 7:20:19 PM \| \| CUDA: NVIDIA GPU 1: GeForce GTX 660 (driver version 368.39, CUDA version 8.0, compute capability 3.0, 2048MB, 1667MB available, 1982 GFLOPS peak) 7/1/2016 7:20:19 PM \| \| CUDA: NVIDIA GPU 2: GeForce GTX 660 Ti (driver version 368.39, CUDA version 8.0, compute capability 3.0, 2048MB, 1659MB available, 2810 GFLOPS peak) 7/1/2016 7:20:19 PM \| \| OpenCL: NVIDIA GPU 0: GeForce GTX 780 Ti (driver version 368.39, device version OpenCL 1.2 CUDA, 3072MB, 2491MB available, 6022 GFLOPS peak) 7/1/2016 7:20:19 PM \| \| OpenCL: NVIDIA GPU 1: GeForce GTX 660 (driver version 368.39, device version OpenCL 1.2 CUDA, 2048MB, 1667MB available, 1982 GFLOPS peak) 7/1/2016 7:20:19 PM \| \| OpenCL: NVIDIA GPU 2: GeForce GTX 660 Ti (driver version 368.39, device version OpenCL 1.2 CUDA, 2048MB, 1659MB available, 2810 GFLOPS peak) ( yeah I just added a 660 ti and it turned out the other 660 was not "ti" ) And this for PC "B" 7/1/2016 5:42:52 PM \| \| CAL: ATI GPU 0: AMD Radeon HD 7850/7870 series (Pitcairn) (CAL version 1.4.1848, 2048MB, 2008MB available, 7296 GFLOPS peak) 7/1/2016 5:42:52 PM \| \| OpenCL: AMD/ATI GPU 0: AMD Radeon HD 7850/7870 series (Pitcairn) (driver version 2004.6 (VM), device version OpenCL 1.2 AMD-APP (2004.6), 2048MB, 2008MB available, 7296 GFLOPS peak) So it seems that the radeon has been measured for a 7300 GFLOPS by BOINC and the 780ti by 6000 : here's maybe why the bitcoin miners do prefere AMD's.. Of course, I've got myself an answer to a previous request, but a new inconsistency has emerged : After installing and uninstalling Lunatics, which "as is" with minimal setup was prolonging the wu's crunching time by a factor of 5, I now have cuda42 as default app, and not cuda50 anymore....I reset the project, remove it, uninstalled BOINC and reinstalled, but no joy. Anyway, now the average crunching time for each GPU are : GTX780ti : 14 min GTX660ti : 23 min GTX660 : 21 min So, still I don't know why the "ti" is less performing than the plain 660. A. ID: 1799963 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1799966 - Posted: 1 Jul 2016, 18:20:28 UTC - in response to Message 1799963. Do you know how to edit your app_info.xml? If so, then edit out all references to the cuda 42, leaving only your cuda 50. I've had problems in the past where I've specified cuda 50 and it gave me cuda 22. I've found that I needed to edit the app_info in order to get it to accept the changes. ID: 1799966 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1799981 - Posted: 1 Jul 2016, 20:45:12 UTC - in response to Message 1799958. so I my motherboard doesn't catch fire... :-O Why you always have a Dry Chemical Extinguisher near by, lol... Oh, _that_ would be great to get all over electronics... Not! lol I prefer Co2, or ideally Halon, though Halon is getting quite hard to get a hold of, but it is an Excellent fire suppressant. Though you certainly wouldn't want to be in an enclosed room full of it. Asphyxiate quite quickly. But at least the fire would be out, right? ID: 1799981 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1799982 - Posted: 1 Jul 2016, 20:46:32 UTC - in response to Message 1799963. Ok guys, tnx for the hints, but I feel I wasn't explained "like I'm 5" but mostly "like I'm in this since forever". Anyway, my fault for such poor knowledge. GazzaVR, don't worry, it'll come quickly; just keep at it! From my personal experience, 2 months ago I had no clue what these gurus were yacking about and now I think I get 95% of it (I hope)! I have found that trying to answer Qs in the Q&A section of the forums has helped me clarify things in my own mind. And then, most of the gurus are really diplomatic at "adding" info to a thread without making you feel like you don't know enough to participate. This has allowed me to go from running "stock" (didn't even know what that meant 66 days ago) to now being able to swap tasks from the GPU to the CPU and vice versa by modifying manually client_state.xml Keep up the valiant effort with your multi-GPU rigs. I'm looking forward to reading your future posts, Rob :-} PS: It would be helpful (for me) if you could put a link to your: Rig A & Rig B My primary 2 crunchers are identical (except for RAM): HP Z400 Xeon W3550 with GTX 750 Ti myRig A host_id #7996377 & myRig B host_id #8010413 ID: 1799982 ·

I3APR Send message Joined: 23 Apr 16 Posts: 99 Credit: 70,717,488 RAC: 0	Message 1800398 - Posted: 3 Jul 2016, 16:33:03 UTC Ok, guys, I'm outta town but keeping an eye to the rig with teamviewer. Did some change tho, before leaving I added the 4th card ( a 660ti ), but I felt no real improvements and...after 20 hours, I believe one of the card is roasted (doesn't even shows up with MSI afterburner or Nvidia utilities..) :-( @Zalster I suppose that "app_info.xml" should be there, but did not find it. Should I feed to Boinc a custom one from some templates ? @Stubbles69 Tnx for your kind word, I knew that there was more than "buy-an-expensive-card-and-feed-it-to-Boinc", but sometimes I feel that this goes beyond my knowledge...there are times that I feel I'm facing something more resembling to a sorcery than math, especially when observing oddly behaving Boinc and can't say why !! Anyway, I feel I'm testing a good configuration in this moment with : <gpu_usage>1</gpu_usage> <cpu_usage>1</cpu_usage> Running lot of "21ap10af...." with opencl_nvidia_SoG between 5 to 10 mins each...of courrse it could change when working guppies... As for my systems, here they are : System "A" ( older with 1x Radeon) : http://setiathome.berkeley.edu/show_host_detail.php?hostid=7985342 System "B" ( newer with multiple GPUs) : http://setiathome.berkeley.edu/show_host_detail.php?hostid=8035198 tnx !! A. ID: 1800398 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1800449 - Posted: 3 Jul 2016, 19:20:25 UTC - in response to Message 1800398. Gazza, Is Lunatics still installed? I guess I thought you have reinstalled it but after looking at your post it seems like you uninstalled it? You will only have an app_info if you make one yourself or if you use lunatics to install one. The app_config.xml is something that you have to install yourself. If you are not using lunatics then the server will send you different apps until it figures out what it thinks are the best for your system. ID: 1800449 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.