Questions and Answers :
Unix/Linux :
2 video cards in linux. Boinc sees them as same device!
Message board moderation
Author | Message |
---|---|
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
I have a weird problem. The system works fine with just my vid card, or even just the Tesla telling Boinc to use only one card. But it falls apart trying to use both. My setup is as follows: 1. Linux x86_64 running on a q6600 intel system 2. Video card is GTX 285 in first pci-e slot 3. Tesla C1060 is installed in 2nd pci-e slot 4. Boinc version is 6.6.36 5. Nvidia driver is 185.18.14 6. My number_of_gpus is set to 2. I had it at 1 and it made no difference in this behavior below. 6. I have use_all_gpus set to 1 assuming it is a true/false required. 7. I have this statement in my app_info.xml: <coproc> <type>CUDA</type> <count>2</count> </coproc> When I start Boinc, it reports 2 Tesla cards instead of the proper ones. If this were just a naming problem I could live with this but.... With the above coproc statement set to 1, When I do a ps ax to look at my process list this is what I see: 7987 ? RNLl 0:01 setiathome-6.08.CUDA_2.2_x86_64-pc-linux-gnu --device 0 7988 ? RNLl 0:01 setiathome-6.08.CUDA_2.2_x86_64-pc-linux-gnu --device 0 and it uses the GTX285 for both! When I have the coproc statement set to 2, it uses the Tesla only and runs only 1 process. it has both device numbers but the GTX285 is not used and a 2nd workunit is waiting to run: 10170 ? RNLl 0:07 setiathome-6.08.CUDA_2.2_x86_64-pc-linux-gnu --device 0 --device 1 How can I get this to do the right thing and provide me with processes like these using both cards? setiathome-6.08.CUDA_2.2_x86_64-pc-linux-gnu --device 0 setiathome-6.08.CUDA_2.2_x86_64-pc-linux-gnu --device 1 How can I fix this? I know others are using 2 cards successfully. |
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
After 5 days of intense research, I have come to the conclusion that this behavior is borked boinc device recognition code for the linux version. I understand the windows version works fine. I have a very hard time comprehending why such an important function as multiple device recognition has gone ignored in the linux version especially when there is working windows code available for reference and this problem has been there for many versions. This is quite disturbing. I have removed the tesla for now and am considering 2 options. 1. dig into the code and fix it since no one else apparently wants to. my available time due to job workload makes this option difficult at best. 2. build up an entirely new machine just to use the tesla. expensive alternative. some will say the easiest alternative is to switch to windows. i would tend to agree that would be easiest but that is not a consideration due to the nature and requirements of my work and also my personal views of the product. According to everything I can find, my configurations were done correctly but resulted in multiple wu feeding a single device so I believe my above impression of broken code is valid. I have not looked yet but I am positive others have registered bug reports on similar situations. It is simply too common a need not to have been done. I will, however, verify this and file one if none have been reported. Not very happy. |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
Did you see this message (concerning <count>)? And this answer to your other thread (concerning <use_all_gpus>)? Gruß, Gundolf Computer sind nicht alles im Leben. (Kleiner Scherz) SETI@home classic workunits 3,758 SETI@home classic CPU time 66,520 hours |
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
yes i have. i had to set it to 2 to be able to use a single workunit on one card while ignoring the other one since it wont feed 2 cards properly. it was just a workaround set to 2 while i investigated. when set to 1 boinc feeds 2 workunits simultaneously to device 0 rather than 1 to each. that i believe is a problem in the device code in boinc. yes i have had that setting in cc_config. see my listing in first msg. I have use_all_gpus set to 1 assuming it is a true/false required. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
There have been reports that you do not need the latest drivers, but instead need at max 180.60 Anything newer will give problems under all kinds of Linux. You can also try BOINC 6.6.37 (also with at max the 180.60 drivers). - 32bit version - 64bit version |
Joseph Monk Send message Joined: 31 Mar 07 Posts: 150 Credit: 1,181,197 RAC: 0 |
There have been reports that you do not need the latest drivers, but instead need at max 180.60 I had problems with 6.6.36, but 6.6.20 works fine for me... I also have my count set to 2, but that's because one of my GPUs is overheating (hits about 103C after 6 hours, where the other card hits 60C). After the weekend (shift worker) I'm gonna do some cable management and build a little spot cooling for it. |
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
There have been reports that you do not need the latest drivers, but instead need at max 180.60 i cant use the 180.60 driver. i had that when i got the gtx285 and all i did was replace my old vid card with it, turned it on, and when the driver loaded my gui began blinking and doing strange things. in talking with nvidia they said the driver was not compatible with the 285 and i had to upgrade. once i upgraded all the video problems went away. unfortunately i did not try running dual devices until the driver upgrade. i have been running 6.6.37 for a few days, and tried going back to 6.6.20 which correctly identified both cards but still gave me the dual wu on device 0 when trying to use both cards. |
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
There have been reports that you do not need the latest drivers, but instead need at max 180.60 sounds like it may be a candidate for a little shin-etsu thermal compound. that stuff works wonders. my tesla used to jump to 80c regularly, i cleaned the old compound off it and put the shin-etsu g-751 compound on (hard to apply but worth it. highest thermal rate available) and the tesla now never exceeds 65c. 6.6.20 still gave me 2 wu on a single device, or with count of 2 both devices on one app which selects the last device listed so i got 1 wu on 1 gpu. |
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
There have been reports that you do not need the latest drivers, but instead need at max 180.60 i am wondering about the kernel though. i will try an upgrade tomorrow to see if that helps. may even try backing up to 180.60 with the new kernel to see if it helps. presently running 2.6.25. will install 2.6.29 tomorrow. |
Joseph Monk Send message Joined: 31 Mar 07 Posts: 150 Credit: 1,181,197 RAC: 0 |
There have been reports that you do not need the latest drivers, but instead need at max 180.60 My problem is there's only about 1/8" space (if that) between the top card and the bottom card, so not much airflow. How are you seeing which device the WU is being processed on? |
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
There have been reports that you do not need the latest drivers, but instead need at max 180.60 ahh. a helpful thing is to get one of those dome shaped faucet washers and slide it between the 2 cards on the back end. there is usually enough 'give' there to be able to do this safely. it will help expand the air opening somewhat and yeah a fan blowing between them would be helpful to keep the top card from sucking heated air from the bottom back. with my process list i can see which device is being used, and also i simply watch the temps of the gpus and whichever one heats up from idle temp is the one being used for cuda. |
Joseph Monk Send message Joined: 31 Mar 07 Posts: 150 Credit: 1,181,197 RAC: 0 |
Yeah, I figured out you must be using ps to see it, confirmed it myself... for some reason 6.6.20 and 6.6.36 both stick it all on device 0, but 6.4.5 works correctly (one on each). I ended up doing a little cable management (wish I had longer cables, not much room to move them around) and moved GPU1 down to slot 3 (still running 16x since I only have the two cards) and it seems to work. Did a test last night for about 6 hours and temps we at 70C (in a closed room, stupid dog, room temp was about 32C) so I can live with that (still might look into some extra cooling... I still haven't OC these cards). Doing a longer (12 hour) test today to verify it's all good. Only possible issue I see now is that GPU1 is about 1" off the PSU, but that's better than the MAYBE 1/8" I had between GPU0 and GPU1 before. |
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
Yeah, I figured out you must be using ps to see it, confirmed it myself... for some reason 6.6.20 and 6.6.36 both stick it all on device 0, but 6.4.5 works correctly (one on each). sounds like you have enough now. 70 is fine. 6.4.5 huh? hmm i will have to try that. think it will bork my existing setup since its so old? does it take app_info.xml? if not i cant use it then cause im running opt. apps for cpu and gpu |
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
Yeah, I figured out you must be using ps to see it, confirmed it myself... for some reason 6.6.20 and 6.6.36 both stick it all on device 0, but 6.4.5 works correctly (one on each). forgot... if the psu fan is in the rear of it or on the other side thats fine, but if it faces the gpu fan it can cause air cavitation. you will have to force feed air between them in that case to make sure a good flow is available for both. |
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
Yeah, I figured out you must be using ps to see it, confirmed it myself... for some reason 6.6.20 and 6.6.36 both stick it all on device 0, but 6.4.5 works correctly (one on each). sir, you are a genius. i would never even have considered going so far back in versions! the device report is borked, it reports 2 teslas in the messages log, i don't know what it will report to seti since it can't contact the servers right now, but it works just fine one wu to each device, both devices running around 66c so i know for sure its working. they idle around 48c with this heat in the room. so even if it reports 2 teslas it works. that's all i care about really. i only hope the vid bioses don't wind up fighting since they are vastly different versions. this tesla is a pre-release engineering unit and it has video ports attached. useless for anything other than text. thank you! i will stick with this version until i hear that some newer version is fixed. only thing i did was keep the 6.6.37 boincmgr because the 6.4.5 one would not sort any columns. seems to work. |
Joseph Monk Send message Joined: 31 Mar 07 Posts: 150 Credit: 1,181,197 RAC: 0 |
Must be something odd with the 6.6.* series... hopefully they fix it. PSU fan blows out the bottom, so no issue there, when I get home from work toda y I'll get to see how well it worked. |
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
Must be something odd with the 6.6.* series... hopefully they fix it. yeah. I am going to look for the boinc bug reporting area and see if anyone has reported this for 6.6 series. It may be possible no one has although that seems a bit far fetched considering the sheer number of users. |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
|
Chuck Gorish Send message Joined: 19 Jun 00 Posts: 156 Credit: 29,589,106 RAC: 0 |
thanks! i decided to forge ahead instead of waiting for an answer to that and i just backed up the entire boinc dir before switching. took it just fine. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
6.4.5 has one internal scheduler for both CPU and GPU work, so that one is kind of broken. You then have to add the <ncpus>CPUs+GPUs</ncpus> (e.g. you have 4 CPUs + 2 GPUs, so this line is <ncpus>6</ncpus>) in the options section of the cc_config.xml file. BOINC 6.6.x has a separate internal CPU and GPU scheduler. The one in 6.6.20 is still broken, the one in 6.6.36 is getting there. The reason that not all GPUs are used is because of trouble feeding them with work. The developers know about it, but since it would entail a large code change in both the client software and the server software, with special app_classes for all the different GPUs, they've chosen to use the stop-gap solution of only using the best card in your system. If you then want to use all GPUs, you'll be able to do so by telling BOINC that using the <use_all_gpus> line in cc_config.xml It's not pretty but you get work done. May I remind you all that a year ago you didn't even have CUDA around here? BOINC itself took more than a year in Beta test before it was ripe enough to be released to the masses. ATI detection is still coming, so I can see why the developers want to wait with these big changes until those GPUs are also added to the fray. It won't do to do all your hard work twice. The 3 main developers are overworked enough already as it is, having to fix bugs in both the client software and the back-end software. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.