2 video cards in linux. Boinc sees them as same device!

Questions and Answers : Unix/Linux : 2 video cards in linux. Boinc sees them as same device!
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 917254 - Posted: 12 Jul 2009, 22:46:44 UTC

I have a weird problem. The system works fine with just my vid card, or even just the Tesla telling Boinc to use only one card. But it falls apart trying to use both.

My setup is as follows:

1. Linux x86_64 running on a q6600 intel system
2. Video card is GTX 285 in first pci-e slot
3. Tesla C1060 is installed in 2nd pci-e slot
4. Boinc version is 6.6.36
5. Nvidia driver is 185.18.14
6. My number_of_gpus is set to 2. I had it at 1 and it made no difference in this behavior below.
6. I have use_all_gpus set to 1 assuming it is a true/false required.
7. I have this statement in my app_info.xml:

<coproc>
<type>CUDA</type>
<count>2</count>
</coproc>


When I start Boinc, it reports 2 Tesla cards instead of the proper ones. If this were just a naming problem I could live with this but....
With the above coproc statement set to 1,
When I do a ps ax to look at my process list this is what I see:

7987 ? RNLl 0:01 setiathome-6.08.CUDA_2.2_x86_64-pc-linux-gnu --device 0
7988 ? RNLl 0:01 setiathome-6.08.CUDA_2.2_x86_64-pc-linux-gnu --device 0

and it uses the GTX285 for both!

When I have the coproc statement set to 2, it uses the Tesla only and runs only 1 process. it has both device numbers but the GTX285 is not used and a 2nd workunit is waiting to run:

10170 ? RNLl 0:07 setiathome-6.08.CUDA_2.2_x86_64-pc-linux-gnu --device 0 --device 1

How can I get this to do the right thing and provide me with processes like these using both cards?

setiathome-6.08.CUDA_2.2_x86_64-pc-linux-gnu --device 0

setiathome-6.08.CUDA_2.2_x86_64-pc-linux-gnu --device 1


How can I fix this? I know others are using 2 cards successfully.

ID: 917254 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 918668 - Posted: 17 Jul 2009, 10:32:17 UTC - in response to Message 917254.  

After 5 days of intense research, I have come to the conclusion that this behavior is borked boinc device recognition code for the linux version. I understand the windows version works fine. I have a very hard time comprehending why such an important function as multiple device recognition has gone ignored in the linux version especially when there is working windows code available for reference and this problem has been there for many versions. This is quite disturbing.

I have removed the tesla for now and am considering 2 options.

1. dig into the code and fix it since no one else apparently wants to. my available time due to job workload makes this option difficult at best.

2. build up an entirely new machine just to use the tesla. expensive alternative.

some will say the easiest alternative is to switch to windows. i would tend to agree that would be easiest but that is not a consideration due to the nature and requirements of my work and also my personal views of the product.

According to everything I can find, my configurations were done correctly but resulted in multiple wu feeding a single device so I believe my above impression of broken code is valid. I have not looked yet but I am positive others have registered bug reports on similar situations. It is simply too common a need not to have been done. I will, however, verify this and file one if none have been reported.

Not very happy.
ID: 918668 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 918674 - Posted: 17 Jul 2009, 11:14:08 UTC - in response to Message 918668.  

Did you see this message (concerning <count>)?

And this answer to your other thread (concerning <use_all_gpus>)?

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 918674 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 918704 - Posted: 17 Jul 2009, 12:38:16 UTC

yes i have. i had to set it to 2 to be able to use a single workunit on one card while ignoring the other one since it wont feed 2 cards properly. it was just a workaround set to 2 while i investigated. when set to 1 boinc feeds 2 workunits simultaneously to device 0 rather than 1 to each. that i believe is a problem in the device code in boinc.

yes i have had that setting in cc_config. see my listing in first msg.


I have use_all_gpus set to 1 assuming it is a true/false required.

ID: 918704 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 918746 - Posted: 17 Jul 2009, 16:07:46 UTC - in response to Message 918704.  
Last modified: 17 Jul 2009, 16:08:22 UTC

There have been reports that you do not need the latest drivers, but instead need at max 180.60

Anything newer will give problems under all kinds of Linux.

You can also try BOINC 6.6.37 (also with at max the 180.60 drivers).
- 32bit version
- 64bit version
ID: 918746 · Report as offensive
Joseph Monk

Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 918855 - Posted: 17 Jul 2009, 22:09:15 UTC - in response to Message 918746.  

There have been reports that you do not need the latest drivers, but instead need at max 180.60

Anything newer will give problems under all kinds of Linux.

You can also try BOINC 6.6.37 (also with at max the 180.60 drivers).
- 32bit version
- 64bit version


I had problems with 6.6.36, but 6.6.20 works fine for me... I also have my count set to 2, but that's because one of my GPUs is overheating (hits about 103C after 6 hours, where the other card hits 60C). After the weekend (shift worker) I'm gonna do some cable management and build a little spot cooling for it.
ID: 918855 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 918938 - Posted: 18 Jul 2009, 0:54:26 UTC - in response to Message 918746.  

There have been reports that you do not need the latest drivers, but instead need at max 180.60

Anything newer will give problems under all kinds of Linux.

You can also try BOINC 6.6.37 (also with at max the 180.60 drivers).
- 32bit version
- 64bit version


i cant use the 180.60 driver. i had that when i got the gtx285 and all i did was replace my old vid card with it, turned it on, and when the driver loaded my gui began blinking and doing strange things. in talking with nvidia they said the driver was not compatible with the 285 and i had to upgrade. once i upgraded all the video problems went away. unfortunately i did not try running dual devices until the driver upgrade. i have been running 6.6.37 for a few days, and tried going back to 6.6.20 which correctly identified both cards but still gave me the dual wu on device 0 when trying to use both cards.


ID: 918938 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 918942 - Posted: 18 Jul 2009, 0:59:28 UTC - in response to Message 918855.  

There have been reports that you do not need the latest drivers, but instead need at max 180.60

Anything newer will give problems under all kinds of Linux.

You can also try BOINC 6.6.37 (also with at max the 180.60 drivers).
- 32bit version
- 64bit version


I had problems with 6.6.36, but 6.6.20 works fine for me... I also have my count set to 2, but that's because one of my GPUs is overheating (hits about 103C after 6 hours, where the other card hits 60C). After the weekend (shift worker) I'm gonna do some cable management and build a little spot cooling for it.


sounds like it may be a candidate for a little shin-etsu thermal compound. that stuff works wonders. my tesla used to jump to 80c regularly, i cleaned the old compound off it and put the shin-etsu g-751 compound on (hard to apply but worth it. highest thermal rate available) and the tesla now never exceeds 65c.

6.6.20 still gave me 2 wu on a single device, or with count of 2 both devices on one app which selects the last device listed so i got 1 wu on 1 gpu.


ID: 918942 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 918944 - Posted: 18 Jul 2009, 1:01:40 UTC - in response to Message 918855.  

There have been reports that you do not need the latest drivers, but instead need at max 180.60

Anything newer will give problems under all kinds of Linux.

You can also try BOINC 6.6.37 (also with at max the 180.60 drivers).
- 32bit version
- 64bit version


I had problems with 6.6.36, but 6.6.20 works fine for me... I also have my count set to 2, but that's because one of my GPUs is overheating (hits about 103C after 6 hours, where the other card hits 60C). After the weekend (shift worker) I'm gonna do some cable management and build a little spot cooling for it.


i am wondering about the kernel though. i will try an upgrade tomorrow to see if that helps. may even try backing up to 180.60 with the new kernel to see if it helps. presently running 2.6.25. will install 2.6.29 tomorrow.
ID: 918944 · Report as offensive
Joseph Monk

Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 918981 - Posted: 18 Jul 2009, 3:17:55 UTC - in response to Message 918942.  

There have been reports that you do not need the latest drivers, but instead need at max 180.60

Anything newer will give problems under all kinds of Linux.

You can also try BOINC 6.6.37 (also with at max the 180.60 drivers).
- 32bit version
- 64bit version


I had problems with 6.6.36, but 6.6.20 works fine for me... I also have my count set to 2, but that's because one of my GPUs is overheating (hits about 103C after 6 hours, where the other card hits 60C). After the weekend (shift worker) I'm gonna do some cable management and build a little spot cooling for it.


sounds like it may be a candidate for a little shin-etsu thermal compound. that stuff works wonders. my tesla used to jump to 80c regularly, i cleaned the old compound off it and put the shin-etsu g-751 compound on (hard to apply but worth it. highest thermal rate available) and the tesla now never exceeds 65c.

6.6.20 still gave me 2 wu on a single device, or with count of 2 both devices on one app which selects the last device listed so i got 1 wu on 1 gpu.



My problem is there's only about 1/8" space (if that) between the top card and the bottom card, so not much airflow.

How are you seeing which device the WU is being processed on?
ID: 918981 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 919075 - Posted: 18 Jul 2009, 18:00:44 UTC - in response to Message 918981.  

There have been reports that you do not need the latest drivers, but instead need at max 180.60

Anything newer will give problems under all kinds of Linux.

You can also try BOINC 6.6.37 (also with at max the 180.60 drivers).
- 32bit version
- 64bit version


I had problems with 6.6.36, but 6.6.20 works fine for me... I also have my count set to 2, but that's because one of my GPUs is overheating (hits about 103C after 6 hours, where the other card hits 60C). After the weekend (shift worker) I'm gonna do some cable management and build a little spot cooling for it.


sounds like it may be a candidate for a little shin-etsu thermal compound. that stuff works wonders. my tesla used to jump to 80c regularly, i cleaned the old compound off it and put the shin-etsu g-751 compound on (hard to apply but worth it. highest thermal rate available) and the tesla now never exceeds 65c.

6.6.20 still gave me 2 wu on a single device, or with count of 2 both devices on one app which selects the last device listed so i got 1 wu on 1 gpu.



My problem is there's only about 1/8" space (if that) between the top card and the bottom card, so not much airflow.

How are you seeing which device the WU is being processed on?


ahh. a helpful thing is to get one of those dome shaped faucet washers and slide it between the 2 cards on the back end. there is usually enough 'give' there to be able to do this safely. it will help expand the air opening somewhat and yeah a fan blowing between them would be helpful to keep the top card from sucking heated air from the bottom back.

with my process list i can see which device is being used, and also i simply watch the temps of the gpus and whichever one heats up from idle temp is the one being used for cuda.


ID: 919075 · Report as offensive
Joseph Monk

Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 919130 - Posted: 18 Jul 2009, 21:59:39 UTC - in response to Message 919075.  

Yeah, I figured out you must be using ps to see it, confirmed it myself... for some reason 6.6.20 and 6.6.36 both stick it all on device 0, but 6.4.5 works correctly (one on each).

I ended up doing a little cable management (wish I had longer cables, not much room to move them around) and moved GPU1 down to slot 3 (still running 16x since I only have the two cards) and it seems to work. Did a test last night for about 6 hours and temps we at 70C (in a closed room, stupid dog, room temp was about 32C) so I can live with that (still might look into some extra cooling... I still haven't OC these cards). Doing a longer (12 hour) test today to verify it's all good.

Only possible issue I see now is that GPU1 is about 1" off the PSU, but that's better than the MAYBE 1/8" I had between GPU0 and GPU1 before.
ID: 919130 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 919139 - Posted: 18 Jul 2009, 22:21:13 UTC - in response to Message 919130.  

Yeah, I figured out you must be using ps to see it, confirmed it myself... for some reason 6.6.20 and 6.6.36 both stick it all on device 0, but 6.4.5 works correctly (one on each).

I ended up doing a little cable management (wish I had longer cables, not much room to move them around) and moved GPU1 down to slot 3 (still running 16x since I only have the two cards) and it seems to work. Did a test last night for about 6 hours and temps we at 70C (in a closed room, stupid dog, room temp was about 32C) so I can live with that (still might look into some extra cooling... I still haven't OC these cards). Doing a longer (12 hour) test today to verify it's all good.

Only possible issue I see now is that GPU1 is about 1" off the PSU, but that's better than the MAYBE 1/8" I had between GPU0 and GPU1 before.


sounds like you have enough now. 70 is fine. 6.4.5 huh? hmm i will have to try that. think it will bork my existing setup since its so old? does it take app_info.xml? if not i cant use it then cause im running opt. apps for cpu and gpu



ID: 919139 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 919140 - Posted: 18 Jul 2009, 22:23:15 UTC - in response to Message 919130.  

Yeah, I figured out you must be using ps to see it, confirmed it myself... for some reason 6.6.20 and 6.6.36 both stick it all on device 0, but 6.4.5 works correctly (one on each).

I ended up doing a little cable management (wish I had longer cables, not much room to move them around) and moved GPU1 down to slot 3 (still running 16x since I only have the two cards) and it seems to work. Did a test last night for about 6 hours and temps we at 70C (in a closed room, stupid dog, room temp was about 32C) so I can live with that (still might look into some extra cooling... I still haven't OC these cards). Doing a longer (12 hour) test today to verify it's all good.

Only possible issue I see now is that GPU1 is about 1" off the PSU, but that's better than the MAYBE 1/8" I had between GPU0 and GPU1 before.



forgot... if the psu fan is in the rear of it or on the other side thats fine, but if it faces the gpu fan it can cause air cavitation. you will have to force feed air between them in that case to make sure a good flow is available for both.
ID: 919140 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 919168 - Posted: 18 Jul 2009, 23:41:26 UTC - in response to Message 919130.  

Yeah, I figured out you must be using ps to see it, confirmed it myself... for some reason 6.6.20 and 6.6.36 both stick it all on device 0, but 6.4.5 works correctly (one on each).

I ended up doing a little cable management (wish I had longer cables, not much room to move them around) and moved GPU1 down to slot 3 (still running 16x since I only have the two cards) and it seems to work. Did a test last night for about 6 hours and temps we at 70C (in a closed room, stupid dog, room temp was about 32C) so I can live with that (still might look into some extra cooling... I still haven't OC these cards). Doing a longer (12 hour) test today to verify it's all good.

Only possible issue I see now is that GPU1 is about 1" off the PSU, but that's better than the MAYBE 1/8" I had between GPU0 and GPU1 before.


sir, you are a genius. i would never even have considered going so far back in versions! the device report is borked, it reports 2 teslas in the messages log, i don't know what it will report to seti since it can't contact the servers right now, but it works just fine one wu to each device, both devices running around 66c so i know for sure its working. they idle around 48c with this heat in the room. so even if it reports 2 teslas it works. that's all i care about really. i only hope the vid bioses don't wind up fighting since they are vastly different versions. this tesla is a pre-release engineering unit and it has video ports attached. useless for anything other than text.

thank you! i will stick with this version until i hear that some newer version is fixed. only thing i did was keep the 6.6.37 boincmgr because the 6.4.5 one would not sort any columns. seems to work.


ID: 919168 · Report as offensive
Joseph Monk

Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 919174 - Posted: 18 Jul 2009, 23:56:04 UTC - in response to Message 919168.  

Must be something odd with the 6.6.* series... hopefully they fix it.

PSU fan blows out the bottom, so no issue there, when I get home from work toda y I'll get to see how well it worked.
ID: 919174 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 919179 - Posted: 19 Jul 2009, 0:04:49 UTC - in response to Message 919174.  

Must be something odd with the 6.6.* series... hopefully they fix it.

PSU fan blows out the bottom, so no issue there, when I get home from work toda y I'll get to see how well it worked.



yeah. I am going to look for the boinc bug reporting area and see if anyone has reported this for 6.6 series. It may be possible no one has although that seems a bit far fetched considering the sheer number of users.
ID: 919179 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 919199 - Posted: 19 Jul 2009, 1:25:05 UTC - in response to Message 919139.  

The 6.x.x series all uses the same split setup.

Yes you can use the app_info.xml file, it has been around for quite a long time. Mostly for support of platforms that do not have direct support from BOINC, etc..

ID: 919199 · Report as offensive
Chuck Gorish

Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 919285 - Posted: 19 Jul 2009, 10:42:09 UTC - in response to Message 919199.  

thanks! i decided to forge ahead instead of waiting for an answer to that and i just backed up the entire boinc dir before switching. took it just fine.

ID: 919285 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 919319 - Posted: 19 Jul 2009, 12:53:11 UTC

6.4.5 has one internal scheduler for both CPU and GPU work, so that one is kind of broken. You then have to add the <ncpus>CPUs+GPUs</ncpus> (e.g. you have 4 CPUs + 2 GPUs, so this line is <ncpus>6</ncpus>) in the options section of the cc_config.xml file.

BOINC 6.6.x has a separate internal CPU and GPU scheduler. The one in 6.6.20 is still broken, the one in 6.6.36 is getting there.

The reason that not all GPUs are used is because of trouble feeding them with work. The developers know about it, but since it would entail a large code change in both the client software and the server software, with special app_classes for all the different GPUs, they've chosen to use the stop-gap solution of only using the best card in your system. If you then want to use all GPUs, you'll be able to do so by telling BOINC that using the <use_all_gpus> line in cc_config.xml

It's not pretty but you get work done. May I remind you all that a year ago you didn't even have CUDA around here? BOINC itself took more than a year in Beta test before it was ripe enough to be released to the masses. ATI detection is still coming, so I can see why the developers want to wait with these big changes until those GPUs are also added to the fray. It won't do to do all your hard work twice. The 3 main developers are overworked enough already as it is, having to fix bugs in both the client software and the back-end software.
ID: 919319 · Report as offensive
1 · 2 · 3 · 4 · Next

Questions and Answers : Unix/Linux : 2 video cards in linux. Boinc sees them as same device!


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.