2 video cards in linux. Boinc sees them as same device!


log in

Advanced search

Questions and Answers : Unix/Linux : 2 video cards in linux. Boinc sees them as same device!

Previous · 1 · 2 · 3 · 4 · Next
Author Message
Joseph Monk
Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 920916 - Posted: 24 Jul 2009, 2:58:44 UTC - in response to Message 920841.

Have you noticed that sometimes 6.6.11 stops processing CUDA for no particular reason? It's done it a few times, I've just had to restart BOINC and it starts up again... very odd.

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12329
Credit: 2,633,822
RAC: 1,211
Netherlands
Message 920930 - Posted: 24 Jul 2009, 4:15:19 UTC - in response to Message 920916.

Did you know that 6.6.11 is, despite its numbering showing it to be a release version, in reality an ALPHA version? In other words that it is one of the development versions with test- and bug fixes going up to the nearest Public Release version 6.6.20?

6.6.12, the one following 6.6.11, has as possible fix for this problem you're talking about, namely amongst others:

- client: fix bug where if a GPU job is running, and a 2nd GPU job with an earlier deadline arrives, neither job is executed ever. Reorganized things so that scheduling of GPU jobs is done independently of CPU jobs.
The policy for GPU jobs:

* always EDF
* jobs are always removed from memory, regardless of checkpoint (GPU memory is not paged, so it's bad to leave an idle app in memory)

All changes from 6.6.11 onwards can be found in the Change Log thread.

Where you see when it says "This is a development version of BOINC." that it may indeed be a development version of BOINC and when it says "Public release", it isn't.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Joseph Monk
Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 920932 - Posted: 24 Jul 2009, 4:31:37 UTC - in response to Message 920930.

Did you know that 6.6.11 is, despite its numbering showing it to be a release version, in reality an ALPHA version? In other words that it is one of the development versions with test- and bug fixes going up to the nearest Public Release version 6.6.20?

6.6.12, the one following 6.6.11, has as possible fix for this problem you're talking about, namely amongst others:
- client: fix bug where if a GPU job is running, and a 2nd GPU job with an earlier deadline arrives, neither job is executed ever. Reorganized things so that scheduling of GPU jobs is done independently of CPU jobs.
The policy for GPU jobs:

* always EDF
* jobs are always removed from memory, regardless of checkpoint (GPU memory is not paged, so it's bad to leave an idle app in memory)

All changes from 6.6.11 onwards can be found in the Change Log thread.

Where you see when it says "This is a development version of BOINC." that it may indeed be a development version of BOINC and when it says "Public release", it isn't.


Right, we understand that, but 6.4.5 has issues and we're trying to find the best version that solves the issues we've mentioned in this thread.

I've gone through the code changes from 6.4.5 to 6.6.11 to 6.6.36 and I *think* I've found where the problem (original issue in this thread) is, but every attempt to compile the code (per: http://boinc.berkeley.edu/trac/wiki/CompileClient) has failed on the make in sea directory:


cp ../../../stage//usr/local/bin/boinc BOINC/boinc cp: cannot stat `../../../stage//usr/local/bin/boinc': No such file or directory make: *** [BOINC/boinc] Error 1


If I could get the compile to work I could run a few tests and hammer out the specific problem and report the fix back to the developers.

Chuck Gorish
Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 920992 - Posted: 24 Jul 2009, 10:35:35 UTC - in response to Message 920930.
Last modified: 24 Jul 2009, 10:40:22 UTC

Did you know that 6.6.11 is, despite its numbering showing it to be a release version, in reality an ALPHA version? In other words that it is one of the development versions with test- and bug fixes going up to the nearest Public Release version 6.6.20?

6.6.12, the one following 6.6.11, has as possible fix for this problem you're talking about, namely amongst others:
- client: fix bug where if a GPU job is running, and a 2nd GPU job with an earlier deadline arrives, neither job is executed ever. Reorganized things so that scheduling of GPU jobs is done independently of CPU jobs.
The policy for GPU jobs:

* always EDF
* jobs are always removed from memory, regardless of checkpoint (GPU memory is not paged, so it's bad to leave an idle app in memory)

All changes from 6.6.11 onwards can be found in the Change Log thread.

Where you see when it says "This is a development version of BOINC." that it may indeed be a development version of BOINC and when it says "Public release", it isn't.


yes but some tests were done and 6.6.12 begins broken app_info.xml code. 6.6.11 is the newest version to properly support 2 devices. we understand there are irregularities in behavior since it is not a production release, but 6.6.20 and up is severely broken concerning multiple devices. it is something we are willing to put up with until a proper fix is done on the newest versions.
____________

Chuck Gorish
Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 920993 - Posted: 24 Jul 2009, 10:39:01 UTC - in response to Message 920932.

Did you know that 6.6.11 is, despite its numbering showing it to be a release version, in reality an ALPHA version? In other words that it is one of the development versions with test- and bug fixes going up to the nearest Public Release version 6.6.20?

6.6.12, the one following 6.6.11, has as possible fix for this problem you're talking about, namely amongst others:
- client: fix bug where if a GPU job is running, and a 2nd GPU job with an earlier deadline arrives, neither job is executed ever. Reorganized things so that scheduling of GPU jobs is done independently of CPU jobs.
The policy for GPU jobs:

* always EDF
* jobs are always removed from memory, regardless of checkpoint (GPU memory is not paged, so it's bad to leave an idle app in memory)

All changes from 6.6.11 onwards can be found in the Change Log thread.

Where you see when it says "This is a development version of BOINC." that it may indeed be a development version of BOINC and when it says "Public release", it isn't.


Right, we understand that, but 6.4.5 has issues and we're trying to find the best version that solves the issues we've mentioned in this thread.

I've gone through the code changes from 6.4.5 to 6.6.11 to 6.6.36 and I *think* I've found where the problem (original issue in this thread) is, but every attempt to compile the code (per: http://boinc.berkeley.edu/trac/wiki/CompileClient) has failed on the make in sea directory:


cp ../../../stage//usr/local/bin/boinc BOINC/boinc cp: cannot stat `../../../stage//usr/local/bin/boinc': No such file or directory make: *** [BOINC/boinc] Error 1


If I could get the compile to work I could run a few tests and hammer out the specific problem and report the fix back to the developers.


i had that too when i tried. had to manually search out the compiled binaries in the source dirs and then move them where i wanted them. that was a while ago though so unfortunately i dont remember a lot but the above was a familiar error with me too.

____________

Joseph Monk
Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 921007 - Posted: 24 Jul 2009, 12:38:17 UTC - in response to Message 920993.

Tried that, errored right away looking for:

cp: cannot stat `../../../stage//usr/local/bin/boincmgr': No such file or directory

But boincmgr was never created, so can't copy it over...

Joseph Monk
Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 921014 - Posted: 24 Jul 2009, 12:54:30 UTC - in response to Message 921007.

Ha, got it working... now to run some tests and see if I can get it fixed!

Joseph Monk
Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 921024 - Posted: 24 Jul 2009, 13:53:31 UTC - in response to Message 921014.

So... got past that, but can't recompile the boinc client. Keep getting:

boinc_client-client_state.o: In function `CLIENT_STATE::init()':
client_state.cpp:(.text+0x5f08): undefined reference to `curl_version'
boinc_client-http_curl.o: In function `HTTP_OP::set_speed_limit(bool, double)':
http_curl.cpp:(.text+0x10d): undefined reference to `curl_easy_setopt'
http_curl.cpp:(.text+0x132): undefined reference to `curl_easy_setopt'

But I have curl installed...

Chuck Gorish
Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 921240 - Posted: 25 Jul 2009, 10:44:24 UTC - in response to Message 921024.

So... got past that, but can't recompile the boinc client. Keep getting:

boinc_client-client_state.o: In function `CLIENT_STATE::init()':
client_state.cpp:(.text+0x5f08): undefined reference to `curl_version'
boinc_client-http_curl.o: In function `HTTP_OP::set_speed_limit(bool, double)':
http_curl.cpp:(.text+0x10d): undefined reference to `curl_easy_setopt'
http_curl.cpp:(.text+0x132): undefined reference to `curl_easy_setopt'

But I have curl installed...


maybe your path to curl or to curl headers is different than what is in the boinc code? seems like a header is missing or maybe a different version.
____________

Chuck Gorish
Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 921664 - Posted: 27 Jul 2009, 13:52:25 UTC - in response to Message 920841.

Here's the modified script I use, it's pretty simple. Just run it (I've seen no harm in running while BOINC is, as it doesn't change anything) and it spits out something like:

VHAR on GPU: 12mr09ac.6289.7025.14.10.195
VHAR on GPU: 05dc08ae.32507.890.16.10.254
Number of CPU tasks:413
Number of GPU tasks:323
Number of VLAR tasks:330
Number of VHAR tasks:42
Total tasks: 736

You can't run it while downloading new WU, as it can't open the WU files to read them if they aren't there yet.

Right now (330 VLAR tasks) I've moved VHAR to GPU hence it complains about it, but you can see CPU has 413 tasks and there's only 330 VLAR so I should run the rebrand script again soon.

Here's the script:

$path="client_state.xml"; open (IN, $path); $NumOfCPUTasks=0; $NumOfGPUTasks=0; $NumVLAR=0; $NumVHAR=0; $NumGPUToNumCPU_high_limit=25; $NumGPUToNumCPU_low_limit=0.5; while (<IN>) { if( /<workunit>/ ){ #parsing result $trueAR=-1;#error condition $WUname=""; while(<IN>){ if( /<\/workunit>/ ){ open (WU, "projects\/setiathome.berkeley.edu\/" .$WUname) || die "ERROR: cant open task file " . $WUname; while(<WU>){#reading task file and deciding where it should go if( /<true_angle_range>(.*)<\/true_angle_range>/ ){ $trueAR=$1; if( $trueAR == -1 ){ die "ERROR detected - cant determine AR value\n"; } if($trueAR < 0.13){ $tasks{$WUname}=1; $NumVLAR++;} elsif($trueAR > 1.127){ $tasks{$WUname}=2; $NumVHAR++; }else{$tasks{$WUname}=3;} last; } } close(WU); last; } if( /<name>(.*)<\/name>/ ){ $WUname=$1; #print "task:\\".$1."\\ \n"; } elsif( /<version_num>603<\/version_num>/ ){$NumOfCPUTasks++;} elsif( /<version_num>608<\/version_num>/ ){$NumOfGPUTasks++;} } } } close(IN); open (IN, $path); while (<IN>) { if( /<name>(.*)<\/name>/ ){ $WUname=$1; #print "task:\\".$1."\\ \n"; } elsif( /<version_num>608<\/version_num>/ ){ if($tasks{$WUname}){ if($tasks{$WUname} == 1){ print "VLAR on GPU: " . $WUname ."\n";} elsif($tasks{$WUname} == 2){ print "VHAR on GPU: " . $WUname ."\n";} } } } close(IN); print "Number of CPU tasks:".$NumOfCPUTasks."\n"; print "Number of GPU tasks:".$NumOfGPUTasks."\n"; print "Number of VLAR tasks:".$NumVLAR."\n"; print "Number of VHAR tasks:".$NumVHAR."\n"; if($NumOfCPUTasks!=0){ $GPU_to_CPU_ratio=$NumOfGPUTasks/$NumOfCPUTasks;} else{ $GPU_to_CPU_ratio=1;} if($GPU_to_CPU_ratio >$NumGPUToNumCPU_high_limit){ print "Too many tasks allocated to GPU already ".$GPU_to_CPU_ratio."\n";} if($GPU_to_CPU_ratio <$NumGPUToNumCPU_low_limit){ print "Too many tasks allocated to CPU already " .$GPU_to_CPU_ratio ."\n";} print "Total tasks: ".($NumOfCPUTasks+$NumOfGPUTasks)."\n";



ok i'm still trying to comprehend things. this script above changes nothing it simply is a more detailed reporter.

here is what i get

when i run the V5 script it shows:

Number of CPU tasks before rescheduling:223
Number of GPU tasks before rescheduling:194
Number of CPU tasks after rescheduling:223
Number of GPU tasks after rescheduling:194

there are no changes because i ran it a short while ago.

when i run your reporting script it tells me:

Number of CPU tasks:223
Number of GPU tasks:194
Number of VLAR tasks:197
Number of VHAR tasks:26
Total tasks: 417

so from this i can safely assume that at this present time before any more downloads, the gpu has no vlar or vhar workunits that need moving?


____________

Joseph Monk
Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 921713 - Posted: 27 Jul 2009, 18:18:04 UTC - in response to Message 921664.

Yup, if there were any VLAR or VHAR on the GPU it would spit out a line saying <WU name> VLAR on GPU. Since you have 223 on CPU and 196 VLAR/27 VHAR that means only VHAR and VLAR are on CPU, so no mid range ones that need to move either.

Chuck Gorish
Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 921748 - Posted: 27 Jul 2009, 19:36:38 UTC - in response to Message 921713.

Yup, if there were any VLAR or VHAR on the GPU it would spit out a line saying <WU name> VLAR on GPU. Since you have 223 on CPU and 196 VLAR/27 VHAR that means only VHAR and VLAR are on CPU, so no mid range ones that need to move either.


cool so maybe then i wont piss people off with the vlar killer any more :)

i guess the few computation error ones i am seeing are either vlars that are not caught by the script or true errors.

i suspect it may be the tesla since i reviewed a few of the errors and found all of them were on the tesla. might wind up replacing that.


____________

Chuck Gorish
Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 921822 - Posted: 28 Jul 2009, 0:32:14 UTC - in response to Message 921748.

i noticed a download and ran the script and it had a large amount of both vlar and vhar listed for the gpu. ran the V5 and it fixed it :) now all i have to do is figure out a way to run the script automatically.

it would be a really nice addition to have an option to have boinc run an external script after downloading before starting any new tasks. (i know dream on) :)

i hate cycling boinc often but since the downloads are asked for at random times it almost seems to minimize problems i should run a cron job once an hour that will stop boinc, run the V5 script and replace the old state file and then restart boinc..
____________

Chuck Gorish
Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 921823 - Posted: 28 Jul 2009, 0:35:23 UTC - in response to Message 921822.

wonder if i should change the topic name? maybe to something like

2 devices in linux and keeping gpu free of hassles

or something :P


____________

Joseph Monk
Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 921935 - Posted: 28 Jul 2009, 12:15:06 UTC - in response to Message 921822.

What I've done is set mine to 10 days cache, wait until I have a bunch and then set the cache back to 5 days. Then I shut down, run the script, check to make sure it looks good (I have another modified one that will move back X VLAR or VHAR to the GPU if I think the CPU has too much work) and then restart it. Once I get down to a couple days work I'll put the cache back up and download more.

Right now I'm letting my cache clear so I can set my pflops right.

Chuck Gorish
Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 921938 - Posted: 28 Jul 2009, 12:43:46 UTC - in response to Message 921935.

What I've done is set mine to 10 days cache, wait until I have a bunch and then set the cache back to 5 days. Then I shut down, run the script, check to make sure it looks good (I have another modified one that will move back X VLAR or VHAR to the GPU if I think the CPU has too much work) and then restart it. Once I get down to a couple days work I'll put the cache back up and download more.

Right now I'm letting my cache clear so I can set my pflops right.



that sounds like it could work however it requires manual intervention. :) i'm lazy when it comes to computers, i deal with them all day and the last thing i like is messing with my own computer so i try to automate mine as much as possible to keep it hassle free for me.

so far this hourly script run is working ok. i only had 2 computation errors overnight which is better than the 10 or 15 i used to get.
____________

Joseph Monk
Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 922156 - Posted: 29 Jul 2009, 15:52:24 UTC - in response to Message 921938.

We're looking good now. Steady work coming in (broke 7k RAC so far). Just upgraded to CUDA 2.3 and OC both my 260s.

Thu 30 Jul 2009 12:47:25 AM KST CUDA devices: GeForce GTX 260 (driver version 0, CUDA version 1.3, 895MB, est. 125GFLOPS), GeForce GTX 260 (driver version 0, CUDA version 1.3, 896MB, est. 121GFLOPS)

Couldn't get the 2nd card stable at the same clocks :(

Going to run tests for a bit more, but so far it looks good. Just need to confirm long term stability at these clocks.

Chuck Gorish
Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 922332 - Posted: 30 Jul 2009, 3:04:14 UTC - in response to Message 922156.

We're looking good now. Steady work coming in (broke 7k RAC so far). Just upgraded to CUDA 2.3 and OC both my 260s.

Thu 30 Jul 2009 12:47:25 AM KST CUDA devices: GeForce GTX 260 (driver version 0, CUDA version 1.3, 895MB, est. 125GFLOPS), GeForce GTX 260 (driver version 0, CUDA version 1.3, 896MB, est. 121GFLOPS)

Couldn't get the 2nd card stable at the same clocks :(

Going to run tests for a bit more, but so far it looks good. Just need to confirm long term stability at these clocks.


wow if you can keep them there thats incredible. my gtx285 which is an xfx oc black edition shows 127gflops.



____________

Joseph Monk
Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 922333 - Posted: 30 Jul 2009, 3:06:43 UTC - in response to Message 922332.

Been going steady overnight, so think I've got it stable. Primary card has a core of 755Mhz, secondary couldn't handle that so it's at 725Mhz (I was too lazy to find the exact max for it).

I'm very happy with these clocks, think I'll stick with this company for my next purchase. Heat is still barely over what it was before, which strikes me as odd but I guess those coolers work well.

Chuck Gorish
Send message
Joined: 19 Jun 00
Posts: 156
Credit: 29,589,106
RAC: 0
United States
Message 922383 - Posted: 30 Jul 2009, 10:45:13 UTC - in response to Message 922333.

Been going steady overnight, so think I've got it stable. Primary card has a core of 755Mhz, secondary couldn't handle that so it's at 725Mhz (I was too lazy to find the exact max for it).

I'm very happy with these clocks, think I'll stick with this company for my next purchase. Heat is still barely over what it was before, which strikes me as odd but I guess those coolers work well.



excellent! yes the coolers nvidia has standardized on are quite good. a bit close on the top end cards but still sufficient.. you probably have the 260-216sp editions. they run considerably cooler than the standard 216.

see this chart:

http://en.wikipedia.org/wiki/Comparison_of_NVIDIA_Graphics_Processing_Units#cite_note-22
____________

Previous · 1 · 2 · 3 · 4 · Next

Questions and Answers : Unix/Linux : 2 video cards in linux. Boinc sees them as same device!

Copyright © 2014 University of California