Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 110 · 111 · 112 · 113 · 114 · 115 · 116 . . . 162 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2001110 - Posted: 4 Jul 2019, 23:56:54 UTC - in response to Message 2001099.  

Haha, I was about to buy a 1060...but then I forgot the mobo only supported pcie 2.0. This was one of the "better" cards I could find at microcenter. https://www.msi.com/Graphics-card/GT-710-2GD3H-LP.html Nice and cheap...and thankfully they still had some variety to choose from that worked with pcie2.0! Only bad thing...this one is passive. I made sure to upgrade the case fans. Hopefully that's all I need to do. Will find out when the temps raise due to GPU tasks.


. . The bad news is that the GTX1060 like most of the nVidia cards is backwards compatible and will run OK on the PCIe gen 2.0 mobo, but only as gen 2.0 obviously :)

Stephen

:)
ID: 2001110 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2001111 - Posted: 4 Jul 2019, 23:58:16 UTC - in response to Message 2001103.  

Yes, the beta r3602 app doesn't seem much faster than the stock r3584 8.22 Linux SoG app. I have a hunch the zi3v app will be much faster even if you are forced to run with unroll 1 or 2 because of only 1GB of memory.

Really curious about that CUDA60 zi3v app in relation to the stock SoG app for Linux clients. Think that might be another great reason to persuade the entry level card hosts over from Windows.


. . Well the GT730 only has 2 CUs so unroll 2 is it ... I don't know about the 720.

Stephen

. .

This is what TBar wrote in his README_x41p_zi3v.txt file in the /DOCS directory of the archive. Seems to say you can get away with unroll 6 for a 2GB card. Though he does mention "testing is required"

So it might be best to start with removing the unroll first and let it use the default autotune, then proceed from there with an unroll override of 2 - 6 and see what works.

For best use;
1) Run One Task per GPU
2) The commandline for -unroll should be used in app_info.xml if you have problems with autotune.
New with this version is autotune, this will automatically set the unroll to match your compute units.
The commandline is set to default, unroll autotune.
If you have a GPU with One GB of vRam you need to manually set the -unroll to 1 or 2.
-unroll 1
If you have a GPU with Two GBs of vRam you may not be able to use over -unroll 6, testing is required.
-unroll 6
3) If you wish to use 100% CPU per task, add the command -nobs to the app_info.xml.

A sample cmdline would be;
<version_num>802</version_num>
<plan_class>cuda60</plan_class>
<cmdline>-unroll 6 -nobs</cmdline>

Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2001111 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2001114 - Posted: 5 Jul 2019, 0:01:47 UTC

I wouldn't worry about running a gpu in PCI Gen 2.0 mobos just like I don't worry about how many lanes the gpu gets. I have cards running in X16, X8 and X4 slots and it makes nary a difference in compute times. Our Seti tasks just don't use or need much bandwidth. If you are actually going to use a card for video, like gaming, then it does make a difference.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2001114 · Report as offensive     Reply Quote
Loren Datlof

Send message
Joined: 24 Jan 14
Posts: 73
Credit: 19,652,385
RAC: 0
United States
Message 2001184 - Posted: 5 Jul 2019, 16:59:16 UTC
Last modified: 5 Jul 2019, 17:11:31 UTC

I setup my host https://setiathome.berkeley.edu/show_host_detail.php?hostid=8702456 which has a GT 720 (1GB) and a GT 730 (2GB) to run the x41p_zi3v special app with the following results: The 720 gets computation errors while the 730 works fine. So now I am running just one GPU.

When the 720 was erroring out I set unroll to 1 and I also tried it without an unroll command. I am not sure why I was getting computational errors on the 720.

The x41p_zi3v app seems to have cut off about a half hour of computational time versus the Beta SOG app (45 minutes versus 75 minutes). This is a very small sample size and the blc63 WUs seem to take an hour and a half.

Any suggestions on how to get the 720 to work? Is it possible to have the 720 running the Beta SOG app while the 730 runs the x41p_zi3v app?
ID: 2001184 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2001186 - Posted: 5 Jul 2019, 17:20:36 UTC - in response to Message 2001184.  

I am not sure why I was getting computational errors on the 720.
The error message is

Cuda error 'cudaMalloc((void**) &dev_tmp_potP2' in file 'cuda/cudaAcceleration.cu' in line 644 : out of memory.
ID: 2001186 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2001188 - Posted: 5 Jul 2019, 17:31:18 UTC - in response to Message 2001184.  

I setup my host https://setiathome.berkeley.edu/show_host_detail.php?hostid=8702456 which has a GT 720 (1GB) and a GT 730 (2GB) to run the x41p_zi3v special app with the following results: The 720 gets computation errors while the 730 works fine. So now I am running just one GPU.

When the 720 was erroring out I set unroll to 1 and I also tried it without an unroll command. I am not sure why I was getting computational errors on the 720.

The x41p_zi3v app seems to have cut off about a half hour of computational time versus the Beta SOG app (45 minutes versus 75 minutes). This is a very small sample size and the blc63 WUs seem to take an hour and a half.

Any suggestions on how to get the 720 to work? Is it possible to have the 720 running the Beta SOG app while the 730 runs the x41p_zi3v app?

Thanks for the update Loren. Happy to see some results from that app finally. The reason why the 720 always gets errors is printed right in each stderr.txt.
Cuda error 'cudaMalloc((void**) &dev_tmp_potP2' in file 'cuda/cudaAcceleration.cu' in line 644 : out of memory.


So the app cannot work in just 1 GB of video memory but requires 2GB. Looks like you are not using the -nobs parameter. If you reduced your cpu usage to free up a cpu core, you could try the parameter and see if it makes any difference. It should . . . . . by how much . . . . who knows?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2001188 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2001190 - Posted: 5 Jul 2019, 17:58:03 UTC

See that ThePHX264 's host has completed gpu tasks also. He used the cmdline I originally put into the app_info of -unroll 6 and -nobs with no issues it seems. He too had to crunch the long running BLC63, 53 and 43 tasks. Curious how the app would handle the shorter tasks we see regularly like the BLC34 series or some VHAR Arecibo tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2001190 · Report as offensive     Reply Quote
Loren Datlof

Send message
Joined: 24 Jan 14
Posts: 73
Credit: 19,652,385
RAC: 0
United States
Message 2001191 - Posted: 5 Jul 2019, 18:12:37 UTC - in response to Message 2001188.  
Last modified: 5 Jul 2019, 18:13:45 UTC

I setup my host https://setiathome.berkeley.edu/show_host_detail.php?hostid=8702456 which has a GT 720 (1GB) and a GT 730 (2GB) to run the x41p_zi3v special app with the following results: The 720 gets computation errors while the 730 works fine. So now I am running just one GPU.

When the 720 was erroring out I set unroll to 1 and I also tried it without an unroll command. I am not sure why I was getting computational errors on the 720.

The x41p_zi3v app seems to have cut off about a half hour of computational time versus the Beta SOG app (45 minutes versus 75 minutes). This is a very small sample size and the blc63 WUs seem to take an hour and a half.

Any suggestions on how to get the 720 to work? Is it possible to have the 720 running the Beta SOG app while the 730 runs the x41p_zi3v app?

Thanks for the update Loren. Happy to see some results from that app finally. The reason why the 720 always gets errors is printed right in each stderr.txt.
Cuda error 'cudaMalloc((void**) &dev_tmp_potP2' in file 'cuda/cudaAcceleration.cu' in line 644 : out of memory.


So the app cannot work in just 1 GB of video memory but requires 2GB. Looks like you are not using the -nobs parameter. If you reduced your cpu usage to free up a cpu core, you could try the parameter and see if it makes any difference. It should . . . . . by how much . . . . who knows?

Thanks Keith and Richard for your responses.

I am using the -nobs parameter. When I run top the CPU usage is 100%. Here is the relevant part of my app_info.xml file. I changed to -unroll 6 when just using the 730.

 <app>
     <name>setiathome_v8</name>
  </app>
    <file_info>
      <name>setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda60</name>
      <executable/>
    </file_info>
    <file_info>
      <name>libcudart.so.6.0</name>
    </file_info>
    <file_info>
      <name>libcufft.so.6.0</name>
    </file_info>
    <app_version>
      <app_name>setiathome_v8</app_name>
      <platform>x86_64-pc-linux-gnu</platform>
      <version_num>802</version_num>
      <plan_class>cuda60</plan_class>
      <cmdline>-nobs</cmdline>
      <coproc>
        <type>NVIDIA</type>
        <count>1</count>
      </coproc>
      <avg_ncpus>0.1</avg_ncpus>
      <max_ncpus>0.1</max_ncpus>
      <file_ref>
          <file_name>setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda60</file_name>
          <main_program/>
      </file_ref>
      <file_ref>
          <file_name>libcudart.so.6.0</file_name>
      </file_ref>
      <file_ref>
          <file_name>libcufft.so.6.0</file_name>
      </file_ref>
    </app_version>


I noticed that CPU runs the blc63 WUs faster than the GPUs so I am probably better off running three cores of the i5.

I have been able to find the GTX 1050 Ti for around $50 -$60 on eBay and craigslist. But for this computer I need the mini (low profile version) as this is a SFF computer and the one I use daily. Right now they are going for twice that price on eBay and none are to be found on craigslist. The hunt continues...
ID: 2001191 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2001192 - Posted: 5 Jul 2019, 18:41:15 UTC - in response to Message 2001191.  

Yes, you need to reserve a cpu core for the gpu task if you use the -nobs parameter. If you don't the gpu task will get starved for time slices and the crunch time will get extended. You could either simply bump up the cpu resource allocation in your app_info to <avg_ncpus>1.0</avg_ncpus> <max_ncpus>1.0</max_ncpus> in your gpu section or in an app_config and you can reduce your cpu usage to 75% in Local Preferences to only use 3 of your 4 cpu cores.

I saw that for some task types the cpu outperforms the gpu. But I don't know if that was before the addition of the -nobs or after. If you are running at 100% cpu usage, both task types are going to struggle to get enough time slices.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2001192 · Report as offensive     Reply Quote
Loren Datlof

Send message
Joined: 24 Jan 14
Posts: 73
Credit: 19,652,385
RAC: 0
United States
Message 2001202 - Posted: 5 Jul 2019, 19:33:56 UTC - in response to Message 2001192.  

Yes, you need to reserve a cpu core for the gpu task if you use the -nobs parameter. If you don't the gpu task will get starved for time slices and the crunch time will get extended. You could either simply bump up the cpu resource allocation in your app_info to <avg_ncpus>1.0</avg_ncpus> <max_ncpus>1.0</max_ncpus> in your gpu section or in an app_config and you can reduce your cpu usage to 75% in Local Preferences to only use 3 of your 4 cpu cores.

I saw that for some task types the cpu outperforms the gpu. But I don't know if that was before the addition of the -nobs or after. If you are running at 100% cpu usage, both task types are going to struggle to get enough time slices.


I have set the computing preferences in the boinc manager to use at most 50% of the CPUs. This results in the following CPU core usage:

Seti GPU app = 100%
Seti CPU app = 100%
Einstein GPU app = 100%

The remaining CPU is at less than 5%. I think I am OK and am not starving any of the apps.
ID: 2001202 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2001205 - Posted: 5 Jul 2019, 19:54:06 UTC - in response to Message 2001202.  

I'm still seeing the host with both the 720 and 730 cards installed in your latest reported gpu tasks. I thought you said you pulled the 720 so it wouldn't error out anymore gpu tasks. There is a way to exclude the 720 in cc_config from Seti so it won't be used. That way you could use it for other projects or maybe use the 720 to drive the monitor and have the 730 just crunch.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2001205 · Report as offensive     Reply Quote
Loren Datlof

Send message
Joined: 24 Jan 14
Posts: 73
Credit: 19,652,385
RAC: 0
United States
Message 2001209 - Posted: 5 Jul 2019, 20:03:38 UTC

I am using the 720 card for Einstein now. Previously I set the <use_all_gpus> to 0 and it stopped using the 720 so there was no need to pull it.
ID: 2001209 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2001213 - Posted: 5 Jul 2019, 20:34:18 UTC - in response to Message 2001191.  
Last modified: 5 Jul 2019, 20:39:35 UTC

Thanks Keith and Richard for your responses.
I am using the -nobs parameter. When I run top the CPU usage is 100%. Here is the relevant part of my app_info.xml file. I changed to -unroll 6 when just using the 730.
I noticed that CPU runs the blc63 WUs faster than the GPUs so I am probably better off running three cores of the i5.
I have been able to find the GTX 1050 Ti for around $50 -$60 on eBay and craigslist. But for this computer I need the mini (low profile version) as this is a SFF computer and the one I use daily. Right now they are going for twice that price on eBay and none are to be found on craigslist. The hunt continues...


. . Hi, like you my machine in this case is an HP SFF 8000 elite so I had difficulty sourcing cards for it. I found only 2 manufacturers supporting this format in GTX1050ti cards, MSI and Gigabyte and I went with MSi both for the GT730 and the GTX1050ti that replaced it. The times I quoted previously for tasks were for the then plentiful 'normal' Arecibo WUs. I used them as benchmarks to compare apples with apples. At the time the 'old' Blc04 tasks were taking about 10 mins longer, ie around 37 to 38 mins and I would expect the typical examples of the current 'new' GBT formats to take no longer except for the occassional super long runners. BUT, and I probably should have mentioned this in the other message, my GT730 is a "smarter than the average bear" version. It is factory clocked to 1006MHz instead of the Nvidia norm of 902MHz plus it has 2GB of GDDR5 ram not the average DDR3 ram. Overall this gave it about a 20 to 30 % speed advantage over its more typical cousins. But at the time (and we are talking almost 2 years ago) more run of the mill 730s were taking about 45 to 50 mins per task. Based on this I would have expected you to get run times of no more than that.

. . My HP is only a Core2 Duo so I was running with NO CPU crunching and -nobs set. I had -unroll set to 2 and that was about it. In your case that would translate to 'Number of CPUS' equals 50% and turn -unroll back to 2. Before you abandon GPU crunching you might try those settings to see if it does any good for you. It would be interesting to find out.

. . For reference that machine now with the 1050ti and still using those settings on Cuda90 is here ...

https://setiathome.berkeley.edu/show_host_detail.php?hostid=8222433

Stephen

? ?
ID: 2001213 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2001217 - Posted: 5 Jul 2019, 21:19:09 UTC - in response to Message 2001209.  

You can also just exclude the GT 720 from Seti with a gpu_exclude statement in cc_config.xml. You would use whatever device number BOINC labels the GT 720 in the Event Log at startup to identify the device.

<exclude_gpu>
<url>http://setiathome.berkeley.edu/</url>
<device_num>1</device_num>
<type>NVIDIA</type>
</exclude_gpu>
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2001217 · Report as offensive     Reply Quote
Loren Datlof

Send message
Joined: 24 Jan 14
Posts: 73
Credit: 19,652,385
RAC: 0
United States
Message 2001221 - Posted: 5 Jul 2019, 21:32:53 UTC - in response to Message 2001217.  

You can also just exclude the GT 720 from Seti with a gpu_exclude statement in cc_config.xml. You would use whatever device number BOINC labels the GT 720 in the Event Log at startup to identify the device.

<exclude_gpu>
<url>http://setiathome.berkeley.edu/</url>
<device_num>1</device_num>
<type>NVIDIA</type>
</exclude_gpu>


That is what I have done.
ID: 2001221 · Report as offensive     Reply Quote
Loren Datlof

Send message
Joined: 24 Jan 14
Posts: 73
Credit: 19,652,385
RAC: 0
United States
Message 2001223 - Posted: 5 Jul 2019, 21:38:35 UTC - in response to Message 2001213.  
Last modified: 5 Jul 2019, 21:41:10 UTC

Thanks Keith and Richard for your responses.
I am using the -nobs parameter. When I run top the CPU usage is 100%. Here is the relevant part of my app_info.xml file. I changed to -unroll 6 when just using the 730.
I noticed that CPU runs the blc63 WUs faster than the GPUs so I am probably better off running three cores of the i5.
I have been able to find the GTX 1050 Ti for around $50 -$60 on eBay and craigslist. But for this computer I need the mini (low profile version) as this is a SFF computer and the one I use daily. Right now they are going for twice that price on eBay and none are to be found on craigslist. The hunt continues...


. . Hi, like you my machine in this case is an HP SFF 8000 elite so I had difficulty sourcing cards for it. I found only 2 manufacturers supporting this format in GTX1050ti cards, MSI and Gigabyte and I went with MSi both for the GT730 and the GTX1050ti that replaced it. The times I quoted previously for tasks were for the then plentiful 'normal' Arecibo WUs. I used them as benchmarks to compare apples with apples. At the time the 'old' Blc04 tasks were taking about 10 mins longer, ie around 37 to 38 mins and I would expect the typical examples of the current 'new' GBT formats to take no longer except for the occassional super long runners. BUT, and I probably should have mentioned this in the other message, my GT730 is a "smarter than the average bear" version. It is factory clocked to 1006MHz instead of the Nvidia norm of 902MHz plus it has 2GB of GDDR5 ram not the average DDR3 ram. Overall this gave it about a 20 to 30 % speed advantage over its more typical cousins. But at the time (and we are talking almost 2 years ago) more run of the mill 730s were taking about 45 to 50 mins per task. Based on this I would have expected you to get run times of no more than that.

. . My HP is only a Core2 Duo so I was running with NO CPU crunching and -nobs set. I had -unroll set to 2 and that was about it. In your case that would translate to 'Number of CPUS' equals 50% and turn -unroll back to 2. Before you abandon GPU crunching you might try those settings to see if it does any good for you. It would be interesting to find out.

. . For reference that machine now with the 1050ti and still using those settings on Cuda90 is here ...

https://setiathome.berkeley.edu/show_host_detail.php?hostid=8222433

Stephen

? ?

Hi, my computer is an HP 8200 Eilite SFF with an i5 CPU. Right now I have unroll set to 6.

It is hard to compare how well the GPU is running because of all the different WUs we are getting lately. I will have to give it some time and see how things shake out.
ID: 2001223 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2001225 - Posted: 5 Jul 2019, 22:04:28 UTC - in response to Message 2001223.  

. . My HP is only a Core2 Duo so I was running with NO CPU crunching and -nobs set. I had -unroll set to 2 and that was about it. In your case that would translate to 'Number of CPUS' equals 50% and turn -unroll back to 2. Before you abandon GPU crunching you might try those settings to see if it does any good for you. It would be interesting to find out.
Stephen

Hi, my computer is an HP 8200 Eilite SFF with an i5 CPU. Right now I have unroll set to 6.
It is hard to compare how well the GPU is running because of all the different WUs we are getting lately. I will have to give it some time and see how things shake out.

. . Yep, the number of different tape series and the variability within tape series makes it hard to establish a 'norm' at the moment. But please, indulge me, try setting -unroll to 2 and see if there is any change/improvement in the card's performance. Just to answer the question. :)

Stephen

? ?
ID: 2001225 · Report as offensive     Reply Quote
Loren Datlof

Send message
Joined: 24 Jan 14
Posts: 73
Credit: 19,652,385
RAC: 0
United States
Message 2001229 - Posted: 5 Jul 2019, 22:44:39 UTC - in response to Message 2001225.  


. . Yep, the number of different tape series and the variability within tape series makes it hard to establish a 'norm' at the moment. But please, indulge me, try setting -unroll to 2 and see if there is any change/improvement in the card's performance. Just to answer the question. :)

Stephen

? ?


You have been indulged. -unroll has been set to 2.
ID: 2001229 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2001241 - Posted: 6 Jul 2019, 0:42:11 UTC - in response to Message 2001229.  

Very few examples so far with unroll = 2. But what I have observed in the limited data set is there is no difference between unroll = 6 and unroll = 2.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2001241 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2001247 - Posted: 6 Jul 2019, 1:06:39 UTC

What kind of results you guys getting from the 1050ti. I saw these cards on eBay for $28.99. Too good to be real?
ID: 2001247 · Report as offensive     Reply Quote
Previous · 1 . . . 110 · 111 · 112 · 113 · 114 · 115 · 116 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.