Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 91 · 92 · 93 · 94 · 95 · 96 · 97 . . . 162 · Next

AuthorMessage
Sleepy
Volunteer tester
Avatar

Send message
Joined: 21 May 99
Posts: 219
Credit: 98,947,784
RAC: 28,360
Italy
Message 1978891 - Posted: 6 Feb 2019, 12:38:55 UTC

Just in case someone else experienced the same problem.

Yesterdays was a bit nightmarish, since every few minutes the BOINC client on one of two machines would stop and I needed to reactivate it manually (eventually, I put a software watchdog to automate the process, though it has not worked tonight :-( ).
It usually happened amidst a GPU WUs work. It definitely was related to GPU working, since snoozing GPU work would eliminate the problem.
This morning everything seems to work well as usual.

What have I changed before the problem manifested? Nothing, apart standard updates to the system.
What have I done to solve the problem? Basically nothing, apart a couple of shutdown/reboots with no immediate improvements.

I do not know if this was caused by a special temporary strand of Arecibo WUs which made my system hiccup.
I am not overclocking. Also, it is Winter here, therefore not particularly warm.

Good crunching to everybody.
ID: 1978891 · Report as offensive     Reply Quote
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1979060 - Posted: 7 Feb 2019, 5:46:19 UTC - in response to Message 1978891.  

Just in case someone else experienced the same problem.

Yesterdays was a bit nightmarish, since every few minutes the BOINC client on one of two machines would stop and I needed to reactivate it manually (eventually, I put a software watchdog to automate the process, though it has not worked tonight :-( ).
It usually happened amidst a GPU WUs work. It definitely was related to GPU working, since snoozing GPU work would eliminate the problem.
This morning everything seems to work well as usual.

What have I changed before the problem manifested? Nothing, apart standard updates to the system.
What have I done to solve the problem? Basically nothing, apart a couple of shutdown/reboots with no immediate improvements.

I do not know if this was caused by a special temporary strand of Arecibo WUs which made my system hiccup.
I am not overclocking. Also, it is Winter here, therefore not particularly warm.

Good crunching to everybody.


Sleepy,
I had the same problem too. Like you, I tried suspending GPU work but it still did it after about 5 minutes on my system. When I suspended Network activity, I was able to finish all my WU's without a problem. I wonder if it had something to do with the stuck uploads?
BoincTasks started acting up at that time and still is, so I stopped using it.

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1979060 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1979107 - Posted: 7 Feb 2019, 15:54:54 UTC - in response to Message 1979060.  

Just in case someone else experienced the same problem.

Yesterdays was a bit nightmarish, since every few minutes the BOINC client on one of two machines would stop and I needed to reactivate it manually (eventually, I put a software watchdog to automate the process, though it has not worked tonight :-( ).
It usually happened amidst a GPU WUs work. It definitely was related to GPU working, since snoozing GPU work would eliminate the problem.
This morning everything seems to work well as usual.

What have I changed before the problem manifested? Nothing, apart standard updates to the system.
What have I done to solve the problem? Basically nothing, apart a couple of shutdown/reboots with no immediate improvements.

I do not know if this was caused by a special temporary strand of Arecibo WUs which made my system hiccup.
I am not overclocking. Also, it is Winter here, therefore not particularly warm.

Good crunching to everybody.


Sleepy,
I had the same problem too. Like you, I tried suspending GPU work but it still did it after about 5 minutes on my system. When I suspended Network activity, I was able to finish all my WU's without a problem. I wonder if it had something to do with the stuck uploads?
BoincTasks started acting up at that time and still is, so I stopped using it.


I am getting the error message "Boinc Manager exited 3 times in [5 minutes/15 minutes], do you want to restart?". Is that the same error message you two are getting?

Tom
A proud member of the OFA (Old Farts Association).
ID: 1979107 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1979110 - Posted: 7 Feb 2019, 16:11:20 UTC - in response to Message 1979107.  


I am getting the error message "Boinc Manager exited 3 times in [5 minutes/15 minutes], do you want to restart?". Is that the same error message you two are getting?

Tom


I have just taken 2 of my 5 gpus offline. Will see if the problem continues.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1979110 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979118 - Posted: 7 Feb 2019, 16:43:02 UTC

If you see that kind of message, the compute primitives in ComputeCache have been corrupted. The computer is segfaulting on the application and offers to restart the app. If you delete the contents of ComputeCache and restart BOINC, it should clear up. But then investigate why the compute primitives got corrupted. Too much overclocking on the card is likely the reason.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979118 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1979121 - Posted: 7 Feb 2019, 16:50:31 UTC - in response to Message 1979118.  

If you see that kind of message, the compute primitives in ComputeCache have been corrupted. The computer is segfaulting on the application and offers to restart the app. If you delete the contents of ComputeCache and restart BOINC, it should clear up. But then investigate why the compute primitives got corrupted. Too much overclocking on the card is likely the reason.


Thank you for the diagnosis. Since I am a bit confused about the terminology let me ask where exactly is the "ComputeCache"? Are you talking about the folder where all the downloaded tasks from Seti are?

If yes that was yes, which is better "reset the project" or take down Boinc Manager and delete all the data files?

Thank you.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1979121 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1979124 - Posted: 7 Feb 2019, 17:10:20 UTC - in response to Message 1979121.  

he's talking about things more specific to how the GPU is interacting with the OS.

you can clear it with some commands in the Terminal, or an easier way just reboot the computer.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1979124 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1979130 - Posted: 7 Feb 2019, 17:24:33 UTC - in response to Message 1979124.  

he's talking about things more specific to how the GPU is interacting with the OS.

you can clear it with some commands in the Terminal, or an easier way just reboot the computer.


Ah, as you know I am usually only upto "simple" solutions.

So reboot it is. Since I have dropped off two of my slower gpu's will see if that might be "the issue".

I had managed to get it to boot/run with 5 GPUs after turning on the "upper memory" for PCIe option in the bios. It is a shame I can't find that in the AMD bios.

Tom

Latest URL for the system under discussion is: https://setiathome.berkeley.edu/show_host_detail.php?hostid=8661108
A proud member of the OFA (Old Farts Association).
ID: 1979130 · Report as offensive     Reply Quote
Sleepy
Volunteer tester
Avatar

Send message
Joined: 21 May 99
Posts: 219
Credit: 98,947,784
RAC: 28,360
Italy
Message 1979144 - Posted: 7 Feb 2019, 18:06:59 UTC - in response to Message 1979130.  
Last modified: 7 Feb 2019, 18:07:28 UTC

After another short hiccup (while uploading where slow, but it may be a coincidence), now everything is working fine.
I was not receiving any error message, I just saw boinc-client go down.

For the record, crisis after crisis, yesterday there also was an update to libcurl which caused my version of Boinc (the 7.4.4 by TBar) to go down as well.
So I switched to the repository client.
Today I received an update from the special repository with the libcurl34 package and also this sorted out.

After months with everything going smoothly by itself and getting into trouble only when experimenting too "hard", these days where a bit shaky without me doing anything to cause it...

Good crunching!
ID: 1979144 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979168 - Posted: 7 Feb 2019, 20:16:07 UTC - in response to Message 1979121.  
Last modified: 7 Feb 2019, 20:34:58 UTC

If you see that kind of message, the compute primitives in ComputeCache have been corrupted. The computer is segfaulting on the application and offers to restart the app. If you delete the contents of ComputeCache and restart BOINC, it should clear up. But then investigate why the compute primitives got corrupted. Too much overclocking on the card is likely the reason.


Thank you for the diagnosis. Since I am a bit confused about the terminology let me ask where exactly is the "ComputeCache"? Are you talking about the folder where all the downloaded tasks from Seti are?

If yes that was yes, which is better "reset the project" or take down Boinc Manager and delete all the data files?

Thank you.

Tom

No the ComputeCache is the folder in Linux and Windows where the compute kernels or primitives are generated for OpenCL and CUDA tasks. Ever notice the messages in stderr.txt on the first task computed with a new driver or new card. Something along the lines of "can't find so and so file, . . . .generating. That is the application generating the compute primitives for the API platform. It only has to do it once for each driver or card. Unless they get buggered up and any task referencing the corrupted files will fail.

The folder or directory is in different places for each OS. For Windows the folder or directory is located in C:\Users\[User_Name]\AppData\Roaming\NVIDIA\ComputeCache

For Linux the ComputeCache is located in the hidden folder in /home/[login_user_name]/.nv/ComputeCache

Just delete all the folders and the index file in the directory. The primitives get regenerated the first time a gpu task is started after restarting BOINC.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979168 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979169 - Posted: 7 Feb 2019, 20:17:58 UTC - in response to Message 1979144.  
Last modified: 7 Feb 2019, 20:23:23 UTC

After another short hiccup (while uploading where slow, but it may be a coincidence), now everything is working fine.
I was not receiving any error message, I just saw boinc-client go down.

For the record, crisis after crisis, yesterday there also was an update to libcurl which caused my version of Boinc (the 7.4.4 by TBar) to go down as well.
So I switched to the repository client.
Today I received an update from the special repository with the libcurl34 package and also this sorted out.

After months with everything going smoothly by itself and getting into trouble only when experimenting too "hard", these days where a bit shaky without me doing anything to cause it...

Good crunching!


You have to be aware of the client's dependence on libcurl. TBar version compiled on older distros and used the libcurl3 library. But the latest distros past 18.10 deprecated libcurl3 and removed it from the sources. 18.04 straddles both camps. It ships with libcurl4 stock but still has the older libcurl3 library in its software sources for downloading and substituting. Any new package installation may remove libcurl3 and install the stock libcurl4 so you have to watch what a package intends to install and what it is going to remove.

One way to get around this issue as you discovered is to use the curl34 ppa package which ships a libcurl4 library that has both libcurl3 and libcurl4 in the same library.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979169 · Report as offensive     Reply Quote
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1979190 - Posted: 7 Feb 2019, 22:08:07 UTC - in response to Message 1979169.  

Thanks Keith,
Reloaded libcurl3 and that did the trick. YAY

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1979190 · Report as offensive     Reply Quote
J. Mileski
Volunteer tester
Avatar

Send message
Joined: 9 Jun 02
Posts: 632
Credit: 172,116,532
RAC: 572
United States
Message 1979192 - Posted: 7 Feb 2019, 22:38:14 UTC - in response to Message 1979169.  

After another short hiccup (while uploading where slow, but it may be a coincidence), now everything is working fine.
I was not receiving any error message, I just saw boinc-client go down.

For the record, crisis after crisis, yesterday there also was an update to libcurl which caused my version of Boinc (the 7.4.4 by TBar) to go down as well.
So I switched to the repository client.
Today I received an update from the special repository with the libcurl34 package and also this sorted out.

After months with everything going smoothly by itself and getting into trouble only when experimenting too "hard", these days where a bit shaky without me doing anything to cause it...

Good crunching!


You have to be aware of the client's dependence on libcurl. TBar version compiled on older distros and used the libcurl3 library. But the latest distros past 18.10 deprecated libcurl3 and removed it from the sources. 18.04 straddles both camps. It ships with libcurl4 stock but still has the older libcurl3 library in its software sources for downloading and substituting. Any new package installation may remove libcurl3 and install the stock libcurl4 so you have to watch what a package intends to install and what it is going to remove.

One way to get around this issue as you discovered is to use the curl34 ppa package which ships a libcurl4 library that has both libcurl3 and libcurl4 in the same library.


To make it easy for others:
sudo add-apt-repository ppa:xapienz/curl34
sudo apt-get update

ID: 1979192 · Report as offensive     Reply Quote
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1979203 - Posted: 8 Feb 2019, 0:02:02 UTC

I did an update on a Ubuntu 14 computer earlier today and noticed it loaded a new libcurl3 with the LibreOffice update.
That could be what is breaking things on UB 18.
ID: 1979203 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979220 - Posted: 8 Feb 2019, 3:00:47 UTC

No the stock libcurl library is libcurl4 in Ubuntu 18.04 and every release since then. That is why the very first installation instruction in TBar's BOINC versions says you have to install the older libcurl3 library to satisfy the dependency of his client. His clients were static linked in compiling on Ubuntu 16 I believe where the libcurl3 library was the default library. Also why the manager needs the libwebkitgtk-1.0 library because the manager is static compiled with the WxWidgets.

If the LibreOffice update updated the libcurl library to libcurl4 on Ubuntu 14.04, then you would have run into the same issue. At least with that older distro, you can put back the libcurl3 library with no problems. Or use the curl34 ppa library.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979220 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1979250 - Posted: 8 Feb 2019, 5:02:00 UTC

Starting over again. I have a clean install of Lubuntu without allowing any of the "updates since the image was created" to be applied.

I am going to see if that clears all the mysterious "system errors" I have been getting as well as all the crap that seems to be determined to "rain on my parade" :)

Just think, without Tbars efforts after petri's creative programming I would probably still be running Windows 10 and the stock apps on ALL my computers instead one or two :)

Tom
A proud member of the OFA (Old Farts Association).
ID: 1979250 · Report as offensive     Reply Quote
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1981033 - Posted: 18 Feb 2019, 19:11:19 UTC - in response to Message 1979250.  

I have a question.
I switched Boinc from 7.8.3 to 7.4.44
Everything is working fine, except for I can't get it to increase my WU's limit size. I tried changing the minimum work and the Max work buffer but with no luck. My settings for those are 10.00 and 0.10
I would appreciate any help.
Thanks

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1981033 · Report as offensive     Reply Quote
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1981034 - Posted: 18 Feb 2019, 19:20:18 UTC - in response to Message 1981033.  

We are all still limited to 100 tasks per device.
ID: 1981034 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1981037 - Posted: 18 Feb 2019, 19:36:13 UTC - in response to Message 1981033.  

I have a question.
I switched Boinc from 7.8.3 to 7.4.44
Everything is working fine, except for I can't get it to increase my WU's limit size. I tried changing the minimum work and the Max work buffer but with no luck. My settings for those are 10.00 and 0.10
I would appreciate any help.
Thanks

Think you might be confused. We are still limited by the servers to 100 tasks per device. What the 7.4.44 client allows is to increase the max tasks allowed per host up to 3000 tasks from the standard 1000 tasks that 7.8.3 allows. To get more tasks onto the host requires rescheduling. Visit the GUPPI rescheduler thread to read how to reschedule.
https://setiathome.berkeley.edu/forum_thread.php?id=79954&sort_style=5&start=675
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1981037 · Report as offensive     Reply Quote
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1981060 - Posted: 18 Feb 2019, 22:33:56 UTC - in response to Message 1981037.  

Thanks, Keith
Rescheduler is up and working.

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1981060 · Report as offensive     Reply Quote
Previous · 1 . . . 91 · 92 · 93 · 94 · 95 · 96 · 97 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.