Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 138 · 139 · 140 · 141 · 142 · 143 · 144 . . . 162 · Next

AuthorMessage
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2028590 - Posted: 20 Jan 2020, 1:44:51 UTC
Last modified: 20 Jan 2020, 1:47:38 UTC

I just got done chasing up and down because my Linux box screen/keyboard was freezing.

It turned out I had the "when in use" memory setting high enough the swap file was getting hit (I could hear the HD beating pretty hard).

I haven't had any trouble since turned it down to 75%. And some Boinc cpu tasks from Einstein@Home started pausing "waiting for memory".

I expect it is a good argument for a small SSD and/or doubling my memory.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2028590 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2028592 - Posted: 20 Jan 2020, 1:50:05 UTC - in response to Message 2028590.  

What non-BOINC things were running that were causing you to exceed the system memory? How much memory was in the system?
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2028592 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2028594 - Posted: 20 Jan 2020, 1:55:40 UTC - in response to Message 2028592.  
Last modified: 20 Jan 2020, 2:01:49 UTC

What non-BOINC things were running that were causing you to exceed the system memory? How much memory was in the system?


It was happening right after I started up the Boinc clients. The Boinc Manager would display them and then about 10-15 sec later the mouse would freeze.

I have two sticks of 8 GB memory. I suppose I could add two more for a while but where is the fun in that? :)

Have had an issue once I reduced the allowed memory when user is active.


Tom
A proud member of the OFA (Old Farts Association).
ID: 2028594 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2028599 - Posted: 20 Jan 2020, 2:22:00 UTC - in response to Message 2028594.  

strange that you would be hitting memory limits when the system is not really doing anything but BOINC. sitting at the computer at the desktop doesn't really add much, unless you had a lot of web browser windows open.

Einstein Gamma-ray with 7 GPUs running only uses about 3GB of system memory on my "miner" type system which also has 16GB total, running 1 WU at a time. and that goes up to 10-11GB when running SETI with the mutex app, 2 at a time. I have my compute preferences set to allow up to 90% memory when in use (or not in use). you shouldnt be seeing waiting for memory messages unless you're hitting the boinc memory limit that is set. and you shouldnt be hitting swap unless your system as a whole is exceeding that 16GB. next time it happens open up htop and see what the system memory use is.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2028599 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2028623 - Posted: 20 Jan 2020, 8:19:28 UTC - in response to Message 2028599.  

strange that you would be hitting memory limits
Very strange, indeed.
For reference sake, Tom, here you can see what memory each of my boxes are using. Worst case is ~5G used, on a system with 7 GPUs.
ID: 2028623 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2028629 - Posted: 20 Jan 2020, 10:02:35 UTC - in response to Message 2028599.  
Last modified: 20 Jan 2020, 10:10:32 UTC

strange that you would be hitting memory limits when the system is not really doing anything but BOINC. sitting at the computer at the desktop doesn't really add much, unless you had a lot of web browser windows open.

Einstein Gamma-ray with 7 GPUs running only uses about 3GB of system memory on my "miner" type system which also has 16GB total, running 1 WU at a time. and that goes up to 10-11GB when running SETI with the mutex app, 2 at a time. I have my compute preferences set to allow up to 90% memory when in use (or not in use). you shouldnt be seeing waiting for memory messages unless you're hitting the boinc memory limit that is set. and you shouldnt be hitting swap unless your system as a whole is exceeding that 16GB. next time it happens open up htop and see what the system memory use is.



[urlhttps://wp.me/p5CGc5-bD1[/url]
Application
Gamma-ray pulsar search #5 1.08 (FGRPSSE)
Name
LATeah1002F_1320.0_98332_0.0
State
Waiting for memory
Received
Mon 20 Jan 2020 03:52:15 AM CST
Report deadline
Mon 03 Feb 2020 03:52:14 AM CST
Estimated computation size
105,000 GFLOPs
CPU time
00:00:09
CPU time since checkpoint
00:00:09
Elapsed time
00:00:10
Estimated time remaining
08:11:05
Fraction done
0.061%
Virtual memory size
1.08 GB
Working set size
782.45 MB
Directory
slots/9
Process ID
22474
Executable
hsgamma_FGRP5_1.08_x86_64-pc-linux-gnu__FGRPSSE


This is a task waiting for memory.
A proud member of the OFA (Old Farts Association).
ID: 2028629 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2028639 - Posted: 20 Jan 2020, 13:35:38 UTC - in response to Message 2028629.  

That’s not really helpful. I’m curious to see how much memory the whole computer is using and what tasks are using it.

Run htop
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2028639 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2028681 - Posted: 20 Jan 2020, 23:04:45 UTC

I've developed a client that reboots itself every 15 minutes or so last night.
Can't seem to clear the issue at this point.
Was wondering if anyone could tell me what they know about the .wisdom files. I ask because the one for MB has a date/time stamp close to when this nonsense first began. My guess that this gets rebuilt at some point by the app (MB or AP, as the case may be) and that shutting down BOINC and deleting it might be a logical troubleshooting step.
Just wondering if anyone had thoughts?
Thanks!
ID: 2028681 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2028686 - Posted: 21 Jan 2020, 0:40:31 UTC - in response to Message 2028681.  

Wisdom files are the OpenCL compute kernel primitives for the graphics card and driver. You can safely delete them after shutting BOINC down and they will be recreated when crunching restarts.
Just to be clear, I am not talking about the application *.CL file. That is required for the application. Don't delete the *3584.CL or 3556.CL files.
The wisdom files are named after the card type and the driver version. There are separate ones for the MB and AP apps.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2028686 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2028687 - Posted: 21 Jan 2020, 0:48:32 UTC - in response to Message 2028686.  
Last modified: 21 Jan 2020, 0:58:52 UTC

Wisdom files are the OpenCL compute kernel primitives for the graphics card and driver. You can safely delete them after shutting BOINC down and they will be recreated when crunching restarts.
That's what I thought. Thanks for the confirmation. Guess it's worth a shot. It seems to be blc61 files that cause the crash. Using grub to fall back from 5.3.0-26-generic to 5.0.0-37 didn't help either, and that was the only recent activity in the update log. That'll teach me to use the apt update command, even when instructed to ;)
Thanks, Keith.
ID: 2028687 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2028690 - Posted: 21 Jan 2020, 1:31:55 UTC - in response to Message 2028639.  

That’s not really helpful. I’m curious to see how much memory the whole computer is using and what tasks are using it.

Run htop


There are a couple of images in the URL that display Task Manager showing about 3/4 of my 16 GB in use. I believe you can see the "working" set which is rather large. As far as I can tell the Ram tracks the working set on these particular apps.

I turned off the E@H cpu tasks I was running. So I can't re-create the issue except for the reported memory usage per task manager was 4-5 times what Seti cpu tasks use.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2028690 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2028693 - Posted: 21 Jan 2020, 1:41:21 UTC - in response to Message 2028687.  

Wisdom files are the OpenCL compute kernel primitives for the graphics card and driver. You can safely delete them after shutting BOINC down and they will be recreated when crunching restarts.
That's what I thought. Thanks for the confirmation. Guess it's worth a shot. It seems to be blc61 files that cause the crash. Using grub to fall back from 5.3.0-26-generic to 5.0.0-37 didn't help either, and that was the only recent activity in the update log. That'll teach me to use the apt update command, even when instructed to ;)
Thanks, Keith.

If you start getting errors on tasks that have messages in the stderr.txt like ....initialization failed or ..... memory access denied, it is time to purge the Compute Cache of the primitives. It is located in /home/{username}/.nv/ComputeCache in Linux and in C:\Users\{username}\AppData\Roaming\NVIDIA\ComputeCache for Windows.
They primitives can get corrupted or more frequently the permissions changed on the folder in Windows that prevent reading the files which the app has to do for each task crunched.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2028693 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2028702 - Posted: 21 Jan 2020, 2:18:47 UTC - in response to Message 2028690.  

That’s not really helpful. I’m curious to see how much memory the whole computer is using and what tasks are using it.

Run htop


There are a couple of images in the URL that display Task Manager showing about 3/4 of my 16 GB in use. I believe you can see the "working" set which is rather large. As far as I can tell the Ram tracks the working set on these particular apps.

I turned off the E@H cpu tasks I was running. So I can't re-create the issue except for the reported memory usage per task manager was 4-5 times what Seti cpu tasks use.

Tom


the issue was all the CPU tasks you were running. dont run so many I guess since they use up so much system memory. or add more memory if you want to run CPU work.

additionally, were those 3 gravity wave WUs running 1 each on 3 different cards? or 3 on 1 GPU? the gravity wave WUs need a lot of CPU support, my system uses about 1.2-1.5 CPU threads for each GW GPU WU. can't run multiples per GPU unless you have a lot of spare threads.

just what i've noticed so far on my old Xeons. the "per GPU WU" CPU percentage might be lower on the more modern chips.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2028702 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2028708 - Posted: 21 Jan 2020, 3:37:56 UTC - in response to Message 2028693.  

If you start getting errors on tasks that have messages in the stderr.txt like ....initialization failed or ..... memory access denied, it is time to purge the Compute Cache of the primitives. It is located in /home/{username}/.nv/ComputeCache in Linux and in C:\Users\{username}\AppData\Roaming\NVIDIA\ComputeCache for Windows.
They primitives can get corrupted or more frequently the permissions changed on the folder in Windows that prevent reading the files which the app has to do for each task crunched.
In this case, the error tasks get trashed and thus returned with a compute error referring to a bad header (presumably in the returning file) so no info help there.
What ever it was, I'd been crashing every 5-15 minutes, and have now been up and running for 1hr15min, so maybe it was indeed bad WUs. I have in past seen intermittent crashes like this, but few and far between.
Strangely, since deleting the wisdom files, they have not been rebuilt after restart yet S@H is running fine.
Guess we'll see.
ID: 2028708 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2028710 - Posted: 21 Jan 2020, 3:57:22 UTC

Yes, there are occasional bad tasks where the database is overloaded and can't access the task in the database and errors out. I get one or two a week. But strange that the wisdom files didn't get recreated on the very first attempt at crunching a OpenCL task. You can always see that it in the stderr.txt output with entries like "can't find opencl file . . . . recompiling" which adds about another 5 seconds to the compute time of the task. Once created, not necessary for following tasks with that card and driver.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2028710 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2028715 - Posted: 21 Jan 2020, 4:35:49 UTC - in response to Message 2028710.  

Yes, there are occasional bad tasks where the database is overloaded and can't access the task in the database and errors out. I get one or two a week. But strange that the wisdom files didn't get recreated on the very first attempt at crunching a OpenCL task. You can always see that it in the stderr.txt output with entries like "can't find opencl file . . . . recompiling" which adds about another 5 seconds to the compute time of the task. Once created, not necessary for following tasks with that card and driver.

lol at myself
Perhaps related to the fact that the only thing running right now is Cuda90 on GPUs and FGRPSSE on CPU ...
ID: 2028715 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2028722 - Posted: 21 Jan 2020, 6:15:00 UTC - in response to Message 2028718.  

I would suggest an alternative.
https://github.com/jonasmalacofilho/liquidctl

I was able to control both the AIO cpu fans speeds plus the pump speeds with this repository.
Handles all the standard Asetek hardware across Corsair, NZXT, EVGA and Thermaltake AIO's.
I got it to work quite well on Corsair H-100iV2 and EVGA CLC280 AIO's.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2028722 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2028739 - Posted: 21 Jan 2020, 11:18:43 UTC - in response to Message 2028702.  

That’s not really helpful. I’m curious to see how much memory the whole computer is using and what tasks are using it.

Run htop


There are a couple of images in the URL that display Task Manager showing about 3/4 of my 16 GB in use. I believe you can see the "working" set which is rather large. As far as I can tell the Ram tracks the working set on these particular apps.

I turned off the E@H cpu tasks I was running. So I can't re-create the issue except for the reported memory usage per task manager was 4-5 times what Seti cpu tasks use.

Tom


the issue was all the CPU tasks you were running. dont run so many I guess since they use up so much system memory. or add more memory if you want to run CPU work.

additionally, were those 3 gravity wave WUs running 1 each on 3 different cards? or 3 on 1 GPU? the gravity wave WUs need a lot of CPU support, my system uses about 1.2-1.5 CPU threads for each GW GPU WU. can't run multiples per GPU unless you have a lot of spare threads.

just what i've noticed so far on my old Xeons. the "per GPU WU" CPU percentage might be lower on the more modern chips.


Haven't made it to the other room for the picture but I can assure you that I am running without any app_config.xml file in the E@H directory so whatever load E@H decides will run on the gpus is what I am getting. It appears to be allocating 0.9 cpu's per gpu task. And it has "never" run more than one task per gpu. All the gpu tasks seem to be running at about the same "ram usage" as Seti.
It is the cpu tasks that were taknig an outsized bite.

I will see if I can run some E@H CPU tasks during maintenance and take a picture of Htop so we can get reliable answers.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2028739 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2028740 - Posted: 21 Jan 2020, 11:52:52 UTC - in response to Message 2028739.  

Haven't made it to the other room for the picture but I can assure you that I am running without any app_config.xml file in the E@H directory so whatever load E@H decides will run on the gpus is what I am getting. It appears to be allocating 0.9 cpu's per gpu task. And it has "never" run more than one task per gpu. All the gpu tasks seem to be running at about the same "ram usage" as Seti.
This is separate from the RAM discussion, but:

The figure of '0.9 cpu's per gpu task' is simply BOINC's (wildly inaccurate) estimation of - yes - what to allocate for the task. The application running the task will decide, and use, exactly what it wants. In rare cases, the developer has provided a switch - environment variable, configuration file, or command line - to toggle between 'use full CPU' or 'use less than full CPU'. If the application is running at 'less than full CPU', you again have no control over exactly how much it will use. It's usually better to make your own choices, and apply them via app_config.xml

On the subject of pictures - it would be helpful if you could use an image hosting service which would allow you to show screenshots at a higher resolution than 300x240 - I found those hard to read.
ID: 2028740 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2028743 - Posted: 21 Jan 2020, 12:53:57 UTC - in response to Message 2028740.  

Like Richard said, 0.9 doesnt mean it’s using that much. That value is only used for the BOINC internal book keeping so it knows how much resources are being used and how many jobs to run.

With gravity wave, I actually observed the GPU tasks using more than a full thread. About 1.2 - 1.5 CPU threads per GPU WU.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2028743 · Report as offensive     Reply Quote
Previous · 1 . . . 138 · 139 · 140 · 141 · 142 · 143 · 144 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2026 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.