Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 149 · 150 · 151 · 152 · 153 · 154 · 155 . . . 162 · Next

AuthorMessage
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2031863 - Posted: 10 Feb 2020, 20:22:16 UTC - in response to Message 2031859.  
Last modified: 10 Feb 2020, 20:29:13 UTC

Well, there is one checkpoint still created - at the very end of the run, too late to be of any use except in the most unlikely of circumstances. If that one could be removed too (AND TESTED!), we could put this side of the conversation to bed - permanently.

And concentrate on the monitors.

Edit - Juan did send me a new test build this afternoon, saying he had taken out the [should we say remaining?] checkpointing. I haven't tried it yet, because I was too busy trying to find the problem I was supposed to be solving. I'll try and test it tomorrow morning.

The only checkpoint i left was the one at the end of the analyze process. Could remove too but i believe that makes no difference since at that point the crunching was ended and the process of writing the stderr file is starting.

BTW I not removed the check point, i just force the program to believe the checkpoint delay time was not reached. So no checkpoint will be generated. With no check point the app will always restart from zero. What is important i made no changes on the Petri analyze code.

@Ian Yes is the same app i coded and you compile (thanks again for that). The one i`m using now and from my POV is working fine and does what is expected to do. I asked Richard to conduct a more accurate test for us.
ID: 2031863 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14659
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031866 - Posted: 10 Feb 2020, 20:29:56 UTC

Could you two please have a quick chat amongst yourselves and decide on the best step for tomorrow? I've reached beer o'clock, and I'm stepping out for a while.
ID: 2031866 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031867 - Posted: 10 Feb 2020, 20:36:04 UTC - in response to Message 2031866.  

Honestly I'm just curious if you notice any difference between the two apps as far as checkpointing is concerned. but it sounds like there might not be.

then whatever Juan want's you to test. I don't really have any interest beyond the curiosity.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031867 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031869 - Posted: 10 Feb 2020, 20:38:26 UTC - in response to Message 2031855.  
Last modified: 10 Feb 2020, 21:02:23 UTC

Do you have a checklist so we can pinpoint this behaviour?
Not really. The problem has existed since around Sept '18, basically with the release of 0.97. The older version zi3v doesn't have this problem. I think the others and myself were running Ubuntu, any version since around 16.04, Multiple GPUs, with a single monitor attached. I'd say the more GPUs the more likely to experience the event. It seems to be common on the 14 GPU machine. About the only sequence needed is to reboot the computer and then start BOINC without using the GPUs to do anything else in between. As in, don't run tasks in the benchmark App before running BOINC, as just running the GPUs will cause the problem to not happen. That's why I run BOINC for about 15 seconds then restart BOINC before it can complete a task. After restarting BOINC the problem doesn't exist, the 15 seconds cancels the problem....in Linux. It's a different story on the Mac.

Oh, the problem Does Exist when running the Special App in the Benchmark App, for the first task anyway. The second task doesn't have the problem. Same scenario, reboot the machine, then run the Special App.
ID: 2031869 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2031875 - Posted: 10 Feb 2020, 20:57:00 UTC - in response to Message 2031866.  
Last modified: 10 Feb 2020, 21:10:02 UTC

Could you two please have a quick chat amongst yourselves and decide on the best step for tomorrow? I've reached beer o'clock, and I'm stepping out for a while.

Beer O`clock...... I`m In! I know is Monday, but it`s very hot here.

Back to the topic. I only wish to test if the change eliminate the need of adjust the checkpoint timing on the hosts. That`s is what i understand was requested.

About the missing pulse problem due the monitor, a totally different topic, what i learn on this years was, Linux has some "timing problem" when you kill the crunching process. I was unable to discover or replicate that. It`s a random event from my POV. Something is left behind sometimes. That problem is more common to appears when you have a lot of GPU`s on the host. Just a long shoot, maybe when you switch the monitor the time needed for the GPU to understand that makes it missing some critical procedure and that is what trigger the event of some hosts and not in others. Could be that the cause of the missing pulse problem? I have no idea.
ID: 2031875 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2031885 - Posted: 10 Feb 2020, 21:51:31 UTC - in response to Message 2031822.  
Last modified: 10 Feb 2020, 22:12:59 UTC

Now you're getting me completely confused.

Bear in mind that for the time being I'm concentrating on the checkpoint problem only. Missed pulses can wait for another day.

I've just interrupted four tasks (two at a time, each running by itself on one of the two GPUs in this box). I stopped them via 'sudo systemctl stop boinc-client'. The first pair were not expected to checkpoint (interval set to 300): the second pair should have checkpointed (interval 60: stopped at around 70).

On restart, Manager displayed both progress and time restarting from 0. The tasks, in order, were

8529268332 (overflow, pending)
8529268005 (not overflow, valid)
8529287927 (not overflow, valid)
8529288504 (not overflow, pending)

We'll have to wait to check the first one (average turnround 1.8 days), but the other three rather disprove your assertion. I'm running setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101. The final run times shown on the web task page match what I saw for the final 'run to finish' after the restart: the pre-restart time was lost.


. . OK. My 2cents worth. The checkpoint issue results in false overflows, but like TBar says, it is not every time. I would say it may have been as high as 1/2 or 1/3 or as low as 1/4 or 1/5 but it was quite regular. I had minimum checkpoint set to a fairly low value, it was a long time ago but it may have been 60 secs. But it did NOT require a cold boot or even terminating BOINC client/manager, those are part of the missing pulse issue TBar was talking about. It would happen even with suspending/resuming a task. I regularly suspend tasks when tweaking or such and when resuming a high percentage of checkpointed tasks would immediately overflow. I now have checkpoint set to 300secs which is 30% to 50% over expected run times and do not have the problem.

. . Back then I was running the Cuda90 version. So I do not know if it still applies to anything higher. All my GPUs had and have GDDR5. ram.

Stephen

. .
ID: 2031885 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031892 - Posted: 10 Feb 2020, 22:19:14 UTC - in response to Message 2031869.  

Oh, the problem Does Exist when running the Special App in the Benchmark App, for the first task anyway. The second task doesn't have the problem. Same scenario, reboot the machine, then run the Special App.

TBar, I have never noticed the problem on the 1070 when testing the special app after restarting the computer with Rick's benchMT benchmark app.
So, your criteria is met. No monitor attached to the 1070 or 1070Ti, the computer is rebooted and the benchmark app is started as soon as the Desktop appears. BOINC is not running or has even been started. Just start with the selected tasks in the Test WU folder and the app configured to test the special app against the reference cpu app or one of the reference results for the SoG app.

Always get 99.98% agreement or at least 99.2% Q rating. No missing pulses or the results would not be strongly similar and the rescmpv5 app would have printed out the missing whatever in the tables.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031892 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35502
Credit: 261,360,520
RAC: 489
Australia
Message 2031898 - Posted: 10 Feb 2020, 22:42:49 UTC

This is an example of the only true error I get when the power suddenly goes out on my 3GB 1060's and it only happens if a task is interrupted within about 3 seconds of starting, which thankfully doesn't happen very often even with the power problems we have here in the hills when storms are about.

https://setiathome.berkeley.edu/result.php?resultid=8528470348

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
SETI@home error -6 Bad workunit header
!swi.data_type || !found || !swi.nsamples
File: seti_header.cpp
Line: 218

12:08:57 (1698): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>blc75_2bit_guppi_58693_09832_HIP98819_0146.5758.0.21.44.28.vlar_0_r202591688_0</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>


Both rigs have checkpoints set at 300 second and run off SSD's, but then again it's highly likely that different setups may produce different outcomes.

Cheers.
ID: 2031898 · Report as offensive     Reply Quote
alanb1951 Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 May 99
Posts: 10
Credit: 6,904,127
RAC: 34
United Kingdom
Message 2031908 - Posted: 10 Feb 2020, 23:51:56 UTC - in response to Message 2031816.  
Last modified: 10 Feb 2020, 23:53:36 UTC

Am I missing something about Linux? Does it hold data destined for files in memory, and only commit to disk when the file is closed? In which case, why is the header portion written immediately?

Yes, it does. The ext4 file system holds data to be written in memory pages for speed and then writes it to media. Depends on the application whether data and when it is written immediately or not. It defaults to delayed allocation. This snippet of a article explains a lot.
ext4 and data loss

In order to avoid misconceptions regarding ext4, it should be noted that that article is from 2009, and ext4 behaviour regarding delayed allocation has long since been changed to address that. There was a recent issue with 4.19 kernels but according to Phoronix it wasn't actually an ext4 problem

Cheers - Al.
ID: 2031908 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031912 - Posted: 11 Feb 2020, 0:36:59 UTC - in response to Message 2031908.  

I was simply searching for a simple beginners guide to how ext4 works. Not that we were actually discussing any data loss.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031912 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031950 - Posted: 11 Feb 2020, 7:39:40 UTC - in response to Message 2031892.  
Last modified: 11 Feb 2020, 7:44:05 UTC

Oh, the problem Does Exist when running the Special App in the Benchmark App, for the first task anyway. The second task doesn't have the problem. Same scenario, reboot the machine, then run the Special App.

TBar, I have never noticed the problem on the 1070 when testing the special app after restarting the computer with Rick's benchMT benchmark app.
So, your criteria is met. No monitor attached to the 1070 or 1070Ti, the computer is rebooted and the benchmark app is started as soon as the Desktop appears. BOINC is not running or has even been started. Just start with the selected tasks in the Test WU folder and the app configured to test the special app against the reference cpu app or one of the reference results for the SoG app.

Always get 99.98% agreement or at least 99.2% Q rating. No missing pulses or the results would not be strongly similar and the rescmpv5 app would have printed out the missing whatever in the tables.
It looks as though you are one of the lucky ones. Even the Linux people that get them usually only get them on one card out of a few. I looked through the list of top computers to see if there were any there and didn't see any from the likely machines. Then later I noticed my 9 GPU machine had a Hung task that had ended in an error and was about to report and another hung task still trying to run, so, I stopped it before it could report and looked at it. That's when a noticed an Invalid from Yesterday...that was missing Pulses. Well, I had tried restarting the machine Yesterday to see if it would miss any on restart, but didn't see any. Apparently I had Missed one, seems the 970 had missed pulses and it was now an Invalid. One GPU out of 9 had displayed the problem;
https://setiathome.berkeley.edu/result.php?resultid=8526164153
   Device 6: GeForce GTX 970 is okay
SETI@home using CUDA accelerated device GeForce GTX 970
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

setiathome v8 enhanced x41p_V0.98b1, Cuda 10.2 Special
Modifications done by petri33, compiled by TBar

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.009212
Sigma 148
Sigma > GaussTOffsetStop: 148 > -84
Thread call stack limit is: 1k
Autocorr: peak=18.1597, time=6.711, delay=5.6933, d_freq=1419863199.87, chirp=-12.127, fft_len=128k
Autocorr: peak=18.01943, time=6.711, delay=5.6933, d_freq=1419863199.86, chirp=-12.128, fft_len=128k
Triplet: peak=10.08111, time=55.05, period=15.47, d_freq=1419861243.92, chirp=-19.338, fft_len=1024 
Triplet: peak=16.39076, time=11.95, period=8.126, d_freq=1419866153.37, chirp=-89.224, fft_len=1024 

Best spike: peak=23.97182, time=73.82, d_freq=1419860056.03, chirp=-19.273, fft_len=128k
Best autocorr: peak=18.1597, time=6.711, delay=5.6933, d_freq=1419863199.87, chirp=-12.127, fft_len=128k
Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.123e+11, d_freq=0,
	score=-12, null_hyp=0, chirp=0, fft_len=0 
Best pulse: peak=0, time=-2.123e+11, period=0, d_freq=0, score=0, chirp=0, fft_len=0 
Best triplet: peak=16.39076, time=11.95, period=8.126, d_freq=1419866153.37, chirp=-89.224, fft_len=1024 

Spike count:    0
Autocorr count: 2
Pulse count:    0
Triplet count:  2
Gaussian count: 0
18:07:36 (3125): called boinc_finish(0)
Seems it didn't have any trouble after that. The Hung task from tonight was on a 1060, and the GPU showed as being OK in NVIDIA Settings, for some reason it had Hung in BOINC though. A reboot fixed that, and I turned the Error into a Ghost so it wouldn't be reported, I'll retrieve the ghost when I get around to it. The 970 is the Only Maxwell GPU out of the 9 GPUs, the others are all Pascals, don't have a clue it that means anything. I didn't notice any Missed pulses from tonight's reboot either.
ID: 2031950 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14659
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031964 - Posted: 11 Feb 2020, 12:22:38 UTC

Hit a snag testing the latest version. I was running a 430 driver (pre-CUDA 10.2). On my first Linux machine, I used the NVidia installer, but that was horrible - needed to be installed from text mode terminal, no desktop or X-server, logged in as root. Don't want to go there again.

This one I installed with Synaptec Manager. Tried to upgrade today, but highest available is 435. Even that was foul (kept rebooting at 640x480 resolution, graphical desktop. Ugh.)

I've got it back under control and crunching, but wasted the planned testing time. Can anyone recommend a clean installer for a CUDA 10.2 driver, Linux Mint 19.1 based on Ubuntu 18.04?
ID: 2031964 · Report as offensive     Reply Quote
Profile Buckeye4LF Project Donor
Avatar

Send message
Joined: 19 Jun 00
Posts: 173
Credit: 54,916,209
RAC: 833
United States
Message 2031965 - Posted: 11 Feb 2020, 12:48:02 UTC - in response to Message 2031964.  

For the most recent Nvidia drivers I installed from https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa/+index using the commands:

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update

I am running Linux Mint 19.2 with dual Nvidia RTX 2070 super and have had no issues for the past two days. I installed driver 440.59 and it works great. Be warned though, these are not official drivers.

ID: 2031965 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14659
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031966 - Posted: 11 Feb 2020, 13:01:41 UTC - in response to Message 2031965.  

Thanks. My criteria are:

CUDA
OpenCL
Full HD desktop

I'm not fussed about the provider, so long as they're competent enough to deliver those three properly! Definitely one for the shortlist.
ID: 2031966 · Report as offensive     Reply Quote
Profile Buckeye4LF Project Donor
Avatar

Send message
Joined: 19 Jun 00
Posts: 173
Credit: 54,916,209
RAC: 833
United States
Message 2031968 - Posted: 11 Feb 2020, 13:06:13 UTC - in response to Message 2031966.  

i installed openCL before I updated the driver so you may have to do that as a separate command as I am unsure if it is included.

ID: 2031968 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14659
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031969 - Posted: 11 Feb 2020, 13:16:45 UTC - in response to Message 2031968.  

i installed openCL before I updated the driver so you may have to do that as a separate command as I am unsure if it is included.
Can you remember where you got OpenCL from? Less important, I can come back to that later - it's CUDA I was planning to test today.
ID: 2031969 · Report as offensive     Reply Quote
J. Mileski
Volunteer tester
Avatar

Send message
Joined: 9 Jun 02
Posts: 632
Credit: 172,116,532
RAC: 572
United States
Message 2031973 - Posted: 11 Feb 2020, 22:11:50 UTC - in response to Message 2031969.  

i installed openCL before I updated the driver so you may have to do that as a separate command as I am unsure if it is included.
Can you remember where you got OpenCL from? Less important, I can come back to that later - it's CUDA I was planning to test today.

sudo apt-get install ocl-icd-libopencl1

ID: 2031973 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14659
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031976 - Posted: 11 Feb 2020, 22:39:08 UTC - in response to Message 2031973.  

Ta. I'll go through that later.

Turned out I had used the repository version of the same drivers you pointed me to - but 440 was missing, because they haven't been declared 'stable' yet, or whatever the technical term is. Added the PPA, and got access to the latest. Same adventure with the 640x480 screen, and a few broken packages later, but I got it running again.

The object of the test - the 10.2 'no checkpoint' build - turned out to be the same as before: one single lonely checkpoint at 99.99999% progress. Otherwise, crunching looked normal, but we were into the outage by then, so no instant validations. I'm out tomorrow, but I'll check when I get time.
ID: 2031976 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2031986 - Posted: 11 Feb 2020, 23:14:26 UTC - in response to Message 2031976.  

The object of the test - the 10.2 'no checkpoint' build - turned out to be the same as before: one single lonely checkpoint at 99.99999% progress. Otherwise, crunching looked normal, but we were into the outage by then, so no instant validations. I'm out tomorrow, but I'll check when I get time.

That lonely checkpoint is after the end of the analyze process. Just before the start to build the stderr file. I left it because i imagine no harm could be done at this point. Could be removed too if you think that help.
ID: 2031986 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031988 - Posted: 11 Feb 2020, 23:30:26 UTC - in response to Message 2031986.  

I think he’s saying that he doesn’t see any difference in behavior. Effectively it’s not checkpointing even before the change. So the checkpoint issue is probably a non-issue anymore.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031988 · Report as offensive     Reply Quote
Previous · 1 . . . 149 · 150 · 151 · 152 · 153 · 154 · 155 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.