Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 147 · 148 · 149 · 150 · 151 · 152 · 153 . . . 162 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031749 - Posted: 10 Feb 2020, 4:11:17 UTC - in response to Message 2031746.  
Last modified: 10 Feb 2020, 4:13:26 UTC

Yes, and as the experiments wore on I finally came to the fact that the Problem doesn't exist with a 1080Ti using GDDR5X ram, while it happens on Every GPU using GDDR5.
I've been working on this nearly Two Years, then spent Hundreds of dollars for the final test. How long have you been working on it?
ID: 2031749 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031750 - Posted: 10 Feb 2020, 4:23:51 UTC - in response to Message 2031749.  
Last modified: 10 Feb 2020, 4:56:13 UTC

none, as I never had the issue.

my test bench system has a card with GDDR5 mem, GTX 1650. never noticed this problem there either.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031750 · Report as offensive     Reply Quote
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 2031762 - Posted: 10 Feb 2020, 7:40:04 UTC - in response to Message 2031749.  

Does this host https://setiathome.berkeley.edu/results.php?hostid=8570185 fit in the criteria of the problem?
I do not have any monitors attached to it and can only access it through SSH.

Does a simple reboot in the console make the problem appear? If so i can test this too.

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 2031762 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031767 - Posted: 10 Feb 2020, 9:53:58 UTC - in response to Message 2031678.  

Richard, do you know what effect a value of '0' has on the checkpoint setting?
Looks like it is set to a safety value 'DEFAULT_CHECKPOINT_PERIOD':

https://github.com/BOINC/boinc/blob/master/api/boinc_api.cpp#L608
static int min_checkpoint_period() {
    int x = (int)aid.checkpoint_period;
    if (app_min_checkpoint_period > x) {
        x = app_min_checkpoint_period;
    }
    if (x == 0) x = DEFAULT_CHECKPOINT_PERIOD;
    return x;
}
This safety value is 300, not the normal 60:

https://github.com/BOINC/boinc/blob/master/lib/app_ipc.h#L117
#define DEFAULT_CHECKPOINT_PERIOD               300
ID: 2031767 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2031768 - Posted: 10 Feb 2020, 11:48:03 UTC
Last modified: 10 Feb 2020, 11:49:46 UTC

@ Richard

We have the checkpoint remover builds compiled, can you guide us on how to make a real test?
ID: 2031768 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031769 - Posted: 10 Feb 2020, 11:53:57 UTC - in response to Message 2031726.  

About the removal of the checkpoint.

FYI I made some changes in the code and thanks to Ian help with the compile process, we have an experimental version of the 10.2 mutex builds running with the checkpoint removed. Will wait Richard wake up to guide us to how to test to see if all is working.
I'm awake! ;-)

First of all, I want to see the checkpoint problem for myself. I know it causes validation problems, but why? How?

Validation depends solely on the uploaded result file - so that must be corrupted somehow. Too many signals? Too few signals? Duplicated signals? Or some other damage? I'm currently running down the cache on my most recent build, and I'll interrupt/restart some single tasks to see what happens. It also needs a newer driver to test the 10.2 version, so I'll look after that as well.
ID: 2031769 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031770 - Posted: 10 Feb 2020, 12:00:12 UTC - in response to Message 2031768.  

I want to wait a few hours while I flush down this cache - ~500 tasks to go (I want to get rid of the shorties and anticipated overflows at the very least - and a few other daily chores to complete). Maybe I can cut it down even further? I'll let you know when I'm getting close.
ID: 2031770 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2031772 - Posted: 10 Feb 2020, 12:18:36 UTC
Last modified: 10 Feb 2020, 12:44:11 UTC

Take your time, i will be out most of the day anyway. I will PM with the link of the compiled version for testing. Not will release to the rest of the SETIverse until we been sure all is working fie. Not wish to add any new variable now when we see the servers are back to run fine.

First of all, I want to see the checkpoint problem for myself. I know it causes validation problems, but why? How?

Maybe only Petri could answer this. When i asked a long time ago the answer was well beyond my paid grade.
ID: 2031772 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031778 - Posted: 10 Feb 2020, 13:26:04 UTC - in response to Message 2031762.  

Does this host https://setiathome.berkeley.edu/results.php?hostid=8570185 fit in the criteria of the problem?
I do not have any monitors attached to it and can only access it through SSH.

Does a simple reboot in the console make the problem appear? If so i can test this too.


According to Tbar, yes it fits. Do a system reboot and see what happens.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031778 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031780 - Posted: 10 Feb 2020, 13:26:42 UTC - in response to Message 2031767.  

Richard, do you know what effect a value of '0' has on the checkpoint setting?
Looks like it is set to a safety value 'DEFAULT_CHECKPOINT_PERIOD':

https://github.com/BOINC/boinc/blob/master/api/boinc_api.cpp#L608
static int min_checkpoint_period() {
    int x = (int)aid.checkpoint_period;
    if (app_min_checkpoint_period > x) {
        x = app_min_checkpoint_period;
    }
    if (x == 0) x = DEFAULT_CHECKPOINT_PERIOD;
    return x;
}
This safety value is 300, not the normal 60:

https://github.com/BOINC/boinc/blob/master/lib/app_ipc.h#L117
#define DEFAULT_CHECKPOINT_PERIOD               300


Thank you.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031780 · Report as offensive     Reply Quote
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 2031791 - Posted: 10 Feb 2020, 14:46:36 UTC - in response to Message 2031778.  
Last modified: 10 Feb 2020, 14:52:52 UTC



According to Tbar, yes it fits. Do a system reboot and see what happens.


Ok, i just rebooted it. Lets see what happens. I didn't shutdown boinc or anything.
systemctl reboot was the command!

EDIT: I didn't notice anything weird, only one WU from yesterday?! Do i miss anything? /EDIT[/u]

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 2031791 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031805 - Posted: 10 Feb 2020, 15:38:51 UTC

OK, I've started testing. Set checkpoint interval to 15 seconds, and working with BLC66/75.vlar tasks - around 90-100 seconds on this machine.

First surprise - no checkpoint file. Nada. Even showing hidden files. Eventually caught one flashing into view right at the end of the run, and immediately getting cleaned up as the task finishes.

So next time, I caught it.

<ncfft>116900</ncfft>
<cr>-9.999887e+01</cr>
<fl>32768</fl>
<prog>0.99999990</prog>
<potfreq>-1</potfreq>
<potactivity>0</potactivity>
<signal_count>15</signal_count>
<flops>0.000000</flops>
<spike_count>0</spike_count>
<autocorr_count>5</autocorr_count>
<pulse_count>6</pulse_count>
<gaussian_count>0</gaussian_count>
<triplet_count>4</triplet_count>
Note the fourth line - <prog>0.99999990. Or 99.99999% progress, to you and me. Any interruption before that point, and I assume there's no progress listed, no starting ncfft - nothing. So the whole thing will run from the beginning again (testing that theory next).
ID: 2031805 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031806 - Posted: 10 Feb 2020, 15:44:44 UTC - in response to Message 2031791.  
Last modified: 10 Feb 2020, 15:45:22 UTC



According to Tbar, yes it fits. Do a system reboot and see what happens.


Ok, i just rebooted it. Lets see what happens. I didn't shutdown boinc or anything.
systemctl reboot was the command!

EDIT: I didn't notice anything weird, only one WU from yesterday?! Do i miss anything? /EDIT[/u]


I don't see anything in your returned tasks either.

these two show evidence that the task restarted, but no signs of this missing best pulse problem. i checked all your pending and inconclusive tasks around this same time and still nothing *shrug*

https://setiathome.berkeley.edu/result.php?resultid=8529605384
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 750 Ti, 2001 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 2, pciSlotID = 0
  Device 3: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 4, pciSlotID = 0
  Device 4: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 5, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 750 Ti is okay
SETI@home using CUDA accelerated device GeForce GTX 750 Ti
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special
Modifications done by petri33, compiled by TBar

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.423487
Sigma 3
Thread call stack limit is: 1k
Spike: peak=24.10717, time=6.711, d_freq=1420081846.28, chirp=0.45658, fft_len=128k
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 750 Ti, 2001 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 2, pciSlotID = 0
  Device 3: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 4, pciSlotID = 0
  Device 4: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 5, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 3
setiathome_CUDA: CUDA Device 3 specified, checking...
   Device 3: GeForce GTX 750 Ti is okay
SETI@home using CUDA accelerated device GeForce GTX 750 Ti
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special
Modifications done by petri33, compiled by TBar

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.423487
Sigma 3
Thread call stack limit is: 1k
Spike: peak=24.10717, time=6.711, d_freq=1420081846.28, chirp=0.45658, fft_len=128k
Pulse: peak=4.235778, time=101.1, period=1.503, d_freq=1420073813.48, score=1.023, chirp=29.812, fft_len=512 

Best spike: peak=24.10717, time=6.711, d_freq=1420081846.28, chirp=0.45658, fft_len=128k
Best autocorr: peak=17.29911, time=73.82, delay=4.002, d_freq=1420079781.87, chirp=22.445, fft_len=128k
Best gaussian: peak=3.317156, mean=0.5343372, ChiSq=1.309793, time=47.82, d_freq=1420076284.18,
	score=-1.667274, null_hyp=2.100377, chirp=88.202, fft_len=16k
Best pulse: peak=4.235778, time=101.1, period=1.503, d_freq=1420073813.48, score=1.023, chirp=29.812, fft_len=512 
Best triplet: peak=0, time=-2.122e+11, period=0, d_freq=0, chirp=0, fft_len=0 

Spike count:    1
Autocorr count: 0
Pulse count:    1
Triplet count:  0
Gaussian count: 0

15:50:47 (486): called boinc_finish(0)

</stderr_txt>
]]>


https://setiathome.berkeley.edu/result.php?resultid=8529605382
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 750 Ti, 2001 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 2, pciSlotID = 0
  Device 3: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 4, pciSlotID = 0
  Device 4: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 5, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 2
setiathome_CUDA: CUDA Device 2 specified, checking...
   Device 2: GeForce GTX 750 Ti is okay
SETI@home using CUDA accelerated device GeForce GTX 750 Ti
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special
Modifications done by petri33, compiled by TBar

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.009608
Sigma 74
Sigma > GaussTOffsetStop: 74 > -10
Thread call stack limit is: 1k
Autocorr: peak=18.22544, time=40.09, delay=3.1556, d_freq=7870342325.01, chirp=1.7554, fft_len=128k
Pulse: peak=3.761351, time=45.99, period=9.127, d_freq=7870341380.36, score=1.051, chirp=-3.9436, fft_len=4k
Pulse: peak=1.430733, time=45.84, period=2.215, d_freq=7870338074.97, score=1.007, chirp=-4.3866, fft_len=512 
Pulse: peak=1.738717, time=45.84, period=2.761, d_freq=7870342162.68, score=1.052, chirp=-12.735, fft_len=512 
Autocorr: peak=18.09989, time=85.9, delay=4.5231, d_freq=7870340739.66, chirp=-17.637, fft_len=128k
Autocorr: peak=18.30738, time=85.9, delay=4.5231, d_freq=7870340739.23, chirp=-17.642, fft_len=128k
Pulse: peak=1.105718, time=45.82, period=1.41, d_freq=7870347727.49, score=1.018, chirp=27.734, fft_len=128 
Pulse: peak=6.477774, time=45.9, period=17.45, d_freq=7870345273.87, score=1.023, chirp=33.393, fft_len=2k
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 750 Ti, 2001 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 2, pciSlotID = 0
  Device 3: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 4, pciSlotID = 0
  Device 4: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 5, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 2
setiathome_CUDA: CUDA Device 2 specified, checking...
   Device 2: GeForce GTX 750 Ti is okay
SETI@home using CUDA accelerated device GeForce GTX 750 Ti
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special
Modifications done by petri33, compiled by TBar

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.009608
Sigma 74
Sigma > GaussTOffsetStop: 74 > -10
Thread call stack limit is: 1k
Autocorr: peak=18.22544, time=40.09, delay=3.1556, d_freq=7870342325.01, chirp=1.7554, fft_len=128k
Pulse: peak=3.761351, time=45.99, period=9.127, d_freq=7870341380.36, score=1.051, chirp=-3.9436, fft_len=4k
Pulse: peak=1.430733, time=45.84, period=2.215, d_freq=7870338074.97, score=1.007, chirp=-4.3866, fft_len=512 
Pulse: peak=1.738717, time=45.84, period=2.761, d_freq=7870342162.68, score=1.052, chirp=-12.735, fft_len=512 
Autocorr: peak=18.09989, time=85.9, delay=4.5231, d_freq=7870340739.66, chirp=-17.637, fft_len=128k
Autocorr: peak=18.30738, time=85.9, delay=4.5231, d_freq=7870340739.23, chirp=-17.642, fft_len=128k
Pulse: peak=1.105718, time=45.82, period=1.41, d_freq=7870347727.49, score=1.018, chirp=27.734, fft_len=128 
Pulse: peak=6.477774, time=45.9, period=17.45, d_freq=7870345273.87, score=1.023, chirp=33.393, fft_len=2k
Triplet: peak=10.62004, time=59.19, period=28.9, d_freq=7870338017.16, chirp=-80.088, fft_len=1024 

Best spike: peak=23.02097, time=73.01, d_freq=7870344594.63, chirp=-72.298, fft_len=32k
Best autocorr: peak=18.30738, time=85.9, delay=4.5231, d_freq=7870340739.23, chirp=-17.642, fft_len=128k
Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.124e+11, d_freq=0,
	score=-12, null_hyp=0, chirp=0, fft_len=0 
Best pulse: peak=1.738717, time=45.84, period=2.761, d_freq=7870342162.68, score=1.052, chirp=-12.735, fft_len=512 
Best triplet: peak=10.62004, time=59.19, period=28.9, d_freq=7870338017.16, chirp=-80.088, fft_len=1024 

Spike count:    0
Autocorr count: 3
Pulse count:    5
Triplet count:  1
Gaussian count: 0

15:53:30 (485): called boinc_finish(0)

</stderr_txt>
]]>

Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031806 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031810 - Posted: 10 Feb 2020, 15:55:20 UTC - in response to Message 2031805.  
Last modified: 10 Feb 2020, 16:04:08 UTC

Next possible problem. I ran one to around 75%, then suspended it. Should have found something by now, I thought. But the result output file only contained the workunit header - no signals. So I resumed the task. After a few seconds, progress rewound to 0% (as expected from the last test). This time, I let it finish, and held the file by disabling networking. I watched the file size on disk more than double in the last few % of progress, and stay at that size.

Am I missing something about Linux? Does it hold data destined for files in memory, and only commit to disk when the file is closed? * In which case, why is the header portion written immediately?

I need to think about this, and maybe find that offline testing tool - I may have led you down a blind alley.

* Edit - no. I started one of my CPU projects, and its checkpoint files appeared after 15 seconds, and updated the timestamp every 15 seconds thereafter. Changed the checkpoint interval back to 60 seconds, started another, and watched it respect the new interval, too.
ID: 2031810 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031813 - Posted: 10 Feb 2020, 16:11:10 UTC - in response to Message 2031791.  

According to Tbar, yes it fits. Do a system reboot and see what happens.
Ok, i just rebooted it. Lets see what happens. I didn't shutdown boinc or anything.
systemctl reboot was the command!
EDIT: I didn't notice anything weird, only one WU from yesterday?! Do i miss anything? /EDIT[/u]
Actually, if you've read up on one of the the ways to avoid the problem, turning Off the monitor is the first choice. So, I really don't know how that would work if you don't have a monitor running. I know when I turn off the monitor, I don't have the problem. If you didn't write down the task number, and have the problem, it will usually show up in the Invalid list. It depends on how many pulses you missed compared to other signals.

That's the difference between the Checkpoint problem and the Missed Pulse problem. The Checkpoint problem will give an Immediate Overflow after resuming, the Other problem will Miss All Pulses...on occasion. I've seen both not happen all of the time, some times the Checkpoint works, some times it doesn't. Some times you Miss Pulses, some times you don't The only common thing is when it fails people want to know why.
ID: 2031813 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031816 - Posted: 10 Feb 2020, 16:20:31 UTC

Am I missing something about Linux? Does it hold data destined for files in memory, and only commit to disk when the file is closed? In which case, why is the header portion written immediately?

Yes, it does. The ext4 file system holds data to be written in memory pages for speed and then writes it to media. Depends on the application whether data and when it is written immediately or not. It defaults to delayed allocation. This snippet of a article explains a lot.
ext4 and data loss
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031816 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031821 - Posted: 10 Feb 2020, 16:53:56 UTC - in response to Message 2031813.  
Last modified: 10 Feb 2020, 17:03:02 UTC

what's the best most fool-proof way to reproduce this scenario then? the goal posts keep moving it seems.

I have a card with GDDR5 memory, and want to try to trigger this to happen. if it's a real issue with the app+GDDR5 and not some other system specific issue, it should be easily reproducible with a well defined procedure.

i've left the monitor running, rebooted several times but have been unable to see any missed best pulses so far.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031821 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031822 - Posted: 10 Feb 2020, 16:56:34 UTC - in response to Message 2031813.  
Last modified: 10 Feb 2020, 16:57:35 UTC

That's the difference between the Checkpoint problem and the Missed Pulse problem. The Checkpoint problem will give an Immediate Overflow after resuming, the Other problem will Miss All Pulses...on occasion. I've seen both not happen all of the time, some times the Checkpoint works, some times it doesn't. Some times you Miss Pulses, some times you don't The only common thing is when it fails people want to know why.
Now you're getting me completely confused.

Bear in mind that for the time being I'm concentrating on the checkpoint problem only. Missed pulses can wait for another day.

I've just interrupted four tasks (two at a time, each running by itself on one of the two GPUs in this box). I stopped them via 'sudo systemctl stop boinc-client'. The first pair were not expected to checkpoint (interval set to 300): the second pair should have checkpointed (interval 60: stopped at around 70).

On restart, Manager displayed both progress and time restarting from 0. The tasks, in order, were

8529268332 (overflow, pending)
8529268005 (not overflow, valid)
8529287927 (not overflow, valid)
8529288504 (not overflow, pending)

We'll have to wait to check the first one (average turnround 1.8 days), but the other three rather disprove your assertion. I'm running setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101. The final run times shown on the web task page match what I saw for the final 'run to finish' after the restart: the pre-restart time was lost.
ID: 2031822 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031823 - Posted: 10 Feb 2020, 17:00:26 UTC - in response to Message 2031816.  

Am I missing something about Linux? Does it hold data destined for files in memory, and only commit to disk when the file is closed? In which case, why is the header portion written immediately?
Yes, it does. The ext4 file system holds data to be written in memory pages for speed and then writes it to media. Depends on the application whether data and when it is written immediately or not. It defaults to delayed allocation. This snippet of a article explains a lot.
ext4 and data loss
Thanks. But did you also see my comment about checkpoint files from my CPU project tasks appearing, and updating, precisely on cue?
ID: 2031823 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031825 - Posted: 10 Feb 2020, 17:10:33 UTC - in response to Message 2031823.  

Thanks. But did you also see my comment about checkpoint files from my CPU project tasks appearing, and updating, precisely on cue?

Yes, I did. All that has to happen for the cpu tasks to write out their state is to have the application do the fsync () call. Must be what is happening for that case.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031825 · Report as offensive     Reply Quote
Previous · 1 . . . 147 · 148 · 149 · 150 · 151 · 152 · 153 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.