Message boards :
Number crunching :
Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation
Previous · 1 . . . 147 · 148 · 149 · 150 · 151 · 152 · 153 . . . 162 · Next
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Yes, and as the experiments wore on I finally came to the fact that the Problem doesn't exist with a 1080Ti using GDDR5X ram, while it happens on Every GPU using GDDR5. I've been working on this nearly Two Years, then spent Hundreds of dollars for the final test. How long have you been working on it? |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
none, as I never had the issue. my test bench system has a card with GDDR5 mem, GTX 1650. never noticed this problem there either. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
Does this host https://setiathome.berkeley.edu/results.php?hostid=8570185 fit in the criteria of the problem? I do not have any monitors attached to it and can only access it through SSH. Does a simple reboot in the console make the problem appear? If so i can test this too. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Richard, do you know what effect a value of '0' has on the checkpoint setting?Looks like it is set to a safety value 'DEFAULT_CHECKPOINT_PERIOD': https://github.com/BOINC/boinc/blob/master/api/boinc_api.cpp#L608 static int min_checkpoint_period() { int x = (int)aid.checkpoint_period; if (app_min_checkpoint_period > x) { x = app_min_checkpoint_period; } if (x == 0) x = DEFAULT_CHECKPOINT_PERIOD; return x; }This safety value is 300, not the normal 60: https://github.com/BOINC/boinc/blob/master/lib/app_ipc.h#L117 #define DEFAULT_CHECKPOINT_PERIOD 300 |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
@ Richard We have the checkpoint remover builds compiled, can you guide us on how to make a real test? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
About the removal of the checkpoint.I'm awake! ;-) First of all, I want to see the checkpoint problem for myself. I know it causes validation problems, but why? How? Validation depends solely on the uploaded result file - so that must be corrupted somehow. Too many signals? Too few signals? Duplicated signals? Or some other damage? I'm currently running down the cache on my most recent build, and I'll interrupt/restart some single tasks to see what happens. It also needs a newer driver to test the 10.2 version, so I'll look after that as well. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
I want to wait a few hours while I flush down this cache - ~500 tasks to go (I want to get rid of the shorties and anticipated overflows at the very least - and a few other daily chores to complete). Maybe I can cut it down even further? I'll let you know when I'm getting close. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Take your time, i will be out most of the day anyway. I will PM with the link of the compiled version for testing. Not will release to the rest of the SETIverse until we been sure all is working fie. Not wish to add any new variable now when we see the servers are back to run fine. First of all, I want to see the checkpoint problem for myself. I know it causes validation problems, but why? How? Maybe only Petri could answer this. When i asked a long time ago the answer was well beyond my paid grade. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Does this host https://setiathome.berkeley.edu/results.php?hostid=8570185 fit in the criteria of the problem? According to Tbar, yes it fits. Do a system reboot and see what happens. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Richard, do you know what effect a value of '0' has on the checkpoint setting?Looks like it is set to a safety value 'DEFAULT_CHECKPOINT_PERIOD': Thank you. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
Ok, i just rebooted it. Lets see what happens. I didn't shutdown boinc or anything. systemctl reboot was the command! EDIT: I didn't notice anything weird, only one WU from yesterday?! Do i miss anything? /EDIT[/u] _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
OK, I've started testing. Set checkpoint interval to 15 seconds, and working with BLC66/75.vlar tasks - around 90-100 seconds on this machine. First surprise - no checkpoint file. Nada. Even showing hidden files. Eventually caught one flashing into view right at the end of the run, and immediately getting cleaned up as the task finishes. So next time, I caught it. <ncfft>116900</ncfft> <cr>-9.999887e+01</cr> <fl>32768</fl> <prog>0.99999990</prog> <potfreq>-1</potfreq> <potactivity>0</potactivity> <signal_count>15</signal_count> <flops>0.000000</flops> <spike_count>0</spike_count> <autocorr_count>5</autocorr_count> <pulse_count>6</pulse_count> <gaussian_count>0</gaussian_count> <triplet_count>4</triplet_count>Note the fourth line - <prog>0.99999990. Or 99.99999% progress, to you and me. Any interruption before that point, and I assume there's no progress listed, no starting ncfft - nothing. So the whole thing will run from the beginning again (testing that theory next). |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
I don't see anything in your returned tasks either. these two show evidence that the task restarted, but no signs of this missing best pulse problem. i checked all your pending and inconclusive tasks around this same time and still nothing *shrug* https://setiathome.berkeley.edu/result.php?resultid=8529605384 <core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 750 Ti, 2001 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 2, pciSlotID = 0 Device 3: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 4, pciSlotID = 0 Device 4: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 5, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 750 Ti is okay SETI@home using CUDA accelerated device GeForce GTX 750 Ti Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special Modifications done by petri33, compiled by TBar Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.423487 Sigma 3 Thread call stack limit is: 1k Spike: peak=24.10717, time=6.711, d_freq=1420081846.28, chirp=0.45658, fft_len=128k setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 750 Ti, 2001 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 2, pciSlotID = 0 Device 3: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 4, pciSlotID = 0 Device 4: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 5, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 3 setiathome_CUDA: CUDA Device 3 specified, checking... Device 3: GeForce GTX 750 Ti is okay SETI@home using CUDA accelerated device GeForce GTX 750 Ti Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special Modifications done by petri33, compiled by TBar Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.423487 Sigma 3 Thread call stack limit is: 1k Spike: peak=24.10717, time=6.711, d_freq=1420081846.28, chirp=0.45658, fft_len=128k Pulse: peak=4.235778, time=101.1, period=1.503, d_freq=1420073813.48, score=1.023, chirp=29.812, fft_len=512 Best spike: peak=24.10717, time=6.711, d_freq=1420081846.28, chirp=0.45658, fft_len=128k Best autocorr: peak=17.29911, time=73.82, delay=4.002, d_freq=1420079781.87, chirp=22.445, fft_len=128k Best gaussian: peak=3.317156, mean=0.5343372, ChiSq=1.309793, time=47.82, d_freq=1420076284.18, score=-1.667274, null_hyp=2.100377, chirp=88.202, fft_len=16k Best pulse: peak=4.235778, time=101.1, period=1.503, d_freq=1420073813.48, score=1.023, chirp=29.812, fft_len=512 Best triplet: peak=0, time=-2.122e+11, period=0, d_freq=0, chirp=0, fft_len=0 Spike count: 1 Autocorr count: 0 Pulse count: 1 Triplet count: 0 Gaussian count: 0 15:50:47 (486): called boinc_finish(0) </stderr_txt> ]]> https://setiathome.berkeley.edu/result.php?resultid=8529605382 <core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 750 Ti, 2001 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 2, pciSlotID = 0 Device 3: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 4, pciSlotID = 0 Device 4: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 5, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 2 setiathome_CUDA: CUDA Device 2 specified, checking... Device 2: GeForce GTX 750 Ti is okay SETI@home using CUDA accelerated device GeForce GTX 750 Ti Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special Modifications done by petri33, compiled by TBar Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.009608 Sigma 74 Sigma > GaussTOffsetStop: 74 > -10 Thread call stack limit is: 1k Autocorr: peak=18.22544, time=40.09, delay=3.1556, d_freq=7870342325.01, chirp=1.7554, fft_len=128k Pulse: peak=3.761351, time=45.99, period=9.127, d_freq=7870341380.36, score=1.051, chirp=-3.9436, fft_len=4k Pulse: peak=1.430733, time=45.84, period=2.215, d_freq=7870338074.97, score=1.007, chirp=-4.3866, fft_len=512 Pulse: peak=1.738717, time=45.84, period=2.761, d_freq=7870342162.68, score=1.052, chirp=-12.735, fft_len=512 Autocorr: peak=18.09989, time=85.9, delay=4.5231, d_freq=7870340739.66, chirp=-17.637, fft_len=128k Autocorr: peak=18.30738, time=85.9, delay=4.5231, d_freq=7870340739.23, chirp=-17.642, fft_len=128k Pulse: peak=1.105718, time=45.82, period=1.41, d_freq=7870347727.49, score=1.018, chirp=27.734, fft_len=128 Pulse: peak=6.477774, time=45.9, period=17.45, d_freq=7870345273.87, score=1.023, chirp=33.393, fft_len=2k setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 750 Ti, 2001 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 2, pciSlotID = 0 Device 3: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 4, pciSlotID = 0 Device 4: GeForce GTX 750 Ti, 2002 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 5, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 2 setiathome_CUDA: CUDA Device 2 specified, checking... Device 2: GeForce GTX 750 Ti is okay SETI@home using CUDA accelerated device GeForce GTX 750 Ti Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special Modifications done by petri33, compiled by TBar Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.009608 Sigma 74 Sigma > GaussTOffsetStop: 74 > -10 Thread call stack limit is: 1k Autocorr: peak=18.22544, time=40.09, delay=3.1556, d_freq=7870342325.01, chirp=1.7554, fft_len=128k Pulse: peak=3.761351, time=45.99, period=9.127, d_freq=7870341380.36, score=1.051, chirp=-3.9436, fft_len=4k Pulse: peak=1.430733, time=45.84, period=2.215, d_freq=7870338074.97, score=1.007, chirp=-4.3866, fft_len=512 Pulse: peak=1.738717, time=45.84, period=2.761, d_freq=7870342162.68, score=1.052, chirp=-12.735, fft_len=512 Autocorr: peak=18.09989, time=85.9, delay=4.5231, d_freq=7870340739.66, chirp=-17.637, fft_len=128k Autocorr: peak=18.30738, time=85.9, delay=4.5231, d_freq=7870340739.23, chirp=-17.642, fft_len=128k Pulse: peak=1.105718, time=45.82, period=1.41, d_freq=7870347727.49, score=1.018, chirp=27.734, fft_len=128 Pulse: peak=6.477774, time=45.9, period=17.45, d_freq=7870345273.87, score=1.023, chirp=33.393, fft_len=2k Triplet: peak=10.62004, time=59.19, period=28.9, d_freq=7870338017.16, chirp=-80.088, fft_len=1024 Best spike: peak=23.02097, time=73.01, d_freq=7870344594.63, chirp=-72.298, fft_len=32k Best autocorr: peak=18.30738, time=85.9, delay=4.5231, d_freq=7870340739.23, chirp=-17.642, fft_len=128k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.124e+11, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=1.738717, time=45.84, period=2.761, d_freq=7870342162.68, score=1.052, chirp=-12.735, fft_len=512 Best triplet: peak=10.62004, time=59.19, period=28.9, d_freq=7870338017.16, chirp=-80.088, fft_len=1024 Spike count: 0 Autocorr count: 3 Pulse count: 5 Triplet count: 1 Gaussian count: 0 15:53:30 (485): called boinc_finish(0) </stderr_txt> ]]> Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Next possible problem. I ran one to around 75%, then suspended it. Should have found something by now, I thought. But the result output file only contained the workunit header - no signals. So I resumed the task. After a few seconds, progress rewound to 0% (as expected from the last test). This time, I let it finish, and held the file by disabling networking. I watched the file size on disk more than double in the last few % of progress, and stay at that size. Am I missing something about Linux? Does it hold data destined for files in memory, and only commit to disk when the file is closed? * In which case, why is the header portion written immediately? I need to think about this, and maybe find that offline testing tool - I may have led you down a blind alley. * Edit - no. I started one of my CPU projects, and its checkpoint files appeared after 15 seconds, and updated the timestamp every 15 seconds thereafter. Changed the checkpoint interval back to 60 seconds, started another, and watched it respect the new interval, too. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Actually, if you've read up on one of the the ways to avoid the problem, turning Off the monitor is the first choice. So, I really don't know how that would work if you don't have a monitor running. I know when I turn off the monitor, I don't have the problem. If you didn't write down the task number, and have the problem, it will usually show up in the Invalid list. It depends on how many pulses you missed compared to other signals.According to Tbar, yes it fits. Do a system reboot and see what happens.Ok, i just rebooted it. Lets see what happens. I didn't shutdown boinc or anything. That's the difference between the Checkpoint problem and the Missed Pulse problem. The Checkpoint problem will give an Immediate Overflow after resuming, the Other problem will Miss All Pulses...on occasion. I've seen both not happen all of the time, some times the Checkpoint works, some times it doesn't. Some times you Miss Pulses, some times you don't The only common thing is when it fails people want to know why. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Am I missing something about Linux? Does it hold data destined for files in memory, and only commit to disk when the file is closed? In which case, why is the header portion written immediately? Yes, it does. The ext4 file system holds data to be written in memory pages for speed and then writes it to media. Depends on the application whether data and when it is written immediately or not. It defaults to delayed allocation. This snippet of a article explains a lot. ext4 and data loss Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
what's the best most fool-proof way to reproduce this scenario then? the goal posts keep moving it seems. I have a card with GDDR5 memory, and want to try to trigger this to happen. if it's a real issue with the app+GDDR5 and not some other system specific issue, it should be easily reproducible with a well defined procedure. i've left the monitor running, rebooted several times but have been unable to see any missed best pulses so far. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
That's the difference between the Checkpoint problem and the Missed Pulse problem. The Checkpoint problem will give an Immediate Overflow after resuming, the Other problem will Miss All Pulses...on occasion. I've seen both not happen all of the time, some times the Checkpoint works, some times it doesn't. Some times you Miss Pulses, some times you don't The only common thing is when it fails people want to know why.Now you're getting me completely confused. Bear in mind that for the time being I'm concentrating on the checkpoint problem only. Missed pulses can wait for another day. I've just interrupted four tasks (two at a time, each running by itself on one of the two GPUs in this box). I stopped them via 'sudo systemctl stop boinc-client'. The first pair were not expected to checkpoint (interval set to 300): the second pair should have checkpointed (interval 60: stopped at around 70). On restart, Manager displayed both progress and time restarting from 0. The tasks, in order, were 8529268332 (overflow, pending) 8529268005 (not overflow, valid) 8529287927 (not overflow, valid) 8529288504 (not overflow, pending) We'll have to wait to check the first one (average turnround 1.8 days), but the other three rather disprove your assertion. I'm running setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101. The final run times shown on the web task page match what I saw for the final 'run to finish' after the restart: the pre-restart time was lost. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Thanks. But did you also see my comment about checkpoint files from my CPU project tasks appearing, and updating, precisely on cue?Am I missing something about Linux? Does it hold data destined for files in memory, and only commit to disk when the file is closed? In which case, why is the header portion written immediately?Yes, it does. The ext4 file system holds data to be written in memory pages for speed and then writes it to media. Depends on the application whether data and when it is written immediately or not. It defaults to delayed allocation. This snippet of a article explains a lot. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks. But did you also see my comment about checkpoint files from my CPU project tasks appearing, and updating, precisely on cue? Yes, I did. All that has to happen for the cpu tasks to write out their state is to have the application do the fsync () call. Must be what is happening for that case. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.