Setting up Linux to crunch CUDA90 and above for Windows users

Author	Message
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2031826 - Posted: 10 Feb 2020, 17:12:25 UTC - in response to Message 2031822. That's the difference between the Checkpoint problem and the Missed Pulse problem. The Checkpoint problem will give an Immediate Overflow after resuming, the Other problem will Miss All Pulses...on occasion. I've seen both not happen all of the time, some times the Checkpoint works, some times it doesn't. Some times you Miss Pulses, some times you don't The only common thing is when it fails people want to know why. Now you're getting me completely confused. Bear in mind that for the time being I'm concentrating on the checkpoint problem only. Missed pulses can wait for another day. I've just interrupted four tasks (two at a time, each running by itself on one of the two GPUs in this box). I stopped them via 'sudo systemctl stop boinc-client'. The first pair were not expected to checkpoint (interval set to 300): the second pair should have checkpointed (interval 60: stopped at around 70). On restart, Manager displayed both progress and time restarting from 0. The tasks, in order, were 8529268332 (overflow, pending) 8529268005 (not overflow, valid) 8529287927 (not overflow, valid) 8529288504 (not overflow, pending) We'll have to wait to check the first one (average turnround 1.8 days), but the other three rather disprove your assertion. I'm running setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101. The final run times shown on the web task page match what I saw for the final 'run to finish' after the restart: the pre-restart time was lost. I had a similar experience just now. I set checkpointing to 10 seconds on the nominal app, allowed it to run to ~50% (several minutes on a GTX 1650). Killed BOINC, and restarted it, and the task restarted from the beginning. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2031826 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2031827 - Posted: 10 Feb 2020, 17:22:32 UTC - in response to Message 2031822. ...Now you're getting me completely confused.,, Here you go, read up on the comments from one of your old Buds. He seemed annoyed about the Checkpoint problem back then, and I think you are in a few of His threads back then, https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1906253#1906253 Do a search on his user ID and checkpoint, he'll refresh your memory for you. As far as I know, not much has changed since then. ID: 2031827 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2031829 - Posted: 10 Feb 2020, 17:32:51 UTC - in response to Message 2031825. Thanks. But did you also see my comment about checkpoint files from my CPU project tasks appearing, and updating, precisely on cue? Yes, I did. All that has to happen for the cpu tasks to write out their state is to have the application do the fsync () call. Must be what is happening for that case. OK, I've read the article (well, skimmed it), and confirmed that my system has the default values for lazy writes: /proc/sys/vm/dirty_expire_centisecs 3000 // 30 seconds /proc/sys/vm/dirty_writeback_centisecs 500 // 5 seconds I suspended a task from BOINC Manager 10 seconds after the checkpoint file should have been written: that was at least five minutes ago. Nothing has appeared in the slot folder yet. This machine is currently running off a single 512GB M.2 PCIe SSD: I don't think it's got that long a disk write queue! ID: 2031829 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2031830 - Posted: 10 Feb 2020, 17:38:11 UTC - in response to Message 2031821. Last modified: 10 Feb 2020, 17:51:21 UTC what's the best most fool-proof way to reproduce this scenario then? the goal posts keep moving it seems. I have a card with GDDR5 memory, and want to try to trigger this to happen. if it's a real issue with the app+GDDR5 and not some other system specific issue, it should be easily reproducible with a well defined procedure. i've left the monitor running, rebooted several times but have been unable to see any missed best pulses so far. Here's a Post from Yesterday, I'm fairly certain you read it, " I later found my other machines had the same problem once I turned the monitors on and started actually using the machines." That's fairly Clear that the problem doesn't happen when the monitor is OFF. How you went from that to saying having the monitor Off Is fine for testing is anyone's guess. I'd suggest trying the same configurations the others had when they experienced the problem. Juan was running 4 GDDR5 1070s, the other posted here was running 4 or 5 GDDR5 980s. I've seen the problem when running any number of GDDR5 Multiple GPUs, anywhere from 2 to 14 GPUs. If you are using a Single card, you are out of the known sample. I'd say three would be a good start, and most people have one connected to the monitor, the others disconnected. ID: 2031830 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2031831 - Posted: 10 Feb 2020, 17:39:31 UTC - in response to Message 2031827. As far as I know, not much has changed since then. And you probably know more than me, because all the development discussion takes place in a forum I don't have access to. @ GPUUG: Any news on whether the checkpointing problem has been addressed since 10 Dec 2017 (except this week, of course)? (I'd better go and read that ReadMe file!) ID: 2031831 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2031834 - Posted: 10 Feb 2020, 17:59:45 UTC - in response to Message 2031830. what's the best most fool-proof way to reproduce this scenario then? the goal posts keep moving it seems. I have a card with GDDR5 memory, and want to try to trigger this to happen. if it's a real issue with the app+GDDR5 and not some other system specific issue, it should be easily reproducible with a well defined procedure. i've left the monitor running, rebooted several times but have been unable to see any missed best pulses so far. Here's a Post from Yesterday, I'm fairly certain you read it, " I later found my other machines had the same problem once I turned the monitors on and started actually using the machines." That's fairly Clear that the problem doesn't happen when the monitor is OFF. How you went from that to saying having the monitor Off Is fine for testing is anyone's guess. I'd suggest trying the same configurations the others had when they experienced the problem. Juan was running 4 GDDR5 1070s, the other posted here was running 4 or 5 GDDR5 980s. I've seen the problem when running any number of GDDR5 Multiple GPUs, anywhere from 2 to 14 GPUs. If you are using a Single card, you are out of the known sample. I'd say three would be a good start, and most people have one connected to the monitor, the others disconnected. I donâ€™t know why you think I had the monitor Off. I stated I left the monitor running = ON. Iâ€™m unclear as to the monitor being a contributing factor. In one sense you say that the the monitor being off prevents it from happening, yet at the same time you say it only affects the cards NOT hooked to the monitor? Can you explain that? Iâ€™ll look into shuffling some cards around, or picking up a cheap GDDR5 card to try to replicate this with Multi-GPU. Not looking hopeful though, both Juan and Keith seem to have been unsuccessful as well. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2031834 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2031835 - Posted: 10 Feb 2020, 18:06:38 UTC - in response to Message 2031831. As far as I know, not much has changed since then. And you probably know more than me, because all the development discussion takes place in a forum I don't have access to. @ GPUUG: Any news on whether the checkpointing problem has been addressed since 10 Dec 2017 (except this week, of course)? (I'd better go and read that ReadMe file!) I recompiled an app with a modification to the code in attempt to remove checkpointing. Juan instructed me which line to change (he doesnâ€™t yet know how to compile the special app, so I did that for him). But I havenâ€™t really seen different behavior between the default and modified apps. Iâ€™m not seeing it check point. I would assume if I set checkpoint to 10 seconds that it would checkpoint, but even on the default unchanged app it still starts over from the beginning Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2031835 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2031836 - Posted: 10 Feb 2020, 18:17:23 UTC - in response to Message 2031834. This; Does this host https://setiathome.berkeley.edu/results.php?hostid=8570185 fit in the criteria of the problem? I do not have any monitors attached to it and can only access it through SSH. Does a simple reboot in the console make the problem appear? If so i can test this too. According to Tbar, yes it fits. Do a system reboot and see what happens. I believe that's YOU telling Viper testing without a monitor is fine? It certainly looks that way. what's the best most fool-proof way to reproduce this scenario then? the goal posts keep moving it seems. I have a card with GDDR5 memory, and want to try to trigger this to happen. if it's a real issue with the app+GDDR5 and not some other system specific issue, it should be easily reproducible with a well defined procedure. i've left the monitor running, rebooted several times but have been unable to see any missed best pulses so far. Here's a Post from Yesterday, I'm fairly certain you read it, " I later found my other machines had the same problem once I turned the monitors on and started actually using the machines." That's fairly Clear that the problem doesn't happen when the monitor is OFF. How you went from that to saying having the monitor Off Is fine for testing is anyone's guess. I'd suggest trying the same configurations the others had when they experienced the problem. Juan was running 4 GDDR5 1070s, the other posted here was running 4 or 5 GDDR5 980s. I've seen the problem when running any number of GDDR5 Multiple GPUs, anywhere from 2 to 14 GPUs. If you are using a Single card, you are out of the known sample. I'd say three would be a good start, and most people have one connected to the monitor, the others disconnected. I donâ€™t know why you think I had the monitor Off. I stated I left the monitor running = ON. Iâ€™m unclear as to the monitor being a contributing factor. In one sense you say that the the monitor being off prevents it from happening, yet at the same time you say it only affects the cards NOT hooked to the monitor? Can you explain that? Iâ€™ll look into shuffling some cards around, or picking up a cheap GDDR5 card to try to replicate this with Multi-GPU. Not looking hopeful though, both Juan and Keith seem to have been unsuccessful as well. ID: 2031836 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2031837 - Posted: 10 Feb 2020, 18:24:45 UTC - in response to Message 2031831. (I'd better go and read that ReadMe file!) Which says, in its entirety, 6) The App may give Incorrect results on a restarted task. One way to avoid restarted tasks is to set the checkpoint higher than the longest task's estimated run-time, and also avoid suspending/resuming a task. The ReadMe itself has a datestamp of 07 December 2019, and a reference to the CUDA 10.2 app, so I think it's current (again, before this week's changes). ID: 2031837 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2031838 - Posted: 10 Feb 2020, 18:31:36 UTC - in response to Message 2031836. Last modified: 10 Feb 2020, 18:34:00 UTC Well itâ€™s simple. I was unaware that the monitorâ€™s presence was so important since you claimed it only affected cards not attached to the monitor. oh wells shrug Any comment on why that is? Since it doesnâ€™t affect the cards attached to the monitor, why would it matter if one wasnâ€™t plugged in at all? This issue is getting more and more fringe the more you give additional nuggets to the exact reproduction. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2031838 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2031839 - Posted: 10 Feb 2020, 18:36:43 UTC - in response to Message 2031837. Last modified: 10 Feb 2020, 18:50:30 UTC (I'd better go and read that ReadMe file!) Which says, in its entirety, 6) The App may give Incorrect results on a restarted task. One way to avoid restarted tasks is to set the checkpoint higher than the longest task's estimated run-time, and also avoid suspending/resuming a task. The ReadMe itself has a datestamp of 07 December 2019, and a reference to the CUDA 10.2 app, so I think it's current (again, before this week's changes). oh that's another case. suspend/resume on a task that was running for >5 mins, caused it to restart from the beginning. checkpoint set to 10 seconds. running unmodified code (as far as checkpointing is concerned) see the results: https://setiathome.berkeley.edu/result.php?resultid=8530898546 No immediate overflow. hmm <core_client_version>7.16.1</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 1 CUDA device(s): Device 1: GeForce GTX 1650, 3908 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 14 pciBusID = 2, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 1650 is okay SETI@home using CUDA accelerated device GeForce GTX 1650 Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 --------------------------------------------------------- SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.2 special ------------------------------------------------------------------------- Modifications done by petri33, Mutex by Oddbjornik. Compiled by Ian (^_^) ------------------------------------------------------------------------- Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.013941 Sigma 97 Sigma > GaussTOffsetStop: 97 > -33 Thread call stack limit is: 1k setiathome_CUDA: Found 1 CUDA device(s): Device 1: GeForce GTX 1650, 3908 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 14 pciBusID = 2, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 1650 is okay SETI@home using CUDA accelerated device GeForce GTX 1650 Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 --------------------------------------------------------- SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.2 special ------------------------------------------------------------------------- Modifications done by petri33, Mutex by Oddbjornik. Compiled by Ian (^_^) ------------------------------------------------------------------------- Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.013941 Sigma 97 Sigma > GaussTOffsetStop: 97 > -33 Thread call stack limit is: 1k namedMutex: Previous mutex lock holder died in a bad way. namedMutex: mutex is now consistent and the lock has been acquired. Acquired CUDA mutex at 13:35:45,303 Spike: peak=25.86527, time=6.711, d_freq=1420131089.24, chirp=0, fft_len=128k Spike: peak=26.89025, time=20.13, d_freq=1420122652.43, chirp=0, fft_len=128k Spike: peak=24.61807, time=6.711, d_freq=1420131089.25, chirp=0.00092426, fft_len=128k Spike: peak=27.19312, time=20.13, d_freq=1420122652.44, chirp=0.00092426, fft_len=128k Spike: peak=25.89173, time=6.711, d_freq=1420131089.23, chirp=-0.00092426, fft_len=128k Spike: peak=24.62306, time=6.711, d_freq=1420131089.23, chirp=-0.0018485, fft_len=128k Spike: peak=27.94842, time=20.13, d_freq=1420122652.45, chirp=-0.0027728, fft_len=128k Spike: peak=26.68917, time=20.13, d_freq=1420122652.43, chirp=0.003697, fft_len=128k Spike: peak=25.5317, time=20.13, d_freq=1420122652.43, chirp=-0.003697, fft_len=128k Spike: peak=25.04365, time=20.13, d_freq=1420122652.44, chirp=0.0046213, fft_len=128k Spike: peak=25.17622, time=6.711, d_freq=1420122297.96, chirp=-0.0064698, fft_len=128k Spike: peak=27.33359, time=20.13, d_freq=1420122652.45, chirp=-0.0064698, fft_len=128k Spike: peak=24.86638, time=20.13, d_freq=1420122652.43, chirp=0.0073941, fft_len=128k Spike: peak=26.38151, time=6.711, d_freq=1420122297.95, chirp=-0.0073941, fft_len=128k Spike: peak=26.73647, time=6.711, d_freq=1420122297.95, chirp=-0.0083183, fft_len=128k Spike: peak=26.15466, time=6.711, d_freq=1420122297.94, chirp=-0.0092426, fft_len=128k Spike: peak=24.64791, time=6.711, d_freq=1420122297.93, chirp=-0.010167, fft_len=128k Spike: peak=25.53647, time=20.13, d_freq=1420122652.45, chirp=-0.010167, fft_len=128k Spike: peak=24.43285, time=6.711, d_freq=1420131089.24, chirp=-0.011091, fft_len=128k Pulse: peak=4.565244, time=53.74, period=13.04, d_freq=1420128421.21, score=1.025, chirp=-14.74, fft_len=1024 Pulse: peak=0.9206985, time=53.71, period=0.8667, d_freq=1420127920.98, score=1.147, chirp=24.411, fft_len=512 Pulse: peak=6.264369, time=53.7, period=16.99, d_freq=1420123459.99, score=1.07, chirp=25.878, fft_len=256 Pulse: peak=4.276228, time=53.7, period=10.35, d_freq=1420126806.94, score=1.092, chirp=-38.951, fft_len=256 Pulse: peak=11.43044, time=53.7, period=34.73, d_freq=1420125784.12, score=1.093, chirp=-69.364, fft_len=256 Pulse: peak=6.072789, time=53.7, period=13.55, d_freq=1420130468.28, score=1.045, chirp=70.431, fft_len=256 Pulse: peak=1.214941, time=53.7, period=1.475, d_freq=1420125425.59, score=1.143, chirp=-80.303, fft_len=256 Pulse: peak=7.049133, time=53.7, period=20.71, d_freq=1420130855.44, score=1.037, chirp=81.903, fft_len=256 Pulse: peak=7.853018, time=53.74, period=24.61, d_freq=1420130212.63, score=1.008, chirp=-97.11, fft_len=1024 Normal release of CUDA mutex after 275.928 seconds at 13:40:21,231 Best spike: peak=27.94842, time=20.13, d_freq=1420122652.45, chirp=-0.0027728, fft_len=128k Best autocorr: peak=17.38795, time=87.24, delay=6.4449, d_freq=1420126627.44, chirp=-3.7331, fft_len=128k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.122e+11, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=0.9206985, time=53.71, period=0.8667, d_freq=1420127920.98, score=1.147, chirp=24.411, fft_len=512 Best triplet: peak=0, time=-2.122e+11, period=0, d_freq=0, chirp=0, fft_len=0 Spike count: 19 Autocorr count: 0 Pulse count: 9 Triplet count: 0 Gaussian count: 0 13:40:21 (15874): called boinc_finish(0) </stderr_txt> ]]> Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2031839 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2031842 - Posted: 10 Feb 2020, 18:53:08 UTC - in response to Message 2031839. oh that's another case. suspend/resume on a task that was running for >5 mins, caused it to restart from the beginning. Yes, that's to be expected. GPU apps are never kept in memory when suspended (whatever the setting of LAIM). So the application always starts from cold, but whether the task resumes from checkpoint - well, that's what we're discussing here. ID: 2031842 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2031844 - Posted: 10 Feb 2020, 18:54:35 UTC - in response to Message 2031839. and this one. paused it at ~85%, and it restarted from the beginning. same app and checkpoint settings as before. again, no overflow or signs of obvious problem. i'll have to keep an eye to see if they validate. https://setiathome.berkeley.edu/result.php?resultid=8530898585 <core_client_version>7.16.1</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 1 CUDA device(s): Device 1: GeForce GTX 1650, 3908 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 14 pciBusID = 2, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 1650 is okay SETI@home using CUDA accelerated device GeForce GTX 1650 Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 --------------------------------------------------------- SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.2 special ------------------------------------------------------------------------- Modifications done by petri33, Mutex by Oddbjornik. Compiled by Ian (^_^) ------------------------------------------------------------------------- Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.010006 Sigma 72 Sigma > GaussTOffsetStop: 72 > -8 Thread call stack limit is: 1k Acquired CUDA mutex at 13:34:15,703 Spike: peak=24.24003, time=62.99, d_freq=7768989285.06, chirp=13.136, fft_len=128k Spike: peak=25.17315, time=62.99, d_freq=7768989285.06, chirp=13.15, fft_len=128k Spike: peak=25.2738, time=62.99, d_freq=7768989285.05, chirp=13.151, fft_len=128k setiathome_CUDA: Found 1 CUDA device(s): Device 1: GeForce GTX 1650, 3908 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 14 pciBusID = 2, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 1650 is okay SETI@home using CUDA accelerated device GeForce GTX 1650 Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 --------------------------------------------------------- SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.2 special ------------------------------------------------------------------------- Modifications done by petri33, Mutex by Oddbjornik. Compiled by Ian (^_^) ------------------------------------------------------------------------- Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.010006 Sigma 72 Sigma > GaussTOffsetStop: 72 > -8 Thread call stack limit is: 1k Acquired CUDA mutex at 13:40:21,231 Spike: peak=24.24003, time=62.99, d_freq=7768989285.06, chirp=13.136, fft_len=128k Spike: peak=25.17315, time=62.99, d_freq=7768989285.06, chirp=13.15, fft_len=128k Spike: peak=25.2738, time=62.99, d_freq=7768989285.05, chirp=13.151, fft_len=128k Pulse: peak=1.465625, time=45.84, period=2.009, d_freq=7768995643.74, score=1.034, chirp=18.717, fft_len=512 Pulse: peak=5.816671, time=45.84, period=15.09, d_freq=7768992597.84, score=1.015, chirp=21.51, fft_len=512 Autocorr: peak=17.91966, time=28.63, delay=1.8598, d_freq=7768994010.02, chirp=22.084, fft_len=128k Pulse: peak=4.197519, time=45.81, period=8.652, d_freq=7768993221.95, score=1.008, chirp=-26.817, fft_len=32 Pulse: peak=7.609076, time=45.99, period=19.45, d_freq=7768990617.88, score=1.03, chirp=-42.025, fft_len=4k Pulse: peak=9.224659, time=46.17, period=26.49, d_freq=7768989424.6, score=1.004, chirp=-46.739, fft_len=8k Pulse: peak=3.718736, time=45.9, period=9.06, d_freq=7768993476.43, score=1.015, chirp=73.853, fft_len=2k Pulse: peak=6.40499, time=45.99, period=17.45, d_freq=7768991108.61, score=1.038, chirp=-76.734, fft_len=4k Pulse: peak=6.402134, time=45.99, period=17.45, d_freq=7768991109.83, score=1.037, chirp=-76.769, fft_len=4k Pulse: peak=1.796793, time=45.82, period=2.734, d_freq=7768991086.68, score=1.045, chirp=82.688, fft_len=128 setiathome_CUDA: Found 1 CUDA device(s): Device 1: GeForce GTX 1650, 3908 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 14 pciBusID = 2, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 1650 is okay SETI@home using CUDA accelerated device GeForce GTX 1650 Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 --------------------------------------------------------- SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.2 special ------------------------------------------------------------------------- Modifications done by petri33, Mutex by Oddbjornik. Compiled by Ian (^_^) ------------------------------------------------------------------------- Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.010006 Sigma 72 Sigma > GaussTOffsetStop: 72 > -8 Thread call stack limit is: 1k namedMutex: Previous mutex lock holder died in a bad way. namedMutex: mutex is now consistent and the lock has been acquired. Acquired CUDA mutex at 13:44:28,979 Spike: peak=24.24003, time=62.99, d_freq=7768989285.06, chirp=13.136, fft_len=128k Spike: peak=25.17315, time=62.99, d_freq=7768989285.06, chirp=13.15, fft_len=128k Spike: peak=25.2738, time=62.99, d_freq=7768989285.05, chirp=13.151, fft_len=128k Pulse: peak=1.465625, time=45.84, period=2.009, d_freq=7768995643.74, score=1.034, chirp=18.717, fft_len=512 Pulse: peak=5.816671, time=45.84, period=15.09, d_freq=7768992597.84, score=1.015, chirp=21.51, fft_len=512 Autocorr: peak=17.91966, time=28.63, delay=1.8598, d_freq=7768994010.02, chirp=22.084, fft_len=128k Pulse: peak=4.197521, time=45.81, period=8.652, d_freq=7768993221.95, score=1.008, chirp=-26.817, fft_len=32 Pulse: peak=7.609076, time=45.99, period=19.45, d_freq=7768990617.88, score=1.03, chirp=-42.025, fft_len=4k Pulse: peak=9.224659, time=46.17, period=26.49, d_freq=7768989424.6, score=1.004, chirp=-46.739, fft_len=8k Pulse: peak=3.718736, time=45.9, period=9.06, d_freq=7768993476.43, score=1.015, chirp=73.853, fft_len=2k Pulse: peak=6.40499, time=45.99, period=17.45, d_freq=7768991108.61, score=1.038, chirp=-76.734, fft_len=4k Pulse: peak=6.402134, time=45.99, period=17.45, d_freq=7768991109.83, score=1.037, chirp=-76.769, fft_len=4k Pulse: peak=1.796793, time=45.82, period=2.734, d_freq=7768991086.68, score=1.045, chirp=82.688, fft_len=128 Pulse: peak=3.353529, time=45.84, period=6.621, d_freq=7768990246.3, score=1.008, chirp=93.583, fft_len=512 Normal release of CUDA mutex after 248.982 seconds at 13:48:37,961 Best spike: peak=25.2738, time=62.99, d_freq=7768989285.05, chirp=13.151, fft_len=128k Best autocorr: peak=17.91966, time=28.63, delay=1.8598, d_freq=7768994010.02, chirp=22.084, fft_len=128k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.124e+11, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=1.796793, time=45.82, period=2.734, d_freq=7768991086.68, score=1.045, chirp=82.688, fft_len=128 Best triplet: peak=0, time=-2.124e+11, period=0, d_freq=0, chirp=0, fft_len=0 Spike count: 3 Autocorr count: 1 Pulse count: 10 Triplet count: 0 Gaussian count: 0 13:48:38 (16567): called boinc_finish(0) </stderr_txt> ]]> Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2031844 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2031845 - Posted: 10 Feb 2020, 18:56:22 UTC - in response to Message 2031842. Last modified: 10 Feb 2020, 19:00:45 UTC oh that's another case. suspend/resume on a task that was running for >5 mins, caused it to restart from the beginning. Yes, that's to be expected. GPU apps are never kept in memory when suspended (whatever the setting of LAIM). So the application always starts from cold, but whether the task resumes from checkpoint - well, that's what we're discussing here. well that's what i'm trying to even get to happen. so far i've been unsuccessful in getting anything to resume from a checkpoint. it always just starts over from the beginning (which if i'm not mistaken, is the behavior we want anyway right?). can you give instruction on how you're getting a task to restart from a checkpoint? do i need to just kill boinc unexpectedly? Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2031845 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2031851 - Posted: 10 Feb 2020, 19:30:50 UTC - in response to Message 2031845. well that's what i'm trying to even get to happen. so far i've been unsuccessful in getting anything to resume from a checkpoint. it always just starts over from the beginning (which if i'm not mistaken, is the behavior we want anyway right?). can you give instruction on how you're getting a task to restart from a checkpoint? do i need to just kill boinc unexpectedly? No, I can't - and I can't see any sign of a checkpoint being created in the file system. Haven't tried the <checkpoint_debug> Event Log flag yet, but them I always prefer to rely on direct observation, rather than fallible instrumentation. So, I'm beginning to think that some programmer anticipated this discussion by anything up to two years, and simply forgot to update the ReadMe (or tell anyone else what they'd done). ID: 2031851 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2031852 - Posted: 10 Feb 2020, 19:33:15 UTC - in response to Message 2031851. Iâ€™m starting to think the same thing based on my observations so far. Maybe petri already fixed this and didnâ€™t mention it. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2031852 · Reply Quote

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 2031855 - Posted: 10 Feb 2020, 19:53:06 UTC - in response to Message 2031836. Last modified: 10 Feb 2020, 19:53:36 UTC No need to fingerpoint etc. So monitor attached seems like one common attribute. TBar is this behaviour consistent with different drivers?! Linux flavors?! Im running Debian 10. Do you have a checklist so we can pinpoint this behaviour? Perhaps running Gnome vs Kde or anything?! I dont have all variables needed and in the Linux world there are alot. Kernels etc etc. The list is huge. :-/ _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 2031855 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2031856 - Posted: 10 Feb 2020, 19:55:07 UTC The last I heard anything about the Checkpoint from Petri was on 25 Apr 2019. The exchange went something like; Me: BTW, how did removing the Ckeckpoint workout? We need to get Raistmer to post the new code to svn before it can be recommended to Eric for beta. Petri: I haven't tested the checkpoint. Nothing since. ID: 2031856 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2031859 - Posted: 10 Feb 2020, 19:59:54 UTC - in response to Message 2031856. Last modified: 10 Feb 2020, 20:10:16 UTC Well, there is one checkpoint still created - at the very end of the run, too late to be of any use except in the most unlikely of circumstances. If that one could be removed too (AND TESTED!), we could put this side of the conversation to bed - permanently. And concentrate on the monitors. Edit - Juan did send me a new test build this afternoon, saying he had taken out the [should we say remaining?] checkpointing. I haven't tried it yet, because I was too busy trying to find the problem I was supposed to be solving. I'll try and test it tomorrow morning. ID: 2031859 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2031861 - Posted: 10 Feb 2020, 20:13:46 UTC - in response to Message 2031859. Edit - Juan did send me a new test build this afternoon, saying he had taken out the [should we say remaining?] checkpointing. I haven't tried it yet, because I was too busy trying to find the problem I was supposed to be solving. I'll try and test it tomorrow morning. sounds like he sent you the build that I compiled for him. let us know if it acts any different (ie, not even creating that very end checkpoint). Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2031861 · Reply Quote

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.