Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 148 · 149 · 150 · 151 · 152 · 153 · 154 . . . 162 · Next

AuthorMessage
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031826 - Posted: 10 Feb 2020, 17:12:25 UTC - in response to Message 2031822.  

That's the difference between the Checkpoint problem and the Missed Pulse problem. The Checkpoint problem will give an Immediate Overflow after resuming, the Other problem will Miss All Pulses...on occasion. I've seen both not happen all of the time, some times the Checkpoint works, some times it doesn't. Some times you Miss Pulses, some times you don't The only common thing is when it fails people want to know why.
Now you're getting me completely confused.

Bear in mind that for the time being I'm concentrating on the checkpoint problem only. Missed pulses can wait for another day.

I've just interrupted four tasks (two at a time, each running by itself on one of the two GPUs in this box). I stopped them via 'sudo systemctl stop boinc-client'. The first pair were not expected to checkpoint (interval set to 300): the second pair should have checkpointed (interval 60: stopped at around 70).

On restart, Manager displayed both progress and time restarting from 0. The tasks, in order, were

8529268332 (overflow, pending)
8529268005 (not overflow, valid)
8529287927 (not overflow, valid)
8529288504 (not overflow, pending)

We'll have to wait to check the first one (average turnround 1.8 days), but the other three rather disprove your assertion. I'm running setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101. The final run times shown on the web task page match what I saw for the final 'run to finish' after the restart: the pre-restart time was lost.


I had a similar experience just now. I set checkpointing to 10 seconds on the nominal app, allowed it to run to ~50% (several minutes on a GTX 1650). Killed BOINC, and restarted it, and the task restarted from the beginning.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031826 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031827 - Posted: 10 Feb 2020, 17:22:32 UTC - in response to Message 2031822.  

...Now you're getting me completely confused.,,
Here you go, read up on the comments from one of your old Buds. He seemed annoyed about the Checkpoint problem back then, and I think you are in a few of His threads back then, https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1906253#1906253
Do a search on his user ID and checkpoint, he'll refresh your memory for you. As far as I know, not much has changed since then.
ID: 2031827 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14667
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031829 - Posted: 10 Feb 2020, 17:32:51 UTC - in response to Message 2031825.  

Thanks. But did you also see my comment about checkpoint files from my CPU project tasks appearing, and updating, precisely on cue?
Yes, I did. All that has to happen for the cpu tasks to write out their state is to have the application do the fsync () call. Must be what is happening for that case.
OK, I've read the article (well, skimmed it), and confirmed that my system has the default values for lazy writes:

    /proc/sys/vm/dirty_expire_centisecs          3000  // 30 seconds
    /proc/sys/vm/dirty_writeback_centisecs        500  //  5 seconds
I suspended a task from BOINC Manager 10 seconds after the checkpoint file should have been written: that was at least five minutes ago. Nothing has appeared in the slot folder yet.

This machine is currently running off a single 512GB M.2 PCIe SSD: I don't think it's got that long a disk write queue!
ID: 2031829 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031830 - Posted: 10 Feb 2020, 17:38:11 UTC - in response to Message 2031821.  
Last modified: 10 Feb 2020, 17:51:21 UTC

what's the best most fool-proof way to reproduce this scenario then? the goal posts keep moving it seems. I have a card with GDDR5 memory, and want to try to trigger this to happen. if it's a real issue with the app+GDDR5 and not some other system specific issue, it should be easily reproducible with a well defined procedure. i've left the monitor running, rebooted several times but have been unable to see any missed best pulses so far.
Here's a Post from Yesterday, I'm fairly certain you read it, " I later found my other machines had the same problem once I turned the monitors on and started actually using the machines."
That's fairly Clear that the problem doesn't happen when the monitor is OFF. How you went from that to saying having the monitor Off Is fine for testing is anyone's guess. I'd suggest trying the same configurations the others had when they experienced the problem. Juan was running 4 GDDR5 1070s, the other posted here was running 4 or 5 GDDR5 980s. I've seen the problem when running any number of GDDR5 Multiple GPUs, anywhere from 2 to 14 GPUs. If you are using a Single card, you are out of the known sample. I'd say three would be a good start, and most people have one connected to the monitor, the others disconnected.
ID: 2031830 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14667
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031831 - Posted: 10 Feb 2020, 17:39:31 UTC - in response to Message 2031827.  

As far as I know, not much has changed since then.
And you probably know more than me, because all the development discussion takes place in a forum I don't have access to.

@ GPUUG: Any news on whether the checkpointing problem has been addressed since 10 Dec 2017 (except this week, of course)?

(I'd better go and read that ReadMe file!)
ID: 2031831 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031834 - Posted: 10 Feb 2020, 17:59:45 UTC - in response to Message 2031830.  

what's the best most fool-proof way to reproduce this scenario then? the goal posts keep moving it seems. I have a card with GDDR5 memory, and want to try to trigger this to happen. if it's a real issue with the app+GDDR5 and not some other system specific issue, it should be easily reproducible with a well defined procedure. i've left the monitor running, rebooted several times but have been unable to see any missed best pulses so far.
Here's a Post from Yesterday, I'm fairly certain you read it, " I later found my other machines had the same problem once I turned the monitors on and started actually using the machines."
That's fairly Clear that the problem doesn't happen when the monitor is OFF. How you went from that to saying having the monitor Off Is fine for testing is anyone's guess. I'd suggest trying the same configurations the others had when they experienced the problem. Juan was running 4 GDDR5 1070s, the other posted here was running 4 or 5 GDDR5 980s. I've seen the problem when running any number of GDDR5 Multiple GPUs, anywhere from 2 to 14 GPUs. If you are using a Single card, you are out of the known sample. I'd say three would be a good start, and most people have one connected to the monitor, the others disconnected.


I don’t know why you think I had the monitor Off. I stated I left the monitor running = ON.

I’m unclear as to the monitor being a contributing
factor. In one sense you say that the the monitor being off prevents it from happening, yet at the same time you say it only affects the cards NOT hooked to the monitor? Can you explain that?

I’ll look into shuffling some cards around, or picking up a cheap GDDR5 card to try to replicate this with Multi-GPU. Not looking hopeful though, both Juan and Keith seem to have been unsuccessful as well.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031834 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031835 - Posted: 10 Feb 2020, 18:06:38 UTC - in response to Message 2031831.  

As far as I know, not much has changed since then.
And you probably know more than me, because all the development discussion takes place in a forum I don't have access to.

@ GPUUG: Any news on whether the checkpointing problem has been addressed since 10 Dec 2017 (except this week, of course)?

(I'd better go and read that ReadMe file!)


I recompiled an app with a modification to the code in attempt to remove checkpointing. Juan instructed me which line to change (he doesn’t yet know how to compile the special app, so I did that for him).

But I haven’t really seen different behavior between the default and modified apps. I’m not seeing it check point. I would assume if I set checkpoint to 10 seconds that it would checkpoint, but even on the default unchanged app it still starts over from the beginning
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031835 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031836 - Posted: 10 Feb 2020, 18:17:23 UTC - in response to Message 2031834.  

This;
Does this host https://setiathome.berkeley.edu/results.php?hostid=8570185 fit in the criteria of the problem?
I do not have any monitors attached to it and can only access it through SSH.
Does a simple reboot in the console make the problem appear? If so i can test this too.
According to Tbar, yes it fits. Do a system reboot and see what happens.
I believe that's YOU telling Viper testing without a monitor is fine? It certainly looks that way.

what's the best most fool-proof way to reproduce this scenario then? the goal posts keep moving it seems. I have a card with GDDR5 memory, and want to try to trigger this to happen. if it's a real issue with the app+GDDR5 and not some other system specific issue, it should be easily reproducible with a well defined procedure. i've left the monitor running, rebooted several times but have been unable to see any missed best pulses so far.
Here's a Post from Yesterday, I'm fairly certain you read it, " I later found my other machines had the same problem once I turned the monitors on and started actually using the machines."
That's fairly Clear that the problem doesn't happen when the monitor is OFF. How you went from that to saying having the monitor Off Is fine for testing is anyone's guess. I'd suggest trying the same configurations the others had when they experienced the problem. Juan was running 4 GDDR5 1070s, the other posted here was running 4 or 5 GDDR5 980s. I've seen the problem when running any number of GDDR5 Multiple GPUs, anywhere from 2 to 14 GPUs. If you are using a Single card, you are out of the known sample. I'd say three would be a good start, and most people have one connected to the monitor, the others disconnected.


I don’t know why you think I had the monitor Off. I stated I left the monitor running = ON.

I’m unclear as to the monitor being a contributing
factor. In one sense you say that the the monitor being off prevents it from happening, yet at the same time you say it only affects the cards NOT hooked to the monitor? Can you explain that?

I’ll look into shuffling some cards around, or picking up a cheap GDDR5 card to try to replicate this with Multi-GPU. Not looking hopeful though, both Juan and Keith seem to have been unsuccessful as well.
ID: 2031836 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14667
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031837 - Posted: 10 Feb 2020, 18:24:45 UTC - in response to Message 2031831.  

(I'd better go and read that ReadMe file!)
Which says, in its entirety,

6) The App may give Incorrect results on a restarted task. One way to avoid restarted tasks is to set the checkpoint higher than the longest task's estimated run-time, and also avoid suspending/resuming a task.
The ReadMe itself has a datestamp of 07 December 2019, and a reference to the CUDA 10.2 app, so I think it's current (again, before this week's changes).
ID: 2031837 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031838 - Posted: 10 Feb 2020, 18:31:36 UTC - in response to Message 2031836.  
Last modified: 10 Feb 2020, 18:34:00 UTC

Well it’s simple. I was unaware that the monitor’s presence was so important since you claimed it only affected cards not attached to the monitor. oh wells *shrug*

Any comment on why that is? Since it doesn’t affect the cards attached to the monitor, why would it matter if one wasn’t plugged in at all?

This issue is getting more and more fringe the more you give additional nuggets to the exact reproduction.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031838 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031839 - Posted: 10 Feb 2020, 18:36:43 UTC - in response to Message 2031837.  
Last modified: 10 Feb 2020, 18:50:30 UTC

(I'd better go and read that ReadMe file!)
Which says, in its entirety,

6) The App may give Incorrect results on a restarted task. One way to avoid restarted tasks is to set the checkpoint higher than the longest task's estimated run-time, and also avoid suspending/resuming a task.
The ReadMe itself has a datestamp of 07 December 2019, and a reference to the CUDA 10.2 app, so I think it's current (again, before this week's changes).


oh that's another case.

suspend/resume on a task that was running for >5 mins, caused it to restart from the beginning. checkpoint set to 10 seconds. running unmodified code (as far as checkpointing is concerned)

see the results: https://setiathome.berkeley.edu/result.php?resultid=8530898546
No immediate overflow. hmm
<core_client_version>7.16.1</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 1 CUDA device(s):
  Device 1: GeForce GTX 1650, 3908 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 14 
     pciBusID = 2, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 1650 is okay
SETI@home using CUDA accelerated device GeForce GTX 1650
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

---------------------------------------------------------
SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.2 special
-------------------------------------------------------------------------
Modifications done by petri33, Mutex by Oddbjornik. Compiled by Ian (^_^)
-------------------------------------------------------------------------

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.013941
Sigma 97
Sigma > GaussTOffsetStop: 97 > -33
Thread call stack limit is: 1k
setiathome_CUDA: Found 1 CUDA device(s):
  Device 1: GeForce GTX 1650, 3908 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 14 
     pciBusID = 2, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 1650 is okay
SETI@home using CUDA accelerated device GeForce GTX 1650
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

---------------------------------------------------------
SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.2 special
-------------------------------------------------------------------------
Modifications done by petri33, Mutex by Oddbjornik. Compiled by Ian (^_^)
-------------------------------------------------------------------------

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.013941
Sigma 97
Sigma > GaussTOffsetStop: 97 > -33
Thread call stack limit is: 1k
namedMutex: Previous mutex lock holder died in a bad way.
namedMutex: mutex is now consistent and the lock has been acquired.
Acquired CUDA mutex at 13:35:45,303
Spike: peak=25.86527, time=6.711, d_freq=1420131089.24, chirp=0, fft_len=128k
Spike: peak=26.89025, time=20.13, d_freq=1420122652.43, chirp=0, fft_len=128k
Spike: peak=24.61807, time=6.711, d_freq=1420131089.25, chirp=0.00092426, fft_len=128k
Spike: peak=27.19312, time=20.13, d_freq=1420122652.44, chirp=0.00092426, fft_len=128k
Spike: peak=25.89173, time=6.711, d_freq=1420131089.23, chirp=-0.00092426, fft_len=128k
Spike: peak=24.62306, time=6.711, d_freq=1420131089.23, chirp=-0.0018485, fft_len=128k
Spike: peak=27.94842, time=20.13, d_freq=1420122652.45, chirp=-0.0027728, fft_len=128k
Spike: peak=26.68917, time=20.13, d_freq=1420122652.43, chirp=0.003697, fft_len=128k
Spike: peak=25.5317, time=20.13, d_freq=1420122652.43, chirp=-0.003697, fft_len=128k
Spike: peak=25.04365, time=20.13, d_freq=1420122652.44, chirp=0.0046213, fft_len=128k
Spike: peak=25.17622, time=6.711, d_freq=1420122297.96, chirp=-0.0064698, fft_len=128k
Spike: peak=27.33359, time=20.13, d_freq=1420122652.45, chirp=-0.0064698, fft_len=128k
Spike: peak=24.86638, time=20.13, d_freq=1420122652.43, chirp=0.0073941, fft_len=128k
Spike: peak=26.38151, time=6.711, d_freq=1420122297.95, chirp=-0.0073941, fft_len=128k
Spike: peak=26.73647, time=6.711, d_freq=1420122297.95, chirp=-0.0083183, fft_len=128k
Spike: peak=26.15466, time=6.711, d_freq=1420122297.94, chirp=-0.0092426, fft_len=128k
Spike: peak=24.64791, time=6.711, d_freq=1420122297.93, chirp=-0.010167, fft_len=128k
Spike: peak=25.53647, time=20.13, d_freq=1420122652.45, chirp=-0.010167, fft_len=128k
Spike: peak=24.43285, time=6.711, d_freq=1420131089.24, chirp=-0.011091, fft_len=128k
Pulse: peak=4.565244, time=53.74, period=13.04, d_freq=1420128421.21, score=1.025, chirp=-14.74, fft_len=1024 
Pulse: peak=0.9206985, time=53.71, period=0.8667, d_freq=1420127920.98, score=1.147, chirp=24.411, fft_len=512 
Pulse: peak=6.264369, time=53.7, period=16.99, d_freq=1420123459.99, score=1.07, chirp=25.878, fft_len=256 
Pulse: peak=4.276228, time=53.7, period=10.35, d_freq=1420126806.94, score=1.092, chirp=-38.951, fft_len=256 
Pulse: peak=11.43044, time=53.7, period=34.73, d_freq=1420125784.12, score=1.093, chirp=-69.364, fft_len=256 
Pulse: peak=6.072789, time=53.7, period=13.55, d_freq=1420130468.28, score=1.045, chirp=70.431, fft_len=256 
Pulse: peak=1.214941, time=53.7, period=1.475, d_freq=1420125425.59, score=1.143, chirp=-80.303, fft_len=256 
Pulse: peak=7.049133, time=53.7, period=20.71, d_freq=1420130855.44, score=1.037, chirp=81.903, fft_len=256 
Pulse: peak=7.853018, time=53.74, period=24.61, d_freq=1420130212.63, score=1.008, chirp=-97.11, fft_len=1024 
Normal release of CUDA mutex after 275.928 seconds at 13:40:21,231

Best spike: peak=27.94842, time=20.13, d_freq=1420122652.45, chirp=-0.0027728, fft_len=128k
Best autocorr: peak=17.38795, time=87.24, delay=6.4449, d_freq=1420126627.44, chirp=-3.7331, fft_len=128k
Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.122e+11, d_freq=0,
	score=-12, null_hyp=0, chirp=0, fft_len=0 
Best pulse: peak=0.9206985, time=53.71, period=0.8667, d_freq=1420127920.98, score=1.147, chirp=24.411, fft_len=512 
Best triplet: peak=0, time=-2.122e+11, period=0, d_freq=0, chirp=0, fft_len=0 

Spike count:    19
Autocorr count: 0
Pulse count:    9
Triplet count:  0
Gaussian count: 0

13:40:21 (15874): called boinc_finish(0)

</stderr_txt>
]]>

Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031839 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14667
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031842 - Posted: 10 Feb 2020, 18:53:08 UTC - in response to Message 2031839.  

oh that's another case.

suspend/resume on a task that was running for >5 mins, caused it to restart from the beginning.
Yes, that's to be expected. GPU apps are never kept in memory when suspended (whatever the setting of LAIM). So the application always starts from cold, but whether the task resumes from checkpoint - well, that's what we're discussing here.
ID: 2031842 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031844 - Posted: 10 Feb 2020, 18:54:35 UTC - in response to Message 2031839.  

and this one. paused it at ~85%, and it restarted from the beginning. same app and checkpoint settings as before.

again, no overflow or signs of obvious problem. i'll have to keep an eye to see if they validate.

https://setiathome.berkeley.edu/result.php?resultid=8530898585

<core_client_version>7.16.1</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 1 CUDA device(s):
  Device 1: GeForce GTX 1650, 3908 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 14 
     pciBusID = 2, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 1650 is okay
SETI@home using CUDA accelerated device GeForce GTX 1650
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

---------------------------------------------------------
SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.2 special
-------------------------------------------------------------------------
Modifications done by petri33, Mutex by Oddbjornik. Compiled by Ian (^_^)
-------------------------------------------------------------------------

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.010006
Sigma 72
Sigma > GaussTOffsetStop: 72 > -8
Thread call stack limit is: 1k
Acquired CUDA mutex at 13:34:15,703
Spike: peak=24.24003, time=62.99, d_freq=7768989285.06, chirp=13.136, fft_len=128k
Spike: peak=25.17315, time=62.99, d_freq=7768989285.06, chirp=13.15, fft_len=128k
Spike: peak=25.2738, time=62.99, d_freq=7768989285.05, chirp=13.151, fft_len=128k
setiathome_CUDA: Found 1 CUDA device(s):
  Device 1: GeForce GTX 1650, 3908 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 14 
     pciBusID = 2, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 1650 is okay
SETI@home using CUDA accelerated device GeForce GTX 1650
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

---------------------------------------------------------
SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.2 special
-------------------------------------------------------------------------
Modifications done by petri33, Mutex by Oddbjornik. Compiled by Ian (^_^)
-------------------------------------------------------------------------

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.010006
Sigma 72
Sigma > GaussTOffsetStop: 72 > -8
Thread call stack limit is: 1k
Acquired CUDA mutex at 13:40:21,231
Spike: peak=24.24003, time=62.99, d_freq=7768989285.06, chirp=13.136, fft_len=128k
Spike: peak=25.17315, time=62.99, d_freq=7768989285.06, chirp=13.15, fft_len=128k
Spike: peak=25.2738, time=62.99, d_freq=7768989285.05, chirp=13.151, fft_len=128k
Pulse: peak=1.465625, time=45.84, period=2.009, d_freq=7768995643.74, score=1.034, chirp=18.717, fft_len=512 
Pulse: peak=5.816671, time=45.84, period=15.09, d_freq=7768992597.84, score=1.015, chirp=21.51, fft_len=512 
Autocorr: peak=17.91966, time=28.63, delay=1.8598, d_freq=7768994010.02, chirp=22.084, fft_len=128k
Pulse: peak=4.197519, time=45.81, period=8.652, d_freq=7768993221.95, score=1.008, chirp=-26.817, fft_len=32 
Pulse: peak=7.609076, time=45.99, period=19.45, d_freq=7768990617.88, score=1.03, chirp=-42.025, fft_len=4k
Pulse: peak=9.224659, time=46.17, period=26.49, d_freq=7768989424.6, score=1.004, chirp=-46.739, fft_len=8k
Pulse: peak=3.718736, time=45.9, period=9.06, d_freq=7768993476.43, score=1.015, chirp=73.853, fft_len=2k
Pulse: peak=6.40499, time=45.99, period=17.45, d_freq=7768991108.61, score=1.038, chirp=-76.734, fft_len=4k
Pulse: peak=6.402134, time=45.99, period=17.45, d_freq=7768991109.83, score=1.037, chirp=-76.769, fft_len=4k
Pulse: peak=1.796793, time=45.82, period=2.734, d_freq=7768991086.68, score=1.045, chirp=82.688, fft_len=128 
setiathome_CUDA: Found 1 CUDA device(s):
  Device 1: GeForce GTX 1650, 3908 MiB, regsPerBlock 65536
     computeCap 7.5, multiProcs 14 
     pciBusID = 2, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 1650 is okay
SETI@home using CUDA accelerated device GeForce GTX 1650
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

---------------------------------------------------------
SETI@home v8 enhanced x41p_V0.99b1p3, CUDA 10.2 special
-------------------------------------------------------------------------
Modifications done by petri33, Mutex by Oddbjornik. Compiled by Ian (^_^)
-------------------------------------------------------------------------

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.010006
Sigma 72
Sigma > GaussTOffsetStop: 72 > -8
Thread call stack limit is: 1k
namedMutex: Previous mutex lock holder died in a bad way.
namedMutex: mutex is now consistent and the lock has been acquired.
Acquired CUDA mutex at 13:44:28,979
Spike: peak=24.24003, time=62.99, d_freq=7768989285.06, chirp=13.136, fft_len=128k
Spike: peak=25.17315, time=62.99, d_freq=7768989285.06, chirp=13.15, fft_len=128k
Spike: peak=25.2738, time=62.99, d_freq=7768989285.05, chirp=13.151, fft_len=128k
Pulse: peak=1.465625, time=45.84, period=2.009, d_freq=7768995643.74, score=1.034, chirp=18.717, fft_len=512 
Pulse: peak=5.816671, time=45.84, period=15.09, d_freq=7768992597.84, score=1.015, chirp=21.51, fft_len=512 
Autocorr: peak=17.91966, time=28.63, delay=1.8598, d_freq=7768994010.02, chirp=22.084, fft_len=128k
Pulse: peak=4.197521, time=45.81, period=8.652, d_freq=7768993221.95, score=1.008, chirp=-26.817, fft_len=32 
Pulse: peak=7.609076, time=45.99, period=19.45, d_freq=7768990617.88, score=1.03, chirp=-42.025, fft_len=4k
Pulse: peak=9.224659, time=46.17, period=26.49, d_freq=7768989424.6, score=1.004, chirp=-46.739, fft_len=8k
Pulse: peak=3.718736, time=45.9, period=9.06, d_freq=7768993476.43, score=1.015, chirp=73.853, fft_len=2k
Pulse: peak=6.40499, time=45.99, period=17.45, d_freq=7768991108.61, score=1.038, chirp=-76.734, fft_len=4k
Pulse: peak=6.402134, time=45.99, period=17.45, d_freq=7768991109.83, score=1.037, chirp=-76.769, fft_len=4k
Pulse: peak=1.796793, time=45.82, period=2.734, d_freq=7768991086.68, score=1.045, chirp=82.688, fft_len=128 
Pulse: peak=3.353529, time=45.84, period=6.621, d_freq=7768990246.3, score=1.008, chirp=93.583, fft_len=512 
Normal release of CUDA mutex after 248.982 seconds at 13:48:37,961

Best spike: peak=25.2738, time=62.99, d_freq=7768989285.05, chirp=13.151, fft_len=128k
Best autocorr: peak=17.91966, time=28.63, delay=1.8598, d_freq=7768994010.02, chirp=22.084, fft_len=128k
Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.124e+11, d_freq=0,
	score=-12, null_hyp=0, chirp=0, fft_len=0 
Best pulse: peak=1.796793, time=45.82, period=2.734, d_freq=7768991086.68, score=1.045, chirp=82.688, fft_len=128 
Best triplet: peak=0, time=-2.124e+11, period=0, d_freq=0, chirp=0, fft_len=0 

Spike count:    3
Autocorr count: 1
Pulse count:    10
Triplet count:  0
Gaussian count: 0

13:48:38 (16567): called boinc_finish(0)

</stderr_txt>
]]>

Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031844 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031845 - Posted: 10 Feb 2020, 18:56:22 UTC - in response to Message 2031842.  
Last modified: 10 Feb 2020, 19:00:45 UTC

oh that's another case.

suspend/resume on a task that was running for >5 mins, caused it to restart from the beginning.
Yes, that's to be expected. GPU apps are never kept in memory when suspended (whatever the setting of LAIM). So the application always starts from cold, but whether the task resumes from checkpoint - well, that's what we're discussing here.


well that's what i'm trying to even get to happen. so far i've been unsuccessful in getting anything to resume from a checkpoint. it always just starts over from the beginning (which if i'm not mistaken, is the behavior we want anyway right?). can you give instruction on how you're getting a task to restart from a checkpoint? do i need to just kill boinc unexpectedly?
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031845 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14667
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031851 - Posted: 10 Feb 2020, 19:30:50 UTC - in response to Message 2031845.  

well that's what i'm trying to even get to happen. so far i've been unsuccessful in getting anything to resume from a checkpoint. it always just starts over from the beginning (which if i'm not mistaken, is the behavior we want anyway right?). can you give instruction on how you're getting a task to restart from a checkpoint? do i need to just kill boinc unexpectedly?
No, I can't - and I can't see any sign of a checkpoint being created in the file system. Haven't tried the <checkpoint_debug> Event Log flag yet, but them I always prefer to rely on direct observation, rather than fallible instrumentation.

So, I'm beginning to think that some programmer anticipated this discussion by anything up to two years, and simply forgot to update the ReadMe (or tell anyone else what they'd done).
ID: 2031851 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031852 - Posted: 10 Feb 2020, 19:33:15 UTC - in response to Message 2031851.  

I’m starting to think the same thing based on my observations so far. Maybe petri already fixed this and didn’t mention it.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031852 · Report as offensive     Reply Quote
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 2031855 - Posted: 10 Feb 2020, 19:53:06 UTC - in response to Message 2031836.  
Last modified: 10 Feb 2020, 19:53:36 UTC

No need to fingerpoint etc.

So monitor attached seems like one common attribute.
TBar is this behaviour consistent with different drivers?!
Linux flavors?!
Im running Debian 10.

Do you have a checklist so we can pinpoint this behaviour?
Perhaps running Gnome vs Kde or anything?!
I dont have all variables needed and in the Linux world there are alot.
Kernels etc etc.
The list is huge. :-/

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 2031855 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031856 - Posted: 10 Feb 2020, 19:55:07 UTC

The last I heard anything about the Checkpoint from Petri was on 25 Apr 2019. The exchange went something like;
Me: BTW, how did removing the Ckeckpoint workout? We need to get Raistmer to post the new code to svn before it can be recommended to Eric for beta.
Petri: I haven't tested the checkpoint.

Nothing since.
ID: 2031856 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14667
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031859 - Posted: 10 Feb 2020, 19:59:54 UTC - in response to Message 2031856.  
Last modified: 10 Feb 2020, 20:10:16 UTC

Well, there is one checkpoint still created - at the very end of the run, too late to be of any use except in the most unlikely of circumstances. If that one could be removed too (AND TESTED!), we could put this side of the conversation to bed - permanently.

And concentrate on the monitors.

Edit - Juan did send me a new test build this afternoon, saying he had taken out the [should we say remaining?] checkpointing. I haven't tried it yet, because I was too busy trying to find the problem I was supposed to be solving. I'll try and test it tomorrow morning.
ID: 2031859 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031861 - Posted: 10 Feb 2020, 20:13:46 UTC - in response to Message 2031859.  


Edit - Juan did send me a new test build this afternoon, saying he had taken out the [should we say remaining?] checkpointing. I haven't tried it yet, because I was too busy trying to find the problem I was supposed to be solving. I'll try and test it tomorrow morning.



sounds like he sent you the build that I compiled for him. let us know if it acts any different (ie, not even creating that very end checkpoint).
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031861 · Report as offensive     Reply Quote
Previous · 1 . . . 148 · 149 · 150 · 151 · 152 · 153 · 154 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.