Message boards :
Number crunching :
Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation
Previous · 1 . . . 149 · 150 · 151 · 152 · 153 · 154 · 155 . . . 162 · Next
Author | Message |
---|---|
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Well, there is one checkpoint still created - at the very end of the run, too late to be of any use except in the most unlikely of circumstances. If that one could be removed too (AND TESTED!), we could put this side of the conversation to bed - permanently. The only checkpoint i left was the one at the end of the analyze process. Could remove too but i believe that makes no difference since at that point the crunching was ended and the process of writing the stderr file is starting. BTW I not removed the check point, i just force the program to believe the checkpoint delay time was not reached. So no checkpoint will be generated. With no check point the app will always restart from zero. What is important i made no changes on the Petri analyze code. @Ian Yes is the same app i coded and you compile (thanks again for that). The one i`m using now and from my POV is working fine and does what is expected to do. I asked Richard to conduct a more accurate test for us. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Could you two please have a quick chat amongst yourselves and decide on the best step for tomorrow? I've reached beer o'clock, and I'm stepping out for a while. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Honestly I'm just curious if you notice any difference between the two apps as far as checkpointing is concerned. but it sounds like there might not be. then whatever Juan want's you to test. I don't really have any interest beyond the curiosity. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Do you have a checklist so we can pinpoint this behaviour?Not really. The problem has existed since around Sept '18, basically with the release of 0.97. The older version zi3v doesn't have this problem. I think the others and myself were running Ubuntu, any version since around 16.04, Multiple GPUs, with a single monitor attached. I'd say the more GPUs the more likely to experience the event. It seems to be common on the 14 GPU machine. About the only sequence needed is to reboot the computer and then start BOINC without using the GPUs to do anything else in between. As in, don't run tasks in the benchmark App before running BOINC, as just running the GPUs will cause the problem to not happen. That's why I run BOINC for about 15 seconds then restart BOINC before it can complete a task. After restarting BOINC the problem doesn't exist, the 15 seconds cancels the problem....in Linux. It's a different story on the Mac. Oh, the problem Does Exist when running the Special App in the Benchmark App, for the first task anyway. The second task doesn't have the problem. Same scenario, reboot the machine, then run the Special App. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Could you two please have a quick chat amongst yourselves and decide on the best step for tomorrow? I've reached beer o'clock, and I'm stepping out for a while. Beer O`clock...... I`m In! I know is Monday, but it`s very hot here. Back to the topic. I only wish to test if the change eliminate the need of adjust the checkpoint timing on the hosts. That`s is what i understand was requested. About the missing pulse problem due the monitor, a totally different topic, what i learn on this years was, Linux has some "timing problem" when you kill the crunching process. I was unable to discover or replicate that. It`s a random event from my POV. Something is left behind sometimes. That problem is more common to appears when you have a lot of GPU`s on the host. Just a long shoot, maybe when you switch the monitor the time needed for the GPU to understand that makes it missing some critical procedure and that is what trigger the event of some hosts and not in others. Could be that the cause of the missing pulse problem? I have no idea. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Now you're getting me completely confused. . . OK. My 2cents worth. The checkpoint issue results in false overflows, but like TBar says, it is not every time. I would say it may have been as high as 1/2 or 1/3 or as low as 1/4 or 1/5 but it was quite regular. I had minimum checkpoint set to a fairly low value, it was a long time ago but it may have been 60 secs. But it did NOT require a cold boot or even terminating BOINC client/manager, those are part of the missing pulse issue TBar was talking about. It would happen even with suspending/resuming a task. I regularly suspend tasks when tweaking or such and when resuming a high percentage of checkpointed tasks would immediately overflow. I now have checkpoint set to 300secs which is 30% to 50% over expected run times and do not have the problem. . . Back then I was running the Cuda90 version. So I do not know if it still applies to anything higher. All my GPUs had and have GDDR5. ram. Stephen . . |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Oh, the problem Does Exist when running the Special App in the Benchmark App, for the first task anyway. The second task doesn't have the problem. Same scenario, reboot the machine, then run the Special App. TBar, I have never noticed the problem on the 1070 when testing the special app after restarting the computer with Rick's benchMT benchmark app. So, your criteria is met. No monitor attached to the 1070 or 1070Ti, the computer is rebooted and the benchmark app is started as soon as the Desktop appears. BOINC is not running or has even been started. Just start with the selected tasks in the Test WU folder and the app configured to test the special app against the reference cpu app or one of the reference results for the SoG app. Always get 99.98% agreement or at least 99.2% Q rating. No missing pulses or the results would not be strongly similar and the rescmpv5 app would have printed out the missing whatever in the tables. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Wiggo Send message Joined: 24 Jan 00 Posts: 36791 Credit: 261,360,520 RAC: 489 |
This is an example of the only true error I get when the power suddenly goes out on my 3GB 1060's and it only happens if a task is interrupted within about 3 seconds of starting, which thankfully doesn't happen very often even with the power problems we have here in the hills when storms are about. https://setiathome.berkeley.edu/result.php?resultid=8528470348 <core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> SETI@home error -6 Bad workunit header !swi.data_type || !found || !swi.nsamples File: seti_header.cpp Line: 218 12:08:57 (1698): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>blc75_2bit_guppi_58693_09832_HIP98819_0146.5758.0.21.44.28.vlar_0_r202591688_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> Both rigs have checkpoints set at 300 second and run off SSD's, but then again it's highly likely that different setups may produce different outcomes. Cheers. |
alanb1951 Send message Joined: 25 May 99 Posts: 10 Credit: 6,904,127 RAC: 34 |
Am I missing something about Linux? Does it hold data destined for files in memory, and only commit to disk when the file is closed? In which case, why is the header portion written immediately? In order to avoid misconceptions regarding ext4, it should be noted that that article is from 2009, and ext4 behaviour regarding delayed allocation has long since been changed to address that. There was a recent issue with 4.19 kernels but according to Phoronix it wasn't actually an ext4 problem Cheers - Al. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I was simply searching for a simple beginners guide to how ext4 works. Not that we were actually discussing any data loss. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
It looks as though you are one of the lucky ones. Even the Linux people that get them usually only get them on one card out of a few. I looked through the list of top computers to see if there were any there and didn't see any from the likely machines. Then later I noticed my 9 GPU machine had a Hung task that had ended in an error and was about to report and another hung task still trying to run, so, I stopped it before it could report and looked at it. That's when a noticed an Invalid from Yesterday...that was missing Pulses. Well, I had tried restarting the machine Yesterday to see if it would miss any on restart, but didn't see any. Apparently I had Missed one, seems the 970 had missed pulses and it was now an Invalid. One GPU out of 9 had displayed the problem;Oh, the problem Does Exist when running the Special App in the Benchmark App, for the first task anyway. The second task doesn't have the problem. Same scenario, reboot the machine, then run the Special App. https://setiathome.berkeley.edu/result.php?resultid=8526164153 Device 6: GeForce GTX 970 is okay SETI@home using CUDA accelerated device GeForce GTX 970 Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 setiathome v8 enhanced x41p_V0.98b1, Cuda 10.2 Special Modifications done by petri33, compiled by TBar Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.009212 Sigma 148 Sigma > GaussTOffsetStop: 148 > -84 Thread call stack limit is: 1k Autocorr: peak=18.1597, time=6.711, delay=5.6933, d_freq=1419863199.87, chirp=-12.127, fft_len=128k Autocorr: peak=18.01943, time=6.711, delay=5.6933, d_freq=1419863199.86, chirp=-12.128, fft_len=128k Triplet: peak=10.08111, time=55.05, period=15.47, d_freq=1419861243.92, chirp=-19.338, fft_len=1024 Triplet: peak=16.39076, time=11.95, period=8.126, d_freq=1419866153.37, chirp=-89.224, fft_len=1024 Best spike: peak=23.97182, time=73.82, d_freq=1419860056.03, chirp=-19.273, fft_len=128k Best autocorr: peak=18.1597, time=6.711, delay=5.6933, d_freq=1419863199.87, chirp=-12.127, fft_len=128k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.123e+11, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=0, time=-2.123e+11, period=0, d_freq=0, score=0, chirp=0, fft_len=0 Best triplet: peak=16.39076, time=11.95, period=8.126, d_freq=1419866153.37, chirp=-89.224, fft_len=1024 Spike count: 0 Autocorr count: 2 Pulse count: 0 Triplet count: 2 Gaussian count: 0 18:07:36 (3125): called boinc_finish(0)Seems it didn't have any trouble after that. The Hung task from tonight was on a 1060, and the GPU showed as being OK in NVIDIA Settings, for some reason it had Hung in BOINC though. A reboot fixed that, and I turned the Error into a Ghost so it wouldn't be reported, I'll retrieve the ghost when I get around to it. The 970 is the Only Maxwell GPU out of the 9 GPUs, the others are all Pascals, don't have a clue it that means anything. I didn't notice any Missed pulses from tonight's reboot either. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Hit a snag testing the latest version. I was running a 430 driver (pre-CUDA 10.2). On my first Linux machine, I used the NVidia installer, but that was horrible - needed to be installed from text mode terminal, no desktop or X-server, logged in as root. Don't want to go there again. This one I installed with Synaptec Manager. Tried to upgrade today, but highest available is 435. Even that was foul (kept rebooting at 640x480 resolution, graphical desktop. Ugh.) I've got it back under control and crunching, but wasted the planned testing time. Can anyone recommend a clean installer for a CUDA 10.2 driver, Linux Mint 19.1 based on Ubuntu 18.04? |
Buckeye4LF Send message Joined: 19 Jun 00 Posts: 173 Credit: 54,916,209 RAC: 833 |
For the most recent Nvidia drivers I installed from https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa/+index using the commands: sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt-get update I am running Linux Mint 19.2 with dual Nvidia RTX 2070 super and have had no issues for the past two days. I installed driver 440.59 and it works great. Be warned though, these are not official drivers. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Thanks. My criteria are: CUDA OpenCL Full HD desktop I'm not fussed about the provider, so long as they're competent enough to deliver those three properly! Definitely one for the shortlist. |
Buckeye4LF Send message Joined: 19 Jun 00 Posts: 173 Credit: 54,916,209 RAC: 833 |
i installed openCL before I updated the driver so you may have to do that as a separate command as I am unsure if it is included. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
i installed openCL before I updated the driver so you may have to do that as a separate command as I am unsure if it is included.Can you remember where you got OpenCL from? Less important, I can come back to that later - it's CUDA I was planning to test today. |
J. Mileski Send message Joined: 9 Jun 02 Posts: 632 Credit: 172,116,532 RAC: 572 |
i installed openCL before I updated the driver so you may have to do that as a separate command as I am unsure if it is included.Can you remember where you got OpenCL from? Less important, I can come back to that later - it's CUDA I was planning to test today. sudo apt-get install ocl-icd-libopencl1 |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Ta. I'll go through that later. Turned out I had used the repository version of the same drivers you pointed me to - but 440 was missing, because they haven't been declared 'stable' yet, or whatever the technical term is. Added the PPA, and got access to the latest. Same adventure with the 640x480 screen, and a few broken packages later, but I got it running again. The object of the test - the 10.2 'no checkpoint' build - turned out to be the same as before: one single lonely checkpoint at 99.99999% progress. Otherwise, crunching looked normal, but we were into the outage by then, so no instant validations. I'm out tomorrow, but I'll check when I get time. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
The object of the test - the 10.2 'no checkpoint' build - turned out to be the same as before: one single lonely checkpoint at 99.99999% progress. Otherwise, crunching looked normal, but we were into the outage by then, so no instant validations. I'm out tomorrow, but I'll check when I get time. That lonely checkpoint is after the end of the analyze process. Just before the start to build the stderr file. I left it because i imagine no harm could be done at this point. Could be removed too if you think that help. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
I think he’s saying that he doesn’t see any difference in behavior. Effectively it’s not checkpointing even before the change. So the checkpoint issue is probably a non-issue anymore. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.