Setting up Linux to crunch CUDA90 and above for Windows users

Author	Message
TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1897867 - Posted: 28 Oct 2017, 14:23:57 UTC - in response to Message 1897666. Last modified: 28 Oct 2017, 14:30:13 UTC Thanks TBar. Making some progress. Can run the CPU tasks. The GPU tasks seem to be the factor that causes BOINC to crash. I tried to go back to the CUDA80 app but this is what I am getting in the stdout file. Thu 26 Oct 2017 03:10:27 PM PDT \| SETI@home \| task blc04_2bit_guppi_57976_07262_HIP74926_0026.20486.0.21.44.241.vlar_1 resumed by user Thu 26 Oct 2017 03:10:27 PM PDT \| SETI@home \| [error] error: can't open file for shmem seg name Thu 26 Oct 2017 03:10:27 PM PDT \| SETI@home \| [error] error: can't open file for shmem seg name: 2 I just checked the dependencies on both the CUDA 80 and CUDA 90 static apps and didn't see any irregularities. Hmmm, never heard of that one before. Try stopping BOINC, in the Home folder select Show Hidden Files from the View menu. Open the folder .nv, and then the folder ComputeCache. Delete all items from the folder ComputeCache. Then start BOINC and see if that helps. Hi TBar, thanks for the help. Didn't know about that hidden directory. Will put that one in the memory bank. The problem was definitely coming from the gpu tasks. Since that hidden directory seems to be about Nvidia, suspect that is where the problem lay. I would assume that ComputeCache has something to do with what each gpu is working with?? Time for some Googling I guess to see what that one is about. It appears you are back to running the Alpha CUDA App again. The Problem you had isn't a BOINC problem, Petri had the same Error here, Can't get shared memory segment name. You might want to familiarize yourself with what you may be seeing by running Apps that are fresh off the compiler so you don't blame the wrong software, Error tasks for computer 7475713. BOINC probably isn't capable of causing the machine to crash, the raw software running on the devices are. You are running Raw software on a machine that has the GPU Memory clocked Much higher than nVidia has deemed safe for Compute Works, you can expect further episodes. At least the zi3v Code has been tested by many machines at this point, you might consider going back to the tested software. ID: 1897867 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1897905 - Posted: 28 Oct 2017, 17:58:30 UTC - in response to Message 1897867. OK, I am really confused. Isn't the zi3v app the CUDA80 app? I wanted to try the newer CUDA90 apps. I thought that x41p_zi3xs2 app was the latest in that series. It is definitely faster than zi3v and it also doesn't produce as many inconclusives as the zi3v app. I've just had that one episode where BOINC went titsup. I've only had one other episode back when using the zi3v app when for some reason it failed to load the cuda libraries before starting a task and errored out. Can you tell what is the latest, stable CUDA90 app release name? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1897905 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1897917 - Posted: 28 Oct 2017, 18:45:54 UTC You are running Raw software on a machine that has the GPU Memory clocked Much higher than nVidia has deemed safe for Compute Works, you can expect further episodes. At least the zi3v Code has been tested by many machines at this point, you might consider going back to the tested software. FYI, when I was troubleshooting the problem, I wasn't overclocking the cards at all. Either in core clock or memory. I was in the default P2 state the drivers always run the cards in with compute loads. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1897917 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1897927 - Posted: 28 Oct 2017, 19:01:51 UTC - in response to Message 1897905. Last modified: 28 Oct 2017, 19:04:30 UTC There aren't any stable CUDA90 releases yet. Most people don't receive a steady stream of invalids with zi3v, those that have overclocked their machines will probably have problems with any version. I've gone weeks without an Invalid when running zi3v. Your last episode wasn't caused by BOINC, it was most likely caused by just the right Memory error that wrote bad data to a cache. Aren't you running your GPU Memory at the level nVidia has determined will cause Memory Errors? It is a Known fact running the Memory higher than what nVidia has determined will cause Errors, some Errors are worse than others. You shouldn't be surprised when you get the Error you have to know is inevitable. Please don't accuse innocent software when your overclocked machine running Alpha GPU software crashes. When I can run cuda 9 on my machines without a constant stream of invalids I will post a copy in the usual location. FYI, when I was troubleshooting the problem, I wasn't overclocking the cards at all. Either in core clock or memory. I was in the default P2 state the drivers always run the cards in with compute loads. But you were OCing when the problem first developed? After the bad data is written it doesn't matter. You have to remove the problem before it will work at all, at any clock. ID: 1897927 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1897940 - Posted: 28 Oct 2017, 20:35:47 UTC - in response to Message 1897927. Last modified: 28 Oct 2017, 20:37:10 UTC No I am running the memory at the stock memory speed. No overclocks at all. I have received Invalids only when my wingmen are running the SoG app and I believe that is caused by the known overflow spike ordering difference between the CUDA app and the SoG app. I got more ordering Invalids when I was running the zi3V app compared to the xs2 app now. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1897940 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1897943 - Posted: 28 Oct 2017, 20:50:35 UTC - in response to Message 1897940. Last modified: 28 Oct 2017, 20:53:19 UTC No I am running the memory at the stock memory speed. No overclocks at all. I have received Invalids only when my wingmen are running the SoG app and I believe that is caused by the known overflow spike ordering difference between the CUDA app and the SoG app. I got more ordering Invalids when I was running the zi3V app compared to the xs2 app now. It certainly didn't sound like that a short while ago; https://setiathome.berkeley.edu/forum_thread.php?id=81271&postid=1894058 Looked at your systems. You seem to be running Pascal cards. Petri has explained that there is no way to get the Pascal cards to run in anything but P2 state when the drivers detect a compute workload. I am happy with my Maxwell cards and the 375.66 drivers to be able to move them and hold them at P0 state for compute workloads. I don't want that to change. The posts above that are loaded with scripts to OC the memory. The other alternative is there is something in the cuda 9 App that caused both You & Petri to receive the Same Memory error. That Memory Error you both got came from the GPU App, Not BOINC, Petri is even using a different version of BOINC. My three machines received fewer Invalids with zi3v, and none of them are OCed. Even the Mac is getting more Invalids with zi3x. That makes 4 machines. ID: 1897943 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1897946 - Posted: 28 Oct 2017, 20:59:12 UTC - in response to Message 1897943. I have modified my scripts to no overclocks a while ago, not much later than that post. I was just trying to show what the scripts look like and told the poster that they would have to modify the scripts to their own purpose and equipment. I have not done any forensic analysis into the differences between the zi3v app and the 3xs2 app. All I've done is see my list of Inconclusives and Invalids drop after switching to the 3xs2 app. I do not count any of the Errored tasks as caused by the xs2 app since it was caused by me aborting tasks while troubleshooting the initial BOINC crash problem. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1897946 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1897951 - Posted: 28 Oct 2017, 21:09:01 UTC - in response to Message 1897946. You do understand the stderr.txt is written by the Science App and not BOINC, right? That's how you get a stderr.txt when running in standalone mode. Your Memory Error was written in the stderr.txt. ID: 1897951 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1897954 - Posted: 28 Oct 2017, 21:14:13 UTC TBar, did you read through my linked post about the shared memory segment problem? That appears to be a client problem that first showed up in 6.2.18 and is was supposedly fixed in the changelog Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? Not really, as it was fixed in 6.2.18. See its change log, which says: I was able to verify the BOINCTray.exe issue and the shared-mem and handle leaks. Iâ€™m not sure how any of us could test the client crash scenario, I ran through the basic battery of tests against BOINC Alpha. I guess weâ€™ll just have to let the people who discovered it, let us know if the problem is fixed. - client: don't leak handles to shared-mem files - client: don't leak process handles when abort jobs - client: if an app exits or we kill it, always destroy the shmem segment. It seems to be caused by the client starting two tasks at the same time. And this not just SETI, it was happening with other projects too. So obviously not a project app problem at that time. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1897954 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1897958 - Posted: 28 Oct 2017, 21:17:25 UTC - in response to Message 1897951. You do understand the stderr.txt is written by the Science App and not BOINC, right? That's how you get a stderr.txt when running in standalone mode. Your Memory Error was written in the stderr.txt. Yes, didn't specify the those specific original error task messages were caused by a science app screwup. But not convinced the underlying error message was not a client issue as based on the link I posted. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1897958 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1897960 - Posted: 28 Oct 2017, 21:23:57 UTC - in response to Message 1897954. Last modified: 28 Oct 2017, 21:34:37 UTC That was what? 2008? Have there been Any recent occurrences other that You and Petri running the same CUDA software on Different versions of BOINC? I think I remember a shared memory error a couple years ago, it happened when running the New Stock Linux v8 CPU App. It was definitely the App, I compiled my own CPU App and never saw that error again. It's the current SSSE3 App, and it was faster than the Stock App anyway. So you think this Ancient BOINC Error has suddenly surfaced on two much different versions of BOINC, just as you are both running the same GPU Alpha App? Right...whatever. ID: 1897960 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1897965 - Posted: 28 Oct 2017, 21:55:34 UTC - in response to Message 1897960. No I am simply pointing out that the cause back then of the problem was starting two tasks at the same time. I think it likely that was the cause of my issue since the app is very fast and I very often see two tasks starting at exactly the same time. That seems to happen a lot because of the time that the app spends on the finishing up process after completion hits 100% and before the client starts to upload the result. Simple observation here. I have no tools to prove/disprove the case. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1897965 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1897969 - Posted: 28 Oct 2017, 22:16:04 UTC - in response to Message 1897940. No I am running the memory at the stock memory speed. No overclocks at all. I have received Invalids only when my wingmen are running the SoG app and I believe that is caused by the known overflow spike ordering difference between the CUDA app and the SoG app. I got more ordering Invalids when I was running the zi3V app compared to the xs2 app now. . . Hi Keith, . . I don't find that big a problem with 3v. I am running it in PO on the 970s but the inconclusive tasks are only occurring in about 7% of tasks (about 120 a day out of the 1500 tasks processed daily by this rig) and over the last week I have only had about 5 invalids, all noise bombs. 5 resends out of over 8,000 or 9,000 tasks is lower than some people are getting on their rigs running other apps. I have also had 2 compute errors but they were both the result of the restart error problem when I stopped BOINC to change a setting. So while the resends caused by the inconclusives are higher numbers than is desirable a certain number are inevitable and caused by the wingmen, they cannot all be pinned on the special sauce. Petri cannot fix the problem without knowing how it is running so keep at it :) Stephen :) ID: 1897969 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1897971 - Posted: 28 Oct 2017, 22:32:16 UTC - in response to Message 1897969. Last modified: 28 Oct 2017, 22:37:12 UTC Hi Stephen, I am just looking at the current page running tallies a couple times a day and have seen the Inconclusives drop from around 175-180 on the 3v app to around 135-140 on the xs2 app. It looks like I have less than a 2% Inconclusive rate on the xs2 app. BoincTasks says I am doing 1774 gpu tasks a day and I have 22 Inconclusives for the day so far with only an hour to go or so till day's end. Not all of those are from overflows either. A lot are from Darwin wingmen which I have lousy luck getting paired up with. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1897971 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1897977 - Posted: 28 Oct 2017, 23:01:16 UTC - in response to Message 1897971. Last modified: 28 Oct 2017, 23:30:46 UTC Hi Stephen, I am just looking at the current page running tallies a couple times a day and have seen the Inconclusives drop from around 175-180 on the 3v app to around 135-140 on the xs2 app. It looks like I have less than a 2% Inconclusive rate on the xs2 app. BoincTasks says I am doing 1774 gpu tasks a day and I have 22 Inconclusives for the day so far with only an hour to go or so till day's end. Not all of those are from overflows either. A lot are from Darwin wingmen which I have lousy luck getting paired up with. . . Hi Keith, we are both talking about the SETI account "Computers" page right? But that does raise a point, that number of inconclusives (which is what I was using) is not a daily total but may have accrued over several days remaining there until resolved and cleared. I feel even better now. I am not running BOINC Tasks so I do not have anything counting the inconclusive tasks or errors on a daily basis. I did toy with it at one point but while I like the stats and history sections I was too comfortable with BOINC manager. Stephen [edit] D'oh! I had a closer look at the inconclusive tally and it goes back weeks, damn those slow wingmen :). OK, here's the thing. I counted them on a daily returns basis and it's almost nothing. From maybe a small handful of half a dozen up to a dozen or so per day ... Not worth getting concerned about in the grand scheme of things. Something around, or even less than 1%. . . I have to say the 3v is working pretty well for me. .. ID: 1897977 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1897983 - Posted: 28 Oct 2017, 23:43:14 UTC - in response to Message 1897977. Yes, I was pretty sure you were talking about running total of inconclusives as I didn't see anywhere near 120 for today for you. I just count the total reported for the calendar day. It is pretty low as you've noticed. Since I run 24/7, the BoincTasks daily total is pretty accurate with outage Tuesday's the only outlier in the week. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1897983 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1897986 - Posted: 28 Oct 2017, 23:49:30 UTC - in response to Message 1897983. Yes, I was pretty sure you were talking about running total of inconclusives as I didn't see anywhere near 120 for today for you. I just count the total reported for the calendar day. It is pretty low as you've noticed. Since I run 24/7, the BoincTasks daily total is pretty accurate with outage Tuesday's the only outlier in the week. . . Maybe I will go back and have another crack at it sometime ... Boinc Tasks that is. Stephen :) ID: 1897986 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1897994 - Posted: 29 Oct 2017, 0:10:08 UTC - in response to Message 1897986. Where did you run into issues with BoincTasks? It's a solid and simple program to use with lots of documentation. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1897994 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1897997 - Posted: 29 Oct 2017, 0:21:41 UTC - in response to Message 1897994. Where did you run into issues with BoincTasks? It's a solid and simple program to use with lots of documentation. . . It wasn't that I had any problems, I was simply more comfortable with the manager. I couldn't quite get BT to do quite what I expected of it. Stephen .. ID: 1897997 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1898002 - Posted: 29 Oct 2017, 1:09:31 UTC - in response to Message 1897997. Were you trying to use BoincTasks as a substitute for the manager? I have never run anything other than the Manager for each client. I just like having BoincTasks on my daily driver so I can see what is going on with each machine without having to get off my chair and visit each client machine in person. So I just use it mostly for remote monitoring of each client machine from a central location. I am lazy. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1898002 · Reply Quote

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.