Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 39 · 40 · 41 · 42 · 43 · 44 · 45 . . . 162 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1897867 - Posted: 28 Oct 2017, 14:23:57 UTC - in response to Message 1897666.  
Last modified: 28 Oct 2017, 14:30:13 UTC

Thanks TBar. Making some progress. Can run the CPU tasks. The GPU tasks seem to be the factor that causes BOINC to crash. I tried to go back to the CUDA80 app but this is what I am getting in the stdout file.
Thu 26 Oct 2017 03:10:27 PM PDT | SETI@home | task blc04_2bit_guppi_57976_07262_HIP74926_0026.20486.0.21.44.241.vlar_1 resumed by user
Thu 26 Oct 2017 03:10:27 PM PDT | SETI@home | [error] error: can't open file for shmem seg name
Thu 26 Oct 2017 03:10:27 PM PDT | SETI@home | [error] error: can't open file for shmem seg name: 2

I just checked the dependencies on both the CUDA 80 and CUDA 90 static apps and didn't see any irregularities.
Hmmm, never heard of that one before. Try stopping BOINC, in the Home folder select Show Hidden Files from the View menu. Open the folder .nv, and then the folder ComputeCache. Delete all items from the folder ComputeCache. Then start BOINC and see if that helps.
Hi TBar, thanks for the help. Didn't know about that hidden directory. Will put that one in the memory bank. The problem was definitely coming from the gpu tasks. Since that hidden directory seems to be about Nvidia, suspect that is where the problem lay. I would assume that ComputeCache has something to do with what each gpu is working with?? Time for some Googling I guess to see what that one is about.

It appears you are back to running the Alpha CUDA App again. The Problem you had isn't a BOINC problem, Petri had the same Error here, Can't get shared memory segment name. You might want to familiarize yourself with what you may be seeing by running Apps that are fresh off the compiler so you don't blame the wrong software, Error tasks for computer 7475713. BOINC probably isn't capable of causing the machine to crash, the raw software running on the devices are. You are running Raw software on a machine that has the GPU Memory clocked Much higher than nVidia has deemed safe for Compute Works, you can expect further episodes. At least the zi3v Code has been tested by many machines at this point, you might consider going back to the tested software.
ID: 1897867 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1897905 - Posted: 28 Oct 2017, 17:58:30 UTC - in response to Message 1897867.  

OK, I am really confused. Isn't the zi3v app the CUDA80 app? I wanted to try the newer CUDA90 apps. I thought that x41p_zi3xs2 app was the latest in that series. It is definitely faster than zi3v and it also doesn't produce as many inconclusives as the zi3v app. I've just had that one episode where BOINC went titsup. I've only had one other episode back when using the zi3v app when for some reason it failed to load the cuda libraries before starting a task and errored out.

Can you tell what is the latest, stable CUDA90 app release name?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1897905 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1897917 - Posted: 28 Oct 2017, 18:45:54 UTC

You are running Raw software on a machine that has the GPU Memory clocked Much higher than nVidia has deemed safe for Compute Works, you can expect further episodes. At least the zi3v Code has been tested by many machines at this point, you might consider going back to the tested software.

FYI, when I was troubleshooting the problem, I wasn't overclocking the cards at all. Either in core clock or memory. I was in the default P2 state the drivers always run the cards in with compute loads.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1897917 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1897927 - Posted: 28 Oct 2017, 19:01:51 UTC - in response to Message 1897905.  
Last modified: 28 Oct 2017, 19:04:30 UTC

There aren't any stable CUDA90 releases yet. Most people don't receive a steady stream of invalids with zi3v, those that have overclocked their machines will probably have problems with any version. I've gone weeks without an Invalid when running zi3v. Your last episode wasn't caused by BOINC, it was most likely caused by just the right Memory error that wrote bad data to a cache. Aren't you running your GPU Memory at the level nVidia has determined will cause Memory Errors? It is a Known fact running the Memory higher than what nVidia has determined will cause Errors, some Errors are worse than others. You shouldn't be surprised when you get the Error you have to know is inevitable. Please don't accuse innocent software when your overclocked machine running Alpha GPU software crashes.

When I can run cuda 9 on my machines without a constant stream of invalids I will post a copy in the usual location.

FYI, when I was troubleshooting the problem, I wasn't overclocking the cards at all. Either in core clock or memory. I was in the default P2 state the drivers always run the cards in with compute loads.
But you were OCing when the problem first developed? After the bad data is written it doesn't matter. You have to remove the problem before it will work at all, at any clock.
ID: 1897927 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1897940 - Posted: 28 Oct 2017, 20:35:47 UTC - in response to Message 1897927.  
Last modified: 28 Oct 2017, 20:37:10 UTC

No I am running the memory at the stock memory speed. No overclocks at all. I have received Invalids only when my wingmen are running the SoG app and I believe that is caused by the known overflow spike ordering difference between the CUDA app and the SoG app. I got more ordering Invalids when I was running the zi3V app compared to the xs2 app now.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1897940 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1897943 - Posted: 28 Oct 2017, 20:50:35 UTC - in response to Message 1897940.  
Last modified: 28 Oct 2017, 20:53:19 UTC

No I am running the memory at the stock memory speed. No overclocks at all. I have received Invalids only when my wingmen are running the SoG app and I believe that is caused by the known overflow spike ordering difference between the CUDA app and the SoG app. I got more ordering Invalids when I was running the zi3V app compared to the xs2 app now.
It certainly didn't sound like that a short while ago;
https://setiathome.berkeley.edu/forum_thread.php?id=81271&postid=1894058
Looked at your systems. You seem to be running Pascal cards. Petri has explained that there is no way to get the Pascal cards to run in anything but P2 state when the drivers detect a compute workload. I am happy with my Maxwell cards and the 375.66 drivers to be able to move them and hold them at P0 state for compute workloads. I don't want that to change.
The posts above that are loaded with scripts to OC the memory. The other alternative is there is something in the cuda 9 App that caused both You & Petri to receive the Same Memory error. That Memory Error you both got came from the GPU App, Not BOINC, Petri is even using a different version of BOINC.

My three machines received fewer Invalids with zi3v, and none of them are OCed. Even the Mac is getting more Invalids with zi3x. That makes 4 machines.
ID: 1897943 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1897946 - Posted: 28 Oct 2017, 20:59:12 UTC - in response to Message 1897943.  

I have modified my scripts to no overclocks a while ago, not much later than that post. I was just trying to show what the scripts look like and told the poster that they would have to modify the scripts to their own purpose and equipment.

I have not done any forensic analysis into the differences between the zi3v app and the 3xs2 app. All I've done is see my list of Inconclusives and Invalids drop after switching to the 3xs2 app. I do not count any of the Errored tasks as caused by the xs2 app since it was caused by me aborting tasks while troubleshooting the initial BOINC crash problem.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1897946 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1897951 - Posted: 28 Oct 2017, 21:09:01 UTC - in response to Message 1897946.  

You do understand the stderr.txt is written by the Science App and not BOINC, right?
That's how you get a stderr.txt when running in standalone mode. Your Memory Error was written in the stderr.txt.
ID: 1897951 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1897954 - Posted: 28 Oct 2017, 21:14:13 UTC

TBar, did you read through my linked post about the shared memory segment problem? That appears to be a client problem that first showed up in 6.2.18 and is was supposedly fixed in the changelog
Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages?

Not really, as it was fixed in 6.2.18. See its change log, which says:

I was able to verify the BOINCTray.exe issue and the shared-mem and handle leaks. I’m not sure how any of us could test the client crash scenario, I ran through the basic battery of tests against BOINC Alpha. I guess we’ll just have to let the people who discovered it, let us know if the problem is fixed.

- client: don't leak handles to shared-mem files

- client: don't leak process handles when abort jobs

- client: if an app exits or we kill it, always destroy the shmem segment.


It seems to be caused by the client starting two tasks at the same time. And this not just SETI, it was happening with other projects too. So obviously not a project app problem at that time.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1897954 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1897958 - Posted: 28 Oct 2017, 21:17:25 UTC - in response to Message 1897951.  

You do understand the stderr.txt is written by the Science App and not BOINC, right?
That's how you get a stderr.txt when running in standalone mode. Your Memory Error was written in the stderr.txt.

Yes, didn't specify the those specific original error task messages were caused by a science app screwup. But not convinced the underlying error message was not a client issue as based on the link I posted.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1897958 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1897960 - Posted: 28 Oct 2017, 21:23:57 UTC - in response to Message 1897954.  
Last modified: 28 Oct 2017, 21:34:37 UTC

That was what? 2008? Have there been Any recent occurrences other that You and Petri running the same CUDA software on Different versions of BOINC? I think I remember a shared memory error a couple years ago, it happened when running the New Stock Linux v8 CPU App. It was definitely the App, I compiled my own CPU App and never saw that error again. It's the current SSSE3 App, and it was faster than the Stock App anyway. So you think this Ancient BOINC Error has suddenly surfaced on two much different versions of BOINC, just as you are both running the same GPU Alpha App? Right...whatever.
ID: 1897960 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1897965 - Posted: 28 Oct 2017, 21:55:34 UTC - in response to Message 1897960.  

No I am simply pointing out that the cause back then of the problem was starting two tasks at the same time. I think it likely that was the cause of my issue since the app is very fast and I very often see two tasks starting at exactly the same time. That seems to happen a lot because of the time that the app spends on the finishing up process after completion hits 100% and before the client starts to upload the result. Simple observation here. I have no tools to prove/disprove the case.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1897965 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1897969 - Posted: 28 Oct 2017, 22:16:04 UTC - in response to Message 1897940.  

No I am running the memory at the stock memory speed. No overclocks at all. I have received Invalids only when my wingmen are running the SoG app and I believe that is caused by the known overflow spike ordering difference between the CUDA app and the SoG app. I got more ordering Invalids when I was running the zi3V app compared to the xs2 app now.


. . Hi Keith,

. . I don't find that big a problem with 3v. I am running it in PO on the 970s but the inconclusive tasks are only occurring in about 7% of tasks (about 120 a day out of the 1500 tasks processed daily by this rig) and over the last week I have only had about 5 invalids, all noise bombs. 5 resends out of over 8,000 or 9,000 tasks is lower than some people are getting on their rigs running other apps. I have also had 2 compute errors but they were both the result of the restart error problem when I stopped BOINC to change a setting. So while the resends caused by the inconclusives are higher numbers than is desirable a certain number are inevitable and caused by the wingmen, they cannot all be pinned on the special sauce. Petri cannot fix the problem without knowing how it is running so keep at it :)

Stephen

:)
ID: 1897969 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1897971 - Posted: 28 Oct 2017, 22:32:16 UTC - in response to Message 1897969.  
Last modified: 28 Oct 2017, 22:37:12 UTC

Hi Stephen, I am just looking at the current page running tallies a couple times a day and have seen the Inconclusives drop from around 175-180 on the 3v app to around 135-140 on the xs2 app. It looks like I have less than a 2% Inconclusive rate on the xs2 app. BoincTasks says I am doing 1774 gpu tasks a day and I have 22 Inconclusives for the day so far with only an hour to go or so till day's end. Not all of those are from overflows either. A lot are from Darwin wingmen which I have lousy luck getting paired up with.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1897971 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1897977 - Posted: 28 Oct 2017, 23:01:16 UTC - in response to Message 1897971.  
Last modified: 28 Oct 2017, 23:30:46 UTC

Hi Stephen, I am just looking at the current page running tallies a couple times a day and have seen the Inconclusives drop from around 175-180 on the 3v app to around 135-140 on the xs2 app. It looks like I have less than a 2% Inconclusive rate on the xs2 app. BoincTasks says I am doing 1774 gpu tasks a day and I have 22 Inconclusives for the day so far with only an hour to go or so till day's end. Not all of those are from overflows either. A lot are from Darwin wingmen which I have lousy luck getting paired up with.


. . Hi Keith, we are both talking about the SETI account "Computers" page right? But that does raise a point, that number of inconclusives (which is what I was using) is not a daily total but may have accrued over several days remaining there until resolved and cleared. I feel even better now. I am not running BOINC Tasks so I do not have anything counting the inconclusive tasks or errors on a daily basis. I did toy with it at one point but while I like the stats and history sections I was too comfortable with BOINC manager.

Stephen

[edit] D'oh! I had a closer look at the inconclusive tally and it goes back weeks, damn those slow wingmen :). OK, here's the thing. I counted them on a daily returns basis and it's almost nothing. From maybe a small handful of half a dozen up to a dozen or so per day ... Not worth getting concerned about in the grand scheme of things. Something around, or even less than 1%.

. . I have to say the 3v is working pretty well for me.

..
ID: 1897977 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1897983 - Posted: 28 Oct 2017, 23:43:14 UTC - in response to Message 1897977.  

Yes, I was pretty sure you were talking about running total of inconclusives as I didn't see anywhere near 120 for today for you. I just count the total reported for the calendar day. It is pretty low as you've noticed. Since I run 24/7, the BoincTasks daily total is pretty accurate with outage Tuesday's the only outlier in the week.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1897983 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1897986 - Posted: 28 Oct 2017, 23:49:30 UTC - in response to Message 1897983.  

Yes, I was pretty sure you were talking about running total of inconclusives as I didn't see anywhere near 120 for today for you. I just count the total reported for the calendar day. It is pretty low as you've noticed. Since I run 24/7, the BoincTasks daily total is pretty accurate with outage Tuesday's the only outlier in the week.


. . Maybe I will go back and have another crack at it sometime ... Boinc Tasks that is.

Stephen

:)
ID: 1897986 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1897994 - Posted: 29 Oct 2017, 0:10:08 UTC - in response to Message 1897986.  

Where did you run into issues with BoincTasks? It's a solid and simple program to use with lots of documentation.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1897994 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1897997 - Posted: 29 Oct 2017, 0:21:41 UTC - in response to Message 1897994.  

Where did you run into issues with BoincTasks? It's a solid and simple program to use with lots of documentation.


. . It wasn't that I had any problems, I was simply more comfortable with the manager. I couldn't quite get BT to do quite what I expected of it.

Stephen

..
ID: 1897997 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1898002 - Posted: 29 Oct 2017, 1:09:31 UTC - in response to Message 1897997.  

Were you trying to use BoincTasks as a substitute for the manager? I have never run anything other than the Manager for each client. I just like having BoincTasks on my daily driver so I can see what is going on with each machine without having to get off my chair and visit each client machine in person. So I just use it mostly for remote monitoring of each client machine from a central location. I am lazy.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1898002 · Report as offensive     Reply Quote
Previous · 1 . . . 39 · 40 · 41 · 42 · 43 · 44 · 45 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.