Cuda memory leak and freezes and other issues // lunatics 0.44

Message boards : Number crunching : Cuda memory leak and freezes and other issues // lunatics 0.44
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1802309 - Posted: 13 Jul 2016, 6:01:44 UTC
Last modified: 13 Jul 2016, 6:12:13 UTC

Hello Folks

I have quick question, is anyone having the same issues like me?

Im having now since about soon nine months extreme issues with the whole seti apps, constalty freezes, nvidia driver resets, almost no performace on gpu task.

Ive changed every part in my rig since then to exclude faulty HW, but the errors remains absolute the same =( burn in test no problem at all.

So my conclusion is that the seti apps have really big issues in it and Im really in the mood to stop my contribution to this project, its sad but I dont have the nerves no more, its like a brutal kick in the nuts... to see how this project is going down the drain... its a shame... corporateUSA u ruining everything... stop doing stupid war and give sience&education the fundings they need.

Its a big shame =( to see how the corrupt soziophatic US corp. governement treats the world and and our computations dont even get an official analysis. No ntpkcr and that stupid billionare its taking away funds from seti. From my view the seti staff is barely able to keep the projevt alive, but its not their fault, they are enthusiast and doing their best possible with a small crew and small budget, like the most of us. Okai thats the gossip part...

back on topic

Running seti since new year '16 is frustrating as hell, almost everey time when I leave my rig for a day or 2 unattend then its 99% shure crashed, with no log entries at all. RAM, MB, CPU, PSU, gfx card, where changed serveral times, and I still have the same issues... as before... no change at all in the error description. But alot of down time, this year alone about 2 1/2 months. Got my sys back running since a few days and its the same problem as in january, constantly crashing.

When on my nv gpu a guppi is running, it almost locks up my system, the lags are that awfull, severall seconds, mostly about 5-15 secs, that Im not able to write this text without shutting down seti. Its really a pain in the "*&&!

Have anyone same experiences, issues like I have? I would be glad hear from you.

What recommendations? going back stock till lunatics.45 is released?

If someone need the specs of my rig, u will find it in my active hosts ;-)

I c Raistmer is pretty active doing something what might be the right direction, but Im not shure if its the same thing that has 2 do with the issues Im regarding.

But what I conclude there is pretty shure a bug in the igpu seti v7/8 app that occurs when a task is finished at the exit... there its proned 2 crash the whole sys with a permanent freeze with the need 2 reboot.

Kind regards
MajorTom
ID: 1802309 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1802311 - Posted: 13 Jul 2016, 6:18:24 UTC - in response to Message 1802309.  
Last modified: 13 Jul 2016, 6:47:46 UTC

hi MajorTom,

Some digging and mysteries to solve there, as the level of issues you're describing is not something I've had reports of relating to the relatively mature Cuda Multibeam applications. [Not sure about other applications]

Many possibilities exist for diagnosis, if you are willing to go the yards.

[Edit:] first easy thing: What's your GTX 970 exact brand/model ?
Your tasks show:
GPU current clockRate = 1303 MHz

which does seem on the high end. knowing the precise model could tell us if that's 'normal'. A fair few simple checks with other tools can possibly isolate whether there's a stability issue or not. (while the inconclusives and other symptoms you describe suggest there is.... somewhere)

[Note:] The stock distributed Cuda 5.0 applications are the same binaries as the Lunatics installer ones at present... if reverting to stock changes anything, then there was some damage/corruption along the way. Worth to check.
No difference would imply some kindof OS/Driver or hardware issue going on.

[A bit Later:] perhaps describing the 'burn-in' tests you have already done might reveal something.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1802311 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1802316 - Posted: 13 Jul 2016, 6:53:50 UTC

How many instances are you running on CPU and GPU ?


With each crime and every kindness we birth our future.
ID: 1802316 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1802320 - Posted: 13 Jul 2016, 7:18:13 UTC - in response to Message 1802316.  

Have you tried processing work on only the GTX970 with no processing being done on the iGPU; then processing work only on the iGPU & none on the GTX970?
There are plenty of people with plenty of issues, but none of the type you're describing.
Grant
Darwin NT
ID: 1802320 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1802326 - Posted: 13 Jul 2016, 8:03:44 UTC - in response to Message 1802320.  

any overclocking the memory or bus speed?
ID: 1802326 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1802331 - Posted: 13 Jul 2016, 8:41:50 UTC - in response to Message 1802326.  

And what other software is running on the system?
AV programmes have caused various issues, and recently there was a hardware monitoring programme that had a memory leak & was bring systems down.
Grant
Darwin NT
ID: 1802331 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1802335 - Posted: 13 Jul 2016, 9:08:56 UTC
Last modified: 13 Jul 2016, 9:20:48 UTC

thx for all the kind replies =)

hmmm... so my biggest concern might be true and it dont have 2 do with seti at all but then the its really difficult to say what it might be thats causing this.

here the specs not shown

its gtx970 from asus, the model strix which is OC from factory, thats the reason for the 1304mhz default clock. The Card was send in under warranty allready twice the last asus replaced with a complete new card.

No OC at all, done by the side of mine, nor cpu or ram, gpus . CPU 3.9ghz max default turbo and ram @1866mhz but the issue persist if I underclock the ram @1800 and 1600.. it makes no difference. Its a z97 chipset that max default ram is 1866, but in theory it shouldnt be a problem clocking the ram above that.

Like on the 580gtx that Ive used long time for crunching, I run this instances 7cpu 3nv_gpu 1igpu; beacause the all the issues persists I changed it 2 days ago 7cpu 2nv 1igpu, no differnce cuda50 still having a bloated behaving and the laggin form whole sys is eaven worse then with 3 instances, running only 6 cpu instances dont change the issue either.

from youre responds I get more clueless then befor but thx, so I know its really something strange going on that only affects my system, this was not the answer I hoped 2 hear but at least, it might possibly be something completly different, I have 2 look for.

My last possible guess.. it might be somthing with the power connection, inhouse.. I tried different wall sockets and no difference, but if there is more evidence that its might something 2 do in this direction, I have try it with an APC PDU / USV.

The odd thing is if it really has something 2 do with the power grid quality, it dont have affected me 3 or 4 years ago, it has started beeing really disturbing and nagging soon after wow competition '15

No other software is running on this sys, since this issues occurs Ive reinstalled the OS trice and keept it barebone since then, only the anti virus SW is running additional, no monitoring SW is running.

One thing might be worth trying would be switch of the ipgu crunching and looking what then happens, like Grant mention. But imho if that would be it, then why its only affecting my system, though shouldnt be a problem using the igpu and dgpu at the same time.

thx for all the suggestions
ID: 1802335 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1802336 - Posted: 13 Jul 2016, 9:20:33 UTC - in response to Message 1802335.  

I notice you're running Windows 7, and many people have been having issues with Windows Update chewing up lots of CPU time.

I'd suggest Process Explorer to see just what is running & using system resources.
My system has 2 GTX 750Tis & runs 3 WUs at a time on each of them with no system sluggishness using CUDA50.
However I don't have an iGPU, and even if I did I wouldn't use it for crunching- sharing memory, power & heat limits with the CPU results in significantly reduced work output compared to just running the CPU alone.
Grant
Darwin NT
ID: 1802336 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1802337 - Posted: 13 Jul 2016, 9:30:49 UTC - in response to Message 1802336.  
Last modified: 13 Jul 2016, 9:57:09 UTC

I notice you're running Windows 7, and many people have been having issues with Windows Update chewing up lots of CPU time.

I'd suggest Process Explorer to see just what is running & using system resources.
My system has 2 GTX 750Tis & runs 3 WUs at a time on each of them with no system sluggishness using CUDA50.
However I don't have an iGPU, and even if I did I wouldn't use it for crunching- sharing memory, power & heat limits with the CPU results in significantly reduced work output compared to just running the CPU alone.


Yes Ive noticed that odd behaving on my fresh win7 install with the wupdate, service is hanging about 12%cpu load and the svchost, installed the kb fix but it still occurs time from time. If meet someone from MS I punch them straight in their face for doing this, IMHO they do that on purpose, urge ppl upgrade to win10.

But right now ive checked the taskmgr and no hanging svchost. But I will check it again when it start lagging that awfully again. But there is no solution that works for this issue thats public since a year or even longer.

Its a shame from MS I gonna never ever buy an OS from them again.
ID: 1802337 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1802338 - Posted: 13 Jul 2016, 9:44:29 UTC
Last modified: 13 Jul 2016, 10:11:08 UTC

furmark burn-in test runs trough no issue, cpu stress test no issue
win7 mem check 2 pass extend, no errors
standalone memtest86+ extend check, no errors reported after 12hours
sfc reports no issues in the sys

cooling is no issue, cpu runs at 55degC @full load

the cuda memory leak Ive got with the 580gtx too and its crashing unattend the same way like the 970, the nv_driver resets happens on both cards, so think it has be something different.

The really odd thing is, when Im using my rig then its no problem, beside the lags. When I leave the sys a while running and the screen goes blank it doesnt wake up, or its freezed in the screensaver.
I should mention that Im using dual displays but honestly that shouldnt be the source of all this trouble.

All in in all its a really odd and strange thing, never ever had that in all the time Im dealing with pc's and seti and thats allrdy a very long time.

Ive going trough this with a friend of mine and he has no clue either, Ive rebuilded my whole rig since 9 months about 4 or 5 times, so he suggested if its not possible it has 2 do something with the seti apps, which I didnt belive but at least it would be a possibility, and the reason Im asking what do you think ;-)

Ive allready said as joke that this system is totally jinxed and I need an exorcist =D its that kind of annoying.
ID: 1802338 · Report as offensive
AMDave
Volunteer tester

Send message
Joined: 9 Mar 01
Posts: 234
Credit: 11,671,730
RAC: 0
United States
Message 1802368 - Posted: 13 Jul 2016, 14:24:32 UTC - in response to Message 1802335.  
Last modified: 13 Jul 2016, 14:39:01 UTC

Could a contributing factor be that MajorTom is starving the GPU of CPU assistance?  His rig specs include

GenuineIntel
Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz [Family 6 Model 60 Stepping 3]
(8 processors) 

He stated
... I run this instances 7cpu 3nv_gpu 1igpu ...
and
No other software is running on this sys ...

In this instance, would not a better allocation of resources be 4 CPU, 3 GPU, 1 iGPU?  Or 5 CPU, 3 GPU?

Additionally,
... beacause the all the issues persists I changed it 2 days ago 7cpu 2nv 1igpu, no differnce cuda50 still having a bloated behaving and the laggin form whole sys is eaven worse then with 3 instances, running only 6 cpu instances dont change the issue either.

In this instance, wouldn’t a more efficient allocation of resources entail 5 CPU, 2 GPU, 1 iGPU?  Or 6 CPU, 2 GPU?

EDIT:
I am running Win7 Ultimate x64 and had updates set to notify only.  Within the past week, I noticed extreme sluggishness with the system.  I used Process Explorer and discovered a process called “wuauserv” was sucking upwards of 25% of CPU resources (= 2 HTT cores).  This happened to me with WinXP as well.  I would have to kill the process multiple times before it finally remained killed.  I had no other option than to disable all Win updates.
ID: 1802368 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1802397 - Posted: 13 Jul 2016, 19:03:07 UTC
Last modified: 13 Jul 2016, 20:01:49 UTC

Hi AMDave

I dont think it has something 2 do with the instances Ive running, the major issue is still the same, I cant leave the rig unattend.

If it would be only the bloating and lags then maybe but that Im not able 2 get a display signal when the screen goes blank after the screen saver usallay after about 10hrs oft this state.

And the cuda memory leaks or nv_driver_reset occurs in a total random pattern and when I give the gpus more cpu cycles the are left unused and the tasks are still bloated anyways.

On the old maindboard with the 580gtx I had not that issues with the same amounts of instances, cant imagine that with the 970 gfx its so different.

But yes the whole thing started about then when I swapped the gfx & display, ram and psu and since then nothing is normal, only problems.

And when I now put in the old ram and gfx, I have the same issues as with the new parts, thats the reason I have no clue what it might be.

When I check now the task mgr I have about 10% Idle cpu, I think it should suffice, because about 2 years ago it runned seti fine with 7cpu 3nvgpu 1igpu.

OFC I can try it with 6cpu 2(3)nv 1igpu and report if it makes any difference, I think I will even go back on stock to check again.
Im pretty shure its not the reason for the missbehaving but Ive allready tried almost everthing, then why not try that too.

The biggest problem is that wenn I leave the rig alone for more then about 8 hours then its locks up and cant get a display output on both of my displays, if the old display would wake up then case would be clear but here I really dont know what it could be.

Thats the reason Im writing here, Ive battleing this now so long and really nothing helps, before I sent in the new MB, I runned a while stock apps and the major issue remained the same, so again I cant imagine it has 2 do with the amount of isntaces Im running, as long they are reasonable.
ID: 1802397 · Report as offensive
Profile JakeTheDog
Avatar

Send message
Joined: 3 Nov 13
Posts: 153
Credit: 2,585,912
RAC: 0
United States
Message 1802398 - Posted: 13 Jul 2016, 19:08:47 UTC
Last modified: 13 Jul 2016, 19:12:38 UTC

I don't know if it's the same problem I have. I don't even know if I'm using lunatics. For me, any GPU task since last winter, especially VLAR tasks, lags my computer. It's not too bad but is noticeable, so I often suspend GPU tasks when I'm actively using the computer.

Maybe every 2 months, I will encounter a GPU task that crashes my Nvidia driver and, sometimes crash the computer. I found 2 solutions and they seem to have worked.

I think the one that made the most difference is from Raistmer's thread http://lunatics.kwsn.info/index.php/topic,1809.0.html. I added
-sbs 256 -period_iterations_num 100
to mb_cmdline-8.12_windows_intel__opencl_nvidia_sah.txt and mb_cmdline-8.12_windows_intel__opencl_nvidia_SoG.txt in my Boinc/Seti folder.

The other, I'm not sure if it helped, was changing the TDR in the Windows registry. Mine was already 8, I decided to increase to 10. Instructions are here in the link. https://support.microsoft.com/en-gb/kb/2665946

Probably should reboot computer afterwards.

Edit: I don't think I have lunatics installed.
ID: 1802398 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1802400 - Posted: 13 Jul 2016, 19:11:31 UTC - in response to Message 1802397.  
Last modified: 13 Jul 2016, 19:23:57 UTC

running now 6cpu 2nv 1igpu, since thats so easy to change I done it right now.

This as proof Im really try everything that might help getting past this annoying thing.

But now task mgr tells me 21% idle ;-)
ID: 1802400 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1802402 - Posted: 13 Jul 2016, 19:20:40 UTC - in response to Message 1802398.  
Last modified: 13 Jul 2016, 19:45:51 UTC

Hey thanks JakeTheDog =)

I knew it I cant be the only one with that odd issues =D

Thx for the infos I will check it out, ASAP, think the next 2 days

nevermind I had issues with stock apps 2 ;-) that was the reason I sent in almost all my HW under warranty.

I think there is somethin in a major scale behind it, some coding errors I assume.
ID: 1802402 · Report as offensive
AMDave
Volunteer tester

Send message
Joined: 9 Mar 01
Posts: 234
Credit: 11,671,730
RAC: 0
United States
Message 1802425 - Posted: 13 Jul 2016, 21:13:43 UTC - in response to Message 1802402.  
Last modified: 13 Jul 2016, 21:14:57 UTC

@MajorTom

From what Jake proposed, take a look at this thread, specifically with this post and those that follow.
ID: 1802425 · Report as offensive
Profile JakeTheDog
Avatar

Send message
Joined: 3 Nov 13
Posts: 153
Credit: 2,585,912
RAC: 0
United States
Message 1802472 - Posted: 14 Jul 2016, 6:08:35 UTC - in response to Message 1802402.  
Last modified: 14 Jul 2016, 6:10:22 UTC

I'm using the stock Boinc with no mods or apps installed.

So what happens to me is the screen goes black for a few seconds. Windows has error message something like "Nvidia driver kernel has stopped responding but recovered." BOINC will have an error message (i forgot what it was) and pause the task. After a while, it will restart the task, but screen goes black and it all happens again. After a few cycles of this, BOINC will suspend the task and say "postponed" or "GPU not detected." Sometimes my computer crashes during one of these cycles.

The first time it happened to me, I thought maybe my GPU or PSU were broken. First I reinstalled or rolled back the GPU drivers. Then I did all kinds of monitoring and stress testing. Even tested my RAM and system stability. But I didn't find anything wrong. I deleted the problem tasks and just hoped it wouldn't happen again.

After it happened for the 3rd time, I decided to try those 2 solutions. I didn't delete the task. I resumed it and it finished, so I guess it fixed the problem. Hopefully for good.
ID: 1802472 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1802476 - Posted: 14 Jul 2016, 6:39:15 UTC - in response to Message 1802472.  

After it happened for the 3rd time, I decided to try those 2 solutions. I didn't delete the task. I resumed it and it finished, so I guess it fixed the problem. Hopefully for good.


Looking through your particular host tasks, I see multiple different (stock) applications. Which particular application(s) + task types, alone or in combination, trigger the unwanted behaviour could be important.

Could you restate the solutions that worked for you? (for the benefit of the thread, and those of us with short attention spans :) )

It would be best case for development if you could recreate the original dicey situation, and possibly isolate which individual application (or combination) induces the unwanted behaviour.

That would involve building a cache of mixed tasks/applications, then suspending all but one type at a time, giving that type a clean start from boot so as to guarantee other applications didn't pollute the GPU driver state.

I appreciate that would be an arduous process, so would understand if not practical for you (or others) to follow through. Hopefully your description of the fixes that worked (for you) would avoid the need for that, by indicating which application(s) needed the tweaking.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1802476 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1802531 - Posted: 14 Jul 2016, 15:30:27 UTC
Last modified: 14 Jul 2016, 15:31:58 UTC

This as small update: Im changed now after AMDaves suggestion the instances to 6cpus and put back 3nv_gpus and left the 1igpu untouched

Short after changing it I realized that it might be 21% idle in the avrg but it peaks sometimes to only 7% idle or eaven more, so it might be really that.

At least no more tearing & lagging when guppies are in the house, but no vlar atm so hard say whats the case when guppi and vlar are crunched together on the nv_card.

I have now to keep running the rig and evaluate. atm the guppies on nv still take alot of time, maybe I will set it back to 2nv_gpu instances and and check how big the avrg difference is.

The good thing is, no crash during screen sleep state =) but its abit early say to if its solved or not.

I think I will look intos Jakes hint 2 just in case...

I will come back in some days and report. THX guys =)
ID: 1802531 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1802535 - Posted: 14 Jul 2016, 15:57:02 UTC

y'all might like to cross-refer to Archae86's work at Einstein in Memory depletion--graphics driver related. Sample:

Poolmon again shows the Vi12 tag bad behavior of hundreds of 239 byte allocs per second without matching frees, so a steadily climbing paged pool. This is confirmed by a trend graph for Memory|pool paged bytes shown in Resource Monitor.

He's tracked down Vi12 to dxgmms2.sys - seemingly exclusively in the Windows 10 version of DirectX, triggered by a limited number of NVidia driver / NVidia hardware / Einstein software combinations.
ID: 1802535 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Cuda memory leak and freezes and other issues // lunatics 0.44


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.