Cuda memory leak and freezes and other issues // lunatics 0.44

Message boards : Number crunching : Cuda memory leak and freezes and other issues // lunatics 0.44
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile JakeTheDog
Avatar

Send message
Joined: 3 Nov 13
Posts: 153
Credit: 2,585,912
RAC: 0
United States
Message 1802610 - Posted: 14 Jul 2016, 23:54:56 UTC - in response to Message 1802476.  
Last modified: 14 Jul 2016, 23:58:00 UTC

Looking through your particular host tasks, I see multiple different (stock) applications. Which particular application(s) + task types, alone or in combination, trigger the unwanted behaviour could be important.


For certain, 2 tasks at the same incident were SETI@home v8 v8.12 (opencl_nvidia_SoG)windows_intelx86. I wrote a thread about it on June 1 and I think the tasks were https://setiathome.berkeley.edu/result.php?resultid=4960012801 https://setiathome.berkeley.edu/result.php?resultid=4960012803. I forgot what the app names were for the other 2 or 3 incidents, and my event log doesn't go that far back. My impression is maybe sah was also involved, but not cuda 50,42, nor 32, but I'm not 100% sure.

Could you restate the solutions that worked for you? (for the benefit of the thread, and those of us with short attention spans :) )


1st solution was from Raistmer's thread for lunatics http://lunatics.kwsn.info/index.php/topic,1809.0.html. I don't know if I did it correctly, I never used BOINC command lines before. I added the text -sbs 256 -period_iterations_num 100 to 2 text files, mb_cmdline-8.12_windows_intel__opencl_nvidia_sah.txt and mb_cmdline-8.12_windows_intel__opencl_nvidia_SoG.txt in the folder "SETI Data / projects / setiathome.berkeley.edu."

The other was increasing the TDR delay value in Windows registry. I don't know what the default value is supposed to be, but mine was already 8. I changed it to 10. Instructions are here in the link. https://support.microsoft.com/en-gb/kb/2665946.

Also I had to reboot the computer after making these changes. Before I rebooted, they would still crash the GPU.

It would be best case for development if you could recreate the original dicey situation, and possibly isolate which individual application (or combination) induces the unwanted behaviour.


I'm kind of reluctant to do all that testing, but maybe if a lot of other people start reporting similar issues. Also, I had a lot of SOG tasks that had no problems. Not sure why only a few cause the GPU drivers to crash.
ID: 1802610 · Report as offensive
AMDave
Volunteer tester

Send message
Joined: 9 Mar 01
Posts: 234
Credit: 11,671,730
RAC: 0
United States
Message 1802690 - Posted: 15 Jul 2016, 13:42:29 UTC - in response to Message 1802610.  

@Jake

... my event log doesn't go that far back.

To retain more history, see this post.
ID: 1802690 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1803440 - Posted: 19 Jul 2016, 7:13:58 UTC
Last modified: 19 Jul 2016, 7:37:59 UTC

So here whats Ive tested and Ive done so far =) but in short the issues are almost the same as before =(

after changing the instances 6cpu 3nv 1 igpu, I had one freeze directly after loggin after waking up the screen and one BSOD code 116 gfx driver.
Some occasional freezes usually after screen wake up.

Done the trd delay tweak but still the have the freezes, no gfx driver crash or bsod so far.

Tested some settings in BIOS, with pci-e dmi powermanagement enabled, the freezes occur alot more, so Ive disabled it again as it was in default, was just 4 testing.

Now Ive setted the DRAM freq from 1866 to 1800 and turned up the cas lat from 9-11-11-31 to 10-12-12-32 and set dramV from auto(about 1.660) to 1.675v default would be 1.65v.

As now no freezes... uptime now 12hrs... still evalutating... hurts 2 set the cl10 but it dont helps.

Looks like the mainboard has problems under load with the cl9, that really sux because the ram is allrdy underclocked, and cl9 should be in the spcecs possible, ram model and vendor is written in the QVL with these specs, chipset should support it.

An other thing Ive found out, one of the gfx fans is making a rattling noise... so maybe Ive to send in that bloody dang gfx card the 4rd time =( WTF asus

Its a shame ASUS , asus shame on u, asus I cant recommend u nomore, poor quality control, poor service, poor documentation, faulty drivers and software.

Mainboard and GFX card are from asus and both really are substandard, a bad value, both do not work as promised on the box, IMHO asus is a fraudulent company too... so beware.... normaly u are lucky but if not u are totally screwed over and have only problems with them and their products.

So when Ive to send it again for warranty it dont makes alot sense 2 evaluate alot more.

I will post an update when I know more ;-) thx folks 4 helping me
ID: 1803440 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1803442 - Posted: 19 Jul 2016, 7:21:27 UTC - in response to Message 1803440.  

All of my GPUs are Nvidia....

BUT, most of my mobos are Asus, and have performed admirably. Some have been crunching along for many many years now.
I would not hesitate to buy another if I am in the market for one again.

Maybe their GPUs are not up to snuff, but I have stayed with NV from day one on those, and would not try anything else to save a buck.

As far as Asus and my experience with their mobos, top notch.
Of course, I also realize than mine are rather long in the tooth and things may have changed over the years. Or maybe somebody just got a bum one....that does happen.

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1803442 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1803444 - Posted: 19 Jul 2016, 7:32:55 UTC - in response to Message 1803442.  
Last modified: 19 Jul 2016, 7:36:55 UTC

until now I would said the same ^ ^ never ever had such probs with asus in the past.

But that sabretooth z97 is a POS ;-) freezes in the bios, customer service is a bad joke.. no help from them at all.

and that strix 970 has sent in 3 times and now I still have to consider give it back a 4rd time.

I think it speaks 4 it self =( I demand satisfaction. Cant really recomend them nomore or at least both products from asus listed above.

And warranty service and customer service from asus are like drinking cool-aid.

The bad thing, the alternatives arent better =( or even worse then asus.
ID: 1803444 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1803479 - Posted: 19 Jul 2016, 11:10:02 UTC
Last modified: 19 Jul 2016, 11:10:57 UTC

@Jake that with the tdr delay I dont think a value like 10 makes really sense =) because its a hexadezimal value but I did the same as you and set it to 10 but its translated anyways to the value 16 at least in my reg.

That with the command line I didnt looked into it for severall reasons. Maybe later on.

@Richard Hasselgrove I dont think its has do with my issues, when I understand it correct its exclusive 2 win10, Im running win7, and when I monitor my memory usage its not growing overtime, its always about the same amount. But thx anyways =) ya never know.

@topic

I made now an other phone call, to discuss the return/warranty conditions, and again annoying as hell to talk with the dealer.

But I cant live with the noise of the ratteling gfx fan =/ and they cant deliver a acceptable solution so far.
ID: 1803479 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1803529 - Posted: 19 Jul 2016, 21:48:26 UTC - in response to Message 1803444.  

until now I would said the same ^ ^ never ever had such probs with asus in the past.

But that sabretooth z97 is a POS ;-) freezes in the bios, customer service is a bad joke.. no help from them at all.

and that strix 970 has sent in 3 times and now I still have to consider give it back a 4rd time.

I think it speaks 4 it self =( I demand satisfaction. Cant really recomend them nomore or at least both products from asus listed above.

And warranty service and customer service from asus are like drinking cool-aid.

The bad thing, the alternatives arent better =( or even worse then asus.


Like Mark, I run ASUS as well.

3 of 4 board work fine, the 4th..nope....Played with it every day for a week, finally it just decided to stop allowing me to boot.

Replaced with another ASUS board of the same making and it booted right up into Windows, without having to install anything else.

So, yeah, there are quality issues with ASUS.

EVGA boards have great BIOS but the positions of the extra power supply for the PCIe causes issues, as well as the USB 3 connection point

Gigabyte are ok, but not up to the higher quality of the first 2.

Hope you figure it out.
ID: 1803529 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1803704 - Posted: 20 Jul 2016, 15:47:36 UTC
Last modified: 20 Jul 2016, 16:07:10 UTC

=) yeah .... Its not my intention just to blame asus, I dont think all the issues I have are only the fault of asus quality.

Ive rebuild the whole sys from scratch 4 times now, replaced almost every part, screw and cable. It cant be just asus, that would be just to simple.

f.e. the cuda memory leaks I had with a msi 580gtx too, almost exactly the same way as the the asus behaved

Its somethig really more complex and spohisticated, and I just dont get it '-.-

ram clock and latency change from yesterday did not help at all. Had 2 crashes and one freeze meanwhile =/ and no indication why.
ID: 1803704 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1803714 - Posted: 20 Jul 2016, 16:18:32 UTC
Last modified: 20 Jul 2016, 16:29:53 UTC

At least my fresh build watercooling and rgb led lighting works like a breeze and charms me alot =D

The watercooling ist now about 3 months old and not perfect but I kinda like it that way ^ ^ I will drain the loop anyways and rebuild some stuff when I get some other stuff sorted out. Its like alot thinks I do... work in progress ^ ^

here some pics 4 general enlightenment =)



Original Size
http://i.imgur.com/nDs7xfr.jpg /// http://i.imgur.com/LPO2lUg.jpg
ID: 1803714 · Report as offensive
The_Matrix
Volunteer tester

Send message
Joined: 17 Nov 03
Posts: 414
Credit: 5,827,850
RAC: 0
Germany
Message 1803900 - Posted: 21 Jul 2016, 11:22:34 UTC

21.07.2016 13:19:59 | SETI@home | task postponed 30.000000 sec:

A already "found" then failure , the gpu is a about 73°C , gimme some water :D
ID: 1803900 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1803954 - Posted: 21 Jul 2016, 19:05:27 UTC - in response to Message 1803900.  
Last modified: 21 Jul 2016, 19:22:34 UTC

21.07.2016 13:19:59 | SETI@home | task postponed 30.000000 sec:


but thats not from my log, cant find such an entry ;-) cant imagine its temp related.

When I monitor the temps the gpu its always around 60°C.. gpu-z and hwmon are showing the same value, max peak on hot day 67°C, and when overheating its really the issue then why on earth the fan speed only ramps up only to 34%

if its really just temperature, that would indicate missconstruction in the thermal design of the stock cooling, the gpu should handle the tasks trowing at without problems as long not OC.. so I dont see the issue there

But yes ofc the gpu gets water in the future 2 =D but I dont gonna spend a gpu waterblock 4 that cheapo 970gtx, I rather wait til the 1080ti gonna hit the market. Anyhow the gpus should run fine on stock air cooling, thats the way u buy it and not with waterblock mounted.
ID: 1803954 · Report as offensive
The_Matrix
Volunteer tester

Send message
Joined: 17 Nov 03
Posts: 414
Credit: 5,827,850
RAC: 0
Germany
Message 1803957 - Posted: 21 Jul 2016, 19:10:26 UTC
Last modified: 21 Jul 2016, 19:10:47 UTC

Wrong thread, sorry. I just came around to spread MY problems, just to compare it, soooo it´s not the temperature on your hand, but at mine gpu is going
crazy and jump between workunits like crazy.

Did u see that issue ? I allready solved it by flashing an older gpu bios, thought at first i got the right one, but wasn´t.
ID: 1803957 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1803961 - Posted: 21 Jul 2016, 19:17:09 UTC - in response to Message 1803957.  
Last modified: 21 Jul 2016, 19:17:54 UTC

No problem =D

yeah temps might be an issue but I dont think its that on my case because I have the freezes on a cold day 2.

And I dont gonna gonna flash the bios on my gpu because its almost brand new and I gonna have send it back due the ratteling cooling fan anyways.

Got 2 day answer from my dealer and they tell me they will swap it out, hope this time it dont takes another 4 or 5 weeks... =/ like ther last 2 times

but cool u found a solution 4 your issue =) thumb up
ID: 1803961 · Report as offensive
The_Matrix
Volunteer tester

Send message
Joined: 17 Nov 03
Posts: 414
Credit: 5,827,850
RAC: 0
Germany
Message 1803962 - Posted: 21 Jul 2016, 19:19:19 UTC

poor tailor XD , hope he doesn´t read that. U got lucky.
ID: 1803962 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1804169 - Posted: 22 Jul 2016, 14:42:11 UTC
Last modified: 22 Jul 2016, 14:44:08 UTC

Stop using the iGPU and you'll very likely find that your problems will disappear.

[Edit] Or try 2 less CPU tasks.

Cheers.
ID: 1804169 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1808104 - Posted: 9 Aug 2016, 9:45:56 UTC - in response to Message 1804169.  

Stop using the iGPU and you'll very likely find that your problems will disappear.

[Edit] Or try 2 less CPU tasks.

Cheers.


I think in short: YES Sir ^ ^


took a while.. now Im responding.. did some testing while using my old gtx580 and really when I stop using the iGPU the random freezing has disappeard an no crashes no more =)

I tried it with less cpu tasks and lower iGPU clockspeed but that didnt helped.

Very interessting BTW is that with the stock iGPU application it has the same behaviour like with the optimized one ;-) random freezes and crashing

got now 2day finally my replacment for the faulty asus gtx970... right now Im burning it in... looks fine so far =D no rattling fan ^ ^ at least it is silent again =)
ID: 1808104 · Report as offensive
Profile MajorTom
Avatar

Send message
Joined: 25 Aug 03
Posts: 33
Credit: 78,247,091
RAC: 46
Switzerland
Message 1808621 - Posted: 12 Aug 2016, 0:44:23 UTC
Last modified: 12 Aug 2016, 0:48:07 UTC

Appendix:

I have 2 say that I had some issues with Asus, but it looks really like the whole freezing&crashing problem wasnt really based on qualitiy issues by asus ;-) dang iGPU XD

only the rattling gfx fan and some other minor things that leed me to the point of serious doubts. So I may oversatureted it =) nevermind.

WOW event starts soon =D hope my RAC can climb a bit till then ^ ^

Cheers
ID: 1808621 · Report as offensive
Previous · 1 · 2

Message boards : Number crunching : Cuda memory leak and freezes and other issues // lunatics 0.44


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.