Some GPU workunits cause driver reset

Message boards : Number crunching : Some GPU workunits cause driver reset
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
DayneC

Send message
Joined: 12 Dec 14
Posts: 8
Credit: 75,443
RAC: 0
Message 1650602 - Posted: 8 Mar 2015, 12:02:22 UTC

It seems like some but not all tasks cause my AMD graphics driver to reset. Most recently these two are giving me the issue;

http://setiathome.berkeley.edu/result.php?resultid=4014474912
http://setiathome.berkeley.edu/result.php?resultid=4014474916

I aborted the first one but have just suspended gpu processing for now.

I have done 3 passes with memtest on both my ram sticks, tested both slots of my motherboard with 3 passes in memtest. Furmark for 30 mins with no issue, though a few times I ran Furmark it actually crashed, it seemed to run fine with all other software closed down. I have run Prime95 for a few hours without error. I have tried AMD driver versions 14.12, 14.4, and pretty sure also 14.9 and 13.12.

Sometimes I will come back to find that the GPU no longer has any load, though the task says it is running in BOINC, usually this is over night. Other times it will reset while I am sitting in front of the computer. Several times when I have tried to restart a task that has stopped in this way, it will cause the display to go black for a moment and not recover correctly while the machine remains unresponsive till a hard reset, although music or video will still be heard playing.

I have read another thread on this forum about disabling a feature called TDR delay iirc and while this seems to stop it from resetting the display driver the system becomes unresponsive for a few moments every now and then. Worse it seemingly causes some tasks (Folding@home) to bsod the computer.

Should I just try more driver versions till I find one that works or is that unlikely to be the issue? What about the optimised applications lunatics iirc? Will they possibly be less likely to crash? If I use those do you still get credits in the normal way?

Unrelated to the my main issue, if you disable the option "Should SETI@home show your computers on its web site?" will that prevent stats websites e.g. BOINCStats from keeping track of your info?

My Specs (nothing overclocked):
Intel DH67GD
i5 2500
8gb RAM (2x 4gb 1333 iirc)
HD 6870
Corsair 650TX
2x WD Black HDD (1TB and 500GB)
ID: 1650602 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1650604 - Posted: 8 Mar 2015, 12:13:18 UTC
Last modified: 8 Mar 2015, 12:13:55 UTC

@Raistmer: on the visible one, is that stderr truncated or normal ? If truncated I spent a good chunk of last week attempting to raise the core issues with Boinc devs... maybe you'll have better luck there.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1650604 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1650606 - Posted: 8 Mar 2015, 12:17:36 UTC - in response to Message 1650602.  
Last modified: 8 Mar 2015, 12:20:39 UTC

Try deleting the compilations the r1831 app made, that revision only did that once, later revisions redo them every time the driver/APP runtime changes.

Suspend GPU usage, navigate to the setiathome project directory, (Should be C:\ProgramData\BOINC\projects\setiathome.berkeley.edu)
then delete the compilations, they follow the following format:

MB_clFFTplan_Capeverde_8_r1831.bin
MB_clFFTplan_Capeverde_16_r1831.bin
MB_clFFTplan_Capeverde_32_r1831.bin
MB_clFFTplan_Capeverde_64_r1831.bin
MB_clFFTplan_Capeverde_128_r1831.bin
MB_clFFTplan_Capeverde_256_r1831.bin
MB_clFFTplan_Capeverde_512_r1831.bin

etc all the way up to:

MB_clFFTplan_Capeverde_524288_r1831.bin

MultiBeam_Kernels_r1831.cl_Capeverde.bin_V7
MultiBeam_Kernels_r1831.clHD5_Capeverde.bin_V7

r1831_IntelRCoreTMi72600KCPU340GHz.wisdom

Where Capeverde is replaced by your GPU type, and IntelRCoreTMi72600KCPU340GHz is replaced by your CPU type,
Once you're deleting those files, Unsuspend GPU usage and the app will regenerate them with the current APP runtime.

Claggy
ID: 1650606 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1650607 - Posted: 8 Mar 2015, 12:22:44 UTC - in response to Message 1650604.  

@Raistmer: on the visible one, is that stderr truncated or normal ? If truncated I spent a good chunk of last week attempting to raise the core issues with Boinc devs... maybe you'll have better luck there.

That's truncated.

Claggy
ID: 1650607 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1650609 - Posted: 8 Mar 2015, 12:24:10 UTC - in response to Message 1650607.  
Last modified: 8 Mar 2015, 12:27:05 UTC

@Raistmer: on the visible one, is that stderr truncated or normal ? If truncated I spent a good chunk of last week attempting to raise the core issues with Boinc devs... maybe you'll have better luck there.

That's truncated.

Claggy


hmmmm, thought so, *grumble* *grumble* *grumble* LOL (failing the other possible causes in this case, causes of that symptom can crash GPU drivers too...)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1650609 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1650611 - Posted: 8 Mar 2015, 12:30:27 UTC - in response to Message 1650609.  

@Raistmer: on the visible one, is that stderr truncated or normal ? If truncated I spent a good chunk of last week attempting to raise the core issues with Boinc devs... maybe you'll have better luck there.

That's truncated.

Claggy


hmmmm, thought so, *grumble* *grumble* *grumble* LOL (failing the other possible causes in this case, causes of that symptom can crash GPU drivers too...)

Have a look in the 'No output again (for just one WU)?' thread too!!

Claggy
ID: 1650611 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1650615 - Posted: 8 Mar 2015, 12:36:56 UTC - in response to Message 1650611.  

@Raistmer: on the visible one, is that stderr truncated or normal ? If truncated I spent a good chunk of last week attempting to raise the core issues with Boinc devs... maybe you'll have better luck there.

That's truncated.

Claggy


hmmmm, thought so, *grumble* *grumble* *grumble* LOL (failing the other possible causes in this case, causes of that symptom can crash GPU drivers too...)

Have a look in the 'No output again (for just one WU)?' thread too!!

Claggy


*gentle bangs head on table* yeah, it's been a long week. I'll probably just keep using modified boincapi, and try figure out how to offer a solid variant that's generally usable.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1650615 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1650616 - Posted: 8 Mar 2015, 12:38:18 UTC - in response to Message 1650604.  

If truncated I spent a good chunk of last week attempting to raise the core issues with Boinc devs... maybe you'll have better luck there.

I find I have most luck with the BOINC devs if I watch what they're up to, and send off bug reports when I see that their head is in the right bit of code.

At the moment Rom seems to be working all hours getting all the international language translations ported to a new CMS: I can sympathise with him not wanting to break away from that for an extended conversation about arcane (as they probably seem to him) details of the M$ C runtime threading model.
ID: 1650616 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1650618 - Posted: 8 Mar 2015, 12:52:13 UTC - in response to Message 1650604.  
Last modified: 8 Mar 2015, 12:54:23 UTC

If truncated I spent a good chunk of last week attempting to raise the core issues with Boinc devs... maybe you'll have better luck there.

Part of the problem isn't the Boinc devs, But the project itself, and it's devs, it tends to release apps, then sit on them for their whole life without producing refreshes with fixed lib/api's and fixes like the blocksum precision maintenance fixes that made it into the source after the 7.01 release.

There is an outstanding ATI/AMD MBv7 suspending Bug in all the Windows ATI/AMD MBv7 apps where it doesn't suspend during CPU benchmarks (But stay in memory), I reported this problem something like 18 months ago, and it was fixed shortly afterwards,
But new apps never made it to the project(s), recently I've found out that 'Suspend when non-BOINC CPU usage is above' x should also suspend GPU usage, but it doesn't,
while that doesn't really matter to users with mid or higher range ATI/AMD GPUs, it does to users with low end GPUs where usage of the GPU causes lag, and therefore the GPU must suspend correctly.

Claggy
ID: 1650618 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1650619 - Posted: 8 Mar 2015, 12:55:01 UTC - in response to Message 1650616.  

If truncated I spent a good chunk of last week attempting to raise the core issues with Boinc devs... maybe you'll have better luck there.

I find I have most luck with the BOINC devs if I watch what they're up to, and send off bug reports when I see that their head is in the right bit of code.

At the moment Rom seems to be working all hours getting all the international language translations ported to a new CMS: I can sympathise with him not wanting to break away from that for an extended conversation about arcane (as they probably seem to him) details of the M$ C runtime threading model.


I can live with that. I wish it were solely an MS C runtime issue, and they had the time/inclination to harden the core functionality. Until that time I just hope they don't lose customers over it.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1650619 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1650621 - Posted: 8 Mar 2015, 12:59:50 UTC - in response to Message 1650618.  

If truncated I spent a good chunk of last week attempting to raise the core issues with Boinc devs... maybe you'll have better luck there.

Part of the problem isn't the Boinc devs, But the project itself, and it's devs, it tends to release apps, then sit on them for their whole life without producing refreshes with fixed lib/api's and fixes like the blocksum precision maintenance fixes that made it into the source after the 7.01 release.

There is an outstanding ATI/AMD MBv7 suspending Bug in all the Windows ATI/AMD MBv7 apps where it doesn't suspend during CPU benchmarks (But stay in memory), I reported this problem something like 18 months ago, and it was fixed shortly afterwards,
But new apps never made it to the project(s), recently I've found out that 'Suspend when non-BOINC CPU usage is above' x should also suspend GPU usage, but it doesn't,
while that doesn't really matter to users with mid or higher range ATI/AMD GPUs, it does to users with low end GPUs where usage of the GPU causes lag, and therefore the GPU must suspend correctly.

Claggy


Yeah it's tough project-wise I suppose mostly because they've moved onto other work (like possibly drawing up MB8 and GBT stuff, who knows). Well, will just keep toodling away...
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1650621 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1650666 - Posted: 8 Mar 2015, 15:50:00 UTC - in response to Message 1650602.  

I have two machines with HD6870's. I'm using 14.4 on one and 14.9 on the other. Other than Cat Control Center crashing sometimes when opting with 14.9 I'm not having issues with either. With 14.12 I did see longer than normal CPU times for tasks when I did some testing with it.

It looks like you didn't mentions your GPU temps. With the two different cases I have one of my HD6870s runs around 65-70ºC & the other around 68-74ºC.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1650666 · Report as offensive
DayneC

Send message
Joined: 12 Dec 14
Posts: 8
Credit: 75,443
RAC: 0
Message 1650713 - Posted: 8 Mar 2015, 17:13:03 UTC - in response to Message 1650606.  

Try deleting the compilations the r1831 app made, that revision only did that once, later revisions redo them every time the driver/APP runtime changes.

Suspend GPU usage, navigate to the setiathome project directory, (Should be C:\ProgramData\BOINC\projects\setiathome.berkeley.edu)
then delete the compilations, they follow the following format:

MB_clFFTplan_Capeverde_8_r1831.bin
MB_clFFTplan_Capeverde_16_r1831.bin
MB_clFFTplan_Capeverde_32_r1831.bin
MB_clFFTplan_Capeverde_64_r1831.bin
MB_clFFTplan_Capeverde_128_r1831.bin
MB_clFFTplan_Capeverde_256_r1831.bin
MB_clFFTplan_Capeverde_512_r1831.bin

etc all the way up to:

MB_clFFTplan_Capeverde_524288_r1831.bin

MultiBeam_Kernels_r1831.cl_Capeverde.bin_V7
MultiBeam_Kernels_r1831.clHD5_Capeverde.bin_V7

r1831_IntelRCoreTMi72600KCPU340GHz.wisdom

Where Capeverde is replaced by your GPU type, and IntelRCoreTMi72600KCPU340GHz is replaced by your CPU type,
Once you're deleting those files, Unsuspend GPU usage and the app will regenerate them with the current APP runtime.

Claggy


Once I re enabled GPU it immediately froze up again and I had to reset. I then disabled gpu, closed BOINC completely, deleted the files again. Since then it has been running ok, but I will have to see over a longer period.

My CPU temp peaks at about 75 usually and GPU maybe a bit higher than 70 when it's hot during the day. I'm experimenting with TThrottle as well to adjust temps and CPU/GPU load.
ID: 1650713 · Report as offensive
DayneC

Send message
Joined: 12 Dec 14
Posts: 8
Credit: 75,443
RAC: 0
Message 1653122 - Posted: 15 Mar 2015, 9:21:04 UTC
Last modified: 15 Mar 2015, 9:21:20 UTC

Since my last post this (came back to find the WU stalled with no GPU activity) has happened again twice, the first it didn't cause the machine to freeze up when I started it again, the second time I had to restart my machine for other reasons.

Do I just have to delete those files in the project folder whenever this happens and live with it or is there something else I can try?
ID: 1653122 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1653149 - Posted: 15 Mar 2015, 11:38:01 UTC

Delete all binaries first.
Also i would suggest to try a more recent app.
Easiest way is to run the Lunatics installer.

http://setiathome.berkeley.edu/forum_thread.php?id=71867


With each crime and every kindness we birth our future.
ID: 1653149 · Report as offensive
DayneC

Send message
Joined: 12 Dec 14
Posts: 8
Credit: 75,443
RAC: 0
Message 1653164 - Posted: 15 Mar 2015, 12:23:37 UTC

You mean I should delete the EXEs from the project directory?
ID: 1653164 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1653221 - Posted: 15 Mar 2015, 15:58:32 UTC - in response to Message 1653164.  

You mean I should delete the EXEs from the project directory?

Mike is referring to these binaries mentioned earlier;
Try deleting the compilations the r1831 app made, that revision only did that once, later revisions redo them every time the driver/APP runtime changes.

Suspend GPU usage, navigate to the setiathome project directory, (Should be C:\ProgramData\BOINC\projects\setiathome.berkeley.edu)
then delete the compilations, they follow the following format:

MB_clFFTplan_Capeverde_8_r1831.bin
MB_clFFTplan_Capeverde_16_r1831.bin
MB_clFFTplan_Capeverde_32_r1831.bin
MB_clFFTplan_Capeverde_64_r1831.bin
MB_clFFTplan_Capeverde_128_r1831.bin
MB_clFFTplan_Capeverde_256_r1831.bin
MB_clFFTplan_Capeverde_512_r1831.bin

etc all the way up to:

MB_clFFTplan_Capeverde_524288_r1831.bin

MultiBeam_Kernels_r1831.cl_Capeverde.bin_V7
MultiBeam_Kernels_r1831.clHD5_Capeverde.bin_V7

r1831_IntelRCoreTMi72600KCPU340GHz.wisdom

Where Capeverde is replaced by your GPU type, and IntelRCoreTMi72600KCPU340GHz is replaced by your CPU type,
Once you're deleting those files, Unsuspend GPU usage and the app will regenerate them with the current APP runtime.

Claggy

After deleting those files run the first installer from here;
http://mikesworldnet.de/download.html
lunatics_win64_v0.43a_setup.exe
Make sure to choose the app MB7_win_x86_SSE_OpenCL_ATi_HD5_r2489.exe
After installing those files, find the file named mb_cmdline_win_x86_SSE_OpenCL_ATi_HD5.txt, and open it in wordpad. Add the line
-sbs 256
to the empty file and save it.

See how that works.
ID: 1653221 · Report as offensive
DayneC

Send message
Joined: 12 Dec 14
Posts: 8
Credit: 75,443
RAC: 0
Message 1654071 - Posted: 18 Mar 2015, 9:42:08 UTC

After installing lunatics according to your instructions, when restarting BOINC and re-enabling GPU it crashed almost instantly. After I restarted the machine I started BOINC again and it ran fine until later when it crashed again sometime during the night.

I did select MB7_win_x86_SSE_OpenCL_ATi_HD5_r2489.exe, I also selected something else though, I think it was for astropulse to run on GPU but I'm not 100% sure. Should I uninstall and reinstall lunatics with only the option you said?
ID: 1654071 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1654079 - Posted: 18 Mar 2015, 10:22:26 UTC - in response to Message 1654071.  

... when restarting BOINC and re-enabling GPU it crashed almost instantly.

What crashed? The app, driver, BOINC, Windows ...?

Try the stability of the computer/GPU/driver/PSU with some test programs:
http://setiathome.berkeley.edu/forum_thread.php?id=76878&postid=1650200#1650200
http://setiathome.berkeley.edu/forum_thread.php?id=76878&postid=1651402#1651402
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1654079 · Report as offensive
DayneC

Send message
Joined: 12 Dec 14
Posts: 8
Credit: 75,443
RAC: 0
Message 1654093 - Posted: 18 Mar 2015, 11:38:37 UTC

Seti@home WU crashed as per my thread title. As per my OP I have done Memtest, Prime95 and Furmark already.
ID: 1654093 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Some GPU workunits cause driver reset


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.