Blue Screen of Death occuring on unique tasks

Questions and Answers : GPU applications : Blue Screen of Death occuring on unique tasks
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1998972 - Posted: 21 Jun 2019, 1:12:17 UTC

Forgive me for posting about this from the main Boinc forum, but this may be a unique Seti problem. In an attempt to help someone else with a problem, I coincidentally developed my own problem, starting here.

A few weeks ago I noticed that my computer suffered a blue screen of death. The BSOD was a Video_TDR_Failure related to atikmpag.sys. Long story short, I would get the BSOD every time I started up Boinc and allowed the GPU to run, CPU crunching was just fine. I checked over hardware, ran DDU, upgraded/downgraded drivers, nothing worked long term. Technically I got it working again after running DDU, but it only worked again for about a day before suffering the same thing. However, I discovered that the BSOD was occuring only when two unique tasks were attempting to crunch, task 7735441575 and 7735333648. I have suspended those tasks, and I have been crunching on the GPU just fine for a few days now.

I could just abort these tasks and move on, but before doing so I was wondering if anyone had any other thoughts. Specifically, is there something anyone can do to see if there is a problem with the task/workunit itself? I'm not sure if there is anything else to attempt to fix on my computer. Considering all other tasks appear to be crunching just fine I don't know that I want to be spinning my wheels over nothing.
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1998972 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14415
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1999000 - Posted: 21 Jun 2019, 7:09:05 UTC - in response to Message 1998972.  

I doubt there's anything really unique about the tasks - one BLC26, one Arecibo. Both tasks have a completed run from a wingmate, waiting for yours to return and validate - and both wingmates used GPUs. Of course, in one sense, every single workunit run by SETI is unique (that's the whole point of what we do), but if you worked out exactly what configuration setting or data point was the problem and changed it, it wouldn't be 'that task' any more, and it would probably fail to validate.

Better to tackle the problem at source - thank you for including the bugcheck value. Here are two Microsoft articles:

Bug Check 0x116: VIDEO_TDR_ERROR (background - for programmers only)
Timeout Detection and Recovery (TDR) Registry Keys (settings you can tweak to work round the problem)

In a discussion from 2016, the developer's response was "I would recommend just to disable that damned watchdog in Windows registry" - i.e., set TdrLevel to 0. I would do that as a temporary workround to clear those two tasks, and to hold in reserve in case the problem returns. But the problem suggests there is possibly some marginal hardware problem in that machine you could investigate later, if you have access to any diagnostics tools.
ID: 1999000 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5121
Credit: 276,046,078
RAC: 462
Message 1999030 - Posted: 21 Jun 2019, 11:59:24 UTC - in response to Message 1998972.  
Last modified: 21 Jun 2019, 11:59:40 UTC

Forgive me for posting about this from the main Boinc forum, but this may be a unique Seti problem. In an attempt to help someone else with a problem, I coincidentally developed my own problem, starting here.

A few weeks ago I noticed that my computer suffered a blue screen of death. The BSOD was a Video_TDR_Failure related to atikmpag.sys.


I was having the TDR failure on my Ryzen 5 2400G until I stopped using a very aggressive -tt 1500 in the command line. Problem went right away since I dropped the parm.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1999030 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 709
Credit: 8,032,827
RAC: 62
France
Message 1999045 - Posted: 21 Jun 2019, 14:34:12 UTC

imho -tt 1500 is too low with period iteration set to 1 ^^ i can set it to 1800 for BLC but i have to set it higher for arecibo WU ^^ i'm now at -tt 5000 without problems with another hardware , separate GPU and not ryzen :D
ID: 1999045 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5121
Credit: 276,046,078
RAC: 462
Message 1999097 - Posted: 21 Jun 2019, 22:53:52 UTC - in response to Message 1999045.  

imho -tt 1500 is too low with period iteration set to 1 ^^ i can set it to 1800 for BLC but i have to set it higher for arecibo WU ^^ i'm now at -tt 5000 without problems with another hardware , separate GPU and not ryzen :D


I am confused. I thought that the # after the -tt was the length of time that a gpu task could be dispatched before there would be a task switch. Since this meant more time crunching for each "time slice" that was started it was "supposed" to run faster. And usually does.

So the -tt # I heard you tell about using is, as far as I know, out of bounds. I thought I read that -tt 1500 was the largest # you could use.

"Supervisor Call" would somebody correct me?

TY.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1999097 · Report as offensive
Profile Bill Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 30 Nov 05
Posts: 282
Credit: 6,916,194
RAC: 60
United States
Message 1999124 - Posted: 22 Jun 2019, 2:00:10 UTC - in response to Message 1999000.  

Okay, I think I have solved the problem, at least partially.
I doubt there's anything really unique about the tasks - one BLC26, one Arecibo. Both tasks have a completed run from a wingmate, waiting for yours to return and validate - and both wingmates used GPUs. Of course, in one sense, every single workunit run by SETI is unique (that's the whole point of what we do), but if you worked out exactly what configuration setting or data point was the problem and changed it, it wouldn't be 'that task' any more, and it would probably fail to validate.

Better to tackle the problem at source - thank you for including the bugcheck value. Here are two Microsoft articles:

Bug Check 0x116: VIDEO_TDR_ERROR (background - for programmers only)
Timeout Detection and Recovery (TDR) Registry Keys (settings you can tweak to work round the problem)

In a discussion from 2016, the developer's response was "I would recommend just to disable that damned watchdog in Windows registry" - i.e., set TdrLevel to 0. I would do that as a temporary workround to clear those two tasks, and to hold in reserve in case the problem returns. But the problem suggests there is possibly some marginal hardware problem in that machine you could investigate later, if you have access to any diagnostics tools.
Thanks, Richard. I did see the first article before. I'm not sure what to do with that one at this point. I know it involves running debug tools for Windows, which I am sure I can do, but I don't think that I have the knowledge yet to figure out what to do with the information. I'll shelf that one for the time being.

The second article, and the deep cut on the message board did provide something interesting. I cannot find TdrLevel at HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers, nor anywhere else in my registry. Is that a sign of something else? I would assume that if the value is not present, the default setting is TdrLevelRecover? I tried adding the TDRLevel key to my registry, and it didn't fix BSOD. Perhaps I didn't enter it correctly. No matter, because I think I found the actual problem.

I was having the TDR failure on my Ryzen 5 2400G until I stopped using a very aggressive -tt 1500 in the command line. Problem went right away since I dropped the parm.
No, I don't have -tt in my command line, but that reminded me that I had entered command line test for the first time a few weeks ago. I haven't even thought about modifying the command lines. I just deleted all command line options from the ati files, and as I've been typing this the one task crunched to completion and the second is chugging along nicely. So, something in that command line I was using wasn't working right. Perhaps I should go back and research what each of the lines means instead of blindly adding them.

I think I have it solved for now. Thanks, Tom, Richard, and Jord for all the help!
Seti@home classic: 1,456 results, 1.613 years CPU time
ID: 1999124 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 709
Credit: 8,032,827
RAC: 62
France
Message 1999155 - Posted: 22 Jun 2019, 10:42:00 UTC - in response to Message 1999097.  

imho -tt 1500 is too low with period iteration set to 1 ^^ i can set it to 1800 for BLC but i have to set it higher for arecibo WU ^^ i'm now at -tt 5000 without problems with another hardware , separate GPU and not ryzen :D


I am confused. I thought that the # after the -tt was the length of time that a gpu task could be dispatched before there would be a task switch. Since this meant more time crunching for each "time slice" that was started it was "supposed" to run faster. And usually does.

So the -tt # I heard you tell about using is, as far as I know, out of bounds. I thought I read that -tt 1500 was the largest # you could use.

"Supervisor Call" would somebody correct me?

TY.

Tom


it depend of the GPU used, the kernel time depend of the GPU speed and computes units , a slower GPU takesmore times ton complete a FFT sequence ^^ if i set only 1500 , all the FFT aren't done in only 3 steps like here in my results


Wallclock time elapsed since last restart: 2074.6 seconds
Fftlength=32,pass=3:Tune: sum=130559(ms); min=618.9(ms); max=1563(ms); mean=1536(ms); s_mean=1547; sleep=1545(ms); delta=6133; N=85; high_perf
Fftlength=64,pass=3:Tune: sum=135688(ms); min=309.1(ms); max=801.5(ms); mean=793.5(ms); s_mean=795.7; sleep=795(ms); delta=3410; N=171; high_perf
Fftlength=128,pass=3:Tune: sum=141307(ms); min=175(ms); max=419.6(ms); mean=414.4(ms); s_mean=415; sleep=405(ms); delta=2045; N=341; high_perf
Fftlength=256,pass=3:Tune: sum=147845(ms); min=94.98(ms); max=223.6(ms); mean=217.1(ms); s_mean=217.6; sleep=210(ms); delta=1362; N=681; high_perf
Fftlength=512,pass=3:Tune: sum=147914(ms); min=45.59(ms); max=111.4(ms); mean=108.7(ms); s_mean=109; sleep=105(ms); delta=1701; N=1361; high_perf
Fftlength=1024,pass=3:Tune: sum=145822(ms); min=23.09(ms); max=54.71(ms); mean=53.55(ms); s_mean=53.54; sleep=45(ms); delta=2892; N=2723; high_perf
Fftlength=2048,pass=3:Tune: sum=136794(ms); min=18.85(ms); max=25.87(ms); mean=25.12(ms); s_mean=25.15; sleep=15(ms); delta=1; N=5445; usual
Fftlength=4096,pass=3:Tune: sum=135984(ms); min=5.375(ms); max=13.4(ms); mean=12.49(ms); s_mean=12.53; sleep=15(ms); delta=1; N=10891; usual
Fftlength=8192,pass=3:Tune: sum=149674(ms); min=6.819(ms); max=6.91(ms); mean=6.871(ms); s_mean=6.877; sleep=0(ms); delta=1; N=21783; usual


the longest ones are the firsts lines here FFT length 32 64 and 128 , with my -tt 5000 , i can speed up theses ones ^^

if i set a lower number of -tt , the FFT length 32 64 and perhaps 128 takes more than 3 pass to do .. and u will find other pass in 4 or 5 times

after looking at yours results with your vega 11 integrated GPU to your ryzen CPU we can see that


Fftlength=32,pass=3:Tune: sum=90392.8(ms); min=7.523(ms); max=269.3(ms); mean=140.6(ms); s_mean=86.47; sleep=75(ms); delta=451; N=643; usual
Fftlength=32,pass=4:Tune: sum=64054.1(ms); min=5.9(ms); max=179.9(ms); mean=94.89(ms); s_mean=49.1; sleep=45(ms); delta=396; N=675; usual
Fftlength=32,pass=5:Tune: sum=44288.8(ms); min=4.552(ms); max=118.9(ms); mean=71.55(ms); s_mean=53.96; sleep=45(ms); delta=510; N=619; usual
Fftlength=64,pass=3:Tune: sum=77912.6(ms); min=2.961(ms); max=108.7(ms); mean=62.33(ms); s_mean=55.73; sleep=45(ms); delta=347; N=1250; usual
Fftlength=64,pass=4:Tune: sum=55977.8(ms); min=2.625(ms); max=107.9(ms); mean=60.13(ms); s_mean=58.33; sleep=60(ms); delta=493; N=931; usual
Fftlength=64,pass=5:Tune: sum=42051.9(ms); min=2.218(ms); max=90.52(ms); mean=58.32(ms); s_mean=62.74; sleep=60(ms); delta=547; N=721; usual
Fftlength=128,pass=3:Tune: sum=84120.4(ms); min=1.53(ms); max=97.76(ms); mean=59.37(ms); s_mean=62.68; sleep=60(ms); delta=343; N=1417; usual
Fftlength=128,pass=4:Tune: sum=61280.6(ms); min=1.492(ms); max=89.89(ms); mean=56.64(ms); s_mean=61.06; sleep=60(ms); delta=413; N=1082; usual
Fftlength=128,pass=5:Tune: sum=45823.3(ms); min=1.077(ms); max=100.4(ms); mean=51.03(ms); s_mean= 48; sleep=45(ms); delta=544; N=898; usual
Fftlength=256,pass=3:Tune: sum=95640.1(ms); min=0.7792(ms); max=113.8(ms); mean= 57(ms); s_mean=58.39; sleep=60(ms); delta=341; N=1678; usual
Fftlength=256,pass=4:Tune: sum=75324.2(ms); min=0.6662(ms); max=174.9(ms); mean=53.5(ms); s_mean=54.37; sleep=45(ms); delta=577; N=1408; usual
Fftlength=256,pass=5:Tune: sum=56878.4(ms); min=0.5501(ms); max=119.2(ms); mean=48.82(ms); s_mean=56.46; sleep=45(ms); delta=544; N=1165; usual
Fftlength=512,pass=3:Tune: sum=104155(ms); min=0.407(ms); max=115(ms); mean=55.02(ms); s_mean=59.89; sleep=60(ms); delta=341; N=1893; usual
Fftlength=512,pass=4:Tune: sum=75705.6(ms); min=0.3888(ms); max=91.53(ms); mean=50.2(ms); s_mean=59.21; sleep=60(ms); delta=670; N=1508; usual
Fftlength=512,pass=5:Tune: sum=56499.3(ms); min=0.3099(ms); max=63.38(ms); mean=37.87(ms); s_mean=45.71; sleep=45(ms); delta=1545; N=1492; usual
Fftlength=1024,pass=3:Tune: sum=103989(ms); min=0.2803(ms); max=58.62(ms); mean=37.01(ms); s_mean=43.29; sleep=45(ms); delta=2843; N=2810; usual
Fftlength=1024,pass=4:Tune: sum=78824.6(ms); min=0.2107(ms); max=44.32(ms); mean=28.15(ms); s_mean=31.06; sleep=30(ms); delta=2830; N=2800; usual
Fftlength=1024,pass=5:Tune: sum=60700.2(ms); min=0.2437(ms); max=37.66(ms); mean=21.76(ms); s_mean=24.81; sleep=15(ms); delta=2817; N=2790; usual
Fftlength=2048,pass=3:Tune: sum=235284(ms); min=15.18(ms); max=586.9(ms); mean=43.07(ms); s_mean=50.13; sleep=45(ms); delta=1; N=5463; high_perf
Fftlength=4096,pass=3:Tune: sum=246185(ms); min=8.166(ms); max=257.8(ms); mean=22.53(ms); s_mean=27.13; sleep=30(ms); delta=1; N=10927; high_perf
Fftlength=8192,pass=3:Tune: sum=112132(ms); min=3.781(ms); max=13.58(ms); mean=5.131(ms); s_mean=6.039; sleep=0(ms); delta=1; N=21855; usual


Fftlength=32,pass=3:Tune: sum=90392.8(ms); min=7.523(ms); max=269.3(ms); mean=140.6(ms); s_mean=86.47; sleep=75(ms); delta=451; N=643; usual
Fftlength=32,pass=4:Tune: sum=64054.1(ms); min=5.9(ms); max=179.9(ms); mean=94.89(ms); s_mean=49.1; sleep=45(ms); delta=396; N=675; usual
Fftlength=32,pass=5:Tune: sum=44288.8(ms); min=4.552(ms); max=118.9(ms); mean=71.55(ms); s_mean=53.96; sleep=45(ms); delta=510; N=619; usual


not all are done in 3 pass some of them are done in 4 pass and another in 5 pass ... not optimized ^^

Fftlength=64,pass=3:Tune: sum=77912.6(ms); min=2.961(ms); max=108.7(ms); mean=62.33(ms); s_mean=55.73; sleep=45(ms); delta=347; N=1250; usual
Fftlength=64,pass=4:Tune: sum=55977.8(ms); min=2.625(ms); max=107.9(ms); mean=60.13(ms); s_mean=58.33; sleep=60(ms); delta=493; N=931; usual
Fftlength=64,pass=5:Tune: sum=42051.9(ms); min=2.218(ms); max=90.52(ms); mean=58.32(ms); s_mean=62.74; sleep=60(ms); delta=547; N=721; usual


same behavior here ^^

Fftlength=128,pass=3:Tune: sum=84120.4(ms); min=1.53(ms); max=97.76(ms); mean=59.37(ms); s_mean=62.68; sleep=60(ms); delta=343; N=1417; usual
Fftlength=128,pass=4:Tune: sum=61280.6(ms); min=1.492(ms); max=89.89(ms); mean=56.64(ms); s_mean=61.06; sleep=60(ms); delta=413; N=1082; usual
Fftlength=128,pass=5:Tune: sum=45823.3(ms); min=1.077(ms); max=100.4(ms); mean=51.03(ms); s_mean= 48; sleep=45(ms); delta=544; N=898; usual


here too ^^



Fftlength=2048,pass=3:Tune: sum=235284(ms); min=15.18(ms); max=586.9(ms); mean=43.07(ms); s_mean=50.13; sleep=45(ms); delta=1; N=5463; high_perf
Fftlength=4096,pass=3:Tune: sum=246185(ms); min=8.166(ms); max=257.8(ms); mean=22.53(ms); s_mean=27.13; sleep=30(ms); delta=1; N=10927; high_perf


but theses ones are optimized with defaut -tt 60 , done in only 3 pass :D

that's how it works ;)
ID: 1999155 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 709
Credit: 8,032,827
RAC: 62
France
Message 1999156 - Posted: 22 Jun 2019, 10:44:38 UTC

if you want tu stay at -tt 1500 tyou have to modify the -period iteration set more to 1
ID: 1999156 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5121
Credit: 276,046,078
RAC: 462
Message 1999162 - Posted: 22 Jun 2019, 13:30:25 UTC - in response to Message 1999156.  

if you want tu stay at -tt 1500 tyou have to modify the -period iteration set more to 1


I am still a bit bewildered. However, if you would take a shot at "the best" combo for my 2400G of the -period_iteration and the -tt I will try them out. As usual I am trying to lower my wall clock time. The cpu time listed is way below the wallclock time so I am interested in minimizing the wallclock time.

If you want to be really fancy (and even more helpful) take a shot at the "best" command line for a Ryzen 5 2400G.

In any case, thanks for explaining yet something else I didn't (and probably still don't really get) about the reports we see on Seti.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1999162 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 709
Credit: 8,032,827
RAC: 62
France
Message 1999172 - Posted: 22 Jun 2019, 14:23:46 UTC

some mysterious explanations were here from the lunatics forum ;)

and after , u have to try diffrents -tt parameter to gain le maximum high_perf résult at the end of the sterr.txt

the first step was to understand the explanation with the help of google translate, and then try differents parameters

with a period_iteration=1, u will have lot of lags at the beginning of a wu from the first minutes ;)
ID: 1999172 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 12949
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1999638 - Posted: 26 Jun 2019, 2:00:24 UTC

Good analysis and observations. The link to Raistmer's document at Lunatics explains it all. Will take a few readings to comprehend the interplay between all the parameters. But reducing the passes is the goal with delta=1 and n=highest value is the goal with the means at minimum.

Easy to achieve with high powered hardware but a lot harder with lesser hardware like an APU or iGPU.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1999638 · Report as offensive

Questions and Answers : GPU applications : Blue Screen of Death occuring on unique tasks


 
©2021 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.