Why this CPU task invalid so soon?

Message boards : Number crunching : Why this CPU task invalid so soon?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1709801 - Posted: 7 Aug 2015, 17:54:31 UTC
Last modified: 7 Aug 2015, 18:14:33 UTC

@Keith: Can you possibly put in some tests of the CPU with prime95 ? (before and after the reseating/resinking/TIM-replacement etc).

The NaNs, once in there become problematic for the comparisons to select best.

[Edit:] from scattered sources, I'm getting the general impression the FX-8350 can sometimes need a small voltage bump for Prime95 stability, over motherboard 'auto' settings. Some mentions of Bios updates. If something like that is going on here, then it would probably be something that would slowly worsen with time. My Intel Core2Duos have drooped a little oversome years use, so wouldn't be a strange or unusual thing necessarily.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1709801 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1709811 - Posted: 7 Aug 2015, 18:27:48 UTC - in response to Message 1709801.  

Absolutely, Jason. That would be easy to do. I want to get to the bottom of this problem. I suspect that the CPU is on its way out. That would be the easy way to do a stress test. I am thinking that the CPU has degraded after 3 years of 400 Mhz overclock. I run the systems at 4.4 Ghz with no voltage adjustments, just using a multiplier bump. I do have the motherboards power supply set to optimized instead of Auto to stiffen up the CPU and northbridge voltages under load. Makes the VRM temps a couple of degrees hotter but other than that, that is all I did to get the system to run overclocked. As I stated in a previous post, this CPU is not as good silicon as the other system. This systems VID is 1.38V while the other good system has a VID of 1.30. I have never tried bumping up the CPU voltage as it didn't seem necessary since I was able to achieve my target speed with just a multiplier bump. I figured if the system is stable with no crashes, there is no need to spend more energy with a higher voltage for no speed gain, just higher core temps. Its possible that the first couple of years of overclock on air cooling did the damage. My temps were a lot higher before the water cooling addition.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1709811 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1709919 - Posted: 8 Aug 2015, 2:12:47 UTC

Well, I was out for most of the day and just now got back to look at the machine's CPU output for the day. No new invalids, but I did notice a new inconclusive that looks interesting. Task 4296134017 looks interesting to me. That is if I understand what the verbose=2 is telling me. Betweeen 1.04% and 3.35% peak power changed from a peak of 17.08 to a nan. Then is bested again to a peak of 20.39. Then at the next restart at 34.86, I lost the best autocorr results and reset the time elapsed to the time=-2.123e+011, no time elapsed number. So this task looks like an impending invalid after consensus if I am interpreting the results correctly. Does Jason want to comment?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1709919 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1709939 - Posted: 8 Aug 2015, 3:30:26 UTC - in response to Message 1709811.  
Last modified: 8 Aug 2015, 3:31:44 UTC

Thanks! Some (evolving) back story is that while we largely expect code that's been working for a long time on a lot of systems is 'correct', that it doesn't necessarily cope with all possible inputs gracefully opens up some possibilities to make it better. No-one really expects us to do that for cases where there is some obvious instability, but we do have some luxuries in our seti@home cases of a lot of people looking at numbers for a long time.

In this particular case, it probably shows handling of denormal numbers and other ( I usually use flush to zero modes, not sure what floating point modes this particular codebase has enabled) could be used to at least shield postprocessing code from rubbish, and perhaps flag some warning to the user, reprocessing/retries, or something of that sort.

We reasonably rely a fair bit on Boinc's redundancy mechanism to weed out a lot of result data corruption of this sort, but it doesn't really handle some cases, like where the source of difficulty might be in the source data distributed to both hosts, or completely different hosts generate a NaN in the same number (e.g. a sum used for normalisation) via completely different avenues/circuits.

It'll be something to think about, given that we have a reasonable idea of what numbers look sane at different points, and expecting some number of machines fall off the rails over time is pretty reasonable.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1709939 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1710054 - Posted: 8 Aug 2015, 15:01:46 UTC

Looked into the slots this morning before startup. Slot 9 had something stick out. The NaN popped up again. <result_name>12my15ab.27748.2282.438086664196.12.124.vlar_1</result_name> had changes to best autocorr peak power after several restarts.

Verbose level set to:2

Build features: SETI7 Non-graphics FFTW USE_SSE42 x64
CPUID: AMD FX(tm)-8350 Eight-Core Processor

Cache: L1=64K L2=2048K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 FMA3 SSE4.1 SSE4.2 AVX SSE4A XOP FMA4
ar=0.011166 NumCfft=146007 NumGauss=0 NumPulse=50183457664 NumTriplet=67977339808
Pulse finding at FFT lengths 32 through 8192.
Spike finding at FFT lengths 32 through 131072.
Triplet finding at FFT lengths 8 through 32768.

In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768

Windows optimized S@H v7 application
Based on Intel, Core 2-optimized v8-nographics V5.13 by Alex Kan
SSE4.2xjf Win64 Build 2549 , Ported by : Raistmer, JDWhale

SETI7 update by Raistmer
Work Unit Info:
...............
Credit multiplier is : 2.85
WU true angle range is : 0.011166
R New best spike:score:-0.93708, power: 4.6236, index=9, fft_len=32, ifft=0,icfft=2
R New best spike:score:-0.90136, power: 5.0199, index=26, fft_len=32, ifft=2,icfft=2
R New best spike:score:-0.76487, power: 6.8736, index=4, fft_len=32, ifft=6,icfft=2
R New best spike:score:-0.76053, power: 6.9428, index=20, fft_len=32, ifft=7,icfft=2
R New best spike:score:-0.72728, power: 7.4952, index=12, fft_len=32, ifft=22,icfft=2
R New best spike:score:-0.68913, power: 8.1834, index=1, fft_len=32, ifft=84,icfft=2
R New best spike:score:-0.61661, power: 9.6705, index=23, fft_len=32, ifft=176,icfft=2
R New best spike:score:-0.61619, power: 9.6798, index=15, fft_len=32, ifft=599,icfft=2
R New best spike:score:-0.52178, power: 12.03, index=2, fft_len=32, ifft=1053,icfft=2
R New best spike:score:-0.50524, power: 12.497, index=24, fft_len=32, ifft=1248,icfft=2
R New best spike:score:-0.48237, power: 13.173, index=18, fft_len=32, ifft=12019,icfft=2
Best pulse updated: score=0.5681,power=0.4949,fftlen=32,freq_bin=1,time_bin=16384,icfft=2
Best pulse updated: score=0.7814,power=1.0627,fftlen=32,freq_bin=1,time_bin=16384,icfft=2
Best pulse updated: score=0.828,power=0.92944,fftlen=32,freq_bin=4,time_bin=16384,icfft=2
R New best spike:score:-0.46096, power: 13.839, index=37, fft_len=64, ifft=11253,icfft=3
Best pulse updated: score=0.9241,power=4.4743,fftlen=64,freq_bin=21,time_bin=8192,icfft=3
R New best spike:score:-0.4353, power: 14.681, index=38, fft_len=128, ifft=7301,icfft=4
R New best spike:score:-0.41118, power: 15.52, index=4557, fft_len=8192, ifft=99,icfft=10
Best autocorr updated:score=-0.7553, peak_power=7.027, bin=17395, fft_ind=0, icfft=14
Best autocorr updated:score=-0.721, peak_power=7.605, bin=37783, fft_ind=1, icfft=14
Best autocorr updated:score=-0.7015, peak_power=7.954, bin=50215, fft_ind=3, icfft=14
Best autocorr updated:score=-0.6662, peak_power=8.627, bin=42655, fft_ind=4, icfft=14
Best autocorr updated:score=-0.6611, peak_power=8.729, bin=63831, fft_ind=7, icfft=15
Best autocorr updated:score=-0.6017, peak_power=10.01, bin=62796, fft_ind=4, icfft=16
R New best spike:score:-0.38615, power: 16.44, index=10525, fft_len=131072, ifft=0,icfft=18
R New best spike:score:-0.37492, power: 16.871, index=10525, fft_len=131072, ifft=0,icfft=20
Best autocorr updated:score=-0.5956, peak_power=10.15, bin=50215, fft_ind=3, icfft=20
Best autocorr updated:score=-0.5509, peak_power=11.25, bin=62796, fft_ind=4, icfft=29
R New best spike:score:-0.34259, power: 18.175, index=10526, fft_len=131072, ifft=0,icfft=46
R New best spike:score:-0.3236, power: 18.987, index=10526, fft_len=131072, ifft=0,icfft=48
Best autocorr updated:score=-0.5364, peak_power=11.63, bin=49304, fft_ind=7, icfft=128
Best autocorr updated:score=-0.4704, peak_power=13.54, bin=49304, fft_ind=7, icfft=138
R New best spike:score:-0.31804, power: 19.232, index=39685, fft_len=131072, ifft=6,icfft=1053
R New best spike:score:-0.30996, power: 19.593, index=39684, fft_len=131072, ifft=6,icfft=1056
R New best spike:score:-0.29753, power: 20.162, index=86925, fft_len=131072, ifft=1,icfft=1351
R New best spike:score:-0.27388, power: 21.29, index=86924, fft_len=131072, ifft=1,icfft=1361
R New best spike:score:-0.26996, power: 21.483, index=86923, fft_len=131072, ifft=1,icfft=1372
R New best spike:score:-0.26033, power: 21.965, index=39289, fft_len=131072, ifft=1,icfft=2438
Best autocorr updated:score=-0.4685, peak_power=13.6, bin=22165, fft_ind=1, icfft=3098
Best autocorr updated:score=-0.4478, peak_power=14.26, bin=26463, fft_ind=7, icfft=5187
Best autocorr updated:score=-0.4243, peak_power=15.06, bin=26463, fft_ind=7, icfft=5199
Best autocorr updated:score=-0.4182, peak_power=15.27, bin=26463, fft_ind=7, icfft=5213
Best pulse updated: score=0.933,power=4.1365,fftlen=1024,freq_bin=179,time_bin=512,icfft=6481
Best pulse updated: score=0.9433,power=2.478,fftlen=1024,freq_bin=179,time_bin=512,icfft=6481
Verbose level set to:2

Build features: SETI7 Non-graphics FFTW USE_SSE42 x64
CPUID: AMD FX(tm)-8350 Eight-Core Processor

Cache: L1=64K L2=2048K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 FMA3 SSE4.1 SSE4.2 AVX SSE4A XOP FMA4
ar=0.011166 NumCfft=146007 NumGauss=0 NumPulse=50183457664 NumTriplet=67977339808
Pulse finding at FFT lengths 32 through 8192.
Spike finding at FFT lengths 32 through 131072.
Triplet finding at FFT lengths 8 through 32768.

In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768
Restarted at 2.75 percent.
Best pulse updated: score=0.9457,power=4.4787,fftlen=128,freq_bin=69,time_bin=4096,icfft=12517
Best autocorr updated:score=-0.3948, peak_power=16.12, bin=21291, fft_ind=2, icfft=12863
Best autocorr updated:score=-0.3591, peak_power=17.49, bin=21291, fft_ind=2, icfft=12904
Verbose level set to:2

Build features: SETI7 Non-graphics FFTW USE_SSE42 x64
CPUID: AMD FX(tm)-8350 Eight-Core Processor

Cache: L1=64K L2=2048K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 FMA3 SSE4.1 SSE4.2 AVX SSE4A XOP FMA4
ar=0.011166 NumCfft=146007 NumGauss=0 NumPulse=50183457664 NumTriplet=67977339808
Pulse finding at FFT lengths 32 through 8192.
Spike finding at FFT lengths 32 through 131072.
Triplet finding at FFT lengths 8 through 32768.

In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768
Restarted at 4.44 percent.
R New best spike:score:-0.25926, power: 22.019, index=109579, fft_len=131072, ifft=0,icfft=17151
Best autocorr updated:score=-0.3565, peak_power=17.6, bin=28748, fft_ind=1, icfft=19452
Best pulse updated: score=0.9508,power=0.76824,fftlen=512,freq_bin=225,time_bin=1024,icfft=29973
Best pulse updated: score=1.031,power=7.4205,fftlen=8192,freq_bin=136,time_bin=64,icfft=32685
Pulse: peak=7.420475, time=54.11, period=23.21, d_freq=1421210572.33, score=1.031, chirp=-9.7454, fft_len=8k
R New best spike:score:-0.24958, power: 22.515, index=78961, fft_len=131072, ifft=3,icfft=40362
Verbose level set to:2

Build features: SETI7 Non-graphics FFTW USE_SSE42 x64
CPUID: AMD FX(tm)-8350 Eight-Core Processor

Cache: L1=64K L2=2048K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 FMA3 SSE4.1 SSE4.2 AVX SSE4A XOP FMA4
ar=0.011166 NumCfft=146007 NumGauss=0 NumPulse=50183457664 NumTriplet=67977339808
Pulse finding at FFT lengths 32 through 8192.
Spike finding at FFT lengths 32 through 131072.
Triplet finding at FFT lengths 8 through 32768.

In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768
Restarted at 13.44 percent.
R New best spike:score:-0.24958, power: 22.515, index=78961, fft_len=131072, ifft=3,icfft=40362
Best pulse updated: score=1.034,power=0.8516,fftlen=256,freq_bin=63,time_bin=2048,icfft=48310
Pulse: peak=0.8515971, time=53.7, period=1.063, d_freq=1421212567.18, score=1.034, chirp=-14.406, fft_len=256
R New best spike:score:-0.24731, power: 22.634, index=12668, fft_len=131072, ifft=5,icfft=48468
R New best spike:score:-0.23456, power: 23.308, index=12669, fft_len=131072, ifft=5,icfft=48470
Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=4, icfft=61008

Is this change from a normal autocorr peak power value early in the analysis, then change to the peak_power=nan value the source of the "Impossible Autocorr power, postponing task, restarting from last checkpoint" message I see pop up on this failing computer from time to time when I am processing CPU tasks? I didn't find an instance where the time on a autocorr count got reset to null elapsed time though.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1710054 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1710099 - Posted: 8 Aug 2015, 17:53:17 UTC
Last modified: 8 Aug 2015, 18:05:12 UTC

Well, I picked up another invalid. Damn, it was one I looked at this morning but didn't see anything out of the ordinary then. Task 4297854304 Looks like another case of multiple restarts but the one after 52% (in this case 55%) is when the autocorr time got nulled out and then was declared invalid for no autcorr processing done. Also, I see that it had a autocorr peak_power=17.42, but was then updated to a peak_power=nan right before the last restart. Commentary?

[Edit] The task I was looking at this morning and was suspicious of with the peak=nan value looks like it auto updated multiple times on the best autocorr peak power after the last restart and corrected itself.Task 4298009570

Commentary?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1710099 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1710121 - Posted: 8 Aug 2015, 18:46:50 UTC - in response to Message 1710099.  

Well, I picked up another invalid. Damn, it was one I looked at this morning but didn't see anything out of the ordinary then. Task 4297854304 Looks like another case of multiple restarts but the one after 52% (in this case 55%) is when the autocorr time got nulled out and then was declared invalid for no autcorr processing done. Also, I see that it had a autocorr peak_power=17.42, but was then updated to a peak_power=nan right before the last restart. Commentary?

[Edit] The task I was looking at this morning and was suspicious of with the peak=nan value looks like it auto updated multiple times on the best autocorr peak power after the last restart and corrected itself.Task 4298009570

Commentary?

Commentary, sure......but not from any expertise on the internals of this thing. ;^) Jason or Joe will have to provide that.

What I think I'm seeing in the Stderr of those two tasks is that the "Best autocorr updated" event where the nan shows up seems to be in mid-run between restarts, not immediately after a restart. Especially on that 42978534 task where two consecutive lines in the output read:

Best autocorr updated:score=-0.3609, peak_power=17.42, bin=16689, fft_ind=2, icfft=70703
Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=3, icfft=93574

That happens in mid-run between the 11.36% restart and the 55.32% restart. (Boy, your tasks sure do have a lot of restarts!) I don't know what that actually means, but perhaps the actual restarts don't provide the trigger, after all.

You mentioned checking the slots before you started BOINC this morning, but did you actually pick up anything odd from the <autocorr> sections of the state.sah files, or were those slot folders empty? What you posted looked like Stderr output.
ID: 1710121 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1710126 - Posted: 8 Aug 2015, 18:50:59 UTC - in response to Message 1710054.  

...
Is this change from a normal autocorr peak power value early in the analysis, then change to the peak_power=nan value the source of the "Impossible Autocorr power, postponing task, restarting from last checkpoint" message I see pop up on this failing computer from time to time when I am processing CPU tasks? I didn't find an instance where the time on a autocorr count got reset to null elapsed time though.

Thanks for noting that the "Impossible Autocorr power, postponing task, restarting from last checkpoint" message is also happening. That's a sanity check based on a mathematical property of Autocorrelation. The array of 64k values in the final stage of analysis indicates the comparison at all delays from 0 to ~6.7 seconds at a resolution of 0.0001024 seconds. The 0 delay must be the strongest, but is of course not considered for reporting. If the value at some other delay is larger, the sanity check discards processing since the last checkpoint and forces a restart which gets a fresh copy of the data from the WU.

That check will not catch a NaN which is neither greater, equal, or less than an actual number. But it most probably indicates data corruption of the same sort which might lead to a NaN.
                                                                  Joe
ID: 1710126 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1710153 - Posted: 8 Aug 2015, 19:53:53 UTC - in response to Message 1710121.  


Commentary, sure......but not from any expertise on the internals of this thing. ;^) Jason or Joe will have to provide that.

What I think I'm seeing in the Stderr of those two tasks is that the "Best autocorr updated" event where the nan shows up seems to be in mid-run between restarts, not immediately after a restart. Especially on that 42978534 task where two consecutive lines in the output read:

Best autocorr updated:score=-0.3609, peak_power=17.42, bin=16689, fft_ind=2, icfft=70703
Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=3, icfft=93574

That happens in mid-run between the 11.36% restart and the 55.32% restart. (Boy, your tasks sure do have a lot of restarts!) I don't know what that actually means, but perhaps the actual restarts don't provide the trigger, after all.

You mentioned checking the slots before you started BOINC this morning, but did you actually pick up anything odd from the <autocorr> sections of the state.sah files, or were those slot folders empty? What you posted looked like Stderr output.


Yes, I did look at all the state.sah files in the slots before startup this morning. But since that is the first time seeing the structure of that file, I couldn't see anything that looked out of the ordinary, I really didn't know what I was looking for. All the parts in the <autocorr> </autocorr> section looked "Reasonable" to me. Nothing jumped out on me. Obviously, I missed something with the Task 4297854304. The reason that the CPU tasks restart so often is that I also am running MW and Einstein GPU tasks along with SETI GPU tasks and I have the CPU_usage for both high enough to properly crunch their Open_CL apps that the total CPU_usage adds up enough to equal a full CPU core and then it drops off one of the running CPU tasks to support their running.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1710153 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1710158 - Posted: 8 Aug 2015, 20:04:11 UTC - in response to Message 1710121.  


What I think I'm seeing in the Stderr of those two tasks is that the "Best autocorr updated" event where the nan shows up seems to be in mid-run between restarts, not immediately after a restart. Especially on that 42978534 task where two consecutive lines in the output read:

Best autocorr updated:score=-0.3609, peak_power=17.42, bin=16689, fft_ind=2, icfft=70703
Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=3, icfft=93574

That happens in mid-run between the 11.36% restart and the 55.32% restart. (Boy, your tasks sure do have a lot of restarts!) I don't know what that actually means, but perhaps the actual restarts don't provide the trigger, after all.


This is more and more looking as just a simple case of data corruption somewhere. Especially with noticing as you did that the best autocorr peak_power=nan happened during active processing and not at the end of processing or restart of processing. I plan to run a Prime95 session tonight after shutting down BOINC. I wanted to do that anyway as test of before and after I pull the system apart on Monday. I need to see if that fails before I upset the test case after making hardware changes.

Thanks for the post Josef explaining the Impossible Autocorr power reason.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1710158 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1710193 - Posted: 8 Aug 2015, 22:26:06 UTC
Last modified: 8 Aug 2015, 22:56:01 UTC

Just for the heck of it, I grabbed the completed CPU tasks for that host again, searching for "Best autocorr: peak=0", and noted 5 new ones since my retrieval Thursday evening. One of them, of course is 4297854304, which you noted earlier. Another one that really caught my eye is 4299129617, which has the following sequence in the Stderr:

Best autocorr updated:score=-0.349, peak_power=17.91, bin=17696, fft_ind=4, icfft=68746
Autocorr: peak=17.9095, time=60.4, delay=1.8121, d_freq=1418800066.29, chirp=20.5, fft_len=128k
Best autocorr updated:score=-0.3477, peak_power=17.96, bin=17696, fft_ind=4, icfft=68806
Autocorr: peak=17.96048, time=60.4, delay=1.8121, d_freq=1418800067.41, chirp=20.519, fft_len=128k
Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=1, icfft=92819

It apparently found two reportable Autocorr peaks, and then the glitch kicked in right after that.

Another one that looks particularly interesting to me is 4296268517, which shows:

Best autocorr updated:score=-0.39, peak_power=16.3, bin=28731, fft_ind=3, icfft=33827
Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=0, icfft=34972
R New best spike:score:-0.23667, power: 23.195, index=129997, fft_len=131072, ifft=6,icfft=39706
R New best spike:score:-0.20319, power: 25.053, index=129998, fft_len=131072, ifft=6,icfft=39708
Spike: peak=25.05331, time=87.24, d_freq=1421131699.56, chirp=-11.84, fft_len=128k
R New best spike:score:-0.19064, power: 25.788, index=129999, fft_len=131072, ifft=6,icfft=39710
Spike: peak=25.78832, time=87.24, d_freq=1421131699.55, chirp=-11.841, fft_len=128k
Spike: peak=25.23988, time=87.24, d_freq=1421131699.55, chirp=-11.842, fft_len=128k
Best pulse updated: score=1.011,power=3.4366,fftlen=256,freq_bin=63,time_bin=2048,icfft=58143
Pulse: peak=3.436609, time=53.7, period=7.92, d_freq=1421136146.87, score=1.011, chirp=17.339, fft_len=256
Autocorr: peak=18.01251, time=87.24, delay=6.0366, d_freq=1421134433.81, chirp=18.584, fft_len=128k

In this one, "nan" shows up but then a reportable Autocorr peak seems to be found a bit later. However, the "Best autocorr" value doesn't seem to get updated. Task 4296134017 shows similar behavior.

There's also one, 4296268496, where the Best autocorr does seem to get hosed immediately after a restart at 16.98%.

All in all, your machine certainly seems to be doing some interesting stuff! But it's only doing it to the autocorr processing and/or data. Very selective. ;^)

EDIT: FWIW, it appears that any time the "peak_power=nan" shows up before the last restart, as in those 4 examples, the reported "Best autocorr" shows "peak=0". However, if "peak_power=nan" happens after the last restart, as in Task 4297870021, you actually get "Best autocorr: peak=nan". I don't know if there's any significance to that, but it seems possible the Best autocorr may get reset on restarts. Just an observation.
ID: 1710193 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1710211 - Posted: 9 Aug 2015, 0:33:09 UTC - in response to Message 1710193.  
Last modified: 9 Aug 2015, 0:41:23 UTC

Just for the heck of it, I grabbed the completed CPU tasks for that host again, searching for "Best autocorr: peak=0", and noted 5 new ones since my retrieval Thursday evening. One of them, of course is 4297854304, which you noted earlier. Another one that really caught my eye is 4299129617, which has the following sequence in the Stderr:

Best autocorr updated:score=-0.349, peak_power=17.91, bin=17696, fft_ind=4, icfft=68746
Autocorr: peak=17.9095, time=60.4, delay=1.8121, d_freq=1418800066.29, chirp=20.5, fft_len=128k
Best autocorr updated:score=-0.3477, peak_power=17.96, bin=17696, fft_ind=4, icfft=68806
Autocorr: peak=17.96048, time=60.4, delay=1.8121, d_freq=1418800067.41, chirp=20.519, fft_len=128k
Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=1, icfft=92819

It apparently found two reportable Autocorr peaks, and then the glitch kicked in right after that.

Another one that looks particularly interesting to me is 4296268517, which shows:

Best autocorr updated:score=-0.39, peak_power=16.3, bin=28731, fft_ind=3, icfft=33827
Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=0, icfft=34972
R New best spike:score:-0.23667, power: 23.195, index=129997, fft_len=131072, ifft=6,icfft=39706
R New best spike:score:-0.20319, power: 25.053, index=129998, fft_len=131072, ifft=6,icfft=39708
Spike: peak=25.05331, time=87.24, d_freq=1421131699.56, chirp=-11.84, fft_len=128k
R New best spike:score:-0.19064, power: 25.788, index=129999, fft_len=131072, ifft=6,icfft=39710
Spike: peak=25.78832, time=87.24, d_freq=1421131699.55, chirp=-11.841, fft_len=128k
Spike: peak=25.23988, time=87.24, d_freq=1421131699.55, chirp=-11.842, fft_len=128k
Best pulse updated: score=1.011,power=3.4366,fftlen=256,freq_bin=63,time_bin=2048,icfft=58143
Pulse: peak=3.436609, time=53.7, period=7.92, d_freq=1421136146.87, score=1.011, chirp=17.339, fft_len=256
Autocorr: peak=18.01251, time=87.24, delay=6.0366, d_freq=1421134433.81, chirp=18.584, fft_len=128k

In this one, "nan" shows up but then a reportable Autocorr peak seems to be found a bit later. However, the "Best autocorr" value doesn't seem to get updated. Task 4296134017 shows similar behavior.

There's also one, 4296268496, where the Best autocorr does seem to get hosed immediately after a restart at 16.98%.

All in all, your machine certainly seems to be doing some interesting stuff! But it's only doing it to the autocorr processing and/or data. Very selective. ;^)

EDIT: FWIW, it appears that any time the "peak_power=nan" shows up before the last restart, as in those 4 examples, the reported "Best autocorr" shows "peak=0". However, if "peak_power=nan" happens after the last restart, as in Task 4297870021, you actually get "Best autocorr: peak=nan". I don't know if there's any significance to that, but it seems possible the Best autocorr may get reset on restarts. Just an observation.


OK, now I'm confused. I thought Josef explained to us that Best Autocorr always had to have non-zero values to be considered valid. That is why I was looking for the
    Best autocorr: peak=0, time=-2.123e+011, delay=0, d_freq=0, chirp=0, fft_len=0

statements in the state.sah and stderr.txt files. So how come Task 4296268496 was deemed valid? We've seen cases where I got the autocorr count correct except the Best autocorr values were zeroed out with the no elapsed time count and the task was instantly invalid. That was the case with the very first task I started this thread with.

I have to agree with your assessment Jeff that my computer is very selective about how it processes autocorr results. I know that GPU tasks do most of their processing on the GPU core and memory, but they do have to have the task data fed to them via the CPU. So how come I've yet to see an invalid or improper processing of autocorr on the GPU's. Does the Best Autocorr result go into a specific register always on the CPU? Are my troublesome invalids and inconclusives always getting processed on the same register of some failing core in the CPU? That would have to be determined by whomever wrote the autocorr mechanism and how it gets implemented in machine code. Josef, can you jump in here please and explain the outcome of Task 4296268496?

Thanks in advance.

[Edit] Thought I'd better post the pertinent bits as the work unit is getting purged already.

    In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768
    Restarted at 68.20 percent.
    Pulse: peak=2.186817, time=53.79, period=4.535, d_freq=1419941741.54, score=1, chirp=73.16, fft_len=2k
    Pulse: peak=0.8214412, time=53.71, period=0.9011, d_freq=1419944919.28, score=1.021, chirp=-95.101, fft_len=512
    Pulse: peak=6.180362, time=53.9, period=19.4, d_freq=1419945787.77, score=1.003, chirp=95.45, fft_len=4k
    Best pulse updated: score=1.042,power=2.3374,fftlen=1024,freq_bin=986,time_bin=512,icfft=144691
    Pulse: peak=2.337372, time=53.74, period=5.059, d_freq=1419945544.73, score=1.042, chirp=-97.968, fft_len=1024

    Best spike: peak=23.03849, time=33.56, d_freq=1419939382.31, chirp=28.378, fft_len=128k
    Best autocorr: peak=0, time=-2.123e+011, delay=0, d_freq=0, chirp=0, fft_len=0
    Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.123e+011, d_freq=0,
    score=-12, null_hyp=0, chirp=0, fft_len=0
    Best pulse: peak=2.337372, time=53.74, period=5.059, d_freq=1419945544.73, score=1.042, chirp=-97.968, fft_len=1024
    Best triplet: peak=0, time=-2.123e+011, period=0, d_freq=0, chirp=0, fft_len=0


    Flopcounter: 44675103788321.765625

    Spike count: 0
    Autocorr count: 1
    Pulse count: 8
    Triplet count: 0
    Gaussian count: 0
    Wallclock time elapsed since last restart: 1804.8 seconds

    12:42:22 (5876): called boinc_finish


Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1710211 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1710215 - Posted: 9 Aug 2015, 0:57:31 UTC - in response to Message 1710211.  
Last modified: 9 Aug 2015, 0:58:45 UTC

OK, now I'm confused. I thought Josef explained to us that Best Autocorr always had to have non-zero values to be considered valid. That is why I was looking for the
    Best autocorr: peak=0, time=-2.123e+011, delay=0, d_freq=0, chirp=0, fft_len=0

statements in the state.sah and stderr.txt files. So how come Task 4296268496 was deemed valid? We've seen cases where I got the autocorr count correct except the Best autocorr values were zeroed out with the no elapsed time count and the task was instantly invalid. That was the case with the very first task I started this thread with.


I think Joe mentioned in an earlier post that even if the Best Autocorr is hosed, you'll still likely get validated with a weakly similar result as long as the Autocorr count itself is greater than zero. In those cases, though, it seems to require a third party to arbitrate the initial Inconclusive.

I have to agree with your assessment Jeff that my computer is very selective about how it processes autocorr results. I know that GPU tasks do most of their processing on the GPU core and memory, but they do have to have the task data fed to them via the CPU. So how come I've yet to see an invalid or improper processing of autocorr on the GPU's. Does the Best Autocorr result go into a specific register always on the CPU? Are my troublesome invalids and inconclusives always getting processed on the same register of some failing core in the CPU? That would have to be determined by whomever wrote the autocorr mechanism and how it gets implemented in machine code. Josef, can you jump in here please and explain the outcome of Task 4296268496?

Thanks in advance.

That puzzles me a bit, too, as it did on those phantom triplets I was getting on a GPU. Only the triplet processing or data was affected. My best guess (and it's only a WAG) would be, if the problem is failing memory then perhaps task data frequently gets loaded into memory from exactly the base address and the failing bit(s) always hit the same block of data. That scenario would actually make some sense to me. On the other hand, if it's a failing CPU, then perhaps you could be right about a specific register being involved, or perhaps there's a specific FPU (or equivalent) circuit that only gets exercised for autocorrelation processing. As you say, the best answer would have to come from the experts! ;^)
ID: 1710215 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1710236 - Posted: 9 Aug 2015, 2:40:07 UTC - in response to Message 1710211.  

...
I have to agree with your assessment Jeff that my computer is very selective about how it processes autocorr results. I know that GPU tasks do most of their processing on the GPU core and memory, but they do have to have the task data fed to them via the CPU. So how come I've yet to see an invalid or improper processing of autocorr on the GPU's. Does the Best Autocorr result go into a specific register always on the CPU? Are my troublesome invalids and inconclusives always getting processed on the same register of some failing core in the CPU? That would have to be determined by whomever wrote the autocorr mechanism and how it gets implemented in machine code. Josef, can you jump in here please and explain the outcome of Task 4296268496?

Thanks in advance.
...

Jeff's analysis of how that task got credit is correct, the reported Autocorr was enough to keep the Validator from instantly invalidating it, and there were enough signals so it could be given credit based on a "weakly similar" comparison to the canonical result.

Big picture stuff:

Each instance of the CPU application allocates memory the same way, so the buffers used for Autocorr processing occupy the same range within the virtual address range given to the application by Windows. The actual physical address range of course differs for each instance, but is mapped on page boundaries and subsequent tasks may fairly often map to the same physical memory addresses. Although the same registers are also used for the same pieces of data while processing (because the compiler has made those choices), I think the issue is unlikely to be happening there. Still, it might be interesting to try to find out if the tasks with problems are all being processed by the same CPU core.

For GPU processing the full data is put into GPU memory as soon as practical, and processed there. The code used of course differs from CPU processing, doing parallel processing with the GPU hardware is simply different. I'd be very surprised to see the same symptoms there.
                                                                  Joe
ID: 1710236 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1710481 - Posted: 9 Aug 2015, 17:36:01 UTC - in response to Message 1710236.  

Thanks for confirming Jeff's analysis of why that task got credit. I did some testing yesterday with Prime95 and analysis of which CPU task was on which CPU core. I can definitely state that Core #5 is the culprit. So I ended up downclocking the core speeds until I could get Prime95 to pass without errors. It was obvious when I started testing, Core #5 failed instantly on Prime95 upon test start. So, either that core was always weak or the couple of years of overclocking on air weakened it. Never caused an issue with windows or system stability but obviously can't do any kind of math with any sort of precision. So, plans look to have changed. I ran Prime95 on the blend test for about 3 hours without errors while downclocked which is suppose to exercise a lot of memory along with the CPU. I have run MemTest86+ several times on the memory with no error. So my plan of removing the radiator and reseating the memory is not so urgent. The plan is removing the coldplate and trying out new TIM is also not so urgent. I have figured out the problem and the solution should be replace the CPU. So I will wait until I have procured another CPU before breaking the coldplate free from the current chip.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1710481 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1710526 - Posted: 9 Aug 2015, 18:49:05 UTC - in response to Message 1710481.  

That's certainly an interesting, and expensive, discovery. I would've bet on the memory being the culprit, rather than the CPU, but I don't have the expertise to do more than just make guesses when it comes to hardware quirks. Good luck with the CPU replacement!
ID: 1710526 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1710581 - Posted: 9 Aug 2015, 21:24:08 UTC - in response to Message 1710526.  

I still had doubts on memory, so I ran two different test cases with Prime95. The blend test exercised both CPU and memory with a fail but couldn't pinpoint where the weakness was. Then I ran the small FFT test with Prime95 which runs mostly on the CPU with little memory exposure. Still the instant fail on core#5. I still feel pretty comfortable with the integrity of the RAM because of the multiple successful tests with MemTest86+. I could put the old Phenom X6 1100T back in this computer or just go with a replacement FX processor. Of course you still have to deal with the silicon lottery with a new chip. Have to make that decision based on what the finances deem sensible. In hindsight, I should have run the Prime95 tests long ago. I had never looked into that program. Only knew of if from all the references to testing overclocking with it for stress testing. Since I achieved my original overclock with a simple multiplier bump and ended up with a stable system for the last 3 years, I thought the silicon was up to task. Evidently not. By downclocking the chip, I am running at a core voltage about .08V less than where it was running. Actually matches the core voltage on my other dedicated cruncher which has not shown any errors. My plan was to just run these systems until the AMD Zen chip shows up in 2016 or 2017 and rebuild new systems based on that platform. I consider myself a AMD fanboy ever since I leaped off the Intel bleeding edge cost train. I go all the way back to the Nexgen NX686. Decisions....decisions. Thanks everyone for all your help.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1710581 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1712096 - Posted: 13 Aug 2015, 1:57:18 UTC - in response to Message 1710581.  
Last modified: 13 Aug 2015, 2:03:28 UTC

Just an update. Put in a new FX-8370 chip yesterday and benched it at 4.6 GHz on Prime95 Small FFT and Blend tests for about 6 hours. No errors. Looking at reported results so far show no more invalids. Will of course need to monitor the system for several months but feeling pretty confident that the previous errors were from the failing FX-8350 chip and not the system memory. Today I replaced the failing hard drive in the Moxi DVR. Guess I got a lot more expected life out of the original drive. Exactly 4 years to the day. A lot of other owners only got about a year out of the original drive. Replaced it with the same brand and size as the original, just a newer version. Will have to see if I don't lose any more recordings due to image breakup from a eroding surface. Reporting the latest from the hardware wars .......

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1712096 · Report as offensive
Profile Zombu2
Volunteer tester

Send message
Joined: 24 Feb 01
Posts: 1615
Credit: 49,315,423
RAC: 0
United States
Message 1712100 - Posted: 13 Aug 2015, 2:10:44 UTC

now that is interesting i also have a fx-8350 where core 5 is faulty i had to turn it off .... it's been running fine on 7 cores for over a year now
I came down with a bad case of i don't give a crap
ID: 1712100 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1712390 - Posted: 13 Aug 2015, 17:33:39 UTC - in response to Message 1712100.  

When did you get your chip? The one that failed was 3 years old. My newer chip is only 2 years old but of better quality. Runs at much less voltage. I wonder if the module that contains core#5 was always a weak link in their manufacturing. The new 8370 has the same 1.30V VID as my newer 8350. Currently I have it at stable core voltage of 1.34V at 4.6 Ghz, 600 Mhz over stock clocks. I didn't feel the need to push it to the limits, just match or better slightly my old 8350 at 4.4 Ghz. It runs about 100 watts more than the old 8350. I really don't need to use any more power, I use enough as it is. I hear that the newer 8370 and 8370E use more refined manufacturing processes than when the 8350 was first made. I shall see I guess.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1712390 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Why this CPU task invalid so soon?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.