Why this CPU task invalid so soon?

Message boards : Number crunching : Why this CPU task invalid so soon?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1705279 - Posted: 26 Jul 2015, 15:56:23 UTC

Just noticed an invalid Task 1849803429 that I produced yesterday. Funny thing is that it finished successfully and I was the first to report. My wingman's result is marked inconclusive and the task has gone out to a second wingman. How can my result be marked invalid before consensus?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1705279 · Report as offensive
BetelgeuseFive Project Donor
Volunteer tester

Send message
Joined: 6 Jul 99
Posts: 158
Credit: 17,117,787
RAC: 19
Netherlands
Message 1705299 - Posted: 26 Jul 2015, 16:30:28 UTC - in response to Message 1705279.  

Sometimes the uploaded file containing the results is corrupt (truncated or garbled). In this case the task is marked as invalid immediately. Hard to determine if is this a client side, server side or network issue.
If it just the one I would not worry about it.

Tom
ID: 1705299 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1705314 - Posted: 26 Jul 2015, 17:22:19 UTC

It looks like you had a CPU AP task also go invalid. With your result containing 30/30 pulses vs the two valid results 2/0 pulses.

That system isn't starting to overheat or anything is it?
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1705314 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1705332 - Posted: 26 Jul 2015, 18:23:46 UTC

Well the first thing I looked at was the result because of the recent history of truncated stderr.txt problems here. Supposedly solved with 7.6.6 client. Result looked clean. The CPU is water cooled with a H-105 AIO and never exceeds 55 degrees C. However I have been seeing some postponed CPU tasks due to Impossible Power calculations which Jason says is caused by a memory problem likely. Last time I took the system out for a blow cleaning I forgot I wanted to reseat the memory sticks which would be my first troubleshooting fix before running Mem64 test. If that doesn't fix things, then I might have to entertain pulling the cooler and redo the TIM. If that still does not resolve things, then it is possible that the CPU is on its way out. I still believe that that problem is caused by the fact that I only see it on my daily driver and that it occurs when I am heavily using it for browsing and other non-BOINC tasks. My other cruncher with identical hardware and settings pretty much never has this problem and only crunches.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1705332 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1705336 - Posted: 26 Jul 2015, 18:56:08 UTC - in response to Message 1705332.  

Sounds like clean, reseat and test everything time ? Bios settings can go askew sometimes for similar arcane reasons to memory. How old's the BIOS battery while you're in there?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1705336 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1705341 - Posted: 26 Jul 2015, 19:24:33 UTC

Yes, that would be prudent. Had to go find my order history for both systems, basically almost 3 years old for both. Good advice on the CMOS battery since 3 years expected life is the norm. If I am going to pull the system down, might as well go for the whole shebang even though changing more than one variable at a time is not good troubleshooting technique. Probably still the best use of time in the long run.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1705341 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1705346 - Posted: 26 Jul 2015, 19:38:23 UTC - in response to Message 1705341.  
Last modified: 26 Jul 2015, 19:41:01 UTC

... If I am going to pull the system down, might as well go for the whole shebang even though changing more than one variable at a time is not good troubleshooting technique.


True, though in racetrack and aviation engineering, those'd be equivalent to routine maintenance, as opposed to troubleshooting. I would argue that if another round of troubleshooting turns up necessary after maintenance, you eliminated all the easy stuff :)

Complete dismantle, inspection and rebuild is common with things in otherwise constant service. [ Depends if you wanted to know the exact connector or setting that went askew, or if going over everything in holistic fashion is likely to solve it i.e. which is less effort and will prevent more future failures? can't answer that ]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1705346 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1705357 - Posted: 26 Jul 2015, 20:13:18 UTC - in response to Message 1705336.  

Bios settings can go askew sometimes for similar arcane reasons to memory.

Oh, good...so I'm not just losing my mind! I don't normally care about sound on my xw9400, since it's usually a crunch-only box. However, one time in about March or so, I tried to watch a video and found out I had no sound, not even through headphones. Neither reinstalling nor upgrading the driver had any effect, and I just let it go, since that momentary need for sound had long passed. Then on Friday evening, after upgrading that box to Win 7 and reinstalling the drivers (again) to no effect, I suddenly thought to interrupt one of the many restarts (necessary for installing the 200+ Win 7 updates) and check the BIOS setup. Well, whadda ya know...there was a "Use integrated audio device" setting that had mysteriously been turned off...and I sure as heck don't remember changing that, ever! Turned it back on and now the xw9400 has sound again. Perhaps someday I'll actually need it. ;^)
ID: 1705357 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1705364 - Posted: 26 Jul 2015, 20:18:39 UTC - in response to Message 1705357.  

You also may need to check with any very young ones that might be around, if they played 'setup' recently. That happens too :-O
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1705364 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1705368 - Posted: 26 Jul 2015, 20:28:14 UTC - in response to Message 1705364.  

You also may need to check with any very young ones that might be around, if they played 'setup' recently. That happens too :-O

No such critters around here!
ID: 1705368 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1705407 - Posted: 26 Jul 2015, 22:23:24 UTC - in response to Message 1705279.  

Just noticed an invalid Task 1849803429 that I produced yesterday. Funny thing is that it finished successfully and I was the first to report. My wingman's result is marked inconclusive and the task has gone out to a second wingman. How can my result be marked invalid before consensus?

Your stderr says:
Best autocorr: peak=0, time=-2.123e+011, delay=0, d_freq=0, chirp=0, fft_len=0


That indicates to a near certainty that there was no Autocorr processing done, the initial state hasn't been changed. That caused no best_autocorr in the uploaded result file, the direct cause of the instant "invalid" judgement.

I downloaded a copy of the WU and it has normal autocorr parameters so ought to have done those searches. One wild guess at what might have caused the issue is that the autocorr_fftlen is a pure power of 2 and stored as an integer so a single bit flip could change it to zero, and that would turn off Autocorr processing.
                                                                  Joe
ID: 1705407 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1705416 - Posted: 26 Jul 2015, 23:31:46 UTC - in response to Message 1705407.  
Last modified: 26 Jul 2015, 23:32:42 UTC

Which would point to the original guess of memory problems. After I shut down for the night, I think the memory sticks are going to be pulled and reseated and a MemTest.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1705416 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1705664 - Posted: 27 Jul 2015, 17:18:43 UTC - in response to Message 1705416.  

Well, that didn't happen. I had to take apart and blow out and reseat cables in the DVR last night to stop it from glitching video. This morning before starting the systems, I was going to pull and reseat the memory. That won't happen for a while since I forgot that with the H-105 cooler in that it covers the memory sticks. Will have to pull the radiator to remove the memory on this system. That means I will have to do the whole shebang as I originally thought. Which means half a day of downtime. I did run one full complete test of Memtest86+ with no errors after shutting down the machine from projects last night. Still no insight on why I have the problems.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1705664 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1707450 - Posted: 1 Aug 2015, 22:38:14 UTC - in response to Message 1705664.  

Maybe someone can point out what went wrong with these invalid tasks
It looks like I came up with the same results as my wingmen. Autocorr counts are the same.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1707450 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1707494 - Posted: 2 Aug 2015, 0:36:30 UTC - in response to Message 1707450.  

Maybe someone can point out what went wrong with these invalid tasks
It looks like I came up with the same results as my wingmen. Autocorr counts are the same.

These have the same issue as the one in your original post:
Best autocorr: peak=0, time=-2.123e+011, delay=0, d_freq=0, chirp=0, fft_len=0


Whether or not there's a reportable Autocorr in a SaHv7 task, there should always be a best_autocorr with non-zero values if the task has not done a result_overflow. The Sahv7 validator was coded to use that to reject results from older SaHv6 applications which some anonymous platform users might have been tempted to keep using. That rejection is the instant invalid you're seeing.

Note: Even the time=-2.123e+011 is actually a zero value, the time is shown relative to the start of recording.
                                                                   Joe
ID: 1707494 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1707523 - Posted: 2 Aug 2015, 1:41:23 UTC - in response to Message 1707494.  

Thanks for the reply Josef. Sure wish I could get a wingman who is using the same app as me that reports results in the same manner as me so I can compare the data. I guess I still don't know how my results are different from my wingmen. I think you are saying that even if the task should have no Autocorr count, there should always be some numbers other than time reported? Is there some debug option I can turn on in the logging to fully capture what these tasks are doing?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1707523 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1707534 - Posted: 2 Aug 2015, 2:02:02 UTC - in response to Message 1707523.  

Sure wish I could get a wingman who is using the same app as me that reports results in the same manner as me so I can compare the data.

You may get your wish, Keith. It looks like the _2 task for WU 1856385074 has also gone out as "Anonymous platform (CPU)". If that one doesn't crap out, you should be able to compare.
ID: 1707534 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1709049 - Posted: 5 Aug 2015, 23:42:24 UTC - in response to Message 1707534.  
Last modified: 5 Aug 2015, 23:47:29 UTC

Well my wingman using an anonymous platform app reported on this task. Same results as me with regard to pulses and power peaks and times. Only thing different was his non-zero Autocorr values. So I have to figure out why I am flipping the bit on fft_len=128k. I went back into the BIOS and looked around once more. Saw that TurboBoost was back on again. I could have swore that I had turned that off before. Never saw what should have been its effect, but don't need to use it since I am already mildly overclocked. Didn't see anything else out of order but switched the power phase delivery to the system RAM from Optimized to Extreme in the chance that might help the bit-flipping if that is what is actually occurring. I guess my next step is to pull the radiator coldplate and reapply some TIM after taking off the stock applied stuff. Never have seen signs of CPU overheating though. Just an experiment to see if a different kind of TIM might reduce the CPU temps if any. Supposedly the stock pre-applied stuff is some version of Shin-Etsu TIM which is considered pretty good stuff. Supposed to be better than the AS5 I have in my toolkit. Thinking hard on ordering some MX-4 stuff. Actually, done and on its way. Wanted to compare to my old standby AS5 stuff anyway.

[Edit] See that I picked up another two invalids due to the Autocorr corruption it seems. Curious that they all are VLARs from the same tape.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1709049 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1709089 - Posted: 6 Aug 2015, 1:32:23 UTC
Last modified: 6 Aug 2015, 1:35:40 UTC

This kinda sounds similar to the issue I ran into with a GPU in one of my machines last year with a rare, yet very consistent, malady (see Phantom Triplets). Jason and Joe both looked pretty hard at it and their general conclusion also seemed to be "random bit(s) flipping" or a "quite specific borderline circuit" in the GPU memory. Jason provided a rather entertaining and informative link discussing "soft errors" at http://en.wikipedia.org/wiki/Soft_error#Causes_of_soft_errors. Things like alpha particles, cosmic rays, thermal neutrons and "other" causes come into the picture. ;^)

Anyway, I ran the GPU equivalent of Memtest on the GTX 550Ti, as well as trying a voltage change, etc., and came up empty. So, the only alternatives either seemed to be to just live with the occasional hiccup or get a new GPU. I actually acquired a 750Ti that I was going to stick in that machine, but then the great WU drought hit in November and that machine has been pretty much in mothballs since.

EDIT: I realize that my issue was GPU-related, while yours seems to be with CPU tasks, but, in the end, it may actually come down to memory-related causes in both cases.
ID: 1709089 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1709095 - Posted: 6 Aug 2015, 2:00:28 UTC - in response to Message 1709089.  

Your commentary is interesting with regard to your GPU experience, Jeff. I ran into problems on this machine yesterday with a flaky desktop that seemed video related. Had to revert to known good configuration. I had no desktop icons, just background and functional mouse and keyboard. BOINC was obviously still working hard on both the CPU and GPU based on the temps and fans speeds though. Wondered if it had something to do with the total takeover of MW work on the GPUs from a MW account corruption. All CPU cores were in the red with overburden of kernel, user, interrupt and dpc threads. Figured until I got the system back to normal (which I did today with Richard's help), that I should fall back to stock GPU core speeds. I still am overclocking a bit on the GPU memory though. I figured 6 MW tasks at a time on this machined is just too much overhead for the CPU. The system seems to be behaving itself today once I got the Recent Estimated Credit flag option fixed. Will have to wait and see if I have slayed the CPU Autocorr corruption issue. Won't have my new TIM till Monday since I placed the order too late in the day for 2-day delivery. Should have ordered it last week when this problem showed up.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1709095 · Report as offensive
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Why this CPU task invalid so soon?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.