Why this CPU task invalid so soon?

Message boards : Number crunching : Why this CPU task invalid so soon?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1709103 - Posted: 6 Aug 2015, 2:36:20 UTC - in response to Message 1709095.  

Good luck with all that, Keith. If it is just a memory bit or bits starting to fail with some intermittent consistency, though, there may not be much you can do short of replacement. If it were me, cheapskate that I am, I'd probably just try swapping out one stick at a time to see if the problem went away. However, it sounds like the geography in your box would make an all or nothing approach less labor intensive.
ID: 1709103 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1709109 - Posted: 6 Aug 2015, 2:53:41 UTC

Say, Keith, take a look at your task 4290554489. I went to look at your two new Invalids and then decided to look at a nearby Inconclusive.

The Inconclusive also appears to have the same "Best autocorr: peak=0, time=-2.123e+011, delay=0, d_freq=0, chirp=0, fft_len=0" as the Invalids, yet this task actually shows an Autocorr count of 1 and is thus far only Inconclusive rather than instantly Invalid. I wonder how that happens?
ID: 1709109 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1709111 - Posted: 6 Aug 2015, 3:12:06 UTC - in response to Message 1709109.  
Last modified: 6 Aug 2015, 3:24:57 UTC

It looks almost as though that best autocorrelation wasn't saved across that restart. That host could be a suitable victim for multiple restart experiments, since all the weaknesses found in the stderr truncation debate, may well apply [but to state or result files]. That'd be important to know about for application developers IMO, even if Boinc developers prefer to keep the onus for intact files on the client. [afaik the client has no interaction with the state/result files until completion, so any damage there lies solely within the domain of the app and boincapi linked into it]

[Edit:] It ^could^ well mean (hypothesising), anything less than full commit AND a patient client, is just a workaround (which we knew), but problems manifesting in a repeatable way would be surprising, interesting and a smoking gun at the same time.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1709111 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1709465 - Posted: 6 Aug 2015, 21:04:00 UTC - in response to Message 1709109.  

Say, Keith, take a look at your task 4290554489. I went to look at your two new Invalids and then decided to look at a nearby Inconclusive.

The Inconclusive also appears to have the same "Best autocorr: peak=0, time=-2.123e+011, delay=0, d_freq=0, chirp=0, fft_len=0" as the Invalids, yet this task actually shows an Autocorr count of 1 and is thus far only Inconclusive rather than instantly Invalid. I wonder how that happens?

Thanks for catching that, Jeff.

Sensibly, the Validator recognizes that a reported Autocorr indicates Autocorr processing has been done just as a best_autocorr does, so Keith should get credit on the task with a weakly similar comparison.

As Jason says, it looks like the best_autocorr got lost in the restart at 90.62 percent. Those "best" signals are saved in the state.sah file in the slot directory, written in the order spike, autocorr, gaussian, pulse, and triplet. Having the other types properly recovered by the restart deepens the puzzle.

The state.sah file is rewritten completely at every checkpoint, so it's possible the internal state structure which it reflects is where the glitch happened, and that could have been long before the restart. The file is in xml form so can be viewed as text and ought to have a <best_autocorr> section for any task which has run for more than a minute. But with only a small fraction of tasks showing the issue it would be difficult to capture something meaningful.

Keith, you earlier asked about possible diagnostic command line arguments. I think using "-v 2" (without the quotes) may be in order. That increases the verbosity so updates to "best" signals are included in stderr.txt. It can be a lot of data to look through but could help determine if the app is sometimes simply failing to do Autocorr processing.
                                                                 Joe
ID: 1709465 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1709494 - Posted: 6 Aug 2015, 22:20:25 UTC

Thanks for the comments, folks. Very interesting that the autocorr results before the restart didn't survive across the restart. Josef, what about the diagnostic option. I looked through the Wiki but didn't come across that option. Am I correct that is a command line parameter for the BOINC client and not one of the config files?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1709494 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1709509 - Posted: 6 Aug 2015, 23:12:18 UTC - in response to Message 1709494.  

Thanks for the comments, folks. Very interesting that the autocorr results before the restart didn't survive across the restart. Josef, what about the diagnostic option. I looked through the Wiki but didn't come across that option. Am I correct that is a command line parameter for the BOINC client and not one of the config files?

No, it would be a parameter for the specific SETI@Home science application you're running.
ID: 1709509 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1709520 - Posted: 6 Aug 2015, 23:53:43 UTC - in response to Message 1709465.  
Last modified: 6 Aug 2015, 23:55:20 UTC

As Jason says, it looks like the best_autocorr got lost in the restart at 90.62 percent. Those "best" signals are saved in the state.sah file in the slot directory, written in the order spike, autocorr, gaussian, pulse, and triplet. Having the other types properly recovered by the restart deepens the puzzle.

Since my xw9400 is also an 8-core AMD machine (except, apparently, under Win 10) that runs CPU tasks and is shut down for most of each day, I was curious to see if it exhibited that same "Best autocorr: peak=0 ..." quirk following any of the restarts. After searching my archives all the way back to the beginning of the year, I have to say that it doesn't.

I did find a small number of occurrences (just 30 in 7+ months), but every one was on a -9 overflow task, most with 30 Spikes (or, at least, a high Spike count), although one did have 30 Pulses, instead. Obviously, those were all very short-running tasks (most about 15 seconds or less, I think) and none were restarted.
ID: 1709520 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1709525 - Posted: 7 Aug 2015, 0:05:23 UTC - in response to Message 1709509.  

Thanks for the comments, folks. Very interesting that the autocorr results before the restart didn't survive across the restart. Josef, what about the diagnostic option. I looked through the Wiki but didn't come across that option. Am I correct that is a command line parameter for the BOINC client and not one of the config files?

No, it would be a parameter for the specific SETI@Home science application you're running.

Simplest way: There should be an empty cmdline_AKv8c_r2549_winx86-64_SSE42xjfs.txt file in your projects\setiathome.berkeley.edu folder. Insert the -v 2 with a plain text editor such as Notepad, and subsequent starts or restarts of CPU tasks will have the increased stderr verbosity.
                                                                   Joe
ID: 1709525 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1709530 - Posted: 7 Aug 2015, 0:18:17 UTC

Just a thought. It appears that a restart above at least 52% is a common theme in all of Keith's current Invalids (and that one Inconclusive I spotted). Does the autocorr checking usually occur in a specific phase of the processing (i.e., perhaps before or after the halfway mark), or is it spread across the entire duration of the task?
ID: 1709530 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1709538 - Posted: 7 Aug 2015, 0:54:52 UTC - in response to Message 1709525.  

Thanks Jeff, Josef and Richard. Done, command line file modified for greater stderr verbosity. I was thinking on a more global scale and not the specifics of the app that is failing on this machine. Saw the explanation in the Lunatics docs for the app. It must be something specific to this machine. Memory or CPU failing. My other machine with identical processor isn't glitching. Same motherboards except the glitching one is one generation later. Twice as much memory and different brand in the glitchy one. The other machine has a better quality processor based on the VID voltage compared to the failing machine being about .08V less in difference.

On another thought. Since the restart of the processed MB CPU task is almost always because of the shutdown of the machines for the night ...... is it possible the last write of the state file doesn't happen before the machine powers off?? The BOINC directories on both machines are on C:\ and both machines have the same brand and size of SSD. Could the final write flush to the SSD be being missed on the glitchy computer?? My routine is to shut the client BOINC Manager down manually before actually performing the machine shutdown. I am not letting Windows terminate processes as it sees fit. The glitchy computer does have a lot more processes running on it though as it is my daily driver. The other machine is solely a BOINC cruncher running no other processes.

Well, about time to power down for the night. My daily solar output has been crap all this week because of all the smoke from the fires. The power company made some money off me this week.

Thanks again all for your help. Interesting problem I seem to have.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1709538 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1709543 - Posted: 7 Aug 2015, 1:03:52 UTC - in response to Message 1709530.  

Just a thought. It appears that a restart above at least 52% is a common theme in all of Keith's current Invalids (and that one Inconclusive I spotted). Does the autocorr checking usually occur in a specific phase of the processing (i.e., perhaps before or after the halfway mark), or is it spread across the entire duration of the task?

Autocorr processing is done at the 128K FFT length only, and that length is only used out to +/- 30 chirp, so your observation may indeed be a clue. Chirp magnitude and progress run more or less the same, with final chirp magnitude limit 100, though it's not exactly in lock step because different processors react differently to the change in processing at chirp 30. For VLAR tasks it's a reasonable match on my systems, though.

It's true that there cannot be a new Best Autocorr found after a restart at chirp magnitude 30 or later. Still, the code to parse the state.sah checkpoint file at a restart should get the existing best_autocorr as easily as it does other signal types.
                                                                   Joe
ID: 1709543 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1709550 - Posted: 7 Aug 2015, 1:15:36 UTC - in response to Message 1709543.  
Last modified: 7 Aug 2015, 1:25:12 UTC

Just a thought. It appears that a restart above at least 52% is a common theme in all of Keith's current Invalids (and that one Inconclusive I spotted). Does the autocorr checking usually occur in a specific phase of the processing (i.e., perhaps before or after the halfway mark), or is it spread across the entire duration of the task?

Autocorr processing is done at the 128K FFT length only, and that length is only used out to +/- 30 chirp, so your observation may indeed be a clue. Chirp magnitude and progress run more or less the same, with final chirp magnitude limit 100, though it's not exactly in lock step because different processors react differently to the change in processing at chirp 30. For VLAR tasks it's a reasonable match on my systems, though.

It's true that there cannot be a new Best Autocorr found after a restart at chirp magnitude 30 or later. Still, the code to parse the state.sah checkpoint file at a restart should get the existing best_autocorr as easily as it does other signal types.
                                                                   Joe

Well, what I was wondering was what the effect would be if that bit flip, or whatever is happening on Keith's machine, occurs on initial startup, when that startup occurs after a task is done with Autocorr processing. If that Best Autocorr gets zapped at that point, can it ever get repopulated with non-zero values?

EDIT: Assuming that the bit(s) flipping occurs after the checkpoint file has been read in, that is.
ID: 1709550 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1709555 - Posted: 7 Aug 2015, 1:33:51 UTC - in response to Message 1709543.  

Thanks for the answer to Jeff's question. I thought it a good one based on observations with my inconclusives and invalids. I've reported my first CPU task with the verbosity level of 2. Task 4295961549 Validated and no issues.

The reason for so many exits on this task though is because of it getting pushed to waiting because of the amount of MW and Einstein tasks working on the GPUs and the CPU usage added up enough to dump a core. I am seeing more frequent running of MW tasks now that I've set the <rec_half_life_days>1.000000</rec_half_life_days> change to cc_config that Richard advised to get my project priority back to normal after the MW account corruption. Will have to wait and see on further CPU tasks that run non-stop or get exited after 52% or the autocorr processing is finished and the task exited by computer shutdown and before completion.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1709555 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1709562 - Posted: 7 Aug 2015, 1:49:20 UTC
Last modified: 7 Aug 2015, 1:51:34 UTC

Keith, I'm going to offer a suggestion that occurs to me. Before you shut down, try turning off the BOINC auto-start, so that it doesn't start up with Windows. Then, before you start BOINC after your next reboot, review the various state.sah fields for the suspended tasks to see if anything looks suspicious. I'm thinking that it could be just as likely that the bit-flipping occurs on shutdown as it does on startup. Reviewing the state.sah files before starting BOINC might be useful.

EDIT: Joe, does that sound reasonable?
ID: 1709562 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1709603 - Posted: 7 Aug 2015, 5:01:20 UTC

Just for the exercise, I put together a little routine to grab all the completed task results for the wayward machine and search them for "Best autocorr: peak=0". I found 9 altogether (out of 799), the Invalids and one Inconclusive that we had already identified, plus several more, as follows:

Invalid
4294404762 - Last restart at 52.63%
4294219908 - Last restart at 54.43%
4290554474 - Last restart at 74.76%

Inconclusive
4294614470 - Last restart at 65.07%
4290554489 - Last restart at 90.62%
4274492714 - Last restart at 75.71%

Validation Pending
4294607734 - Last restart at 42.68%
4294369776 - Last restart at 56.98%

Validated (-9 overflow)
4294369884 - No restart

Perhaps those will provide a little more food for thought on the morrow!
ID: 1709603 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1709765 - Posted: 7 Aug 2015, 16:46:00 UTC - in response to Message 1709562.  

Sounds like a very doable and intelligent plan. So, I have turned off autostart of the Boinc Manager on both my machines. I should have been doing that already since I have been having to stop BM anyway at each start so I can use Nvidia Inspector to bump the P2 state memory speed back to normal anyway before restarting. The only reason it was on before I guess was that BOINC would restart automatically on a unattended machine if it reboots on its own.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1709765 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1709767 - Posted: 7 Aug 2015, 16:47:01 UTC

Looking for additional clues, I considered the other inconclusives for host 5741129.

Two interesting cases are 4276853829 and 4276900953, neither of which had a restart but both show the Best Autocorr as having peak=nan (Not a Number). IOW, even when the checkpoint file isn't used there's a serious problem with the best autocorr processing. That suggests data corruption. Both those also have apparently good reported Autocorr signals too.

Then there are 4290408209 and 4295778743 in which the last restart was early enough that there was some more autocorr processing. But both had reportable Autocorrs found before the restart much better than what is shown for Best Autocorr.

Other than the indication the state.sah checkpoint file is not necessarily involved, that doesn't really clarify the situation much. I hope full runs with the -v 2 verbosity will help.
                                                                  Joe
ID: 1709767 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1709780 - Posted: 7 Aug 2015, 17:11:39 UTC - in response to Message 1709767.  

Thanks for the explanation about what nan stood for, Josef. I hadn't a clue. I had seen that before and wondered what the heck. It will be interesting what happens now with autostart off. I will have at least a 3 more opportunities to test that case over the weekend and Monday. The plan is to take the system down and pull the radiator so I can access the memory sticks. I was going to just unseat them and then reseat them to refresh the socket connections. I was also entertaining shifting the sticks around. Move matched pairs to the alternate slots. Does anyone think that would be helpful in the test case? I have run about 3 passes of MemTest64 on the memory so far with successful passes and no errors. The other thing I want to accomplish is since I am pulling the radiator out of the top of the case to get to the memory, I will also pull the coldplate from the CPU and then reattach with the new MX-4 TIM I should receive Monday. I will be inspecting the state.sah file in the appropriate CPU slots before every startup.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1709780 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1709783 - Posted: 7 Aug 2015, 17:22:23 UTC - in response to Message 1709767.  
Last modified: 7 Aug 2015, 17:53:17 UTC

Two interesting cases are 4276853829 and 4276900953, neither of which had a restart but both show the Best Autocorr as having peak=nan (Not a Number). IOW, even when the checkpoint file isn't used there's a serious problem with the best autocorr processing. That suggests data corruption. Both those also have apparently good reported Autocorr signals too.

That peak=nan is certainly interesting. Is that peak value stored internally as a string rather than a binary value?

I just did a quick search of the host's tasks that I grabbed last night and found 6 more with peak=nan, in addition to the two you identified: 4274856996, 4276818428, 4280054979, 4282059605, 4290380829 (validated and already deleted, it appears), 4294647989.

One thing I notice that all these tasks have in common for Best autocorr is "delay=0.0003072". Could there be any significance in that?

EDIT: It just occurred to me to search for "delay=0.0003072", to see how often it might occur, and it seems that those 8 tasks are the only instances with that value, out of the 799 I grabbed.
ID: 1709783 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1709796 - Posted: 7 Aug 2015, 17:45:33 UTC - in response to Message 1709780.  
Last modified: 7 Aug 2015, 18:28:46 UTC

I was also entertaining shifting the sticks around. Move matched pairs to the alternate slots.

I would think that might at least alter the symptoms, even if it doesn't happen to actually fix it. If the glitch does then manifest itself in some other way, however, you still won't really know which stick is the culprit.

Does your BIOS allow you to change whether the memory is interleaved or ganged, tying it to a specific processor or not? My xw9400 has those settings but, frankly, I've never been able to detect any performance difference no matter how I set the options. Perhaps someone more familiar with those settings could suggest whether any changes might be beneficial.

EDIT: Ah, scratch that question about the interleaving. I was thinking you had dual quad-core processors but I just noticed that you have a single 8-core CPU.
ID: 1709796 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Why this CPU task invalid so soon?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.