Message boards :
Number crunching :
Why this CPU task invalid so soon?
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Good luck with all that, Keith. If it is just a memory bit or bits starting to fail with some intermittent consistency, though, there may not be much you can do short of replacement. If it were me, cheapskate that I am, I'd probably just try swapping out one stick at a time to see if the problem went away. However, it sounds like the geography in your box would make an all or nothing approach less labor intensive. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Say, Keith, take a look at your task 4290554489. I went to look at your two new Invalids and then decided to look at a nearby Inconclusive. The Inconclusive also appears to have the same "Best autocorr: peak=0, time=-2.123e+011, delay=0, d_freq=0, chirp=0, fft_len=0" as the Invalids, yet this task actually shows an Autocorr count of 1 and is thus far only Inconclusive rather than instantly Invalid. I wonder how that happens? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
It looks almost as though that best autocorrelation wasn't saved across that restart. That host could be a suitable victim for multiple restart experiments, since all the weaknesses found in the stderr truncation debate, may well apply [but to state or result files]. That'd be important to know about for application developers IMO, even if Boinc developers prefer to keep the onus for intact files on the client. [afaik the client has no interaction with the state/result files until completion, so any damage there lies solely within the domain of the app and boincapi linked into it] [Edit:] It ^could^ well mean (hypothesising), anything less than full commit AND a patient client, is just a workaround (which we knew), but problems manifesting in a repeatable way would be surprising, interesting and a smoking gun at the same time. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Say, Keith, take a look at your task 4290554489. I went to look at your two new Invalids and then decided to look at a nearby Inconclusive. Thanks for catching that, Jeff. Sensibly, the Validator recognizes that a reported Autocorr indicates Autocorr processing has been done just as a best_autocorr does, so Keith should get credit on the task with a weakly similar comparison. As Jason says, it looks like the best_autocorr got lost in the restart at 90.62 percent. Those "best" signals are saved in the state.sah file in the slot directory, written in the order spike, autocorr, gaussian, pulse, and triplet. Having the other types properly recovered by the restart deepens the puzzle. The state.sah file is rewritten completely at every checkpoint, so it's possible the internal state structure which it reflects is where the glitch happened, and that could have been long before the restart. The file is in xml form so can be viewed as text and ought to have a <best_autocorr> section for any task which has run for more than a minute. But with only a small fraction of tasks showing the issue it would be difficult to capture something meaningful. Keith, you earlier asked about possible diagnostic command line arguments. I think using "-v 2" (without the quotes) may be in order. That increases the verbosity so updates to "best" signals are included in stderr.txt. It can be a lot of data to look through but could help determine if the app is sometimes simply failing to do Autocorr processing. Joe |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for the comments, folks. Very interesting that the autocorr results before the restart didn't survive across the restart. Josef, what about the diagnostic option. I looked through the Wiki but didn't come across that option. Am I correct that is a command line parameter for the BOINC client and not one of the config files? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Thanks for the comments, folks. Very interesting that the autocorr results before the restart didn't survive across the restart. Josef, what about the diagnostic option. I looked through the Wiki but didn't come across that option. Am I correct that is a command line parameter for the BOINC client and not one of the config files? No, it would be a parameter for the specific SETI@Home science application you're running. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
As Jason says, it looks like the best_autocorr got lost in the restart at 90.62 percent. Those "best" signals are saved in the state.sah file in the slot directory, written in the order spike, autocorr, gaussian, pulse, and triplet. Having the other types properly recovered by the restart deepens the puzzle. Since my xw9400 is also an 8-core AMD machine (except, apparently, under Win 10) that runs CPU tasks and is shut down for most of each day, I was curious to see if it exhibited that same "Best autocorr: peak=0 ..." quirk following any of the restarts. After searching my archives all the way back to the beginning of the year, I have to say that it doesn't. I did find a small number of occurrences (just 30 in 7+ months), but every one was on a -9 overflow task, most with 30 Spikes (or, at least, a high Spike count), although one did have 30 Pulses, instead. Obviously, those were all very short-running tasks (most about 15 seconds or less, I think) and none were restarted. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Thanks for the comments, folks. Very interesting that the autocorr results before the restart didn't survive across the restart. Josef, what about the diagnostic option. I looked through the Wiki but didn't come across that option. Am I correct that is a command line parameter for the BOINC client and not one of the config files? Simplest way: There should be an empty cmdline_AKv8c_r2549_winx86-64_SSE42xjfs.txt file in your projects\setiathome.berkeley.edu folder. Insert the -v 2 with a plain text editor such as Notepad, and subsequent starts or restarts of CPU tasks will have the increased stderr verbosity. Joe |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Just a thought. It appears that a restart above at least 52% is a common theme in all of Keith's current Invalids (and that one Inconclusive I spotted). Does the autocorr checking usually occur in a specific phase of the processing (i.e., perhaps before or after the halfway mark), or is it spread across the entire duration of the task? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks Jeff, Josef and Richard. Done, command line file modified for greater stderr verbosity. I was thinking on a more global scale and not the specifics of the app that is failing on this machine. Saw the explanation in the Lunatics docs for the app. It must be something specific to this machine. Memory or CPU failing. My other machine with identical processor isn't glitching. Same motherboards except the glitching one is one generation later. Twice as much memory and different brand in the glitchy one. The other machine has a better quality processor based on the VID voltage compared to the failing machine being about .08V less in difference. On another thought. Since the restart of the processed MB CPU task is almost always because of the shutdown of the machines for the night ...... is it possible the last write of the state file doesn't happen before the machine powers off?? The BOINC directories on both machines are on C:\ and both machines have the same brand and size of SSD. Could the final write flush to the SSD be being missed on the glitchy computer?? My routine is to shut the client BOINC Manager down manually before actually performing the machine shutdown. I am not letting Windows terminate processes as it sees fit. The glitchy computer does have a lot more processes running on it though as it is my daily driver. The other machine is solely a BOINC cruncher running no other processes. Well, about time to power down for the night. My daily solar output has been crap all this week because of all the smoke from the fires. The power company made some money off me this week. Thanks again all for your help. Interesting problem I seem to have. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Just a thought. It appears that a restart above at least 52% is a common theme in all of Keith's current Invalids (and that one Inconclusive I spotted). Does the autocorr checking usually occur in a specific phase of the processing (i.e., perhaps before or after the halfway mark), or is it spread across the entire duration of the task? Autocorr processing is done at the 128K FFT length only, and that length is only used out to +/- 30 chirp, so your observation may indeed be a clue. Chirp magnitude and progress run more or less the same, with final chirp magnitude limit 100, though it's not exactly in lock step because different processors react differently to the change in processing at chirp 30. For VLAR tasks it's a reasonable match on my systems, though. It's true that there cannot be a new Best Autocorr found after a restart at chirp magnitude 30 or later. Still, the code to parse the state.sah checkpoint file at a restart should get the existing best_autocorr as easily as it does other signal types. Joe |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Just a thought. It appears that a restart above at least 52% is a common theme in all of Keith's current Invalids (and that one Inconclusive I spotted). Does the autocorr checking usually occur in a specific phase of the processing (i.e., perhaps before or after the halfway mark), or is it spread across the entire duration of the task? Well, what I was wondering was what the effect would be if that bit flip, or whatever is happening on Keith's machine, occurs on initial startup, when that startup occurs after a task is done with Autocorr processing. If that Best Autocorr gets zapped at that point, can it ever get repopulated with non-zero values? EDIT: Assuming that the bit(s) flipping occurs after the checkpoint file has been read in, that is. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for the answer to Jeff's question. I thought it a good one based on observations with my inconclusives and invalids. I've reported my first CPU task with the verbosity level of 2. Task 4295961549 Validated and no issues. The reason for so many exits on this task though is because of it getting pushed to waiting because of the amount of MW and Einstein tasks working on the GPUs and the CPU usage added up enough to dump a core. I am seeing more frequent running of MW tasks now that I've set the <rec_half_life_days>1.000000</rec_half_life_days> change to cc_config that Richard advised to get my project priority back to normal after the MW account corruption. Will have to wait and see on further CPU tasks that run non-stop or get exited after 52% or the autocorr processing is finished and the task exited by computer shutdown and before completion. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Keith, I'm going to offer a suggestion that occurs to me. Before you shut down, try turning off the BOINC auto-start, so that it doesn't start up with Windows. Then, before you start BOINC after your next reboot, review the various state.sah fields for the suspended tasks to see if anything looks suspicious. I'm thinking that it could be just as likely that the bit-flipping occurs on shutdown as it does on startup. Reviewing the state.sah files before starting BOINC might be useful. EDIT: Joe, does that sound reasonable? |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Just for the exercise, I put together a little routine to grab all the completed task results for the wayward machine and search them for "Best autocorr: peak=0". I found 9 altogether (out of 799), the Invalids and one Inconclusive that we had already identified, plus several more, as follows: Invalid 4294404762 - Last restart at 52.63% 4294219908 - Last restart at 54.43% 4290554474 - Last restart at 74.76% Inconclusive 4294614470 - Last restart at 65.07% 4290554489 - Last restart at 90.62% 4274492714 - Last restart at 75.71% Validation Pending 4294607734 - Last restart at 42.68% 4294369776 - Last restart at 56.98% Validated (-9 overflow) 4294369884 - No restart Perhaps those will provide a little more food for thought on the morrow! |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Sounds like a very doable and intelligent plan. So, I have turned off autostart of the Boinc Manager on both my machines. I should have been doing that already since I have been having to stop BM anyway at each start so I can use Nvidia Inspector to bump the P2 state memory speed back to normal anyway before restarting. The only reason it was on before I guess was that BOINC would restart automatically on a unattended machine if it reboots on its own. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Looking for additional clues, I considered the other inconclusives for host 5741129. Two interesting cases are 4276853829 and 4276900953, neither of which had a restart but both show the Best Autocorr as having peak=nan (Not a Number). IOW, even when the checkpoint file isn't used there's a serious problem with the best autocorr processing. That suggests data corruption. Both those also have apparently good reported Autocorr signals too. Then there are 4290408209 and 4295778743 in which the last restart was early enough that there was some more autocorr processing. But both had reportable Autocorrs found before the restart much better than what is shown for Best Autocorr. Other than the indication the state.sah checkpoint file is not necessarily involved, that doesn't really clarify the situation much. I hope full runs with the -v 2 verbosity will help. Joe |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for the explanation about what nan stood for, Josef. I hadn't a clue. I had seen that before and wondered what the heck. It will be interesting what happens now with autostart off. I will have at least a 3 more opportunities to test that case over the weekend and Monday. The plan is to take the system down and pull the radiator so I can access the memory sticks. I was going to just unseat them and then reseat them to refresh the socket connections. I was also entertaining shifting the sticks around. Move matched pairs to the alternate slots. Does anyone think that would be helpful in the test case? I have run about 3 passes of MemTest64 on the memory so far with successful passes and no errors. The other thing I want to accomplish is since I am pulling the radiator out of the top of the case to get to the memory, I will also pull the coldplate from the CPU and then reattach with the new MX-4 TIM I should receive Monday. I will be inspecting the state.sah file in the appropriate CPU slots before every startup. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Two interesting cases are 4276853829 and 4276900953, neither of which had a restart but both show the Best Autocorr as having peak=nan (Not a Number). IOW, even when the checkpoint file isn't used there's a serious problem with the best autocorr processing. That suggests data corruption. Both those also have apparently good reported Autocorr signals too. That peak=nan is certainly interesting. Is that peak value stored internally as a string rather than a binary value? I just did a quick search of the host's tasks that I grabbed last night and found 6 more with peak=nan, in addition to the two you identified: 4274856996, 4276818428, 4280054979, 4282059605, 4290380829 (validated and already deleted, it appears), 4294647989. One thing I notice that all these tasks have in common for Best autocorr is "delay=0.0003072". Could there be any significance in that? EDIT: It just occurred to me to search for "delay=0.0003072", to see how often it might occur, and it seems that those 8 tasks are the only instances with that value, out of the 799 I grabbed. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I was also entertaining shifting the sticks around. Move matched pairs to the alternate slots. I would think that might at least alter the symptoms, even if it doesn't happen to actually fix it. If the glitch does then manifest itself in some other way, however, you still won't really know which stick is the culprit. EDIT: Ah, scratch that question about the interleaving. I was thinking you had dual quad-core processors but I just noticed that you have a single 8-core CPU. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.