Posts by Jeff Buck


log in
1) Message boards : Number crunching : Precision X running but not showing on desktop (Message 1711824)
Posted 19 days ago by Profile Jeff BuckProject donor
Is it one of the "hidden" icons on the taskbar? That's where mine always resides. If you have a "<" symbol down there, click on that to see what's hidden.
2) Message boards : Number crunching : Windows 10 - Yea or Nay? (Message 1710532)
Posted 22 days ago by Profile Jeff BuckProject donor
Don't forget that when every new OS comes out the 3rd party hardware vendors will not always compile and issue new drivers for older kit to save themselves money. That forces you to buy new kit. I have a very good scanner that also does 35mm film and slides with an attachment. But no drivers for win 7 so now useless.

I had that problem about 8 years ago when my old Win98 machine started to fail and my new box came with Vista. Not only did I very quickly learn to hate Vista just for itself, I found that Vista had no drivers for my Epson printer, Epson scanner, and ScanMaker 5, all of which I really liked using. Both Epson and Microtek simply refused to come up with new drivers. Eventually I scrounged an old IBM Aptiva from the resale shop at a local landfill (for $5, I think) and just transferred the HD from the old Win98 box to the Aptiva. It still sits right here next to my desk and gets fired up whenever I want to do any serious scanning.
3) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1710526)
Posted 22 days ago by Profile Jeff BuckProject donor
That's certainly an interesting, and expensive, discovery. I would've bet on the memory being the culprit, rather than the CPU, but I don't have the expertise to do more than just make guesses when it comes to hardware quirks. Good luck with the CPU replacement!
4) Questions and Answers : Windows : Invalid tasks for computer 7589235 (Message 1710410)
Posted 22 days ago by Profile Jeff BuckProject donor
As to the Invalid tasks, the most recent thread on this topic is Stderr Truncations. There are several earlier threads and many scattered posts. It appears that the next BOINC version, v7.6.6, will resolve this problem.

As to the "can't find workunit", I've noticed that also over the last couple days. Workunits and tasks normally get deleted from the database 24 hours after validation. Lately it seems that only the workunits are getting deleted but the tasks are not. I don't know why that's happening, but probably isn't anything for us crunchers to worry about.

EDIT: Ah, I see the task deletion topic is now generating some discussion over on the Panic Mode thread, starting with Message 1710406.
5) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1710215)
Posted 22 days ago by Profile Jeff BuckProject donor
OK, now I'm confused. I thought Josef explained to us that Best Autocorr always had to have non-zero values to be considered valid. That is why I was looking for the
    Best autocorr: peak=0, time=-2.123e+011, delay=0, d_freq=0, chirp=0, fft_len=0

statements in the state.sah and stderr.txt files. So how come Task 4296268496 was deemed valid? We've seen cases where I got the autocorr count correct except the Best autocorr values were zeroed out with the no elapsed time count and the task was instantly invalid. That was the case with the very first task I started this thread with.


I think Joe mentioned in an earlier post that even if the Best Autocorr is hosed, you'll still likely get validated with a weakly similar result as long as the Autocorr count itself is greater than zero. In those cases, though, it seems to require a third party to arbitrate the initial Inconclusive.

I have to agree with your assessment Jeff that my computer is very selective about how it processes autocorr results. I know that GPU tasks do most of their processing on the GPU core and memory, but they do have to have the task data fed to them via the CPU. So how come I've yet to see an invalid or improper processing of autocorr on the GPU's. Does the Best Autocorr result go into a specific register always on the CPU? Are my troublesome invalids and inconclusives always getting processed on the same register of some failing core in the CPU? That would have to be determined by whomever wrote the autocorr mechanism and how it gets implemented in machine code. Josef, can you jump in here please and explain the outcome of Task 4296268496?

Thanks in advance.

That puzzles me a bit, too, as it did on those phantom triplets I was getting on a GPU. Only the triplet processing or data was affected. My best guess (and it's only a WAG) would be, if the problem is failing memory then perhaps task data frequently gets loaded into memory from exactly the base address and the failing bit(s) always hit the same block of data. That scenario would actually make some sense to me. On the other hand, if it's a failing CPU, then perhaps you could be right about a specific register being involved, or perhaps there's a specific FPU (or equivalent) circuit that only gets exercised for autocorrelation processing. As you say, the best answer would have to come from the experts! ;^)
6) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1710193)
Posted 22 days ago by Profile Jeff BuckProject donor
Just for the heck of it, I grabbed the completed CPU tasks for that host again, searching for "Best autocorr: peak=0", and noted 5 new ones since my retrieval Thursday evening. One of them, of course is 4297854304, which you noted earlier. Another one that really caught my eye is 4299129617, which has the following sequence in the Stderr:

Best autocorr updated:score=-0.349, peak_power=17.91, bin=17696, fft_ind=4, icfft=68746
Autocorr: peak=17.9095, time=60.4, delay=1.8121, d_freq=1418800066.29, chirp=20.5, fft_len=128k
Best autocorr updated:score=-0.3477, peak_power=17.96, bin=17696, fft_ind=4, icfft=68806
Autocorr: peak=17.96048, time=60.4, delay=1.8121, d_freq=1418800067.41, chirp=20.519, fft_len=128k
Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=1, icfft=92819

It apparently found two reportable Autocorr peaks, and then the glitch kicked in right after that.

Another one that looks particularly interesting to me is 4296268517, which shows:

Best autocorr updated:score=-0.39, peak_power=16.3, bin=28731, fft_ind=3, icfft=33827
Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=0, icfft=34972
R New best spike:score:-0.23667, power: 23.195, index=129997, fft_len=131072, ifft=6,icfft=39706
R New best spike:score:-0.20319, power: 25.053, index=129998, fft_len=131072, ifft=6,icfft=39708
Spike: peak=25.05331, time=87.24, d_freq=1421131699.56, chirp=-11.84, fft_len=128k
R New best spike:score:-0.19064, power: 25.788, index=129999, fft_len=131072, ifft=6,icfft=39710
Spike: peak=25.78832, time=87.24, d_freq=1421131699.55, chirp=-11.841, fft_len=128k
Spike: peak=25.23988, time=87.24, d_freq=1421131699.55, chirp=-11.842, fft_len=128k
Best pulse updated: score=1.011,power=3.4366,fftlen=256,freq_bin=63,time_bin=2048,icfft=58143
Pulse: peak=3.436609, time=53.7, period=7.92, d_freq=1421136146.87, score=1.011, chirp=17.339, fft_len=256
Autocorr: peak=18.01251, time=87.24, delay=6.0366, d_freq=1421134433.81, chirp=18.584, fft_len=128k

In this one, "nan" shows up but then a reportable Autocorr peak seems to be found a bit later. However, the "Best autocorr" value doesn't seem to get updated. Task 4296134017 shows similar behavior.

There's also one, 4296268496, where the Best autocorr does seem to get hosed immediately after a restart at 16.98%.

All in all, your machine certainly seems to be doing some interesting stuff! But it's only doing it to the autocorr processing and/or data. Very selective. ;^)

EDIT: FWIW, it appears that any time the "peak_power=nan" shows up before the last restart, as in those 4 examples, the reported "Best autocorr" shows "peak=0". However, if "peak_power=nan" happens after the last restart, as in Task 4297870021, you actually get "Best autocorr: peak=nan". I don't know if there's any significance to that, but it seems possible the Best autocorr may get reset on restarts. Just an observation.
7) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1710121)
Posted 23 days ago by Profile Jeff BuckProject donor
Well, I picked up another invalid. Damn, it was one I looked at this morning but didn't see anything out of the ordinary then. Task 4297854304 Looks like another case of multiple restarts but the one after 52% (in this case 55%) is when the autocorr time got nulled out and then was declared invalid for no autcorr processing done. Also, I see that it had a autocorr peak_power=17.42, but was then updated to a peak_power=nan right before the last restart. Commentary?

[Edit] The task I was looking at this morning and was suspicious of with the peak=nan value looks like it auto updated multiple times on the best autocorr peak power after the last restart and corrected itself.Task 4298009570

Commentary?

Commentary, sure......but not from any expertise on the internals of this thing. ;^) Jason or Joe will have to provide that.

What I think I'm seeing in the Stderr of those two tasks is that the "Best autocorr updated" event where the nan shows up seems to be in mid-run between restarts, not immediately after a restart. Especially on that 42978534 task where two consecutive lines in the output read:

Best autocorr updated:score=-0.3609, peak_power=17.42, bin=16689, fft_ind=2, icfft=70703 Best autocorr updated:score=0, peak_power=nan, bin=3, fft_ind=3, icfft=93574

That happens in mid-run between the 11.36% restart and the 55.32% restart. (Boy, your tasks sure do have a lot of restarts!) I don't know what that actually means, but perhaps the actual restarts don't provide the trigger, after all.

You mentioned checking the slots before you started BOINC this morning, but did you actually pick up anything odd from the <autocorr> sections of the state.sah files, or were those slot folders empty? What you posted looked like Stderr output.
8) Message boards : Number crunching : Corrupted data sets (Message 1709835)
Posted 24 days ago by Profile Jeff BuckProject donor
Ok..Update:

If i'm reading my results properly. Another work unit has been completed and validated..It doesn't have OpenCl next to it either. I've unchecked this in my preferences so i'm cautiously optimistic the problem has been resolved.

The task that just successfully completed, and the other one that you likely now have in progress, are tasks that were assigned to the CPU. Judging from the Application details page for your machine, which shows 7 consecutive valid tasks with the CPU app, it doesn't appear that CPU tasks were ever a problem to begin with. It was the tasks assigned to your Intel GPU that all failed, with the one exception.

Changing your preferences to not accept tasks for the iGPU certainly does avoid whatever the problem was, but wouldn't give any indication as to whether or not it was actually resolved by the AV changes.
9) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1709796)
Posted 24 days ago by Profile Jeff BuckProject donor
I was also entertaining shifting the sticks around. Move matched pairs to the alternate slots.

I would think that might at least alter the symptoms, even if it doesn't happen to actually fix it. If the glitch does then manifest itself in some other way, however, you still won't really know which stick is the culprit.

Does your BIOS allow you to change whether the memory is interleaved or ganged, tying it to a specific processor or not? My xw9400 has those settings but, frankly, I've never been able to detect any performance difference no matter how I set the options. Perhaps someone more familiar with those settings could suggest whether any changes might be beneficial.

EDIT: Ah, scratch that question about the interleaving. I was thinking you had dual quad-core processors but I just noticed that you have a single 8-core CPU.
10) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1709783)
Posted 24 days ago by Profile Jeff BuckProject donor
Two interesting cases are 4276853829 and 4276900953, neither of which had a restart but both show the Best Autocorr as having peak=nan (Not a Number). IOW, even when the checkpoint file isn't used there's a serious problem with the best autocorr processing. That suggests data corruption. Both those also have apparently good reported Autocorr signals too.

That peak=nan is certainly interesting. Is that peak value stored internally as a string rather than a binary value?

I just did a quick search of the host's tasks that I grabbed last night and found 6 more with peak=nan, in addition to the two you identified: 4274856996, 4276818428, 4280054979, 4282059605, 4290380829 (validated and already deleted, it appears), 4294647989.

One thing I notice that all these tasks have in common for Best autocorr is "delay=0.0003072". Could there be any significance in that?

EDIT: It just occurred to me to search for "delay=0.0003072", to see how often it might occur, and it seems that those 8 tasks are the only instances with that value, out of the 799 I grabbed.
11) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1709603)
Posted 24 days ago by Profile Jeff BuckProject donor
Just for the exercise, I put together a little routine to grab all the completed task results for the wayward machine and search them for "Best autocorr: peak=0". I found 9 altogether (out of 799), the Invalids and one Inconclusive that we had already identified, plus several more, as follows:

Invalid
4294404762 - Last restart at 52.63%
4294219908 - Last restart at 54.43%
4290554474 - Last restart at 74.76%

Inconclusive
4294614470 - Last restart at 65.07%
4290554489 - Last restart at 90.62%
4274492714 - Last restart at 75.71%

Validation Pending
4294607734 - Last restart at 42.68%
4294369776 - Last restart at 56.98%

Validated (-9 overflow)
4294369884 - No restart

Perhaps those will provide a little more food for thought on the morrow!
12) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1709562)
Posted 24 days ago by Profile Jeff BuckProject donor
Keith, I'm going to offer a suggestion that occurs to me. Before you shut down, try turning off the BOINC auto-start, so that it doesn't start up with Windows. Then, before you start BOINC after your next reboot, review the various state.sah fields for the suspended tasks to see if anything looks suspicious. I'm thinking that it could be just as likely that the bit-flipping occurs on shutdown as it does on startup. Reviewing the state.sah files before starting BOINC might be useful.

EDIT: Joe, does that sound reasonable?
13) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1709550)
Posted 24 days ago by Profile Jeff BuckProject donor
Just a thought. It appears that a restart above at least 52% is a common theme in all of Keith's current Invalids (and that one Inconclusive I spotted). Does the autocorr checking usually occur in a specific phase of the processing (i.e., perhaps before or after the halfway mark), or is it spread across the entire duration of the task?

Autocorr processing is done at the 128K FFT length only, and that length is only used out to +/- 30 chirp, so your observation may indeed be a clue. Chirp magnitude and progress run more or less the same, with final chirp magnitude limit 100, though it's not exactly in lock step because different processors react differently to the change in processing at chirp 30. For VLAR tasks it's a reasonable match on my systems, though.

It's true that there cannot be a new Best Autocorr found after a restart at chirp magnitude 30 or later. Still, the code to parse the state.sah checkpoint file at a restart should get the existing best_autocorr as easily as it does other signal types.
Joe

Well, what I was wondering was what the effect would be if that bit flip, or whatever is happening on Keith's machine, occurs on initial startup, when that startup occurs after a task is done with Autocorr processing. If that Best Autocorr gets zapped at that point, can it ever get repopulated with non-zero values?

EDIT: Assuming that the bit(s) flipping occurs after the checkpoint file has been read in, that is.
14) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1709530)
Posted 24 days ago by Profile Jeff BuckProject donor
Just a thought. It appears that a restart above at least 52% is a common theme in all of Keith's current Invalids (and that one Inconclusive I spotted). Does the autocorr checking usually occur in a specific phase of the processing (i.e., perhaps before or after the halfway mark), or is it spread across the entire duration of the task?
15) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1709520)
Posted 24 days ago by Profile Jeff BuckProject donor
As Jason says, it looks like the best_autocorr got lost in the restart at 90.62 percent. Those "best" signals are saved in the state.sah file in the slot directory, written in the order spike, autocorr, gaussian, pulse, and triplet. Having the other types properly recovered by the restart deepens the puzzle.

Since my xw9400 is also an 8-core AMD machine (except, apparently, under Win 10) that runs CPU tasks and is shut down for most of each day, I was curious to see if it exhibited that same "Best autocorr: peak=0 ..." quirk following any of the restarts. After searching my archives all the way back to the beginning of the year, I have to say that it doesn't.

I did find a small number of occurrences (just 30 in 7+ months), but every one was on a -9 overflow task, most with 30 Spikes (or, at least, a high Spike count), although one did have 30 Pulses, instead. Obviously, those were all very short-running tasks (most about 15 seconds or less, I think) and none were restarted.
16) Message boards : Number crunching : Panic Mode On (99) Server Problems? (Message 1709499)
Posted 24 days ago by Profile Jeff BuckProject donor
Looks like someone can reopen the "BOINC down" thread, because the domain just dropped off the face of the Earth. Again.

Perhaps you should start a "BOINC is Down Cafe" here on S@h. ;^)
17) Message boards : Number crunching : Panic Mode On (99) Server Problems? (Message 1709476)
Posted 24 days ago by Profile Jeff BuckProject donor
Well, that's it. The GPU people ran them out of data.

You mean these GPU people? ;^)

18) Message boards : Number crunching : Windows 10 - Yea or Nay? (Message 1709464)
Posted 24 days ago by Profile Jeff BuckProject donor
... but the one I bolded, 2952664, is known as 'nagware' and has a long history of issues, primarily failing to install properly and after two restarts, nags you again to install it..and it fails again, etc.

Personally, I have hidden that particular update six times now, but it keeps coming back as an available update, because MS keeps "fixing" it, which changes the unique hash identifier for it, even though the KB number is still the same, so it comes back as a "new" available update. Here's the explanation from InfoWorld about that one.

Thanks very much for that tidbit. I just knew that update sounded familiar, yet because it showed an August 4th release date, it had me puzzled. That's why I just made a note of it, until I could get around to checking it out further. I don't think I ever understood why some vaguely familiar updates occasionally seemed to reappear. Your explanation of the "hash identifier" component has me going, "Ah, ha...so that's it!"
19) Message boards : Number crunching : Windows 10 - Yea or Nay? (Message 1709421)
Posted 25 days ago by Profile Jeff BuckProject donor
Having decided to forgo the many features of Win 10 due to the high cost of privacy, I canceled the download but it keeps trying to download even though I chose "Hide" in the update list.

Any suggestions?

Have you Uninstalled the KB3035583 update?


I just did that, so far the Win 10 available icon has not re-appeared. The system immediately started looking for 'available updates' upon reboot which I stopped.

Thanks again.

The next time you manually "Check for Updates", it may still try to offer you the "Get Win 10" app and/or the Win 10 update itself. At that point, if you just hide both, you should be okay going forward. One thing I also did was to completely delete the two C:\\$[blahblahblah] folders that had the downloaded Win 10 files in them.

One annoying thing I've noticed with Windows Update, though, since reverting back to Win 7, is that each time I manually download an update (such as MSE definition files) is that the box that shows the download progress is now headed "Downloading Windows 10", even though it clearly isn't. (No harm done, but quite startling the first time I saw it!) I don't know if that's just because I reverted or if it will happen to anybody who actually had the Win 10 files downloaded, even if they didn't install it.

By the way, there seems to be a new, optional "Compatibility update for upgrading Windows 7" (KB2952664), released 2 days ago, that looks rather suspicious to me. I haven't hidden it yet, but I sure haven't downloaded it, either.
20) Message boards : Number crunching : Why this CPU task invalid so soon? (Message 1709109)
Posted 25 days ago by Profile Jeff BuckProject donor
Say, Keith, take a look at your task 4290554489. I went to look at your two new Invalids and then decided to look at a nearby Inconclusive.

The Inconclusive also appears to have the same "Best autocorr: peak=0, time=-2.123e+011, delay=0, d_freq=0, chirp=0, fft_len=0" as the Invalids, yet this task actually shows an Autocorr count of 1 and is thus far only Inconclusive rather than instantly Invalid. I wonder how that happens?


Next 20

Copyright © 2015 University of California