Posts by Jeff Buck


log in
1) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819477)
Posted 14 hours ago by Profile Jeff Buck
... assuming that boinc_task_state.xml is where the checkpoint is written ...

Why not have a look inside it?

boinc_task_state.xml is just 10 lines of text, with just time, memory, and disk usage values.

state.sah is the one which contains ~180 lines of science, including the best signals found so far.

Well, I actually did look inside a boinc_task_state.xml and, once I saw the two checkpoint lines, made the rash assumption that they recorded the latest checkpoint. I also found, in the Process Monitor log that I was looking at, that the boinc_task_state files tended to be written in groups, spaced approximately by the checkpoint interval.

So now I'm a bit confused. In looking through the same Process Monitor log, the state.sah files seem to be referenced only sporadically, not at any specific interval, and apparently only for Read access (which, in itself, makes no sense), to wit:
6:53:13.1178805 PM boinc.exe 1484 QueryDirectory C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah SUCCESS Filter: state.sah, 1: state.sah 6:53:13.1178988 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2 SUCCESS 6:53:13.1180423 PM boinc.exe 1484 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 6:53:13.1180629 PM boinc.exe 1484 QueryAttributeTagFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah SUCCESS Attributes: A, ReparseTag: 0x0 6:53:13.1180710 PM boinc.exe 1484 SetDispositionInformationFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah SUCCESS Delete: True 6:53:13.1180820 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\state.sah SUCCESS

If I remember correctly from a while back, during one of our bug hunting episodes, you had me turn on a config flag which resulted, as a side effect, in lots of checkpoint messages being written to the Event Log. I'm pretty sure they always appeared in groups, one for each running task, and at approximately the specified checkpoint interval. So, I guess I'll have to do some more digging to try to understand just what "checkpointing" means in a BOINC context, 'cause something's missing from the puzzle for me. :^)

EDIT: Ah, I think I see the issue with only Read access for state.sah files showing up in the Process Monitor log. This particular log was limited to boinc.exe actions. I'll have to dig around and see if I've got one tucked away that includes science app activity, too.

EDIT2: Okay, then. The apps do, indeed, appear to write the state.sah files at approximately the checkpoint-specified intervals. A couple examples:
6:43:36.6860645 PM AKv8c_r2549_winx86_SSE3xjfs.exe 3280 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\12\state.sah SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten 6:43:36.6870820 PM AKv8c_r2549_winx86_SSE3xjfs.exe 3280 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\12\state.sah SUCCESS Offset: 0, Length: 3,437 6:43:36.6872363 PM AKv8c_r2549_winx86_SSE3xjfs.exe 3280 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\12\state.sah SUCCESS 6:43:36.7224494 PM Lunatics_x41zc_win32_cuda50.exe 4020 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\state.sah SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten 6:43:36.7235162 PM Lunatics_x41zc_win32_cuda50.exe 4020 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\state.sah SUCCESS Offset: 0, Length: 3,632 6:43:36.7236086 PM Lunatics_x41zc_win32_cuda50.exe 4020 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\state.sah SUCCESS

I'll leave the math for the elapsed time for each to others. They look close enough to my earlier figures for the boinc_task_state files that the conclusions would be pretty much the same. The overhead is insignificant.

EDIT3: And just to be clear, both the state.sah and boinc_task_state.xml files are written at the checkpoints for each task.
2) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819447)
Posted 16 hours ago by Profile Jeff Buck
Okaaaay! Moving on, then. ;^)

I just dug into one of my old Process Monitor logs and, assuming that boinc_task_state.xml is where the checkpoint is written, here are the log entries for a couple in different slots:
6:40:39.6115090 PM boinc.exe 1484 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_task_state.xml SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten 6:40:39.6117095 PM boinc.exe 1484 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8 SUCCESS Desired Access: Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open For Backup, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 6:40:39.6117489 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8 SUCCESS 6:40:39.6309677 PM boinc.exe 1484 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_task_state.xml SUCCESS Offset: 0, Length: 508 6:40:39.6310223 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_task_state.xml SUCCESS 6:40:41.1149725 PM boinc.exe 1484 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\6\boinc_task_state.xml SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten 6:40:41.1151886 PM boinc.exe 1484 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\6 SUCCESS Desired Access: Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Open For Backup, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 6:40:41.1152160 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\6 SUCCESS 6:40:41.1156076 PM boinc.exe 1484 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\6\boinc_task_state.xml SUCCESS Offset: 0, Length: 516 6:40:41.1156731 PM boinc.exe 1484 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\6\boinc_task_state.xml SUCCESS

It looks like the first one took 0.0195133 seconds, while the second took 0.0007006 seconds. Of course, that's elapsed time, not necessarily processing time, but still, multiplying that out by 60 checkpoints per hour would only represent a maximum of 1.170798 seconds per task hour in the first case and 0.042036 seconds per task hour in the second. (I hope I've figured that correctly.) In any event, that doesn't seem like a significant cost to processing.
3) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819427)
Posted 17 hours ago by Profile Jeff Buck
The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however.


The bigger (somewhat hidden) problem there is the number of points of failure, since the Boinc codebase assumes sequential activity in a timely fashion, then kills things. IOW the tradeoff is partly the overhead described, and another part any level of respect previously offered and since sacrificed for the sakes of completely ignoring computer science principles.

I'm thinking somewhat less abstract ;^), and more in terms of how many milliseconds are used writing each checkpoint. With Al's rescheduling runs now happening once per hour, and checkpoints at 60 seconds, that means that about 30 seconds per task out of every "task hour" is lost to reprocessing (about 0.8%). So, if 60 checkpoints are taken in that hour, what is the additional processing cost? Decreasing the checkpoint interval to 30 seconds would cut the lost processing in half, but how much would the additional checkpointing cost offset the savings in processing (disk thrashing notwithstanding). (Actually, with 56 cores and 6 GPUs, disk thrashing is probably already a given on that host.)
4) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819419)
Posted 18 hours ago by Profile Jeff Buck
Just pulled it up, mine is set to 60 seconds. Too often?

Considering how frequently you're running that rescheduler, 60 seconds is probably reasonable. Depending on the timing of each rescheduling run relative to the last checkpoint, each of your running tasks would lose between 0 and 59.999+ seconds of processing each time BOINC restarts so, on average, probably about 30 seconds per task. I suspect that task housekeeping, during both the BOINC shutdown and restart, probably adds some to that, though I don't know how much.

The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however.
5) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819383)
Posted 21 hours ago by Profile Jeff Buck
Where does one go to shorten the checkpoint interval?

In BOINC Manager > Options, on the Computing tab, all the way at the bottom there's an option to "Request tasks to checkpoint at most every ___ seconds."
6) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819360)
Posted 22 hours ago by Profile Jeff Buck
First of all, it's not really SoG you're trying to manage, but guppi VLARs and Arecibo non-VLARs. It doesn't matter if you're running Cuda or SoG on the GPUs.

Secondly, yes, the frequency of your rescheduling could have an impact, both on your RAC and on those "finish file present too long" errors you're getting. With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence. When BOINC comes back up, more than 10 seconds have passed and the error is generated.

Rescheduling every 30 minutes could also hurt your overall throughput if you still have checkpoints set to the default, which I think is 300 seconds. Any time tasks are restarted, they have to go back to the last checkpoint so, on average, you could be losing close to 2.5 minutes of processing time for each task that's restarted after a rescheduling run. That could be significant.

Personally, I would recommend a far longer interval between rescheduling runs and, if you haven't already made such a change, a shorter checkpoint interval. (But not too short, since there is some additional overhead incurred with each checkpoint. Mine is usually set in the 120 to 150 second range.)
7) Message boards : Number crunching : Open Beta test: SoG for NVidia, Lunatics v0.45 - Beta4 update (Message 1819174)
Posted 1 day ago by Profile Jeff Buck
. . But that does not explain why it stopped working when he first uninstalled the AMD drivers. That would tend to suggest it had been using them. Computers can be such "interesting" things. :)

Stephen

.

It could be that the uninstall of the AMD driver took out a .dll that was common to both platforms. I'm thinking that might happen if the AMD driver was installed first and the NVIDIA driver added later. If there's an uninstall log, that might provide a clue, but probably not worth trying to track it down.
8) Message boards : Number crunching : Open Beta test: SoG for NVidia, Lunatics v0.45 - Beta4 update (Message 1819055)
Posted 2 days ago by Profile Jeff Buck
Does raise some interesting thoughts, does it not?


Perhaps about my sanity ...

I restored the old drivers Just 'cause

Ed F

What I find interesting is that both with your original setup, and now with your restored drivers, the Stderr is showing both AMD and NVIDIA OpenCL being detected, with AMD first, followed by NVIDIA:
Priority of process adjusted successfully, below normal priority class used OpenCL platform detected: Advanced Micro Devices, Inc. OpenCL platform detected: NVIDIA Corporation BOINC assigns device 0

For those hours in between, it was only showing NVIDIA:
Priority of process adjusted successfully, below normal priority class used OpenCL platform detected: NVIDIA Corporation BOINC assigns device 0

So, while the AMD OpenCL was recognized, I'm not sure it was actually being used, as long as the NVIDIA was found afterwards.

EDIT: Also, if you look further down in the Stderr currently, you'll see:
Number of OpenCL platforms: 2

whereas it only showed 1 platform when you just had the NVIDIA driver installed.

EDIT2: And continuing on down, it shows:
OpenCL Platform Name: ATI Stream Number of devices: 0 OpenCL Platform Name: NVIDIA CUDA Number of devices: 1

It knows there's just 1 device, and it's NVIDIA. I think Raistmer's got it covered. :^)
9) Message boards : Number crunching : Panic Mode On (103) Server Problems? (Message 1818988)
Posted 2 days ago by Profile Jeff Buck
Wheee! At least some parts of the 20ja16aa and 22ja16aa are extremely noisy. Those are a couple of the tapes that produced APs that were 100% blanked. Now the MB tasks are pulsing like a '70s disco. They all appear to be overflowing with 30 Pulses. One of my machines just blew through 168 tasks in about an hour, when its normal output is about 30. The only thing stopping it from zipping through 2 or 3 times that many was that some guppi VLARs are salted into the mix, giving the GPUs something more satisfying to chew on periodically.
10) Message boards : Number crunching : Anything relating to AstroPulse tasks (Message 1818925)
Posted 2 days ago by Profile Jeff Buck
The sole non-blank is from 03no15.

But still very noisy, hitting the limit of 30 repetitive pulses. That's pretty much what I've seen with the few of mine that weren't 100% blanked in the most recent batch. Some of them hitting 30 for both, but all hitting 30 for repetitive pulses.
11) Message boards : Number crunching : Anything relating to AstroPulse tasks (Message 1818726)
Posted 3 days ago by Profile Jeff Buck
Those "no ALFA scheduled" look interesting, since we've obviously got data from those 3 dates. Hmmmm....

You're probably right about those "aa" tapes eventually being followed by "ab", etc., at a later date. Some of those "aa" tapes are still on the SSP to be split for MB and they all look like full ones (50.20 GB). I doubt if all those dates just happen to have exactly the right amount of data to fill just one tape.

I was also just looking at some of the MB tasks that my hosts have completed from the January and February data that was 100% blanked for AP, curious to see if there was any correlation with the MB results. I can't say that I see any. Some are overflows, some aren't.

02fe16aa
Task 5167397389 returned 1 Spike, 1 Autocorr

03fe16aa
Task 5167570635 overflowed w/ 30 Pulses

15fe16ab
Task 5165731675 (VLAR) returned 4 Spikes, 1 Autocorr, 9 Pulses and 1 Triplet
Task 5164704402 (VLAR) overflowed with 28 Autocorrs and 2 Pulses

15fe16ac
Task 5165873621 returned 2 Autocorrs and 2 Triplets

15ja16aa
Task 5168998647 overflowed w/ 30 Pulses
Task 5169973788 returned 5 Spikes, 1 Autocorr and 1 Triplet

Maybe with a bigger sample, a pattern would emerge, but so far I'm not seeing any correlation between the RFI in the AP WUs and the noise levels in the MB WUs.
12) Message boards : Number crunching : Anything relating to AstroPulse tasks (Message 1818663)
Posted 3 days ago by Profile Jeff Buck
If so, it still doesn't allay my fears of a possible antenna or recorder problem rendering all tapes from 2016 as useless for Astropulse.

@Richard

I'm a little late to the party on this because I haven't paid attention to this thread in a long time. However, after seeing your comment, I just went back and reviewed all the 2016 APs that I got last month, 158 of them, and didn't really see any cause for concern. The only APs that were 100% blanked were the 27 from the 15fe16a* tapes (where you found that Arecibo was likely targeting the moon), along with 7 from that pesky B3_P1 receiver (or whatever that designation means). I also had 5 with 30/30 overflows, but no pattern to the sources. The rest looked fine, from 06au16a*, 07au16a*, 10au16a*, 11au16a*, 12au16a*, 12au16a*, 13au16a*, 13au16a*, 27jl16a*, 28jl16a*, 28jl16a*, and 28jl16a*. That works out to 13 different tapes from 8 different days.

EDIT: A quick look at a sampling of the January and February, 2016, APs that I got early this month, however, seems to indicate that almost all of those were 100% blanked. I'll try to take a more in-depth look at those tomorrow. It looks like the dates on those run from mid-January to early February.

Well, that was a thoroughly unrewarding exercise. Out of 143 AP tasks from 2016 tapes that my hosts processed in early September, 100% of the channels for 100% of the dates were 100% blanked. Tapes represented were all from the mid-January to early February time period: 02fe16aa, 03fe16aa, 12ja16aa, 15ja16aa, 17ja16aa, 18ja16aa, 19ja16aa, 20ja16aa, 21ja16aa, 22ja16aa, 23ja16aa, 24ja16aa, 26ja16aa, 27ja16aa, 28ja16aa, 29ja16aa, and 30ja16aa. Perhaps worth noting is that all were "aa" tapes. Could it be that the only data collected on those days were brief observations of a single target? That lunar meteoroid-strike EMP proposal only requested 4 observing periods, though.
13) Message boards : Number crunching : Anything relating to AstroPulse tasks (Message 1818571)
Posted 4 days ago by Profile Jeff Buck
If so, it still doesn't allay my fears of a possible antenna or recorder problem rendering all tapes from 2016 as useless for Astropulse.

@Richard

I'm a little late to the party on this because I haven't paid attention to this thread in a long time. However, after seeing your comment, I just went back and reviewed all the 2016 APs that I got last month, 158 of them, and didn't really see any cause for concern. The only APs that were 100% blanked were the 27 from the 15fe16a* tapes (where you found that Arecibo was likely targeting the moon), along with 7 from that pesky B3_P1 receiver (or whatever that designation means). I also had 5 with 30/30 overflows, but no pattern to the sources. The rest looked fine, from 06au16a*, 07au16a*, 10au16a*, 11au16a*, 12au16a*, 12au16a*, 13au16a*, 13au16a*, 27jl16a*, 28jl16a*, 28jl16a*, and 28jl16a*. That works out to 13 different tapes from 8 different days.

EDIT: A quick look at a sampling of the January and February, 2016, APs that I got early this month, however, seems to indicate that almost all of those were 100% blanked. I'll try to take a more in-depth look at those tomorrow. It looks like the dates on those run from mid-January to early February.
14) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1818479)
Posted 4 days ago by Profile Jeff Buck
Actually, all of my machines have been dropping now that I've looked at them. Is this something that is happening across the board, or am I just the lucky one?

I think it may have been due to 75% of the pfb splitters getting stuck last week. That skewed the WUs heavily in favor of guppi VLARs, with very few Arecibo non-VLAR tasks getting sent out. Hopefully, with that splitter situation corrected on Sunday, and with APs currently flowing, things will get back to normal shortly (whatever "normal" is).
15) Message boards : Number crunching : Windows 10 - Yea or Nay? (Message 1818349)
Posted 5 days ago by Profile Jeff Buck
Yep, definitely Win 10.
Here's how to stop this:
Control Panel --> System --> Advanced System settings --> Hardware --> Device Installations Settings --> Automatically download Manufacturer ... --> NO

Oh, by the way, check Win 7 as well. Found the same issue there. Not sure if this was part of the latest "Security Update" that MS just pushed? Don't remember this being on 7 before ...

No, not new. This came up when Windows 10 was first being released. I even posted a couple screen shots in this thread. Unfortunately, when I did try running Win10 on one of my boxes, that setting turned out to be little more than a placebo (see Messages 1707338, 1707349, 1707394 and 1708078). That was the final Win10 straw for me.

It seems to me that somebody's come up with a Registry hack since then to block the driver updates, or perhaps the Installation Settings has been fixed. In any event, Win7 still seems to be safe.
16) Message boards : Number crunching : Ratio of Guppies vs. Arecibo (Message 1818219)
Posted 5 days ago by Profile Jeff Buck
A lot of the tasks recorded on 15 Feb 16 - ab, ac, and ad - are VLAR, so not being distributed to NVidia GPUs. I wonder what they were looking at that day?

Edit - looks like it was the moon: http://www.naic.edu/vscience/schedule/tpfiles/KesarajutagS3039tp.pdf. Be interesting if we find something intelligent up there, but it might be a good test of Astropulse. Anyone see anything interesting when they were processed for AP, or did they just pick up the moon's radar station?

Looks to me like that was the batch that returned almost entirely:
In ap_remove_radar.cpp: get_indices_to_randomize: num_ffts_forecast < 100. Blanking too much RFI?
percent blanked: 100.00
17) Message boards : Number crunching : Ratio of Guppies vs. Arecibo (Message 1818078)
Posted 6 days ago by Profile Jeff Buck
There's currently only one gbt pfb splitter actively working. Three have been stuck on a single tape (23ja09aa) since at least last Wednesday. Apparently nobody in Berkeley pays attention to such things. At least work is still flowing and this situation probably gives us a preview of what the crunching world will look like if Arecibo eventually shuts down.

EDIT: Corrected "gbt" to "pfb". It's Arecibo splitters that are stuck.

Well, I'll be darned. Sometime in the last few hours, on a Sunday afternoon, somebody apparently cleared the jam. Four splitters are once again turning out Arecibo tasks.
18) Message boards : Number crunching : Ratio of Guppies vs. Arecibo (Message 1818013)
Posted 6 days ago by Profile Jeff Buck
There's currently only one gbt pfb splitter actively working. Three have been stuck on a single tape (23ja09aa) since at least last Wednesday. Apparently nobody in Berkeley pays attention to such things. At least work is still flowing and this situation probably gives us a preview of what the crunching world will look like if Arecibo eventually shuts down.

EDIT: Corrected "gbt" to "pfb". It's Arecibo splitters that are stuck.
19) Message boards : Number crunching : Open Beta test: SoG for NVidia, Lunatics v0.45 - Beta4 update (Message 1817737)
Posted 8 days ago by Profile Jeff Buck
Which log flag have to be enabled for "Restarting task ..." to appear in Event Log (without too much other clutter)?

That would be cpu_sched.

9/15/2016 9:12:34 PM | SETI@home | [cpu_sched] Restarting task 26fe09ab.11025.10706.9.36.130_0 using setiathome_v8 version 800 (cuda50) in slot 0

Yeah, that's irritated me from the time they made the change. I even rolled back to an earlier BOINC for a long time, but eventually had to start giving in. It's not just from the "Restarting" standpoint either, but from the lack of slot info if you don't turn on cpu_sched. The downside is that you end up with two "Starting task" messages in the log for every task, one without the slot number and one with it. You can't just replace one with the other. Just something we have to accept if we want that additional info, I guess.

EDIT: Just found my original complaint about it, from March, 2014: BOINC 7.2.42 - Reduced info in Event Log?
20) Message boards : Number crunching : Open Beta test: SoG for NVidia, Lunatics v0.45 - Beta4 update (Message 1817708)
Posted 8 days ago by Profile Jeff Buck
SoG does best with 1 CPU core per WU.

He's got -use_sleep in his cmdline, so I doubt if a full core is necessary. At least I didn't find it so in my recent experiments with SoG.


Next 20

Copyright © 2016 University of California