Posts by jason_gee


log in
1) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1820356)
Posted 1 hour ago by Profile jason_gee
So,
Darwin 16.0 is out...so is CUDA Toolkit 8.0.
Do you see a problem here;
Click on the green buttons that describe your target platform. Only supported platforms will be shown. Operating System Windows Linux

The 8.0 driver appears to be MIA as well.
Best I can tell there is a problem with NVCC and CLANG.
Of course, only nVidia knows for sure...

Not impressed.



There have been rumours that the next Macbook Pros, and/or Mac Pros, may include Pascal GPUs, based on certain job openings advertised by nVidia. These were directed at findoign experienced Mac Developers, for participation in Metal and OpenCL development, and driver development.

So to my mind, it amounts to that the Cuda support isn't there yet for OSX, and the underlying infrastructure might be temporarily missing. (for example on Windows Cuda uses DirectX underneath... more direct Means on Linux)

The logical conclusion for me is that Cuda support would need to migrate significant underlying portions, and probably would happen piecemeal, rather than at once.
2) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1820218)
Posted 1 day ago by Profile jason_gee
Some thing to think about on CUDA side too:
if device capable to pair kernels - how it will ensure that such pairing will not trigger TDR under Windows....


Yeah it's going tricky/fiddly on that front. I'm having to inject extra sync points to overcome the driver optimisations that fuse kernels in the streams, making them too long. It is working on my host though, with conservative settings. I'd imagine my system might be worst case for this, with old Core2Duo, only DDR2, PCIe 1.1, trying to feed a GTX 980. Naturally the 980 is easily chocking the rest of the system unless I dial back a lot. Probably others will have less trouble, with me adding some settings/options, and better match of hardware.

I will probably end up having to add an option to split the 3-4-5 folds that are fused, at some unknown compromise. Managed to get the 3-4 second lags down to usable area, but it'll take some rework to eventually scale for the full range of hardware.
3) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1820018)
Posted 2 days ago by Profile jason_gee
If I could nominate a WU to look at- but this may be more a Darwin
(11.4.2) thing:

WU 2275390797


That's definitely the stock Darwin app issue or the host itself having issues (most likely the former).

State: All (1106) · In progress (21) · Validation pending (453) · Validation inconclusive (393) · Valid (69) · Invalid (170) · Error (0)


Gives us enough numbers to start pinning down some useful metrics: >10% invalid (0 invalid is normal), and ~87% inconclusive to pending ratio (healthy app + system is routinely lower than 5%, reflecting problems elsewhere than on the local host)

looks like the reissue is going to a reliable system with the reference application (which is reference by stock issue and weight of numbers), compare:

State: All (751) · In progress (200) · Validation pending (335) · Validation inconclusive (15) · Valid (192) · Invalid (0) · Error (9)

0 invalids, 4.5% inc/pending

Ignoring the single unrelated SoG error: 8 of the 9 errors are interesting, illustrating where some of the known issues with the 7.6.22 client and/or 7.7.0 boincapi/lib are (but fortunately don't adversely affect other hosts, or trash the science db)
4) Message boards : SETI@home Science : Chinese FAST Officially Completed Today (Message 1819908)
Posted 2 days ago by Profile jason_gee
So, we might expect FAST to be used also for SETi@home?!


If the Chinese were more smart than narcissistic , then maybe. My Chinese friends are skeptical and say the government are still corrupt, but at least clueless enough to throw some money around. [that money being debt from other countries, but digressing into economics]
5) Message boards : Technical News : Data Dump (May 17 2016) (Message 1819901)
Posted 2 days ago by Profile jason_gee
Last update as of May 2016 - I mean: really? It's not that I'd expect a weekly update, but once every half year seems a little bit weak.


More a case of legitimate logistics in my opinion. We're seeing the volunteer community spread pretty thinly as well. Just a matter of lots of announcements about dollars, but no apparent investment in what really matters (building the future).

[Edit:] Ivory backscratchers maybe ?
6) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1819807)
Posted 3 days ago by Profile jason_gee
@Richard, I'm not getting anything on the spreadsheet generated address of http://boinc2.ssl.berkeley.edu/beta/download/13b/27jl16aa.19977.160094.8.42.65

something changed ?
7) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1819649)
Posted 3 days ago by Profile jason_gee
zi+a source definitely brought the validation characteristic into line, with dominant inconclusives being the familiar dicey wingmen/apps. If you see more than the expected <5% inconclusive/pending ratio on OSX or Linux, then Likely it'll be something build/platform specific, as opposed to any further lurking accuracy issues in the source.

Could you check a workunit we've been worrying over at Beta, please? Beta WU 8902774

The final replicant, task 24852195, is Petri's own machine, reporting as 'x41p_zi3j, Cuda 8.00 special'. As posted in Beta message 59697, the final reported pulse seems to be out of tolerance for validation, although the 'best pulse' matches the values which should have been reported.


Can look at that one after a short work day in the morning. Petri's own Linux build (p_zi3j) is likely significantly different from my Windows one (zipa2) and may not have the pulsefind fix (not sure on that) .
8) Message boards : Number crunching : Ratio of Guppies vs. Arecibo (Message 1819586)
Posted 3 days ago by Profile jason_gee
...
57451_19304 ----> 57451.19304

JD 57451.193040 is
BCE 4556 April 17 16:37:58.7 UT Wednesday


Hah! Pretty clever with telescopes those ancients.
9) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1819582)
Posted 3 days ago by Profile jason_gee
A few steps closer here with Windows, having isolated some issues that will have to be designed for down the line (while test builds circulate). For alpha here will just need to turn some workarounds for the blocking sync into parameters, as Windows is pretty tetchy about the duration of those improved long pulsefinds.

zi+a source definitely brought the validation characteristic into line, with dominant inconclusives being the familiar dicey wingmen/apps. If you see more than the expected <5% inconclusive/pending ratio on OSX or Linux, then Likely it'll be something build/platform specific, as opposed to any further lurking accuracy issues in the source.
10) Message boards : Number crunching : Lack of SoG WUs (Message 1819573)
Posted 3 days ago by Profile jason_gee
The APR is calculated using averages. It's the lazy choice. Naturally we would choose the right tool for the right job, which usually would involve looking at the different work.

An APR based on Angle Range & WU source?

APR, being a BOINC measurement used for all applications and at all projects, can't take parochial constructs like AR into account. Simple example: Astropulse tasks don't have an AR.

The only project-specified value it has available to consider is <rsc_fpops_est>.


Correct using old functional (software) design strategies and incomplete math, generalised for use in as wider context as possible. However, nowadays there are cleaner, easier to maintain, ways to separate the domain specific knowledge (e.g. AR), than rebuilding a whole splitter from source everytime something new comes along.

Current method:
- either estimate, derive methodically, or guess <rsc_fpops_est> (well or badly)
-- happens to include an AR dependant term (in the splitter, hardwired)
- scale estimates, for task issue, using averages

Proposed method:
- either estimate, derive methodically, or guess <rsc_fpops_est> (well or badly)
- formalise the domain (task) specific component (such as AR) as a plugin transfer function. (currently is a transfer function in multibeam, but not very pluginable or useful for other work/project types)
- scale estimates, for task issue, using actual estimate localisation techniques.

Pros of the Current method, are that it's there and works (sortof). The cons are that any changes/additions require building a new splitter to include change (risky), reaching into scheduler issue time are sensitive to problems with using averages, and have no visible quality metrics to say whether it's right or wrong (other than users pointing and saying it looks borked when their estimates go wacky).

Pros of the equivalent but more refined second approach are flexibility to make domain specific changes (e.g. add telescope) as determining parameters, without having to rebuild/maintain core code, can choose starting values from the closest estimate set of knowledge, and function + problems become isolated to their domain.

The current method is a bespoke solution to a general problem, where the general problem has been misidentified. That is, the problem as coded for in Boinc is making estimates for every kindof task on every project in the same way, when the real underlying problem is that unique tasks, hosts, applications all require different estimates, and some adaptiveness.

I guess it might come down to that the quest for one-size-fits-all solutions ignores the information available for the sakes of a prescriptive regime that ends up guaranteeing projects have to modify code when they'd rather be focussed on their particular science.

There are tools and techniques for making estimates for the purposes of control, and averages are nearly universally considered bad.
11) Message boards : Number crunching : Lack of SoG WUs (Message 1819454)
Posted 4 days ago by Profile jason_gee
The APR is calculated using averages. It's the lazy choice. Naturally we would choose the right tool for the right job, which usually would involve looking at the different work.
12) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819452)
Posted 4 days ago by Profile jason_gee
Yep, so assuming no-one owns systems 1% the speed of yours (storage wise), and doesn't use their computer for anything other than posting on Seti@homeNC, should be good :D
13) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819440)
Posted 4 days ago by Profile jason_gee
what is the additional processing cost? Decreasing the checkpoint interval to 30 seconds would cut the lost processing in half, but how much would the additional checkpointing cost offset the savings in processing (disk thrashing notwithstanding).


If a Windows update happened to be occurring, and that included a .NET framework update, on a dual core CPU, potentially about 3-5 days lost. Obviously not directly a Boinc issue, though something that might well and truly warrant consideration, given that seti@home was promoted as a screensaver in the first place. Another angle might be that they have a crappy HDD ... Both examples easily cover the naive and arbitrary choices of 10 seconds hardcoded in Boinc code. [Edit: no options exposed in cc_config AFAIK]

[Edit:] I should point out I have no real problem with time based constraints, only that introducing failures based on assumptions that hosts are just crunching is borderline moronic.
14) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819438)
Posted 4 days ago by Profile jason_gee
The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however.


The bigger (somewhat hidden) problem there is the number of points of failure, since the Boinc codebase assumes sequential activity in a timely fashion, then kills things. IOW the tradeoff is partly the overhead described, and another part any level of respect previously offered and since sacrificed for the sakes of completely ignoring computer science principles.

I'm thinking somewhat less abstract ;^), and more in terms of how many milliseconds are used writing each checkpoint. With Al's rescheduling runs now happening once per hour, and checkpoints at 60 seconds, that means that about 30 seconds per task out of every "task hour" is lost to reprocessing (about 0.8%). So, if 60 checkpoints are taken in that hour, what is the additional processing cost? Decreasing the checkpoint interval to 30 seconds would cut the lost processing in half, but how much would the additional checkpointing cost offset the savings in processing (disk thrashing notwithstanding). (Actually, with 56 cores and 6 GPUs, disk thrashing is probably already a given on that host.)


Yeah I agree. For the most part everything's cool. The problems come with 'Us weirdos', which either are naturally weirdos, or happen to be having some issue making us temporary weirdos. Then you run into a situation that 'ús weirdos'are normal. That works well if the clientele you're seeking aren't weirdos.

You see it's the whole approach of cherrypicking types of users and computers that I personally take issue with. Their assumptions of time exclude new things. i like new things.
15) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819421)
Posted 4 days ago by Profile jason_gee
The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however.


The bigger (somewhat hidden) problem there is the number of points of failure, since the Boinc codebase assumes sequential activity in a timely fashion, then kills things. IOW the tradeoff is partly the overhead described, and another part any level of respect previously offered and since sacrificed for the sakes of completely ignoring computer science principles.
16) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819406)
Posted 4 days ago by Profile jason_gee
Jason, sorry, I don't understand the mechanism, could you please explain the upsides/downsides to more/less frequent? Thanks!

*edit* I mean, I don't want to lose anything, obviously, but I also want it to run as efficiently as possible.


Sure Al,
Let's say all tasks were 2+ hours long and you had a power failure. The last time your tasks saved information would be between when they started and the power failure. That might be every 60 seconds, or whatever you set. When the tasks resumed some information was potentially saved, so can continue from there. If you set it too long and have failures, you potentially waste energy reprocessing work you already did. Too short, you thrash your disks.
17) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819401)
Posted 4 days ago by Profile jason_gee
Just pulled it up, mine is set to 60 seconds. Too often?


Go the rate you're willing to lose. I set mine to 1 hour (3600 seconds), though individual tasks will obviously write more often.
18) Message boards : Number crunching : The Saga Begins (LotsaCores 2.0) (Message 1819396)
Posted 4 days ago by Profile jason_gee
These 'ódd things' come up from time to time, and in general they trace back to some poor design decisions. I'd be happy to detail those if requested, however repeated attempts to submit fixes have been met with narcissistic crap, which I have no time for anymore.

[Edit:]
Example of thread safety: Request shutdown, worker shuts down, achknowledges shutdown, shutdown happens.

Example of crap: Just kill things
19) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1818867)
Posted 6 days ago by Profile jason_gee
looked it up earlier, but neglected to grab it (seeing no Cuda in the run, and being at work). From my brief look seemed like a bunch of OpenCLs ganging up on a stock Linux CPU that was probably fine.
20) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1818656)
Posted 7 days ago by Profile jason_gee
hmmm, interesting. In the scheme of things inconclusive overflows is a fairly minor thing compared to woes of the past, though I'll stick it on the list of quirks to look for here (as its impact is likely to grow with throughput). So far the list contains that, familiar Windows Drivers not liking Boinc's shutdown methods, and some finer grained load control needed for my crumby CPU (probably also mostly a windows issue only, though could come in handy elsewhere). I *think* the weekend should sort at least the last two.

[Edit:] I'm yet to see inconclusive overflows yet here, so if you can save any such tasks it'd be great if you could save them, for later cross platform comparison and refinement purposes.

[Edit2:] naturally as soon as I posted that, my host reached a run of them in cache, so grabbing at least one for inspection on the weekend.


Next 20

Copyright © 2016 University of California