looking for accurate and simple metric for daily output

Message boards : Number crunching : looking for accurate and simple metric for daily output
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1802899 - Posted: 16 Jul 2016, 7:38:28 UTC

I am wondering if there has ever been any discussion about good metrics for daily output that do not involve creditNew.
In particular, I am wondering if the "task count per day" (using BoincTasks) is any good to measure differences from day-to-day when minor changes in config is done at midnight.
My guess is that it would be if the server(s) sent out task-type distributions that didn't vary as much from day-to-day or even from 1 computer to another.

But it might be possible to do so locally by having the max cache and suspending those tasks that are known to cause significant variations such as AP and non-guppi-VLARs.
Would the AR value of the task need to be considered?
Are there other variable to consider in a simple yet accurate daily output metric?
ID: 1802899 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22184
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1802900 - Posted: 16 Jul 2016, 7:58:08 UTC

The AR of the task needs to be considered, and given the rate of production of Arecibo MBs you might find those to be rather sporadic in nature.
Rather than suspend APs it would be better to not have them in the first place, so set you options to stop them arriving - if you don't want to block them on all your hosts then configure one of the "venues" not to have them, and move the host under test to that venue for the tests, then move it back from whence it came.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1802900 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1802902 - Posted: 16 Jul 2016, 8:21:35 UTC - in response to Message 1802899.  
Last modified: 16 Jul 2016, 8:42:49 UTC

Stepped on an ant nest :D Skip to end for <short_version>. (Sorry for the length)

There has been discussion/research, with respect to the shortcomings of the scheduling/estimate/credit mechanisms for direct comparison/control. For multibeam specifically, the needed source data are the elapsed times, theoretical peak flops, the unscaled fpops estimate already in the tasks (but can be derived from AR + task type anyway), optionally CPU fraction and number of instances (which is less available, but could be inferred with better prediction)

The closest 'úseful' metric is the APR (once settled), which connects to the majority of those parameters.
There are some problems with that APR, because averages are known to be sensitive to disturbance and slow to respond to change (specifically for estimates/control)

Using the existing crude APR, and ensuring Things 'settled', if you take a median value over time it will be more stable/predictable, with a better implementation option being running median of the source data, but a Kalman filter (linear or extended) being provably optimal instead, and tunable to desired response time (and giving useful covariance matrix for enhancing prediictions or estimates for new platforms/applications/hosts/hardware.)

Let's call the choice of APR, Median filtered APR, Running Median, or tuned Kalman estimate, just 'PR' for 'processing rate', which gives the estimated GFlops. The original unscaled #ops, for multibeam roughly +/- 10% from áctual compute operations, which IMO is useful enough for estimation/scheduling and comparison, and a lot more stable than Credit/Rac.

APR/theoretical_peak_flops gives 'compute efficiency', which is more useful on the development/optimisation side, and is mislabelled 'pfc_scale' in its current unstable implementation (unstable in Engineering and mathmatical terms).

It's this total 'çompute efficiency' where the Current Cuda and OpenCL implementations trade blows between lower efficiency more instances, Vs fewer or single-instance, in terms of total raw compute throughput ---> mostly complicated by CPU demand, limits on applicable hardware, and new breeds of hardware and application techniques rolling onstage.

Accumulated feedback from the mass scale then can be fed back to refine estimate quality (which GPS localisation does, via sensor fusion, in mobile devices, so nothing new/special)

<short_version>
APR refined, i.e. [Filter PR] GFlops, would be the most useful, provided all the caveats with GFlops are considered, along with well chosen indicators of quality of that estimate. (e.g. variance)
</short_version>
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1802902 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1802910 - Posted: 16 Jul 2016, 9:49:43 UTC - in response to Message 1802899.  

I am wondering if there has ever been any discussion about good metrics for daily output that do not involve creditNew.
In particular, I am wondering if the "task count per day" (using BoincTasks) is any good to measure differences from day-to-day when minor changes in config is done at midnight.
My guess is that it would be if the server(s) sent out task-type distributions that didn't vary as much from day-to-day or even from 1 computer to another.

But it might be possible to do so locally by having the max cache and suspending those tasks that are known to cause significant variations such as AP and non-guppi-VLARs.
Would the AR value of the task need to be considered?
Are there other variable to consider in a simple yet accurate daily output metric?

If you want to use such metric as quantitative indicator of host performance it doesn't need to be absolute one.
And relative metric can be established by plotting Elapsed time vs AR curve (you have script for making such statistics now) for one app, computing "tasks of particular AR per second" and then compute similar value for any other app or app's different tuning lines. So you will be able to compare different apps different tunings (and different host configs as aggregate value) with each other. Usually no need to establish really continuous time vs AR dependence. Roughly tasks divided to 3 classes by AR (cause different searches enabled for these classes): VLAR VHAR and mid-AR. From those mid-AR mostly inhomogenious one.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1802910 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1803009 - Posted: 16 Jul 2016, 20:54:33 UTC
Last modified: 16 Jul 2016, 21:29:39 UTC

Gents, thanks for your replies.
I forgot to add in my initial post that I'm trying to identify a metric for others to use who are not necessarily very advance technically.
So here's a case scenario:
Pat (gender neutral):
- has good computer technician skills (learned mostly ad-hoc);
- is very good with Excel; but 
- only has very limited experience with modifying simple scripts written by others.
A month ago, s/he added the S@h project in Boinc and 
has spent a few hours per week getting familiar with the Boinc Advanced View interface.
During the last week, Pat started to read threads in the S@h forums 
and noticed there are numerous ways to increase daily output without installing additional hardware.
S/he's very interested in tweaking a few computer setups currently dedicated to processing S@h but before doing so, 
Pat is wondering what the daily output is for each computer in order to have a baseline to compare to when small changes are made.

Would counting the number of tasks processed per day with BoincTasks be a sufficient metric?
If not, would a tally of each task-type (guppi, nonVLAR, VLAR-nonGuppi, and AstroPulse) be sufficient? (with a baseline of 3 to X separate days in order to account for different combinations of task-types)
Should only specific task-types be processed on the GPU or CPU?
If non of the above, could a script be made to constrain what a stock setup would process without it becoming an "anonymous platform" setup?

Notes:
- in the above case scenario, I am assuming that SoG would be the default stock app when running stock for a month.
- Obviously, Credit/day is not good enough since it only includes tasks that have been processed by both Pat's host and WingPerson.
- [edit] I'm assuming the AR values would even-out over large batches. Maybe a 24-hr period is not long enough. If so, please advise if 2 days might be sufficient[/e]
ID: 1803009 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1803077 - Posted: 17 Jul 2016, 4:23:28 UTC - in response to Message 1803009.  
Last modified: 17 Jul 2016, 4:50:42 UTC

You did specify 'Simple' in the title :) (as well as accurate)

For the simplicity part:
As angle range and telescope are complex functions affecting number of operations, that will rule out a single score. (the different devices+apps perform differently depending on AR)

How about reducing to say 3 'scores':
'Long task' (e.g. VLAR and Guppies) performance: xxxx 'points'
'mid task' performance: yyyy 'points'
'shorty performance': zzzz 'points'

where 'points' can come from whichever 'most accurate' metrics you find, absolute or relative (most likely GFlops, but tasks/day might work with only 3 boxes to worry about)

For the accuracy part:
Well to me that's the functional part that makes it useful for predicting some new device/app that hasn't accumulated enough data yet. Depends on what you want to use the figures for (e.g. comparing devices or applications).

In that case the 3 bins approach, taken for simplicity, might or might not be stable/accurate enough. The graphs in the other thread showing tasks/day with variance seem pretty intuitive to me. Perhaps 3 bins per device each with variance ?

If using averages, you might need several weeks to see a given device data stabilise. Try Median instead. If Median comes very different to average, then you automatically know the data is skewed (e.g. device is being used by the user sporadically, or other projects/apps).

Total performance score then could be the sum +/- the sum of the variances.

The problem here for me (over a long time) has been that the work keeps changing, so a synthetic bench that reflects that might be needed, to reflect the kinds of work, and changes in work mix.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1803077 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1803085 - Posted: 17 Jul 2016, 5:03:56 UTC - in response to Message 1803077.  

The problem here for me (over a long time) has been that the work keeps changing, so a synthetic bench that reflects that might be needed, to reflect the kinds of work, and changes in work mix.
____________

A moving target is a big problem.
ID: 1803085 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1803091 - Posted: 17 Jul 2016, 5:34:40 UTC - in response to Message 1803077.  

Ok...let's see if I understand:

From 125 non-VLARS that I collected AR values for Raistmer, there is about -84% correlation with ElapsedTime under NV_SoG. Therefore, the AR value (since it is currently difficult to get) is not totally necessary since the correlation is so high.

How about reducing to say 3 'scores':
'Long task' (e.g. VLAR and Guppies) performance: xxxx 'points'
'mid task' performance: yyyy 'points'
'shorty performance': zzzz 'points'

So could an equation such as this work: 100x + 50y + 1z ?
...where the numerical value would be based on time, and would need to be figured out for each of the 3 task-types you are suggesting.

The variance could be minimized by determining a minimal count of tasks: in total and/or by some of the 3 types.

Therefore, on older PCs with slower GPU & CPU, 1-day's worth of data might not be sufficient. My advanced statistics knowledge hasn't been used in many years but I think someone competent and not-as-rusty-as-me would be able to figure that out, right?

Keep in mind that the objective is to be able to compare if there is an improvement in throughput after a minor change (such as a new commandline, a new NV_SoG_r8xxx+1, swapping tasks).
Ideally, I would like to do that with BoincTasks' hitory.csv file imported into a spreadsheet that would do all the calculations for the Tweaker.

{I'm starting to feel like I'm about to reinvent/demystify CreditNew! lol}
ID: 1803091 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1803097 - Posted: 17 Jul 2016, 6:20:55 UTC - in response to Message 1803091.  
Last modified: 17 Jul 2016, 6:22:13 UTC

{I'm starting to feel like I'm about to reinvent/demystify CreditNew! lol}


Basically yes, and feels like progress to me :D

CreditNew is a bunch of numbers. A simpler and more functional (i.e. aesthetically tolerable and actually useful) metric, would consider form and function, so your quest seems like a reasonable one to me, even if a more complex engineering challenge than you might have bargained for :)

So could an equation such as this work: 100x + 50y + 1z ?


If 3 bins proves useful/simple enough perhaps. It'll just place a given host/device/app as a 3 dimensional point (or a blob or smear if needing variance), and the overall and individual work mixes as volumes. If the axes were time, then best in class performance of host, applications and devices should drift toward the origin.

Starts to sound complicated again, but since you seem to be looking for useful and easily interpreted representation, then the balance is probably somewhere in visualisation of it (even though if completely different than I would picture/describe at the moment)

Something like that, more as a developer than end-user, would tell me things like 'this application needs more attention on shorties' or 'this group of devices are lemons'
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1803097 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1803166 - Posted: 17 Jul 2016, 19:10:33 UTC - in response to Message 1803091.  


From 125 non-VLARS that I collected AR values for Raistmer, there is about -84% correlation with ElapsedTime under NV_SoG. Therefore, the AR value (since it is currently difficult to get) is not totally necessary since the correlation is so high.

Some contradiction here.
"non-VLAR" already means correlation with AR. By term definition non-VLAR means AR>VLAR threshold.
Also, most probably your 125 tasks missed any VHARs (or got them in low number) so you see only single bin distribution. Actually coarse-grained one is "3-bins" (VLAR, VHAR, mid-AR) as said few times before.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1803166 · Report as offensive

Message boards : Number crunching : looking for accurate and simple metric for daily output


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.