Improved Benchmarking System Using Calibration Concepts

Message boards : Number crunching : Improved Benchmarking System Using Calibration Concepts
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 179272 - Posted: 17 Oct 2005, 15:09:20 UTC

If you look at Improved Benchmarking System Using Calibration Concepts I have a new proposal.

Read and comment here ...
ID: 179272 · Report as offensive
STE\/E
Volunteer tester

Send message
Joined: 29 Mar 03
Posts: 1137
Credit: 5,334,063
RAC: 0
United States
Message 179288 - Posted: 17 Oct 2005, 16:28:48 UTC

Participants that use optimized Science Applications may be cheated, or they may be cheating everyone that is not using optimized Science Applications. At this point no analysis conclusively proves this one way or another. But, for example, I added optimized SETI@Home Science Applications to my 8 computers and, with no other changes to my systems, my world position shifted from 1,050 to 910 over a period of a month or two. This indicates that all other things being equal at the moment, the user of the optimized Science Applications may be biasing the system.
=========

I've thought that myself about the Optimized App's for a long time Paul. I know the argument is well we do more WU's for Science because we can Process the WU's faster. But I think what it all boils down to for most people is to inflate the Benchmarks, thus claiming higher Credit.

Also like you said what about the people that don't use the Optimized App's or don't visit the Forums and even know about them. Or are just not inclined to use them, they are most definitely getting the short end of the Credits I would think.

ID: 179288 · Report as offensive
Profile tekwyzrd
Volunteer tester
Avatar

Send message
Joined: 21 Nov 01
Posts: 767
Credit: 30,009
RAC: 0
United States
Message 179316 - Posted: 17 Oct 2005, 17:44:39 UTC


# Those systems that obtain a Quality Persistency Count greater than 75 would be issued with a Level 3 Cross-Project Benchmark Certification.
# Those systems that obtain a Quality Persistency Count less than 25 would be issued with a Level 3 Cross-Project Benchmark Certification.


Typo?

Now we have used our system with a Level 1 Benchmark Certification as a "standard" to certify the performance of the other systems participating within the Quorum of Results. These systems should be awarded a Level 2 Benchmark Certification.

Systems that have a Level 2 Benchmark Certification can be used to award a Level 3 Benchmark Certification.

Systems that have a Level 3 Benchmark Certification cannot be used to award any certification. However, these systems are, next to the uncertified systems the best target candidates for direct benchmark certification.


Kind of like the National Bureau of Standards traceability guidelines for the certification of TMDE.
Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws.
Douglas Adams (1952 - 2001)
ID: 179316 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 179346 - Posted: 17 Oct 2005, 19:29:18 UTC

Seems to me that it's a bit late to go changing the benchmark/claim/grant process now.
Not yet having fully read and digested the proposal (sorry Paul) I already have a (greedy) view...

After running BOINC for 16+ months I have dveloped certain expectations (of credit) from my 6 hosts. Any change that:

(a) Reduces MY Daily Credit - Not Acceptable
(b) Leaves MY Daily Credit unchanged - Don't give a damn.
(c) Increses MY Daily Credit - Wow! Great! Do It.

Frankly, I believe I have maximized my credit with appropriate combinations of optimized clients/BOINC, and a long work buffer. All of these 'adjustments' are available options to every participant so it is in no sense cheating.

Simply put, every BOINC participant falls into one of these categories.
There will be winners and loosers in any change. There may be dissatified participants already, but why alienate another group of participants?
Some months ago I was in favour of a 'new' benchmark/credit system, but I fear it will not be a benefit to (selfish) ME<g>


ID: 179346 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 179347 - Posted: 17 Oct 2005, 19:32:46 UTC - in response to Message 179272.  

If you look at Improved Benchmarking System Using Calibration Concepts I have a new proposal.

Read and comment here ...

Paul,

My only real comment is that tying the benchmark to work units is probably the best way to get an accurate result, but it is also impractical.

You'd almost have to pick one project as the "benchmark" project if credits are to be comparable across all projects.

I liked your suggestion that PrimeGrid is very heavy on integer divide performance as opposed to SETI which does lots of floating point. Good comment.

It has been widely reported that CPU cache size is important as the benchmark will run out of the L1 cache on practically every machine, and a good portion of the SETI data set will fit in the L2 cache on some machines if it's big enough.

There could very easily be a future BOINC project with a huge data set that is dependent on disk performance.

I think the ultimate answer is something like JM7's correction factor.
ID: 179347 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 179359 - Posted: 17 Oct 2005, 20:17:38 UTC

Ned,

The whole point to using the full work unit would be to eliminate the problem that arises because the cache size effect is not measured by the current benchmarks.

But I do not agree that we have to pick one project. To be honest, I think that if we REALLY did get a couple instrumented applications that we would see very similar results. PrimeGrid being an exception.

Well, anyway, I will see what people say ...

Oh, and yes there should be a *NOT* in there some where ...
ID: 179359 · Report as offensive
Profile Dorsai
Avatar

Send message
Joined: 7 Sep 04
Posts: 474
Credit: 4,504,838
RAC: 0
United Kingdom
Message 179366 - Posted: 17 Oct 2005, 20:38:40 UTC
Last modified: 17 Oct 2005, 20:46:13 UTC

I thought the theory of the system was that a WU that took "a long time" on one PC would take "a long time" for any pc that did it.
Then, a WU that took "a middle time" would take a "middle time" on all PC's, and so on.

The benchmark would then mean that all PC's that got the Wu would take the same. "long time" or "middle time" or "no time".

This in turn would mean that "long time" WUs would claim "lots of credit", etc.

The benchmark of the PC would mean that slow PC's would claim the same for a "long time" WU as a fast PC would. The differance being that a fast PC would do more WU than a slow one. Hence the Fast PC get more credit/time.

But this is not the case. Some PC's claim 6CR, and others claim 50, for the same Wu..

The result?

Some get pissed off (not I), and there is no clear consensus regarding credit/wu.

The onyl fair way I can think of is to link calimed credit to a PC's "crunch time average, calculated over X hundered Wu's".

This makes the benchmark the work its self.

But this option seems to be "not popular".

But I will carry on crunching, anway.

have fun all, and crunch on.

(edit) I am afraid I found the link Posted By Mr Buck went so far over my head, that my hair was not ruffled. IE I am not clever enought to understand it...(/Edit)





Foamy is "Lord and Master".
(Oh, + some Classic WUs too.)
ID: 179366 · Report as offensive
Profile spacemeat
Avatar

Send message
Joined: 4 Oct 99
Posts: 239
Credit: 8,425,288
RAC: 0
United States
Message 179368 - Posted: 17 Oct 2005, 20:45:43 UTC - in response to Message 179288.  

I know the argument is well we do more WU's for Science because we can Process the WU's faster. But I think what it all boils down to for most people is to inflate the Benchmarks, thus claiming higher Credit.



if this were truly the case, people would be modifying the BOINC source with something simple like a 2x factor to the claimed credit or benchmark score, not by tweaking and optimizing benchmark methods to better reflect the science work being done by the project binary
ID: 179368 · Report as offensive
STE\/E
Volunteer tester

Send message
Joined: 29 Mar 03
Posts: 1137
Credit: 5,334,063
RAC: 0
United States
Message 179371 - Posted: 17 Oct 2005, 21:12:58 UTC
Last modified: 17 Oct 2005, 21:14:58 UTC

I know this is a pretty simple way to give out Credit but what in the World would be wrong with just simply giving out Credit based on how long it took you to Process the WU. Theres no need for Benchmarks this way, if it takes you X amount of minutes to Process a WU you would get X amount of Credit, no matter how fast or how slow you Processed the WU.

Of course the WU turned in would have to meet the Quorum & be a Validated as a Valid WU to start with. If the Quorum was 3 then all 3 people would recieve different Credit based on the amount of time it took them to Process the WU ...

Example > you could award 0.6 Credits Per Minute, which would Equate to 6 Credits Per Hour, 12 Credits for 2 Hours of Processing and so on. Like I said there would be no need for Benchmarks or Optimized App's with a method like this.
ID: 179371 · Report as offensive
Sergey Broudkov
Avatar

Send message
Joined: 24 May 04
Posts: 221
Credit: 561,897
RAC: 0
Russia
Message 179379 - Posted: 17 Oct 2005, 21:43:58 UTC - in response to Message 179346.  

Frankly, I believe I have maximized my credit with appropriate combinations of optimized clients/BOINC, and a long work buffer.


How can a long work buffer increase your credits? Only by not being idle during outages, or what do you mean?
Kitty@SETI team (Russia). Our cats also want to know if there is ETI out there
ID: 179379 · Report as offensive
Profile Landroval

Send message
Joined: 7 Oct 01
Posts: 188
Credit: 2,098,881
RAC: 1
United States
Message 179380 - Posted: 17 Oct 2005, 21:50:42 UTC - in response to Message 179379.  
Last modified: 17 Oct 2005, 21:51:19 UTC

Frankly, I believe I have maximized my credit with appropriate combinations of optimized clients/BOINC, and a long work buffer.


How can a long work buffer increase your credits? Only by not being idle during outages, or what do you mean?


Say my computer, for whatever reason, claims a lower amount of credit on most workunits than most other computers. Most claim (say) 25 credits/WU, I claim 18.

If I run a long queue, then the WU has probably validated and been assigned 25 credits by the time I turn it in. Thus the granted credit (25) is higher than my claimed credit (18).

Regards,

Brian

If you think education is expensive, try ignorance.
ID: 179380 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 179381 - Posted: 17 Oct 2005, 22:11:19 UTC - in response to Message 179272.  

Read and comment here ...


....

... well, after a quick look on your latest "novel" ... :)


... If didn't get lost in the text, to work at all, projects must add boinc_fpops_cumulative to their application, to get an accurate flops-count to "calibrate" results against. Now, if seriously screwed-up so you don't need boinc_fpops_cumulative the rest will not make much sence... :oops:


LHC@home has had over 3 months now to make their scheduling-server "talk" to v5-clients, and it's over 4 months since v4.45 was released. So, in either case don't expect your "calibration concept" to be in use before in 3-6 months, while projects can immediately after adding the in any case neccesary boinc_fpops_cumulative to their application start using it with v5.2.x, well actually only needs v4.7x, but wouldn't really recommend anyone to still be running these clients...


As mentioned in another post, even re-crunching the exact same wu can give atleast 4% variation. The calibration-method will be influenced by this variation, while using boinc_fpops_cumulative directly will not so the latter will always give more accurate claimed credit.

For cheat-protection, on initial look it can seem like the calibration-method is better, but in any case 2 users must cheat on the same wu to get any benefit so in practice as long as majority of users doesn't cheat it shouldn't be a problem. On 2nd. look, if uses boinc_fpops_cumulative can very easily add checks to flag any claimed credit with larger difference than 5% or something from the lowest claimed, and if user/computer too many flagged take correcting measures.

Not really a problem either way but if result/day is kept the same, the calibration-concept will increase server-load, while starting to using boinc_fpops_cumulative will actually decrease current scheduling-server-load slightly.

...

Hmm, cross-project...

Well, computer A is 5% slower running Sulphur-cycle-model than computer B is at running the normal slab-model, while B is over 20% faster than A running Einstein@home. So, atleast these two computers wouldn't work very well to make any sort of cross-project-calibration, but can't really say this can't work for other computers...

...

If haven't got lost by now, using the calibration concept would mean you'll need two applications, one "fast" without boinc_fpops_cumulative and one "slow" with boinc_fpops_cumulative.

The "perfect" application would be one that basically has two loops, one main loop counting n from 1 to k, and another smaller loop doing the same functions each time. After each smaller loop is done, some error-checking is done before n is increased or application terminates.
For this application, boinc_fpops_cumulative = n * constant, meaning there's no overhead and therefore shouldn't really be any reason to use the calibration-concept for crediting at all.

Normal applications will of course get some overhead while counting flops, but if it's possible to keep this overhead small and mostly just multiply loop-counts with various constants, you can again be down to only one application.

Since the normal variation can be atleast 4% in crunch-times, using boinc_fpops_cumulative doesn't need to be perfect so can take a couple "short-cuts" under the calculation. As long as you're within 1% error and uses less than 0.1% overhead in the calculation of total flops, it's unlikely using the calibration-concepts will improve anything.


In any case, calibrating win9x would be a lost cause, since win9x doesn't have the slightest idea of what cpu-time is...

... ... ...

Oh well, lastly, let's look on a "worst-case" calibration-box:

On another forum last year, someone wondered why a dual Xeon 2.4GHz/HT with 2x512MB DDR266 reg/ecc was so slow running CPDN, compared to a p4. After trying varying how many CPDN-instances running at once, the results was:

4 instances: 7,554 s/TS
3 instances: 5,441 s/TS
2 instances: 3,667 s/TS
1 instances: 2,626 s/TS

This gives the theoretical daily production:
4 instances: 4,24 trickles/day
3 instances: 4,46 trickles/day
2 instances: 4,38 trickles/day
1 instances: 3,05 trickles/day

In other words, the increased production by running 2 or more instances compared to only running 1 was only 40-45%, most likely the mediocre performance was due to limited memory-bandwidth.

But, while example CPDN and SETI@Home can be greatly influenced by memory-speed, other BOINC-projects like Einstein@home is AFAIK little influenced by memory.

So, chances are if shares with other projects and starts crunching 3 Einstein@home alongside one CPDN, will maybe not get as low as 2,626 s/TS but chances are below 3 s/TS.


This means, if this dual-Xeon is calibrated while running 4 CPDN-instances, the error can be over 60% if multi-project. If calibrated while running only 1 CPDN-instance and 3 Einstein@home-instances, the error can be over 150%...
ID: 179381 · Report as offensive
Sergey Broudkov
Avatar

Send message
Joined: 24 May 04
Posts: 221
Credit: 561,897
RAC: 0
Russia
Message 179383 - Posted: 17 Oct 2005, 22:18:12 UTC - in response to Message 179380.  
Last modified: 17 Oct 2005, 22:18:34 UTC

How can a long work buffer increase your credits? Only by not being idle during outages, or what do you mean?


Say my computer, for whatever reason, claims a lower amount of credit on most workunits than most other computers. Most claim (say) 25 credits/WU, I claim 18.

If I run a long queue, then the WU has probably validated and been assigned 25 credits by the time I turn it in. Thus the granted credit (25) is higher than my claimed credit (18).


Hmm, I see. But OTOH it raises your chances to be too late and get nothing. Though it should be an optimum somewhere in between, when you're not the first but yet not the last, and return your results just in right time (in average).

Kitty@SETI team (Russia). Our cats also want to know if there is ETI out there
ID: 179383 · Report as offensive
Profile tekwyzrd
Volunteer tester
Avatar

Send message
Joined: 21 Nov 01
Posts: 767
Credit: 30,009
RAC: 0
United States
Message 179388 - Posted: 17 Oct 2005, 22:35:10 UTC - in response to Message 179371.  


Example > you could award 0.6 Credits Per Minute, which would Equate to 6 Credits Per Hour, 12 Credits for 2 Hours of Processing and so on. Like I said there would be no need for Benchmarks or Optimized App's with a method like this.


10 minutes to the hour? What kind of clock are you using?
Hey, I think we've got one of them there ETs here! ;-D


Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws.
Douglas Adams (1952 - 2001)
ID: 179388 · Report as offensive
Sergey Broudkov
Avatar

Send message
Joined: 24 May 04
Posts: 221
Credit: 561,897
RAC: 0
Russia
Message 179402 - Posted: 17 Oct 2005, 22:59:49 UTC - in response to Message 179272.  

If you look at Improved Benchmarking System Using Calibration Concepts I have a new proposal.

Read and comment here ...


At the first glance (I haven't yet read it thoroughly) just one question regarding the Certification Levels: have you considered to make this Certification system contiguous instead of discrete? That is, the Certification Level would be rather the Level of Confidence (or Trust, what's a better word?), say, from 0 to 100. So your scheme would be transformed to something like this: instead of

Now we have used our system with a Level 1 Benchmark Certification as a "standard" to certify the performance of the other systems participating within the Quorum of Results. These systems should be awarded a Level 2 Benchmark Certification.

Systems that have a Level 2 Benchmark Certification can be used to award a Level 3 Benchmark Certification.

Systems that have a Level 3 Benchmark Certification cannot be used to award any certification. However, these systems are, next to the uncertified systems the best target candidates for direct benchmark certification.


I suggest that ANY system having LoC higher than the system under question can be used to award it and increase its LoC (maybe in some proportion to their difference).

Example (all values and factors are disputable):

System A - 90 LoC (out of 100 possible)
System B - 70
System C - 10

Then, if they give similar (comparable) benchmarks,

A will add nothing to its current LoC (less trusted can't increase it)
B will get (90 - 70) / 100 = 0.2 points from A's approval and become 70.2
C similarily will get 0.8 from A and 0.6 from B and become 11.4

What do you think?
Kitty@SETI team (Russia). Our cats also want to know if there is ETI out there
ID: 179402 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 21763
Credit: 7,508,002
RAC: 20
United Kingdom
Message 179406 - Posted: 17 Oct 2005, 23:03:19 UTC - in response to Message 179381.  
Last modified: 17 Oct 2005, 23:10:06 UTC

Read and comment here ...

[...]
In any case, calibrating win9x would be a lost cause, since win9x doesn't have the slightest idea of what cpu-time is...

Those old Windows machines are always going to be a problem and that is exactly where having a calibrated credit result in the quorum for such machines is such a good idea!

... ... ...

Oh well, lastly, let's look on a "worst-case" calibration-box: ...

Well, you can't get much worse than the present benchmark scenario! These are two recent results for my machine:

126615513 30125636 Over Success Done 8,789.51 27.18 27.18

123543333 29375588 Over Success Done 8,505.98 26.30 10.87


Same claimed credit by me. x2 variation in the granted credit!


Using a calibration trail as proposed, starting with just one calibrated client, you can have a million others calibrated in at most 13 WUs later. This was previously discussed some time ago.

If you have a golden "Cobble Computer" to use as the calibration source, you just need this one machine to calibrate WUs from all the projects to start off the calibration trail. You get as many calibrated WUs as the "Cobble Computer" can work through. You then get an exponential increase in calibrated WUs as you work through more levels of the "calibration standard". This works cross-project for except such as CPDN that awards a fixed constant credit.

To thwart cheating, the beauty of all this is that the calibrations could be managed all on the trusted side of Boinc on the Boinc server during validation.

Secure certificates could be used to ensure correctly identifying the same (untrusted) host to stop cheaters gaining calibration for a fast machine that is then swapped for a slow machine! (Well, that trick would be picked up in the quorum checks anyhow.)


OK. Paul's mini-novel can be shortened. Regardless, I'm very much in favour of the "calibration trail" to award credits. No fops counting required on the clients if there are already enough calibration WUs seeded. Meanwhile, for the months of transition, the present benchmark and fiddle factors can continue until they are made redundant by the improved server-side calibration scheme.

I very much like that the credits scoring can then be all done on the trusted server side. It then just leaves the cheats to try juggling the clocks without tripping over +/-5% and getting slammed into the dungeons of zero!

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 179406 · Report as offensive
STE\/E
Volunteer tester

Send message
Joined: 29 Mar 03
Posts: 1137
Credit: 5,334,063
RAC: 0
United States
Message 179412 - Posted: 17 Oct 2005, 23:27:37 UTC - in response to Message 179388.  


Example > you could award 0.6 Credits Per Minute, which would Equate to 6 Credits Per Hour, 12 Credits for 2 Hours of Processing and so on. Like I said there would be no need for Benchmarks or Optimized App's with a method like this.


10 minutes to the hour? What kind of clock are you using?
Hey, I think we've got one of them there ETs here! ;-D



hehe ... Yup I got that wrong alright, it should have read Example > you could award 0.06 Credits Per Minute, which would Equate to 3.6 Credits Per Hour, 7.2 Credits for 2 Hours of Processing and so on.

It was just an Example to show what I was getting at, I would expect the Credits would be higher than that anyway ... ;)

ID: 179412 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 179456 - Posted: 18 Oct 2005, 1:43:46 UTC - in response to Message 179272.  

Paul

I read through the proposal and am late in the comments... The Carpet had to come up First...

So the one thought:
What I saw was orgainzed and "do-able," maybe not easy...

I think what I saw and read represented a fix for a long standing problem. Currently, Because I have taken time to "optimize" my machines I have a small advanatage in that I get in sooner with lower credit and when the broken validation process hits, I Win! I do not recieve "reperesentative credit," I receive "more credit" because I came in first/lowest... Because I have an Optimized Core Client, it spreads to My second project Einstien... Better reflection of claimed credit and I lose less when that validation occurs...

I still hate to think about the person running the PII 350 and just lost their shirt... They lost on the Benchmark and the Time to complete Science that should be properly credited...

If you look at Improved Benchmarking System Using Calibration Concepts I have a new proposal.

Read and comment here ...


If the credit all my machines receive is propelry reflected, then I will be happier... It would mean that I am being given "appropiate credit" for what I donate! So will many others..


R/

Al

Please consider a Donation to the Seti Project.

ID: 179456 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19720
Credit: 40,757,560
RAC: 67
United Kingdom
Message 179511 - Posted: 18 Oct 2005, 4:08:11 UTC

Paul

Just read your proposal, and must read it more thoughly soon, but my first comment is; haven't you missed some points which would help sway the arguement for a better benchmarking system.

The points being:


  • improved projected crunch time
  • correct number of units in work cache as per connect to network variable
  • better scheduling performance
  • better communications management for requesting work and reporting units



At the moment JM7's fine work on the correcting factor is just a 'band-aid' to get the scheduler working correctly with a proper benchmark this would not be needed.

Andy

ID: 179511 · Report as offensive
Profile Tern
Volunteer tester
Avatar

Send message
Joined: 4 Dec 03
Posts: 1122
Credit: 13,376,822
RAC: 44
United States
Message 179571 - Posted: 18 Oct 2005, 8:53:36 UTC

I need to re-read the proposal after having more sleep, but if you addressed MY number one complaint with the current system, at least "in so many words" (I know it was implied) I missed it...

Your approach would allow different projects to have different "benchmarks". Currently if you run more than one project, and use optimized science applications on one or more, you have to choose whether to run an optimized BOINC client or not; one way you under-request on one set of projects, the other way you over-request on the other set of projects... this solves that once and for all, everyone would request the correct amount for each project.

Also, again, probably lack of sleep talking, but I got lost on the first pass and it sounded way "too complicated" - then I finally realized what you were saying, and it was like "OH! That's easy!"... I think a lot of people are going to be even more confused by this than the "simple" benchmarking (which confuses enough...) so I think a lot of examples are going to be needed to help people understand.
ID: 179571 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : Improved Benchmarking System Using Calibration Concepts


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.