Improved Benchmarking System Using Calibration Concepts

Author	Message
Paul D. Buck Volunteer tester Send message Joined: 19 Jul 00 Posts: 3898 Credit: 1,158,042 RAC: 0	Message 179272 - Posted: 17 Oct 2005, 15:09:20 UTC If you look at Improved Benchmarking System Using Calibration Concepts I have a new proposal. Read and comment here ... ID: 179272 ·

STE\/E Volunteer tester Send message Joined: 29 Mar 03 Posts: 1137 Credit: 5,334,063 RAC: 0	Message 179288 - Posted: 17 Oct 2005, 16:28:48 UTC Participants that use optimized Science Applications may be cheated, or they may be cheating everyone that is not using optimized Science Applications. At this point no analysis conclusively proves this one way or another. But, for example, I added optimized SETI@Home Science Applications to my 8 computers and, with no other changes to my systems, my world position shifted from 1,050 to 910 over a period of a month or two. This indicates that all other things being equal at the moment, the user of the optimized Science Applications may be biasing the system. ========= I've thought that myself about the Optimized App's for a long time Paul. I know the argument is well we do more WU's for Science because we can Process the WU's faster. But I think what it all boils down to for most people is to inflate the Benchmarks, thus claiming higher Credit. Also like you said what about the people that don't use the Optimized App's or don't visit the Forums and even know about them. Or are just not inclined to use them, they are most definitely getting the short end of the Credits I would think. ID: 179288 ·

tekwyzrd Volunteer tester Send message Joined: 21 Nov 01 Posts: 767 Credit: 30,009 RAC: 0	Message 179316 - Posted: 17 Oct 2005, 17:44:39 UTC # Those systems that obtain a Quality Persistency Count greater than 75 would be issued with a Level 3 Cross-Project Benchmark Certification. # Those systems that obtain a Quality Persistency Count less than 25 would be issued with a Level 3 Cross-Project Benchmark Certification. Typo? Now we have used our system with a Level 1 Benchmark Certification as a "standard" to certify the performance of the other systems participating within the Quorum of Results. These systems should be awarded a Level 2 Benchmark Certification. Systems that have a Level 2 Benchmark Certification can be used to award a Level 3 Benchmark Certification. Systems that have a Level 3 Benchmark Certification cannot be used to award any certification. However, these systems are, next to the uncertified systems the best target candidates for direct benchmark certification. Kind of like the National Bureau of Standards traceability guidelines for the certification of TMDE. Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws. Douglas Adams (1952 - 2001) ID: 179316 ·

MikeSW17 Volunteer tester Send message Joined: 3 Apr 99 Posts: 1603 Credit: 2,700,523 RAC: 0	Message 179346 - Posted: 17 Oct 2005, 19:29:18 UTC Seems to me that it's a bit late to go changing the benchmark/claim/grant process now. Not yet having fully read and digested the proposal (sorry Paul) I already have a (greedy) view... After running BOINC for 16+ months I have dveloped certain expectations (of credit) from my 6 hosts. Any change that: (a) Reduces MY Daily Credit - Not Acceptable (b) Leaves MY Daily Credit unchanged - Don't give a damn. (c) Increses MY Daily Credit - Wow! Great! Do It. Frankly, I believe I have maximized my credit with appropriate combinations of optimized clients/BOINC, and a long work buffer. All of these 'adjustments' are available options to every participant so it is in no sense cheating. Simply put, every BOINC participant falls into one of these categories. There will be winners and loosers in any change. There may be dissatified participants already, but why alienate another group of participants? Some months ago I was in favour of a 'new' benchmark/credit system, but I fear it will not be a benefit to (selfish) ME<g> ID: 179346 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 179347 - Posted: 17 Oct 2005, 19:32:46 UTC - in response to Message 179272. If you look at Improved Benchmarking System Using Calibration Concepts I have a new proposal. Read and comment here ... Paul, My only real comment is that tying the benchmark to work units is probably the best way to get an accurate result, but it is also impractical. You'd almost have to pick one project as the "benchmark" project if credits are to be comparable across all projects. I liked your suggestion that PrimeGrid is very heavy on integer divide performance as opposed to SETI which does lots of floating point. Good comment. It has been widely reported that CPU cache size is important as the benchmark will run out of the L1 cache on practically every machine, and a good portion of the SETI data set will fit in the L2 cache on some machines if it's big enough. There could very easily be a future BOINC project with a huge data set that is dependent on disk performance. I think the ultimate answer is something like JM7's correction factor. ID: 179347 ·

Paul D. Buck Volunteer tester Send message Joined: 19 Jul 00 Posts: 3898 Credit: 1,158,042 RAC: 0	Message 179359 - Posted: 17 Oct 2005, 20:17:38 UTC Ned, The whole point to using the full work unit would be to eliminate the problem that arises because the cache size effect is not measured by the current benchmarks. But I do not agree that we have to pick one project. To be honest, I think that if we REALLY did get a couple instrumented applications that we would see very similar results. PrimeGrid being an exception. Well, anyway, I will see what people say ... Oh, and yes there should be a NOT in there some where ... ID: 179359 ·

Dorsai Send message Joined: 7 Sep 04 Posts: 474 Credit: 4,504,838 RAC: 0	Message 179366 - Posted: 17 Oct 2005, 20:38:40 UTC Last modified: 17 Oct 2005, 20:46:13 UTC I thought the theory of the system was that a WU that took "a long time" on one PC would take "a long time" for any pc that did it. Then, a WU that took "a middle time" would take a "middle time" on all PC's, and so on. The benchmark would then mean that all PC's that got the Wu would take the same. "long time" or "middle time" or "no time". This in turn would mean that "long time" WUs would claim "lots of credit", etc. The benchmark of the PC would mean that slow PC's would claim the same for a "long time" WU as a fast PC would. The differance being that a fast PC would do more WU than a slow one. Hence the Fast PC get more credit/time. But this is not the case. Some PC's claim 6CR, and others claim 50, for the same Wu.. The result? Some get pissed off (not I), and there is no clear consensus regarding credit/wu. The onyl fair way I can think of is to link calimed credit to a PC's "crunch time average, calculated over X hundered Wu's". This makes the benchmark the work its self. But this option seems to be "not popular". But I will carry on crunching, anway. have fun all, and crunch on. (edit) I am afraid I found the link Posted By Mr Buck went so far over my head, that my hair was not ruffled. IE I am not clever enought to understand it...(/Edit) Foamy is "Lord and Master". (Oh, + some Classic WUs too.) ID: 179366 ·

spacemeat Send message Joined: 4 Oct 99 Posts: 239 Credit: 8,425,288 RAC: 0	Message 179368 - Posted: 17 Oct 2005, 20:45:43 UTC - in response to Message 179288. I know the argument is well we do more WU's for Science because we can Process the WU's faster. But I think what it all boils down to for most people is to inflate the Benchmarks, thus claiming higher Credit. if this were truly the case, people would be modifying the BOINC source with something simple like a 2x factor to the claimed credit or benchmark score, not by tweaking and optimizing benchmark methods to better reflect the science work being done by the project binary ID: 179368 ·

STE\/E Volunteer tester Send message Joined: 29 Mar 03 Posts: 1137 Credit: 5,334,063 RAC: 0	Message 179371 - Posted: 17 Oct 2005, 21:12:58 UTC Last modified: 17 Oct 2005, 21:14:58 UTC I know this is a pretty simple way to give out Credit but what in the World would be wrong with just simply giving out Credit based on how long it took you to Process the WU. Theres no need for Benchmarks this way, if it takes you X amount of minutes to Process a WU you would get X amount of Credit, no matter how fast or how slow you Processed the WU. Of course the WU turned in would have to meet the Quorum & be a Validated as a Valid WU to start with. If the Quorum was 3 then all 3 people would recieve different Credit based on the amount of time it took them to Process the WU ... Example > you could award 0.6 Credits Per Minute, which would Equate to 6 Credits Per Hour, 12 Credits for 2 Hours of Processing and so on. Like I said there would be no need for Benchmarks or Optimized App's with a method like this. ID: 179371 ·

Sergey Broudkov Send message Joined: 24 May 04 Posts: 221 Credit: 561,897 RAC: 0	Message 179379 - Posted: 17 Oct 2005, 21:43:58 UTC - in response to Message 179346. Frankly, I believe I have maximized my credit with appropriate combinations of optimized clients/BOINC, and a long work buffer. How can a long work buffer increase your credits? Only by not being idle during outages, or what do you mean? Kitty@SETI team (Russia). Our cats also want to know if there is ETI out there ID: 179379 ·

Landroval Send message Joined: 7 Oct 01 Posts: 188 Credit: 2,098,881 RAC: 1	Message 179380 - Posted: 17 Oct 2005, 21:50:42 UTC - in response to Message 179379. Last modified: 17 Oct 2005, 21:51:19 UTC Frankly, I believe I have maximized my credit with appropriate combinations of optimized clients/BOINC, and a long work buffer. How can a long work buffer increase your credits? Only by not being idle during outages, or what do you mean? Say my computer, for whatever reason, claims a lower amount of credit on most workunits than most other computers. Most claim (say) 25 credits/WU, I claim 18. If I run a long queue, then the WU has probably validated and been assigned 25 credits by the time I turn it in. Thus the granted credit (25) is higher than my claimed credit (18). Regards, Brian If you think education is expensive, try ignorance. ID: 179380 ·

Ingleside Volunteer developer Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13	Message 179381 - Posted: 17 Oct 2005, 22:11:19 UTC - in response to Message 179272. Read and comment here ... .... ... well, after a quick look on your latest "novel" ... :) ... If didn't get lost in the text, to work at all, projects must add boinc_fpops_cumulative to their application, to get an accurate flops-count to "calibrate" results against. Now, if seriously screwed-up so you don't need boinc_fpops_cumulative the rest will not make much sence... :oops: LHC@home has had over 3 months now to make their scheduling-server "talk" to v5-clients, and it's over 4 months since v4.45 was released. So, in either case don't expect your "calibration concept" to be in use before in 3-6 months, while projects can immediately after adding the in any case neccesary boinc_fpops_cumulative to their application start using it with v5.2.x, well actually only needs v4.7x, but wouldn't really recommend anyone to still be running these clients... As mentioned in another post, even re-crunching the exact same wu can give atleast 4% variation. The calibration-method will be influenced by this variation, while using boinc_fpops_cumulative directly will not so the latter will always give more accurate claimed credit. For cheat-protection, on initial look it can seem like the calibration-method is better, but in any case 2 users must cheat on the same wu to get any benefit so in practice as long as majority of users doesn't cheat it shouldn't be a problem. On 2nd. look, if uses boinc_fpops_cumulative can very easily add checks to flag any claimed credit with larger difference than 5% or something from the lowest claimed, and if user/computer too many flagged take correcting measures. Not really a problem either way but if result/day is kept the same, the calibration-concept will increase server-load, while starting to using boinc_fpops_cumulative will actually decrease current scheduling-server-load slightly. ... Hmm, cross-project... Well, computer A is 5% slower running Sulphur-cycle-model than computer B is at running the normal slab-model, while B is over 20% faster than A running Einstein@home. So, atleast these two computers wouldn't work very well to make any sort of cross-project-calibration, but can't really say this can't work for other computers... ... If haven't got lost by now, using the calibration concept would mean you'll need two applications, one "fast" without boinc_fpops_cumulative and one "slow" with boinc_fpops_cumulative. The "perfect" application would be one that basically has two loops, one main loop counting n from 1 to k, and another smaller loop doing the same functions each time. After each smaller loop is done, some error-checking is done before n is increased or application terminates. For this application, boinc_fpops_cumulative = n * constant, meaning there's no overhead and therefore shouldn't really be any reason to use the calibration-concept for crediting at all. Normal applications will of course get some overhead while counting flops, but if it's possible to keep this overhead small and mostly just multiply loop-counts with various constants, you can again be down to only one application. Since the normal variation can be atleast 4% in crunch-times, using boinc_fpops_cumulative doesn't need to be perfect so can take a couple "short-cuts" under the calculation. As long as you're within 1% error and uses less than 0.1% overhead in the calculation of total flops, it's unlikely using the calibration-concepts will improve anything. In any case, calibrating win9x would be a lost cause, since win9x doesn't have the slightest idea of what cpu-time is... ... ... ... Oh well, lastly, let's look on a "worst-case" calibration-box: On another forum last year, someone wondered why a dual Xeon 2.4GHz/HT with 2x512MB DDR266 reg/ecc was so slow running CPDN, compared to a p4. After trying varying how many CPDN-instances running at once, the results was: 4 instances: 7,554 s/TS 3 instances: 5,441 s/TS 2 instances: 3,667 s/TS 1 instances: 2,626 s/TS This gives the theoretical daily production: 4 instances: 4,24 trickles/day 3 instances: 4,46 trickles/day 2 instances: 4,38 trickles/day 1 instances: 3,05 trickles/day In other words, the increased production by running 2 or more instances compared to only running 1 was only 40-45%, most likely the mediocre performance was due to limited memory-bandwidth. But, while example CPDN and SETI@Home can be greatly influenced by memory-speed, other BOINC-projects like Einstein@home is AFAIK little influenced by memory. So, chances are if shares with other projects and starts crunching 3 Einstein@home alongside one CPDN, will maybe not get as low as 2,626 s/TS but chances are below 3 s/TS. This means, if this dual-Xeon is calibrated while running 4 CPDN-instances, the error can be over 60% if multi-project. If calibrated while running only 1 CPDN-instance and 3 Einstein@home-instances, the error can be over 150%... ID: 179381 ·

Sergey Broudkov Send message Joined: 24 May 04 Posts: 221 Credit: 561,897 RAC: 0	Message 179383 - Posted: 17 Oct 2005, 22:18:12 UTC - in response to Message 179380. Last modified: 17 Oct 2005, 22:18:34 UTC How can a long work buffer increase your credits? Only by not being idle during outages, or what do you mean? Say my computer, for whatever reason, claims a lower amount of credit on most workunits than most other computers. Most claim (say) 25 credits/WU, I claim 18. If I run a long queue, then the WU has probably validated and been assigned 25 credits by the time I turn it in. Thus the granted credit (25) is higher than my claimed credit (18). Hmm, I see. But OTOH it raises your chances to be too late and get nothing. Though it should be an optimum somewhere in between, when you're not the first but yet not the last, and return your results just in right time (in average). Kitty@SETI team (Russia). Our cats also want to know if there is ETI out there ID: 179383 ·

tekwyzrd Volunteer tester Send message Joined: 21 Nov 01 Posts: 767 Credit: 30,009 RAC: 0	Message 179388 - Posted: 17 Oct 2005, 22:35:10 UTC - in response to Message 179371. Example > you could award 0.6 Credits Per Minute, which would Equate to 6 Credits Per Hour, 12 Credits for 2 Hours of Processing and so on. Like I said there would be no need for Benchmarks or Optimized App's with a method like this. 10 minutes to the hour? What kind of clock are you using? Hey, I think we've got one of them there ETs here! ;-D Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws. Douglas Adams (1952 - 2001) ID: 179388 ·

Sergey Broudkov Send message Joined: 24 May 04 Posts: 221 Credit: 561,897 RAC: 0	Message 179402 - Posted: 17 Oct 2005, 22:59:49 UTC - in response to Message 179272. If you look at Improved Benchmarking System Using Calibration Concepts I have a new proposal. Read and comment here ... At the first glance (I haven't yet read it thoroughly) just one question regarding the Certification Levels: have you considered to make this Certification system contiguous instead of discrete? That is, the Certification Level would be rather the Level of Confidence (or Trust, what's a better word?), say, from 0 to 100. So your scheme would be transformed to something like this: instead of Now we have used our system with a Level 1 Benchmark Certification as a "standard" to certify the performance of the other systems participating within the Quorum of Results. These systems should be awarded a Level 2 Benchmark Certification. Systems that have a Level 2 Benchmark Certification can be used to award a Level 3 Benchmark Certification. Systems that have a Level 3 Benchmark Certification cannot be used to award any certification. However, these systems are, next to the uncertified systems the best target candidates for direct benchmark certification. I suggest that ANY system having LoC higher than the system under question can be used to award it and increase its LoC (maybe in some proportion to their difference). Example (all values and factors are disputable): System A - 90 LoC (out of 100 possible) System B - 70 System C - 10 Then, if they give similar (comparable) benchmarks, A will add nothing to its current LoC (less trusted can't increase it) B will get (90 - 70) / 100 = 0.2 points from A's approval and become 70.2 C similarily will get 0.8 from A and 0.6 from B and become 11.4 What do you think? Kitty@SETI team (Russia). Our cats also want to know if there is ETI out there ID: 179402 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 21763 Credit: 7,508,002 RAC: 20	Message 179406 - Posted: 17 Oct 2005, 23:03:19 UTC - in response to Message 179381. Last modified: 17 Oct 2005, 23:10:06 UTC Read and comment here ... [...] In any case, calibrating win9x would be a lost cause, since win9x doesn't have the slightest idea of what cpu-time is... Those old Windows machines are always going to be a problem and that is exactly where having a calibrated credit result in the quorum for such machines is such a good idea! ... ... ... Oh well, lastly, let's look on a "worst-case" calibration-box: ... Well, you can't get much worse than the present benchmark scenario! These are two recent results for my machine: 126615513 30125636 Over Success Done 8,789.51 27.18 27.18 123543333 29375588 Over Success Done 8,505.98 26.30 10.87 Same claimed credit by me. x2 variation in the granted credit! Using a calibration trail as proposed, starting with just one calibrated client, you can have a million others calibrated in at most 13 WUs later. This was previously discussed some time ago. If you have a golden "Cobble Computer" to use as the calibration source, you just need this one machine to calibrate WUs from all the projects to start off the calibration trail. You get as many calibrated WUs as the "Cobble Computer" can work through. You then get an exponential increase in calibrated WUs as you work through more levels of the "calibration standard". This works cross-project for except such as CPDN that awards a fixed constant credit. To thwart cheating, the beauty of all this is that the calibrations could be managed all on the trusted side of Boinc on the Boinc server during validation. Secure certificates could be used to ensure correctly identifying the same (untrusted) host to stop cheaters gaining calibration for a fast machine that is then swapped for a slow machine! (Well, that trick would be picked up in the quorum checks anyhow.) OK. Paul's mini-novel can be shortened. Regardless, I'm very much in favour of the "calibration trail" to award credits. No fops counting required on the clients if there are already enough calibration WUs seeded. Meanwhile, for the months of transition, the present benchmark and fiddle factors can continue until they are made redundant by the improved server-side calibration scheme. I very much like that the credits scoring can then be all done on the trusted server side. It then just leaves the cheats to try juggling the clocks without tripping over +/-5% and getting slammed into the dungeons of zero! Regards, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 179406 ·

STE\/E Volunteer tester Send message Joined: 29 Mar 03 Posts: 1137 Credit: 5,334,063 RAC: 0	Message 179412 - Posted: 17 Oct 2005, 23:27:37 UTC - in response to Message 179388. Example > you could award 0.6 Credits Per Minute, which would Equate to 6 Credits Per Hour, 12 Credits for 2 Hours of Processing and so on. Like I said there would be no need for Benchmarks or Optimized App's with a method like this. 10 minutes to the hour? What kind of clock are you using? Hey, I think we've got one of them there ETs here! ;-D hehe ... Yup I got that wrong alright, it should have read Example > you could award 0.06 Credits Per Minute, which would Equate to 3.6 Credits Per Hour, 7.2 Credits for 2 Hours of Processing and so on. It was just an Example to show what I was getting at, I would expect the Credits would be higher than that anyway ... ;) ID: 179412 ·

Pappa Volunteer tester Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0	Message 179456 - Posted: 18 Oct 2005, 1:43:46 UTC - in response to Message 179272. Paul I read through the proposal and am late in the comments... The Carpet had to come up First... So the one thought: What I saw was orgainzed and "do-able," maybe not easy... I think what I saw and read represented a fix for a long standing problem. Currently, Because I have taken time to "optimize" my machines I have a small advanatage in that I get in sooner with lower credit and when the broken validation process hits, I Win! I do not recieve "reperesentative credit," I receive "more credit" because I came in first/lowest... Because I have an Optimized Core Client, it spreads to My second project Einstien... Better reflection of claimed credit and I lose less when that validation occurs... I still hate to think about the person running the PII 350 and just lost their shirt... They lost on the Benchmark and the Time to complete Science that should be properly credited... If you look at Improved Benchmarking System Using Calibration Concepts I have a new proposal. Read and comment here ... If the credit all my machines receive is propelry reflected, then I will be happier... It would mean that I am being given "appropiate credit" for what I donate! So will many others.. R/ Al Please consider a Donation to the Seti Project. ID: 179456 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19720 Credit: 40,757,560 RAC: 67	Message 179511 - Posted: 18 Oct 2005, 4:08:11 UTC Paul Just read your proposal, and must read it more thoughly soon, but my first comment is; haven't you missed some points which would help sway the arguement for a better benchmarking system. The points being: improved projected crunch time correct number of units in work cache as per connect to network variable better scheduling performance better communications management for requesting work and reporting units At the moment JM7's fine work on the correcting factor is just a 'band-aid' to get the scheduler working correctly with a proper benchmark this would not be needed. Andy ID: 179511 ·

Tern Volunteer tester Send message Joined: 4 Dec 03 Posts: 1122 Credit: 13,376,822 RAC: 44	Message 179571 - Posted: 18 Oct 2005, 8:53:36 UTC I need to re-read the proposal after having more sleep, but if you addressed MY number one complaint with the current system, at least "in so many words" (I know it was implied) I missed it... Your approach would allow different projects to have different "benchmarks". Currently if you run more than one project, and use optimized science applications on one or more, you have to choose whether to run an optimized BOINC client or not; one way you under-request on one set of projects, the other way you over-request on the other set of projects... this solves that once and for all, everyone would request the correct amount for each project. Also, again, probably lack of sleep talking, but I got lost on the first pass and it sounded way "too complicated" - then I finally realized what you were saying, and it was like "OH! That's easy!"... I think a lot of people are going to be even more confused by this than the "simple" benchmarking (which confuses enough...) so I think a lot of examples are going to be needed to help people understand. ID: 179571 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.