Message boards :
Technical News :
Thursday Thoughts (Oct 02 2008)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Not much to report, really. We had a couple blips or brownouts which were minor and easily corrected. Mostly spending my day working on R&D type stuff (mysql replication, radar blanking, etc.) and data pipeline management - this included boxing up freshly reformatted drives to ship to Arecibo. One thing in the works, maybe, is changing the workunit redundancy to effectively zero. There is already the mechanism in BOINC to "trust" hosts that continually return validated work. These hosts are then sent workunits that only they will have to process (not a redundant "wingman"). No validation is required (or actually possible) upon returning the result, and no waiting on others for credit, either. Of course, even trusted hosts will get occasional tests to prove they are still trustworthy. Plus there are quick tests we can do on the backend in lieu of "comparison validation." Other pros for doing this include using half the resources for the same amount of science (hooray!) and potentially getting through our backlog of data twice as fast. The cons are mostly concerns. If we try to keep up with current demand for work we'd have to run twice as many splitters, which is impossible given our current resources (we'd at least need more cpus, more disks, and better disk i/o). Or we could split at today's rate and regularly run out of work, which might upset some people. If we do increase our splitter production rate and burn through our data, we will even more likely run out of work on a regular basis (since we can't pad fresh data with old data if we used up the old data). Just some thoughts for now. We haven't really decided on anything yet. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Blurf Send message Joined: 2 Sep 06 Posts: 8962 Credit: 12,678,685 RAC: 0 |
Thanks, Matt. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
I think I kind of prefer having one wingman. Having two wingmen was a little overkill on redundancy, but I've noticed on the full MB WUs, and more noticeably on the AP WUs that if a Core 2 and an AMD are paired up, aside from the crunch time, the claimed credit is either really close, or somewhere between 30-60 apart--the lower one being the granted credit for both. I'm not entirely sure why such differences show up since the MB WUs seem to have a very consistent claim value for all systems, but I have noticed it is there. For instance, my Opteron 2210s usually do 740-745 credits for the claimed, and I've seen some Core 2's claim 720-771. Sometimes I'm the one that takes a small hit in granted credit, sometimes the wingman gets hit. I know the vast majority of the participants/volunteers get upset when data runs out, but having a 3-5 day cache helps the situation some, as well. Personally, I'd be fine if we had to go a few days without WUs, and I know some others feel the same way, though they would rather be crunching than not, but as gets mentioned during extended outages/failures, there are other projects that can be done interim. As far as disk I/O and capacity, I support raid5, but 10 does have better I/O, the only downside is the decrease in usable storage. I was doing some reading about raid5 the other day and it turns out that the optimal number of disks is 9. Significant performance increases with a hardware controller up to 9, and then the increases become less and start to plateau. Also, last week I mentioned the Areca raid controller, and I have seen numerous reports that if you get the model with the SO-DIMM slot on it, and put a 1gb module in there, the burst and sustained read/write increase 20-50%. Just more things to think about, as if there weren't enough already. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
|
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
Boxing up freshly reformatted drives to ship to Arecibo. What good news. How long will it take drives to be returned filled with new data, how many tapes are there to be processed while the formatted drives get filled up again? thanks for the news updates and all the hard work Matt and team have a great weekend Speedy |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
I am all for reducing the need for redundancy if it is not needed. I would, however, worry about credit cheats and how to stop them. Perhaps, only those that have a recent enough version of BOINC to use FLOPS counting can be trusted? Can it be the case where anyone that has a significant disagreement over the credit request with a sufficiently recent version of BOINC to do FLOPS counting can also become untrusted for a while? Then you just have to worry about how often you have to do a check, and how fast a host becomes trusted. BOINC WIKI |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30648 Credit: 53,134,872 RAC: 32 |
One thing in the works, maybe, is changing the workunit redundancy to effectively zero. There is already the mechanism in BOINC to "trust" hosts that continually return validated work. These hosts are then sent workunits that only they will have to process (not a redundant "wingman"). No validation is required (or actually possible) upon returning the result, and no waiting on others for credit, either. Of course, even trusted hosts will get occasional tests to prove they are still trustworthy. Plus there are quick tests we can do on the backend in lieu of "comparison validation." Other pros for doing this include using half the resources for the same amount of science (hooray!) and potentially getting through our backlog of data twice as fast. Just wondering are we able to crunch faster that the data is collected? I'm not asking about the burst speed because I'm sure the receiver records much faster than we process, but something more like an average if that is month to month or quarter to quarter. Gary |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Just my thoughts....... Don't do it. There are too many examples of hosts gone wacky to trust the science of your project to accept single reported results. Even my rigs, much OC'd, but basically trustworthy, have been known to go off on walkabout once in a while and start doing strange things when the RAM gets a bit confused....... I would much rather wait for a wingman to confirm my results that to have me report something that is not valid science and not have a cross check on what I have reported. It is your project, and DO do whatever you are comfortable with and see fit to implement...... Again, just my thoughts. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
H Elzinga Send message Joined: 20 Aug 99 Posts: 125 Credit: 8,277,116 RAC: 0 |
Just my thoughts....... Agreed. Recently had a power failure. One host took to much time to do a shutdown so the ups went out before. 2 results (dual CPU machine) showed a "checked but no consensus" state and were reissued. After that the were trown out. Both got reported a few hours after the power failure. This hist takes a minimum of 20h to process a unte so thes had to be running at that moment. |
Sirius B Send message Joined: 26 Dec 00 Posts: 24879 Credit: 3,081,182 RAC: 7 |
Just my thoughts....... Definitely agree. Ramsey brought down my farm yesterday - that's all projects, not just ramsey. On a trustworthy basis, that incident would have killed the trust already built up. Don't Do It! |
ML1 Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20 |
Just my thoughts....... I think trust is the keyword there... It would be trivially easy to start inflating the credit returns when a client notices that it has a singular WU. It would then be a game of how brazenly the credits could be bent until the 'trust' is eventually lost... Also, is there not scientific merit in that all the signals listed in the Master Science Database have been validated? Other studies may use that data if the data is known to be reliable. Universe background noise studies? Find the general direction of ET by noticing a rise in the noise floor?? Keep searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
MarkJ Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5 |
One thing in the works, maybe, is changing the workunit redundancy to effectively zero. There is already the mechanism in BOINC to "trust" hosts that continually return validated work. I'd have to agree with the other guys I wouldn't "trust" any host, mine included. I would recommend we stick with the current arrangements. Thanks for the update Matt, as always nice to know whats happening. BOINC blog |
Geek@Play Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0 |
One thing in the works, maybe, is changing the workunit redundancy to effectively zero. There is already the mechanism in BOINC to "trust" hosts that continually return validated work. Don't do it! I can wait on my wingman....no problem. The work must be valid before insertion into the scientific data base or questions will arise. Boinc....Boinc....Boinc....Boinc.... |
ML1 Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20 |
Don't do it! I can wait on my wingman....no problem... Hey! More importantly, there'd be no more forums posts for pending credit angst and wingman chasing! The forums would die!!... That just can't be done! :-p Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30648 Credit: 53,134,872 RAC: 32 |
A story about a computer. Some time back I had a machine that was very happy and crunching lots of work. Then one day I noticed a work unit that didn't validate. Strange. As time went on over a month it had more work units that didn't validate but it always had good work units too. I began to suspect the machine may have a problem and made sure backups were being made and were readable. Then one day the machine was dead. If not for wing men I wouldn't have had notice something was wrong. Gary |
Clyde C. Phillips, III Send message Joined: 2 Aug 00 Posts: 1851 Credit: 5,955,047 RAC: 0 |
Just using one cruncher sounds nice. Wonder what the error rate has been lately? For example can one be certain that 99 percent of all results crunched are correct? |
KyleFL Send message Joined: 20 May 99 Posts: 17 Credit: 2,332,249 RAC: 0 |
I´ll have to aggree with my previous posters. I feel more save with a wingman double checking a task that one of my hosts did crunch. 1. You get a confirmation of the result you provided is correct. 2. You can check your crunching speed against your wingman because he did crunch the exact same WU. The crunching power will rise in the future, as newer and faster CPUs are coming into the market. Of corse it can be tempting to double the speed with a single switch over, but I personally think the risk would be greater then the benefit. Cu KyleFL |
jim little Send message Joined: 3 Apr 99 Posts: 112 Credit: 915,934 RAC: 0 |
I had a few, very few that the wingman had near zero results while mine were in teens or perhaps a hundred. Choosing the smaller may not be using correct data! A difference of a single digit or two in a value in teens or hundreds, seems to be no great problem, but a coarse screening when really large disagreements occur would wave a red flag to me. Most of my wing agreements are within a one or two units difference which be no big deal. (I hope!) Since I am running on two Mac's, one portable with dual processors, and the other a big Mac with two dual processors, both should give similar answers. Comparing with other Intel chips should also agree. Other brands of CPU chips might have a possible differences. I quit using my older processors as the new ones are so much faster. And a single dual processor in the portable is a fast machine. BTW the portable is more efficient that the big box. About 35 watts for two processors while the big one uses 245 watts for four. Both have power factor of 0.99 so the watts and true energy are nearly identical. One of these days I want to try the energy use in a dual quad machine. No, I am not going to use my box, one new machine a year is almost too much, and I have use that one this spring for the portable. Final thought. Most data units will be uninteresting. But it only takes one BINGO...... duke |
[KWSN]John Galt 007 Send message Joined: 9 Nov 99 Posts: 2444 Credit: 25,086,197 RAC: 0 |
Just my thoughts....... Like the time you reported 28,212,776,635,318,302,094,458,356,388,446,235,923,207, 741,744,470,434,104,453,978,968,420,464,012,487,728,615,460,126,025,888,922,495, 329,680,983,667,039,227,542,120,518,375,956,424,460,107,196,736,376,759,573,283, 400,799,476,702,873,620,993,300,697,266,428,815,162,638,403,940,600,522,997,760.00 credits... Could you imagine THAT??? All in good humor, my friend... Clk2HlpSetiCty:::PayIt4ward |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I think that the Adaptive replication feature would be very worthwhile if the servers can deliver the additional work and effective sanity checks can be provided for uncompared results. The developers did think about possible abusers, so in that mode a Workunit page display does not include the table showing which host(s) have been tasked to produce a result. A user won't know if his host's result will be compared or accepted without comparison, so there's little motivation to play puerile games. It's not perfect, but maybe good enough. OTOH, I'd like to see it modified so that after a user's host has uploaded and reported a result, that user can see full detail on the Workunit. If the project set the criteria for which hosts are considered reliable such that about 20% of active hosts are excluded, even reliable hosts will be doing work requiring validation nearly 20% of the time. I think that would be more than adequate to catch hosts which develop a problem before they've done significant harm. Bear in mind that 99.999...% of "signals" in the master science database are actually random noise, we just hope it isn't 100%. Adaptive replication in effect just adds another small noise factor. Any concern about credit claims could fairly easily be handled by the sanity checks which would replace actual comparative validation. That could be made as tight as necessary, even forcing a reissue to another host in questionable circumstances. Joe |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.