Father Padilla Meets the Perfect Gnat (Dec 03 2007)


log in

Advanced search

Message boards : Technical News : Father Padilla Meets the Perfect Gnat (Dec 03 2007)

1 · 2 · 3 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 688417 - Posted: 3 Dec 2007, 22:16:10 UTC

I was out of town all weekend (on the east coast visiting family) but didn't miss much around here. However we did have a long server meeting this morning as many things are afoot.

First off, our power outage from last Thursday is now rescheduled for this upcoming Thursday (see notice on the front page). We're hyper-prepared now, so outside of shutting everything down Thursday afternoon and resurrecting the whole project Friday morning, it should be a breeze.

There was discussion about our current workunit storage woes. Namely, we need more, and we have an immediate plan to make more (converting barely-used archive storage). This is because of our 2/2 redundancy, i.e. we send out two redundant workunits and need two results to validate. This means a large number of users finish their workunits quickly, but have to wait for their "partner" (or "wingman") to return the other before validating, during which time the workunit is stuck on disk taking up space. Months ago when we were 3/2 we'd send out three redundant workunits and only need 2 to validate, which means the workunit stays on disk only as long as the two fastest machines take to return their result - so they'd get deleted faster. That's the crux of it.

Other than that chatted about making some minor upgrades to the BOINC backend (employing better trigger file standards, cleaning up the start/stop scripts (i.e. program them in something other than python)) and gearing up for the end-of-the-year donation drive. Most of the pieces are in place for that.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 22,062,201
RAC: 4,070
United States
Message 688429 - Posted: 3 Dec 2007, 22:38:43 UTC

My soap box: To solve the storage problem, limit the cache size and/or go back to 3/2. Now that you can cancel wu's on the clients, this makes a lot of sense, at the cost of some wasted bandwidth. But nothing is free.

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 688431 - Posted: 3 Dec 2007, 23:04:06 UTC - in response to Message 688417.
Last modified: 3 Dec 2007, 23:12:49 UTC


There was discussion about our current workunit storage woes. Namely, we need more, and we have an immediate plan to make more (converting barely-used archive storage). This is because of our 2/2 redundancy, i.e. we send out two redundant workunits and need two results to validate. This means a large number of users finish their workunits quickly, but have to wait for their "partner" (or "wingman") to return the other before validating, during which time the workunit is stuck on disk taking up space. Months ago when we were 3/2 we'd send out three redundant workunits and only need 2 to validate, which means the workunit stays on disk only as long as the two fastest machines take to return their result - so they'd get deleted faster. That's the crux of it.


In a "not-totally-selfless" act, I've been "helping" you out with this on my own. On my AMD system, I'm manually processing work as I see that the other person has submitted and reported. This keeps my completed result storage as low as possible. The "not-totally-selfless" part of it means that I keep my RAC more stable than volatile...and I can "CPCW" from time to time with 4.x BOINC Clients, which, btw, the enforcement of 4.19 as a minimum version is broken as there is a 4.13 version running amok out there at least... Also, while on that subject, IMO it's time to "deal with" the "proxy problem" and get the minimum BOINC version up to a flop-counting version, but that's a subject for another time...

That being said, I'm now at a point where no result that I have left in queue has a partner/wingman that has reported completion, so I've resumed tasks and am processing in queue-order.

I too agree with PhonAcq in going back to 3/2 with the server-side abort capability. I, however, would go one step further to suggest that you revisit the deadline determination scheme. The fact that you have such excessively long deadlines is starting to impact at least Einstein@Home because their deadlines are very tight (perhaps actually short by a couple of days). There have been complaints about how EDF is happening and people don't understand why you (SETI) offer so much more time for far less actual work. In my opinion, and the opinion of a few other participants here, the deadlines here are very excessive in length. Your effort to soothe people who threw fits about EDF has trickled into other projects now. An "explainer" about EDF usually doesn't help, as people that have the fits about it tend to want 100% lockstep compliance with "their rules" and don't understand that the EDF system is designed to help ensure that "their rules" are maintained over a longer time horizon while still making sure that the time spent is beneficial to both the participant and the project.

I know that you may have a differing opinion from the project perspective, since you have information that we don't, so I leave my standard disclaimer:

IMO, YMMV, etc, etc, etc...
____________

Fred W
Volunteer tester
Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 688443 - Posted: 3 Dec 2007, 23:32:28 UTC - in response to Message 688431.

...the enforcement of 4.19 as a minimum version is broken as there is a 4.13 version running amok out there at least...


Worse - several 3.xx clients have been reported (I've been paired with a couple in the past month or so) and they claim 0 credits...

F.

____________

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 688457 - Posted: 4 Dec 2007, 0:08:56 UTC - in response to Message 688443.

...the enforcement of 4.19 as a minimum version is broken as there is a 4.13 version running amok out there at least...


Worse - several 3.xx clients have been reported (I've been paired with a couple in the past month or so) and they claim 0 credits...

F.


Yeah, I know. I feel a little bad about tossing the "tiny fish" (4.x wingman) back in the lake, but my Rotten Luck â„¢ works all the time, so I can measure that I get more than the supposed "1 in 10" chance... Also, my Intel host is on auto-pilot, so I'm still "keeping" a few of the little guys...

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 541,672
RAC: 182
United States
Message 688458 - Posted: 4 Dec 2007, 0:10:25 UTC - in response to Message 688417.
Last modified: 4 Dec 2007, 0:14:37 UTC

Other than completely undoing the 2/2 policy (which I do not recommend), I see only two free solutions. There are more that cost money, of course, such as purchasing more drive space, bandwidth, etc. They are:
1. Increase the size (required number of calculations) per result, and/or
2. Shorten the WU deadlines

Either you have to keep the results on the clients longer (longer crunch time, reduced server storage), or get all results back faster so that post-processing can delete the results sooner. Also, this would significantly lower the number of pending results. For example, by doubling the size of a WU (one less "split"), you'd have half the number of results and each client would spend double the crunching time. Creating more work units by using the 3/2 policy would not help the situation significantly and would add redundant crunching tasks for 33% of the users.

There were more complex solutions discussed, such as matching client computer speeds across results in the same WU. However, this requires server-side code development, which requires development time and expertise. Another solution would be matching receiver angle with CPU speed. I forget if higher or lower angles require more CPU, but just match higher CPU angles with faster machines. These are great ideas for future development or BOINC releases, but they do not help the current situation.

If I had to pick only one solution, I would pick the first one. This will allow you to throttle WU size against the speed increases of CPUs over the years to come. I'm sure there are pluses and minuses to either approach.

Anyway, good luck this Thursday. I know you guys are well-prepared.

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 688470 - Posted: 4 Dec 2007, 0:42:10 UTC - in response to Message 688458.

Other than completely undoing the 2/2 policy (which I do not recommend), I see only two free solutions. There are more that cost money, of course, such as purchasing more drive space, bandwidth, etc. They are:
1. Increase the size (required number of calculations) per result, and/or
2. Shorten the WU deadlines

[snip]
Creating more work units by using the 3/2 policy would not help the situation significantly and would add redundant crunching tasks for 33% of the users.

[snip]

If I had to pick only one solution, I would pick the first one. This will allow you to throttle WU size against the speed increases of CPUs over the years to come. I'm sure there are pluses and minuses to either approach.


The only way I'd be in favor of option #1 would be if the deadlines are not pushed out further AGAIN. I can see that being done, as people would throw fits about seeing EDF crop up again...

Also, the 3/2 policy does not guarantee a redundant task. If the BOINC client in use by the host is version 5.8.17+, on each communication with the scheduler any non-started tasks that already met quorum would get aborted (221 abort code). The percentage of 3rd task redundancy would probably only be 10-20%, depending on the client being used and the percentage of started vs. non-started tasks.

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1916
Credit: 9,567,772
RAC: 15,111
United States
Message 688471 - Posted: 4 Dec 2007, 0:42:21 UTC - in response to Message 688458.
Last modified: 4 Dec 2007, 0:44:58 UTC

[snip]
There were more complex solutions discussed, such as matching client computer speeds across results in the same WU. However, this requires server-side code development, which requires development time and expertise. Another solution would be matching receiver angle with CPU speed. I forget if higher or lower angles require more CPU, but just match higher CPU angles with faster machines. These are great ideas for future development or BOINC releases, but they do not help the current situation.

If I had to pick only one solution, I would pick the first one. This will allow you to throttle WU size against the speed increases of CPUs over the years to come. I'm sure there are pluses and minuses to either approach.

[snip]


Higher angle range = short crunch time, low angle range = standard crunch time...

Note: I'm one of those that complained about EDF - but only about the very high angle range WU's - I think those should be given a deadline that takes them just outside immediate "high priority" assignment for those with (say) a 4.5 day cache. (For those that don't know, EDF is dead as of the 5.10.13 [give or take] client.) I think that the longer deadlines are not required for "normal" WU's.

...Just my 2 ¢ worth...
____________
.

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 688553 - Posted: 4 Dec 2007, 3:23:54 UTC - in response to Message 688471.
Last modified: 4 Dec 2007, 3:35:27 UTC


Note: I'm one of those that complained about EDF - but only about the very high angle range WU's - I think those should be given a deadline that takes them just outside immediate "high priority" assignment for those with (say) a 4.5 day cache. (For those that don't know, EDF is dead as of the 5.10.13 [give or take] client.) I think that the longer deadlines are not required for "normal" WU's.

...Just my 2 ¢ worth...


Hello again ;)

That's a slippery-slope proposition. You're wanting the project to bend to your specific setup. What if I feel that the deadline should be bumped up to accomodate my 10-day cache (if I had one; mine is currently at 3 days)? Not only that, just the number of days of cache is not the only thing used to determine deadline trouble. Resource allocations also have to be considered.

Example:
What if I want to have a 10-day cache, but only want to give 25% allocation to a project on a system that is up 24x7? This means that at a maximum, BOINC could only count on 2.5 days of time being allocated to that project during a 10-day timeframe, before even thinking about debt considerations. For simplicity, let's say there are no ST/LT debt issues. If a result with a 9-day deadline comes in at the bottom of a full 10-day cache, it automatically sooner or later has to be given priority even at 100% allocation. OK. What kind of deadline will suffice? Well, the way I see it, though I could be wrong, is that it would require a deadline of cache-size / allocation, or 10/0.25, which would be 40 days. If allocation was only 10%, then 10/0.10 = 100 days.

Now throw in the other aspect of the newer clients that they can maintain additional amounts beyond 10 days, and you start getting an idea about what asking for something that simply makes one user happy can do...

If you don't want EDF, then maintain a smaller cache or set a higher resource allocation. Beyond that, realize that EDF is there for a reason and it really is not going to "disrespect your resource allocations" over the long-term.

If you want to get really bent out of shape about the short-deadline tasks, then get bent out of shape about what ONE of them will do to your RDCF, which will then force BOINC to think other tasks may be in deadline trouble when they really are not.

IMO, YMMV, etc, etc, etc...

Ncrab
Send message
Joined: 1 Dec 07
Posts: 10
Credit: 57,389
RAC: 0
Brazil
Message 688619 - Posted: 4 Dec 2007, 7:00:52 UTC

While I see a debate if 2/2 or 3/2 is the best choice... I wonder if effort was already directed to 1/1 scheme.

Why exactly this aproach is not possible ?

Think about double the virtual machine power is a worthy reward...


...just ideas, by a newbee user...

Profile TOM
Volunteer tester
Avatar
Send message
Joined: 5 Apr 01
Posts: 59
Credit: 54,930,063
RAC: 0
Germany
Message 688626 - Posted: 4 Dec 2007, 7:28:40 UTC
Last modified: 4 Dec 2007, 7:56:59 UTC

My oldest pending WU is from the 26 Aug 2007 4:46:22 UTC.

If I click on the counterpart computer the db says "Couldn't find computer".

May be you can clean out deleted machines, save some space and reschedule the pending WU earlier.
Interesting to find out how many of them are in the db.

...just ideas...

TOM
____________

Profile Pooh Bear 27
Volunteer tester
Avatar
Send message
Joined: 14 Jul 03
Posts: 3221
Credit: 2,082,679
RAC: 1,003
United States
Message 688655 - Posted: 4 Dec 2007, 10:33:16 UTC - in response to Message 688619.

While I see a debate if 2/2 or 3/2 is the best choice... I wonder if effort was already directed to 1/1 scheme.

Why exactly this aproach is not possible ?

Think about double the virtual machine power is a worthy reward...


...just ideas, by a newbee user...

Science needs validity. So having two people do the work and validate against each other shows that. When Classic was 1/1 there were people that sent back bad results and how could it be validated if it was good or not?
____________

Ncrab
Send message
Joined: 1 Dec 07
Posts: 10
Credit: 57,389
RAC: 0
Brazil
Message 688669 - Posted: 4 Dec 2007, 12:22:43 UTC - in response to Message 688655.

Science needs validity. So having two people do the work and validate against each other shows that. When Classic was 1/1 there were people that sent back bad results and how could it be validated if it was good or not?


I understand about validity. But the only way to achieve it is recalculating all WU ?

Since it´s the same software that do this job over the same data, maybe a kind of partials checksums (md5) over key process (not entire process) is not enough ?

1. Client receive and resolve a WU
2. Client return the results and attach the checksum of partials key process
3. Server receive this results and then send a special job for that WU (called check-WU) to another client
4. Check-WU perform only the partials key process and return it´s checksum
5. Server validate the WU

I supose that enough effort already have been done over this issue and maybe my intervention is useless... but on computing view point it is sounds there are an overhead here.

Thanks for reply.

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 541,672
RAC: 182
United States
Message 688670 - Posted: 4 Dec 2007, 12:25:57 UTC - in response to Message 688470.

The only way I'd be in favor of option #1 would be if the deadlines are not pushed out further AGAIN. I can see that being done, as people would throw fits about seeing EDF crop up again...


Yes, when I said double the crunch time, I meant we should keep the same deadlines for this option.

Henk Haneveld
Send message
Joined: 16 May 99
Posts: 154
Credit: 1,484,496
RAC: 187
Netherlands
Message 688671 - Posted: 4 Dec 2007, 12:39:25 UTC

Why not go to a fixed deadline of 30 days for all results.
Even if somebody is running the maximum cache size of 20 days he has a extra 10 days to complete a result. It will take care of the long time-out wait for abandoned results.

____________

Aurora Borealis
Volunteer tester
Avatar
Send message
Joined: 14 Jan 01
Posts: 2975
Credit: 4,986,593
RAC: 1,090
Canada
Message 688694 - Posted: 4 Dec 2007, 14:46:02 UTC - in response to Message 688671.
Last modified: 4 Dec 2007, 14:48:19 UTC

Why not go to a fixed deadline of 30 days for all results.
Even if somebody is running the maximum cache size of 20 days he has a extra 10 days to complete a result. It will take care of the long time-out wait for abandoned results.

The majority of users still only run as originally intended. That is to say, only when the computer is idle, in the screensaver mode. If we had a 30 day deadline the older computers crunching part time would have greater difficulty to complete work by deadline. I do think that the short deadline (4 days) are problematic for these same users and for the many that are still using modems connecting once or twice a week. The minimum deadline, in my opinion, should be at least 7 days. This would eliminate a lot of situations were WU immediately go into EDF mode for these people that require a 1 week cache to work offline.

The idea is to keep at many computers crunching for the project as possible. This is not time sensitive data. Even if the in progress database gets a bit bloated, we don’t want to cut off contributing crunchers unnecessarily. Once they leave it would be difficult to ever get them back. They are still the backbone of the project, not the minority of 24/7 crunchers that hang around the forums.
____________
Questions? Answers are in the "Unofficial" BOINC Wiki.

Boinc V7.0.27
Win7 i5 3.33G 4GB, GTX470

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8436
Credit: 47,943,348
RAC: 59,128
United Kingdom
Message 688701 - Posted: 4 Dec 2007, 15:06:20 UTC - in response to Message 688694.

.... I do think that the short deadline (4 days) are problematic for these same users and for the many that are still using modems connecting once or twice a week. The minimum deadline, in my opinion, should be at least 7 days. ....

FYI: The 4-day deadline is history, and has been since the release of multibeam data recordings in early August this year. The minimum deadline has been over 8 days for almost four months now.

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13564
Credit: 29,768,105
RAC: 16,597
United States
Message 688702 - Posted: 4 Dec 2007, 15:06:37 UTC - in response to Message 688669.
Last modified: 4 Dec 2007, 15:13:43 UTC

Since it´s the same software that do this job over the same data, maybe a kind of partials checksums (md5) over key process (not entire process) is not enough ?


From what I understand about checksums, they only verify that the file hasn't been modified in any way, but they do not verify that the resulting data inside the file is accurate. This will not give proper validation. The only way to verify the math is being done correctly is by comparing it against another machine to see if they both come close to the same result.

For instance (extremely oversimplified), if I create a file that has the following math result:

5 + 6 = 23

Then attach an MD5 checksum, the checksum will simply make sure nobody changes the data, but it does not check to make sure the math is accurate (in this case, it obviously isn't). So if somebody changes the "+" to a "-", the checksum will fail, but it still doesn't verify the accuracy of the data itself. The checksum still doesn't know that 5+6 is not 23, nor is 5-6 equal to 23. It simply knows that somewhere, somebody changed something that wasn't originally there.

Somebody can correct me if I have misunderstood checksums, or if I have misunderstood a different method of using checksums.
____________

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8356
Credit: 4,095,592
RAC: 1,195
United Kingdom
Message 688709 - Posted: 4 Dec 2007, 16:07:47 UTC - in response to Message 688702.
Last modified: 4 Dec 2007, 16:08:43 UTC

... method of using checksums.

Essentially correct.

There's many ways of 'checksums', many of which do the 'sums' in various ways...

Perhaps the simplest is the parity bit used in serial data that 'counts' the number of '1' (or '0') bits. That can detect if one bit is flipped and flag an error. More advanced is to use Hamming codes...

For blocks of data such as computer files, the binary data that is the file itself (text for example gets encoded as ASCII binary numbers or as unicode binary words) can be 'checksummed' to detect if any one bit or any byte is lost or corrupted. One favoured easy way is to run an XOR through all the bytes. This generates a (simple, fast) parity check on each column of bits through the data.

MD5 is a 'digest encryption' method, greatly more thorough than a parity check, that takes all the binary data in a file to generate a unique signature number ("digital signature"). Changing any bit or byte or bytes should cause a different signature to be generated. (This is similar to but more thorough than generating a hash code as used in fast searches.)

Aside: MD5 has been compromised in a certain very specific way and so should no longer be used for digital signatures.


Hope that's of interest,

Cheers,
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 688712 - Posted: 4 Dec 2007, 16:11:17 UTC

This is all healthy discussion. Two quick points:

1. There really is no way to verify results unless we have *at least* two results per workunit. Being this is a scientific project, we need verification.

2. We already have a solution to the disk space problem: adding more disk space. We'll need to do this anyway as Astropulse is just around the bend and will require even more workunit storage. We're not going to go back to 3/2 unless we really have to - it's a huge increase in bandwidth consumption (a current bottleneck) and wasted computing resources - and we're trying to be as "green" as possible. As for funky deadlines, bogus credit, etc. take that up with BOINC. I don't follow that too closely.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

1 · 2 · 3 · Next

Message boards : Technical News : Father Padilla Meets the Perfect Gnat (Dec 03 2007)

Copyright © 2014 University of California