Upwards and Onwards (May 28 2009)


log in

Advanced search

Message boards : Technical News : Upwards and Onwards (May 28 2009)

Previous · 1 · 2 · 3
Author Message
Profile AlphaLaser
Volunteer tester
Send message
Joined: 22 Aug 06
Posts: 262
Credit: 45,707
RAC: 0
United States
Message 906083 - Posted: 11 Jun 2009, 4:48:12 UTC - in response to Message 906042.
Last modified: 11 Jun 2009, 4:53:56 UTC

Don't forget that the first AP task that user will get will be with the stock app (so ~80 hours for that machine), but it will be estimated by BOINC at DCF=1.0, rather than the DCF=~0.4 typical for stock AP.

If stock apps don't have DCF=~1, it means the run time estimate is WRONG, and the admins should fix it.

Actually, while I agree that DCF should be around 1.0, there is an advantage:

A brand-new computer defaults to 1.0, which means the first work requests will be small, and grow as DCF converges on the right number.

Agreed, but with some concern about AP tasks.
A new host would see the AP tasks at about 2.5 times the actual time. As the tasks are long to start with this could lead the owner who only wants to do a limited number of hours to assume it is more time than they wish to contribute within the deadline limit.


I have to agree with Nicolas. Most new users aren't gonna know about the existence of DCF among other things and it is reasonable for them to expect for the initial estimates to be nearly correct from the get-go. For SETI the runtimes are generally predictable ahead of time and in the worse case your tasks end sooner rather than later (-9 overflows and such). Keeping in mind this is a "first impressions" moment, users ought to get intuitive feedback from the client with minimal surprises. Using the DCF to deal with overfetch from new and "untrusted" users sounds like a kludge when we should be thinking about maybe using some other dedicated mechanisms for handling it.
____________

WinterKnight
Volunteer tester
Send message
Joined: 18 May 99
Posts: 8630
Credit: 23,723,511
RAC: 19,182
United Kingdom
Message 906091 - Posted: 11 Jun 2009, 5:35:44 UTC - in response to Message 906083.

Don't forget that the first AP task that user will get will be with the stock app (so ~80 hours for that machine), but it will be estimated by BOINC at DCF=1.0, rather than the DCF=~0.4 typical for stock AP.

If stock apps don't have DCF=~1, it means the run time estimate is WRONG, and the admins should fix it.

Actually, while I agree that DCF should be around 1.0, there is an advantage:

A brand-new computer defaults to 1.0, which means the first work requests will be small, and grow as DCF converges on the right number.

Agreed, but with some concern about AP tasks.
A new host would see the AP tasks at about 2.5 times the actual time. As the tasks are long to start with this could lead the owner who only wants to do a limited number of hours to assume it is more time than they wish to contribute within the deadline limit.


I have to agree with Nicolas. Most new users aren't gonna know about the existence of DCF among other things and it is reasonable for them to expect for the initial estimates to be nearly correct from the get-go. For SETI the runtimes are generally predictable ahead of time and in the worse case your tasks end sooner rather than later (-9 overflows and such). Keeping in mind this is a "first impressions" moment, users ought to get intuitive feedback from the client with minimal surprises. Using the DCF to deal with overfetch from new and "untrusted" users sounds like a kludge when we should be thinking about maybe using some other dedicated mechanisms for handling it.

DCF maybe a kludge, but it is probably one that needs to stay. The benchmarks only really test the core of the cpu, so when running tasks there can be a large difference in performance between cpu's with similar benchmarks.
As we know AMD's don't, in general, do very well here on Seti, but there is also quite a big difference in performance between Intel cpu's at same clock speed and similar benchmarks but with differing amounts of cache memory.

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 357,953
RAC: 37
Germany
Message 906109 - Posted: 11 Jun 2009, 8:18:43 UTC - in response to Message 906091.

DCF maybe a kludge, but it is probably one that needs to stay. The benchmarks only really test the core of the cpu, so when running tasks there can be a large difference in performance between cpu's with similar benchmarks.
As we know AMD's don't, in general, do very well here on Seti, but there is also quite a big difference in performance between Intel cpu's at same clock speed and similar benchmarks but with differing amounts of cache memory.

The DCF definitely is a kludge, but because it's per project and not per application.

What the discussion here is about is not the DCF but the initial estimate, as stated by Nicolas.

Gruß,
Gundolf
____________
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 906232 - Posted: 11 Jun 2009, 16:47:14 UTC - in response to Message 906109.

DCF maybe a kludge, but it is probably one that needs to stay. The benchmarks only really test the core of the cpu, so when running tasks there can be a large difference in performance between cpu's with similar benchmarks.
As we know AMD's don't, in general, do very well here on Seti, but there is also quite a big difference in performance between Intel cpu's at same clock speed and similar benchmarks but with differing amounts of cache memory.

The DCF definitely is a kludge, but because it's per project and not per application.

What the discussion here is about is not the DCF but the initial estimate, as stated by Nicolas.

Gruß,
Gundolf

The problem is in fact with the initial estimate. But let me tell you about my newest cruncher.

This machine exists primarily to run web statistics for hosting customers. It is not very busy -- or very big.

It was selected solely for power consumption. It's a Via C7 at 1.5.

The first work unit it received was AP. Crunching 24 hours/day, running the stock Astropulse application, it missed the deadline by five days, and some other machine completed the reissued work. The estimate said it had plenty of time.

I mention the C7 because it has a particularly poor FPU. A reasonable estimate for any other processor is going to be way off on this processor.

If BOINC over-estimates the run-time, and the actual time is shorter, everything is reasonably okay. If BOINC under-estimates, and downloads more work than it can do, there is a problem.

... and no matter how you try to fix this, someone is going to find some issue: if you raise the estimates you get safer scheduling while DCF dials-in, if you force the queue to be artificially small during the first few work units, then people will complain that the queue can't be filled.
____________

Profile AlphaLaser
Volunteer tester
Send message
Joined: 22 Aug 06
Posts: 262
Credit: 45,707
RAC: 0
United States
Message 906572 - Posted: 12 Jun 2009, 12:12:32 UTC
Last modified: 12 Jun 2009, 12:15:28 UTC

Yeah, its not so much a problem with DCF itself but rather that it should initialize to near 1.0.

Isn't there not a built-in mechanism on the server side to automatically adjust the data used for estimation sent out to newly attached hosts based on previously returned work by other hosts? Or perhaps doing that adds to much load? Otherwise, it would seem like a nice feature to have for some projects.

Though, DCF would work much better on a per-app basis. I would go even further and say that perhaps projects should be able to define categories of work (for SETI, that would mean for each group of ARs) -- that would provide flexibility for projects which might have apps with different modes of operation or different categories of input.
____________

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 22,106,142
RAC: 3,929
United States
Message 906644 - Posted: 12 Jun 2009, 13:47:53 UTC - in response to Message 906232.

Regarding the missed deadline, does boinc have a method to ping the host to verify progress? If so, wouldn't that be a better and universal way to decide about reissuing, at least at first? (This question should probably in NC, sorry.)

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13581
Credit: 29,943,410
RAC: 16,209
United States
Message 906654 - Posted: 12 Jun 2009, 13:56:16 UTC - in response to Message 906644.

Regarding the missed deadline, does boinc have a method to ping the host to verify progress? If so, wouldn't that be a better and universal way to decide about reissuing, at least at first? (This question should probably in NC, sorry.)


Are you asking if the SETI servers can "ping" the client on the volunteer's PC to verify progress?

If so, the answer is no, and I don't forsee that changing for that would require an incoming connection into the user's PC, which means that firewalls would have to be configured to allow this to happen, and I can't imagine that too many corporate donors would be happy with this idea, nor would a few other users.
____________

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 906793 - Posted: 12 Jun 2009, 17:53:20 UTC - in response to Message 906644.

Regarding the missed deadline, does boinc have a method to ping the host to verify progress? If so, wouldn't that be a better and universal way to decide about reissuing, at least at first? (This question should probably in NC, sorry.)

The way to do this would be for the BOINC client to "check in" and report progress periodically.

Two problems: it adds another mechanism, with more work on the servers and more fields in the database, and it doesn't really help for machines with intermittent connections (dialup, portables, etc.).

If you can't verify that a machine is making progress, you don't know if it is broken, or does not have connectivity -- you still have to wait for the deadline.
____________

Dena Wiltsie
Send message
Joined: 19 Apr 01
Posts: 1121
Credit: 541,668
RAC: 341
United States
Message 906801 - Posted: 12 Jun 2009, 18:09:10 UTC - in response to Message 906793.

You still do have some information in that the last contact time is recorded. While this will not tell you what the status is, it will tell you they are still out there and connecting to the project. If they have not connected in a while, they could have quit the project or gone on vacation without draining their work.
____________

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 22,106,142
RAC: 3,929
United States
Message 906835 - Posted: 12 Jun 2009, 19:17:02 UTC - in response to Message 906801.

I see. So if the host has checked in, but is late, then you might give it another quantum of time. And repeat this process a couple of times with smaller quanta, in order to save the efforts of the slower hosts. Of course this makes no sense if these hosts continue to get wu's that are 'too much' for them.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 907036 - Posted: 12 Jun 2009, 23:51:50 UTC - in response to Message 906801.

You still do have some information in that the last contact time is recorded. While this will not tell you what the status is, it will tell you they are still out there and connecting to the project. If they have not connected in a while, they could have quit the project or gone on vacation without draining their work.

Exactly. You have information "if" the machine is checking in.

As you point out, the person can be on vacation, or they could have quit.

The machine could have been destroyed, or the hard drive might have failed.

... or it could be crunching away happily, but not connected to the net.

If it hasn't checked in for a while, that's all you really know.

It could disappear for a week, and resurface just before the due-date.

It could even report late, but before the work has been reassigned and reissued, complete the quorum, and move on.
____________

Ingleside
Volunteer developer
Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 4,194,867
RAC: 20,384
Norway
Message 907272 - Posted: 13 Jun 2009, 14:06:55 UTC - in response to Message 906835.

I see. So if the host has checked in, but is late, then you might give it another quantum of time. And repeat this process a couple of times with smaller quanta, in order to save the efforts of the slower hosts. Of course this makes no sense if these hosts continue to get wu's that are 'too much' for them.

For a project to get client to connect again, they can just set <next_rpc_delay> and appart for computers that's disconnected for some reason they'll get a scheduler-rpc.

But, the problem for SETI@home is, this would be useless, since the database-server doesn't manage to handle the extra load to check for re-issue and abortions, so wouldn't have the capasity to do extra checks for any tasks in danger of passing deadline either as part of scheduler-requests.

The Transitioner will check and update the task then the deadline is reached, but you can't extend the deadline based on last time host connected project, since you've no way of knowing if the task is just a little late to be returned, or for some reason is a "ghost-wu" and never made it to the host in the 1st. place, and re-issuing "lost" work is disabled since puts too much load on database...



____________
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 22,106,142
RAC: 3,929
United States
Message 907811 - Posted: 15 Jun 2009, 12:42:46 UTC - in response to Message 907272.

I had forgotten about that little limitation imposed by the database engineering. So many rough edges still after so many years.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 907893 - Posted: 15 Jun 2009, 17:38:07 UTC - in response to Message 907811.

I had forgotten about that little limitation imposed by the database engineering. So many rough edges still after so many years.

Probably less of a "rough edge" and more of a case of just trying to carry a really big load on largely "recycled" equipment.

If they had the money for faster servers, it might be different -- but part of what BOINC tries to do is minimize cost.
____________

Previous · 1 · 2 · 3

Message boards : Technical News : Upwards and Onwards (May 28 2009)

Copyright © 2014 University of California