DCF when the GPUs are different speeds


log in

Advanced search

Questions and Answers : Wish list : DCF when the GPUs are different speeds

Author Message
Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,025,351
RAC: 18
United Kingdom
Message 1206611 - Posted: 16 Mar 2012, 15:46:33 UTC

I have been trying to get a stable DCF on http://setiathome.berkeley.edu/show_host_detail.php?hostid=6379672 and have managed to improve things quite a lot by adding flops to app_info.xml but I expect I have got it as good as it's going to get. The problem is that I have 4 GPUs of different speeds and when one of the slow GPUs finishes I typically get

16/03/2012 15:20:28 | SETI@home | [dcf] DCF: 0.373319->1.174715, raw_ratio 1.174715, adj_ratio 3.146681

To get things to work vaguely sensibly I have used flops values such that the CPUs and fast GPUs typically have a DCF of 0.4 which means I can get the current 400/50 WU limits and that I don't get timeouts on the slow GPUs. The actual GPU configuration is

16/03/2012 12:32:59 | | NVIDIA GPU 0: GeForce GTX 460 (driver version 28562, CUDA version 4010, compute capability 2.1, 1024MB, 684 GFLOPS peak)
16/03/2012 12:32:59 | | NVIDIA GPU 1: GeForce GT 430 (driver version 28562, CUDA version 4010, compute capability 2.1, 512MB, 179 GFLOPS peak)
16/03/2012 12:32:59 | | NVIDIA GPU 2: GeForce GTX 460 (driver version 28562, CUDA version 4010, compute capability 2.1, 1024MB, 684 GFLOPS peak)
16/03/2012 12:32:59 | | NVIDIA GPU 3: GeForce GT 520 (driver version 28562, CUDA version 4010, compute capability 2.1, 512MB, 104 GFLOPS peak)

and given BOINC reports this then it must know the relative speed of the GPUs.

To my thinking clearly BOINC should be taking relative speed of the GPUs into account when calculating the DCF for a given WU. Further I suspect it could even work out the speed of each GPU relative to the CPU and thereby totally remove the need for flops entries in app_info.xml. Were a future release of BOINC to do this maybe some of the Luddites running old versions on BOINC would finally update!

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,025,351
RAC: 18
United Kingdom
Message 1207809 - Posted: 19 Mar 2012, 13:13:44 UTC - in response to Message 1206611.
Last modified: 19 Mar 2012, 13:22:16 UTC

Will this ever be fixed? Currently I get a lot of the following which trigger a load of high priority running.

19/03/2012 13:08:51 | SETI@home | [dcf] DCF: 0.691461->2.026834, raw_ratio 2.026834, adj_ratio 2.931233

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,025,351
RAC: 18
United Kingdom
Message 1208249 - Posted: 20 Mar 2012, 15:41:50 UTC
Last modified: 20 Mar 2012, 15:44:36 UTC

Far too often it jumps way too high and takes a far too long to recover.

20/03/2012 15:24:03 | SETI@home | [dcf] DCF: 0.591590->4.139177, raw_ratio 4.139177, adj_ratio 6.996701
20/03/2012 15:28:30 | SETI@home | [dcf] DCF: 4.139177->3.772510, raw_ratio 0.472504, adj_ratio 0.114154
20/03/2012 15:29:55 | SETI@home | [dcf] DCF: 3.772510->3.736534, raw_ratio 0.174930, adj_ratio 0.046370
20/03/2012 15:30:18 | SETI@home | [dcf] DCF: 3.736534->3.701055, raw_ratio 0.188607, adj_ratio 0.050476
20/03/2012 15:35:21 | SETI@home | [dcf] DCF: 3.701055->3.382952, raw_ratio 0.520028, adj_ratio 0.140508
20/03/2012 15:35:38 | SETI@home | [dcf] DCF: 3.382952->3.086598, raw_ratio 0.419405, adj_ratio 0.123976
20/03/2012 15:41:55 | SETI@home | [dcf] DCF: 3.086598->3.058384, raw_ratio 0.265261, adj_ratio 0.085940
20/03/2012 15:42:01 | SETI@home | [dcf] DCF: 3.058384->3.030399, raw_ratio 0.259856, adj_ratio 0.084965

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 24544
Credit: 522,233
RAC: 86
United States
Message 1208299 - Posted: 21 Mar 2012, 0:20:19 UTC - in response to Message 1208249.

Far too often it jumps way too high and takes a far too long to recover.
20/03/2012 15:24:03 | SETI@home | [dcf] DCF: 0.591590->4.139177, raw_ratio 4.139177, adj_ratio 6.996701
20/03/2012 15:28:30 | SETI@home | [dcf] DCF: 4.139177->3.772510, raw_ratio 0.472504, adj_ratio 0.114154
20/03/2012 15:29:55 | SETI@home | [dcf] DCF: 3.772510->3.736534, raw_ratio 0.174930, adj_ratio 0.046370
20/03/2012 15:30:18 | SETI@home | [dcf] DCF: 3.736534->3.701055, raw_ratio 0.188607, adj_ratio 0.050476
20/03/2012 15:35:21 | SETI@home | [dcf] DCF: 3.701055->3.382952, raw_ratio 0.520028, adj_ratio 0.140508
20/03/2012 15:35:38 | SETI@home | [dcf] DCF: 3.382952->3.086598, raw_ratio 0.419405, adj_ratio 0.123976
20/03/2012 15:41:55 | SETI@home | [dcf] DCF: 3.086598->3.058384, raw_ratio 0.265261, adj_ratio 0.085940
20/03/2012 15:42:01 | SETI@home | [dcf] DCF: 3.058384->3.030399, raw_ratio 0.259856, adj_ratio 0.084965

The DCF is designed to prevent far too much work from being downloaded to a host. It assumes that the estimates for each application from a project will be off in a similar manner. The fix is to have DCF be per application for CPU scheduling. This will not, however, work for work fetch as the work fetch is per project.
____________


BOINC WIKI

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,025,351
RAC: 18
United Kingdom
Message 1208414 - Posted: 21 Mar 2012, 10:14:09 UTC - in response to Message 1208299.
Last modified: 21 Mar 2012, 10:15:05 UTC

The DCF is designed to prevent far too much work from being downloaded to a host. It assumes that the estimates for each application from a project will be off in a similar manner. The fix is to have DCF be per application for CPU scheduling. This will not, however, work for work fetch as the work fetch is per project.

I wonder do you understand what I am asking for? How can a DCF per application address GPUs of different speeds?

As I said initially there needs to be per device DCF adjustment.

Further the current code that allows the DCF to jump from 0.591590->4.139177 is inappropriate and needs fixing. It allows the DCF to instantly jump so high that the (adj_ratio < 0.1) applies when it should not and it then takes forever and a day for the DCF to return to what it should be.

void PROJECT::update_duration_correction_factor(ACTIVE_TASK* atp) { RESULT* rp = atp->result; double raw_ratio = atp->elapsed_time/rp->estimated_duration_uncorrected(); double adj_ratio = atp->elapsed_time/rp->estimated_duration(); double old_dcf = duration_correction_factor; // it's OK to overestimate completion time, // but bad to underestimate it. // So make it easy for the factor to increase, // but decrease it with caution // if (adj_ratio > 1.1) { duration_correction_factor = raw_ratio; } else { // in particular, don't give much weight to results // that completed a lot earlier than expected // if (adj_ratio < 0.1) { duration_correction_factor = duration_correction_factor*0.99 + 0.01*raw_ratio; } else { duration_correction_factor = duration_correction_factor*0.9 + 0.1*raw_ratio; } } // limit to [.01 .. 100] // if (duration_correction_factor > 100) duration_correction_factor = 100; if (duration_correction_factor < 0.01) duration_correction_factor = 0.01; if (log_flags.dcf_debug) { msg_printf(this, MSG_INFO, "[dcf] DCF: %f->%f, raw_ratio %f, adj_ratio %f", old_dcf, duration_correction_factor, raw_ratio, adj_ratio ); } }
[

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 24544
Credit: 522,233
RAC: 86
United States
Message 1208639 - Posted: 21 Mar 2012, 23:43:33 UTC - in response to Message 1208414.

The DCF is designed to prevent far too much work from being downloaded to a host. It assumes that the estimates for each application from a project will be off in a similar manner. The fix is to have DCF be per application for CPU scheduling. This will not, however, work for work fetch as the work fetch is per project.

I wonder do you understand what I am asking for? How can a DCF per application address GPUs of different speeds?

As I said initially there needs to be per device DCF adjustment.

Further the current code that allows the DCF to jump from 0.591590->4.139177 is inappropriate and needs fixing. It allows the DCF to instantly jump so high that the (adj_ratio < 0.1) applies when it should not and it then takes forever and a day for the DCF to return to what it should be.

void PROJECT::update_duration_correction_factor(ACTIVE_TASK* atp) { RESULT* rp = atp->result; double raw_ratio = atp->elapsed_time/rp->estimated_duration_uncorrected(); double adj_ratio = atp->elapsed_time/rp->estimated_duration(); double old_dcf = duration_correction_factor; // it's OK to overestimate completion time, // but bad to underestimate it. // So make it easy for the factor to increase, // but decrease it with caution // if (adj_ratio > 1.1) { duration_correction_factor = raw_ratio; } else { // in particular, don't give much weight to results // that completed a lot earlier than expected // if (adj_ratio < 0.1) { duration_correction_factor = duration_correction_factor*0.99 + 0.01*raw_ratio; } else { duration_correction_factor = duration_correction_factor*0.9 + 0.1*raw_ratio; } } // limit to [.01 .. 100] // if (duration_correction_factor > 100) duration_correction_factor = 100; if (duration_correction_factor < 0.01) duration_correction_factor = 0.01; if (log_flags.dcf_debug) { msg_printf(this, MSG_INFO, "[dcf] DCF: %f->%f, raw_ratio %f, adj_ratio %f", old_dcf, duration_correction_factor, raw_ratio, adj_ratio ); } }
[

Actually a DCF per device is not necessarily a requirement. After all, it is the difference between the actual and expected times, but the servers do not specify the time, they specify the FLoating Point OPerations count. So the speed of the processor is entered into the equation when calculating the original estimated time to compute. Then the actual time is divided by the original time to get a duration correction factor for that task.

I believe that BOINC only maintains one speed for all GPUs and one speed for all CPUs. It is this number that needs to be replicated for each GPU type rather than the DCF.
____________


BOINC WIKI

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,025,351
RAC: 18
United Kingdom
Message 1208766 - Posted: 22 Mar 2012, 9:22:57 UTC - in response to Message 1208639.
Last modified: 22 Mar 2012, 9:53:25 UTC

I believe that BOINC only maintains one speed for all GPUs and one speed for all CPUs. It is this number that needs to be replicated for each GPU type rather than the DCF.

Yes, that is what I meant by "there needs to be per device DCF adjustment". In my initial post I also said "clearly BOINC should be taking relative speed of the GPUs into account when calculating the DCF".

When will the BOINC that does this or resolves my issue using some other regime be released?

You have not commented on the current issues I have with the current DCF jumping way too high. When will that code be fixed or expunged?

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,025,351
RAC: 18
United Kingdom
Message 1208768 - Posted: 22 Mar 2012, 9:25:49 UTC - in response to Message 1208639.
Last modified: 22 Mar 2012, 9:28:48 UTC

Actually a DCF per device is not necessarily a requirement. After all, it is the difference between the actual and expected times, but the servers do not specify the time, they specify the FLoating Point OPerations count. So the speed of the processor is entered into the equation when calculating the original estimated time to compute. Then the actual time is divided by the original time to get a duration correction factor for that task.

Actually I have never asked for a "a DCF per device". To me it has always been obvious this would not be a good solution to the issue I have.

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12310
Credit: 2,606,869
RAC: 1,041
Netherlands
Message 1208770 - Posted: 22 Mar 2012, 9:27:49 UTC - in response to Message 1208766.

When will the BOINC that does this be released?

As far as I know, never, since CreditNew will take over the function of TDCF and then it all happens on the server.

The server will maintain host_app_version.et, the statistics (mean and variance) of job runtimes (normalized by wu.fpops_est) per host and application version.
Source, Job runtime estimates.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,025,351
RAC: 18
United Kingdom
Message 1208772 - Posted: 22 Mar 2012, 9:40:37 UTC - in response to Message 1208770.
Last modified: 22 Mar 2012, 9:51:03 UTC

When will the BOINC that does this be released?

As far as I know, never, since CreditNew will take over the function of TDCF and then it all happens on the server.

The server will maintain host_app_version.et, the statistics (mean and variance) of job runtimes (normalized by wu.fpops_est) per host and application version.
Source, Job runtime estimates.

Thank you for the link which I have just read. I can't see and explicit referance to how GPUs with different speeds are catered for though. Have I missed it?

Which time will get displayed in the Remaining column for GPU tasks that are not running when a system has GPUs of different speeds?

Which version of BOINC has this?

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12310
Credit: 2,606,869
RAC: 1,041
Netherlands
Message 1208775 - Posted: 22 Mar 2012, 10:04:08 UTC - in response to Message 1208772.

I answered before your edit, on the notion of "which BOINC will do a per device DCF adjustment". And that's that no BOINC will do that. Also not for per application. As far as I understand from David, DCF is going away and isn't in use in CreditNew and therefore not in use on projects that use CreditNew. Seti is one of the projects that uses CreditNew.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,025,351
RAC: 18
United Kingdom
Message 1208783 - Posted: 22 Mar 2012, 10:52:13 UTC - in response to Message 1208775.

I answered before your edit, on the notion of "which BOINC will do a per device DCF adjustment". And that's that no BOINC will do that. Also not for per application. As far as I understand from David, DCF is going away and isn't in use in CreditNew and therefore not in use on projects that use CreditNew. Seti is one of the projects that uses CreditNew.

Once I gathered DCF was going I made the edit to make the request general.

The real issue now is will the new regime address GPUs of different speeds being in the same system? Thus far I can't find any information that says it will.

Will the new regime address CPUs with HyperThreading Technology where the CPU speed depends on if there is one or two threads active on each Core?

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12310
Credit: 2,606,869
RAC: 1,041
Netherlands
Message 1208789 - Posted: 22 Mar 2012, 11:10:54 UTC - in response to Message 1208783.

The real issue now is will the new regime address GPUs of different speeds being in the same system? Thus far I can't find any information that says it will.

Will the new regime address CPUs with HyperThreading Technology where the CPU speed depends on if there is one or two threads active on each Core?

These are things that you shouldn't ask in the Seti forums, as they're a BOINC thing. So best ask it at the BOINC development email list. This list requires registration.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,025,351
RAC: 18
United Kingdom
Message 1208804 - Posted: 22 Mar 2012, 12:30:07 UTC - in response to Message 1208789.
Last modified: 22 Mar 2012, 12:34:17 UTC

The real issue now is will the new regime address GPUs of different speeds being in the same system? Thus far I can't find any information that says it will.

Will the new regime address CPUs with HyperThreading Technology where the CPU speed depends on if there is one or two threads active on each Core?

These are things that you shouldn't ask in the Seti forums, as they're a BOINC thing. So best ask it at the BOINC development email list. This list requires registration.

It would be better to use a PM rather than posting on this thread, but as you have I have to reply here. I do not wish to join the BOINC development email list as I suspect I would get a large number of emails. Given this what other option do I have but to post the issue here?

I feel there should be a "Developer" thread that requires approval before you are allowed to post on which these types of concerns could be raised. I feel approval is needed to keep the Signal to Noise ratio high.

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12310
Credit: 2,606,869
RAC: 1,041
Netherlands
Message 1208833 - Posted: 22 Mar 2012, 14:13:03 UTC - in response to Message 1208804.
Last modified: 22 Mar 2012, 14:13:28 UTC

I do not wish to join the BOINC development email list as I suspect I would get a large number of emails. Given this what other option do I have but to post the issue here?

You could of course detach from that list after you've had your question(s) answered, but if you really do not feel like joining that list, you can always try to email David personally. Be nice and eloquent, though. Make sure to explain your problem in detail.

I feel there should be a "Developer" thread that requires approval before you are allowed to post on which these types of concerns could be raised. I feel approval is needed to keep the Signal to Noise ratio high.

But then you'd need such a thread on every of the 50+ projects and someone going by those projects on a daily basis to gather information.

It's easier to have forums for that, which we do... but even then, the developers will only check in there when they're pointed out such and so thread and what's in it (by me mostly). They're too busy with all other things BOINC, non-BOINC and personal life to also go read and answer forums 3 times a day.

The email lists enter directly into their email box, which is why I point that out first. This is where the other volunteer developers (such as John McLeod) will also be reading what you have to say/ask and answer if they know about the subject.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Profile dads
Send message
Joined: 14 Jan 12
Posts: 4
Credit: 158,397
RAC: 0
United States
Message 1209960 - Posted: 25 Mar 2012, 5:06:56 UTC

this is what gets my goat they send me 25 603 enhance non cuda and when thats done. Im left with 83 cuda and my quad core does nothing until i get most of the units done and they send me more work . i need more work for my cpu x 4
____________

Questions and Answers : Wish list : DCF when the GPUs are different speeds

Copyright © 2014 University of California