Using FLOPs estimates in app_info.xml

Message boards : Number crunching : Using FLOPs estimates in app_info.xml
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 10 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1006143 - Posted: 19 Jun 2010, 10:43:16 UTC

This question has been scattered all through recent threads, so I thought I'd provide a consoldiation point.

For the details, read BOINC changeset [trac]changeset:21775[/trac]:

- scheduler: estimate peak FLOPS of anon platform app versions based on CPU and GPU usage (or, if missing, 1 CPU).

Previously we were using the user-supplied <flops> element, and if it was missing all hell broke loose.

Too true. David wrote that about 12 hours ago: I don't know if it's active on SETI yet (probably is), but the direction of travel is certainly clear:

Don't add any new FLOPs entries, and prepare to remove any that you have now.

I say "prepare", because the transition may, yet again, be nasty, and I can't test it until Monday. If you've been following the calculations which MarkJ and I have been posting, your flops figure will declare your GPU to be slower than it really is. So removing it may result in a deluge of new work, which you will have difficulty in completing - I think WinterKnight is in that position right now.

As always, experiments are best performed with a small cache - less to lose if it goes wrong.
ID: 1006143 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1006153 - Posted: 19 Jun 2010, 11:03:58 UTC - in response to Message 1006150.  

I'll wait and see if it balances eventually, if not I'll go back to my backed up app_info with flops.

You'll probably have to wait until after you've started crunching tasks issued after David implemented that change, and then up to 200 / 300 CUDA tasks beyond that point - it can take that long for DCF to fully stabilise, if it goes completely off-scale.

Let us know how you get on. Once it does get stable, it will be interesting to see what the effect on rescheduled task time estimates is.
ID: 1006153 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19062
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1006154 - Posted: 19 Jun 2010, 11:07:38 UTC - in response to Message 1006143.  

Actually my computer is not in any trouble. Because I hit the NTT button when it had downloaded ~400 tasks, and so stopped the flow. Think I was very lucky there, because it estimated ~2m:20s for each task, when fasted VHAR task is actually ~8m:00s.

When the next task finished the BOINC client did a re-think and suddenly went into panic mode. It had revised estimate for AP tasks from the correct ~12 hrs to over 400 hrs.

After a few hours it got sorted out and adjusted the DCF to ~1.00000, part of the changes.
But it does mean that the 2 days worth of tasks I had on the computer before the sudden d/load have estimates at least twenty times too big.

And thanks to JM7's latest version of the scheduler they are being ignored and still forcing the computer into high priority mode.

The cpu will run out of work at about 23:00 UTC, at that point the quad gets switched off if it cannot d/load more work.

ID: 1006154 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 1006158 - Posted: 19 Jun 2010, 11:15:25 UTC
Last modified: 19 Jun 2010, 11:17:49 UTC

Richard my main cruncher has started to get 608's and the estimated time is as it should be around 8 mins for normal and 2 mins for shorties.
At the moment the 603's are estimated the same 8 mins,that is because the cpu's are running through about 20 AP's.
AP times are estimated at 49 hrs but are taking around 12 hrs to complete.


Dave
ID: 1006158 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19062
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1006160 - Posted: 19 Jun 2010, 11:28:29 UTC

My quad has now finished the VHAR tasks that arrived in the deluge, and started on a pre-deluge batch due back on the 27 July estimated at 16:51:21, the first four completed between 00:15:47 and 00:18:23
Therefore my estimate of 20 * was way out make that 60 *
ID: 1006160 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1006162 - Posted: 19 Jun 2010, 11:34:33 UTC - in response to Message 1006160.  

My quad has now finished the VHAR tasks that arrived in the deluge, and started on a pre-deluge batch due back on the 27 July estimated at 16:51:21, the first four completed between 00:15:47 and 00:18:23
Therefore my estimate of 20 * was way out make that 60 *

Any time the estimate is more than 10 * wrong, boinc moves DCF very, very slowly (1% of the difference) - it doen't really believe it. Only once it gets below 10 * does it move in more confident 10% steps. That's where I got the 200 / 300 job 'settling time' figure from.
ID: 1006162 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19062
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1006164 - Posted: 19 Jun 2010, 11:49:15 UTC - in response to Message 1006162.  

My quad has now finished the VHAR tasks that arrived in the deluge, and started on a pre-deluge batch due back on the 27 July estimated at 16:51:21, the first four completed between 00:15:47 and 00:18:23
Therefore my estimate of 20 * was way out make that 60 *

Any time the estimate is more than 10 * wrong, boinc moves DCF very, very slowly (1% of the difference) - it doen't really believe it. Only once it gets below 10 * does it move in more confident 10% steps. That's where I got the 200 / 300 job 'settling time' figure from.

I knew that, had many discussions with John on the scheduler.

As the computer is now doing pre-deluge tasks it is lowering the DCF, now at ~0.9.

I thinking what to do next,
1. let it carry on, and probably run out of work for the cpu. Therefore switch off until problems are sorted.
2. adjust DCF to a lower value so that it is no longer in panic mode, and hope the cpu can get some work.
3. adjust to the correct value (0.04) for the pre-deluge tasks and risk another gpu deluge and return to square 1.

Urmmmm??????

ID: 1006164 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1006192 - Posted: 19 Jun 2010, 13:55:48 UTC
Last modified: 19 Jun 2010, 14:12:13 UTC

A little bit of info.
I looked at 1 of my rigs and saw the DCF was up 5.8. I stopped BOINC, reset the DCF to 1.00 and removed the <flops> entries in the app_info file.

When restarting, the estimated times to completion were 24 mins for a GPU unit (about right but around 4 mins high) and 30 mins for a CPU unit (normally around 2 hours).

I observed the following.

When a GPU unit was completed the DCF would step down by around 0.1 per unit. Estimated GPU times would drop by about 90 secs towards the "normal" value and CPU tasks would increase by about 5 minutes towards the "normal" time.

When a CPU unit was finished, DCF would jump immediately to around 3.5. GPU ETA's would jump out to around 1 Hr 20min and CPU ETA's to 1Hr 30min to 1Hr 40min. From this point as GPU units were completed both GPU and CPU ETA's would reduce by about 5 minutes for each completed unit till the next CPU unit finished when it would jump up again. If the computer reported a mixed bag of GPU and CPU tasks in the normal 6:1 ratio, the DCF would only vary a couple of decimal points either way, the variation caused by the 2 types of units cancelling each other out. 90% of the CPU units and all of the GPU units were downloaded since the beginning of the 12 hour period you mention above.

At the time others were complaining of errors induced by an out of control DCF there was also a general complaint of a lack of GPU units. This could be the reason, no GPU units being returned to cancel the wild DCF rises caused by finishing CPU units. Looks like Someone needs to check their algorithms.

Edit: If you want to keep an eye on the box its this one.


Brodo
ID: 1006192 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1006218 - Posted: 19 Jun 2010, 15:18:52 UTC - in response to Message 1006192.  

My CUDA host uses app_info but no flops entries in it.
Currently it has some midrange CUDA tasks with ETA 72 hours (!) and some CPU VLARs with ETA ~61 hours.
Also it has some tasks (VLARs too )on CPU with "usual" ETA ~2h30min.
How ~same AR can be treated so differently - probably they were recived on different stages of this transformation.

BTW, few hours before ETA formidrange CUDA was ~<60 hours. That is, it's increasing though completion time surely remains the same ~20min per task.
Very bad sign IMO. This host already refuses to ask any new GPU work
ID: 1006218 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1006220 - Posted: 19 Jun 2010, 15:23:03 UTC - in response to Message 1006218.  

<sounding really techo> What's your DCF on that machine ? I'm interested to know if your seeing the same thing I am.
ID: 1006220 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1006221 - Posted: 19 Jun 2010, 15:28:07 UTC - in response to Message 1006220.  
Last modified: 19 Jun 2010, 15:28:44 UTC

<sounding really techo> What's your DCF on that machine ? I'm interested to know if your seeing the same thing I am.


<duration_correction_factor>3.157689</duration_correction_factor>
I'm actually surprised that there is still ONE value. Though debts divided per platform now, thsi single value governs time estimates... surely it's big flaw in current BOINC design. GPU and CPU apps behave very differently and to use single correction factor for them both ....
ID: 1006221 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 1006230 - Posted: 19 Jun 2010, 15:40:53 UTC

My CPU completion times are down to 46:33:23 Too funny. As GPU's turn in it goes down but seems to reset back to 99 hours if I use the rescheduler, luckily most of the GPU work I am getting is non-Vlar although I did one in a little over an hour this morning.
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 1006230 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1006234 - Posted: 19 Jun 2010, 15:45:15 UTC - in response to Message 1006230.  
Last modified: 19 Jun 2010, 15:49:34 UTC

I use re-scheduler on that host occasionally, but sure it was not do any move last 24 hours at least. Currently I disable automation so all tasks belong to that platform for that they were downloaded... And still such strange time estimations. Currently it does work with small ETA numbers, will see what will bewhen it meets first ~70h GPU task and completes it in 20 min ;D
(though as Richard said already, BOINC will not trust their own "eyes" and will do only very small correction...)

EDIT: currently estimates for those CUDA tasks ~58h. Dropped again. That is, looks like Brodo's picture. CUDA task completions low ETA until CPU task finishes and boosts ETA high again...
EDIT2: and current DCF is : <duration_correction_factor>2.661938</duration_correction_factor>
ID: 1006234 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 1006237 - Posted: 19 Jun 2010, 15:48:51 UTC - in response to Message 1006234.  

As my GPU's turn in my time estimates drop by 3 hours I am now at around 33 hrs.
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 1006237 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1006238 - Posted: 19 Jun 2010, 15:51:43 UTC - in response to Message 1006221.  

<sounding really techo> What's your DCF on that machine ? I'm interested to know if your seeing the same thing I am.

<duration_correction_factor>3.157689</duration_correction_factor>
I'm actually surprised that there is still ONE value. Though debts divided per platform now, thsi single value governs time estimates... surely it's big flaw in current BOINC design. GPU and CPU apps behave very differently and to use single correction factor for them both ....

But that's the point. The current BOINC client - and it's a free choice, "current" can be all the way back to v5.10.13, v5.2.6 - whatever you choose - only has one DCF.

They realised (belatedly, under pressure) that multiple values were needed. Rather than recalling all earlier versions, and effectively forcing everyone who wanted to run more than one application within a single project to upgrade to a new version of the BOINC client, they (he) decided to track the additional numbers on the server. It has been said many times - including on the boinc_mail lists that I know you're subscribed to - that the mechanism for reflecting the DCF "corrections" from the server back to the (unmodified) clients is by an adjustment to <rsc_fpops_est>. Sometimes the adjustment results in a value of zero, which is - er - unfortunate, but usually it's finite.

The trouble is, despite my nasty experience at Beta, nobody has given consideration to the transitional period. Everybody who still has old tasks, issued before the server-side correction kicked in, in their cache will see these wild client-side DCF fluctuations. It's even worse if they're big enough to trigger EDF, because then you'll be alternating between 'old' and 'new' tasks for even longer.

Until the last of the 'old' tasks are flushed from your cache, you're on the roller-coaster. Only once they're finished, and DCF has had a chance finally to settle down at the 'new' baseline, will we find out if it works.
ID: 1006238 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1006243 - Posted: 19 Jun 2010, 16:02:21 UTC

DCF is climbing on all my machines, even my non-CUDA laptop. The Laptop DCF was 2.8 30 minutes ago but has since reported in a buch of VHAR's (Around 40mins crunching time) and is now reading DCF=1.4 . ETA's are showing around 3 hours which is roughly normal.

Formerly the big crunchers all had DCF's in the 0.3 to 0.4 range, now are running around the 2 mark.

Don't know if this info is useful info or not but it's there if it's needed.

Brodo
ID: 1006243 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 1006258 - Posted: 19 Jun 2010, 16:23:42 UTC - in response to Message 1006237.  

As my GPU's turn in my time estimates drop by 3 hours I am now at around 33 hrs.

Not sure what happened but my CPU units now say they will finish in 113 hours. Boinc is really boinced. But at least I got a lot of work this morning.
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 1006258 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1006266 - Posted: 19 Jun 2010, 16:32:55 UTC - in response to Message 1006243.  

Looks like the DCF on all 3 of my machines is over 1 right now, I have current estimations on AP time of 291 hours on the Q8200, 253 on the T7200 and 321 on a X4 630.

Actual times are closer to 24 hours or so. The iMac and the Q8200 are both in high priority.

ID: 1006266 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1006294 - Posted: 19 Jun 2010, 17:45:50 UTC

And on my CUDa host another CPU taskcompleted - ETA for CUDA task now 83h and DCF:
<duration_correction_factor>3.815202</duration_correction_factor>
bigger than before.
This host needs 1-2 days to crunch all already downloaded work and get fresh tasks with corrected flops values, will see what it will be then.
ID: 1006294 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1006307 - Posted: 19 Jun 2010, 18:59:40 UTC

What about the correction between AP and MB on CPU-only machines? Is it OK to keep flops entries in that case? Because now that I've got one AP WU, I see that is marked as "Anonymous platform - CPU", so I assume that it will get the same flops estimate as MB, which is wrong by about the factor 1.6.
ID: 1006307 · Report as offensive
1 · 2 · 3 · 4 . . . 10 · Next

Message boards : Number crunching : Using FLOPs estimates in app_info.xml


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.