Average processing rate - a little high?

Message boards : Number crunching : Average processing rate - a little high?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1138282 - Posted: 10 Aug 2011, 7:53:27 UTC

I guess if the Average processing rate is out of wack, on the high side, then strange, probably not good things can happen. This on the CPU and for the opt apps has normally been ~25.

Average processing rate 1809.5925556394
ID: 1138282 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1138298 - Posted: 10 Aug 2011, 10:18:53 UTC

Yeah, there's something weird there. My slow single-core machine recently got switched from AP-only to MB-only. AP was running ~12 for APR and ~49 hours for an AP. I guess the "normal" MBs were projected to be about 9 hours, with some of them predicting 28 hours. I looked earlier today and did not have an APR even with 14 completed tasks, but now I do with 17 completed, and it is 95, and I see TONS of WUs in the cache, with some shorties estimating 7 minutes (realistically, more like 2.5 hours).

With so much duration on the deadline, I'm not worried about it at all (and that's what panic mode in BOINC is for anyway). I am simply concurring.. something is wonky.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1138298 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1138305 - Posted: 10 Aug 2011, 11:01:45 UTC - in response to Message 1138298.  

Yeah, there's something weird there. My slow single-core machine recently got switched from AP-only to MB-only. AP was running ~12 for APR and ~49 hours for an AP. I guess the "normal" MBs were projected to be about 9 hours, with some of them predicting 28 hours. I looked earlier today and did not have an APR even with 14 completed tasks, but now I do with 17 completed, and it is 95, and I see TONS of WUs in the cache, with some shorties estimating 7 minutes (realistically, more like 2.5 hours).

With so much duration on the deadline, I'm not worried about it at all (and that's what panic mode in BOINC is for anyway). I am simply concurring.. something is wonky.

Is that host 4082448? I'm seeing 0 tasks completed (== validated), which is odd, as well - especially since I see four of them in your task list.

Have you merged any host records recently, or has the host been renumbered? David is struggling with a problem in that area.

Apart from that, what you've seen with the 7 minute estimates is, regrettably, normal for tasks issued just after the 10th validation for any new host or any new application. I've described it, several times, as 'DCF squared'.
ID: 1138305 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1138315 - Posted: 10 Aug 2011, 11:52:45 UTC
Last modified: 10 Aug 2011, 11:53:36 UTC

Richard, in your contacts with David has it been asked why BOINC is using new method of estimates and DCF.

When starting new host the combined result is horrific. When new system kicks in, due to wait for validation, DCF is down to the old values of 0.1 or lower, and therefore it is approx true estimate * DCF = result, if not watched, 1000's of d/loads.
ID: 1138315 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1138338 - Posted: 10 Aug 2011, 13:07:43 UTC - in response to Message 1138282.  
Last modified: 10 Aug 2011, 13:10:26 UTC

I guess if the Average processing rate is out of wack, on the high side, then strange, probably not good things can happen. This on the CPU and for the opt apps has normally been ~25.

Average processing rate 1809.5925556394

Your host in it's first 5 tasks has completed two 'In ap_remove_radar.cpp: get_indices_to_randomize: num_ffts_forecast < 100. Blanking too much RFI?' tasks, that's what has put your APR so high,

Really, MB -9's, and AP's 'In ap_remove_radar.cpp: get_indices_to_randomize: num_ffts_forecast < 100. Blanking too much RFI?' and 'Found 30 single pulses and 30 repeating pulses, exiting.' should be excluded from the APR calculations,

Claggy
ID: 1138338 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1138345 - Posted: 10 Aug 2011, 13:23:06 UTC - in response to Message 1138315.  

Well, the "why" is easy: I, and many other people, called for it - ever since the launch of Astropulse.

That was the point at which, for me at least, it became clear that the previous single-DCF-value-per-project model couldn't handle multiple applications with different running characteristics - exacerbated here by the extensive use of optimised apps and anonymous platform, which widened that initial disparity between MB and AP to something like 4x. I'm sure you'll remember the days when a run of MB work would bring AP estimates down to 10 hours - which led to over-caching and missed deadlines when they really turned out to take 40 hours. We covered all that together in Beta.

David took two decisions: first, to implement per_app_version on the server, rather than in the client: and second, to integrate variable DCF into CreditNew so tightly that it would be difficult to separate it back out again.

I'm not going to talk about CreditNew here - that's for a different discussion. Let's stick to DCF.

I believe (though I'd need to go back through years of emails to find it) that David felt that it would be too difficult to implement multiple DCFs in the client. Jason Gee has disproved that theory comprehensively: but then, of course, we ran into "not invented here" syndrome.

DCF squared was obviously going to be a problem from the very early tests at Beta: but unfortunately, we never completed a proper, full, test and debug at Beta.

We needed urgent remedial action here to cope with the Fermi fiasco last year. And, again for reasons which I don't think were ever fully explained, David chose to migrate the full multiple DCF server code here, ready or not, along with the cuda_fermi app.

It clearly wasn't ready, and the necessary band-aids were quickly applied. But - and this is again, sadly, typical of BOINC development - "the moving finger wrote...

... and, having writ,
Moves on: nor all thy Piety nor Wit
Shall lure it back to cancel half a Line,
Nor all thy Tears wash out a Word of it."

I don't mind having yet another go at persuading him to come back and have another go at 'DCF squared'. I gave Einstein a graphic demonstration of the consequences when they were considering moving to CreditNew earlier this year - which may very well account for Einstein message 113483. That may help, but it's a tough nut to crack.
ID: 1138345 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1138378 - Posted: 10 Aug 2011, 15:19:26 UTC - in response to Message 1138338.  

snipped......
Really, MB -9's, and AP's 'In ap_remove_radar.cpp: get_indices_to_randomize: num_ffts_forecast < 100. Blanking too much RFI?' and 'Found 30 single pulses and 30 repeating pulses, exiting.' should be excluded from the APR calculations,

Claggy

I agree totally.
ID: 1138378 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1138386 - Posted: 10 Aug 2011, 15:34:39 UTC - in response to Message 1138345.  

snipped .....
we ran into "not invented here" syndrome.

That has been around far too long in the BOINC developer's world, along with "you're just a cruncher how dare you critise my code. There is no fault here...." until three weeks later a project manager notices the same problem and it's fixed within the hour. No apologies etc..

And people wondered why I didn't join his little group.

ID: 1138386 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1138410 - Posted: 10 Aug 2011, 16:15:52 UTC - in response to Message 1138305.  
Last modified: 10 Aug 2011, 16:16:41 UTC

Is that host 4082448? I'm seeing 0 tasks completed (== validated), which is odd, as well - especially since I see four of them in your task list.

Have you merged any host records recently, or has the host been renumbered? David is struggling with a problem in that area.

My host 7927 also shows
	SETI@home Enhanced (anonymous platform, CPU) 
	Number of tasks completed 0
Though it needs nearly two days to complete a task, it does so, and all are validated!

No merging or renumbering either (I never would renumber a four-digit hostID:-).

Gruß,
Gundolf
ID: 1138410 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1138445 - Posted: 10 Aug 2011, 17:44:42 UTC - in response to Message 1138305.  

Is that host 4082448? I'm seeing 0 tasks completed (== validated), which is odd, as well - especially since I see four of them in your task list.

Have you merged any host records recently, or has the host been renumbered? David is struggling with a problem in that area.

Apart from that, what you've seen with the 7 minute estimates is, regrettably, normal for tasks issued just after the 10th validation for any new host or any new application. I've described it, several times, as 'DCF squared'.


Yes, it is that host. I've never seen "number of completed tasks" be non-zero on either of my two machines.. well, recently anyway. Main cruncher has remnants of evidence that it used to run MB, but that was in the early days of creditnew and separate DCFs. I do not recall when that number stopped counting for me, but I know it just doesn't increment. A few months ago, I had just over 1000 consecutive valid APs, and "completed tasks" was still zero.

I can tell you that up until a few days ago, the single-core machine never had any application details for MB, so the "consecutive valid tasks" is my only evidence of doing work with that app. And now that I've gone from 17 to 20 on that number, APR has come down from 95 to 40, but I'm still getting 7-minute shorties, and in fact, I have LOTS of them.

The machine ID has been around since "20 Dec 2007 | 3:27:47 UTC". It has gone from XP > Linux > Server 2003 over the years, and seen a few hardware changes, but I disabled network comms and set NNT before doing the data folder migration from one OS to another. Haven't had any nuked caches so far. BOINC just runs the benchmarks and resumes crunching where it left off (minor app_info.xml changes since the filenames for the apps are different in Windows and Linux).

Last time I merged any hosts was about 3 years ago, and at that same time, I also deleted 7 old machines from the account (they were going to crunch and got work issued, but had a catastrophic failure after 1-3 returned WUs).
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1138445 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1138738 - Posted: 11 Aug 2011, 4:13:55 UTC

The one time having a low-ish quota comes in handy. I was hoping that maybe once some of the WUs at the top of the cache burn through, it would fix the DCF for all the other "ready to start" tasks.. but so far it hasn't. Hit the quota of 124 tasks for today.. unfortunately, that will only keep me from accumulating more for the next 3 hours..then it's a new day. Not doing NNT.. seeing how far this will go.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1138738 · Report as offensive
Iona
Avatar

Send message
Joined: 12 Jul 07
Posts: 790
Credit: 22,438,118
RAC: 0
United Kingdom
Message 1140364 - Posted: 14 Aug 2011, 10:22:09 UTC - in response to Message 1138378.  

snipped......
Really, MB -9's, and AP's 'In ap_remove_radar.cpp: get_indices_to_randomize: num_ffts_forecast < 100. Blanking too much RFI?' and 'Found 30 single pulses and 30 repeating pulses, exiting.' should be excluded from the APR calculations,

Claggy

I agree totally.



Yep, it gets my vote, too. When I first ran the Lunatic GPU apps on my 4890, around half of the first 10 results were 'overflows' and validated as OK. I monitored things as best as I could, when I first ran the GPU apps, but only found the increase in System and CPU temp, due to the cooler on the GPU not venting the heat out of the rear of the PC, to be something of concern. I like nice cool systems, as everyone else does! It was only after I ended up with insanely short completion times and the subsequent 'computing errors', that Mike pointed out the insanely high APR on the GPU app for MB...it was then, that I 'twigged' that the 'overflows' must have been included in arriving at that APR. Thats crazy! In time, I'll use Fred's Re-scheduler, but I think it will take an awful lot of WUs, to get the APR down to a more realistic figure....after all, its only out by a factor of 10X or so.



Don't take life too seriously, as you'll never come out of it alive!
ID: 1140364 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1140378 - Posted: 14 Aug 2011, 12:57:20 UTC - in response to Message 1140364.  

snipped......
Really, MB -9's, and AP's 'In ap_remove_radar.cpp: get_indices_to_randomize: num_ffts_forecast < 100. Blanking too much RFI?' and 'Found 30 single pulses and 30 repeating pulses, exiting.' should be excluded from the APR calculations,

Claggy

I agree totally.



Yep, it gets my vote, too. When I first ran the Lunatic GPU apps on my 4890, around half of the first 10 results were 'overflows' and validated as OK. I monitored things as best as I could, when I first ran the GPU apps, but only found the increase in System and CPU temp, due to the cooler on the GPU not venting the heat out of the rear of the PC, to be something of concern. I like nice cool systems, as everyone else does! It was only after I ended up with insanely short completion times and the subsequent 'computing errors', that Mike pointed out the insanely high APR on the GPU app for MB...it was then, that I 'twigged' that the 'overflows' must have been included in arriving at that APR. Thats crazy! In time, I'll use Fred's Re-scheduler, but I think it will take an awful lot of WUs, to get the APR down to a more realistic figure....after all, its only out by a factor of 10X or so.




You can also change to 3 instances for MB tasks.
Brings APR down as well.

Feel free to PM me if needed.




With each crime and every kindness we birth our future.
ID: 1140378 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1140745 - Posted: 15 Aug 2011, 10:50:52 UTC

Finally got to that point where there should be a true estimate of the completion time. But because the APR is too high then the estimate is approx 1/3 of actual. Therefore when it completes DCF is going to go through the roof, which affects all Seti apps.

Why, oh why, after 6 years, or so, of BOINC why don't we have a version of BOINC that works. When is BOINC2 going to be released?
ID: 1140745 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1141613 - Posted: 17 Aug 2011, 14:55:27 UTC
Last modified: 17 Aug 2011, 15:08:25 UTC

Finally. Well my APR has come down to what it should be, but I still had a lot of estimated 2-8 minute shorties in the cache.. at least a hundred of them, all with a due date fast approaching, and I was starting to wonder if/when the DCF and ETA were going to figure out what was going on. Finally happened sometime this morning. Those 2-8 minute shorties are actually about 2h20m, so now.. according to BoincTasks.. I have 267d of cache on that machine.

I should probably consider aborting all of it now that the ETA and DCF are reasonable.

edit: there we go. I did the humane thing and released ~580 tasks back into the wild.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1141613 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1141943 - Posted: 18 Aug 2011, 3:54:08 UTC - in response to Message 1141613.  

I'm still struggling with AP tasks, even the latest to arrive has an estimate of 7h:30m, the earlier ones were at 4h:56m. The actual crunch time is 12h:30m with no radar blanking, severe blanking can put crunch time out to beyond 15h.

So I am forcing it to have at least one AP task running, but when it finishes the DCF shoots up to ~2.5 so then there are no reports/requests for several hours.
ID: 1141943 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1149289 - Posted: 5 Sep 2011, 18:43:53 UTC - in response to Message 1138378.  
Last modified: 5 Sep 2011, 18:44:10 UTC

snipped......
Really, MB -9's, and AP's 'In ap_remove_radar.cpp: get_indices_to_randomize: num_ffts_forecast < 100. Blanking too much RFI?' and 'Found 30 single pulses and 30 repeating pulses, exiting.' should be excluded from the APR calculations,

Claggy

I agree totally.

I emailed the Boinc dev list last week asking for MB -9's, and the two Astropulse early exit conditions to be excluded from APR calculations,
I've had a reply from DA:

I'll do 2 things:

- short term: put a limit on the impact of ET estimates
in runtime estimation, to avoid -177 errors.

- long term: as you suggest, allow the project's validation function
to say "Ignore this job in timing statistics"
(i.e., because it exited early, like SETI at home's overflow jobs).

It may take a few days to deploy these on S at h.

-- David


Stage one is the following changeset: [trac]changeset:24128[/trac] and has probably just been applied to the scheduler:

- web: fix warnings in forum pages
- scheduler: when using elapsed time stats to predict runtime,



cap the estimated FLOPS at twice the peak FLOPS;
otherwise, if a host has received a lot of very short jobs
recently, it will get a too-high FLOPS estimate and
will exceed the rsc_fpops_bound limit.

Claggy
ID: 1149289 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1149390 - Posted: 5 Sep 2011, 22:31:32 UTC - in response to Message 1149289.  

...
Stage one is the following changeset: [trac]changeset:24128[/trac] and has probably just been applied to the scheduler:

- web: fix warnings in forum pages
- scheduler: when using elapsed time stats to predict runtime,



cap the estimated FLOPS at twice the peak FLOPS;
otherwise, if a host has received a lot of very short jobs
recently, it will get a too-high FLOPS estimate and
will exceed the rsc_fpops_bound limit.

Claggy

The change is in a section which applies only to those running anonymous platform, others should not be affected.

For those running optimized apps it ought to be very effective at preventing -177 "Maximum elapsed time exceeded" errors. For those without <flops> in app_info.xml the limited scaling will likely make runtime estimates quite long until duration correction factor works down to an appropriate fractional level.

I guess Dr. Anderson could have already rebuilt the project scheduler, but this being a U.S. holiday I think it's more likely tomorrow (perhaps during the downtime).
                                                                  Joe
ID: 1149390 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1149463 - Posted: 6 Sep 2011, 10:23:06 UTC - in response to Message 1149289.  

Thanks Claggy.

The APR for AP tasks is still v. high (60+) in comparison to MB (~20) on the CPU. All thanks to the three early exits in the first ten tasks.
ID: 1149463 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1152481 - Posted: 15 Sep 2011, 15:29:40 UTC

I have just checked my outstanding AP tasks that have remained pending for some time, AKA over the deadline. One of these tasks has now been sent to host 5359757 who admittedly has some problems. But he has completed several AP tasks.

If you click on the Application details for Astropulse v505 5.05 windows_intelx86
we get these details;
Number of tasks completed 6
Max tasks per day 110
Number of tasks today 0
Consecutive valid tasks 10
Average processing rate 1022.418334262
Average turnaround time 2.16 days

Some how he has completed 6 task but validated 10, that seems a little odd, to me.
You will also see the APR is as per the thread title.

Going further and checking on the actual AP tasks validated we see at the moment 8 tasks. Four tasks completed 100% and four tasks exitted early all with "too much blanking"

From what has been said by those "in the know" the problems of estimating times and the subsequent effects on the scheduler caused by the APR only affect those running with an app_info file.

This host also has 13 files in error but that also doesn't seem to affected the ability to download. i.e. max tasks per day has not been reduced.

Can someone please explain?
If this is a typical host that does a fair ammount of AP tasks using the default application, and it isn't brought into line, Why are those of us running optimised apps punished with restricted d/loads and I'm told low credits.
ID: 1152481 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Average processing rate - a little high?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.