Odd GPU Time Estimates

Message boards : Number crunching : Odd GPU Time Estimates
Message board moderation

To post messages, you must log in.

AuthorMessage
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1165272 - Posted: 25 Oct 2011, 11:24:42 UTC

On this machine - 6197362 - I have some odd time estimates for GPU (only). My DCF is 1.64 currently, and CPU estimates are are around 3-4 hours and AP are around 45 hours - off somewhat, but in line with the DCF. GPU estimates, however, are around 3.5 - 4 HOURS, off by a factor of 10 or so. The machine thinks that it has too much work, and is not asking for new work.

Any idea what might cause this? I have had almost 900 consecutive valid GPU WUs, so the estimates should be stable. I am running optimized apps, but this hasn't been a problem before... My other cruncher, essentially the same machine physically, has more-or-less correct estimates, despite a DCF of about 0.26, and is pulling new work from time to time. Both machines run a 10 day cache (when I can get it, of course!).
ID: 1165272 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1165285 - Posted: 25 Oct 2011, 13:04:35 UTC

It's a long story, spread over many threads since last month - about 13 September. Some people were encountering problems, especially if bad luck broght them a run of short tasks (-9 overflows, or the equivalent 30/30 exit for Astropulse) shortly after deploying a new computer. The Berkeley crew tried to put in a server modification to circumvent the errors, but unfortunately it caught a lot of GPU users as well as helping the very few people affected by the original problem.

Reversing the server code changes in one fell swoop would cause as many problems as the original error, though different ones. So, for the time being - indeed, for most of the last month - we're running in a slightly constrained emergency mode, and hoping each week that the staff will have time during maintenance to take one more small step towards normality - unfortunately, each week, another more urgent crisis seems to get in the way.

The main, temporary, changes are:

1) 'Resend lost tasks' is turned off - there are ghosts in the machines. I think that could, and should, be turned back on today, while the database is quiescent.

2) The server code which attempts to nudge DCF towards 1.00 for all applications has been - deliberately - crippled. This is the main reason why your GPU tasks are overestimated at the moment. Your CPU tasks are keeping DCF comparatively high: without them, DCF would have dropped - one of mine is around 0.07 - your estimates would be much closer, and your computer would be attempting to download much more work, if your DCF wasn't being maintained high.

The next stage of correcting the server code will reduce those estimates towards normality. But for people like me who already have a low DCF, that would result in vastly too much work being requested. So it has to be done gently and gradually: but again, I hope, they should be in a position to take a further step in the right direction today.

And finally - that cache. Because the risk at the moment is one of requesting/receiving too much work each time the DCF correction cap is lifted a notch, the project has put quota limits in place - 50 per CPU core, 400 per GPU (plus ghosts as a bonus extra). Those limits will have to stay in place until everyone's DCF is fully back to normal running, but I hope they can be lifted then.
ID: 1165285 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1165306 - Posted: 25 Oct 2011, 14:36:05 UTC

Richard - many thanks for the detailed answer. It seems to make sense...

One more question. Since my two machines are very similar (2 x quad core Opteron 2356 CPUs + 4 GT240s each), is there some reason why they are NOT being treated similarly by SETI? As mentioned above, the other machine is behaving normally and d/l when needed even though it has lots more tasks (~1600) than the one I asked about (~1200). By the limits formula you mentioned, both should have a 2000 WU limit.
ID: 1165306 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1165455 - Posted: 26 Oct 2011, 9:36:43 UTC - in response to Message 1165306.  

Richard - many thanks for the detailed answer. It seems to make sense...

One more question. Since my two machines are very similar (2 x quad core Opteron 2356 CPUs + 4 GT240s each), is there some reason why they are NOT being treated similarly by SETI? As mentioned above, the other machine is behaving normally and d/l when needed even though it has lots more tasks (~1600) than the one I asked about (~1200). By the limits formula you mentioned, both should have a 2000 WU limit.

It's unlikely to be any difference in the treatment "by SETI". More likely to be some difference in the work request patterns at your end. If you run other BOINC projects alongside SETI, there may be differences in the debt ("work fetch priority") for different projects on the two machines, or it may simply be differences in what work was available for download at the precise second each different computer made its request. All would be revealed in your message/event log, if you have the time and inclination to pore through it - I wouldn't bother, frankly.

The good news is that they're planning to turn 'resend lost results' back on, first thing this morning (Pacific time). Since it's Jeff in charge this week, and he's an early riser, that means maybe five hours from now. We should begin to get a clearer picture once that first step is out of the way.
ID: 1165455 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1165510 - Posted: 26 Oct 2011, 14:48:05 UTC

Thanks for the info Richard. I was wondering why my GPU had times of 3 hours.

The limit for GPU's is 400 is that for a 10 day cache? Im running a two day cache and have 34.
[/quote]

Old James
ID: 1165510 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1165526 - Posted: 26 Oct 2011, 16:08:14 UTC - in response to Message 1165510.  

Thanks for the info Richard. I was wondering why my GPU had times of 3 hours.

The limit for GPU's is 400 is that for a 10 day cache? Im running a two day cache and have 34.


The limit is 'work in progress' 50 per CPU core, 400 per GPU - that's server side, so unaffected by cache settings.

The high estimates for GPU will be impacting on the cache because Boinc doesn't know they will be done much faster. Inserting <flops> into app_info.xml should correct the times (for a DCF of 1 iirc).

http://setiathome.berkeley.edu/forum_thread.php?id=60427&nowrap=true#1027270 has the hand calculated methond and Geek@Play opene a thread (that i can't find right now) explaining how to use the APR value from your host details page.
ID: 1165526 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1165531 - Posted: 26 Oct 2011, 16:46:45 UTC

Thanks LadyL. I will leave well enough alone. Who knows what might happen if I start mucking about in those secret files:)
[/quote]

Old James
ID: 1165531 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1165539 - Posted: 26 Oct 2011, 17:02:06 UTC - in response to Message 1165455.  

The good news is that they're planning to turn 'resend lost results' back on, first thing this morning (Pacific time). Since it's Jeff in charge this week, and he's an early riser, that means maybe five hours from now. We should begin to get a clearer picture once that first step is out of the way.

And I've got

SETI@home 26/10/2011 17:51:40 Resent lost task 27se11ab.6176.21335.11.10.89_1

That's step one, at least.
ID: 1165539 · Report as offensive

Message boards : Number crunching : Odd GPU Time Estimates


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.