More Bandwidth = Raised (or Dropped) Limits?


log in

Advanced search

Message boards : Number crunching : More Bandwidth = Raised (or Dropped) Limits?

Author Message
jravin
Send message
Joined: 25 Mar 02
Posts: 991
Credit: 106,216,788
RAC: 84,972
United States
Message 1354648 - Posted: 8 Apr 2013, 13:43:20 UTC

Is this in the near future?

Now that the bandwidth problem seems to have gone away (good job on a nice smooth transition - all thanks to the guys in Berkeley!!!) will they think about going back to (at least) the old per cpu core and per GPU limits, rather than the current 100/200 per machine limits?

Pretty please!

Let us pray!!!
____________

Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 7188
Credit: 29,058,826
RAC: 33,035
United Kingdom
Message 1354655 - Posted: 8 Apr 2013, 13:50:11 UTC - in response to Message 1354648.

I understood that the limits were imposed due to database problems nothing to do with bandwidth.
____________


Today is life, the only life we're sure of. Make the most of today.

Terror Australis
Volunteer tester
Send message
Joined: 14 Feb 04
Posts: 1759
Credit: 206,758,761
RAC: 19,118
Australia
Message 1354656 - Posted: 8 Apr 2013, 13:51:45 UTC - in response to Message 1354648.

Is this in the near future?

Now that the bandwidth problem seems to have gone away (good job on a nice smooth transition - all thanks to the guys in Berkeley!!!) will they think about going back to (at least) the old per cpu core and per GPU limits, rather than the current 100/200 per machine limits?

Pretty please!

Let us pray!!!

Unfortunately, the reason for the current limits is that the science data base program has reached the limits of its capacity to handle transactions, not network bandwidth.

Due to this I can't see any raising or removing of the limits in the near future.

T.A.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8808
Credit: 53,429,012
RAC: 43,894
United Kingdom
Message 1354665 - Posted: 8 Apr 2013, 14:29:34 UTC - in response to Message 1354655.

I understood that the limits were imposed due to database problems nothing to do with bandwidth.

I'd go further, and say that the smooth-flowing bandwidth and more robust infrastructure (power supply, cooling, remote access, staff available to deal with lockups and hardware failures) should - although it's early days yet - do away with whatever few justifications there were for holding large caches in the first place.

jravin
Send message
Joined: 25 Mar 02
Posts: 991
Credit: 106,216,788
RAC: 84,972
United States
Message 1354672 - Posted: 8 Apr 2013, 14:50:54 UTC - in response to Message 1354665.
Last modified: 8 Apr 2013, 14:51:58 UTC

I understood that the limits were imposed due to database problems nothing to do with bandwidth.

I'd go further, and say that the smooth-flowing bandwidth and more robust infrastructure (power supply, cooling, remote access, staff available to deal with lockups and hardware failures) should - although it's early days yet - do away with whatever few justifications there were for holding large caches in the first place.


That's a good point, and one I hadn't thought of.

But maybe, if there is a planned outage again, they could allow us to load up in such circumstance? To bridge the anticipated gap, I mean.
____________

Profile Dr Grey
Send message
Joined: 27 May 99
Posts: 66
Credit: 33,648,807
RAC: 20,868
United Kingdom
Message 1354718 - Posted: 8 Apr 2013, 16:41:47 UTC

Now that the average turnaround time is under 36 hours would it make sense to shorten the deadline to return the workunits? What is it now, 8 weeks? That means some of those workunits are sitting on the database for many months just politely waiting for a quorum. Eg. is this one ever coming in?

http://setiathome.berkeley.edu/workunit.php?wuid=1168264448

Would halving the deadline mean that the client cache size could be doubled for little affect on the size of the database? Does it work that way?
____________

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2905
Credit: 218,678,821
RAC: 14,944
United States
Message 1354727 - Posted: 8 Apr 2013, 17:13:11 UTC - in response to Message 1354665.

I understood that the limits were imposed due to database problems nothing to do with bandwidth.

I'd go further, and say that the smooth-flowing bandwidth and more robust infrastructure (power supply, cooling, remote access, staff available to deal with lockups and hardware failures) should - although it's early days yet - do away with whatever few justifications there were for holding large caches in the first place.


Hi Richard,

The database size issue shouldn't be an issue if they just dropped the "number of days" down to two or so, from the standard ten. If we were bumping the limit at ten days, then allowing two days should significantly cut the number of database entries.

It's also "smarter" since there are machines out there that will not do 100wu in two days, so they have too much cache while some of us have not-enough. That's probably why the choice in BOINC was made for "days" instead of "units" in the first place.

And halving the number of CPU units allowed to 50 (or even lower), then giving us at least a 150wu (or more) cache for GPUs could be done without changing the number of database entries. (since a GPU does work 10-20x faster than a CPU, it would make sense to compensate for that a little)

Why would 100wu not be enough for GPUs?

Because there is still the matter of "no work available" messages and the new, longer delays between requests.

Some of us could use a little more buffer against running out of work.

I'm not complaining; I'm just observing. Since the move, a couple of my machines have run dry (and some more than once) due to "no work available" messages. No big deal, but I'm guessing that some less ham-handed limits could eliminate that problem.

What I don't know is if there is something else, like a set proportion of CPU to GPU work allowed that's been programmed into the scheduler. I admit that it's likely I'm showing my ignorance.

Sakletare
Avatar
Send message
Joined: 18 May 99
Posts: 131
Credit: 21,071,756
RAC: 2,494
Sweden
Message 1354737 - Posted: 8 Apr 2013, 17:57:52 UTC - in response to Message 1354727.

In my opinion, the best way to reduce the size of the database is for the scheduler to match fast hosts with other fast hosts.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5942
Credit: 62,333,274
RAC: 37,417
Australia
Message 1354746 - Posted: 8 Apr 2013, 18:15:51 UTC - in response to Message 1354727.

The database size issue shouldn't be an issue if they just dropped the "number of days" down to two or so, from the standard ten. If we were bumping the limit at ten days, then allowing two days should significantly cut the number of database entries.

That's been my thoughts pretty much since the server side limits came into place. That way the faster crunchers will be able to remain busy during even the usual weekly outage, and the size of the database will remain small.

____________
Grant
Darwin NT.

Tom*Project donor
Send message
Joined: 12 Aug 11
Posts: 114
Credit: 5,244,774
RAC: 29,911
United States
Message 1354756 - Posted: 8 Apr 2013, 18:28:45 UTC - in response to Message 1354737.
Last modified: 8 Apr 2013, 18:55:47 UTC

BUMP +3

No need to try to match at the system level just have the feeder feed two ques a fast queue and a slow queue

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4663
Credit: 123,470,271
RAC: 99,424
United States
Message 1354788 - Posted: 8 Apr 2013, 20:18:59 UTC

I think the larger work units, once introduced, will be the solution for longer caches. Which will probably also reduce the db load to a point that a limit higher than 100 can be easily chosen.

Hopefully once they feel that everything is happy in its new home they can get back to work on that, or maybe some of the other issues they have not had the time to work on.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile Alaun
Send message
Joined: 29 Nov 05
Posts: 17
Credit: 5,272,480
RAC: 737
United States
Message 1354814 - Posted: 8 Apr 2013, 21:25:36 UTC

+1 on the larger workunits. That's the way to get more work done per database entry. Fewer files to handle, fewer server requests, also fewer load/unload breaks on GPU's. Hope a new workunit would load down a GPU for many hours.
____________

Profile WilliamProject donor
Volunteer tester
Avatar
Send message
Joined: 14 Feb 13
Posts: 1610
Credit: 9,470,168
RAC: 16
Message 1354984 - Posted: 9 Apr 2013, 11:54:22 UTC - in response to Message 1354727.


The database size issue shouldn't be an issue if they just dropped the "number of days" down to two or so, from the standard ten. If we were bumping the limit at ten days, then allowing two days should significantly cut the number of database entries.

It doesn't work that way.
The 'number of days' is a BOINC client cap. The limit on tasks in progress is a server cap. 'they' have no influence on the amount of work the client requests, AFAIK they can only limit 'tasks in progress'.
____________
A person who won't read has no advantage over one who can't read. (Mark Twain)

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2905
Credit: 218,678,821
RAC: 14,944
United States
Message 1355040 - Posted: 9 Apr 2013, 15:18:01 UTC - in response to Message 1354984.


The database size issue shouldn't be an issue if they just dropped the "number of days" down to two or so, from the standard ten. If we were bumping the limit at ten days, then allowing two days should significantly cut the number of database entries.

It doesn't work that way.
The 'number of days' is a BOINC client cap. The limit on tasks in progress is a server cap. 'they' have no influence on the amount of work the client requests, AFAIK they can only limit 'tasks in progress'.


That does complicate things. I was thinking that each project could customize the allowed values for that field so that BOINC behaved with projects with differing needs.

BOINC is estimating how much work a computer can do in any given science task to know how much work "10 days" of work for that project is (in work units). There must be some way of "tricking" it, or setting it, so it calculates SETI tasks taking five times as long as they actually do.

Either that, or set the servers to only serve 20% of what BOINC calls-for, but then I guess BOINC would keep calling...

I suppose I really have no idea how to accomplish setting a two day limit, so I should shut-up.

So, the next-best ham-handed way of doing things would be to limit CPU work units per machine to 25 or 50 (whatever approximates a day's work) and raise the GPU limits by 50 or 75.

... think, think, think, bret....

Ok, ten days of work was too much, too big of a database... check.

We can't lower the number of days... check.

Ten days of work was thousands of tasks... check.

We're currently limited to 100 tasks ... check.

Raising the limits to 250 tasks should result in a database much smaller than the thousands of tasks that were formerly cached... check.

Conclusion: lower the hard number of CPU tasks and raise the freaking limits by some number of GPU tasks and see what happens. Slower machines will still be limited to 10 days and faster machines might keep work; if it doesn't behave, lower them again ...check.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4663
Credit: 123,470,271
RAC: 99,424
United States
Message 1355062 - Posted: 9 Apr 2013, 15:59:13 UTC - in response to Message 1355040.


The database size issue shouldn't be an issue if they just dropped the "number of days" down to two or so, from the standard ten. If we were bumping the limit at ten days, then allowing two days should significantly cut the number of database entries.

It doesn't work that way.
The 'number of days' is a BOINC client cap. The limit on tasks in progress is a server cap. 'they' have no influence on the amount of work the client requests, AFAIK they can only limit 'tasks in progress'.


That does complicate things. I was thinking that each project could customize the allowed values for that field so that BOINC behaved with projects with differing needs.

BOINC is estimating how much work a computer can do in any given science task to know how much work "10 days" of work for that project is (in work units). There must be some way of "tricking" it, or setting it, so it calculates SETI tasks taking five times as long as they actually do.

Either that, or set the servers to only serve 20% of what BOINC calls-for, but then I guess BOINC would keep calling...

I suppose I really have no idea how to accomplish setting a two day limit, so I should shut-up.

So, the next-best ham-handed way of doing things would be to limit CPU work units per machine to 25 or 50 (whatever approximates a day's work) and raise the GPU limits by 50 or 75.

... think, think, think, bret....

Ok, ten days of work was too much, too big of a database... check.

We can't lower the number of days... check.

Ten days of work was thousands of tasks... check.

We're currently limited to 100 tasks ... check.

Raising the limits to 250 tasks should result in a database much smaller than the thousands of tasks that were formerly cached... check.

Conclusion: lower the hard number of CPU tasks and raise the freaking limits by some number of GPU tasks and see what happens. Slower machines will still be limited to 10 days and faster machines might keep work; if it doesn't behave, lower them again ...check.


Why would you ask for my 24 core box to go from about a 7 hour cache with 100 tasks to even less? That is just plain rude sir! :P

If they were to adjust the limits again. I would like something along the lines of how they were doing it last time. Which was per CPU/GPU core/processor. IIRC they were using 50 for CPU & 100 for GPU. I imagine they didn't use those values again as they knew it wouldn't do the job.

As they seem to have control over CPU & GPU limits independently. They might considering bumping the GPU limit +50 or so to see how the db takes it. If not they would need to back it down again.

Ideally having the controls on the back end to consistently keep the db under control is the best answer. Larger jobs is another way. Which is probably easier to accomplish.

I think PrimeGrid also uses a limit of 100 in progress tasks. However some of their tasks can run 30+ hours on mt machines that do SETI@Home work in ~2 hours.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Message boards : Number crunching : More Bandwidth = Raised (or Dropped) Limits?

Copyright © 2014 University of California