GPU max 100wu Why?

Message boards : Number crunching : GPU max 100wu Why?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 505
Credit: 69,523,653
RAC: 10
Sweden
Message 1979876 - Posted: 11 Feb 2019, 15:20:53 UTC

Is there any chance that the number of wu's per GPU can be increased to like 200??
With the faster GPU's they get without wu's when there is server maintanaince and server downs.

I suggest an increase of wu's per GPU!!

//TRuEQ
TRuEQ & TuVaLu
ID: 1979876 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1979883 - Posted: 11 Feb 2019, 15:54:16 UTC

preaching to the choir unfortunately.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1979883 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 505
Credit: 69,523,653
RAC: 10
Sweden
Message 1979932 - Posted: 11 Feb 2019, 20:02:18 UTC

I know that the setting to 100 wu max was done sometime because the servers supply was drained by some people.
But that was a long time ago and if i remember correctly it was before the servers where co-located to different site.

Now with the alot faster GPU's and better servers and better internet connection it should be possible to increase the setting 100 to 150, 200 or maybe even more.

//TRuEQ
ID: 1979932 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1979945 - Posted: 11 Feb 2019, 21:07:09 UTC
Last modified: 11 Feb 2019, 21:08:04 UTC

From what I remember, the 100 Wu limit was imposed because after the outage (any outage in fact) the database was unable to keep up with the large amount of results being returned in one go.

So in fact now that GPU's have got faster and there are many multi GPU machines it would probably be even worse, add in Linux and the special sauce and well..

Someone should do a few sums on how many WU's the top 100 machines process per hour, times that by the nominal length of an outage an see what total number of WU's being returned from just the top 100 would be.
ID: 1979945 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979948 - Posted: 11 Feb 2019, 21:32:20 UTC - in response to Message 1979945.  

From what I remember, the 100 Wu limit was imposed because after the outage (any outage in fact) the database was unable to keep up with the large amount of results being returned in one go.

So in fact now that GPU's have got faster and there are many multi GPU machines it would probably be even worse, add in Linux and the special sauce and well..

Someone should do a few sums on how many WU's the top 100 machines process per hour, times that by the nominal length of an outage an see what total number of WU's being returned from just the top 100 would be.

I do 17.5K tasks per day, so after a typical 6 hour Tuesday outage I need to report 4375 tasks which I've finished during the outage. If we just select the Top 20 hosts which would have similar numbers to report that would tally up 87.5K tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979948 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 505
Credit: 69,523,653
RAC: 10
Sweden
Message 1979949 - Posted: 11 Feb 2019, 21:34:11 UTC - in response to Message 1979945.  

From what I remember, the 100 Wu limit was imposed because after the outage (any outage in fact) the database was unable to keep up with the large amount of results being returned in one go.

So in fact now that GPU's have got faster and there are many multi GPU machines it would probably be even worse, add in Linux and the special sauce and well..

Someone should do a few sums on how many WU's the top 100 machines process per hour, times that by the nominal length of an outage an see what total number of WU's being returned from just the top 100 would be.



Then maybe the restriction should be in how the results are returned instead of not sending new wu's. Maybe just 50 or 100 at a time..... Not all....

//TRuEQ
ID: 1979949 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 505
Credit: 69,523,653
RAC: 10
Sweden
Message 1979955 - Posted: 11 Feb 2019, 21:40:25 UTC
Last modified: 11 Feb 2019, 21:52:49 UTC

//TRuEQ
ID: 1979955 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1979966 - Posted: 11 Feb 2019, 22:22:38 UTC - in response to Message 1979949.  
Last modified: 11 Feb 2019, 22:23:28 UTC

the servers already restrict how many WU they can take back at one time from each host.

if i have 5000 completed WU ready to report, on the next communication, I will tell the server i have 5000 ready, but the server will only take anywhere from 100-250 or so at one time. it really doesn't matter how many you have ready. the server only takes what it wants. but you can also limit how many you send back at a time via a command in the app_config file. some people have found this beneficial after a Tuesday maintenance outage, but i haven't really seen it to matter too much personally.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1979966 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979983 - Posted: 12 Feb 2019, 1:16:39 UTC - in response to Message 1979966.  

I restrict to only reporting 100 tasks at a time since that is the highest value for connecting consistently after an outage. Anything higher than that drops percentage of successful reporting rapidly to zero. It takes several hours to report all work after an outage. Also only by setting NNT when reporting is it usually successful. So it takes several hours to report after the project comes back and then several hours after that to even start getting new work.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979983 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1980003 - Posted: 12 Feb 2019, 5:07:48 UTC - in response to Message 1979876.  
Last modified: 12 Feb 2019, 5:08:11 UTC

Is there any chance that the number of wu's per GPU can be increased to like 200??

Probably when the existing servers are replaced with systems that can meet that load.
I figure $500,000 would probably be enough (as a minimum) to meet the extra demand increasing the server side limits would create, $900,000+ for a bit of future proofing.
Grant
Darwin NT
ID: 1980003 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1980024 - Posted: 12 Feb 2019, 7:23:40 UTC

I thought it wasn't a server restriction, but a database one.

There is a limit to the number of queries it can process irrespective of hardware.

I might be wrong but that is what I seem to remember.
ID: 1980024 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1980029 - Posted: 12 Feb 2019, 8:39:17 UTC

Most of the "top twenty" users don't complain, they just grit their teeth when the weekly outrage comes along - it is only one or two who do complain. I very much doubt that the impact of Keith or Petri with their massive crunchers is as big as one thinks as they both have strategies in place to control up and downloads to "sensible per cycle".
Indeed, if one looks at the top twenty users, many of them have very large numbers of computers, not "mega crunchers", thus they might be considered "average" users.
If one considers RAC as an indication of work done per day, then the top twenty computers (not users) represents a very small drop in the ocean of demand at the end of an outrage when everyone is returning results and requesting new work - and the shear number of users returning one or two tasks results in a very large number of "contact queries", which are among the hardest hitters on the servers - sending out work is remarkably "query efficient", and indeed sending out a bundle of ten tasks to a single computer is more efficient through the whole process than sending out the same ten tasks to ten individual computers.
Splitting work off to another server dedicated to one group of users would insert another layer of query complexity into an already messy set of queries, and could well result in a reduction in overall performance, not the desired improvement.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1980029 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1980048 - Posted: 12 Feb 2019, 11:21:41 UTC - in response to Message 1980026.  

I just feel it is somehow wrong that out of 93,895 active user, just the top 20 cause inconvenience to all the others?

How are the inconveniencing others?

The fact is the project keeps asking for more processing power, more users. That will put a much greater load on things than exists now. The top 20 are just providing exactly what Seti has asked of every one- provide more computing resources.
Grant
Darwin NT
ID: 1980048 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1980051 - Posted: 12 Feb 2019, 11:25:42 UTC - in response to Message 1980024.  
Last modified: 12 Feb 2019, 11:32:28 UTC

I thought it wasn't a server restriction, but a database one.

There is a limit to the number of queries it can process irrespective of hardware.

I might be wrong but that is what I seem to remember.

The problem is the load on the database, and that limit is due to the hardware it is running on. More powerful hardware will be able to meet that load- although the biggest improvement would be suitable flash storage, but then it would be limited by the existing hardware & internal network speeds, so they would need to be upgraded to take full advantage of the benefits that all flash storage would provide.
Whenever you remove one bottleneck in a system, it's not long till you find the next one, and the one after that, and the one after that, etc (think of a never ending game of whack-a-mole).
Grant
Darwin NT
ID: 1980051 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1980057 - Posted: 12 Feb 2019, 11:48:04 UTC

One thing that can be easily done, and helps a lot, is to simply run as many 'shorty' tasks through as you can before maintenance.
They take half the time to run, so having your cache full of the longer running tasks makes a big difference.

I can usually make it through a normal maintenance with 1500 GPU tasks - with rescheduling.
ID: 1980057 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1980060 - Posted: 12 Feb 2019, 12:01:11 UTC
Last modified: 12 Feb 2019, 12:15:53 UTC

Hi All

As we all knows the 100 WU limit was put in place to try to keep the DB size on a manageable numbers because the software/hardware used.
That was an "emergency" solution. And at that time with less powerful GPU's has little impact on the heavy users. But that changes.

The real cause of that extremely high DB size are not the "mega crunchers" as was suggested in this thread.

Why? I will try to explain. Mega crunches return their work in around 1 day, so in theory their work could be purged very fast and not impact in the size of the DB, but the problem is the way the Boinc works, it's need a wing mate to check the results, and there is the problem. Normally the wing mates of that "mega crunchers" are "common users" who takes even weeks to send back their work, some a entire month, some even time-out. There is the problem. The DB needs to keep track of that for the entire period of time. And that makes it size growing bigger and bigger.

So the problem with the size of the DB can't be attribute to the 20 top "mega-crunchers" hosts. Someone could say, i'm defending myself, i respect their opinions, but no i just explain the problem for a different POV.

There are no "easy" solution to that problem, DB size x WU Cache Size.
More powerful servers, change the DB engine, more.... more & more... But that cost a lot more....

IMHO the right thing to do without spending 500K or more on more.... as suggested, is limit the cache size of every host (mega or common) to it's turnaround time and set the time out of the WU to a lower level. So a "mega cruncher" who does 5K WU per day could receive 5K WU and the "slower user" who does 2 WU per day receives 2. Not exactly but something like the GPUGrid does, by only sending new WU when the host report the old one crunched. That will allow everyone "happy" and help to pass the outages (who are very frequently this time BTW) without problem.

Some could say, nobody promise 24/7 work and that is right, but why a host who crunch 2-3 WU day needs a 100 WU cache? And why we need this extremely large time-out times? More than a month to timed out the WU on the actual days where even a cellular could crunch a WU in few days. Or a 10 days cache for the old telephone access modem times?
Just a simple resource management could make all the users "happy".

When the newer GPU's and Heavy core count CPU's spread in the Setiverse the problem with the 100 limit will increase proportionally.
You not need to have a "mega-cruncher" anymore to have problem with the 100 WU limit.
Is simple math, a 2080 Ti could crunch a WU in less than 30 secs, so a 100 WU cache holds for about a 1/2 hour only. And is not only with the GPU's a new generation of CPU's with a 40 or more cores. On it's case the 100 WU will hold for few hours only.

There are turnarounds, like rescheduling, yes, but they are not for common users. This turnarounds need "special" users who knows what they are doing to avoid add even more stress to the DB.

My 0.02
ID: 1980060 · Report as offensive
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24877
Credit: 3,081,182
RAC: 7
Ireland
Message 1980065 - Posted: 12 Feb 2019, 12:15:31 UTC - in response to Message 1980060.  

There are turnarounds, like rescheduling, yes, but they are not for common users. This turnarounds need "special" users who knows what they are doing to avoid add even more stress to the DB.
That's the most asinine piece of bovine excrement I've ever read!
ID: 1980065 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1980070 - Posted: 12 Feb 2019, 12:30:02 UTC

You don't need high-end GPUs to run out of work.
My lower end consumer 1050Ti runs out on most outages.
ID: 1980070 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1980073 - Posted: 12 Feb 2019, 12:36:46 UTC - in response to Message 1980070.  

You don't need high-end GPUs to run out of work.
My lower end consumer 1050Ti runs out on most outages.

If you compare the 1050Ti (specially running Linux Special Sauces builds) with the GPU's we have at the time the 100 WU was introduced. They are powerful GPU's. At that time the available GPU's crunches a WU in about 15 min to 1/2 Hr. LOL
Basically any new GPU based host has the same problem. That's is why i said the problem with the 100 WU limit will increase very fast as more and more users add newer hardware.
ID: 1980073 · Report as offensive
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24877
Credit: 3,081,182
RAC: 7
Ireland
Message 1980075 - Posted: 12 Feb 2019, 12:53:17 UTC - in response to Message 1980073.  

Basically any new GPU based host has the same problem. That's is why i said the problem with the 100 WU limit will increase very fast as more and more users add newer hardware.
So you propose a "special elite group" of crunchers to differentiate from "commoners"?
Also, when was the last time you seen a single processor computer crunching 2-3 wu's a day?
ID: 1980075 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : GPU max 100wu Why?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.