Message boards :
Number crunching :
Please rise the limits... just a little...
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
bill Send message Joined: 16 Jun 99 Posts: 861 Credit: 29,352,955 RAC: 0 |
I talk about a "Little increase" to avoid any problems, a change from 100WU GPU Host to 100W per GPU sure will not make a crash on the DB and will keep our GPU´s working. Yes, but has anybody computed the numbers to see if even "a little increase" will cause the project to crash. Without actual numbers I don't see the limits rising any time soon, if ever. To paraphrase Mr. Spock 'the needs of the project, out weigh the needs of the one, or the few'. |
Mike Send message Joined: 17 Feb 01 Posts: 34249 Credit: 79,922,639 RAC: 80 |
I totally agree Bill. Under actual conditions i dont think it will change. Its on the staff to decide it. I`d also like to store 500 APs again but it is as it is. With each crime and every kindness we birth our future. |
Bernie Vine Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328 |
There are 200 potential CPU's in the top 20 machines, currently they are allowed 100 WU's each 20x100 = 2,000 Increasing the limit to 100 per CPU will mean just those 20 machines could try and download an extra 18,000 WU's How many other multi CPU rigs are there out there? Each quad would need an extra 300 each eight core and extra 700 and so on. Even my little farm of 6 machine would be able to download an extra 1,000 As Bill says without the numbers it could be a disaster. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
sorry double post |
bill Send message Joined: 16 Jun 99 Posts: 861 Credit: 29,352,955 RAC: 0 |
There are 200 potential CPU's in the top 20 machines, currently they are allowed 100 WU's each 20x100 = 2,000 Boincstats says there are 143000+ active crunchers. Just running some rough numbers through my head that's an increase of 143000 x 100WU= 14300000 WUs. Nothing trivial. |
Mike Send message Joined: 17 Feb 01 Posts: 34249 Credit: 79,922,639 RAC: 80 |
There are 200 potential CPU's in the top 20 machines, currently they are allowed 100 WU's each 20x100 = 2,000 Nah. Not nearly everyone downloads 100 a day. Some only 100 a week or month. With each crime and every kindness we birth our future. |
bill Send message Joined: 16 Jun 99 Posts: 861 Credit: 29,352,955 RAC: 0 |
There are 200 potential CPU's in the top 20 machines, currently they are allowed 100 WU's each 20x100 = 2,000 And some download hundreds+ a day. |
Mike Send message Joined: 17 Feb 01 Posts: 34249 Credit: 79,922,639 RAC: 80 |
But thats not the majority. With each crime and every kindness we birth our future. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
There are 200 potential CPU's in the top 20 machines, currently they are allowed 100 WU's each 20x100 = 2,000 There is a mistake, i talk about 100 WU PER GPU not per CPU or core (100 WU per CPU give a enought work to continue working in the scheduled outages), sure the number is a lot less, lets try to imagine, if the same top 20 machines have 2.5xGPU each one (mean) is only about 3000 WU. Now imagine the top 100 (that will include most of the 2 or 3 GPU´s hosts) with a mean of 2 GPU/host (sure is less) that will give about 10K WU, a little compared with the total of WU the DB actualy handle. So i can´t see the "possible dissaster" to do that. But Maybe i´m wrong. |
Bernie Vine Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328 |
There are 200 potential CPU's in the top 20 machines, currently they are allowed 100 WU's each 20x100 = 2,000 So the top 20 machines will possibly ask for an EXTRA 10,000 WU's per day. As we have no idea how many GPU cores there are out there if there are only 500 double GPU machines that would be 50,000 requests that would all hit the servers and database on the day the limits were raised! I know as an average cruncher who runs a few single GPU slower machines I am not considered as important as the multi GPU monsters, however since I put the TCP fix in SETI@Home is running better than it has for a long time. I would like to keep it that way. The project has years of results than have not been analysed, I know that is supposed to change, but it is not a race to crunch as much and as fast as you can. I suspect SETI@Home currently has more data than it will be able to handle in my lifetime. There is no rush!! |
bill Send message Joined: 16 Jun 99 Posts: 861 Credit: 29,352,955 RAC: 0 |
Until someone comes up with hard numbers it's all assumption anyway. No numbers, no raise in limits. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14644 Credit: 200,643,578 RAC: 874 |
About six weeks ago, Einstein wrote: It seems that the large number of tasks (3.7M) is seriously stressing our databases (not only the replica). Here, we currently have 3,379,437 MB out in the field 170,122 AP out in the field 2,585,161 MB results returned 189,666 AP results returned --------- 6,324,386 tasks in database - some 70% higher than Einstein's 'serious stress' level. If the limits were raised, there would be a one-off transitional spike as the fast and/or high cache hosts transitioned to the new maximum level. There would be frantic splitter and download activity for a couple of days while everybody filled their boots: no problem, we've survived worse than that before. Then we'd settle down to a new steady state. The pipe would stay full. Tasks would be allocated on a 'return one, get one back' basis as now. The same amount of work would be done. There would be two differences. 1) The database would be fuller - more bloat, more stress, less speed. 2) The upload pipe would be fuller. The same number of scheduler requests, but each file would be bigger. I can't see any benefit (for the project, that is). And since Matt is in the process of getting everything clean, lean and ship-shape in preparation for moving the servers out of their air-con home for the last 14 years, and (down the hill?) to their new co-lo home, is now the time to stuff everything up to and beyond the limit? [Who would want to be driving that buggy? Really?] |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
If they are allready preparing to moving, i agree with you, is safer keep all the way is working now. Let´s wait and see what is comming. |
ExchangeMan Send message Joined: 9 Jan 00 Posts: 115 Credit: 157,719,104 RAC: 0 |
I would be with everyone who wants to raise the limit. As a compromise, can we cut back the CPU cache from 100 to 50 and raise the GPU cache to 150. For big GPU crunchers that would work well. There is a great imbalance between CPU crunch power and GPU crunch power. I've been getting shorties all day (as I'm sure everyone else has). If all I have are shorties in my cache, my big cruncher will burn through 100 in 15 to 20 minutes. I know, when you process at that rate 50 more isn't likely worth it. Now with the download transfer rate problems appearing temporarily to be solved for some of us Windows users at least we don't have to fight that problem so much. I don't know about any of you, but I get times where the 5 minute cycle gives me no GPU work units (saying none are available) even though I may report 30 completed tasks. This gets me really nervous when several cycles in a row do this. At least a somewhat larger cache increases the chances of riding this 'phenomena' out without the GPUs running dry. Oh well, just my 2 cents. Carry on. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
I would be with everyone who wants to raise the limit. As a compromise, can we cut back the CPU cache from 100 to 50 and raise the GPU cache to 150. For big GPU crunchers that would work well. There is a great imbalance between CPU crunch power and GPU crunch power. I've been getting shorties all day (as I'm sure everyone else has). If all I have are shorties in my cache, my big cruncher will burn through 100 in 15 to 20 minutes. I know, when you process at that rate 50 more isn't likely worth it. Same here.... 100 WUs just don't last very long on a multi-GPU cruncher. Especially when they are mostly of the shorty variety. Between 9 rigs, my cache has been floating between around 1700 and 1500....so all work requests are not being filled. And of course, several of my rigs won't make it through a Tuesday outage with only 100 GPU WUs. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
BOINCStats reports SETI@Home has 420,541 active hosts. The next project down in active hosts is Einstein@Home with 262,645. So I would say if Einstein is having issues then our servers must be held together with wizardry. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Sakletare Send message Joined: 18 May 99 Posts: 132 Credit: 23,423,829 RAC: 0 |
our servers must be held together with wizardry. Do not take Matt for some conjuror of cheap tricks! I wouldn't be surprised if Matt turns out to be the eighth son of an eighth son. There, there's two nerdy references for you. ;) |
Keith White Send message Joined: 29 May 99 Posts: 392 Credit: 13,035,233 RAC: 22 |
I believe the problem comes in not from the number any one super cruncher can do in a day but from how many of those will persist in the database waiting for it's mate from a wingman with a much slower system. The top host can on average crunch 1 GPU assigned unit in under a minute (seen it a low as 35 seconds) due to the number of GPUs and multitasking multiple units on each GPU. In a day we are talking 1400-2500 units. My very low end GPU can do one in a little less than an hour. It takes around 3.5 days to process 100 GPU units. Right now 15-25% of them per day are waiting for their wingman when they are reported. Just imagine the percentage for a super cruncher. It's the validation pending ones as well as those assigned in progress queue that clog up the database lookup for super crunchers. I have roughly 25% of an average day's worth of work that's been pending for more than 30 days. What's the rate for someone who can crunch several hundred if not thousand in a day? How many persist for weeks, filling up the database, slowing lookup times? It's not that the super crunchers are the problem, they are just the ever accelerating conveyer belt of bonbons that Lucy simply can't box fast enough. "Life is just nature's way of keeping meat fresh." - The Doctor |
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
I believe the problem comes in not from the number any one super cruncher can do in a day but from how many of those will persist in the database waiting for it's mate from a wingman with a much slower system. A happy situation might be for the cache limits to go from 100 work units to .5 or 1 day caches and shorten the timeout limits. That would be a "validate or perish" situation for the database. Oh, not to mention that it would cut the number of entries in the database for all of those who do fewer than 100 work units in a day (and CPU work units which would be difficult to do 100 of in a day). More connections? I doubt it. Look at the production of the computers you find in the 960-1000th places in the "top computers" list. There are a LOT of computers making a LOT of connections to re-build a 100 work unit cache. I *really* hope Matt is making headway in getting our super-duper servers into a building with a super-duper connection to the outside world. |
rob smith Send message Joined: 7 Mar 03 Posts: 22149 Credit: 416,307,556 RAC: 380 |
S@H uses multiple servers, so the load on the upload servers does not affect that on the download servers. The upload servers feed into the validators, which aren't affected by the problem addressed by the poor performance of the download servers - S@H has always had a large pool of data awaiting validation, and appears to manage that side of things quite well. The download servers are on the other hand struggling to cope with demand. Its bound not to be a simple solution (apart from lack of bandwidth) but a deep rooted one, that is taking a lot of effort to isolate and resolve. Contributors I can think of include the re-try/back-off by clients, the in-balance between the the two download servers (one is massively faster than the other), the "auto-sync" between the various time-outs and hand-overs (they all appear to be based on 5 minutes) and so on. Not to mention that there is a large number of different versions of BOINC out here all with subtly different "approaches to the world". And finally there are the abusers, sorry users.... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.