Message boards :
Number crunching :
Should BOINC/Seti be stratified?
Message board moderation
Author | Message |
---|---|
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
We all know this, but the attached image derived from boincstats graphically argues a point. The top 1% of active users contribute 30% of the average credit each day. There are about 100K active users. What is more, the top handful of users produce 1-2 orders of magnitude more credit each day than the remaining top 1%. To be extensible, isn't it about time to stratify the project? That is, give the eager beavers with Borg-like compute rigs more and/or longer jobs; and separate out the rest of us slow pokes to reduce contention and latency issues? Learning how to do this wouldn't be a bad development for Boinc and will make the system more adaptable to a wider range of compute node technologies and larger numbers of interested users. Quick. Somebody write a grant! https://tinyurl.com/y8d8bnu6 |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
https://tinyurl.com/y8d8bnu6 The file at this link is named SETI Production.bmp but is is actually a JPEG so may have to be renamed to .JPG/.JPEG to open. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Actually, it is a .bmp. Here is a link to the .jpg: https://tinyurl.com/yb46f4cf |
rob smith Send message Joined: 7 Mar 03 Posts: 22200 Credit: 416,307,556 RAC: 380 |
Stratification, while sounding simple to do, would mean a lot more server activity to produce the two different types of task, cross validation between the two sets of results and increased user management activities, all which would almost certainly not reduce the server stress that you would expect. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
I agree. The instant calculations that the scheduler has to do already when processing a work request are complex enough. Adding more to the mix would stand a good chance of bringing a downside with it. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
rob smith Send message Joined: 7 Mar 03 Posts: 22200 Credit: 416,307,556 RAC: 380 |
And just have a look at the "panic" thread for the current "issues" some are having with the scheduler not doing what they want it to, and Richard's numerous posts describing his digging into just one small corner of the scheduler code.... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Exactly. It's a bit of code that has to work instantly on the fly for work to flow properly. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
I understand those points. Yet, I wonder how other projects handle or plan to handle the variation and improvement in client hardware. By way of a suggestion, it seems that to reduce any network connection bottle neck the standard wu's could be bundled into packets of 3 or more standard WU's, the actual number to be determined by trials or make it dynamic based on the requesting hosts apparent speed. The client software would then first separate the bundles locally into several standard WU's on the host, then work on the individual ones as normal, and finally return the individual units as usual; i.e. no re-bundling. Of course the server software would have to be changed as well, though not a big change it seems, to keep the book keeping correct. This change would decrease output connections, though increase the time to download a work unit. Yet, the network overhead would go down for outgoing work and thereby reduce connection demand. I think the upload burden wouldn't change, but on uploads, the system benefits from having the clients buffer the completed work and thus keep a more steady inflow to the servers. I'm sure there are many suggestions. But generally, we should be trying to improve the system and frankly I don't see that happening much at SETI. Instead, it seems (despite the servers current better operations) that the client technology with introduction of ever faster GPU's, at least, is leaping over the original server architecture and design. Why not look actively for ways at least to make incremental steps if not giant leaps!? |
rob smith Send message Joined: 7 Mar 03 Posts: 22200 Credit: 416,307,556 RAC: 380 |
All that you suggest would make sense if the bottleneck were the server/network interface that was the issue, however it is the internal server/database area that is the current bottleneck - the changes you suggest would increase the load on that interface and not reduce it - for every call for work the server would have to do some analysis of the client performance to establish how to package the work, and then do the packaging, all of which takes server clock cycles, database reads. In recent weeks there have been a number of trials of changes to the server, some have been "less than successful" others have worked to some extent. Any database restructuring needs to be VERY carefully considered as the underlying database has millions, if not billions of records, occupies many terabytes of disk.... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Thanks for the feedback. Actually, your objection is based on setting packet size dynamically. Just to get "it" going, a fixed packet of say 10 would save almost that much network overhead in the downloading step, it would seem. Overtime, I have posted/felt that eventually a lot of steps in the process should be made dynamic, which would better handle a variety of hosts. But that would take some effort, I'm sure. One hot button example is posting credit (cringe!). Network use may not be the top of the pareto, but it has been in the past and is likely to be important if the project grows in size and so forth. I guess if one eats a low hanging fruit from time to time, it isn't terrible. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Does anyone have a way to know how many wu's are timing out versus how many are being returned as computed (valid or not)? Time-outs are doubly bad for this system if they amount to a large number. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30649 Credit: 53,134,872 RAC: 32 |
One thing to remember is the science. Right now a work unit it a pixel on the sky. Make it bigger and what happens? Not what the science wants. However there could be one thing done. The return deadlines could be made dynamic. Faster hosts get shorter deadlines. This might help with the explosion in work waiting validation. What also may need a look at is the punish code. It may be allowing hosts to burn though massive numbers of work units always returning computation error. Cut that down and you cut the size of the BOINC DB. Client side it may need to put a notice up to fix the issue. (We have the PM to bad host thread, but ...) |
rob smith Send message Joined: 7 Mar 03 Posts: 22200 Credit: 416,307,556 RAC: 380 |
You can do it yourself - look at the last two characters of task names. A new task goes out with "_0" and "_1", for resends this is increased to "_2", "_3" etc. Then look at the data for each of those tasks, the headline error message will tell you if it was resent as a result of a computational error, download error, inconclusive or timed out. Last time I looked the number of resends was low (well below 1%), and the majority of those were down to computational error and inconclusive (combined about 75%), timeout were a small fraction and download errors extremely low. However there are a few hosts that "vanish" for one reason or another (hardware failure, owner decides to stop etc.) and a few that have such excessively large caches and such a low work rate that they fail to deliver on time (I recall one host with a 2-core processor that had over 10,000 tasks stashed away, and was returning about one task per week.....) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Bundling of tasks into a supersize task can be done. Witness that exact mechanism done at MilkyWay to reduce the amount of traffic into and out of the servers. I don't know what it entailed in the server code to bundle five individual tasks into one supersize task and then split them apart upon task report, but they did accomplish that. And in a short time too if I recall. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
mmonnin Send message Joined: 8 Jun 17 Posts: 58 Credit: 10,176,849 RAC: 0 |
Yes it saved MilkyWay. For them it was the # of constant connections to the server that broke it. A 280x that has high FP64 does 4x concurrent bundled tasks, so 20 unbundled tasks, in 2min each. The client was constantly requesting and sending tasks when the task limit was only 60. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
AFAIK there was SETIv10 movement some time ago (skipping v9 is new trend it seems). Not very much info since then or I just missed it. But v10 tasks expected to be more computationally heavy. SETI apps news We're not gonna fight them. We're gonna transcend them. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.