Should BOINC/Seti be stratified?

Message boards : Number crunching : Should BOINC/Seti be stratified?
Message board moderation

To post messages, you must log in.

AuthorMessage
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 1928458 - Posted: 6 Apr 2018, 23:07:01 UTC

We all know this, but the attached image derived from boincstats graphically argues a point. The top 1% of active users contribute 30% of the average credit each day. There are about 100K active users. What is more, the top handful of users produce 1-2 orders of magnitude more credit each day than the remaining top 1%.

To be extensible, isn't it about time to stratify the project? That is, give the eager beavers with Borg-like compute rigs more and/or longer jobs; and separate out the rest of us slow pokes to reduce contention and latency issues? Learning how to do this wouldn't be a bad development for Boinc and will make the system more adaptable to a wider range of compute node technologies and larger numbers of interested users.

Quick. Somebody write a grant!




https://tinyurl.com/y8d8bnu6
ID: 1928458 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1928504 - Posted: 7 Apr 2018, 1:08:06 UTC - in response to Message 1928458.  

https://tinyurl.com/y8d8bnu6


The file at this link is named SETI Production.bmp but is is actually a JPEG so may have to be renamed to .JPG/.JPEG to open.
ID: 1928504 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 1928576 - Posted: 7 Apr 2018, 10:18:08 UTC - in response to Message 1928504.  
Last modified: 7 Apr 2018, 10:22:15 UTC

Actually, it is a .bmp. Here is a link to the .jpg:
https://tinyurl.com/yb46f4cf
ID: 1928576 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1928591 - Posted: 7 Apr 2018, 12:29:47 UTC

Stratification, while sounding simple to do, would mean a lot more server activity to produce the two different types of task, cross validation between the two sets of results and increased user management activities, all which would almost certainly not reduce the server stress that you would expect.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1928591 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1928592 - Posted: 7 Apr 2018, 12:35:36 UTC - in response to Message 1928591.  

I agree. The instant calculations that the scheduler has to do already when processing a work request are complex enough. Adding more to the mix would stand a good chance of bringing a downside with it.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1928592 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1928593 - Posted: 7 Apr 2018, 12:38:08 UTC

And just have a look at the "panic" thread for the current "issues" some are having with the scheduler not doing what they want it to, and Richard's numerous posts describing his digging into just one small corner of the scheduler code....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1928593 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1928597 - Posted: 7 Apr 2018, 12:59:01 UTC - in response to Message 1928593.  

Exactly. It's a bit of code that has to work instantly on the fly for work to flow properly.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1928597 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 1928838 - Posted: 8 Apr 2018, 18:32:24 UTC

I understand those points. Yet, I wonder how other projects handle or plan to handle the variation and improvement in client hardware.

By way of a suggestion, it seems that to reduce any network connection bottle neck the standard wu's could be bundled into packets of 3 or more standard WU's, the actual number to be determined by trials or make it dynamic based on the requesting hosts apparent speed. The client software would then first separate the bundles locally into several standard WU's on the host, then work on the individual ones as normal, and finally return the individual units as usual; i.e. no re-bundling. Of course the server software would have to be changed as well, though not a big change it seems, to keep the book keeping correct. This change would decrease output connections, though increase the time to download a work unit. Yet, the network overhead would go down for outgoing work and thereby reduce connection demand. I think the upload burden wouldn't change, but on uploads, the system benefits from having the clients buffer the completed work and thus keep a more steady inflow to the servers.

I'm sure there are many suggestions. But generally, we should be trying to improve the system and frankly I don't see that happening much at SETI. Instead, it seems (despite the servers current better operations) that the client technology with introduction of ever faster GPU's, at least, is leaping over the original server architecture and design. Why not look actively for ways at least to make incremental steps if not giant leaps!?
ID: 1928838 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1928846 - Posted: 8 Apr 2018, 19:04:20 UTC

All that you suggest would make sense if the bottleneck were the server/network interface that was the issue, however it is the internal server/database area that is the current bottleneck - the changes you suggest would increase the load on that interface and not reduce it - for every call for work the server would have to do some analysis of the client performance to establish how to package the work, and then do the packaging, all of which takes server clock cycles, database reads.
In recent weeks there have been a number of trials of changes to the server, some have been "less than successful" others have worked to some extent. Any database restructuring needs to be VERY carefully considered as the underlying database has millions, if not billions of records, occupies many terabytes of disk....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1928846 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 1928849 - Posted: 8 Apr 2018, 19:11:25 UTC - in response to Message 1928846.  

Thanks for the feedback. Actually, your objection is based on setting packet size dynamically. Just to get "it" going, a fixed packet of say 10 would save almost that much network overhead in the downloading step, it would seem.

Overtime, I have posted/felt that eventually a lot of steps in the process should be made dynamic, which would better handle a variety of hosts. But that would take some effort, I'm sure. One hot button example is posting credit (cringe!).

Network use may not be the top of the pareto, but it has been in the past and is likely to be important if the project grows in size and so forth. I guess if one eats a low hanging fruit from time to time, it isn't terrible.
ID: 1928849 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 1928850 - Posted: 8 Apr 2018, 19:13:02 UTC

Does anyone have a way to know how many wu's are timing out versus how many are being returned as computed (valid or not)?

Time-outs are doubly bad for this system if they amount to a large number.
ID: 1928850 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30608
Credit: 53,134,872
RAC: 32
United States
Message 1928858 - Posted: 8 Apr 2018, 19:28:12 UTC

One thing to remember is the science. Right now a work unit it a pixel on the sky. Make it bigger and what happens? Not what the science wants.

However there could be one thing done. The return deadlines could be made dynamic. Faster hosts get shorter deadlines. This might help with the explosion in work waiting validation. What also may need a look at is the punish code. It may be allowing hosts to burn though massive numbers of work units always returning computation error. Cut that down and you cut the size of the BOINC DB. Client side it may need to put a notice up to fix the issue. (We have the PM to bad host thread, but ...)
ID: 1928858 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1928860 - Posted: 8 Apr 2018, 19:28:34 UTC

You can do it yourself - look at the last two characters of task names.
A new task goes out with "_0" and "_1", for resends this is increased to "_2", "_3" etc. Then look at the data for each of those tasks, the headline error message will tell you if it was resent as a result of a computational error, download error, inconclusive or timed out. Last time I looked the number of resends was low (well below 1%), and the majority of those were down to computational error and inconclusive (combined about 75%), timeout were a small fraction and download errors extremely low.
However there are a few hosts that "vanish" for one reason or another (hardware failure, owner decides to stop etc.) and a few that have such excessively large caches and such a low work rate that they fail to deliver on time (I recall one host with a 2-core processor that had over 10,000 tasks stashed away, and was returning about one task per week.....)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1928860 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1928886 - Posted: 8 Apr 2018, 20:47:29 UTC

Bundling of tasks into a supersize task can be done. Witness that exact mechanism done at MilkyWay to reduce the amount of traffic into and out of the servers. I don't know what it entailed in the server code to bundle five individual tasks into one supersize task and then split them apart upon task report, but they did accomplish that. And in a short time too if I recall.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1928886 · Report as offensive
mmonnin
Volunteer tester

Send message
Joined: 8 Jun 17
Posts: 58
Credit: 10,176,849
RAC: 0
United States
Message 1929219 - Posted: 10 Apr 2018, 9:48:37 UTC

Yes it saved MilkyWay. For them it was the # of constant connections to the server that broke it. A 280x that has high FP64 does 4x concurrent bundled tasks, so 20 unbundled tasks, in 2min each. The client was constantly requesting and sending tasks when the task limit was only 60.
ID: 1929219 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1929894 - Posted: 14 Apr 2018, 10:13:39 UTC

AFAIK there was SETIv10 movement some time ago (skipping v9 is new trend it seems).
Not very much info since then or I just missed it. But v10 tasks expected to be more computationally heavy.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1929894 · Report as offensive

Message boards : Number crunching : Should BOINC/Seti be stratified?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.