Would serves survive?

Message boards : Number crunching : Would serves survive?
Message board moderation

To post messages, you must log in.

AuthorMessage
Iztok s52d (and friends)

Send message
Joined: 12 Jan 01
Posts: 136
Credit: 393,469,375
RAC: 116
Slovenia
Message 83619 - Posted: 7 Mar 2005, 18:40:59 UTC

Hi!

I can imagine 150 000 clients jumping on poor server.
Mostly asking for more work, trying to upload as well.

Having server down for so many days (They were mostly down previous week) is causing worst possible DoS attack.

I expect they need few days simply to recover,
so we might see increase in "Ready to send 1,866" line, while nobody gets a WU to crunch.



BR
Iztok

p.s. When I enjoyed nice academic life, nobody expect me in the Lab on Mondays early: but, I had no problems comming to the lab on weekends to baby-sit our boxes.




ID: 83619 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83626 - Posted: 7 Mar 2005, 18:51:26 UTC - in response to Message 83619.  


> Having server down for so many days (They were mostly down previous week) is
> causing worst possible DoS attack.

From the "news" page:

March 7, 2005 - 18:45 UTC
The graceful shutdown procedure is finally falling into place. The project is up but data service is off while we generate some more work. Data service will be on shortly. We will have a short outage or two today for testing graceful shutdown.

-----

It's pretty clear that the BOINC client is a little hot after a longish outage: one thing that's clear is that they need a way to tell the clients to "spread out" and keep the number of connections down so that things can finish.
ID: 83626 · Report as offensive
Iztok s52d (and friends)

Send message
Joined: 12 Jan 01
Posts: 136
Credit: 393,469,375
RAC: 116
Slovenia
Message 83630 - Posted: 7 Mar 2005, 20:15:43 UTC - in response to Message 83626.  


>
> It's pretty clear that the BOINC client is a little hot after a longish
> outage: one thing that's clear is that they need a way to tell the clients to
> "spread out" and keep the number of connections down so that things can
> finish.
>

Hi!

Now... The only host happy with server is the one with all the results sent,
and having few hours work.
How to do it when 100 000 boxes are calling in every few minutes?
(If I look into log of the box with 30 WU results...)
Simply rejecting them is too much work for the box.

I would put there a firewall and change allowed IPs every few minutes
(blocking just TCP connect, to keep old downloads going on).
They do have all our IPs, so they can open only one group at the time,
If they open up 10% of PCs for 5 minutes, another 10% next 5 minuts, then
it might come to some normal state.

I would be glad to hear the story.

BR
Iztok

p.s. And please tell us it is campus policy NOT to have boxes alive over weekend untill problem is fixed.








ID: 83630 · Report as offensive
Profile Chiana
Avatar

Send message
Joined: 2 Feb 05
Posts: 19
Credit: 165,472
RAC: 0
Sweden
Message 83636 - Posted: 7 Mar 2005, 20:23:11 UTC

Hopefully they will manage to keep it up as much as possible, my crunchers ran out of work sometime yesterday so they are eager to get in contact with the servers, have a couple of wu's to upload and need a couple of thousand seconds of work...

I suppose we will see one or two smackdowns the coming days du to hammering...

ID: 83636 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83640 - Posted: 7 Mar 2005, 20:26:55 UTC - in response to Message 83630.  


> Now... The only host happy with server is the one with all the results sent,
> and having few hours work.
> How to do it when 100 000 boxes are calling in every few minutes?
> (If I look into log of the box with 30 WU results...)
> Simply rejecting them is too much work for the box.

Let's say that each BOINC client tries to connect every five minutes on average.

Stretching from five minutes to fifty minutes reduces the server load by a factor of ten -- to put it another way, more CPU time and bandwidth to get work in, while the other clients are waiting their turn.

The overall throughput goes up because the clients that do connect can complete the transfers.

> I would put there a firewall and change allowed IPs every few minutes
> (blocking just TCP connect, to keep old downloads going on).
> They do have all our IPs, so they can open only one group at the time,
> If they open up 10% of PCs for 5 minutes, another 10% next 5 minuts, then
> it might come to some normal state.

Don't forget that BOINC has control of the server code and the client code, so they don't need to do something funky with a firewall, they can (theoretically) tell the clients to back off and not hammer the server.

> p.s. And please tell us it is campus policy NOT to have boxes alive over
> weekend untill problem is fixed.

I'm not on campus, don't represent Berkeley, but like most everyone here I do have an opinion. It's possible that everything would have been safe over the weekend (other building loads turned off), but it's also entirely possible to have a power failure and lose the database -- or even damage hardware.

They've been doing what they can to prevent that BY HAND. So, until they get the UPSes and servers all getting along and talking to each other, they've been running only when someone was there to turn things off.
ID: 83640 · Report as offensive
Nick Cole

Send message
Joined: 27 May 99
Posts: 97
Credit: 3,806
RAC: 0
United Kingdom
Message 83641 - Posted: 7 Mar 2005, 20:27:10 UTC

Yes real IT guys (and gals) would have put some more controlling mechanisms in place. What is frustrating is to get a connection and then be bounced off, while that update/download is in progress. The end result is that it takes most of the working day session to get the down/uploads to stabilise then it all goes off again. Operationally it really needs to be allowed longer once stabilised to before it is switched off.

Of course it didn't help announcing that the system would be shutdown over the weekend just before the weekend started!

The problem with allowing groups or subnets to connect is that you would never know when it was your turn. And the router would be doing overtime not accepting requests as a result.

The problem is that the array architecture could do with tweaking. If the results download processes and the upload processes were hived off site on to a mirror then all users could carry on regardless. They depend on us to process what they have divvied up and return the results. Collation, management and checking of those results can be carried out independently as they are at present. The problem is everything is being done on one site in one system. That coupled with the woeful UPS situation is making a bad problem far worse than it needs to be. The project seems to lack IT operational techy expertise.
ID: 83641 · Report as offensive
Profile Chiana
Avatar

Send message
Joined: 2 Feb 05
Posts: 19
Credit: 165,472
RAC: 0
Sweden
Message 83645 - Posted: 7 Mar 2005, 20:32:39 UTC - in response to Message 83641.  

> Yes real IT guys (and gals) would have put some more controlling mechanisms in
> place. What is frustrating is to get a connection and then be bounced off,
> while that update/download is in progress. The end result is that it takes
> most of the working day session to get the down/uploads to stabilise then it
> all goes off again. Operationally it really needs to be allowed longer once
> stabilised to before it is switched off.
>
> Of course it didn't help announcing that the system would be shutdown over the
> weekend just before the weekend started!
>
> The problem with allowing groups or subnets to connect is that you would never
> know when it was your turn. And the router would be doing overtime not
> accepting requests as a result.
>
> The problem is that the array architecture could do with tweaking. If the
> results download processes and the upload processes were hived off site on to
> a mirror then all users could carry on regardless. They depend on us to
> process what they have divvied up and return the results. Collation,
> management and checking of those results can be carried out independently as
> they are at present. The problem is everything is being done on one site in
> one system. That coupled with the woeful UPS situation is making a bad
> problem far worse than it needs to be. The project seems to lack IT
> operational techy expertise.
>

For what I can se it is not technical knowledge that is lacking, it is funding, mirroring a server-pool costs something like an entire server-pool, not to mention someplace to put it too (like renting space, paying powerbills, internet connections et.c.)


ID: 83645 · Report as offensive
JAF
Avatar

Send message
Joined: 9 Aug 00
Posts: 289
Credit: 168,721
RAC: 0
United States
Message 83649 - Posted: 7 Mar 2005, 20:39:55 UTC

A couple problems as I see it. First, I had to abort a bunch of Seti WU's because I wouldn't make the deadline. I signed up for Einstein@home to continue crunch (I didn't see any sense in crunching work that would be thrown out anyway).

Second, I have a bunch (over thirty) completed WU's that are due today. Being on dial-up, I either have to sit here and bang on the servers to try to report my work, or just let it expire (Ned, you mentioned in another post that the deadline doesn't always apply, but didn't see a follow-up -- I'll go back and look for that thread.)
<img src='http://www.boincsynergy.com/images/stats/comb-912.jpg'>
ID: 83649 · Report as offensive
Profile Saenger
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 2452
Credit: 33,281
RAC: 0
Germany
Message 83650 - Posted: 7 Mar 2005, 20:48:40 UTC - in response to Message 83649.  

> (Ned, you mentioned in another post that the
> deadline doesn't always apply, but didn't see a follow-up -- I'll go back and
> look for that thread.)

If you still are faster then the others, and send them back before the validator get's them, you'll be lucky.
And it's not that hard to acomplish that ATM ;)

BTW: They could just turn of the Validator for a couple of days, and everything would be fine. We just have to wait a bit longer for the credits, but who cares?
Gruesse vom Saenger

For questions about Boinc look in the BOINC-Wiki
ID: 83650 · Report as offensive
Nick Cole

Send message
Joined: 27 May 99
Posts: 97
Credit: 3,806
RAC: 0
United Kingdom
Message 83652 - Posted: 7 Mar 2005, 20:52:21 UTC - in response to Message 83645.  

> > Yes real IT guys (and gals) would have put some more controlling
> mechanisms in
> > place. What is frustrating is to get a connection and then be bounced
> off,
> > while that update/download is in progress. The end result is that it
> takes
> > most of the working day session to get the down/uploads to stabilise then
> it
> > all goes off again. Operationally it really needs to be allowed longer
> once
> > stabilised to before it is switched off.
> >
> > Of course it didn't help announcing that the system would be shutdown
> over the
> > weekend just before the weekend started!
> >
> > The problem with allowing groups or subnets to connect is that you would
> never
> > know when it was your turn. And the router would be doing overtime not
> > accepting requests as a result.
> >
> > The problem is that the array architecture could do with tweaking. If
> the
> > results download processes and the upload processes were hived off site
> on to
> > a mirror then all users could carry on regardless. They depend on us to
> > process what they have divvied up and return the results. Collation,
> > management and checking of those results can be carried out independently
> as
> > they are at present. The problem is everything is being done on one site
> in
> > one system. That coupled with the woeful UPS situation is making a bad
> > problem far worse than it needs to be. The project seems to lack IT
> > operational techy expertise.
> >
>
> For what I can se it is not technical knowledge that is lacking, it is
> funding, mirroring a server-pool costs something like an entire server-pool,
> not to mention someplace to put it too (like renting space, paying powerbills,
> internet connections et.c.)
>
>
Understood but that is what sponsoring is all about, and it is only two servers worth of hard disks needed to use for transferring work units. Architecturally the system appears to prevent such a solution, and if nothing else not being able to make UPS work is a techy issue. I am sure that whoever provided the servers would be able to help. Why not relocate the main data transfer machines to wherever the netorks switches and routers are located, that has been powered properly all weekend, and overnight. It is not the entire suite that needs to be moved just the work unit transfer systems after all. And from what I remember reading somewhere it isn't as if the machines are that power hungry anyway.

Operational emergency solutions require a bit of lateral thinking that is all. Given the timescale of this disruption something more could have been done apart from leave it. While I would love to help I am 8 timezones away!

I think after this debacle the project team need a gold medal for perseverance as I am sure they are tearing their hair out, but we need silver medals for sticking with it.
ID: 83652 · Report as offensive
Iztok s52d (and friends)

Send message
Joined: 12 Jan 01
Posts: 136
Credit: 393,469,375
RAC: 116
Slovenia
Message 83654 - Posted: 7 Mar 2005, 20:58:02 UTC - in response to Message 83649.  

Hi!

Do not worry, they do credit overtime WU. It is unlikely that all others
will manage to upload but you.

Anyhow, it is hard to get over in huge DOS like avalanche.
My boxes are trying to contact server every few minutes. Plenty of WUs waiting
for several days, and only one PC still has some seti work.

So, press upload occasionally and good luck.

to Nick COle:

>The problem with allowing groups or subnets to connect is that you would never >know when it was your turn. And the router would be doing overtime not >accepting requests as a result.

Do you know now when it is your turn? Wednesday 0400 local time?

BR
Iztok

ID: 83654 · Report as offensive
Nick Cole

Send message
Joined: 27 May 99
Posts: 97
Credit: 3,806
RAC: 0
United Kingdom
Message 83658 - Posted: 7 Mar 2005, 21:01:42 UTC - in response to Message 83640.  

But the overall timescale goes up by a factor of 10 (not quite as simple as that but broadly speaking on those lines). It also follows that there will be 10 times less people processing so the overall gain is nil. Considering the large numbers of users (approx 500,000 on both projects) simplistic solutions do not provide the answers. If the system is down for 16hours out of 24 then it doesn't take much to realise that even stretching connect intervals a small amount will push most users out of that window. This applies to the whole SETI project not just BOINC remember!

The only viable solution is to have an off site (or out of lab elsewhere on campus) data mirror that can be used instead. Having a dedicated fibre network link to wherever they were relocated wouldn't be hard to arrange even if they couldn't re-site next to the routers/switches.

Disaster recovery and business continuity wasn't built in.

>
> > Now... The only host happy with server is the one with all the results
> sent,
> > and having few hours work.
> > How to do it when 100 000 boxes are calling in every few minutes?
> > (If I look into log of the box with 30 WU results...)
> > Simply rejecting them is too much work for the box.
>
> Let's say that each BOINC client tries to connect every five minutes on
> average.
>
> Stretching from five minutes to fifty minutes reduces the server load by a
> factor of ten -- to put it another way, more CPU time and bandwidth to get
> work in, while the other clients are waiting their turn.
>
> The overall throughput goes up because the clients that do connect can
> complete the transfers.
>

> > I would put there a firewall and change allowed IPs every few minutes
> > (blocking just TCP connect, to keep old downloads going on).
> > They do have all our IPs, so they can open only one group at the time,
> > If they open up 10% of PCs for 5 minutes, another 10% next 5 minuts,
> then
> > it might come to some normal state.
>
> Don't forget that BOINC has control of the server code and the client code, so
> they don't need to do something funky with a firewall, they can
> (theoretically) tell the clients to back off and not hammer the server.
>
> > p.s. And please tell us it is campus policy NOT to have boxes alive over
> > weekend untill problem is fixed.
>
> I'm not on campus, don't represent Berkeley, but like most everyone here I do
> have an opinion. It's possible that everything would have been safe over the
> weekend (other building loads turned off), but it's also entirely possible to
> have a power failure and lose the database -- or even damage hardware.
>
> They've been doing what they can to prevent that BY HAND. So, until they get
> the UPSes and servers all getting along and talking to each other, they've
> been running only when someone was there to turn things off.
>
ID: 83658 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83672 - Posted: 7 Mar 2005, 21:43:30 UTC - in response to Message 83658.  

> But the overall timescale goes up by a factor of 10 (not quite as simple as
> that but broadly speaking on those lines).

Not exactly. Think Distributed Denial of Service attack. Think of the guy a couple of years ago that took down Google by filling their pipe.

Every BOINC client can behave exactly like a zombie.

> Disaster recovery and business continuity wasn't built in.

"Business continuity" does not apply -- this is an academic program.

The definition of "disaster" is different as well. According to BOINCsynergy.com there are 76,833 users, and 186,315 hosts. Of those, how many are really inconvenienced? Most of them don't even visit these forums.

You've ignored several posts where I've pointed out that there are other projects and that crunching 2 or 3 works great if you want to always crunch. BOINC does offer a highly distributed architecture if you look at BOINC as the project.

Mostly, I think the "IT professionals" who keep deriding the staff at Berkeley are missing the fact that the requirements are just not the same. The project documentation is public, and you can read the requirements there.
ID: 83672 · Report as offensive
Nick Cole

Send message
Joined: 27 May 99
Posts: 97
Credit: 3,806
RAC: 0
United Kingdom
Message 83685 - Posted: 7 Mar 2005, 22:38:09 UTC - in response to Message 83672.  

You are right about DOS similarities. But it is the impact on users that is the issue. The timescale I was referring to was the overall processing rate, ie numbers of users. There is no simple correlation between the figures but the rules of thumb illustrate the problem. Yes reducing the connectivity helps the machines but it prevents many users from connecting anyway. It is similar to a traffic jam, you either move slowly or keep stopping and starting, if there are enough participants then arguing about the mechanics is pointless as there will always be inconvenience for a lot of people. Is it better to allow 10% of users to have access and slow down or stop the remaining 90% or keep everybody ticking over but slower? With half a million users or perhaps a couple of hundred thousand who wish or can exchange wus and results frequently the arguments are academic (excuse the pun).

In real terms academics pursue research on a business funded basis. Sponsors (which includes us clients) still expect business standards of rigour in the methodologies used, regardless of who is carrying out that function. If the sponsors miss out on value or results then it doesn't matter who or what is doing it the point is the same. While a shoestring approach is fine in a pilot or short term development activity SETI has moved way beyond that. For it to have credible results then its operational methodology has to be above reproach. Since most users are in timezones other than California or even US then the scale and scope of it moves the whole thing away from mere academia. Half a million (regular users) inconvenienced people is still a lot. Maybe it isn't life threatening but the point is still the same. We are using our spare time, electricity, machinery and so on to do this after all. With these outages instead of letting the system run itself and maintain productivity the user interaction becomes immense, therefore the overal cost of the project goes up.

Again while it isn't life threatening or income generatingingly vital the methods used in a business continuity environment such as distribuiting machinery across power phases, locating in different rooms, floors, buildings, diversity of networks, colocation with routers, locating critical machines in secure locations can all be applied at almost no cost. It is this attention to (or lack of) operational detail that is the real issue.

Some (many) current participants are only interested in SETI and do not wish to participate in other things. Your "IT Professionals" are actually IT Professionals with considerable operational experience. That doesn't knock the project team's achievements but professional input is extremely valid. It is easy to be critical with hindsight of course, but when that comes from a professional angle it needs to be listened to. How many people even on home systems use a UPS with their PCs, on automatic shutdown, and have dial up back up to broadband routers for example? Good practice is good practice regardless of where it comes from or is applied. Such professional advice is supplied free, as are the many add-ons developed for classic. Just because it is free doesn't mean it is wrong.

All the above said the project has run almost glitch free for the last 6 years which isn't a bad achievement at all. Unfortunately Parkinson's Law has finally kicked in. If something can go wrong it will, in this case it took those 6 years!

> > But the overall timescale goes up by a factor of 10 (not quite as simple
> as
> > that but broadly speaking on those lines).
>
> Not exactly. Think Distributed Denial of Service attack. Think of the guy a
> couple of years ago that took down Google by filling their pipe.
>
> Every BOINC client can behave exactly like a zombie.
>
> > Disaster recovery and business continuity wasn't built in.
>
> "Business continuity" does not apply -- this is an academic program.
>
> The definition of "disaster" is different as well. According to
> BOINCsynergy.com there are 76,833 users, and 186,315 hosts. Of those, how
> many are really inconvenienced? Most of them don't even visit these forums.
>
> You've ignored several posts where I've pointed out that there are other
> projects and that crunching 2 or 3 works great if you want to always crunch.
> BOINC does offer a highly distributed architecture if you look at BOINC as the
> project.
>
> Mostly, I think the "IT professionals" who keep deriding the staff at Berkeley
> are missing the fact that the requirements are just not the same. The project
> documentation is public, and you can read the requirements there.
>
ID: 83685 · Report as offensive
Iztok s52d (and friends)

Send message
Joined: 12 Jan 01
Posts: 136
Credit: 393,469,375
RAC: 116
Slovenia
Message 83694 - Posted: 7 Mar 2005, 23:01:41 UTC - in response to Message 83685.  

Hello Friends!

Let us go back to this thread topic: would boxes survive?
How long would it take till we restore normal operation?
Empirical result: my boxes got few WUs, uploaded few. Works somehow.
Without any pushing (it is not so easy on linux as with wintendo).

Now, back to the subtopic: what can they do quickly with available HW/SW?

> You are right about DOS similarities. But it is the impact on users that is
> the issue. The timescale I was referring to was the overall processing rate,
> ie numbers of users. There is no simple correlation between the figures but
> the rules of thumb illustrate the problem. Yes reducing the connectivity
> helps the machines but it prevents many users from connecting anyway. It is
> similar to a traffic jam, you either move slowly or keep stopping and
> starting, if there are enough participants then arguing about the mechanics is
> pointless as there will always be inconvenience for a lot of people. Is it
> better to allow 10% of users to have access and slow down or stop the
> remaining 90% or keep everybody ticking over but slower?

I really like simmilarity with traffic jam. You try to prove we do not need
any traffic light or police on the street? (Cairo maybe?)

I proposed to limit sessions: rejecting due to overload is too high load for server.
if connection establishments is filtered out, then client timeout (like it does mostly just now), but established connections are able to finish regulary. No more "Temporarily failed" messages. It does not matter if they block
connections based on IP, weather, rate: once connection is established, it should finish properly.

Given unlimited resources best sollution would be to have several seti@home
servers around, processing different tapes. They can just report back
validated results and user stats.
If stats pages can combine seti+lhc etc, then they can combine
seti-eu, seti-na, seti-ch etc...

He, few more uploaded while I write this bla-bla.

Have a nice time, it is midnight here in Ljubljana.

73 Iztok










ID: 83694 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83701 - Posted: 7 Mar 2005, 23:17:33 UTC - in response to Message 83685.  

> You are right about DOS similarities. But it is the impact on users that is
> the issue.

Please explain "impact on the users" -- not the perceived impact, but the actual impact.

... and please explain it as the typical user who is doing as the project asked, donating CPU cycles that would otherwise be wasted.

As I see it here, the "inconvenience" is that I have work units that are still on my disk waiting to be sent to Berkeley. They'll go just as quickly if I ignore them as they will if I obsess over them -- and I'll get the same credits either way.
ID: 83701 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83704 - Posted: 7 Mar 2005, 23:27:18 UTC - in response to Message 83694.  


> Let us go back to this thread topic: would boxes survive?

The servers are surviving. They're getting a lot of requests that go unanswered, and that is wasted bandwidth.

> How long would it take till we restore normal operation?
> Empirical result: my boxes got few WUs, uploaded few. Works somehow.
> Without any pushing (it is not so easy on linux as with wintendo).

Based on other outages of about the same time, probably 48 hours before things are close to "normal."

> I really like simmilarity with traffic jam. You try to prove we do not need
> any traffic light or police on the street? (Cairo maybe?)
>
> I proposed to limit sessions: rejecting due to overload is too high load for
> server.

The problem with limiting sessions on the server side is that the BOINC clients are still trying to connect. You want to maximize throughput at the server by limiting in the client. Ideally, there would be some way for the server to say "hey, guys, everybody relax, form a queue and we'll take 'x' connections per second."

That's hard to do when you don't know the number of clients that are actually running, but you can approximate it by telling the clients "wait at least two to four hours on failure" and randomizing that.

It'd be best if there was a way to spread that news while the database and most of the servers were down -- for example, if SETI was telling everyone all weekend that the minimum interval was 6 hours instead of 1 minute, the servers would be processing uploads and scheduler requests at near maximum throughput without the traffic jams.

Kind of like Los Angeles traffic during the 1984 Olympics -- they asked businesses to shift their hours (and commuting) and traffic was better during the Olympics than any time before or after.
ID: 83704 · Report as offensive
Nick Cole

Send message
Joined: 27 May 99
Posts: 97
Credit: 3,806
RAC: 0
United Kingdom
Message 83705 - Posted: 7 Mar 2005, 23:34:22 UTC - in response to Message 83701.  

> > You are right about DOS similarities. But it is the impact on users that
> is
> > the issue.
>
> Please explain "impact on the users" -- not the perceived impact, but the
> actual impact.
>
> ... and please explain it as the typical user who is doing as the project
> asked, donating CPU cycles that would otherwise be wasted.
>
> As I see it here, the "inconvenience" is that I have work units that are still
> on my disk waiting to be sent to Berkeley. They'll go just as quickly if I
> ignore them as they will if I obsess over them -- and I'll get the same
> credits either way.
>
Ahh but you cannot get anymore or upload results!!!! What happens when you have processed your cached work?

Impact on users is as I have just written. If there is no work that can be accessed or exchanged then the wasted CPU cycles remain wasted. And us clients cannot up our stats! No work no stats.
ID: 83705 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83706 - Posted: 7 Mar 2005, 23:45:11 UTC - in response to Message 83705.  

> > > You are right about DOS similarities. But it is the impact on users
> that
> > is
> > > the issue.
> >
> > Please explain "impact on the users" -- not the perceived impact, but
> the
> > actual impact.
> >
> > ... and please explain it as the typical user who is doing as the
> project
> > asked, donating CPU cycles that would otherwise be wasted.
> >
> > As I see it here, the "inconvenience" is that I have work units that are
> still
> > on my disk waiting to be sent to Berkeley. They'll go just as quickly if
> I
> > ignore them as they will if I obsess over them -- and I'll get the same
> > credits either way.
> >
> Ahh but you cannot get anymore or upload results!!!! What happens when you
> have processed your cached work?
>
> Impact on users is as I have just written. If there is no work that can be
> accessed or exchanged then the wasted CPU cycles remain wasted. And us
> clients cannot up our stats! No work no stats.

I only cache 1/4 day of work. I ran out of SETI on Friday, and LHC ran out of work over the weekend (they were up). I never had more than two E@H work units all weekend. If they'd been down, well, my work station was also being used for MY work.

I expect I'll see some new SETI work units today -- in the past I've always seen new work before uploads started working.

Credits will start showing up soon after work is uploaded.

Back before E@H was open, and LHC was on hiatus, I ran my cache at about five days, and crunched through several outages -- no problem.

I don't run my workstation any more or any less because of BOINC, and I don't run BOINC on my servers because they make my money. I'm happy to donate as many spare CPU cycles as Berkeley can use.

... but the requirements are not the same for Berkeley as they are even for a small commerce site, or the file server in a small office.
ID: 83706 · Report as offensive
Profile JigPu
Avatar

Send message
Joined: 16 Feb 00
Posts: 99
Credit: 2,513,738
RAC: 0
Message 83824 - Posted: 8 Mar 2005, 5:53:40 UTC - in response to Message 83705.  

> Impact on users is as I have just written. If there is no work that can be
> accessed or exchanged then the wasted CPU cycles remain wasted. And us
> clients cannot up our stats! No work no stats.
>
One simple solution (if you don't like the idea of attaching to a secondary project) is to increase the cache size.

I've been running BOINC almost since it first became out into beta testing, and if there's one thing I learned, it's that when Berkeley runs into a string of bad luck, it's usually a LONG string of bad luck (cut fiber, corrupted databases, faulty RAIDs, etc). For this reason, I used to have my cache set at 7 days*. Berekely has had times like this when they were more down than up, so it wasn't hard to learn that you tend to run out of work quickly if you don't got a lot of WUs :D I don't think my 7 day cache ever ran dry, and with a simple preference change, it's not hard to get that kind of reliability either.

Admittedly, a huge cache isn't a solution to Berkeley's problems, but it is a very simple fix if somebody cares about making sure that their machines always have work. The casual user with a 3 or 4 day cache will be able to survive a reasonable outage, but any not anything serious. If they want to complain about Berkeley's reliablity killing their caches, fine, but they should also consider giving themselves a bit more work (or attaching to a secondary/tertiary project for the same effect).


<font>*I don't run a 7 day cache anymore since I'm now attached to 2 projects, and can easily get away with a smaller cache.</font>

Puffy
ID: 83824 · Report as offensive

Message boards : Number crunching : Would serves survive?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.