The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 89 · 90 · 91 · 92 · 93 · 94 · Next

AuthorMessage
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2033580 - Posted: 22 Feb 2020, 22:19:08 UTC - in response to Message 2033490.  

We might be able to raise the money but how can we help with the Time?


A fairly simple solution.
Recently we've been looking that the capital money (e.g. buying new disks), but there needs to be a fair bit of money for wages, rent, power etc - the revenue spend. My figures might be a bit off, but I would guess it would cost about $100k to cover salary and all other costs for a single person for a year. While not as glamorous as a lump of hardware. Having another pair of hands would ease the burden of daily server tasks to allow the likes of Eric to develop the software and prepare grant applications etc.


So how do we raise/funnel the money to fund another Programmer/System Admin?

I assume that the director would have to be willing to hire....
And I assume we would need to come up with more than a single years worth of funding.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2033580 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14010
Credit: 208,696,464
RAC: 304
Australia
Message 2033583 - Posted: 22 Feb 2020, 22:26:34 UTC - in response to Message 2033530.  

The splitters are down..... again..... This will be going to be a long weekend of ups & downs.

No <panic> yet. The WU cache is holding. Need to go for some more six packs for the festivities.

<edit< Just because i write this message they are up again.

Could be a hell of a coincidence, but apparently each time the AP splitters starts the other goes down for some time.
The splitter output is still being limited by the database size.
For a while now it's been running in very short bursts.
Most of my requests don't result in any work, but when i do get work i get plenty of it for a couple of requests & it's enough to keep the Lunix system fed and the Windows system full to the serverside limits.
Grant
Darwin NT
ID: 2033583 · Report as offensive
pututu

Send message
Joined: 21 Jul 16
Posts: 12
Credit: 10,108,801
RAC: 6
United States
Message 2033599 - Posted: 23 Feb 2020, 0:03:57 UTC

There are 14 servers of various 2X quad/hex core machines running. How much electricity cost do they consume annually? Considering that these servers are in California (Berkeley?), the electricity rates are relatively high perhaps good to know the actual annual electricity cost. My gut feeling is that if we can consolidate this to a few (Ivy-Bridge, Haswell, Broadwell or EPYC) servers, the payback should be favorable?

From reading some of the posts below, it seems that there are two or three main items that can help to reduce the current outage/panic/server issues:
1. Tasks/WU management and distribution (e.g. setting short deadline, limit the number of tasks downloaded depending on task time return, etc, etc)
2. Hardware (bandwidth, CPU speed, storage, RAM, etc).
3. Maybe software upgrade/overhaul (goes together to take advantage of the newer hardware capabilities?)

From my limited experience and exposure to SETI@home, this is what I think. Item 1 seems like something that can be optimized relatively easy when compare to item 2 and 3 and perhaps as a short term solution. May not cost a lot of money and time when compare to item 2 or 3 or I'm completely out of touch. However after having optimizing the task management/distribution, the project will likely hit a wall again because it will eventually be hardware limited if SETI@home decides to attract more donors.

My humble one cent. :)
ID: 2033599 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2033606 - Posted: 23 Feb 2020, 1:21:26 UTC - in response to Message 2033599.  
Last modified: 23 Feb 2020, 2:21:11 UTC

There are 14 servers of various 2X quad/hex core machines running. How much electricity cost do they consume annually? Considering that these servers are in California (Berkeley?), the electricity rates are relatively high perhaps good to know the actual annual electricity cost. My gut feeling is that if we can consolidate this to a few (Ivy-Bridge, Haswell, Broadwell or EPYC) servers, the payback should be favorable?

From reading some of the posts below, it seems that there are two or three main items that can help to reduce the current outage/panic/server issues:
1. Tasks/WU management and distribution (e.g. setting short deadline, limit the number of tasks downloaded depending on task time return, etc, etc)
2. Hardware (bandwidth, CPU speed, storage, RAM, etc).
3. Maybe software upgrade/overhaul (goes together to take advantage of the newer hardware capabilities?)

From my limited experience and exposure to SETI@home, this is what I think. Item 1 seems like something that can be optimized relatively easy when compare to item 2 and 3 and perhaps as a short term solution. May not cost a lot of money and time when compare to item 2 or 3 or I'm completely out of touch. However after having optimizing the task management/distribution, the project will likely hit a wall again because it will eventually be hardware limited if SETI@home decides to attract more donors.

My humble one cent. :)


These are definitely higher costs than most areas, which is counter intuitive to California's power plan as the state is actually paying Arizona to take power off its grid due to overproduction. One thing which is very debatable in my most humble of opinions is the ability for Berkeley to even interest a soul to work there for a mere $100K/yr. in one of the top 5 highest cost of living areas in the nation, possibly the world. Perhaps some undergrad attending the university wishing to enter into an internship. We might as well be talking about funding a solar/wind generation site for the university for cost reduction and redundancy when the state or PG&E forces brownouts. Perhaps we can gain interest from the panic crowds in the Global Climate debates??? :) That would be a long term reduction of costs for the university as a whole.

#1 is a very feasible suggestion, and most of that work is already done if you look at the breakdown per host on a per plan class basis (needs a bit of tweaking, but it is a good start), it would be a feasible move to limit outgoing work on a machine's own track record, and should be easy to implement with little overhead.

For the sake of the crowd who wants to allow the largest number of members the ability to crunch numbers, I would suggest that we set limits based on a security measure. That is, don't allow hardware on the system which will not run a supportable operating system which is actively writing fixes, or we disallow hardware which is outside that scope by a time limit no more than 2 years outside of such software limits. That just makes sense, and allows the system a higher floor, which corresponds with actual growth in the field.

There is something to be said about the soft bigotry of low expectations here. The system as a whole should not cut off its nose to spite its face. We want people to participate at maximum potential, but when the entire system suffers because of our want to allow some to run antiquated hardware, we shoot for mediocrity not excellence.

Edit* That said, if one sets limits too low on their equipment, or doesn't turn on their computer very often, that is a choice the person makes. Setting a much lower turn around limit is not an impractical move. If storage space, memory, and software limitations were not an issue, we wouldn't even have this discussion. We are however, having this discussion. These policies have led us to what is perhaps our own limits set not by the hardware, but our own permissiveness in policy. This is having serious ramifications systemwide, and is correctable. We can talk about increasing the hardware in a dream list, and come up with other measures at a cost. We can also make a decision to change a policy and avoid the cost while at the same time correcting the system, and extending the life of these systems as they exist. My humble opinion is to say that policy > hardware changes, at least in the foreseeable timeframe. This can extend the ability for some to make rational purchase decisions for the future based on sensible, and planned needs.

My stinky opinion for what it's worth.
ID: 2033606 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 2033626 - Posted: 23 Feb 2020, 7:43:20 UTC - in response to Message 2033503.  

Tnx...makes sense...just finding it funny that it took 3 years... :)
It just took 3 years for you to notice. I guess you didn't watch your queue all the time during those 3 years to make sure no non SoG task slips through.

I look through my tasks each morning and at least a couple of times throughout the day. On my slow cruncher, that would catch such strangers.
Humans may rule the world...but bacteria run it...
ID: 2033626 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22922
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033631 - Posted: 23 Feb 2020, 8:43:23 UTC - in response to Message 2033580.  

Simple - just make a donation to SETI@Home by our preferred mechanism, if you don't say it is for a special fund raiser then it goes to running the pool for general running of the project.
The funding for a new member of staff should not be for one year, but for a number of years - short term staff (including interns and research students) are OK for doing a secific, well defined task, but what the project needs is a stable core of staff onto which one adds the short-term staff.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033631 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2033718 - Posted: 24 Feb 2020, 2:59:44 UTC - in response to Message 2033606.  


My stinky opinion for what it's worth.


It does help keep our sinuses clear though... :)

Tom
A proud member of the OFA (Old Farts Association).
ID: 2033718 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2033719 - Posted: 24 Feb 2020, 3:04:14 UTC
Last modified: 24 Feb 2020, 3:04:41 UTC

For what it is worth of the projects I participate in, Seti@Home has the most relaxed "due" schedule. Many of my other projects allow a week or less per task.

An experiment might be reduce the deadline to say 2 weeks and see if the load drops off because tasks are not sitting idle in the DB.

If possible could we split the deadlines making the gpu tasks say a week or under?

Tom
A proud member of the OFA (Old Farts Association).
ID: 2033719 · Report as offensive
Gene Project Donor

Send message
Joined: 26 Apr 99
Posts: 150
Credit: 48,393,279
RAC: 118
United States
Message 2033727 - Posted: 24 Feb 2020, 5:48:38 UTC

TomM proposes/suggests reducing the deadline to 2 weeks. I would vote for a less "ambitious" adjustment to the deadlines. I do observe that the AstroPulse tasks are issued with a 26-day deadline, as compared to the 60-day deadline for everything else. If the deadline were reduced to perhaps 40 or 50 days and allowed to remain there a couple of months (i.e. long enough to stabilize to some sort of equilibrium) that ought to give the project some hard data on the effects on database issues and resend statistics. Then decide whether it was a mistake - and revert to previous values; or, decide it was a positive move and, perhaps, continue adjusting deadlines in similar small steps.
ID: 2033727 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14010
Credit: 208,696,464
RAC: 304
Australia
Message 2033731 - Posted: 24 Feb 2020, 7:25:42 UTC - in response to Message 2033719.  

An experiment might be reduce the deadline to say 2 weeks
See all of my previous posts on reducing deadlines.
Grant
Darwin NT
ID: 2033731 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1649
Credit: 12,921,799
RAC: 89
New Zealand
Message 2033732 - Posted: 24 Feb 2020, 7:30:09 UTC

Tasks are been removed after they have sat in your validated list for a period of time. But I am unsure if the purge queue is working properly on the server status page because I have not seen this rise over 300,000 in probably almost a week. I just find it interesting that all I'm pleased results are being deleted.
I also see that we are back up to over 6 million results out in the field
ID: 2033732 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22922
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033734 - Posted: 24 Feb 2020, 9:16:06 UTC
Last modified: 24 Feb 2020, 9:18:09 UTC

I'm republishing this chart I posted a month ago. The x-axis is time *days pending), the y-axis is the percentage remaining to be validated.
At 10 days pending there are still between 50% & 60% of the tasks waiting, at 30 days, this drops to between 20, and 30%and by 50 days we are down into the noise.



The peculiar gap between 30 and 40 days was the "Christmas Debacle", when nothing really got validated.

So what does this mean?
A deadline reduction to 10 days would shove the resends through the roof, and turn a large number of currently useful-if-slow hosts into slow-but-useless hosts - they just wouldn't return their data in time and that data forms a very large proportion of the total.
A deadline reduction to 30 day still push the resends through the roof, and again turn a large number of currently useful-if-slow hosts into slow-but-useless hosts - they just wouldn't return their data in time and that data forms a fair proportion of the total.
A deadline of 40 days would see between 10% and 20% resends, getting less hostile to the slow-but-productive hosts.
A deadline of 50 days, and the resends drop to ~5%, much less hostile on the slow-but-productive hosts.

I would suggest that the sweet-spot may be deadlines around 40-50 days, where the impact on the slowest hosts is probably about as low as one can reasonably expect.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033734 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033735 - Posted: 24 Feb 2020, 9:37:39 UTC - in response to Message 2033734.  

It would be interesting to know - but we can probably only speculate - under what circumstances a slow-but-useful computer returns a valid task after 50 days.

1) It really is that slow!
2) It only crunches part time.
3) It crunches, but only a small proportion of its time is spent on SETI
4) It broke down, but the owner took time to source and install replacement parts
5) It broke down, but the owner didn't notice
6) It was caught by a driver problem, but the owner didn't know how to handle it
7) It spends a lot of time out of reach of the internet

and so on. I'm sure we can think of many more.
ID: 2033735 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033737 - Posted: 24 Feb 2020, 9:43:47 UTC - in response to Message 2033719.  

For what it is worth of the projects I participate in, Seti@Home has the most relaxed "due" schedule. Many of my other projects allow a week or less per task.
An experiment might be reduce the deadline to say 2 weeks and see if the load drops off because tasks are not sitting idle in the DB.
If possible could we split the deadlines making the gpu tasks say a week or under?
Tom


. . Tom! Are you after my nick? Talk like that might get you excommunicated. ...

Stephen

:)
ID: 2033737 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033738 - Posted: 24 Feb 2020, 9:46:50 UTC - in response to Message 2033727.  

TomM proposes/suggests reducing the deadline to 2 weeks. I would vote for a less "ambitious" adjustment to the deadlines. I do observe that the AstroPulse tasks are issued with a 26-day deadline, as compared to the 60-day deadline for everything else. If the deadline were reduced to perhaps 40 or 50 days and allowed to remain there a couple of months (i.e. long enough to stabilize to some sort of equilibrium) that ought to give the project some hard data on the effects on database issues and resend statistics. Then decide whether it was a mistake - and revert to previous values; or, decide it was a positive move and, perhaps, continue adjusting deadlines in similar small steps.


. . I would vote for 28 days myself. I remain convinced the project would be perfectly viable with an even shorter deadline but in the spirit of compromise 28 days seems way more than sufficient.

Stephen

. .
ID: 2033738 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14010
Credit: 208,696,464
RAC: 304
Australia
Message 2033739 - Posted: 24 Feb 2020, 9:49:42 UTC - in response to Message 2033734.  
Last modified: 24 Feb 2020, 9:55:48 UTC

So what does this mean?
A deadline reduction to 10 days would shove the resends through the roof, and turn a large number of currently useful-if-slow hosts into slow-but-useless hosts - they just wouldn't return their data in time and that data forms a very large proportion of the total
....
I don't believe that to be the case.
The reason there are so many systems that take so long to return work, is because of the exiting long deadlines. We are allowing it to occur. And even so, the Average turnaround time at present is only 34 hours!

Even the slowest of the slow systems can return a WU within 2 days. Even allowing them to spend much of their time not actually processing work or working on another project, they can still return the longest to process WU within a week. But people do have issues- power, comms, system etc.
So we set deadlines at 4 weeks. In that time the slowest of the slow that spends most of it's time powered off will still be able to return several WUs. And even if there are floods, fires, storms etc that make it impossible for systems to return the work within a week, people will still be able to return finished work before it times out by giving them that 28 day deadline.
Grant
Darwin NT
ID: 2033739 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033740 - Posted: 24 Feb 2020, 9:58:56 UTC - in response to Message 2033734.  

I would suggest that the sweet-spot may be deadlines around 40-50 days, where the impact on the slowest hosts is probably about as low as one can reasonably expect.


. . Except that the majority of that 'delay' on the slow hosts is not due to their low productivity so much as their oversized caches. The reason they sit on tasks for 50 days is not because it takes them that long to process a task, but because WUs sit in their 'in progress' status for weeks on end before they get around to processing them. Shortening the deadline and if necessary reducing their work fetch limits would eliminate that unnecessary period of WUs sitting in purgatory. To avoid large numbers of time outs and system imposed work allocation limits they would have to actually administer their hosts more responsibly and reduce their caches to a size that matches their level of productivity. What a shame that would be ...

Stephen

:(
ID: 2033740 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033741 - Posted: 24 Feb 2020, 10:08:16 UTC - in response to Message 2033740.  

... oversized caches ...
Now there's a challenge! I'll have a look through some of my pendings later, and see how many of my wingmates fall into that category.
ID: 2033741 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14010
Credit: 208,696,464
RAC: 304
Australia
Message 2033743 - Posted: 24 Feb 2020, 10:18:51 UTC - in response to Message 2033740.  

. . Except that the majority of that 'delay' on the slow hosts is not due to their low productivity so much as their oversized caches. The reason they sit on tasks for 50 days is not because it takes them that long to process a task, but because WUs sit in their 'in progress' status for weeks on end before they get around to processing them.
In theory, if a WU is processed it should be done within 20 days (10+10 for cache settings).*
Any longer than that, and still returned by that host, would most likely be due to outside factors (System, power, comms etc issues), or a recently connected very slow host, possibly with more than one project with 10+10 cache settings still figuring things out.



*Unless bunkering or other such user manipulation is at play.
Grant
Darwin NT
ID: 2033743 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22922
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033747 - Posted: 24 Feb 2020, 11:10:29 UTC

Just take a look at the graph before making ANY assumption about "having no effect", "Because of long deadlines" - these two are totally and utterly WRONG.

The truth is, and some do not accept this, is that SETI@Home has a POLICY of supporting a very wide range of computer performance, and human activity such as holidays and forgetting to stop a host, infrequent processing and so on. Twenty days would mean about 40% of the task sent out would have to be resent, and, as these are probably on hosts that only do a very small number of tasks per year that means alienating a very large proportion of the user base, which according to many reports is shrinking - do you want to decimate that base over night?
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033747 · Report as offensive
Previous · 1 . . . 89 · 90 · 91 · 92 · 93 · 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2026 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.