The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 88 · 89 · 90 · 91 · 92 · 93 · 94 · Next

AuthorMessage
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2033404 - Posted: 21 Feb 2020, 21:46:32 UTC - in response to Message 2033403.  

And for extremely slow rarely on systems, 1 month is plenty of time for them to return a WU. It's actually plenty of time for them to return many WUs.
While having deadlines as short as one week wouldn't affect such systems, it would affect those that are having problems- be it hardware, internet, power supply (fires, floods, storms etc). A 1 month deadline reduces the time it takes to clear a WU from the database, but still allows people time to recover from problems and not lose any of the work they have processed.


+1
ID: 2033404 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14016
Credit: 208,696,464
RAC: 304
Australia
Message 2033406 - Posted: 21 Feb 2020, 21:57:45 UTC - in response to Message 2033326.  

Theoretically they could replace all of these systems with just a couple AMD Epyc based servers.
A single modern dual socket Epyc server can have more cores than all those listed servers combined! There are even many single core chips - those must be really ancient.


indeed. by my count its <60Cores and ~1TB RAM total. you can do that in a SINGLE socket Epyc board! 64 cores, 128 threads, MUCH better IPC, 1-2TB of faster DDR4 memory.

but it's probably best to at least spread it out over a couple systems to decrease sources of bottlenecks (network connectivity, disk I/O, etc) and to not have all your eggs in one basket so to speak in the case of hardware issues taking down the whole project lol.

this stuff isn't cheap, but we can dream. the point is, even if they upgrade to more modern setups, but not necessarily bleeding edge, they will be a lot better off. Intel Xeon E5-2600v2 chips can be had cheaply and available up to 12c/24t parts, Registered ECC DDR3 ram is cheap and plentiful. even a meager upgrade like that on some key systems would go a LONG way.


The present bottleneck is storage I/O (Input Output). HDDs are not good for non sequential work. While the recent plans to move to a new storage unit with larger capacity (and so much faster) HDDs may help, the present srever accessing the data can no longer keep it all cached in memory.
Hence the present issues.

If the project were to put out their wish list for a new SSD based storage unit, and a new database server with significantly more RAM & faster CPUs (and idleally the replica as well), i'm sure the community would come to the party. That would well and truly fix the present database backlog issues. Of course once you remove one bottleneck, others will show up- as it is the download servers have repeated issues meeting current demands, the upload server is continually having issues (any news on it's replacement?), and the Scheduler is always having random issues meeting demand.
Replacing the exiting database server with something much better would allow the replaced hardware to be used for those functions that are already showing signs of not coping. As old and limited as it is, the present database server is much more powerfull & has much more RAM than most of the other servers.
Grant
Darwin NT
ID: 2033406 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22958
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033412 - Posted: 21 Feb 2020, 22:40:04 UTC - in response to Message 2033333.  

Yes, SETI@Home only gets a small fraction of the total Break Through Listening project's data - it only gets the bit that it is interested in and is capable of processing. A lot of the data collected is from the wrong frequency ranges, or the telescope doesn't have the equipment required to deliver the data to S@H. Just think how long it has taken to not gain access to the data from Parks, technical issues have obstructed that feed.
The there is LOFAR, which produces data from a set of frequencies unusable by S@H, but is a contributor to the BTLP, or the UK Lovell telescope, or Merlin (and many other telescopes that don't operate in the manner that is required by S@H, but contribute data to the BTLP.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033412 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2033417 - Posted: 21 Feb 2020, 23:12:30 UTC - in response to Message 2033320.  

....they really just need better hardware...


These are really powerful system that they are currently running ¯\_(ツ)_/¯
---Edit--------------
Time and money are in short supply over there from what it seems.


We might be able to raise the money but how can we help with the Time?

I know. Lets put together a Shadow Data Center that keeps a full copy of the Primary. But do it with more Modern hardware.

Then as the other computers die, the Shadow takes over. After all "The Shadow Knows...."

Actually if we could come up with a way to remotely replicate the Primary without impacting its performance, that might not be a half bad idea. Once we are "up" they could take down the old system and move the Shadow hardware into place.

Lets see now.... If we get X number of dual core Epyc's with "Maximum" memory for all 8 of those channels.... ;)

Tom
A proud member of the OFA (Old Farts Association).
ID: 2033417 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2033439 - Posted: 22 Feb 2020, 0:53:13 UTC - in response to Message 2033372.  

It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work.
But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request.
Methinks the overhead would be too high to do be worth it.
Meow.


. . Sadly that is a problem. But if an index were created for each host based on the daily return rate of that host this could be applied to work assignment. That would take time to construct and probably be very difficult to incorporate into the current systems. So is very unlikely. :(

Stephen

< shrug >


Would you honestly tell me that we don't have a single mind among either the community at large, or the big brains in the computer science classes in the post graduate school at Berkeley who could not use the available information contained about each host, who could not use this information to create an algorithm in a matter of hours, to arrive at a solution which limits every host to a commensurate and sane number? These tidbits of information we have:

Created 24 Nov 2019, 19:42:47 UTC
Average credit 46,727.03
Average turnaround time 0.37 days
Last time contacted server 22 Feb 2020, 0:38:19 UTC
Fraction of time BOINC is running 99.66%
While BOINC is running, fraction of time computing is allowed 100.00%

We already process the necessary information to justify daily download numbers, which are based upon each individual computer's participation in the project, given a fixed limit of four weeks per work unit, to come up with just a simple number each computer's statistics could hold. Tasks/unit of time.
ID: 2033439 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033466 - Posted: 22 Feb 2020, 3:02:52 UTC

Still no new work? Where is the <panic> bottom?
ID: 2033466 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 14016
Credit: 208,696,464
RAC: 304
Australia
Message 2033467 - Posted: 22 Feb 2020, 3:08:41 UTC - in response to Message 2033466.  

Still no new work? Where is the <panic> bottom?
Every now and then some turn up, you just have to be extremely lucky with the timing of your request (even luckier than after the weekly outages).
Grant
Darwin NT
ID: 2033467 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 38678
Credit: 261,360,520
RAC: 489
Australia
Message 2033477 - Posted: 22 Feb 2020, 4:39:38 UTC

I don't know if there's a proper answer or solution to anything here, but remember that our deadlines were (and are still) based on what was considered fair by pre-BOINC days and what I was using back then took almost 10 days (222hrs in fact) to complete a CPU task (a pre MMX job) while since then any processor that doesn't support at least SSE instructions will not work here these days (not that you'd want to run 1 of them now anyway), plus the use of GPU's/iGPU's wasn't even thought of back then.

On a side note my 2 little rigs (just in huge cases) have been able to stay near or close to their cache limits for the last 9.5hrs (they were almost out of GPU work when I got up an hour before that), but maybe my 2 rigs are sitting in some sort of "Sweet Spot" (who actually knows anything these days?).

Cheers.
ID: 2033477 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19988
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2033479 - Posted: 22 Feb 2020, 5:14:12 UTC - in response to Message 2033477.  
Last modified: 22 Feb 2020, 5:14:32 UTC

Actually I think the deadlines were set, after the upgrade that produced the large differences in crunch times at different Angle Ranges, based on the work done by Joe Segur.
2007 - postid692684 - Estimates and Deadlines revisited
ID: 2033479 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2033480 - Posted: 22 Feb 2020, 5:21:45 UTC

good news
Workunits waiting for assimilation is less than 4 million
Results returned and awaiting validation is less than 14 million.
progress.
ID: 2033480 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 38678
Credit: 261,360,520
RAC: 489
Australia
Message 2033484 - Posted: 22 Feb 2020, 7:35:07 UTC - in response to Message 2033479.  

Actually I think the deadlines were set, after the upgrade that produced the large differences in crunch times at different Angle Ranges, based on the work done by Joe Segur.
2007 - postid692684 - Estimates and Deadlines revisited
And back in 2007 you could still use a pre SSE instruction CPU. ;-)

Cheers
ID: 2033484 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 2033488 - Posted: 22 Feb 2020, 8:14:13 UTC

I've been running Boinc/Seti 24/7 for 3 years and haven't seen a cuda task since my system stabiliuzed with SOG app.
Imagine my surprise when I discovered 17 of them this morning. What's up with that ? I've made no changes to that system
in ages....
Humans may rule the world...but bacteria run it...
ID: 2033488 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22958
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033489 - Posted: 22 Feb 2020, 8:46:13 UTC

Every now and then the servers will send out a few tasks destined for other apps to confirm that you are still using the best performing one.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033489 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22958
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033490 - Posted: 22 Feb 2020, 9:01:20 UTC

We might be able to raise the money but how can we help with the Time?


A fairly simple solution.
Recently we've been looking that the capital money (e.g. buying new disks), but there needs to be a fair bit of money for wages, rent, power etc - the revenue spend. My figures might be a bit off, but I would guess it would cost about $100k to cover salary and all other costs for a single person for a year. While not as glamorous as a lump of hardware. Having another pair of hands would ease the burden of daily server tasks to allow the likes of Eric to develop the software and prepare grant applications etc.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033490 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 2033493 - Posted: 22 Feb 2020, 9:30:42 UTC - in response to Message 2033489.  

Every now and then the servers will send out a few tasks destined for other apps to confirm that you are still using the best performing one.

Tnx...makes sense...just finding it funny that it took 3 years... :)
Humans may rule the world...but bacteria run it...
ID: 2033493 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033503 - Posted: 22 Feb 2020, 10:13:31 UTC - in response to Message 2033493.  

Tnx...makes sense...just finding it funny that it took 3 years... :)
It just took 3 years for you to notice. I guess you didn't watch your queue all the time during those 3 years to make sure no non SoG task slips through.
ID: 2033503 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 717
Credit: 8,032,827
RAC: 62
France
Message 2033511 - Posted: 22 Feb 2020, 11:30:59 UTC
Last modified: 22 Feb 2020, 11:32:02 UTC

ID: 2033511 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22958
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033514 - Posted: 22 Feb 2020, 12:11:17 UTC

It hasn't reached its deadline, so why panic?

Actually that is one of mine, and as I'm not going to be able to get back to it for a few weeks and sort out its power supply you'll just have to be patient
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033514 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 717
Credit: 8,032,827
RAC: 62
France
Message 2033522 - Posted: 22 Feb 2020, 13:56:27 UTC

i don't panic at all ;)
ID: 2033522 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033530 - Posted: 22 Feb 2020, 15:12:48 UTC
Last modified: 22 Feb 2020, 15:15:44 UTC

The splitters are down..... again..... This will be going to be a long weekend of ups & downs.

No <panic> yet. The WU cache is holding. Need to go for some more six packs for the festivities.

<edit< Just because i write this message they are up again.

Could be a hell of a coincidence, but apparently each time the AP splitters starts the other goes down for some time.
ID: 2033530 · Report as offensive
Previous · 1 . . . 88 · 89 · 90 · 91 · 92 · 93 · 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2026 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.