The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 88 · 89 · 90 · 91 · 92 · 93 · 94 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2033387 - Posted: 21 Feb 2020, 21:03:33 UTC - in response to Message 2033381.  

One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get.

Having a two week cache would be lot less cool if it hurts your RAC ;)

I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately"

GPUGrid rewards fast turnaround hosts with 50% more credit if returned within 24 hours. 25% more credit if work is returned within 48 hours. The same could be implemented here.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2033387 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2033388 - Posted: 21 Feb 2020, 21:08:06 UTC - in response to Message 2033381.  

I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately"

Actually you can. If you set:

<report_results_immediately>1</report_results_immediately>

in the cc_config.xml file. From the client configuration wiki.

<report_results_immediately>0|1</report_results_immediately>
If 1, each job will be reported to the project server as soon as it's finished, with an inbuilt 60 second delay from completion of result upload. (normally it's deferred for up to one hour, so that several jobs can be reported in one request). Using this option increases the load on project servers, and should generally be avoided. This is intended to be used only on computers whose disks are reformatted daily.


But early overflows that run for only 15 seconds still would get reported at each 305 second scheduler connect interval.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2033388 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 715
Credit: 8,032,827
RAC: 62
France
Message 2033389 - Posted: 21 Feb 2020, 21:08:57 UTC

it's not really a problem of the speedest ones against the slowest ones but ...

1_ slowest ones are a lot, the cross validated pending is not a problem .. ( slow CPU and core count , slow GPU ) ( like mine but run 24h a day , 7 day a week , 52 weeks a year and a whole life i hope ...

2_ the intermediates ones (multi core CPU and a fast GPU ) are not the problems too, they are in sufficient number to have cross validated pending with another one intermediate ,

3_ the problem is that are only a few fastest ones ( multi core CPU multi fast GPU ) that can't in fact cross validated pending with another fastest computer ... they have to wait after the 2_ and 1_


some possible solutions . . .

increase the number of fastest hosts 3_.... ( not even possible , money cost, electricity cost, maintenance to do , hard configuration etc ) and server side problems to feed all these 3_

split the fastests 3_ to 2_ ... more intermediates and higher intermediates to increase the cross validated pending between them , easyer to configure, to supply etc ..

eliminate all the slowest ones .... against the values of the scientific project that is SETI

or all the proposals you've mades before me :D

(sorry for the language mistakes ) :p
ID: 2033389 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033394 - Posted: 21 Feb 2020, 21:21:11 UTC
Last modified: 21 Feb 2020, 21:22:18 UTC

Looks like the current panic is ending. My hosts are only a few tasks short of having full caches now.
ID: 2033394 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033397 - Posted: 21 Feb 2020, 21:29:43 UTC - in response to Message 2033341.  

With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778
Why this number is so high? Sure not because the superfast or spoofed hosts. This comes from the slow hosts (the vast majority of the hosts) and the big WU deadline.
Actually about 9 million of those 14 million results are in there not because someone is still crunching them but because the corresponding workunit is stuck in assimilation queue.
ID: 2033397 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2033400 - Posted: 21 Feb 2020, 21:38:50 UTC - in response to Message 2033369.  

The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project.
As it should be.

Meow.


. . So how slow a computer would you need to take 12 weeks to process one WU?????

Stephen

. . Just curious ....

:)

Most people do NOT run their computers 24/7, and therefore a slow computer that runs maybe only a few hours/week can take a very long time
to crunch one task. Should they be kicked out from their long time interest in SETI, just because the 24/7 club wants all WU's they can get?

This is getting into elitistic territory now, and that is not what SETI is (was ?) about.



I think this is a valid point. I didn't think of the people who have slow machines and are only on for a few hours every day. I would love to see the stats of machines in the category and see IF the return time allowance could be shortened. I want to be as inclusive as possible, but I'll be honest that if the number of machines in this category is small then it might be better to sacrifice a few participates to maybe get even more. How many people join the project, but quit because it isn't stable and they can't reliably get WUs every week?? How much shorter would the Tuesday outage be if the db was a better size? I want seti to be run by as many people as possible. This project is not only about finding the alien, but also PR and making people feel a part of something, but what if we are losing more people by not changing the system to run better??
ID: 2033400 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 2033403 - Posted: 21 Feb 2020, 21:44:57 UTC - in response to Message 2033369.  

Most people do NOT run their computers 24/7, and therefore a slow computer that runs maybe only a few hours/week can take a very long time
to crunch one task. Should they be kicked out from their long time interest in SETI, just because the 24/7 club wants all WU's they can get?

This is getting into elitistic territory now, and that is not what SETI is (was ?) about.
And for extremely slow rarely on systems, 1 month is plenty of time for them to return a WU. It's actually plenty of time for them to return many WUs.
While having deadlines as short as one week wouldn't affect such systems, it would affect those that are having problems- be it hardware, internet, power supply (fires, floods, storms etc). A 1 month deadline reduces the time it takes to clear a WU from the database, but still allows people time to recover from problems and not lose any of the work they have processed.
Grant
Darwin NT
ID: 2033403 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2033404 - Posted: 21 Feb 2020, 21:46:32 UTC - in response to Message 2033403.  

And for extremely slow rarely on systems, 1 month is plenty of time for them to return a WU. It's actually plenty of time for them to return many WUs.
While having deadlines as short as one week wouldn't affect such systems, it would affect those that are having problems- be it hardware, internet, power supply (fires, floods, storms etc). A 1 month deadline reduces the time it takes to clear a WU from the database, but still allows people time to recover from problems and not lose any of the work they have processed.


+1
ID: 2033404 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 2033406 - Posted: 21 Feb 2020, 21:57:45 UTC - in response to Message 2033326.  

Theoretically they could replace all of these systems with just a couple AMD Epyc based servers.
A single modern dual socket Epyc server can have more cores than all those listed servers combined! There are even many single core chips - those must be really ancient.


indeed. by my count its <60Cores and ~1TB RAM total. you can do that in a SINGLE socket Epyc board! 64 cores, 128 threads, MUCH better IPC, 1-2TB of faster DDR4 memory.

but it's probably best to at least spread it out over a couple systems to decrease sources of bottlenecks (network connectivity, disk I/O, etc) and to not have all your eggs in one basket so to speak in the case of hardware issues taking down the whole project lol.

this stuff isn't cheap, but we can dream. the point is, even if they upgrade to more modern setups, but not necessarily bleeding edge, they will be a lot better off. Intel Xeon E5-2600v2 chips can be had cheaply and available up to 12c/24t parts, Registered ECC DDR3 ram is cheap and plentiful. even a meager upgrade like that on some key systems would go a LONG way.


The present bottleneck is storage I/O (Input Output). HDDs are not good for non sequential work. While the recent plans to move to a new storage unit with larger capacity (and so much faster) HDDs may help, the present srever accessing the data can no longer keep it all cached in memory.
Hence the present issues.

If the project were to put out their wish list for a new SSD based storage unit, and a new database server with significantly more RAM & faster CPUs (and idleally the replica as well), i'm sure the community would come to the party. That would well and truly fix the present database backlog issues. Of course once you remove one bottleneck, others will show up- as it is the download servers have repeated issues meeting current demands, the upload server is continually having issues (any news on it's replacement?), and the Scheduler is always having random issues meeting demand.
Replacing the exiting database server with something much better would allow the replaced hardware to be used for those functions that are already showing signs of not coping. As old and limited as it is, the present database server is much more powerfull & has much more RAM than most of the other servers.
Grant
Darwin NT
ID: 2033406 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033412 - Posted: 21 Feb 2020, 22:40:04 UTC - in response to Message 2033333.  

Yes, SETI@Home only gets a small fraction of the total Break Through Listening project's data - it only gets the bit that it is interested in and is capable of processing. A lot of the data collected is from the wrong frequency ranges, or the telescope doesn't have the equipment required to deliver the data to S@H. Just think how long it has taken to not gain access to the data from Parks, technical issues have obstructed that feed.
The there is LOFAR, which produces data from a set of frequencies unusable by S@H, but is a contributor to the BTLP, or the UK Lovell telescope, or Merlin (and many other telescopes that don't operate in the manner that is required by S@H, but contribute data to the BTLP.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033412 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2033417 - Posted: 21 Feb 2020, 23:12:30 UTC - in response to Message 2033320.  

....they really just need better hardware...


These are really powerful system that they are currently running ¯\_(ツ)_/¯
---Edit--------------
Time and money are in short supply over there from what it seems.


We might be able to raise the money but how can we help with the Time?

I know. Lets put together a Shadow Data Center that keeps a full copy of the Primary. But do it with more Modern hardware.

Then as the other computers die, the Shadow takes over. After all "The Shadow Knows...."

Actually if we could come up with a way to remotely replicate the Primary without impacting its performance, that might not be a half bad idea. Once we are "up" they could take down the old system and move the Shadow hardware into place.

Lets see now.... If we get X number of dual core Epyc's with "Maximum" memory for all 8 of those channels.... ;)

Tom
A proud member of the OFA (Old Farts Association).
ID: 2033417 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2033439 - Posted: 22 Feb 2020, 0:53:13 UTC - in response to Message 2033372.  

It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work.
But that would require more programming and a lot more work required on the project servers to figure out what to send or not send on every single work request.
Methinks the overhead would be too high to do be worth it.
Meow.


. . Sadly that is a problem. But if an index were created for each host based on the daily return rate of that host this could be applied to work assignment. That would take time to construct and probably be very difficult to incorporate into the current systems. So is very unlikely. :(

Stephen

< shrug >


Would you honestly tell me that we don't have a single mind among either the community at large, or the big brains in the computer science classes in the post graduate school at Berkeley who could not use the available information contained about each host, who could not use this information to create an algorithm in a matter of hours, to arrive at a solution which limits every host to a commensurate and sane number? These tidbits of information we have:

Created 24 Nov 2019, 19:42:47 UTC
Average credit 46,727.03
Average turnaround time 0.37 days
Last time contacted server 22 Feb 2020, 0:38:19 UTC
Fraction of time BOINC is running 99.66%
While BOINC is running, fraction of time computing is allowed 100.00%

We already process the necessary information to justify daily download numbers, which are based upon each individual computer's participation in the project, given a fixed limit of four weeks per work unit, to come up with just a simple number each computer's statistics could hold. Tasks/unit of time.
ID: 2033439 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033466 - Posted: 22 Feb 2020, 3:02:52 UTC

Still no new work? Where is the <panic> bottom?
ID: 2033466 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 2033467 - Posted: 22 Feb 2020, 3:08:41 UTC - in response to Message 2033466.  

Still no new work? Where is the <panic> bottom?
Every now and then some turn up, you just have to be extremely lucky with the timing of your request (even luckier than after the weekly outages).
Grant
Darwin NT
ID: 2033467 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 2033477 - Posted: 22 Feb 2020, 4:39:38 UTC

I don't know if there's a proper answer or solution to anything here, but remember that our deadlines were (and are still) based on what was considered fair by pre-BOINC days and what I was using back then took almost 10 days (222hrs in fact) to complete a CPU task (a pre MMX job) while since then any processor that doesn't support at least SSE instructions will not work here these days (not that you'd want to run 1 of them now anyway), plus the use of GPU's/iGPU's wasn't even thought of back then.

On a side note my 2 little rigs (just in huge cases) have been able to stay near or close to their cache limits for the last 9.5hrs (they were almost out of GPU work when I got up an hour before that), but maybe my 2 rigs are sitting in some sort of "Sweet Spot" (who actually knows anything these days?).

Cheers.
ID: 2033477 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19013
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2033479 - Posted: 22 Feb 2020, 5:14:12 UTC - in response to Message 2033477.  
Last modified: 22 Feb 2020, 5:14:32 UTC

Actually I think the deadlines were set, after the upgrade that produced the large differences in crunch times at different Angle Ranges, based on the work done by Joe Segur.
2007 - postid692684 - Estimates and Deadlines revisited
ID: 2033479 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2033480 - Posted: 22 Feb 2020, 5:21:45 UTC

good news
Workunits waiting for assimilation is less than 4 million
Results returned and awaiting validation is less than 14 million.
progress.
ID: 2033480 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 2033484 - Posted: 22 Feb 2020, 7:35:07 UTC - in response to Message 2033479.  

Actually I think the deadlines were set, after the upgrade that produced the large differences in crunch times at different Angle Ranges, based on the work done by Joe Segur.
2007 - postid692684 - Estimates and Deadlines revisited
And back in 2007 you could still use a pre SSE instruction CPU. ;-)

Cheers
ID: 2033484 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 2033488 - Posted: 22 Feb 2020, 8:14:13 UTC

I've been running Boinc/Seti 24/7 for 3 years and haven't seen a cuda task since my system stabiliuzed with SOG app.
Imagine my surprise when I discovered 17 of them this morning. What's up with that ? I've made no changes to that system
in ages....
Humans may rule the world...but bacteria run it...
ID: 2033488 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033489 - Posted: 22 Feb 2020, 8:46:13 UTC

Every now and then the servers will send out a few tasks destined for other apps to confirm that you are still using the best performing one.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033489 · Report as offensive
Previous · 1 . . . 88 · 89 · 90 · 91 · 92 · 93 · 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.