Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 88 · 89 · 90 · 91 · 92 · 93 · 94 · Next
Author | Message |
---|---|
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
One way to discourage oversized cache would be to include the turnaround time in credit calculation. Return the result immediately for max credit and longer you sit on it, the less you get. GPUGrid rewards fast turnaround hosts with 50% more credit if returned within 24 hours. 25% more credit if work is returned within 48 hours. The same could be implemented here. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
I see where you are coming from. I believe the only way you can return a result "immediately" as if it is a noise bomb (runs for 10 seconds) and is started as soon as it is downloaded. I cannot see any other way to return a result "immediately" Actually you can. If you set: <report_results_immediately>1</report_results_immediately> in the cc_config.xml file. From the client configuration wiki. <report_results_immediately>0|1</report_results_immediately> But early overflows that run for only 15 seconds still would get reported at each 305 second scheduler connect interval. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 715 Credit: 8,032,827 RAC: 62 |
it's not really a problem of the speedest ones against the slowest ones but ... 1_ slowest ones are a lot, the cross validated pending is not a problem .. ( slow CPU and core count , slow GPU ) ( like mine but run 24h a day , 7 day a week , 52 weeks a year and a whole life i hope ... 2_ the intermediates ones (multi core CPU and a fast GPU ) are not the problems too, they are in sufficient number to have cross validated pending with another one intermediate , 3_ the problem is that are only a few fastest ones ( multi core CPU multi fast GPU ) that can't in fact cross validated pending with another fastest computer ... they have to wait after the 2_ and 1_ some possible solutions . . . increase the number of fastest hosts 3_.... ( not even possible , money cost, electricity cost, maintenance to do , hard configuration etc ) and server side problems to feed all these 3_ split the fastests 3_ to 2_ ... more intermediates and higher intermediates to increase the cross validated pending between them , easyer to configure, to supply etc .. eliminate all the slowest ones .... against the values of the scientific project that is SETI or all the proposals you've mades before me :D (sorry for the language mistakes ) :p |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Looks like the current panic is ending. My hosts are only a few tasks short of having full caches now. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
With a simple look at the SSP you see: Results returned and awaiting validation 0 35,474 14,150,778Actually about 9 million of those 14 million results are in there not because someone is still crunching them but because the corresponding workunit is stuck in assimilation queue. |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
The reason for the longer deadlines is that the project has always wanted to keep those with old slow computers still able to contribute to the project. I think this is a valid point. I didn't think of the people who have slow machines and are only on for a few hours every day. I would love to see the stats of machines in the category and see IF the return time allowance could be shortened. I want to be as inclusive as possible, but I'll be honest that if the number of machines in this category is small then it might be better to sacrifice a few participates to maybe get even more. How many people join the project, but quit because it isn't stable and they can't reliably get WUs every week?? How much shorter would the Tuesday outage be if the db was a better size? I want seti to be run by as many people as possible. This project is not only about finding the alien, but also PR and making people feel a part of something, but what if we are losing more people by not changing the system to run better?? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13722 Credit: 208,696,464 RAC: 304 |
Most people do NOT run their computers 24/7, and therefore a slow computer that runs maybe only a few hours/week can take a very long timeAnd for extremely slow rarely on systems, 1 month is plenty of time for them to return a WU. It's actually plenty of time for them to return many WUs. While having deadlines as short as one week wouldn't affect such systems, it would affect those that are having problems- be it hardware, internet, power supply (fires, floods, storms etc). A 1 month deadline reduces the time it takes to clear a WU from the database, but still allows people time to recover from problems and not lose any of the work they have processed. Grant Darwin NT |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
And for extremely slow rarely on systems, 1 month is plenty of time for them to return a WU. It's actually plenty of time for them to return many WUs. +1 |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13722 Credit: 208,696,464 RAC: 304 |
Theoretically they could replace all of these systems with just a couple AMD Epyc based servers.A single modern dual socket Epyc server can have more cores than all those listed servers combined! There are even many single core chips - those must be really ancient. The present bottleneck is storage I/O (Input Output). HDDs are not good for non sequential work. While the recent plans to move to a new storage unit with larger capacity (and so much faster) HDDs may help, the present srever accessing the data can no longer keep it all cached in memory. Hence the present issues. If the project were to put out their wish list for a new SSD based storage unit, and a new database server with significantly more RAM & faster CPUs (and idleally the replica as well), i'm sure the community would come to the party. That would well and truly fix the present database backlog issues. Of course once you remove one bottleneck, others will show up- as it is the download servers have repeated issues meeting current demands, the upload server is continually having issues (any news on it's replacement?), and the Scheduler is always having random issues meeting demand. Replacing the exiting database server with something much better would allow the replaced hardware to be used for those functions that are already showing signs of not coping. As old and limited as it is, the present database server is much more powerfull & has much more RAM than most of the other servers. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22160 Credit: 416,307,556 RAC: 380 |
Yes, SETI@Home only gets a small fraction of the total Break Through Listening project's data - it only gets the bit that it is interested in and is capable of processing. A lot of the data collected is from the wrong frequency ranges, or the telescope doesn't have the equipment required to deliver the data to S@H. Just think how long it has taken to not gain access to the data from Parks, technical issues have obstructed that feed. The there is LOFAR, which produces data from a set of frequencies unusable by S@H, but is a contributor to the BTLP, or the UK Lovell telescope, or Merlin (and many other telescopes that don't operate in the manner that is required by S@H, but contribute data to the BTLP. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
....they really just need better hardware... We might be able to raise the money but how can we help with the Time? I know. Lets put together a Shadow Data Center that keeps a full copy of the Primary. But do it with more Modern hardware. Then as the other computers die, the Shadow takes over. After all "The Shadow Knows...." Actually if we could come up with a way to remotely replicate the Primary without impacting its performance, that might not be a half bad idea. Once we are "up" they could take down the old system and move the Shadow hardware into place. Lets see now.... If we get X number of dual core Epyc's with "Maximum" memory for all 8 of those channels.... ;) Tom A proud member of the OFA (Old Farts Association). |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
It would be grand if the project could meter work allocation out to the host computers based on their ability to return processed work. Would you honestly tell me that we don't have a single mind among either the community at large, or the big brains in the computer science classes in the post graduate school at Berkeley who could not use the available information contained about each host, who could not use this information to create an algorithm in a matter of hours, to arrive at a solution which limits every host to a commensurate and sane number? These tidbits of information we have: Created 24 Nov 2019, 19:42:47 UTC Average credit 46,727.03 Average turnaround time 0.37 days Last time contacted server 22 Feb 2020, 0:38:19 UTC Fraction of time BOINC is running 99.66% While BOINC is running, fraction of time computing is allowed 100.00% We already process the necessary information to justify daily download numbers, which are based upon each individual computer's participation in the project, given a fixed limit of four weeks per work unit, to come up with just a simple number each computer's statistics could hold. Tasks/unit of time. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Still no new work? Where is the <panic> bottom? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13722 Credit: 208,696,464 RAC: 304 |
Still no new work? Where is the <panic> bottom?Every now and then some turn up, you just have to be extremely lucky with the timing of your request (even luckier than after the weekly outages). Grant Darwin NT |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
I don't know if there's a proper answer or solution to anything here, but remember that our deadlines were (and are still) based on what was considered fair by pre-BOINC days and what I was using back then took almost 10 days (222hrs in fact) to complete a CPU task (a pre MMX job) while since then any processor that doesn't support at least SSE instructions will not work here these days (not that you'd want to run 1 of them now anyway), plus the use of GPU's/iGPU's wasn't even thought of back then. On a side note my 2 little rigs (just in huge cases) have been able to stay near or close to their cache limits for the last 9.5hrs (they were almost out of GPU work when I got up an hour before that), but maybe my 2 rigs are sitting in some sort of "Sweet Spot" (who actually knows anything these days?). Cheers. |
W-K 666 Send message Joined: 18 May 99 Posts: 19013 Credit: 40,757,560 RAC: 67 |
Actually I think the deadlines were set, after the upgrade that produced the large differences in crunch times at different Angle Ranges, based on the work done by Joe Segur. 2007 - postid692684 - Estimates and Deadlines revisited |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
good news Workunits waiting for assimilation is less than 4 million Results returned and awaiting validation is less than 14 million. progress. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
Actually I think the deadlines were set, after the upgrade that produced the large differences in crunch times at different Angle Ranges, based on the work done by Joe Segur.And back in 2007 you could still use a pre SSE instruction CPU. ;-) Cheers |
Ghia Send message Joined: 7 Feb 17 Posts: 238 Credit: 28,911,438 RAC: 50 |
I've been running Boinc/Seti 24/7 for 3 years and haven't seen a cuda task since my system stabiliuzed with SOG app. Imagine my surprise when I discovered 17 of them this morning. What's up with that ? I've made no changes to that system in ages.... Humans may rule the world...but bacteria run it... |
rob smith Send message Joined: 7 Mar 03 Posts: 22160 Credit: 416,307,556 RAC: 380 |
Every now and then the servers will send out a few tasks destined for other apps to confirm that you are still using the best performing one. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.