Message boards :
Number crunching :
Is it really over?
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 ![]() |
but without the annoying celebrities. The key word being "annoying". ![]() ![]() |
![]() ![]() Send message Joined: 16 Mar 07 Posts: 3949 Credit: 1,604,668 RAC: 0 ![]() |
I'm sure we lose credit hounds from time to time, but I'm here for the science. I thought it was free sushii? (Which I am still waiting for.) Still, this little bump is nothing. Who remembers May 2007 when one of the servers completely died and the project was down for over a week? Pure mathematics is, in its way, the poetry of logical ideas. Albert Einstein |
OzzFan ![]() ![]() ![]() ![]() Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 ![]() ![]() |
but without the annoying celebrities. I'm sure if you ask around, you'll get a few "yes, he is" answers. :) |
![]() ![]() Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 ![]() |
Who remembers May 2007 when one of the servers completely died and the project was down for over a week? You know, I'm pretty sure I had one or two machines running back then, but I can't honestly remember that one. Sort of puts it all in perspective. Edit - Just checked my old notes, and I was on the road for a couple of weeks in May 07. Must have missed the whole thing. Is it too late to start complaining? ![]() ![]() |
![]() Send message Joined: 2 Sep 06 Posts: 8964 Credit: 12,678,685 RAC: 0 ![]() |
To those that are checking out: don't give up. Patience is a virtue. ![]() ![]() |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Just checked my old notes, and I was on the road for a couple of weeks in May 07. Must have missed the whole thing. Is it too late to start complaining? Probably. But which road were you on? The fun back then was tracking the new server across the continent, like Matt's post for May 09 2007: Just checked FedEx tracking (via a vendor-only system). Not much resolution when the thing is in transit. The replacement Sun server is on a truck somewhere between Memphis and San Jose. |
nemesis ![]() Send message Joined: 12 Oct 99 Posts: 1408 Credit: 35,074,350 RAC: 0 |
quoting the Blurfmeister, "To those that are checking out: don't give up. Patience is a virtue." and i'm dang near sainthood then. |
![]() Send message Joined: 8 Dec 00 Posts: 19 Credit: 20,552,123 RAC: 0 ![]() |
Stable? Something tells me you haven't been paying attention. Yes, in terms of continuous uptime vs. hardware failures caused downtimes. Now it's very frequent. Every week we have outage, and in addition all the failures - I don't remember month without a failure.... Check weekly graph on cricket...Do you feel like it's whe way it sould look like?- I don't think so... And the S@H Classic was coping a way better -but it's just my impression....(does anyone have old cricket graph to post here??) There has been worse outages than this one. The project has never been stable in the sense that most people think. Outages happened on the very rare occasions - mainly due to hardware failure. Now I think that crew have too many screws to turn - a webserver, database, filesytem, dns policy and routing - once I'm starting to have an impression that BOINC Berkeley farm is stabilized - boom - here we go... - RAID out of sync..... ok - fixed....5 days later and now what - mounting lost.....next... - uff outage (this time planned...) - 3 days - new fedora is out - let's update the system.... - ups...mysql got updated and screwing up the db....a root partition filled with logs, neverending story... - do you remember such things during S@H Classic? (i remember a few even quite long failures, a few DDoS'es/hacks - this happens). The scenario is just an example... And how come, that other projects (Rosetta, Einstein, Milkyway) have such long uptimes? - yes they are smaller - but comparing to S@H also have a much shorter experience..and usually less (or comparable amount of) money (NSF founding etc.) - but it's just my guess. I'm aware that S@H is a HUGE project, but technology have found solutions to solve such issues. I think that Matt et al. and the BOINC 2-rack farm is not able to support the demand of such number of crunchers/mangle sci data. I'm feeling like there is a need for additional layer in the S@H processing (i.e. sites with WUs) which would distribute work for lets say geographical region and users should be allowed to choose couple of them (to have redundancy). WU's should have a signatures telling from were they were downloaded. Those sites should synchronize with Berkeley to make sci DB inserts.. - I know - it's just a plot.. - but someone from crew should think about that... - before people will loose their nerves and switch to another project. Inplain words if you have a HUGE problem - try to divide it into smaller chunks and solve each one of them (not possible in every situations - I know) - thats why Beowulf clusters, CFD/MES domain decomposition and basically distributed computing occured - there have to be a way to decentralize project thus providing better failover, load distribution etc. Cheers Profi |
![]() Send message Joined: 8 Dec 00 Posts: 19 Credit: 20,552,123 RAC: 0 ![]() |
Patience is a virtue. Yes, quite frequently probed :)... |
OzzFan ![]() ![]() ![]() ![]() Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 ![]() ![]() |
Stable? Something tells me you haven't been paying attention. Yes, it has gotten worse as of late, and quite a few weeks have gone by with issues. I'm not saying it isn't bad right now. I'm just saying it has been worse. I've been told that there was a time during Classic where the servers were down nearly a month. The important thing to take from all this is that the project never guaranteed 100% uptime or that workunits would be available at all times. There has been worse outages than this one. The project has never been stable in the sense that most people think. If you're referring to Classic, then you're sorely mistaken. It was not on very rare occasions, but just as problematic as it is now. Some times were worse then others back then, just as they are now. People forget. Now I think that crew have too many screws to turn - a webserver, database, filesytem, dns policy and routing - once I'm starting to have an impression that BOINC Berkeley farm is stabilized - boom - here we go... - RAID out of sync..... ok - fixed....5 days later and now what - mounting lost.....next... - uff outage (this time planned...) - 3 days - new fedora is out - let's update the system.... - ups...mysql got updated and screwing up the db....a root partition filled with logs, neverending story... - do you remember such things during S@H Classic? (i remember a few even quite long failures, a few DDoS'es/hacks - this happens). No, because they didn't share as much with us. We did just as much work with less information. Computer were so slow back then that the project could be down an entire week and most people would still be working on the same workunit. This is one of the reasons why people don't remember very well. And how come, that other projects (Rosetta, Einstein, Milkyway) have such long uptimes? - yes they are smaller - but comparing to S@H also have a much shorter experience..and usually less (or comparable amount of) money (NSF founding etc.) - but it's just my guess. Yes, they are much smaller. None of them have over 297,000 active participants with such a rabid following, building crunching farms with multiple nVidia cards just to increase their numbers (not saying it doesn't happen - just not on the scale it happens here on SETI). I would even bet that most of those projects, being newer and all, have started with newer hardware - possibly even hardware that isn't donated or in beta. They have had the opportunity to spend their NSF money on decent stuff. SETI, on the other hand, uses donated/beta hardware all the time. SETI paved the way for all of them, and continues to pioneer low budget distributed computing. I'm aware that S@H is a HUGE project, but technology have found solutions to solve such issues. I think that Matt et al. and the BOINC 2-rack farm is not able to support the demand of such number of crunchers/mangle sci data. I'm feeling like there is a need for additional layer in the S@H processing (i.e. sites with WUs) which would distribute work for lets say geographical region and users should be allowed to choose couple of them (to have redundancy). The problem with this idea is that all the workunits must come from a single source (the tapes/hard drives at SETI). Likewise, they must return to the same source. This suggestion just moves the problem around but does nothing to actually solve the bandwidth issue. WU's should have a signatures telling from were they were downloaded. Those sites should synchronize with Berkeley to make sci DB inserts.. - I know - it's just a plot.. - but someone from crew should think about that... Sounds like a lot more technology going toward the solution, which means even more points of failure in the system. Plus, in order to "synchronize" with Berkeley, they'd be using up some of the bandwidth, which simply moves the problem around but does nothing to solve the issue. - before people will loose their nerves and switch to another project. Inplain words if you have a HUGE problem - try to divide it into smaller chunks and solve each one of them (not possible in every situations - I know) - thats why Beowulf clusters, CFD/MES domain decomposition and basically distributed computing occured - there have to be a way to decentralize project thus providing better failover, load distribution etc. If we made the system that complex, how would that be big science on a small budget? The problem is that you're trying to find ways to make the server have more uptime. This is contrary to the entire thesis of BOINC. The idea is that you do not need a server with 100% or even 80% uptime. The goal is not to make the SETI servers more reliable, but to understand that people's expectations are far off base. SETI never promised to have work available all the time. The sooner that is understood, the sooner people wouldn't get so upset or angry when something happens. |
Odysseus ![]() Send message Joined: 26 Jul 99 Posts: 1808 Credit: 6,701,347 RAC: 6 ![]() |
And how come, that other projects (Rosetta, Einstein, Milkyway) have such long uptimes? - yes they are smaller - but comparing to S@H also have a much shorter experience..and usually less (or comparable amount of) money (NSF founding etc.) - but it's just my guess. Rosetta and Einstein are well funded & staffed; the latter has multiple servers so has been particularly reliable. Milkyway isn’t at all comparable to the other two, being operated by only a couple of researchers, nominally an alpha-test project FWIW, and anyway has certainly had its share of unscheduled (albeit usually pretty brief) outages, cancelled work, and such glitches, as one might infer from a glance at the ‘complaint’ threads in the NC forum over there. ![]() |
![]() ![]() ![]() Send message Joined: 16 Apr 00 Posts: 1296 Credit: 45,357,093 RAC: 0 ![]() |
And how come, that other projects (Rosetta, Einstein, Milkyway) have such long uptimes? - yes they are smaller - but comparing to S@H also have a much shorter experience..and usually less (or comparable amount of) money (NSF founding etc.) - but it's just my guess. Rosetta is an excellent backup project. I keep it loaded with 'No New Tasks' enabled. When my SETI cache drops to less than a day, I turn it on. ![]() Join the PACK! |
![]() ![]() Send message Joined: 15 Mar 01 Posts: 1011 Credit: 230,314,058 RAC: 0 ![]() |
Come on Mutt.... i'm sticking around, just couldn't see leaving most of my rigs running without any GPU work units. but you all can thank me for fixing the servers, if i wouldn't have shut down we would have been looking at another day of outage!:P uploads went quick, we must have a new switch... ![]() ![]() |
![]() ![]() Send message Joined: 23 May 99 Posts: 4292 Credit: 72,971,319 RAC: 0 ![]() |
Eric isn't sure but at least he posted in the Technical News...Seems it was Campus wide. Official Abuser of Boinc Buttons... And no good credit hound! ![]() |
![]() Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 ![]() |
No, because they didn't share as much with us. We did just as much work with less information. Computer were so slow back then that the project could be down an entire week and most people would still be working on the same workunit. This is one of the reasons why people don't remember very well. Before someone (3rd party) came up with a pool for Seti WUs at the time, where you could upload WUs to and download "new" ones from ad-infinitum, you'd be doing that one WU per computer only and be 'standing by' waiting for Seti to return when you were done. With the pool you wouldn't notice the outages that much, but you'd only be burning electrons as all the work in the pool would've been done a couple of times to multiple times over, by the time that Seti returned and a fresh batch of WUs could be downloaded. It wasn't really doing much for the science behind the project. |
![]() ![]() Send message Joined: 21 Jun 01 Posts: 21804 Credit: 2,815,091 RAC: 0 ![]() |
so bailing instead of waiting? I've seen more loyalty from Milkyway crunchers Over inflated credits = over inflated loyalty. me@rescam.org |
![]() Send message Joined: 8 Dec 00 Posts: 19 Credit: 20,552,123 RAC: 0 ![]() |
I'm not saying it isn't bad right now. I'm just saying it has been worse. It suppose to be a WAY better.. I've been told that there was a time during Classic where the servers were down nearly a month. I don't remember a month-lasting break, but a week or so - yes.. But even then people were informed through various channels - they knew what was going on (more less).. The important thing to take from all this is that the project never guaranteed 100% uptime or that workunits would be available at all times. Yes - by means of "guarantee" - you mean "expect the proper functioning of the device or service one has paid for" ?? - OK - I agree - nobody sent me a written statement nor it's not in the License Agreement when installing BOINC client... Nevertheless, it should be getting better and even without any formal guarantees project should be getting very near to 100% uptime - lately it certainly does not. If you're referring to Classic, then you're sorely mistaken. It was not on very rare occasions, but just as problematic as it is now. Some times were worse then others back then, just as they are now. People forget. I really don't think that the project complexity did not change thus producing new failure points.. If the people forget.. - let us forget now, but somehow this downtimes are not allowing me to forget (how one can forget if it happens every week) ?? No, because they didn't share as much with us. We did just as much work with less information. Computer were so slow back then that the project could be down an entire week and most people would still be working on the same workunit. This is one of the reasons why people don't remember very well. ?? - I felt very well informed..Had a plenty WUs to crunch... - totally satisfied... - now the situation changed - WUs transision to other machine is problematic...so due to servers downtimes machines starting to drain caches and stay idle.. - so this is what the project is for - to keep machines occupied... (and I don't want to participate in another project - I want to FIND ET :) ) Yes, they are much smaller. None of them have over 297,000 active participants with such a rabid following, building crunching farms with multiple nVidia cards just to increase their numbers (not saying it doesn't happen - just not on the scale it happens here on SETI). Yes - agree. But people building 4-GPU cards with 10+ CPU cores monsters will quickly find that there is no point in participating in the project which is unable to keep up with work demand to satisfy such beasts. :) The project will reduce in size if single S@H server farm will not be able to keep up.. I would even bet that most of those projects, being newer and all, have started with newer hardware - possibly even hardware that isn't donated or in beta. They have had the opportunity to spend their NSF money on decent stuff. SETI, on the other hand, uses donated/beta hardware all the time. I think that they are running on donated/beta hardware also - thats the way it goes. Isn't S@H also funded from NSF? Please find other project that have so many major players as Intel, Overland, HP.... - to get a great hardware from them from the position of S@H is just a matter of asking.. (BTW - they are donating hardware). Also a donation from members which strenghtens funding of the project.. SETI paved the way for all of them, and continues to pioneer low budget distributed computing. As of "paving the way" - totally agree. I'm aware that S@H is a HUGE project, but technology have found solutions to solve such issues. I think that Matt et al. and the BOINC 2-rack farm is not able to support the demand of such number of crunchers/mangle sci data. I'm feeling like there is a need for additional layer in the S@H processing (i.e. sites with WUs) which would distribute work for lets say geographical region and users should be allowed to choose couple of them (to have redundancy). The problem with this idea is that all the workunits must come from a single source (the tapes/hard drives at SETI). Likewise, they must return to the same source. This suggestion just moves the problem around but does nothing to actually solve the bandwidth issue. I don't think so. Data is comming from the common source - sky - it is stored and as a raw data may be shipped to Berkeley - but after then it may be sent to another sites for splitting and other mangling (sending/validating etc.) to offload the Berkeley site. The results have to be returned to Berkeley to create uniform SCI database of collected data/results. I think that splitting/sending/validating/ assimilating and other actions on such small objects as WUs is creating such a stress on HDDs, OSes, apache servers and network hardware, that it fails on a frequent basis (it's quite like working on random r/w with hundreds of DB queries and running web service with hundreds of connections simultaneously - it have to kill single or single small farm of machines sooner or later..). Sounds like a lot more technology going toward the solution, which means even more points of failure in the system. Yes - but it's of a low probability, that 2 or more sites will be down at the same time... Plus, in order to "synchronize" with Berkeley, they'd be using up some of the bandwidth, which simply moves the problem around but does nothing to solve the issue. You are distributing workload/network stressing. And it's a way different to process a single request/session (or couple of them) at the same time instead of mangling hundreds of connections at the same time.. If we made the system that complex, how would that be big science on a small budget? The two things have a very little in common. System complexity is one thing and science side another. Combining the two is a shortcut. I know people doing great science with simple tools and people wasting their time to harness huge complex tools to solve little problem. The problem is that you're trying to find ways to make the server have more uptime. This is contrary to the entire thesis of BOINC. The idea is that you do not need a server with 100% or even 80% uptime. Not true! I think you're missing the point. What is "entire thesis of BOINC"? I think it's to create Open Infrastructure for Network Computing - so to create a tool for scientists or developers who want's to run distributed project - to make it uniform, thus providing the platform for users to participate in more than one project if they are willing to. If not - a BOINC is a "carrier" of science application(s) - so still - if one wants to participate in a single project, it's desired for project servers to keep up with the work distribution. So yes - I would like to see S@#H having 98%+ uptime. The goal is not to make the SETI servers more reliable, but to understand that people's expectations are far off base. SETI never promised to have work available all the time. I don't believe what I'm reading here... Since last failure passed 7 hours and I was not able to get any WUs for new machines/send results for 4 days (3 days of upload failures- 6hrs break and S@H working -5hrs of failure)... And you're saying that "the goal is not to make the SETI servers more reliable" - so what is the purpose of this project if the servers are "failing crap"? - What is the point in meesing in peoples heads if i.e. servers will lose the fruit of their machines work due to frequent failures/misconfiguration? I DO NOT expect for S@H to work as my cardio stimulator :) - just want it to be 28-29 out of 30 days up in a month (or close to that) - is my expectations too high you think? S@H become something more than just a scientific project - everyone involed knows that. It's a community of people very devoted, wishing all the best and admiring work of S@H staff. If you're runnig the project of that scale - you must accept the consequences - you have to bear in mind that the pressure on you from a huge number of participants will be extreme. You don't have to make an explicit "promise" nor declaration - it's self explanatory. And it's quite obvious (at least for me) that the pressure is rising when something fails. The sooner that is understood, the sooner people wouldn't get so upset or angry when something happens. NO, no no.. - sorry - I do not agree with such attitude - being upset, worried or excited about something means that you care. Silently accepting failures, especially when they occur so frequently, is burying head in the sand. Saying that something is wrong with the project is not just burble - at least someone (in this case at least you OzzFan) will read and have the knowledge that not everything is ok. Cheers, Profi p.s. It seems like everything got back to normal after 3 days of upload failure , 6 hrs of project uptime then thumer crash and 5hrs RAID resync... The threads with prediction of next failure times are really sarcastic :] and confirm, that I'm not alone... |
![]() ![]() Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 ![]() |
I'm always amazed/amused at all the people who keep predicting that SETI users will leave in large flocks because of the "bad service". Take a look at Scarecrow's graph of users. In particular, look at the graph at the very bottom of this page. The number of active hosts is pretty flat, while the total number of hosts climbs slowly over the last 365 days (which includes some major "bad service" events). This tells me that there is some turn over in users who sign up, give it a try, and then quit for whatever reasons. This probably means the total crunching power goes up steadily, as newer hardware is introduced by a constant sized group of steady users or new users. Since S@H has a hard time generating enough work for this existing user base, why would they bother with any extra effort to retain more users or generate new users? All it would produce is more people griping about lack of work, lack of respect, lack of customer service, etc. I hope they have found more important things to spend their time on. If this bother you, goodbye. Thanks for the CPU cycles. There is somebody standing in line to take your place. Fine Print: this polite kiss-off reflects the views of this user only, and is not the offical position of this station, this network, or its sponsers. ![]() ![]() |
![]() ![]() Send message Joined: 23 May 99 Posts: 4292 Credit: 72,971,319 RAC: 0 ![]() |
Eric posted at least twice this last outage. That is a 100% improvement over the last outage. Things are looking up. Official Abuser of Boinc Buttons... And no good credit hound! ![]() |
![]() Send message Joined: 8 Dec 00 Posts: 19 Credit: 20,552,123 RAC: 0 ![]() |
I'm always amazed/amused at all the people who keep predicting that SETI users will leave in large flocks because of the "bad service". Take a look at Scarecrow's graph of users. In particular, look at the graph at the very bottom of this page. 2 years only - look at the growth rate when S@H was Classic - growt was amazing back then...You're long enough (even longer than me) to know that. I don't think it will be rapid and everybody will resign. I think that people will start to think about project change and some of them will switch. The project shoud grow - now the growth rate is minimal. The number of active hosts is pretty flat, while the total number of hosts climbs slowly over the last 365 days (which includes some major "bad service" events). This tells me that there is some turn over in users who sign up, give it a try, and then quit for whatever reasons. Or new users sign up and stay and the same amount of people who started to get upset or resigned due to frequent failures and leave the project...And active users attach more cpus.. Since S@H has a hard time generating enough work for this existing user base, why would they bother with any extra effort to retain more users or generate new users? To be a biggest project and to keep this position? To be always on the frontier of science and distributed computing? To finally manage to process all incoming data in the realtime (or close to that)? To enhance search area (spatially and spectrally)? You choose - it's not hard to figure.. Users (and their machines) ARE the true power of S@H - It was many times underlined by S@H staff and it's well self explanatory. The more users the more power - the more power the bigger things you can achieve. All it would produce is more people griping about lack of work, lack of respect, lack of customer service, etc. I hope they have found more important things to spend their time on. Like what? That's how it's working. If Edison would give up on 15th burned lightbulb you still would be sitting with the candle. And feeling the pressure (apart from stressing staff a bit) influence changes and improvement. What could be more important than pushing the hardware to the limits, testing DB's engines, figuring out new distribution/scientific algorithms, using new hardware for data processing - especially when running scientific distributed project? Certainly the purpose is not to gather unconcerned people...There is no point in this - the project is glued by the great idea, devotion of users and passion of staff...otherwise it would break and split apart. If this bother you, goodbye. Thanks for the CPU cycles. There is somebody standing in line to take your place. It bothers me (in a positive way)- and it's a pity that people like RottenMutt wanting to quit due to project failures.... - those peoples are legends (my hat off here). Certainly you're not the one who should say good bye to me....-find another tissue to wipe your nose...- crunch some 10x credits you have and we'll start chatting.. Instead of posting start crunching...-I't seems like leaving your computer idle will do more good than writing kiss-off's and good-bye's - 'cause to me you're totally wasting your computer's cycles. And this "somebody" is maybe you? - because being with the project over 10 years and earning not even 400K is like being semi-idle...and waiting in queue.. I'm not griping neither most of the people - they are concerned not picky (and maybe a little resigned if failures are too frequent). Apart from using idle cycles I want to utilize S@H to burn-in my new hardware. I was just saying, that due to failures I'm unable to use S@H for this purpose and I'll have to find another solution to test it (although S@H client stesses CPU+GPU pretty well - I wouldn't want to change that - only when forced). It's not the mater of complaining - it's a matter of running business, money and fulfilling testing duties. I have to have a decent reliable tool. Hope made myself clear on this one.. Profi Fine Print: this polite kiss-off reflects the views of this user only, and is not the offical position of this station, this network, or its sponsers. I can do even smaller Footnote: And I can add colors too. Your kiss-off was not polite - even when "polite" words were used. Sometimes tone of your voice/composition of sentence tells more then words spoken. I don't meant to discourage you nor offend you- keep crunching, but if you want to say goodbye to me - save it..And you don't need to make "Legal Notice/Statement" hence you're not representing "this station, this network, or its sponsers" |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.