1)
Message boards :
Technical News :
Maxed (Dec 16 2010)
(Message 1057099)
Posted 17 Dec 2010 by ![]() Post: In most English speaking countries a "result" is indeed what comes after something. At Berkeley a result in a duplicate of a work unit, and it is called a result before it is sent, while it is out in the field, and after it comes home. [sarcasm=on] So a "result" is a outcome of the work of splitter process - raw chunks of telescope data?? - not even assigned to particular machine/user ?? - ufff what a mind job... OK - I need to consult some geographer and linguist specialist, because I thought that Berkeley is in the US, and US is a English speaking country (vast majority) [sarcasm=off] Thanks guys for so swift replies. Cheers Profi p.s. I understand of course result state - a "ready to send" state of the "result" is not very intuitive (at least for me)- however it may be caused by DB data flow or sth like that... |
2)
Message boards :
Technical News :
Maxed (Dec 16 2010)
(Message 1057036)
Posted 17 Dec 2010 by ![]() Post: [As of 17 Dec 2010 16:00:08 UTC] Results ready to send:198,960 <- BTW - shouldn't it be "Workunits" instead of "Results"? (maybe I'm getting this wrong but isn't it an amount of work-to-do created?) Current result creation rate 16.4682/sec <- same as above - WU's instead of Results? (a work-to-do creation rate or is it a rate at which the results from users are being inserted into the database?) Am I misunderstanding the definitions of "Workunit" and "Result"? Is the "Workunit" a portion of a raw telescope data "chopped" into units of approx. 380KB (ok - skipping VLAR and other "derivatives") and sent to users? Is a "Result" an outcome of cruncher's machine calculation on chunk of raw data which is returned to S@H server for cross-checking with other results returned by other users(machines) working on the same chunk of data (and other server-side operations like DB assimilation etc.) ? Correct me here if I'm wrong - thanks in advance... Results received in last hour 50,451 1hr=3600s So from simple "steady-state" calculation: At current rate 16.4682*3600=59285,52 WU's can be created each hour. So right now the project is more less balanced - creation rate is slightly higher than WU demand. A current WU's pool will satisfy 4 hr of Crunchers' demand. If you want to have 3 days outage, than I'm afraid that the project will run out of the woods and it will get worse (after outage - higher demand) and as a time goes by crunching machines are not getting slower.... But keep on pushing folks!! :) Profi p.s. I was suggesting 1Gbit connection as a temporary solution a long time ago - hopefully it will be realized soon. But I think that sooner or later "server side" of the project will have to be splitted somehow across a couple of world-spreaded sites and Berkeley should remain "capo di tutti capi" of the project - otherwise S@H may become a victim of it's own success. |
3)
Message boards :
Number crunching :
Is it really over?
(Message 980829)
Posted 19 Mar 2010 by ![]() Post:
Do you need to read them to know the profile of the project you're participate in?? I'm not much of a reader - doing so if only something is wrong or I can't figure something by myself...I'm relying on my practice and feeling of the profile of S@H and my understanding of the idea of distributed computing. The goal is to make computing affordable. I think that affordable means "easy to deploy" or "available to everybody" - I think that goal of BOINC is on one side to create a platform for scientists for creating distributed projects with ease, in uniform way, with focusing on science as much as possible (BOINC is taking off their shoulders issues like transmission, security, client-server relations, results validation and stuff like that. On the other side the paltform is allowing various people using different computer hardware to participate in such projects and streate outout of the work in a coherent way (results have to be platform-independent) - so this is "affordable" on the clients' side - this is my understanding of the whole thing without going into much details... You don't make it affordable by spending lots of money. You make it affordable by minimizing costs. You have to balance that -let's say not minimizing but optimizing - you'll not be able to run project like S@H on a shoe string budget - otherwise you'll build a project of a shape of giant on a strawy legs - it will fail in sooner or later. OzzFan is making this point because he's actually read the white papers. He's bothered to figure it out. You guys may have a time to read whitepapers of BOINC, S@H etc. - I don't have. I'm glad to hear different opinions... Also reading does not exactly mean understanding - it's a pretty complex system so one may neglect some issues important to another person. (I'm not saying that OzzFan is not understanding what he reads). Again - I read whitepapers when I have to - it's like reading Licence Agreement when installin BOINC client..- I never do it (yes I know that it may generate some problems - I'm accepting that) If you disagree, you're either too lazy to research, or you'd rather make up the rules as you go. And who are you to judge me ?? - yet another talkative person - Seeing your profile it seems like you want to become a moderator or sth, because you are able to take a position in every case - even not having a clue... As for "Lazy" - I'm working 18+ hours a day lately - I DO NOT have time to read whitepapers nor forum posts (it's what you're doing most of the time I see....) And I'm not making up rules - as for S@H I'm going with the flow. If not failures I would not be posting. Try not to defend a project blindly. Sometimes I'm feeling like in a socialistic regime - maybe it's my (users) fault that by joining and crunching we create a stress on a S@H servers causing them to fail?? The goal is not to make the SETI servers more reliable, but to understand that people's expectations are far off base. SETI never promised to have work available all the time. Again, due to a lack of reading or comprehension. It seems like you have a problem with comrehension or you are unable/not patient enough to read the full context...When I start to read the post - I read it fully...And I don't think I'm having a comprehension problem. More reliable servers (and a steady flow of work) are desirable, but they're things we'd like, not necessarily based on reality. Nobody is expecting from you orS@H staff to invent new servers - sometimes it's a matter of new approach to solve problems (i.e combine machines to solve problems, double them to increase stability and QoS, etc.). Let's face it - those guys at Berkeley ARE creating the trends in a distributed computing world and thus creating distributed computing reality (they made S@H and BOINC - wonderful things I'm participating in of a complexity I can barely imagine - they just did it). Servers and hardware are just tools - they have to do their best to reach the goal with tools they have - if not succeeding - they may have to change the way they use tools (i.e. figure out new ways to overcome problems with hardware/software at hand). SETI@Home (and Serendip) get a "free ride" because there is no money for telescope time -- they don't control the telescope. 100% true and I agree - those are "piggyback" projects. When the telescope is used to transmit (planetary radar) it's off. When they're doing maintenance it's off. When they get data, they get whatever they get. I know all that - I'm in the project as long as you are.... Don't treat me as a newbie... The BOINC client has a cache to keep work when things are broken or work is unavailable. Once again - I don't want to create huge cache on my numerous machines when making burn-in tests- it would worsen the situation - I'm using 3 days cache. If I'm unable to fetch/send results for a longer time - go figure... People having small RAC may not even notice that something is wrong, but peoples/maniacs :) building super-machines will detect it instantly - I saw posts when someone was claiming that even fully working project cannot keep up with the demand of their machine (probably when you have 3x Tesla/Fermi and GTX295 or some Radeons and add 8+ cores & overclock all that - the machine may exceed number of allowed WUs per day per machine limit..) Uploads and downloads are scheduled to try to minimize the peak server loads and deal with congestion and outages. They are fairly random concerning the number of users. Get postphoned when there is a problem with connection/servers. And create a fairly constant load on servers. Reporting is done in batches to limit transactions to what will fit with the relatively tiny servers. Yes - in the intervals. And those may be user-enforced (and server-side accepted or not..) ... and the overall BOINC design accepts that this won't always be enough to keep everyone busy. BOINC design - yes, very often people not... Why? Because it's cheaper than building high-reliability servers -- by 10x or more. At some point it's a matter of the Quality of Service - one may not find it amusing nor interesting to participate in a project constantly failing. Redundancy is wickedly expensive. So why S@H staff decided to double sci database/create backups? Maybe it should be also run on a single-drive PC - we would have a project starting from scratch once every month or so...Redundancy - thats is one of the solutions used in the computing world to get decent uptimes have failover and create robust systems (i.e. idea of RAID, PSU redundancy, server "heartbeat" monitoring) - sometimes - you just have to do it... - it may not be cheap - but sometimes it's worth spending some extra money even without expanding capabilities but just to increase reliability. But, most people would rather make up the goals instead of read what BOINC or SETI@Home actually says. ...have you read all the articles of S@H/BOINC staff? Articles from Planetary Society?? Maybe added some NSF/UC Berkeley statements/papers/budgets?? (seems like youre definately read all the posts - I even didn't manage to do that...:) - ok then - you've read more than I did.. In this case I'm making up my goals - I'll live with that..If it'll come to that my "made up (expected) goals" are so different from offered by project... - then I'll humbly accept the kiss-off good-bye proposal of Bill Walker... Cheers & keep on crunching Profi p.s. As for this thread - I'll not post here/reply anymore - simply don't have time - sorry. Got to catch up with my duties...I feel like I'm stressing S@H servers with such a long posts.. :) |
4)
Message boards :
Number crunching :
Is it really over?
(Message 980762)
Posted 19 Mar 2010 by ![]() Post: Ok - just to comment post from Bill Walker (Ned Ludd tomorrow - sorry 4:30 a.m here...:) - got to work from 8:00 a.m.) : I agree it would be interesting to see the data further back. Does anyone have it? There are graphs of both users and hosts on Scarecrow's site, my comments apply pretty much to both. OK. Maybe if we would heve some more data we could judge more precisely... On the other hand - it's just my feeling of how's the whole thing evolving.... Maybe, maybe not. Need more data. Lots of us here in the 10 year club... Agree. Data needed. Maybe a new (active) users /users without credits within lets say 60 days (quitted) would shed come light here... My proposition was just exemplary one.. According to BOINC Stats and Scarecrows statistics over half of all the active BOINC users are active SETI users. The other 40% or so are spread over more than 50 other projects. SETI's position is safe for a while. For a while...we see how the things will go if the current fail rate maintains....It would be worth checking different project growth ratio..(yeah - I know that not everyone quitting S@H will join another project...). I think S@H has proven that very old second hand servers, running freeware and volunteer written code, have limits. ?? - second hand?? - you mean betas - or engineering samples?? If you are thinking that for example: * jocelyn: Sun V40z (4 x 2.2GHz Opteron, 28 GB RAM) * mork: Intel Server (4 x six-core 2.13GHz Xeon, 64 GB RAM) * thinman: AMD Server (2 x 2.4GHz Opteron, 16 GB RAM) * thumper: Sun Fire X4500 (2 x dual-core 2.6GHz Opteron, 16 GB RAM) * vader: Intel Server (4 x dual-core 3GHz Xeon, 16 GB RAM) are very old "crappy" machines that I think you're sorely wrong.. I'm working on a cheaper/weaker machines most of the time.. And I totally agree - even those mentioned machines have their load limits (especially when the excessive load is a long lasting one). Every machine composed of mechanical (stresses, wearing) and electronical parts (unsteady states, parameters change with time i.e. capacity of electrolythic capacitors) has it's limits - no doubt here. One have to figure out how to give a failover protection (either by doubling machines or distributing their load or by other means) How will any more users add to that knowledge? A law of complex systems... - staff at S@H is finding problems with the commonly used freeware software not approchable for other users (most of them are not even close to stress put by such complex and demanding project... They are reaching tables limits, concurency problems and many various issues that 99% of people are not even aware they exist) How can a very limited staff of scientists do science if they have to spend time coddling annoyed users? OK - to put it bluntly - we are their tools to make science... - It's nice for them to keep people informed (just a short note) if they have problems (an action taking 2 minutes - nobody is that occupied to neglect 250000 users wondering what's going on) To finally manage to process all incoming data in the realtime (or close to that)? ok - for me it's self explanatory. To have an instant results from scientific data input it's a scientist's dream - believe me.. No scientific proof needed here... To process all the incoming data from Arecibo in a real time, the extreme amount of processing power is required - so yes - time and colosal amout of money is required, but due to technological evolution CUDA and STREAM technologies are affordable for common users thus giving a great tool to reach the goal for S@H... Time and money are very finite resources at S@H. Agreed. Every minute and every penny spent of keeping users contented takes away from real science. ?? - not true - it's not "keeping users contented" but using the uncertain tool in a proper way. S@H science would be nothing without users "kept contented"... (there will be no S@H science at all or it would be marginal)[/quote] A balance must be struck. A balance - that's what been lately forgotten and now restored... Users (and their machines) ARE the true power of S@H - It was many times underlined by S@H staff and it's well self explanatory. The more users the more power - the more power the bigger things you can achieve. I'm trying to be a little more forseeing...They may have become a limit if nothing changes. How will adding more users today help us when Arecibo is shut down, We still have a plenty of old data to process with enchanced resolution...Arecibo is not shut down forever - when the collecting of data will start - almost ANY number of users/machines is required to crunch that amount of data...- in the mean time -there is a PLENTY work to do.... or when the last of the old servers finally dies? Example - you can estimate a workload/CPU power for the replacement basing on the demand growth from the users - it's just an example.... Need to spend some time on those topics too, and right now. I have spent enough time now on this I think.. All it would produce is more people griping about lack of work, lack of respect, lack of customer service, etc. I hope they have found more important things to spend their time on. Science - Not(hing) without users - no computing power. New funding -> no science -> no funding... New staffing -> no science -> no funding -> no new staff. Volunteers are lined up and waiting right now, that issue can wait. Returning to square one.... - waiting for what? - WU's to come? If situation won't change people will start to leave (it will be a slow process) - BOINC is a perfect platform to start a journey with other more stable project..at least I think so. I expect S@H would love to have Edison's funding, corrected for inflation. Who said they're giving up on the science? Edison it's just an example - if you can't find a deeper meaning in this - I can't help it...So you're claiming that S@H staff should neglect the opinion of active users and terminate all the social relation with S@ users community?? - they simply should not care about the project growt rate? - I'm really sorry for you.... And feeling the pressure (apart from stressing staff a bit) influence changes and improvement. What do you mean by "vaguely defined "outsiders"" ? Certainly not public relations, unless it generates new funding. And in case of distributed projects I think that a number of active users is a very important parameter to get appropriate funding... In my moments of quiet despair about this project (and I do have them) I wonder if part of the experiment is to push the users, and see at what point we all give up. As I said before, we are not there yet. The 10 year club survivors all appear devoted and passionate after all these years. So... we have a lot in common after all....:) Glad to hear that. [quote]If this bother you, goodbye. Thanks for the CPU cycles. There is somebody standing in line to take your place. So the fact that you spend more money on computers makes you a better person than me, and gives you a bigger say in how S@H is run? Seriously? Nope - never claimed that.. And about how S@H is run when compared our RAC - I struggle approximately 10x harder with project failures than you. If you are not crunching intensively enough - you may not even notice that something happened to the project. If you'll reach 150K of RottenMutt RAC or 6K of mine(was better I know - shortage now :)) - you'll see... Seriously... Maybe we have an issue in language usage here. It sounds to me like you are griping. Nope - we don't have a language cap here...It's a bit different to be angry or upset from being careing for somethin and indicate that something isn't right and stimulate and define own expectations in non offensive way. Yet again - if you're seeing my position as "griping person" - I can't help it. Very clear. You expect S@H to provide a free service to you, for business purposes. What happened to volunteer devotion and spirit? OMG - you're so wrong - very wrong....- "to provide a free service to you, for business purposes" - yes as a useful piggyback - when working - superb tool for simultaneous stressing of CPU+GPU - and yes - mutual profits - I have tested hardware - S@H crunched data.. . "volunteer devotion and spirit" - I think I still have it - otherwise I would not spend time posting this message (see the header of the message) and would not run S@H on many machines 24/7.. The fine print was just an attempt to lighten up a conversation that I feel has become much more serious than it deserves to be lately. I know humour doesn't always translate well. Yep - pressure rose a bit... now everything restores to nominal... I have never seen inserted formalism to lighten up the conversation - usually it's creating opposite effect. Humor translated well here - I've added even smaller footnote (and with colours)...And the other side is that I think that you have to trim a sense of humor when refering to things your interlocutor cares about (even in the proximity of entity) - so that's where a language transmitting function succeded (message with context received and understood)and a perception and processing of information failed (refused jokes when speaking about relevant things). Keep crunching Profi. P.s. I think that this case is closed (at least for me). I'll not reply to this any further. |
5)
Message boards :
Number crunching :
Is it really over?
(Message 980571)
Posted 18 Mar 2010 by ![]() Post: Eric posted at least twice this last outage. That is a 100% improvement over the last outage. Things are looking up. Yep - that's cool to know whats going on... Now if the Campus will fulfill the rumor of a gigabit connection between seti and the world, We could at least get rid of one problem and I'd think It would make up for what the campus did recently to our connection. :D :) - yep - agree. I think that full gigabit all the way would be a partial solution to all the problems. Cheers, Profi |
6)
Message boards :
Number crunching :
Is it really over?
(Message 980563)
Posted 18 Mar 2010 by ![]() Post: I'm always amazed/amused at all the people who keep predicting that SETI users will leave in large flocks because of the "bad service". Take a look at Scarecrow's graph of users. In particular, look at the graph at the very bottom of this page. 2 years only - look at the growth rate when S@H was Classic - growt was amazing back then...You're long enough (even longer than me) to know that. I don't think it will be rapid and everybody will resign. I think that people will start to think about project change and some of them will switch. The project shoud grow - now the growth rate is minimal. The number of active hosts is pretty flat, while the total number of hosts climbs slowly over the last 365 days (which includes some major "bad service" events). This tells me that there is some turn over in users who sign up, give it a try, and then quit for whatever reasons. Or new users sign up and stay and the same amount of people who started to get upset or resigned due to frequent failures and leave the project...And active users attach more cpus.. Since S@H has a hard time generating enough work for this existing user base, why would they bother with any extra effort to retain more users or generate new users? To be a biggest project and to keep this position? To be always on the frontier of science and distributed computing? To finally manage to process all incoming data in the realtime (or close to that)? To enhance search area (spatially and spectrally)? You choose - it's not hard to figure.. Users (and their machines) ARE the true power of S@H - It was many times underlined by S@H staff and it's well self explanatory. The more users the more power - the more power the bigger things you can achieve. All it would produce is more people griping about lack of work, lack of respect, lack of customer service, etc. I hope they have found more important things to spend their time on. Like what? That's how it's working. If Edison would give up on 15th burned lightbulb you still would be sitting with the candle. And feeling the pressure (apart from stressing staff a bit) influence changes and improvement. What could be more important than pushing the hardware to the limits, testing DB's engines, figuring out new distribution/scientific algorithms, using new hardware for data processing - especially when running scientific distributed project? Certainly the purpose is not to gather unconcerned people...There is no point in this - the project is glued by the great idea, devotion of users and passion of staff...otherwise it would break and split apart. If this bother you, goodbye. Thanks for the CPU cycles. There is somebody standing in line to take your place. It bothers me (in a positive way)- and it's a pity that people like RottenMutt wanting to quit due to project failures.... - those peoples are legends (my hat off here). Certainly you're not the one who should say good bye to me....-find another tissue to wipe your nose...- crunch some 10x credits you have and we'll start chatting.. Instead of posting start crunching...-I't seems like leaving your computer idle will do more good than writing kiss-off's and good-bye's - 'cause to me you're totally wasting your computer's cycles. And this "somebody" is maybe you? - because being with the project over 10 years and earning not even 400K is like being semi-idle...and waiting in queue.. I'm not griping neither most of the people - they are concerned not picky (and maybe a little resigned if failures are too frequent). Apart from using idle cycles I want to utilize S@H to burn-in my new hardware. I was just saying, that due to failures I'm unable to use S@H for this purpose and I'll have to find another solution to test it (although S@H client stesses CPU+GPU pretty well - I wouldn't want to change that - only when forced). It's not the mater of complaining - it's a matter of running business, money and fulfilling testing duties. I have to have a decent reliable tool. Hope made myself clear on this one.. Profi Fine Print: this polite kiss-off reflects the views of this user only, and is not the offical position of this station, this network, or its sponsers. I can do even smaller Footnote: And I can add colors too. Your kiss-off was not polite - even when "polite" words were used. Sometimes tone of your voice/composition of sentence tells more then words spoken. I don't meant to discourage you nor offend you- keep crunching, but if you want to say goodbye to me - save it..And you don't need to make "Legal Notice/Statement" hence you're not representing "this station, this network, or its sponsers" |
7)
Message boards :
Number crunching :
Is it really over?
(Message 980427)
Posted 18 Mar 2010 by ![]() Post: I'm not saying it isn't bad right now. I'm just saying it has been worse. It suppose to be a WAY better.. I've been told that there was a time during Classic where the servers were down nearly a month. I don't remember a month-lasting break, but a week or so - yes.. But even then people were informed through various channels - they knew what was going on (more less).. The important thing to take from all this is that the project never guaranteed 100% uptime or that workunits would be available at all times. Yes - by means of "guarantee" - you mean "expect the proper functioning of the device or service one has paid for" ?? - OK - I agree - nobody sent me a written statement nor it's not in the License Agreement when installing BOINC client... Nevertheless, it should be getting better and even without any formal guarantees project should be getting very near to 100% uptime - lately it certainly does not. If you're referring to Classic, then you're sorely mistaken. It was not on very rare occasions, but just as problematic as it is now. Some times were worse then others back then, just as they are now. People forget. I really don't think that the project complexity did not change thus producing new failure points.. If the people forget.. - let us forget now, but somehow this downtimes are not allowing me to forget (how one can forget if it happens every week) ?? No, because they didn't share as much with us. We did just as much work with less information. Computer were so slow back then that the project could be down an entire week and most people would still be working on the same workunit. This is one of the reasons why people don't remember very well. ?? - I felt very well informed..Had a plenty WUs to crunch... - totally satisfied... - now the situation changed - WUs transision to other machine is problematic...so due to servers downtimes machines starting to drain caches and stay idle.. - so this is what the project is for - to keep machines occupied... (and I don't want to participate in another project - I want to FIND ET :) ) Yes, they are much smaller. None of them have over 297,000 active participants with such a rabid following, building crunching farms with multiple nVidia cards just to increase their numbers (not saying it doesn't happen - just not on the scale it happens here on SETI). Yes - agree. But people building 4-GPU cards with 10+ CPU cores monsters will quickly find that there is no point in participating in the project which is unable to keep up with work demand to satisfy such beasts. :) The project will reduce in size if single S@H server farm will not be able to keep up.. I would even bet that most of those projects, being newer and all, have started with newer hardware - possibly even hardware that isn't donated or in beta. They have had the opportunity to spend their NSF money on decent stuff. SETI, on the other hand, uses donated/beta hardware all the time. I think that they are running on donated/beta hardware also - thats the way it goes. Isn't S@H also funded from NSF? Please find other project that have so many major players as Intel, Overland, HP.... - to get a great hardware from them from the position of S@H is just a matter of asking.. (BTW - they are donating hardware). Also a donation from members which strenghtens funding of the project.. SETI paved the way for all of them, and continues to pioneer low budget distributed computing. As of "paving the way" - totally agree. I'm aware that S@H is a HUGE project, but technology have found solutions to solve such issues. I think that Matt et al. and the BOINC 2-rack farm is not able to support the demand of such number of crunchers/mangle sci data. I'm feeling like there is a need for additional layer in the S@H processing (i.e. sites with WUs) which would distribute work for lets say geographical region and users should be allowed to choose couple of them (to have redundancy). The problem with this idea is that all the workunits must come from a single source (the tapes/hard drives at SETI). Likewise, they must return to the same source. This suggestion just moves the problem around but does nothing to actually solve the bandwidth issue. I don't think so. Data is comming from the common source - sky - it is stored and as a raw data may be shipped to Berkeley - but after then it may be sent to another sites for splitting and other mangling (sending/validating etc.) to offload the Berkeley site. The results have to be returned to Berkeley to create uniform SCI database of collected data/results. I think that splitting/sending/validating/ assimilating and other actions on such small objects as WUs is creating such a stress on HDDs, OSes, apache servers and network hardware, that it fails on a frequent basis (it's quite like working on random r/w with hundreds of DB queries and running web service with hundreds of connections simultaneously - it have to kill single or single small farm of machines sooner or later..). Sounds like a lot more technology going toward the solution, which means even more points of failure in the system. Yes - but it's of a low probability, that 2 or more sites will be down at the same time... Plus, in order to "synchronize" with Berkeley, they'd be using up some of the bandwidth, which simply moves the problem around but does nothing to solve the issue. You are distributing workload/network stressing. And it's a way different to process a single request/session (or couple of them) at the same time instead of mangling hundreds of connections at the same time.. If we made the system that complex, how would that be big science on a small budget? The two things have a very little in common. System complexity is one thing and science side another. Combining the two is a shortcut. I know people doing great science with simple tools and people wasting their time to harness huge complex tools to solve little problem. The problem is that you're trying to find ways to make the server have more uptime. This is contrary to the entire thesis of BOINC. The idea is that you do not need a server with 100% or even 80% uptime. Not true! I think you're missing the point. What is "entire thesis of BOINC"? I think it's to create Open Infrastructure for Network Computing - so to create a tool for scientists or developers who want's to run distributed project - to make it uniform, thus providing the platform for users to participate in more than one project if they are willing to. If not - a BOINC is a "carrier" of science application(s) - so still - if one wants to participate in a single project, it's desired for project servers to keep up with the work distribution. So yes - I would like to see S@#H having 98%+ uptime. The goal is not to make the SETI servers more reliable, but to understand that people's expectations are far off base. SETI never promised to have work available all the time. I don't believe what I'm reading here... Since last failure passed 7 hours and I was not able to get any WUs for new machines/send results for 4 days (3 days of upload failures- 6hrs break and S@H working -5hrs of failure)... And you're saying that "the goal is not to make the SETI servers more reliable" - so what is the purpose of this project if the servers are "failing crap"? - What is the point in meesing in peoples heads if i.e. servers will lose the fruit of their machines work due to frequent failures/misconfiguration? I DO NOT expect for S@H to work as my cardio stimulator :) - just want it to be 28-29 out of 30 days up in a month (or close to that) - is my expectations too high you think? S@H become something more than just a scientific project - everyone involed knows that. It's a community of people very devoted, wishing all the best and admiring work of S@H staff. If you're runnig the project of that scale - you must accept the consequences - you have to bear in mind that the pressure on you from a huge number of participants will be extreme. You don't have to make an explicit "promise" nor declaration - it's self explanatory. And it's quite obvious (at least for me) that the pressure is rising when something fails. The sooner that is understood, the sooner people wouldn't get so upset or angry when something happens. NO, no no.. - sorry - I do not agree with such attitude - being upset, worried or excited about something means that you care. Silently accepting failures, especially when they occur so frequently, is burying head in the sand. Saying that something is wrong with the project is not just burble - at least someone (in this case at least you OzzFan) will read and have the knowledge that not everything is ok. Cheers, Profi p.s. It seems like everything got back to normal after 3 days of upload failure , 6 hrs of project uptime then thumer crash and 5hrs RAID resync... The threads with prediction of next failure times are really sarcastic :] and confirm, that I'm not alone... |
8)
Message boards :
Number crunching :
Is it really over?
(Message 979629)
Posted 17 Mar 2010 by ![]() Post: Patience is a virtue. Yes, quite frequently probed :)... |
9)
Message boards :
Number crunching :
Is it really over?
(Message 979627)
Posted 17 Mar 2010 by ![]() Post: Stable? Something tells me you haven't been paying attention. Yes, in terms of continuous uptime vs. hardware failures caused downtimes. Now it's very frequent. Every week we have outage, and in addition all the failures - I don't remember month without a failure.... Check weekly graph on cricket...Do you feel like it's whe way it sould look like?- I don't think so... And the S@H Classic was coping a way better -but it's just my impression....(does anyone have old cricket graph to post here??) There has been worse outages than this one. The project has never been stable in the sense that most people think. Outages happened on the very rare occasions - mainly due to hardware failure. Now I think that crew have too many screws to turn - a webserver, database, filesytem, dns policy and routing - once I'm starting to have an impression that BOINC Berkeley farm is stabilized - boom - here we go... - RAID out of sync..... ok - fixed....5 days later and now what - mounting lost.....next... - uff outage (this time planned...) - 3 days - new fedora is out - let's update the system.... - ups...mysql got updated and screwing up the db....a root partition filled with logs, neverending story... - do you remember such things during S@H Classic? (i remember a few even quite long failures, a few DDoS'es/hacks - this happens). The scenario is just an example... And how come, that other projects (Rosetta, Einstein, Milkyway) have such long uptimes? - yes they are smaller - but comparing to S@H also have a much shorter experience..and usually less (or comparable amount of) money (NSF founding etc.) - but it's just my guess. I'm aware that S@H is a HUGE project, but technology have found solutions to solve such issues. I think that Matt et al. and the BOINC 2-rack farm is not able to support the demand of such number of crunchers/mangle sci data. I'm feeling like there is a need for additional layer in the S@H processing (i.e. sites with WUs) which would distribute work for lets say geographical region and users should be allowed to choose couple of them (to have redundancy). WU's should have a signatures telling from were they were downloaded. Those sites should synchronize with Berkeley to make sci DB inserts.. - I know - it's just a plot.. - but someone from crew should think about that... - before people will loose their nerves and switch to another project. Inplain words if you have a HUGE problem - try to divide it into smaller chunks and solve each one of them (not possible in every situations - I know) - thats why Beowulf clusters, CFD/MES domain decomposition and basically distributed computing occured - there have to be a way to decentralize project thus providing better failover, load distribution etc. Cheers Profi |
10)
Message boards :
Number crunching :
Is it really over?
(Message 979528)
Posted 16 Mar 2010 by ![]() Post: I wonder, how many times S@H will bounce until people will resign massively and switch to other projects... Unable to upload WUs for over 3 days now.... Starting to think about changing project.... And please don't give me this "small budget/manpower" or "consider joining another project" - yeah I know - but instead of joining another project I'm starting to think about switching/changinig the project. The idea of SETI@Home is (still) very attractive and that's why I joined S@H classic over 9yrs ago. It was solid, stable, fun and involving. Now S@H started to be a victim of it's own success - downtimes take 20% of total time, servers failing/overheating, power breakers failures, air conditioning failures, fiber optics broken.... you name it - all sort of disasters available in the computer world...- on a very frequent basis... And people at SSL are saying "we're running the largest distribured project to date".... - ok - If the project will be of actual shape - I don't think it will last for much longer. Perhaps it's time to stabilize the project somehow? (instead of adding new features/scientific code/DB clogging NTPICkr and stuff like that). Maybe a change of approach to such massively data processing should be taken into account (ie. distribution of the workload to another sites, etc.)? I'll wait for a day or two.... - hence S@H is my primary burn-in test and a WU distribution is failing for such a long time and now it happend that I have 8 brand new PC's to test - I'll have to find a way to test them - either with S@H or without it.. Cheers, Profi p.s. I'm not panicking so "Panic Mode on" suggestions are also not for me.. |
11)
Message boards :
Technical News :
Working as Expected (Jul 13 2009)
(Message 918305)
Posted 16 Jul 2009 by ![]() Post:
I don't put ayn words in your mouth - I'm not claiming that you've said so - it's just my perception of what you havve said - or maybe more correctly - what I have understood of what you've said.
Totally agree.
1) this happens - what worries me it is the downtime frequency nowadays. It's not a time-critical work - and will never be such.
The signals are decades old - so? The very same workunits have been mangled several times each time the processing algorithm changed - no big deal. If not processed with the new app - they certainly will be..with increased resolution, sensitivity, chirp rate, etc. There is so many knobs to turn to improve the quality of the search - but it was not possible to do when the S@H started - now we have our "big shiny powerful" machines...
Not forever nor on frequent basis - if you bare in mind that S@H has a big connection/ disk IO problem outages does not help.It may occur that the project will not recover between outages and than people will be more and more frustrated. BOINC people encouraging participants to devote CPU time to many projects, but there are people (including me) who wish to dedicate their CPU cycles to one project (at least a lion part to one of them)
As mentioned above: Here is the exemplary reason-result chain: Machine is crunching mostly S@H -> a WU's queue has drained -> machine is waiting to upload the results to get new WU's -> Due to outage machine is idle -> switch to another project co occupy idle time or wait for the project to recover. If the last part is repeated on the frequent basis one may switch to more stable project.
As for me - more worried then upset - no harsh feelings towards S@H team - they are doing exceptionally good work with what they have. And it seems like S@H skipped beyond developers' imagination. Unfortunately the popularity of S@H may (maybe not) terminate or at least significantly decrease the growth of the project.
Uff.... - tho you must have to think quicker.. :) - no - don't get offended.. just kiddig. Keep up a good work - all help is appreciated.
Sure - trac'em. Hence S@H staff members number is so limited they may not come up with an idea/don't have a knowledge as great 180 k users.
anxious - maybe a bit worried - sure - I am a bit S@H addicted..:P angry - not for sure - there is no one to blame - moreover I'm doing this for fun mostly. Gigabit is one solution, but looking at the progress rate of SETI it is only temporary one - S@H will stand in front of the hardware problem in the future - I think that the better way is to distribute the backend of the project on many sites preferably distributed across the globe. |
12)
Message boards :
Technical News :
Working as Expected (Jul 13 2009)
(Message 918281)
Posted 15 Jul 2009 by ![]() Post:
OK - so spread the workload. Many peoples have started their own BOINC project on the basis of BOINC platform developed @ Berkeley (usually with help from BOINC developers) - I imagine that there is a possibility to launch a site to support the clients from Europe, another site for Asia etc. within the S@H especially with support from Matt and other brilliant guys running SETI now. There must be a way to distribute the workunits to different sites...
To drop a line once every day is not a big deal. Quick note is enough, i.e "We have a upload server problem and we are working on it" and after solving a problem some more elaboration what was the problem and what solution have been found. I'm against the situation that the people who don't have an experience with S@H or sufficient computer knowledge don't know what is going on. I realize that this note may ignite some discussion, but at least one will know what the situation is. Cheers, |
13)
Message boards :
Technical News :
Working as Expected (Jul 13 2009)
(Message 918271)
Posted 15 Jul 2009 by ![]() Post:
Good for you. Agreed. As many peoples, as many approaches. I have different - less "I don't give a s*** about .....". I care if the S@H is working or not - for you it seems like "if it works - it works - cool....if not - ok - who cares - it's just a (stupid) project". It's always a question how you see things - one can see a pot a half full others a half empty.... you know what I mean? (please don't argue who have which approach- it doesn't matter here)
NO - it's your opinion. If the project is aiming to be a distributed project their designers have to bare in mind it's grow rate, increase number of users and many other different things. One of them may be distribution of the servers distributing WU's and collecting results - many projects have done so because they've realized that they will not withstand the demand from so many clients. 180000 clients it's not a huge number if you consider OSSI or other services... The number of servers have to be increased and load-balanced one way or another. Imagine that processing power increasing in a geometric (quadratic) rate (doubles every 18 months) an in addition you are enabling technologies such as CUDA.... - no way that single farm of servers (limited to one or two racks) and a single 100 Base-T connection (sigh...) would support such projects. The solution is: more servers load-balanced and much faster connection or distribution of the load to many sites and exchange the information between them (still don't know how - I realize that this issue was already discussed with Matt and "atomization" is a barrier to overcome - it simply has to be done one way or another - suggestions welcomed here. I know Matt is trying to squeeze as much as he can from what he's got, but he's only a human... (give some ideas here to help him)
If you trying to maintain such a project as S@H localized in one place you have to have an infrastructure to keep up with a demand - otherwise you'll land with problems currently bugging S@H. I imagine, that among those 180000 active users there are people/institutions able to support the distribution of the workload (WU's distribution, collection of the results etc.) Current demand of the S@H clients exceeds the capability of the SSL S@H hardware and it will only get worse as time will go by. So to have a project continuity some solution have to be found now - otherwise due to downtimes people will migrate to other projects such as Folding@Home, Rosetta@Home or World Community Grid (list of projects can be found at www.distributedcomputing.info). And concerning a limited resources - I know its always money....and I know that Matt at al DO NOT HAVE the money to buy a new equipment - so they have to come up with an idea which will overcome the current problem. Cheers, |
14)
Message boards :
Technical News :
Working as Expected (Jul 13 2009)
(Message 918133)
Posted 15 Jul 2009 by ![]() Post:
OK - you have an ambivalent approach to whole situation. Fair for me. Nevertheless I expect S@H to work otherwise it's a bit a waste of time for me :). BTW subject may be obsolete because something is going on (server is not just dropping connections with connect() failures of HTTP errors - it's up but probably overwhelmed with requests). Cheers, |
15)
Message boards :
Technical News :
Working as Expected (Jul 13 2009)
(Message 918070)
Posted 15 Jul 2009 by ![]() Post: Matt, do whatever you're doing, but: And how you expect S@H should work? - Mine answer is as simple as that: Until now for 80% (or more) time S@H was doing it's job, by means of sending WU's, receiving results, doing science mangling on the results, updating stats etc. Scheduled downtimes (let's say 10% of the time) - ok, we know that's a must or we're all have been notified. Now 1% (and lately about 9% of the time ("9" is for people liking arithmetics)) is a time of failures due to hardware/software/human error. In other words: usually S@H worked - now it's not (project is not performing in the way it supposed or was designed to work). I'm just being worried that as time goes by, situation is getting worse despite of improvements in the system/computers design and computing efficiency. Hope this will shed some light on the situation. p.s. I may be wrong. I'm just concerned about the frequency of the downtimes and lack of info. I'm wishing all the best for Matt and other guys @ Berkeley trying to get the best of the limited hardware resources. |
16)
Message boards :
Technical News :
Working as Expected (Jul 13 2009)
(Message 918049)
Posted 15 Jul 2009 by ![]() Post: Matt, do whatever you're doing, but: 1) drop a line just to keep people in the loop 2) disable deadline checking or people might not get credit for all the hard work their computers already did Cheers Profi p.s. Certainly situation is opposite to thread title "Working as expected" - especially when looking from a cruncher's point of view.. |
17)
Message boards :
Technical News :
Weirder (Sep 06 2007)
(Message 635334)
Posted 7 Sep 2007 by ![]() Post: Guess what? A *second* drive on thumper failed this morning, around the same time the other drive failed yesterday. This system is on service, so we should get some replacements soon. But there's no obvious signs of why these two failed so close in succession. They were both on the same drive controller, but there's a 15% chance of that happening at random. The temperatures all look sane. Thanks Matt for an update... Anyway - from the brief analysis - SETI@Home has a approx. 1 week MTBF time.... -Profi |
18)
Questions and Answers :
Windows :
Why can't i upload/download from SETI?
(Message 543789)
Posted 10 Apr 2007 by ![]() Post: Hi, I recently had the same problem "SETI@home|[file_xfer] Temporarily failed download of setiathome_5.15_windows_intelx86.exe: system connect" and stuff like that..:) It seems like SETI data server was altered/changed. Previous connection rules for BOINC application to setiathome.SSL.berkeley.edu (128.32.18.152, 208.68.240.11) and boinc2.ssl.berkeley.edu (66.28.250.124, 208.68.204.13) is no longer sufficient. I had to add a rule for my firewall for BOINC.EXE application to allow outbound connections to setiboincdata.ssl.berkeley.edu (208.68.240.16). After creating this everything went smoothly.. Cheers, josuah08 |
19)
Message boards :
Number crunching :
invalid pointer??
(Message 409405)
Posted 28 Aug 2006 by ![]() Post: I have been receiving the following messages in stderr.txt files... xxxx:~/BOINC/slots/0$ cat ~/BOINC/slots/0/stderr.txt free(): invalid pointer 0xbffff6c8! ar=0.426462 NumCfft=72539 NumGauss= 459965578 NumPulse= 88189923968 NumTriplet= 7647584501760 xxxx:~/BOINC/slots/0$ cat ~/BOINC/slots/1/stderr.txt free(): invalid pointer 0xbffff6c8! ar=0.426462 NumCfft=72539 NumGauss= 459965578 NumPulse= 88189923968 NumTriplet= 7647584501760 xxxx:~/BOINC/slots/0$ cat ~/BOINC/slots/2/stderr.txt free(): invalid pointer 0xbffff6c8! ar=0.426462 NumCfft=72539 NumGauss= 459965578 NumPulse= 88189923968 NumTriplet= 7647584501760 xxxx:~/BOINC/slots/0$ cat ~/BOINC/slots/3/stderr.txt free(): invalid pointer 0xbffff6c8! Running on Linux 2.6.8 SMP i686 GNU/Linux Boinc version: 5.2.13 i686-pc-linux-gnu S@H version: setiathome-5.12.i686-pc-linux-gnu Because I'm running on the Pentium Cascade (4xXeon) i have 4 Seti threads running but all are in the sleeping state Previously everything was fine - this started to happen on the new version using newer gcc Any suggestions?? Thanks in advance Profi |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.