Message boards :
Technical News :
Feedback Loop (Aug 20 2007)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
So the weekend was more or less successful: we kept the minimum number of multibeam splitters running and finally started to catch up with demand. We even started building up a nice backlog of work to send out, so I started up the classic splitter so they could cleanly finish the remaining partially-split tapes we have on line. The backend continues to choke occasionally - the bottleneck still being the workunit file server, so there's not much we can do about that. It'll probably be a lot better when we're entirely on multibeam data and less splitter processes are hitting the thing. Meanwhile, the sloooow workunits we hoped would time out on their own aren't. Not sure what to do about that exactly. And while the level of fast-returning overflows went down as we moved on to less noisy data, about 10% of all results sent back are still overflowing. There's been some fairly good discussion in the number crunchers forum about how to get a better "feedback loop" between users and us here at Berkeley in times of crisis. Let me continue the chatter over here with my ten cents: Currently the method of "problem hunting" done by me (and probably Eric) is pretty much a random scan of e-mails, private messages, and message board posts as time allows. The key phrase is "as time allows." There could be weeks where I simply don't have a single moment to look at any of the above. So the real bottleneck is our project's utter lack of staff-wide bandwidth for relating to the public. I get tagged a lot for being the "go-to" guy around here when really it's just that writing these posts is a form of micro-procrastination as I context switch between one little project and a dozen others. While I keep tabs on many aspects of the whole project, there are large sections where I don't know what the hell is going on, and I like to keep it that way. Like beta testing. Or compiling/optimizing core clients. Anyway.. for the day-to-day monitoring stuff it's really up to me, Jeff, Eric, and Bob - that's it - and none of us work full time on SETI. Long time ago we had a beeper which woke us up in the middle of the night when servers went down. We've come to learn, especially with the resilience of BOINC, that outages are not crises. As much as we appreciate the drive to help us compute as much as possible, we don't (and cannot possibly) guarantee 24/7 work. So to set up a crisis line to tell us that our network graphs have flatlined will just serve to distract or annoy. Of course, there are REAL crises (potential data corruption, massive client failures), and a core group of y'all know which is which. I feel like, however imperfect and wonky it is, the current modes of getting information to us is at least adequate. And I fear additional channels will get cluttered with noise. You must realize that we all are checking into the lab constantly, even during our off hours. Sometimes we catch a fire before it burns out of control (in some cases we let it burn overnight). Sometimes we all just happen to be busy living our lives and are late to arrive at the scene of a disaster which, at worst, results in an inelegant recovery but a recovery nonetheless. Still... I don't claim to have the best answer (or attitude) so I'm willing to entertain improvements that are easy to implement and don't require me to watch or read anything more than I already do. In the meantime I am officially a message board lurker. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Wm. C. Wright Send message Joined: 10 Apr 01 Posts: 1 Credit: 265,668 RAC: 0 |
I love it - great words! Been in IT for 18 years and totally understand. As long as stats don't get corrupted all is OK. Seti is very reslilient - I have been watching and crunching for years. Staff bandwidth will be tight for a cycle or two but attitude is like the weather - it will change at unpredictable times. Keep up the good work Matt and team! Well done! Bill |
Pooh Bear 27 Send message Joined: 14 Jul 03 Posts: 3224 Credit: 4,603,826 RAC: 0 |
Matt, From someone who also has been doing IT for many years, in many capacities, I believe that you and the team are doing an excellent job in the capacity the team has. I am unsure that you can really do any more than you are. The updates you have been doing (3-4 times a week) tell the story. If people do not listen to the story, and understand that things are under a chaotic control. They will always be grumpy about the things they do not understand. Keep up your great work. Do what you must. Allow not it to take over all of your life. Keep your music close. You are a brave person to keep this gig for this long, with this much abuse people have thrust upon the group, that I've seen over the past several years. I'll be here, as long as the project is run by wonderful people as yourself. I get grumpy at those who yell and scream at the team all the time. The team itself, I am proud to see do the hard work they do. My movie https://vimeo.com/manage/videos/502242 |
Bounce Send message Joined: 3 Apr 99 Posts: 66 Credit: 5,604,569 RAC: 0 |
I have to agree with everyone else. A staff can be burned out quickly when there is pressure to perform beyond that which there is a willingness to fund. More than once I've seen a situation where management blew a gasket when presented with the cost of running a full 24/7/365 day shop. Yet that same group would stomp into the office in the morning looking to lop off heads because no one responded to their 3am voice mail about how their spread sheet wasn't working right. It's an ongoing effort to educate people about how their expectations have to be adjusted according to their willingness to loosen their purse strings. |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
Matt, You & Berkeley are Doing a Great Job . . . Thank Each of You Sincerely . . . |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30638 Credit: 53,134,872 RAC: 32 |
So the weekend was more or less successful: we kept the minimum number of multibeam splitters running and finally started to catch up with demand. We even started building up a nice backlog of work to send out, so I started up the classic splitter so they could cleanly finish the remaining partially-split tapes we have on line. The backend continues to choke occasionally - the bottleneck still being the workunit file server, so there's not much we can do about that. It'll probably be a lot better when we're entirely on multibeam data and less splitter processes are hitting the thing. Meanwhile, the sloooow workunits we hoped would time out on their own aren't. Not sure what to do about that exactly. And while the level of fast-returning overflows went down as we moved on to less noisy data, about 10% of all results sent back are still overflowing. Okay, what you need is some sort of error reporting facility that sends out a SOS when it gets enough similar messages. Not much different than looking for that SETI signal in the random noise. I don't know if there is something for that out here in help desk land or not. Might be worth a few minutes of web searching. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Okay, what you need is some sort of error reporting facility that sends out a SOS when it gets enough similar messages. Not much different than looking for that SETI signal in the random noise. I don't know if there is something for that out here in help desk land or not. Might be worth a few minutes of web searching. As the originator of the thread that I think Matt was referring to, I'd be delighted if your web search turned up a 'bot' which would monitor these boards for an alarm signal - then I could go back to bed with a clear conscience. It would need to
Sounds like the user community to me, but please prove me wrong. |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Okay, what you need is some sort of error reporting facility that sends out a SOS when it gets enough similar messages. Not much different than looking for that SETI signal in the random noise. I don't know if there is something for that out here in help desk land or not. Might be worth a few minutes of web searching. There IS such a program. It's called Nagios. We are using it at my company to monitor over 300 servers and 1000+ network services with one server. It's a dedicated box that monitors all other boxes. Sends emails or pages when things are not working, and it can be configured to only notify during certain hours (e.g., business hours only). |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Okay, what you need is some sort of error reporting facility that sends out a SOS when it gets enough similar messages. Not much different than looking for that SETI signal in the random noise. I don't know if there is something for that out here in help desk land or not. Might be worth a few minutes of web searching. And - you neglected to say - it "is an enterprise-class monitoring solutions for hosts, services, and networks released under an Open Source license" (according to the top entry returned by A Well Known Search Engine). Interesting. I'll read up on it. [not sure yet whether it can parse php message boards, though] |
Jim-R. Send message Joined: 7 Feb 06 Posts: 1494 Credit: 194,148 RAC: 0 |
Okay, what you need is some sort of error reporting facility that sends out a SOS when it gets enough similar messages. Not much different than looking for that SETI signal in the random noise. I don't know if there is something for that out here in help desk land or not. Might be worth a few minutes of web searching. I think the program he is speaking of is more of a service monitor. It would let you know if the entire web server were down, but I doubt if it could parse messages looking for keywords, which is what was being mentioned in the other posts. Jim Some people plan their life out and look back at the wealth they've had. Others live life day by day and look back at the wealth of experiences and enjoyment they've had. |
elendil Send message Joined: 7 May 02 Posts: 28 Credit: 1,908,698 RAC: 0 |
So the weekend was more or less successful: we kept the minimum number of multibeam splitters running and finally started to catch up with demand. We even started building up a nice backlog of work to send out, so I started up the classic splitter so they could cleanly finish the remaining partially-split tapes we have on line. The backend continues to choke occasionally - the bottleneck still being the workunit file server, so there's not much we can do about that. It'll probably be a lot better when we're entirely on multibeam data and less splitter processes are hitting the thing. Meanwhile, the sloooow workunits we hoped would time out on their own aren't. Not sure what to do about that exactly. And while the level of fast-returning overflows went down as we moved on to less noisy data, about 10% of all results sent back are still overflowing. I don´t want to bother Matt with this in particular, but how do I know if I´m crunching a ´standard´ WU or a ´multibeam´ WU? -=[ Not all who wander are lost ]=- |
Jim-R. Send message Joined: 7 Feb 06 Posts: 1494 Credit: 194,148 RAC: 0 |
So the weekend was more or less successful: we kept the minimum number of multibeam splitters running and finally started to catch up with demand. We even started building up a nice backlog of work to send out, so I started up the classic splitter so they could cleanly finish the remaining partially-split tapes we have on line. The backend continues to choke occasionally - the bottleneck still being the workunit file server, so there's not much we can do about that. It'll probably be a lot better when we're entirely on multibeam data and less splitter processes are hitting the thing. Meanwhile, the sloooow workunits we hoped would time out on their own aren't. Not sure what to do about that exactly. And while the level of fast-returning overflows went down as we moved on to less noisy data, about 10% of all results sent back are still overflowing. I think it tells in the work unit header, which you can view with a text editor but don't change or save anything! The easiest way is to look at the first six digits of the filename. This is the date the wu was recorded. So 05ma07 would be the fifth of March, 2007. Anything with 07 is a multibeam wu. The only tapes in the splitter for linefeed are from 2000, which would give say, 02ja00 or similar, the key being the third pair being 00. Jim Some people plan their life out and look back at the wealth they've had. Others live life day by day and look back at the wealth of experiences and enjoyment they've had. |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
I think the program he is speaking of is more of a service monitor. It would let you know if the entire web server were down, but I doubt if it could parse messages looking for keywords, which is what was being mentioned in the other posts. Actually, there are modules that allow you to check running processes on unix machines. These modules run via a client daemon that runs on each server (very low memory requirements). I know modules exists for Windows, Linux, and SunOS 5.8. That would be a much preferred option over looking at server_status.php because how would it check the servers if the web server was down? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I think the program he is speaking of is more of a service monitor. It would let you know if the entire web server were down, but I doubt if it could parse messages looking for keywords, which is what was being mentioned in the other posts. That, coupled with the 'Open Source license' should make it interesting, and potentially very useful, for the staff monitoring the internal networks in the lab. However, to reprise a phrase I coined in another thread, I'm not sure how it would replace the "Distributed Wisdom" which comes in parallel with a 'Distributed Computing' project. |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
So the weekend was more or less successful: we kept the minimum number of multibeam splitters running and finally started to catch up with demand. We even started building up a nice backlog of work to send out, so I started up the classic splitter so they could cleanly finish the remaining partially-split tapes we have on line. The backend continues to choke occasionally - the bottleneck still being the workunit file server, so there's not much we can do about that. It'll probably be a lot better when we're entirely on multibeam data and less splitter processes are hitting the thing. Meanwhile, the sloooow workunits we hoped would time out on their own aren't. Not sure what to do about that exactly. And while the level of fast-returning overflows went down as we moved on to less noisy data, about 10% of all results sent back are still overflowing. @ elendil: Jim_R's method works, but also - if you're looking at the graphics or the screensaver, look for a "06" or "07" - those (and maybe 05 as well...) are multi-beam. As Jim says, anything with a "00" is linefeed. (AKA "standard") A dead givaway for multibeam from Beta (which I'm not sure has made it to the production app yet...) is if there's a line reading "Beam x, Polariz. y" with x being 0-7 and y being 0 or 1. [Edit to add] If you look at the "tasks" tab of BOINC Manager, (in the "advanced view") the first 6 characters are in DDMMYY - again, all 06 and 07 year dates will be multi-beam. [/edit] . Hello, from Albany, CA!... |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
However, to reprise a phrase I coined in another thread, I'm not sure how it would replace the "Distributed Wisdom" which comes in parallel with a 'Distributed Computing' project. It wouldn't because those are two completely different things. The "distributed wisdom" that amalgamates from thousands of users forming a collective intelligence about the SETI system cannot be replaced by one server and some software. What Nagios can do, however, is provide simple checks and very quick, predictable responses regarding the internal state of all the SETI servers. It is a very useful tool, but it will not replace all performance measures or be able to tell you "how well it's running" as the community here does very well. Ideally, we would have both and each would serve their role. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30638 Credit: 53,134,872 RAC: 32 |
Okay, what you need is some sort of error reporting facility that sends out a SOS when it gets enough similar messages. Not much different than looking for that SETI signal in the random noise. I don't know if there is something for that out here in help desk land or not. Might be worth a few minutes of web searching. Well, I don't think that plain language recognition stuff has gotten that far. [Never mind it needs to read many languages.] Perhaps though some sort of 10-20 questions about the error in a fill it out error report might work. Get a bunch with all the 10-20 questions answered the same in a short time period and it is time to start reading the comments section of the report. Or a bunch about WU's off the same splitter tape. I don't know if the user system error logs get uploaded but a pattern match on them would show problems up PDQ I'm sure. Known bugs could easily be screened out, problem machines also. [Maybe you could even automated e-mail the owner of a problem machine to check it out.] I know everyone knows when the entire system is down, so that can be ignored. Also I see there may need to be another function added to BOINC, namely the ability to tell a/all client(s) to abandon a set of WU's once you know there is a problem with that data. Enough debugging for the day. |
Ingleside Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13 |
Also I see there may need to be another function added to BOINC, namely the ability to tell a/all client(s) to abandon a set of WU's once you know there is a problem with that data. Client aborting Cancelled wu's is already included in v5.5.1 and later, the only problems is client must do a Scheduler-request before it's triggered, and project must have aborted the wu... The BOINC server-code includes the ability to abort wu's, but AFAIK this is abort wu-id-X to wu-id-Y. With SETI@home having multiple work-generators running, "good" wu's would be mixed-in with "bad" wu's, so aborting by wu-id wouldn't be a good idea. "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Also I see there may need to be another function added to BOINC, namely the ability to tell a/all client(s) to abandon a set of WU's once you know there is a problem with that data. I agree it would be impossibly time consuming to do it by hand, but haven't they (or in particular, Eric) written scripts to do this sort of thing in the past? It tends to crop up in Beta from time to time. I mention it, because the problem still keeps cropping up - I've just performed Joe Segur's surgical operation on one this morning (keyhole surgery, because it was on a remote server) - WU 147847468, and my copy issued 22 August was the first one to count towards quorum. |
whong99 Send message Joined: 29 Oct 99 Posts: 1 Credit: 4,764,060 RAC: 33 |
Keep up the excellent work... I just wanted to pass on some feedback from out here in 'userland'. My BOINC Manager v5.8.11 on my computer at work has been trying without success to download the file (setiathome_5.27_windows_intelx86.exe) since Aug 20. It has been 'stuck' trying to download two new work units (ie. 13fe07aa.25866.20113.12.5.146_1 and 02mr07ah.12398.23290.14.5.210._1). I have aborted the download for the above work units and BOINC Manager is currently attempting to download another two new work units (10ja07ab.30450.6207.8.5.217 and 10ja07ab.28819.11933.7.5.224) without success either (see BOINC messages below). My Internet connection is fine since BOINC Manager is able to successfully download work units for other projects running on my computer. Thanks. William Hong Sample BOINC Messages: ... 8/22/2007 7:42:43 PM|SETI@home|[file_xfer] Started download of file setiathome_5.27_windows_intelx86.exe 8/22/2007 7:42:46 PM|SETI@home|[file_xfer] Temporarily failed download of setiathome_5.27_windows_intelx86.exe: http error 8/22/2007 7:42:46 PM|SETI@home|Backing off 2 hr 35 min 30 sec on download of file setiathome_5.27_windows_intelx86.exe ... 8/23/2007 1:32:35 PM|SETI@home|[file_xfer] Started download of file setiathome_5.27_windows_intelx86.exe 8/23/2007 1:32:35 PM|SETI@home|[file_xfer] Started download of file 10ja07ab.30450.6207.8.5.217 8/23/2007 1:32:48 PM|SETI@home|Sending scheduler request: To fetch work 8/23/2007 1:32:48 PM|SETI@home|Requesting 43803 seconds of new work 8/23/2007 1:32:53 PM|SETI@home|Scheduler RPC succeeded [server version 511] 8/23/2007 1:32:53 PM|SETI@home|Deferring communication for 11 sec 8/23/2007 1:32:53 PM|SETI@home|Reason: requested by project 8/23/2007 1:33:25 PM|SETI@home|[file_xfer] Temporarily failed download of setiathome_5.27_windows_intelx86.exe: http error 8/23/2007 1:33:25 PM|SETI@home|Backing off 1 min 0 sec on download of file setiathome_5.27_windows_intelx86.exe 8/23/2007 1:33:25 PM|SETI@home|[file_xfer] Temporarily failed download of 10ja07ab.30450.6207.8.5.217: http error 8/23/2007 1:33:25 PM|SETI@home|Backing off 1 min 0 sec on download of file 10ja07ab.30450.6207.8.5.217 8/23/2007 1:33:25 PM|SETI@home|[file_xfer] Started download of file 10ja07ab.28819.11933.7.5.224 8/23/2007 1:34:15 PM|SETI@home|[file_xfer] Temporarily failed download of 10ja07ab.28819.11933.7.5.224: http error 8/23/2007 1:34:15 PM|SETI@home|Backing off 1 min 0 sec on download of file 10ja07ab.28819.11933.7.5.224 |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.