Message boards :
Technical News :
The New Year Unfolds... (Jan 10 2013)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
The new year is unfolding nicely, more or less. Wow - 2013. Every new year now sounds like a science fiction year. I don't really have anything major to report, but here's another update anyway. We were supposed to have some more lab-wide power repairs last weekend. This got postponed to a later date which has yet to be settled upon. As I've been mentioning for years, the boinc server backend (everything pertaining to creating the workunit, sending it out, receiving the result and processing it) performs in many parts on a set of constantly changing servers of disparate make and model and power, and thus some problems involves so many moving targets that it's almost impossible to diagnose. I tend to refer to these times when performance is lower than expected as "server malaise." It also doesn't help we are dealing with an almost constant malaise given we are pretty much maxed out on our network connection to the world 24 hours a day. This is like running a retail business with a line out the door 24 hours a day - no quiet time to clean the place up, restock the shelves, etc. Usually when we see some queue backing up, or network traffic drop, the procedure is somewhat like this: 1. check to see if a server or important service (httpd, informix, mysql) isn't running - these are easy to find and hopefully easy to fix. 2. check to see if some BOINC mechanism (validation, assimilation, etc.) is stuck on something - these are relatively easy to find (by scanning logs and process tables) and sometimes easy to fix, but not always. 3. check to see if everything is kind of working, just slowly. If this is true, we tend to write it off as "server malaise" and wait and see if it improves on its own - the functional equivalent of "take two aspirin and call me in the morning." Usually we find things improve on their own over time, of if not then more obvious clues as to actual problems make themselves clearer. We simply don't find it an efficient use of our very limited time to understand and solve every problem perfectly. I mention all this as we certainly had a few malaises over the past few weeks. The one last week was due to the one cronjob failing to run, which didn't update some statistics, which led to some splitters running too much and generating too much work, which led to a bloated database and bloated filesystem, which led to slow backend processing, which took about 4 days to clear out, but it eventually did without any effort on our part. During that time general upload/download bandwidth was constrained a tad, but we survived. Otherwise, things are well. The recent (or relatively recent) server upgrades have been a major blessing, and more are planned. During the outage on Tuesday I actually moved some servers around such that *all* the SETI related servers are now in the closet (as opposed to our auxiliary lab). This is a first, I think. Outside of our desktops all SETI machines are in the racks. Of course, this is just in time for the closet a/c to be in need of repair. This surgery happening on Monday, and may take a couple days, during which the projects will all be down (with limited servers left up to keep the web site alive with a warning on the front page and status updates). We hope to be back up Tuesday afternoon. There is a chance repairs won't work. We have a plan B (and C) if this happens but let's just be positive and cross that bridge if/when we get there. Oh yeah one random note. Yesterday I had some fun with this database weirdness. Somewhere along the line, perhaps during one of many sudden power outages, a small set (i.e. about 10 out of 3,000,000,000) of the spikes in the database were cloned, and became two entries in the database, with the same id #s. This is "impossible" as id #s are primary keys and supposed to be unique. So which of the clones we were seeing was depending on how you were selecting these spikes - selecting by id or by some other field you'd get one clone or the other. This wasn't apparent at all until I tried to update values in these spikes, and then when selecting them I'd get the unupdated clone version and it looked like the update wasn't working. Long story short I finally figured this out and got rid of the clones. But yeah databases sure can be funny sometimes. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Thanks for the update Matt, Happy New Year, Claggy |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Wow - 2013. Every new year now sounds like a science fiction year. Only 50 more years before Zefram rides his rocket, huh? :-) |
Bernie Vine Send message Joined: 26 May 99 Posts: 9958 Credit: 103,452,613 RAC: 328 |
Thanks for the update and insight. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31006 Credit: 53,134,872 RAC: 32 |
Great post and thanks for taking the time to write it. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
Great post and thanks for taking the time to write it. Yep, greatly appreciated. Grant Darwin NT |
Ex: "Socialist" Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2 |
Thanks for the update, appreciated as always. #resist |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
Great post and thanks for taking the time to write it. Same here. I know writing updates for us isn't the best use of your time, but then maybe it is when you consider the good will it generates among us. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
Brother Frank Send message Joined: 10 Dec 11 Posts: 26 Credit: 15,142,410 RAC: 0 |
Thanks so much Matt. I appreciate the overview of your quick check and diagnosis methods. It makes a great deal of sense given your shortage of staffing. When things go wrong on my six computers here, it is often an internet connectivity issue like Comcast being down or my modem or router getting turned off accidentally. I check that first. If it is common to all my machines, I often just let it go for a while and let the issue rest. As I run out of work, I gradually switch machines over to backup projects. My favorites are GPU Grid for my Nvidia cards and Rosetta at Home for my CPU's They respond very fast and soon all machines are running flat out on them. The main thing for me has been to learn to relax and take it easy with all this. I do what I can without running around too much from machine to machine. Sometimes, I get a bad update from Nvidia and have to roll it back to the last widely accepted version. Other times, I've had a section of memory go bad or a hard drive fry. Am slowly learning to take things in stride. Like with you, but to a much smaller degree, I often have something not quite working right and every so often, a major repair is involved. It was very kind of you to share with us. It puts me much more at ease because I know now that you attack the problems and find and tweak many issues. I love your idea of computer malaise or network malaise. A vague type of discomfort from an unpecified source describes it well. Generally my malaise turns to specific types of issues as I am with Seti @Home longer and longer. That's headed in the right direction. Brother Frank |
Neil L. Carter Send message Joined: 6 Dec 99 Posts: 62 Credit: 16,385,509 RAC: 27 |
So I think everyone knows you've been getting new hardware (which is great!!!). Is there a plan somewhere to upgrade the internet connection sometime? I know this bottleneck has been discussed many, many times in the past, but I've not heard of any plans for actually resolving the issue. Thanks! |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31006 Credit: 53,134,872 RAC: 32 |
So I think everyone knows you've been getting new hardware (which is great!!!). AFIK the problem with the bandwidth is intractable absent a large infusion of cash. The problem is political in nature and the beast is underfed and is demanding a full meal to pass. |
Adam Weichel Send message Joined: 30 Jul 02 Posts: 22 Credit: 25,877,509 RAC: 46 |
Matt - is there any hardware required presently, or are cash infusions the best way to make an impact? I keep meaning to harass Eric over FB, but he always mentions he'll ask you anyhow. I could probably come up with HDDs, possibly some ECC DDR2/3 RAM (depending on the needs). Let me know. :) Adam Computer nut, Distributed Computing freak, Jeeper and Dodge Ram driver. Life is worth living... and worth discovering. I run VMWare ESXi Free - why don't you? |
rob smith Send message Joined: 7 Mar 03 Posts: 22526 Credit: 416,307,556 RAC: 380 |
The GPUUG have set up a page which lists the current hardware (and monetary) donations. Take a look at this thread http://setiathome.berkeley.edu/forum_thread.php?id=70511 Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
JardaM Send message Joined: 14 Sep 99 Posts: 20 Credit: 12,211,383 RAC: 5 |
A small problem in your code, Matt. WU 1029297452 (it is AP WU) has three valid results since mid Aug 2012 already and it is still not validated. The problem is that 3rd task was possibly released shortly before the 2nd was returned from the field, probably just before its deadline. |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
A small problem in your code, Matt. Just before the deadline of which task? If it is the second, this is highly unlikely as not being returned by the deadline is the trigger that generates the third task. BOINC WIKI |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
A small problem in your code, Matt. Indeed, task 2525261366 had a report deadline of 8 Aug 2012, 22:42:42 UTC. The third task was Created 8 Aug 2012, 22:42:45 UTC and Sent 8 Aug 2012, 22:42:47 UTC. The third result may or may not be valid, but it is certainly a problem that when it was reported as a success Validation didn't take place. There are other Astropulse WUs in the same state for the same reason, that sequence of events seems to always lead to a zombie-like state for Astropulse tasks (but not for SETI@home Enhanced). Joe |
Wiggo Send message Joined: 24 Jan 00 Posts: 36765 Credit: 261,360,520 RAC: 489 |
A small problem in your code, Matt. But SETI@home Enhanced does have its own problem with old stuck tasks such as this one,Workunit 638353788, and I know of several others with tasks caught in the same limbo dating back to around the same time. But regardless of whether they be AP (of which I have 2 stuck) or MB these old tasks would have to be producing some effect on the performance of the database. Cheers. |
Siran d'Vel'nahr Send message Joined: 23 May 99 Posts: 7379 Credit: 44,181,323 RAC: 238 |
Greetings, Wow! Finally something I can post about in this section of the forum! WOOHOO! :) I have 2 Astropulse v5.05 WUs that seem to be stuck: 797918561 799691663 Please note that the report dates are in August of 2011. Keep on BOINCing...! :) CAPT Siran d'Vel'nahr - L L & P _\\// Winders 11 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
I have a stuck one, too, although it's not that old. 1150388682 only west to two of us, so none of the usual reasons for it to be stuck apply. I only noticed it because it's on my machine that's currently dead. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
rob smith Send message Joined: 7 Mar 03 Posts: 22526 Credit: 416,307,556 RAC: 380 |
Ladies and Gentlemen The place to discuss the current downloads is in "number crunching", where I am sure you will you are not alone. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.