The New Year Unfolds... (Jan 10 2013)

Message boards : Technical News : The New Year Unfolds... (Jan 10 2013)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1326480 - Posted: 10 Jan 2013, 21:55:19 UTC

The new year is unfolding nicely, more or less. Wow - 2013. Every new year now sounds like a science fiction year. I don't really have anything major to report, but here's another update anyway.

We were supposed to have some more lab-wide power repairs last weekend. This got postponed to a later date which has yet to be settled upon.

As I've been mentioning for years, the boinc server backend (everything pertaining to creating the workunit, sending it out, receiving the result and processing it) performs in many parts on a set of constantly changing servers of disparate make and model and power, and thus some problems involves so many moving targets that it's almost impossible to diagnose. I tend to refer to these times when performance is lower than expected as "server malaise." It also doesn't help we are dealing with an almost constant malaise given we are pretty much maxed out on our network connection to the world 24 hours a day. This is like running a retail business with a line out the door 24 hours a day - no quiet time to clean the place up, restock the shelves, etc.

Usually when we see some queue backing up, or network traffic drop, the procedure is somewhat like this: 1. check to see if a server or important service (httpd, informix, mysql) isn't running - these are easy to find and hopefully easy to fix. 2. check to see if some BOINC mechanism (validation, assimilation, etc.) is stuck on something - these are relatively easy to find (by scanning logs and process tables) and sometimes easy to fix, but not always. 3. check to see if everything is kind of working, just slowly. If this is true, we tend to write it off as "server malaise" and wait and see if it improves on its own - the functional equivalent of "take two aspirin and call me in the morning." Usually we find things improve on their own over time, of if not then more obvious clues as to actual problems make themselves clearer. We simply don't find it an efficient use of our very limited time to understand and solve every problem perfectly.

I mention all this as we certainly had a few malaises over the past few weeks. The one last week was due to the one cronjob failing to run, which didn't update some statistics, which led to some splitters running too much and generating too much work, which led to a bloated database and bloated filesystem, which led to slow backend processing, which took about 4 days to clear out, but it eventually did without any effort on our part. During that time general upload/download bandwidth was constrained a tad, but we survived.

Otherwise, things are well. The recent (or relatively recent) server upgrades have been a major blessing, and more are planned. During the outage on Tuesday I actually moved some servers around such that *all* the SETI related servers are now in the closet (as opposed to our auxiliary lab). This is a first, I think. Outside of our desktops all SETI machines are in the racks.

Of course, this is just in time for the closet a/c to be in need of repair. This surgery happening on Monday, and may take a couple days, during which the projects will all be down (with limited servers left up to keep the web site alive with a warning on the front page and status updates). We hope to be back up Tuesday afternoon. There is a chance repairs won't work. We have a plan B (and C) if this happens but let's just be positive and cross that bridge if/when we get there.

Oh yeah one random note. Yesterday I had some fun with this database weirdness. Somewhere along the line, perhaps during one of many sudden power outages, a small set (i.e. about 10 out of 3,000,000,000) of the spikes in the database were cloned, and became two entries in the database, with the same id #s. This is "impossible" as id #s are primary keys and supposed to be unique. So which of the clones we were seeing was depending on how you were selecting these spikes - selecting by id or by some other field you'd get one clone or the other. This wasn't apparent at all until I tried to update values in these spikes, and then when selecting them I'd get the unupdated clone version and it looked like the update wasn't working. Long story short I finally figured this out and got rid of the clones. But yeah databases sure can be funny sometimes.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1326480 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1326484 - Posted: 10 Jan 2013, 22:06:40 UTC - in response to Message 1326480.  

Thanks for the update Matt, Happy New Year,

Claggy
ID: 1326484 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1326487 - Posted: 10 Jan 2013, 22:26:57 UTC - in response to Message 1326480.  

Wow - 2013. Every new year now sounds like a science fiction year.

Only 50 more years before Zefram rides his rocket, huh? :-)
ID: 1326487 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9958
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1326512 - Posted: 11 Jan 2013, 0:13:01 UTC

Thanks for the update and insight.
ID: 1326512 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31012
Credit: 53,134,872
RAC: 32
United States
Message 1326544 - Posted: 11 Jan 2013, 4:13:19 UTC

Great post and thanks for taking the time to write it.

ID: 1326544 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1326571 - Posted: 11 Jan 2013, 6:12:47 UTC - in response to Message 1326544.  

Great post and thanks for taking the time to write it.

Yep, greatly appreciated.
Grant
Darwin NT
ID: 1326571 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1326584 - Posted: 11 Jan 2013, 6:35:45 UTC

Thanks for the update, appreciated as always.
#resist
ID: 1326584 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1326666 - Posted: 11 Jan 2013, 14:08:31 UTC - in response to Message 1326571.  

Great post and thanks for taking the time to write it.

Yep, greatly appreciated.

Same here. I know writing updates for us isn't the best use of your time, but then maybe it is when you consider the good will it generates among us.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1326666 · Report as offensive
Profile Brother Frank

Send message
Joined: 10 Dec 11
Posts: 26
Credit: 15,142,410
RAC: 0
United States
Message 1327168 - Posted: 12 Jan 2013, 19:41:28 UTC - in response to Message 1326480.  

Thanks so much Matt. I appreciate the overview of your quick check and diagnosis methods. It makes a great deal of sense given your shortage of staffing. When things go wrong on my six computers here, it is often an internet connectivity issue like Comcast being down or my modem or router getting turned off accidentally. I check that first. If it is common to all my machines, I often just let it go for a while and let the issue rest. As I run out of work, I gradually switch machines over to backup projects. My favorites are GPU Grid for my Nvidia cards and Rosetta at Home for my CPU's They respond very fast and soon all machines are running flat out on them. The main thing for me has been to learn to relax and take it easy with all this. I do what I can without running around too much from machine to machine. Sometimes, I get a bad update from Nvidia and have to roll it back to the last widely accepted version. Other times, I've had a section of memory go bad or a hard drive fry. Am slowly learning to take things in stride. Like with you, but to a much smaller degree, I often have something not quite working right and every so often, a major repair is involved. It was very kind of you to share with us. It puts me much more at ease because I know now that you attack the problems and find and tweak many issues. I love your idea of computer malaise or network malaise. A vague type of discomfort from an unpecified source describes it well. Generally my malaise turns to specific types of issues as I am with Seti @Home longer and longer. That's headed in the right direction. Brother Frank
ID: 1327168 · Report as offensive
Neil L. Carter Project Donor
Volunteer tester

Send message
Joined: 6 Dec 99
Posts: 62
Credit: 16,385,509
RAC: 27
United States
Message 1327714 - Posted: 15 Jan 2013, 23:01:26 UTC

So I think everyone knows you've been getting new hardware (which is great!!!).

Is there a plan somewhere to upgrade the internet connection sometime? I know this bottleneck has been discussed many, many times in the past, but I've not heard of any plans for actually resolving the issue.

Thanks!
ID: 1327714 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31012
Credit: 53,134,872
RAC: 32
United States
Message 1327790 - Posted: 16 Jan 2013, 4:01:37 UTC - in response to Message 1327714.  

So I think everyone knows you've been getting new hardware (which is great!!!).

Is there a plan somewhere to upgrade the internet connection sometime? I know this bottleneck has been discussed many, many times in the past, but I've not heard of any plans for actually resolving the issue.

Thanks!


AFIK the problem with the bandwidth is intractable absent a large infusion of cash. The problem is political in nature and the beast is underfed and is demanding a full meal to pass.

ID: 1327790 · Report as offensive
Profile Adam Weichel

Send message
Joined: 30 Jul 02
Posts: 22
Credit: 25,877,509
RAC: 46
Canada
Message 1327910 - Posted: 16 Jan 2013, 13:03:11 UTC

Matt - is there any hardware required presently, or are cash infusions the best way to make an impact? I keep meaning to harass Eric over FB, but he always mentions he'll ask you anyhow.

I could probably come up with HDDs, possibly some ECC DDR2/3 RAM (depending on the needs).

Let me know. :)

Adam
Computer nut, Distributed Computing freak, Jeeper and Dodge Ram driver.

Life is worth living... and worth discovering.

I run VMWare ESXi Free - why don't you?
ID: 1327910 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22535
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1327982 - Posted: 16 Jan 2013, 17:11:02 UTC

The GPUUG have set up a page which lists the current hardware (and monetary) donations.
Take a look at this thread
http://setiathome.berkeley.edu/forum_thread.php?id=70511
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1327982 · Report as offensive
JardaM

Send message
Joined: 14 Sep 99
Posts: 20
Credit: 12,211,383
RAC: 5
Czech Republic
Message 1331287 - Posted: 25 Jan 2013, 21:16:39 UTC

A small problem in your code, Matt.
WU 1029297452 (it is AP WU) has three valid results since mid Aug 2012 already and it is still not validated. The problem is that 3rd task was possibly released shortly before the 2nd was returned from the field, probably just before its deadline.
ID: 1331287 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 1331325 - Posted: 25 Jan 2013, 22:37:00 UTC - in response to Message 1331287.  

A small problem in your code, Matt.
WU 1029297452 (it is AP WU) has three valid results since mid Aug 2012 already and it is still not validated. The problem is that 3rd task was possibly released shortly before the 2nd was returned from the field, probably just before its deadline.

Just before the deadline of which task? If it is the second, this is highly unlikely as not being returned by the deadline is the trigger that generates the third task.


BOINC WIKI
ID: 1331325 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1331389 - Posted: 26 Jan 2013, 2:47:16 UTC - in response to Message 1331325.  

A small problem in your code, Matt.
WU 1029297452 (it is AP WU) has three valid results since mid Aug 2012 already and it is still not validated. The problem is that 3rd task was possibly released shortly before the 2nd was returned from the field, probably just before its deadline.

Just before the deadline of which task? If it is the second, this is highly unlikely as not being returned by the deadline is the trigger that generates the third task.

Indeed, task 2525261366 had a report deadline of 8 Aug 2012, 22:42:42 UTC. The third task was Created 8 Aug 2012, 22:42:45 UTC and Sent 8 Aug 2012, 22:42:47 UTC.

The third result may or may not be valid, but it is certainly a problem that when it was reported as a success Validation didn't take place. There are other Astropulse WUs in the same state for the same reason, that sequence of events seems to always lead to a zombie-like state for Astropulse tasks (but not for SETI@home Enhanced).
                                                                    Joe
ID: 1331389 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36814
Credit: 261,360,520
RAC: 489
Australia
Message 1331714 - Posted: 26 Jan 2013, 21:54:56 UTC - in response to Message 1331389.  

A small problem in your code, Matt.
WU 1029297452 (it is AP WU) has three valid results since mid Aug 2012 already and it is still not validated. The problem is that 3rd task was possibly released shortly before the 2nd was returned from the field, probably just before its deadline.

Just before the deadline of which task? If it is the second, this is highly unlikely as not being returned by the deadline is the trigger that generates the third task.

Indeed, task 2525261366 had a report deadline of 8 Aug 2012, 22:42:42 UTC. The third task was Created 8 Aug 2012, 22:42:45 UTC and Sent 8 Aug 2012, 22:42:47 UTC.

The third result may or may not be valid, but it is certainly a problem that when it was reported as a success Validation didn't take place. There are other Astropulse WUs in the same state for the same reason, that sequence of events seems to always lead to a zombie-like state for Astropulse tasks (but not for SETI@home Enhanced).
                                                                    Joe

But SETI@home Enhanced does have its own problem with old stuck tasks such as this one,Workunit 638353788, and I know of several others with tasks caught in the same limbo dating back to around the same time.

But regardless of whether they be AP (of which I have 2 stuck) or MB these old tasks would have to be producing some effect on the performance of the database.

Cheers.
ID: 1331714 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7379
Credit: 44,181,323
RAC: 238
United States
Message 1331965 - Posted: 27 Jan 2013, 12:50:59 UTC

Greetings,

Wow! Finally something I can post about in this section of the forum! WOOHOO! :)

I have 2 Astropulse v5.05 WUs that seem to be stuck:

797918561
799691663

Please note that the report dates are in August of 2011.

Keep on BOINCing...! :)

CAPT Siran d'Vel'nahr - L L & P _\\//
Winders 11 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 1331965 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1332225 - Posted: 28 Jan 2013, 14:03:23 UTC

I have a stuck one, too, although it's not that old.

1150388682 only west to two of us, so none of the usual reasons for it to be stuck apply.

I only noticed it because it's on my machine that's currently dead.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1332225 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22535
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1332265 - Posted: 28 Jan 2013, 17:11:49 UTC

Ladies and Gentlemen
The place to discuss the current downloads is in "number crunching", where I am sure you will you are not alone.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1332265 · Report as offensive

Message boards : Technical News : The New Year Unfolds... (Jan 10 2013)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.