The New Year Unfolds... (Jan 10 2013)


log in

Advanced search

Message boards : Technical News : The New Year Unfolds... (Jan 10 2013)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1326480 - Posted: 10 Jan 2013, 21:55:19 UTC

The new year is unfolding nicely, more or less. Wow - 2013. Every new year now sounds like a science fiction year. I don't really have anything major to report, but here's another update anyway.

We were supposed to have some more lab-wide power repairs last weekend. This got postponed to a later date which has yet to be settled upon.

As I've been mentioning for years, the boinc server backend (everything pertaining to creating the workunit, sending it out, receiving the result and processing it) performs in many parts on a set of constantly changing servers of disparate make and model and power, and thus some problems involves so many moving targets that it's almost impossible to diagnose. I tend to refer to these times when performance is lower than expected as "server malaise." It also doesn't help we are dealing with an almost constant malaise given we are pretty much maxed out on our network connection to the world 24 hours a day. This is like running a retail business with a line out the door 24 hours a day - no quiet time to clean the place up, restock the shelves, etc.

Usually when we see some queue backing up, or network traffic drop, the procedure is somewhat like this: 1. check to see if a server or important service (httpd, informix, mysql) isn't running - these are easy to find and hopefully easy to fix. 2. check to see if some BOINC mechanism (validation, assimilation, etc.) is stuck on something - these are relatively easy to find (by scanning logs and process tables) and sometimes easy to fix, but not always. 3. check to see if everything is kind of working, just slowly. If this is true, we tend to write it off as "server malaise" and wait and see if it improves on its own - the functional equivalent of "take two aspirin and call me in the morning." Usually we find things improve on their own over time, of if not then more obvious clues as to actual problems make themselves clearer. We simply don't find it an efficient use of our very limited time to understand and solve every problem perfectly.

I mention all this as we certainly had a few malaises over the past few weeks. The one last week was due to the one cronjob failing to run, which didn't update some statistics, which led to some splitters running too much and generating too much work, which led to a bloated database and bloated filesystem, which led to slow backend processing, which took about 4 days to clear out, but it eventually did without any effort on our part. During that time general upload/download bandwidth was constrained a tad, but we survived.

Otherwise, things are well. The recent (or relatively recent) server upgrades have been a major blessing, and more are planned. During the outage on Tuesday I actually moved some servers around such that *all* the SETI related servers are now in the closet (as opposed to our auxiliary lab). This is a first, I think. Outside of our desktops all SETI machines are in the racks.

Of course, this is just in time for the closet a/c to be in need of repair. This surgery happening on Monday, and may take a couple days, during which the projects will all be down (with limited servers left up to keep the web site alive with a warning on the front page and status updates). We hope to be back up Tuesday afternoon. There is a chance repairs won't work. We have a plan B (and C) if this happens but let's just be positive and cross that bridge if/when we get there.

Oh yeah one random note. Yesterday I had some fun with this database weirdness. Somewhere along the line, perhaps during one of many sudden power outages, a small set (i.e. about 10 out of 3,000,000,000) of the spikes in the database were cloned, and became two entries in the database, with the same id #s. This is "impossible" as id #s are primary keys and supposed to be unique. So which of the clones we were seeing was depending on how you were selecting these spikes - selecting by id or by some other field you'd get one clone or the other. This wasn't apparent at all until I tried to update values in these spikes, and then when selecting them I'd get the unupdated clone version and it looked like the update wasn't working. Long story short I finally figured this out and got rid of the clones. But yeah databases sure can be funny sometimes.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4067
Credit: 32,905,352
RAC: 7,730
United Kingdom
Message 1326484 - Posted: 10 Jan 2013, 22:06:40 UTC - in response to Message 1326480.

Thanks for the update Matt, Happy New Year,

Claggy

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12285
Credit: 2,575,697
RAC: 755
Netherlands
Message 1326487 - Posted: 10 Jan 2013, 22:26:57 UTC - in response to Message 1326480.

Wow - 2013. Every new year now sounds like a science fiction year.

Only 50 more years before Zefram rides his rocket, huh? :-)
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 6907
Credit: 25,775,836
RAC: 38,690
United Kingdom
Message 1326512 - Posted: 11 Jan 2013, 0:13:01 UTC

Thanks for the update and insight.
____________


Today is life, the only life we're sure of. Make the most of today.

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12405
Credit: 6,715,265
RAC: 8,808
United States
Message 1326544 - Posted: 11 Jan 2013, 4:13:19 UTC

Great post and thanks for taking the time to write it.

____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5792
Credit: 58,053,803
RAC: 48,263
Australia
Message 1326571 - Posted: 11 Jan 2013, 6:12:47 UTC - in response to Message 1326544.

Great post and thanks for taking the time to write it.

Yep, greatly appreciated.
____________
Grant
Darwin NT.

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,732,094
RAC: 1,208
United States
Message 1326584 - Posted: 11 Jan 2013, 6:35:45 UTC

Thanks for the update, appreciated as always.
____________
-Dave #2

3.2.0-33

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11162
Credit: 13,959,580
RAC: 12,405
United States
Message 1326666 - Posted: 11 Jan 2013, 14:08:31 UTC - in response to Message 1326571.

Great post and thanks for taking the time to write it.

Yep, greatly appreciated.

Same here. I know writing updates for us isn't the best use of your time, but then maybe it is when you consider the good will it generates among us.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile Brother Frank
Send message
Joined: 10 Dec 11
Posts: 26
Credit: 15,142,410
RAC: 0
United States
Message 1327168 - Posted: 12 Jan 2013, 19:41:28 UTC - in response to Message 1326480.

Thanks so much Matt. I appreciate the overview of your quick check and diagnosis methods. It makes a great deal of sense given your shortage of staffing. When things go wrong on my six computers here, it is often an internet connectivity issue like Comcast being down or my modem or router getting turned off accidentally. I check that first. If it is common to all my machines, I often just let it go for a while and let the issue rest. As I run out of work, I gradually switch machines over to backup projects. My favorites are GPU Grid for my Nvidia cards and Rosetta at Home for my CPU's They respond very fast and soon all machines are running flat out on them. The main thing for me has been to learn to relax and take it easy with all this. I do what I can without running around too much from machine to machine. Sometimes, I get a bad update from Nvidia and have to roll it back to the last widely accepted version. Other times, I've had a section of memory go bad or a hard drive fry. Am slowly learning to take things in stride. Like with you, but to a much smaller degree, I often have something not quite working right and every so often, a major repair is involved. It was very kind of you to share with us. It puts me much more at ease because I know now that you attack the problems and find and tweak many issues. I love your idea of computer malaise or network malaise. A vague type of discomfort from an unpecified source describes it well. Generally my malaise turns to specific types of issues as I am with Seti @Home longer and longer. That's headed in the right direction. Brother Frank

Neil L. CarterProject donor
Volunteer tester
Send message
Joined: 6 Dec 99
Posts: 53
Credit: 4,067,219
RAC: 5,291
United States
Message 1327714 - Posted: 15 Jan 2013, 23:01:26 UTC

So I think everyone knows you've been getting new hardware (which is great!!!).

Is there a plan somewhere to upgrade the internet connection sometime? I know this bottleneck has been discussed many, many times in the past, but I've not heard of any plans for actually resolving the issue.

Thanks!
____________

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12405
Credit: 6,715,265
RAC: 8,808
United States
Message 1327790 - Posted: 16 Jan 2013, 4:01:37 UTC - in response to Message 1327714.

So I think everyone knows you've been getting new hardware (which is great!!!).

Is there a plan somewhere to upgrade the internet connection sometime? I know this bottleneck has been discussed many, many times in the past, but I've not heard of any plans for actually resolving the issue.

Thanks!


AFIK the problem with the bandwidth is intractable absent a large infusion of cash. The problem is political in nature and the beast is underfed and is demanding a full meal to pass.

____________

Profile Adam Weichel
Send message
Joined: 30 Jul 02
Posts: 22
Credit: 11,271,039
RAC: 1,854
Canada
Message 1327910 - Posted: 16 Jan 2013, 13:03:11 UTC

Matt - is there any hardware required presently, or are cash infusions the best way to make an impact? I keep meaning to harass Eric over FB, but he always mentions he'll ask you anyhow.

I could probably come up with HDDs, possibly some ECC DDR2/3 RAM (depending on the needs).

Let me know. :)

Adam
____________
Computer nut, Distributed Computing freak, Jeeper and Dodge Ram driver.

Life is worth living... and worth discovering.

I run VMWare ESXi Free - why don't you?

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8310
Credit: 55,293,030
RAC: 75,539
United Kingdom
Message 1327982 - Posted: 16 Jan 2013, 17:11:02 UTC

The GPUUG have set up a page which lists the current hardware (and monetary) donations.
Take a look at this thread
http://setiathome.berkeley.edu/forum_thread.php?id=70511
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

JardaM
Send message
Joined: 14 Sep 99
Posts: 18
Credit: 9,258,912
RAC: 546
Czech Republic
Message 1331287 - Posted: 25 Jan 2013, 21:16:39 UTC

A small problem in your code, Matt.
WU 1029297452 (it is AP WU) has three valid results since mid Aug 2012 already and it is still not validated. The problem is that 3rd task was possibly released shortly before the 2nd was returned from the field, probably just before its deadline.
____________

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 24385
Credit: 519,750
RAC: 37
United States
Message 1331325 - Posted: 25 Jan 2013, 22:37:00 UTC - in response to Message 1331287.

A small problem in your code, Matt.
WU 1029297452 (it is AP WU) has three valid results since mid Aug 2012 already and it is still not validated. The problem is that 3rd task was possibly released shortly before the 2nd was returned from the field, probably just before its deadline.

Just before the deadline of which task? If it is the second, this is highly unlikely as not being returned by the deadline is the trigger that generates the third task.
____________


BOINC WIKI

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4230
Credit: 1,043,161
RAC: 314
United States
Message 1331389 - Posted: 26 Jan 2013, 2:47:16 UTC - in response to Message 1331325.

A small problem in your code, Matt.
WU 1029297452 (it is AP WU) has three valid results since mid Aug 2012 already and it is still not validated. The problem is that 3rd task was possibly released shortly before the 2nd was returned from the field, probably just before its deadline.

Just before the deadline of which task? If it is the second, this is highly unlikely as not being returned by the deadline is the trigger that generates the third task.

Indeed, task 2525261366 had a report deadline of 8 Aug 2012, 22:42:42 UTC. The third task was Created 8 Aug 2012, 22:42:45 UTC and Sent 8 Aug 2012, 22:42:47 UTC.

The third result may or may not be valid, but it is certainly a problem that when it was reported as a success Validation didn't take place. There are other Astropulse WUs in the same state for the same reason, that sequence of events seems to always lead to a zombie-like state for Astropulse tasks (but not for SETI@home Enhanced).
Joe

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6790
Credit: 93,122,185
RAC: 75,947
Australia
Message 1331714 - Posted: 26 Jan 2013, 21:54:56 UTC - in response to Message 1331389.

A small problem in your code, Matt.
WU 1029297452 (it is AP WU) has three valid results since mid Aug 2012 already and it is still not validated. The problem is that 3rd task was possibly released shortly before the 2nd was returned from the field, probably just before its deadline.

Just before the deadline of which task? If it is the second, this is highly unlikely as not being returned by the deadline is the trigger that generates the third task.

Indeed, task 2525261366 had a report deadline of 8 Aug 2012, 22:42:42 UTC. The third task was Created 8 Aug 2012, 22:42:45 UTC and Sent 8 Aug 2012, 22:42:47 UTC.

The third result may or may not be valid, but it is certainly a problem that when it was reported as a success Validation didn't take place. There are other Astropulse WUs in the same state for the same reason, that sequence of events seems to always lead to a zombie-like state for Astropulse tasks (but not for SETI@home Enhanced).
Joe

But SETI@home Enhanced does have its own problem with old stuck tasks such as this one,Workunit 638353788, and I know of several others with tasks caught in the same limbo dating back to around the same time.

But regardless of whether they be AP (of which I have 2 stuck) or MB these old tasks would have to be producing some effect on the performance of the database.

Cheers.
____________

Profile Siran d'Vel'nahr
Volunteer tester
Avatar
Send message
Joined: 23 May 99
Posts: 5686
Credit: 4,653,062
RAC: 2,886
United States
Message 1331965 - Posted: 27 Jan 2013, 12:50:59 UTC

Greetings,

Wow! Finally something I can post about in this section of the forum! WOOHOO! :)

I have 2 Astropulse v5.05 WUs that seem to be stuck:

797918561
799691663

Please note that the report dates are in August of 2011.

Keep on BOINCing...! :)

____________
CAPT Siran d'Vel'nahr XO
USS Vre'kasht NCC-33187

Siran's website: [ ONLINE! ]

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11162
Credit: 13,959,580
RAC: 12,405
United States
Message 1332225 - Posted: 28 Jan 2013, 14:03:23 UTC

I have a stuck one, too, although it's not that old.

1150388682 only west to two of us, so none of the usual reasons for it to be stuck apply.

I only noticed it because it's on my machine that's currently dead.
____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8310
Credit: 55,293,030
RAC: 75,539
United Kingdom
Message 1332265 - Posted: 28 Jan 2013, 17:11:49 UTC

Ladies and Gentlemen
The place to discuss the current downloads is in "number crunching", where I am sure you will you are not alone.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

1 · 2 · Next

Message boards : Technical News : The New Year Unfolds... (Jan 10 2013)

Copyright © 2014 University of California