Noddy Goes to Sweden (Dec 12 2007)


log in

Advanced search

Message boards : Technical News : Noddy Goes to Sweden (Dec 12 2007)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 690923 - Posted: 12 Dec 2007, 21:27:05 UTC

Blech. The fallout from yesterday's business wasn't very pretty. The science database server had a migraine all night due to the load-intensive index build and subsequent mounting errors due to heavy disk i/o. So the assimilators were off until this morning after we rebooted the system and cleared its pipes.

However, towards the end of the day yesterday I spotted something funny. Of two scheduling servers, bruno and ptolemy, the former was refusing to send out any work. This wasn't a network issue, nor was it a real lack-of-work issue. There was plenty of work in bruno's queue, and the feeder had it all stowed up in shared memory ready to go, but the scheduler for no apparent reason was allowing none of it through. Clients were requesting N seconds of work and bruno would send it 0 workunits. The clients requesting the same N seconds of work on ptolemy were getting work. This was weird and nothing like we've seen before. Of course, bruno and ptolemy have identical kernels, scheduler executables, apache configurations, database permissions, file server permissions, network routes, etc. etc. etc. Jeff and I have been beating our heads on this for basically all last night and this morning and we still have no idea. Jeff's adding some new debug code to the scheduler as I type.

We do have a workaround - just dump all the traffic on ptolemy until we figure it out. We may very well do this by the end of the day if the real problem doesn't present itself.

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.

By the way, Bob is taking over adding a "median" form of the result turnaround time query and determining if it will hit the database as hard as I feared. Cool.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Phil Kline
Send message
Joined: 11 Jun 99
Posts: 6
Credit: 121,918
RAC: 0
Australia
Message 690936 - Posted: 12 Dec 2007, 22:53:59 UTC

Keep asking for work, get Message from Server: No work sent. You guys have got one heck of a problem there from the sound of it.

Best of luck,


Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 690939 - Posted: 12 Dec 2007, 23:11:21 UTC

Update: Jeff found the basic gist of the problem. Totally totally totally arcane and still a bit of a mystery to us. More explaining as we figure it out but we have a band aid solution in place for now. That pretty much killed an entire day.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Phil Kline
Send message
Joined: 11 Jun 99
Posts: 6
Credit: 121,918
RAC: 0
Australia
Message 690941 - Posted: 12 Dec 2007, 23:17:52 UTC

Back up again, just got one work unit. Great work, guys!!!!

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8266
Credit: 4,071,441
RAC: 350
United Kingdom
Message 690942 - Posted: 12 Dec 2007, 23:21:40 UTC - in response to Message 690939.
Last modified: 12 Dec 2007, 23:22:28 UTC

Update: Jeff found the basic gist of the problem. Totally totally totally arcane ...

Good stuff and sounding intriguing...

Dare I make a wild guess file-lock problems?

Good luck,

Regards,
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 690944 - Posted: 12 Dec 2007, 23:30:12 UTC - in response to Message 690942.
Last modified: 12 Dec 2007, 23:31:33 UTC

Dare I make a wild guess file-lock problems?


Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 539,987
RAC: 567
United States
Message 690949 - Posted: 13 Dec 2007, 0:01:46 UTC - in response to Message 690944.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.


At least you found it. Sometimes I never find the bug, only work around it.

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 10
United States
Message 690960 - Posted: 13 Dec 2007, 0:24:55 UTC
Last modified: 13 Dec 2007, 0:27:25 UTC

. . . sneaky server eh - Nice Work Matt (and to Each of You @ Berkeley) Keep it up

ps - 'Do They Hurt' ;)
____________
BOINC Wiki . . .

Science Status Page . . .

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8266
Credit: 4,071,441
RAC: 350
United Kingdom
Message 691083 - Posted: 13 Dec 2007, 8:42:35 UTC - in response to Message 690944.
Last modified: 13 Dec 2007, 8:44:37 UTC

Dare I make a wild guess file-lock problems?

Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

Well, that still leaves it at a 'wild guess' without a clue...

Wild guess #2: Something silly with the machine name or IP address, or the routing tables to that machine...?


What changed after/during your last shutdown for that to appear now?...


Happy bug squashing!

Cheers,
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Profile Ace Casino
Avatar
Send message
Joined: 5 Feb 03
Posts: 283
Credit: 20,116,032
RAC: 11,632
United States
Message 691098 - Posted: 13 Dec 2007, 11:23:39 UTC

FYI:
There was an article in the “Washington Post” this past weekend titled: “Are They Out There”. The article is about UFO’s, but there are a few paragraphs mentioning SETI, the new Allen Telescope Array at Berkeley and its mission to find a radio signal.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,711,384
RAC: 646,449
United States
Message 691099 - Posted: 13 Dec 2007, 11:26:23 UTC - in response to Message 691098.

FYI:
There was an article in the “Washington Post” this past weekend titled: “Are They Out There”. The article is about UFO’s, but there are a few paragraphs mentioning SETI, the new Allen Telescope Array at Berkeley and its mission to find a radio signal.

Hmmm.....a link to the article in NC might be in order......
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8266
Credit: 4,071,441
RAC: 350
United Kingdom
Message 691107 - Posted: 13 Dec 2007, 12:58:00 UTC - in response to Message 691083.
Last modified: 13 Dec 2007, 12:59:32 UTC

Dare I make a wild guess file-lock problems?

Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

Well, that still leaves it at a 'wild guess' without a clue...

Wild guess #2: Something silly with the machine name or IP address, or the routing tables to that machine...?


What changed after/during your last shutdown for that to appear now?...

And I have to clarify also ;-)

You had the Boinc clients trying to download WU data from the wrong server?...


Happy bug squashing!

Cheers,
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4920
Credit: 72,613,315
RAC: 1,286
Australia
Message 691118 - Posted: 13 Dec 2007, 14:10:21 UTC - in response to Message 691083.

Dare I make a wild guess file-lock problems?

Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

Well, that still leaves it at a 'wild guess' without a clue...

Wild guess #2: Something silly with the machine name or IP address, or the routing tables to that machine...?


What changed after/during your last shutdown for that to appear now?...


Happy bug squashing!

Cheers,
Martin


My wild guess, tongue planted firmly in my cheek... the version of libcurl compiled into the server code isn't playing friendly with a proxy [or proxy style] configuration somewhere in the line [The load sharing etc...maybe ]

Jason

____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 691162 - Posted: 13 Dec 2007, 18:26:06 UTC - in response to Message 691118.


My wild guess, tongue planted firmly in my cheek... the version of libcurl compiled into the server code isn't playing friendly with a proxy [or proxy style] configuration somewhere in the line [The load sharing etc...maybe ]


Maybe they should revert to version 4.45 or something? ;)

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12130
Credit: 6,411,927
RAC: 8,134
United States
Message 691262 - Posted: 14 Dec 2007, 1:02:26 UTC - in response to Message 690923.

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


____________

Profile kev1701e
Avatar
Send message
Joined: 28 Dec 99
Posts: 138
Credit: 10,004,553
RAC: 0
United States
Message 691428 - Posted: 14 Dec 2007, 17:02:24 UTC - in response to Message 691262.

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


It was spam to Yahoo as well

kev

Macroman1
Send message
Joined: 30 May 99
Posts: 67
Credit: 12,532,684
RAC: 0
United States
Message 691434 - Posted: 14 Dec 2007, 17:21:04 UTC - in response to Message 691428.

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


It was spam to Yahoo as well

kev



Was marked as spam on my cox.net account too

____________
"Gentlemen, there are only two types of naval vessels..........Submarines, and Targets" -- U.S. Navy Submarine SONAR Instructor.

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1898
Credit: 9,176,238
RAC: 12,255
United States
Message 691443 - Posted: 14 Dec 2007, 17:59:16 UTC - in response to Message 691434.

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


It was spam to Yahoo as well

kev



Was marked as spam on my cox.net account too


Managed to miss Earthlink's SPAM filter.

____________
.

Profile Ghery S. Pettit
Avatar
Send message
Joined: 7 Nov 99
Posts: 283
Credit: 23,382,336
RAC: 4,473
United States
Message 691456 - Posted: 14 Dec 2007, 18:48:40 UTC

Wasn't marked as SPAM on my Comcast account (or by the IEEE e-mail alias server that saw it before Comcast).


____________

Fred W
Volunteer tester
Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 691516 - Posted: 14 Dec 2007, 23:55:02 UTC

Got through to my Yahoo account too without being filtered.

F.
____________

1 · 2 · Next

Message boards : Technical News : Noddy Goes to Sweden (Dec 12 2007)

Copyright © 2014 University of California