Noddy Goes to Sweden (Dec 12 2007)

Message boards : Technical News : Noddy Goes to Sweden (Dec 12 2007)

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1441
Credit: 213,689
RAC: 0
United States
Message 690923 - Posted: 12 Dec 2007, 21:27:05 UTC

Blech. The fallout from yesterday's business wasn't very pretty. The science database server had a migraine all night due to the load-intensive index build and subsequent mounting errors due to heavy disk i/o. So the assimilators were off until this morning after we rebooted the system and cleared its pipes.

However, towards the end of the day yesterday I spotted something funny. Of two scheduling servers, bruno and ptolemy, the former was refusing to send out any work. This wasn't a network issue, nor was it a real lack-of-work issue. There was plenty of work in bruno's queue, and the feeder had it all stowed up in shared memory ready to go, but the scheduler for no apparent reason was allowing none of it through. Clients were requesting N seconds of work and bruno would send it 0 workunits. The clients requesting the same N seconds of work on ptolemy were getting work. This was weird and nothing like we've seen before. Of course, bruno and ptolemy have identical kernels, scheduler executables, apache configurations, database permissions, file server permissions, network routes, etc. etc. etc. Jeff and I have been beating our heads on this for basically all last night and this morning and we still have no idea. Jeff's adding some new debug code to the scheduler as I type.

We do have a workaround - just dump all the traffic on ptolemy until we figure it out. We may very well do this by the end of the day if the real problem doesn't present itself.

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.

By the way, Bob is taking over adding a "median" form of the result turnaround time query and determining if it will hit the database as hard as I feared. Cool.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 690923 · Report as offensive
Profile Phil Kline

Send message
Joined: 11 Jun 99
Posts: 6
Credit: 121,918
RAC: 0
Australia
Message 690936 - Posted: 12 Dec 2007, 22:53:59 UTC

Keep asking for work, get Message from Server: No work sent. You guys have got one heck of a problem there from the sound of it.

Best of luck,


ID: 690936 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1441
Credit: 213,689
RAC: 0
United States
Message 690939 - Posted: 12 Dec 2007, 23:11:21 UTC

Update: Jeff found the basic gist of the problem. Totally totally totally arcane and still a bit of a mystery to us. More explaining as we figure it out but we have a band aid solution in place for now. That pretty much killed an entire day.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 690939 · Report as offensive
Profile Phil Kline

Send message
Joined: 11 Jun 99
Posts: 6
Credit: 121,918
RAC: 0
Australia
Message 690941 - Posted: 12 Dec 2007, 23:17:52 UTC

Back up again, just got one work unit. Great work, guys!!!!
ID: 690941 · Report as offensive
Profile ML1
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 9202
Credit: 5,938,083
RAC: 1,927
United Kingdom
Message 690942 - Posted: 12 Dec 2007, 23:21:40 UTC - in response to Message 690939.  
Last modified: 12 Dec 2007, 23:22:28 UTC

Update: Jeff found the basic gist of the problem. Totally totally totally arcane ...

Good stuff and sounding intriguing...

Dare I make a wild guess file-lock problems?

Good luck,

Regards,
Martin
See new freedom: Mageia5
See & try out for yourself: Linux Voice
The Future is what We all make IT (GPLv3)
ID: 690942 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1441
Credit: 213,689
RAC: 0
United States
Message 690944 - Posted: 12 Dec 2007, 23:30:12 UTC - in response to Message 690942.  
Last modified: 12 Dec 2007, 23:31:33 UTC

Dare I make a wild guess file-lock problems?


Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 690944 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1057
Credit: 802,388
RAC: 86
United States
Message 690949 - Posted: 13 Dec 2007, 0:01:46 UTC - in response to Message 690944.  

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.


At least you found it. Sometimes I never find the bug, only work around it.
ID: 690949 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16015
Credit: 749,424
RAC: 227
United States
Message 690960 - Posted: 13 Dec 2007, 0:24:55 UTC
Last modified: 13 Dec 2007, 0:27:25 UTC

. . . sneaky server eh - Nice Work Matt (and to Each of You @ Berkeley) Keep it up

ps - 'Do They Hurt' ;)
BOINC Wiki . . .

Science Status Page . . .
ID: 690960 · Report as offensive
Profile ML1
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 9202
Credit: 5,938,083
RAC: 1,927
United Kingdom
Message 691083 - Posted: 13 Dec 2007, 8:42:35 UTC - in response to Message 690944.  
Last modified: 13 Dec 2007, 8:44:37 UTC

Dare I make a wild guess file-lock problems?

Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

Well, that still leaves it at a 'wild guess' without a clue...

Wild guess #2: Something silly with the machine name or IP address, or the routing tables to that machine...?


What changed after/during your last shutdown for that to appear now?...


Happy bug squashing!

Cheers,
Martin
See new freedom: Mageia5
See & try out for yourself: Linux Voice
The Future is what We all make IT (GPLv3)
ID: 691083 · Report as offensive
Profile Ace Casino
Avatar

Send message
Joined: 5 Feb 03
Posts: 285
Credit: 22,983,478
RAC: 8,371
United States
Message 691098 - Posted: 13 Dec 2007, 11:23:39 UTC

FYI:
There was an article in the “Washington Post” this past weekend titled: “Are They Out There”. The article is about UFO’s, but there are a few paragraphs mentioning SETI, the new Allen Telescope Array at Berkeley and its mission to find a radio signal.
ID: 691098 · Report as offensive
kittymanProject Donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 45929
Credit: 815,351,053
RAC: 125,212
United States
Message 691099 - Posted: 13 Dec 2007, 11:26:23 UTC - in response to Message 691098.  

FYI:
There was an article in the “Washington Post” this past weekend titled: “Are They Out There”. The article is about UFO’s, but there are a few paragraphs mentioning SETI, the new Allen Telescope Array at Berkeley and its mission to find a radio signal.

Hmmm.....a link to the article in NC might be in order......
Cats.....what more does one need?

Have made friends in this life.
Most were cats.
ID: 691099 · Report as offensive
Profile ML1
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 9202
Credit: 5,938,083
RAC: 1,927
United Kingdom
Message 691107 - Posted: 13 Dec 2007, 12:58:00 UTC - in response to Message 691083.  
Last modified: 13 Dec 2007, 12:59:32 UTC

Dare I make a wild guess file-lock problems?

Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

Well, that still leaves it at a 'wild guess' without a clue...

Wild guess #2: Something silly with the machine name or IP address, or the routing tables to that machine...?


What changed after/during your last shutdown for that to appear now?...

And I have to clarify also ;-)

You had the Boinc clients trying to download WU data from the wrong server?...


Happy bug squashing!

Cheers,
Martin
See new freedom: Mageia5
See & try out for yourself: Linux Voice
The Future is what We all make IT (GPLv3)
ID: 691107 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7243
Credit: 87,248,037
RAC: 5,277
Australia
Message 691118 - Posted: 13 Dec 2007, 14:10:21 UTC - in response to Message 691083.  

Dare I make a wild guess file-lock problems?

Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

Well, that still leaves it at a 'wild guess' without a clue...

Wild guess #2: Something silly with the machine name or IP address, or the routing tables to that machine...?


What changed after/during your last shutdown for that to appear now?...


Happy bug squashing!

Cheers,
Martin


My wild guess, tongue planted firmly in my cheek... the version of libcurl compiled into the server code isn't playing friendly with a proxy [or proxy style] configuration somewhere in the line [The load sharing etc...maybe ]

Jason

"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin
ID: 691118 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 691162 - Posted: 13 Dec 2007, 18:26:06 UTC - in response to Message 691118.  


My wild guess, tongue planted firmly in my cheek... the version of libcurl compiled into the server code isn't playing friendly with a proxy [or proxy style] configuration somewhere in the line [The load sharing etc...maybe ]


Maybe they should revert to version 4.45 or something? ;)
ID: 691162 · Report as offensive
Profile Gary CharpentierCrowdfunding Project Donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 18644
Credit: 21,477,976
RAC: 19,542
United States
Message 691262 - Posted: 14 Dec 2007, 1:02:26 UTC - in response to Message 690923.  

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


ID: 691262 · Report as offensive
Profile kev1701e
Avatar

Send message
Joined: 28 Dec 99
Posts: 138
Credit: 10,216,553
RAC: 0
United States
Message 691428 - Posted: 14 Dec 2007, 17:02:24 UTC - in response to Message 691262.  

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


It was spam to Yahoo as well

kev
ID: 691428 · Report as offensive
Macroman1

Send message
Joined: 30 May 99
Posts: 67
Credit: 12,532,684
RAC: 0
United States
Message 691434 - Posted: 14 Dec 2007, 17:21:04 UTC - in response to Message 691428.  

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


It was spam to Yahoo as well

kev



Was marked as spam on my cox.net account too

"Gentlemen, there are only two types of naval vessels..........Submarines, and Targets" -- U.S. Navy Submarine SONAR Instructor.
ID: 691434 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 2581
Credit: 34,767,455
RAC: 20,288
United States
Message 691443 - Posted: 14 Dec 2007, 17:59:16 UTC - in response to Message 691434.  

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


It was spam to Yahoo as well

kev



Was marked as spam on my cox.net account too


Managed to miss Earthlink's SPAM filter.

.
ID: 691443 · Report as offensive
Profile Ghery S. Pettit
Avatar

Send message
Joined: 7 Nov 99
Posts: 293
Credit: 24,046,105
RAC: 143
United States
Message 691456 - Posted: 14 Dec 2007, 18:48:40 UTC

Wasn't marked as SPAM on my Comcast account (or by the IEEE e-mail alias server that saw it before Comcast).


ID: 691456 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 691516 - Posted: 14 Dec 2007, 23:55:02 UTC

Got through to my Yahoo account too without being filtered.

F.
ID: 691516 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Noddy Goes to Sweden (Dec 12 2007)


 
©2016 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.