Noddy Goes to Sweden (Dec 12 2007)

Message boards : Technical News : Noddy Goes to Sweden (Dec 12 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 690923 - Posted: 12 Dec 2007, 21:27:05 UTC

Blech. The fallout from yesterday's business wasn't very pretty. The science database server had a migraine all night due to the load-intensive index build and subsequent mounting errors due to heavy disk i/o. So the assimilators were off until this morning after we rebooted the system and cleared its pipes.

However, towards the end of the day yesterday I spotted something funny. Of two scheduling servers, bruno and ptolemy, the former was refusing to send out any work. This wasn't a network issue, nor was it a real lack-of-work issue. There was plenty of work in bruno's queue, and the feeder had it all stowed up in shared memory ready to go, but the scheduler for no apparent reason was allowing none of it through. Clients were requesting N seconds of work and bruno would send it 0 workunits. The clients requesting the same N seconds of work on ptolemy were getting work. This was weird and nothing like we've seen before. Of course, bruno and ptolemy have identical kernels, scheduler executables, apache configurations, database permissions, file server permissions, network routes, etc. etc. etc. Jeff and I have been beating our heads on this for basically all last night and this morning and we still have no idea. Jeff's adding some new debug code to the scheduler as I type.

We do have a workaround - just dump all the traffic on ptolemy until we figure it out. We may very well do this by the end of the day if the real problem doesn't present itself.

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.

By the way, Bob is taking over adding a "median" form of the result turnaround time query and determining if it will hit the database as hard as I feared. Cool.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 690923 · Report as offensive
Profile Phil Kline

Send message
Joined: 11 Jun 99
Posts: 6
Credit: 121,918
RAC: 0
Australia
Message 690936 - Posted: 12 Dec 2007, 22:53:59 UTC

Keep asking for work, get Message from Server: No work sent. You guys have got one heck of a problem there from the sound of it.

Best of luck,


ID: 690936 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 690939 - Posted: 12 Dec 2007, 23:11:21 UTC

Update: Jeff found the basic gist of the problem. Totally totally totally arcane and still a bit of a mystery to us. More explaining as we figure it out but we have a band aid solution in place for now. That pretty much killed an entire day.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 690939 · Report as offensive
Profile Phil Kline

Send message
Joined: 11 Jun 99
Posts: 6
Credit: 121,918
RAC: 0
Australia
Message 690941 - Posted: 12 Dec 2007, 23:17:52 UTC

Back up again, just got one work unit. Great work, guys!!!!
ID: 690941 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20265
Credit: 7,508,002
RAC: 20
United Kingdom
Message 690942 - Posted: 12 Dec 2007, 23:21:40 UTC - in response to Message 690939.  
Last modified: 12 Dec 2007, 23:22:28 UTC

Update: Jeff found the basic gist of the problem. Totally totally totally arcane ...

Good stuff and sounding intriguing...

Dare I make a wild guess file-lock problems?

Good luck,

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 690942 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 690944 - Posted: 12 Dec 2007, 23:30:12 UTC - in response to Message 690942.  
Last modified: 12 Dec 2007, 23:31:33 UTC

Dare I make a wild guess file-lock problems?


Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 690944 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 690949 - Posted: 13 Dec 2007, 0:01:46 UTC - in response to Message 690944.  

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.


At least you found it. Sometimes I never find the bug, only work around it.
ID: 690949 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 690960 - Posted: 13 Dec 2007, 0:24:55 UTC
Last modified: 13 Dec 2007, 0:27:25 UTC

. . . sneaky server eh - Nice Work Matt (and to Each of You @ Berkeley) Keep it up

ps - 'Do They Hurt' ;)
BOINC Wiki . . .

Science Status Page . . .
ID: 690960 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20265
Credit: 7,508,002
RAC: 20
United Kingdom
Message 691083 - Posted: 13 Dec 2007, 8:42:35 UTC - in response to Message 690944.  
Last modified: 13 Dec 2007, 8:44:37 UTC

Dare I make a wild guess file-lock problems?

Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

Well, that still leaves it at a 'wild guess' without a clue...

Wild guess #2: Something silly with the machine name or IP address, or the routing tables to that machine...?


What changed after/during your last shutdown for that to appear now?...


Happy bug squashing!

Cheers,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 691083 · Report as offensive
Profile Ace Casino
Avatar

Send message
Joined: 5 Feb 03
Posts: 285
Credit: 29,750,804
RAC: 15
United States
Message 691098 - Posted: 13 Dec 2007, 11:23:39 UTC

FYI:
There was an article in the “Washington Post” this past weekend titled: “Are They Out There”. The article is about UFO’s, but there are a few paragraphs mentioning SETI, the new Allen Telescope Array at Berkeley and its mission to find a radio signal.
ID: 691098 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 691099 - Posted: 13 Dec 2007, 11:26:23 UTC - in response to Message 691098.  

FYI:
There was an article in the “Washington Post” this past weekend titled: “Are They Out There”. The article is about UFO’s, but there are a few paragraphs mentioning SETI, the new Allen Telescope Array at Berkeley and its mission to find a radio signal.

Hmmm.....a link to the article in NC might be in order......
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 691099 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20265
Credit: 7,508,002
RAC: 20
United Kingdom
Message 691107 - Posted: 13 Dec 2007, 12:58:00 UTC - in response to Message 691083.  
Last modified: 13 Dec 2007, 12:59:32 UTC

Dare I make a wild guess file-lock problems?

Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

Well, that still leaves it at a 'wild guess' without a clue...

Wild guess #2: Something silly with the machine name or IP address, or the routing tables to that machine...?


What changed after/during your last shutdown for that to appear now?...

And I have to clarify also ;-)

You had the Boinc clients trying to download WU data from the wrong server?...


Happy bug squashing!

Cheers,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 691107 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 691118 - Posted: 13 Dec 2007, 14:10:21 UTC - in response to Message 691083.  

Dare I make a wild guess file-lock problems?

Good guess but wrong.

Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point.

Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it.

Well, that still leaves it at a 'wild guess' without a clue...

Wild guess #2: Something silly with the machine name or IP address, or the routing tables to that machine...?


What changed after/during your last shutdown for that to appear now?...


Happy bug squashing!

Cheers,
Martin


My wild guess, tongue planted firmly in my cheek... the version of libcurl compiled into the server code isn't playing friendly with a proxy [or proxy style] configuration somewhere in the line [The load sharing etc...maybe ]

Jason

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 691118 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 691162 - Posted: 13 Dec 2007, 18:26:06 UTC - in response to Message 691118.  


My wild guess, tongue planted firmly in my cheek... the version of libcurl compiled into the server code isn't playing friendly with a proxy [or proxy style] configuration somewhere in the line [The load sharing etc...maybe ]


Maybe they should revert to version 4.45 or something? ;)
ID: 691162 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30639
Credit: 53,134,872
RAC: 32
United States
Message 691262 - Posted: 14 Dec 2007, 1:02:26 UTC - in response to Message 690923.  

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


ID: 691262 · Report as offensive
Profile kev1701e
Avatar

Send message
Joined: 28 Dec 99
Posts: 138
Credit: 10,216,553
RAC: 0
United States
Message 691428 - Posted: 14 Dec 2007, 17:02:24 UTC - in response to Message 691262.  

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


It was spam to Yahoo as well

kev
ID: 691428 · Report as offensive
Macroman1

Send message
Joined: 30 May 99
Posts: 67
Credit: 12,532,684
RAC: 0
United States
Message 691434 - Posted: 14 Dec 2007, 17:21:04 UTC - in response to Message 691428.  

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


It was spam to Yahoo as well

kev



Was marked as spam on my cox.net account too

"Gentlemen, there are only two types of naval vessels..........Submarines, and Targets" -- U.S. Navy Submarine SONAR Instructor.
ID: 691434 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 691443 - Posted: 14 Dec 2007, 17:59:16 UTC - in response to Message 691434.  

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now.
- Matt


Matt:

FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :)


It was spam to Yahoo as well

kev



Was marked as spam on my cox.net account too


Managed to miss Earthlink's SPAM filter.

.

Hello, from Albany, CA!...
ID: 691443 · Report as offensive
Profile Ghery S. Pettit
Avatar

Send message
Joined: 7 Nov 99
Posts: 325
Credit: 28,109,066
RAC: 82
United States
Message 691456 - Posted: 14 Dec 2007, 18:48:40 UTC

Wasn't marked as SPAM on my Comcast account (or by the IEEE e-mail alias server that saw it before Comcast).


ID: 691456 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 691516 - Posted: 14 Dec 2007, 23:55:02 UTC

Got through to my Yahoo account too without being filtered.

F.
ID: 691516 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Noddy Goes to Sweden (Dec 12 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.