Message boards :
Technical News :
Noddy Goes to Sweden (Dec 12 2007)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Blech. The fallout from yesterday's business wasn't very pretty. The science database server had a migraine all night due to the load-intensive index build and subsequent mounting errors due to heavy disk i/o. So the assimilators were off until this morning after we rebooted the system and cleared its pipes. However, towards the end of the day yesterday I spotted something funny. Of two scheduling servers, bruno and ptolemy, the former was refusing to send out any work. This wasn't a network issue, nor was it a real lack-of-work issue. There was plenty of work in bruno's queue, and the feeder had it all stowed up in shared memory ready to go, but the scheduler for no apparent reason was allowing none of it through. Clients were requesting N seconds of work and bruno would send it 0 workunits. The clients requesting the same N seconds of work on ptolemy were getting work. This was weird and nothing like we've seen before. Of course, bruno and ptolemy have identical kernels, scheduler executables, apache configurations, database permissions, file server permissions, network routes, etc. etc. etc. Jeff and I have been beating our heads on this for basically all last night and this morning and we still have no idea. Jeff's adding some new debug code to the scheduler as I type. We do have a workaround - just dump all the traffic on ptolemy until we figure it out. We may very well do this by the end of the day if the real problem doesn't present itself. Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now. By the way, Bob is taking over adding a "median" form of the result turnaround time query and determining if it will hit the database as hard as I feared. Cool. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Phil Kline Send message Joined: 11 Jun 99 Posts: 6 Credit: 121,918 RAC: 0 |
Keep asking for work, get Message from Server: No work sent. You guys have got one heck of a problem there from the sound of it. Best of luck, |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Update: Jeff found the basic gist of the problem. Totally totally totally arcane and still a bit of a mystery to us. More explaining as we figure it out but we have a band aid solution in place for now. That pretty much killed an entire day. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Phil Kline Send message Joined: 11 Jun 99 Posts: 6 Credit: 121,918 RAC: 0 |
Back up again, just got one work unit. Great work, guys!!!! |
ML1 Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20 |
Update: Jeff found the basic gist of the problem. Totally totally totally arcane ... Good stuff and sounding intriguing... Dare I make a wild guess file-lock problems? Good luck, Regards, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Dare I make a wild guess file-lock problems? Good guess but wrong. Another tease: a long-standing bug in the BOINC backend server code that only manifested itself just now and never before, and on only one system, all of which seems statistically impossible to me at this point. Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Clarification (I always have to clarify): not a bug in the BOINC code as much as our (SETI@home's) faulty implementation of it. At least you found it. Sometimes I never find the bug, only work around it. |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
. . . sneaky server eh - Nice Work Matt (and to Each of You @ Berkeley) Keep it up ps - 'Do They Hurt' ;) BOINC Wiki . . . Science Status Page . . . |
ML1 Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20 |
Dare I make a wild guess file-lock problems? Well, that still leaves it at a 'wild guess' without a clue... Wild guess #2: Something silly with the machine name or IP address, or the routing tables to that machine...? What changed after/during your last shutdown for that to appear now?... Happy bug squashing! Cheers, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Ace Casino Send message Joined: 5 Feb 03 Posts: 285 Credit: 29,750,804 RAC: 15 |
FYI: There was an article in the “Washington Post†this past weekend titled: “Are They Out Thereâ€Â. The article is about UFO’s, but there are a few paragraphs mentioning SETI, the new Allen Telescope Array at Berkeley and its mission to find a radio signal. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
FYI: Hmmm.....a link to the article in NC might be in order...... "Freedom is just Chaos, with better lighting." Alan Dean Foster |
ML1 Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20 |
Dare I make a wild guess file-lock problems? And I have to clarify also ;-) You had the Boinc clients trying to download WU data from the wrong server?... Happy bug squashing! Cheers, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Dare I make a wild guess file-lock problems? My wild guess, tongue planted firmly in my cheek... the version of libcurl compiled into the server code isn't playing friendly with a proxy [or proxy style] configuration somewhere in the line [The load sharing etc...maybe ] Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
Maybe they should revert to version 4.45 or something? ;) |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30648 Credit: 53,134,872 RAC: 32 |
Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now. Matt: FYI the mass e-mail was treated by AOL as SPAM and delivered to the spam box. You might want to talk to AOL's e-mail admins to have your outbound mail not classed as spam as everyone has asked for it. Might also help with fundrasing if people actually get the e-mail :) |
kev1701e Send message Joined: 28 Dec 99 Posts: 138 Credit: 10,216,553 RAC: 0 |
Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now. It was spam to Yahoo as well kev |
Macroman1 Send message Joined: 30 May 99 Posts: 67 Credit: 12,532,684 RAC: 0 |
Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now. Was marked as spam on my cox.net account too "Gentlemen, there are only two types of naval vessels..........Submarines, and Targets" -- U.S. Navy Submarine SONAR Instructor. |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It's always some kind of struggle given our lack of resources. You should know this by now. Managed to miss Earthlink's SPAM filter. . Hello, from Albany, CA!... |
Ghery S. Pettit Send message Joined: 7 Nov 99 Posts: 325 Credit: 28,109,066 RAC: 82 |
Wasn't marked as SPAM on my Comcast account (or by the IEEE e-mail alias server that saw it before Comcast). |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Got through to my Yahoo account too without being filtered. F. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.