The Story of the Hare who Lost his Spectacles (Dec 13 2007)

Message boards : Technical News : The Story of the Hare who Lost his Spectacles (Dec 13 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 691189 - Posted: 13 Dec 2007, 20:50:46 UTC

Roll up your sleeves, get the coffee brewing, etc.

So yesterday's "bug" hasn't been 100% solved yet, but there is a workaround in place. Here are the details (continued from yesterday's spiel): We have two redundant schedulers on bruno/ptolemy, both running the exact same executable (mounted from the same NAS, no less), on the exact same linux OS/kernel. One was sending work, the other was not. By "not" I mean there was work available, but something was causing the schedule processes on bruno to wrongly think that the work wasn't suitable for sending out.

Since this was all old, stable code, running on identical servers, this naturally pointed to some kind of broken network plumbing on bruno at first. A large part of the day was spent tracking this down. We checked everything: ifconfigs, MTU sizes, DNS records, router settings, routing tables, apache configurations, everything. We rebooted switches and servers to no avail. We had no choice but to begin questioning the actual code that has been working for months and happens to still be working perfectly on ptolemy.

Jeff attached a debugger to the many scheduler cgi processes and eventually spotted something odd. Why was the scheduler tagging the ready-to-send result in the shared memory (which is filled by the feeder) as "beta" results? We looked on ptolemy. There were not tagged as "beta" there. A clue!

Scheduler code was pored through and digested and it was determined this was indeed the heart of the problem - results tagged as "beta" were not to be sent out to regular clients asking for non-beta work. So bruno's refused to send any of these results out - it was erroneously thinking these were all "beta" results. But why?!

After countless fprintf's were added to the scheduler code we found this actually wasn't the schedulers fault - it was the feeder! The feeder is a relatively simple part of the back end which keeps a buffer of ready results to send out in shared memory for the hundreds of scheduler processes to pick and choose from. The scheduler plucks results from the array, creating an empty slot which the feeder fills up again. When the feeder first starts up it reads the application info from the database to determine which application is "current" and then gets the pertinent information about the application, including whether or not it is "beta." This information is then tied to the ready-to-send results as they are pulled from the database. We found that even though beta was "0" in the database, it was being set to "1" after that particular row was read into memory.

Was this a database connection problem then? We checked. Both bruno and ptolemy were connecting to the same database and getting at the same rows with the same values, so no. However, during this exercise we noted that C struct in the BOINC db code for the application had an extra field "weight" and of course this was the penultimate row, just before the final row "beta." What does that mean? Well, when filling this struct with a stream coming from MySQL, whatever value MySQL thinks is "beta" will be put in the struct as "weight" and whatever random data (on disks or in memory) beyond that MySQL would put in the struct as "beta." This has been the case for months, if not years (?!) but being these fields are never used by us (our beta project is basically a "real" project that's completely separate from the public project so its beta value is "0" as well), this never was an issue. We were fine as long as beta happened to be set to "0" (correctly or incorrectly) which it always had been...

...until JUST NOW! And only on bruno! This seems statistically impossible without any good explanation, but before getting lost down that road we put in a one-line hack which forces beta to be "0" no matter what bogus values get put in the oversized C struct, and immediately bruno was back in business. Until we get the whole gang in the lab at the same time and we can answer the final questions and confirm the appropriate fixes, it will remain this way.

Now back to some actual programming (helping Jeff wrap up work on radar blanking code).

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 691189 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 691205 - Posted: 13 Dec 2007, 21:55:33 UTC - in response to Message 691189.  

Roll up your sleeves, get the coffee brewing, etc.

So yesterday's "bug" hasn't been 100% solved yet, but there is a workaround in place. Here are the details (continued from yesterday's spiel): We have two redundant schedulers on bruno/ptolemy, both running the exact same executable (mounted from the same NAS, no less), on the exact same linux OS/kernel. One was sending work, the other was not. By "not" I mean there was work available, but something was causing the schedule processes on bruno to wrongly think that the work wasn't suitable for sending out.

Since this was all old, stable code, running on identical servers, this naturally pointed to some kind of broken network plumbing on bruno at first. A large part of the day was spent tracking this down. We checked everything: ifconfigs, MTU sizes, DNS records, router settings, routing tables, apache configurations, everything. We rebooted switches and servers to no avail. We had no choice but to begin questioning the actual code that has been working for months and happens to still be working perfectly on ptolemy.

Jeff attached a debugger to the many scheduler cgi processes and eventually spotted something odd. Why was the scheduler tagging the ready-to-send result in the shared memory (which is filled by the feeder) as "beta" results? We looked on ptolemy. There were not tagged as "beta" there. A clue!

Scheduler code was pored through and digested and it was determined this was indeed the heart of the problem - results tagged as "beta" were not to be sent out to regular clients asking for non-beta work. So bruno's refused to send any of these results out - it was erroneously thinking these were all "beta" results. But why?!

After countless fprintf's were added to the scheduler code we found this actually wasn't the schedulers fault - it was the feeder! The feeder is a relatively simple part of the back end which keeps a buffer of ready results to send out in shared memory for the hundreds of scheduler processes to pick and choose from. The scheduler plucks results from the array, creating an empty slot which the feeder fills up again. When the feeder first starts up it reads the application info from the database to determine which application is "current" and then gets the pertinent information about the application, including whether or not it is "beta." This information is then tied to the ready-to-send results as they are pulled from the database. We found that even though beta was "0" in the database, it was being set to "1" after that particular row was read into memory.

Was this a database connection problem then? We checked. Both bruno and ptolemy were connecting to the same database and getting at the same rows with the same values, so no. However, during this exercise we noted that C struct in the BOINC db code for the application had an extra field "weight" and of course this was the penultimate row, just before the final row "beta." What does that mean? Well, when filling this struct with a stream coming from MySQL, whatever value MySQL thinks is "beta" will be put in the struct as "weight" and whatever random data (on disks or in memory) beyond that MySQL would put in the struct as "beta." This has been the case for months, if not years (?!) but being these fields are never used by us (our beta project is basically a "real" project that's completely separate from the public project so its beta value is "0" as well), this never was an issue. We were fine as long as beta happened to be set to "0" (correctly or incorrectly) which it always had been...

...until JUST NOW! And only on bruno! This seems statistically impossible without any good explanation, but before getting lost down that road we put in a one-line hack which forces beta to be "0" no matter what bogus values get put in the oversized C struct, and immediately bruno was back in business. Until we get the whole gang in the lab at the same time and we can answer the final questions and confirm the appropriate fixes, it will remain this way.

Now back to some actual programming (helping Jeff wrap up work on radar blanking code).

- Matt


Matt, I salute you - an explanation even I can understand. No mean feat.

Thank you.

F.
ID: 691205 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 691222 - Posted: 13 Dec 2007, 22:46:31 UTC

I have no idea what that said but I'm glad you found it and hope you find a permanent fix. :)


PROUD MEMBER OF Team Starfire World BOINC
ID: 691222 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20144
Credit: 7,508,002
RAC: 20
United Kingdom
Message 691239 - Posted: 13 Dec 2007, 23:09:03 UTC - in response to Message 691189.  

Roll up your sleeves, get the coffee brewing, etc.

So yesterday...

Now back to some actual programming (helping Jeff wrap up work on radar blanking code).

Good find and a great story. And all too easy a programming foible to blunder into...

Different word size or different word order on those two machines?

Regards,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 691239 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 691247 - Posted: 14 Dec 2007, 0:07:01 UTC
Last modified: 14 Dec 2007, 0:19:53 UTC

Thank You again Berkeley & Matt, Good Post Sir! Nice work all-around i might add . . .

excerpted: for the title of this Thread - Ian's always been one of mi fav's


All this time, it had been quite plain to hare that the others knew
nothing about spectacles.

As for all their tempting ideas, well Hare didn't care.
The lost spectacles were his own affair.

And after all, Hare did have a spare a-pair.

A-pair.

BOINC Wiki . . .

Science Status Page . . .
ID: 691247 · Report as offensive
Profile Pooh Bear 27
Volunteer tester
Avatar

Send message
Joined: 14 Jul 03
Posts: 3224
Credit: 4,603,826
RAC: 0
United States
Message 691248 - Posted: 14 Dec 2007, 0:08:00 UTC

When it's found my suspicion is not a programming code, but a rogue ram stick.
ID: 691248 · Report as offensive
Odysseus
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 1808
Credit: 6,701,347
RAC: 6
Canada
Message 691291 - Posted: 14 Dec 2007, 4:18:47 UTC
Last modified: 14 Dec 2007, 4:21:01 UTC

Without his spectacles he was completely helpless. Where were his spectacles? Could someone have stolen them? Had he mislaid them? What was he to do?

Bee wanted to help, and thinking he had the answer, began: “You probably ate them, thinking they were a carrot.”

“No!” interrupted Owl, who was wise. “I have good eyesight, insight, and foresight. How could an intelligent hare make such a silly mistake?”


Jethro Tull: A Passion Play, 1973. There’s some wonderful, if rather silly, wordplay in this strange little interlude … IMO the album is underappreciated; Thick as a Brick cast a long shadow.
ID: 691291 · Report as offensive
Jim Wilkins
Volunteer tester

Send message
Joined: 11 Oct 99
Posts: 70
Credit: 1,658,376
RAC: 0
United States
Message 691433 - Posted: 14 Dec 2007, 17:16:59 UTC - in response to Message 691189.  


Makes me long for the days of FORTRAN, the only language I truly understood.

Jim
(not a software developer, too hard!)
ID: 691433 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 691446 - Posted: 14 Dec 2007, 18:04:56 UTC

Matt, check total length of the struct's. If one is off by a byte that may be the bug you're looking for... (or if some field is mis-placed by a byte...)

It's happened to me! (in COBOL) and was very frustrating to deal with!
.

Hello, from Albany, CA!...
ID: 691446 · Report as offensive
Profile Ghery S. Pettit
Avatar

Send message
Joined: 7 Nov 99
Posts: 325
Credit: 28,109,066
RAC: 82
United States
Message 691455 - Posted: 14 Dec 2007, 18:46:10 UTC - in response to Message 691291.  


Jethro Tull: A Passion Play, 1973. There’s some wonderful, if rather silly, wordplay in this strange little interlude … IMO the album is underappreciated; Thick as a Brick cast a long shadow.


Can't say that I'd heard of it. Thick as a Brick, on the other hand, is still in my record collection.


ID: 691455 · Report as offensive
nick
Volunteer tester
Avatar

Send message
Joined: 22 Jul 05
Posts: 284
Credit: 3,902,174
RAC: 0
United States
Message 691468 - Posted: 14 Dec 2007, 19:51:52 UTC

how many WUs does the feeder keep on the scheduler? and can you try to make something that tells us how many WU/sec are going out? or some other amount of time? i think that it would be a great debug thing, and it would be interesting for the rest of us.

thanks
nick


ID: 691468 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 691498 - Posted: 14 Dec 2007, 22:38:27 UTC

I'm sure this is the wrong place to post this but:


I just noticed on one of my computers a work unit that has been running for 65 hours and has clocked 0% progress.

What should I do? or how does this happen? or should I notify someone so that they may look at the wu, etc????
Dave

ID: 691498 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 691518 - Posted: 14 Dec 2007, 23:58:20 UTC - in response to Message 691498.  

I'm sure this is the wrong place to post this but:


I just noticed on one of my computers a work unit that has been running for 65 hours and has clocked 0% progress.

What should I do? or how does this happen? or should I notify someone so that they may look at the wu, etc????


Could you post a link to the WU? It may be a bad one that someone may recognise and advise you to abort it. But it would be easier if we could look at it first.

F.
ID: 691518 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 691523 - Posted: 15 Dec 2007, 0:16:28 UTC - in response to Message 691498.  

I'm sure this is the wrong place to post this but:


I just noticed on one of my computers a work unit that has been running for 65 hours and has clocked 0% progress.

What should I do? or how does this happen? or should I notify someone so that they may look at the wu, etc????

It's also a good idea to exit BOINC (full exit, not just minimise to system tray) and restart it, then tell us what happens at the restart.
ID: 691523 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 691548 - Posted: 15 Dec 2007, 2:05:59 UTC

The wu id is http://setiathome.berkeley.edu/workunit.php?wuid=166639066

Looks like I am not the onl;y one who had trouble with this wu

DRS
Dave

ID: 691548 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 691551 - Posted: 15 Dec 2007, 2:09:01 UTC

Upon full restart,

The work unit restarts ticking time but, not progress.


DRS
Dave

ID: 691551 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 691566 - Posted: 15 Dec 2007, 2:49:16 UTC - in response to Message 691551.  

Upon full restart,

The work unit restarts ticking time but, not progress.


DRS

Dave...please post this in Number Crunching and not in the news thread.
You might get more help there as well.
Regards,
Mark.

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 691566 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 691627 - Posted: 15 Dec 2007, 3:59:13 UTC - in response to Message 691189.  

Somebody SCREAM!!! That was a hell of a stereotypical computer programmer one-line fix. I applaud Jeff's and your determination.

Murphy's Law perhaps that the simplest part breaks?
ID: 691627 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 692147 - Posted: 17 Dec 2007, 2:10:22 UTC

Fix or not, can somebody give a three finger salute to the stats server so that ScareCrow's graphs can be updated?
ID: 692147 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 692528 - Posted: 18 Dec 2007, 8:53:05 UTC - in response to Message 692147.  

Fix or not, can somebody give a three finger salute to the stats server so that ScareCrow's graphs can be updated?

Updating normally now......
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 692528 · Report as offensive
1 · 2 · Next

Message boards : Technical News : The Story of the Hare who Lost his Spectacles (Dec 13 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.