Stardust and Sand (Jun 23 2011)


log in

Advanced search

Message boards : Technical News : Stardust and Sand (Jun 23 2011)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1384
Credit: 74,079
RAC: 0
United States
Message 1120614 - Posted: 23 Jun 2011, 21:35:31 UTC

Here's another catch-up tech news report. No big news, but more of the usual.

Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.

The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.

We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.

The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Slavac
Volunteer tester
Avatar
Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1120616 - Posted: 23 Jun 2011, 21:41:57 UTC - in response to Message 1120614.

Here's another catch-up tech news report. No big news, but more of the usual.

Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.

The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.

We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.

The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt


If you end up needing a new CPU for Bruno let me/us know what type and we'll get you sent a replacement asap.
____________


Executive Director GPU Users Group Inc. -
brad@gpuug.org

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 3964
Credit: 31,882,705
RAC: 10,967
United Kingdom
Message 1120621 - Posted: 23 Jun 2011, 21:50:30 UTC - in response to Message 1120614.
Last modified: 23 Jun 2011, 21:50:44 UTC

Thanks for the update Matt, keep up all the good work,

Claggy

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 11732
Credit: 5,969,877
RAC: 0
United States
Message 1120625 - Posted: 23 Jun 2011, 22:02:01 UTC
Last modified: 23 Jun 2011, 22:06:03 UTC

Matt:

You should be able to get the root prompt, but you might not be able to launch (page in) ssh/login/bash to get any prompt. Not sure how you are configed, but you might need to leave a terminal logged in and set to above normal priority. Obviously a security risk so it needs to behind a physically locked door.

As to that leak, not sure what debugging tools you have, but unless it is one of the POSIX designed in leaks, you should be able to find and quash it. Perhaps a little personal development time reading up on the different available tools might find a new path to try. Worse you will find the right tool but it isn't available for Fedora. e.g. Malloc Debug http://www.manpagez.com/man/3/malloc/
____________

Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 2789
Credit: 11,757,893
RAC: 535
Canada
Message 1120639 - Posted: 23 Jun 2011, 22:58:25 UTC - in response to Message 1120614.

thanks for the update Matt

Best Wishes
Byron

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13307
Credit: 27,902,179
RAC: 16,424
United States
Message 1120651 - Posted: 23 Jun 2011, 23:41:27 UTC - in response to Message 1120614.
Last modified: 23 Jun 2011, 23:47:31 UTC

Glad to see that your music hasn't completely taken you away from us... yet. Thanks for the update Matt!

Berserker
Volunteer tester
Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,386,463
RAC: 0
United Kingdom
Message 1120657 - Posted: 23 Jun 2011, 23:56:10 UTC

I've spent more than my fair share of time with malloc debug and various equivalents. It can work, but for non-trivial cases it can take like, forever (I've spent weeks on this sort of problem).

Hopefully there's some decent memory profilers for *nix. If so, can you dummy up a bucketload of either simulated or actual data and throw at a testbed assimilator. Memory profiling should at least help you with where to look, if it doesn't give you the smoking gun.

That said, as it's all DB backed, unclosed queries/result sets would be a place to start.
____________
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.

Berserker
Volunteer tester
Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,386,463
RAC: 0
United Kingdom
Message 1120663 - Posted: 24 Jun 2011, 0:08:04 UTC

As for MySQL, I've rolled my own (I use Gentoo, ergo I have no choice), and have had no troubles (but then I don't use InnoDB, replicas or countless other features you do). The trick, as ever, is finding a 'good' version and then figuring out what arcane combination of configure options pushes the right buttons to make it have all the features you want, in the right order. Not sure if Fedora have 'volunteers' as such, but if they do, maybe one of them could help.
____________
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.

bill
Send message
Joined: 16 Jun 99
Posts: 848
Credit: 20,620,633
RAC: 15,596
United States
Message 1120664 - Posted: 24 Jun 2011, 0:08:08 UTC

Thanks for the update Matt. Much appreciated.

Profile Jeff Mercer
Send message
Joined: 14 Aug 08
Posts: 90
Credit: 154,562
RAC: 714
United States
Message 1120711 - Posted: 24 Jun 2011, 3:44:29 UTC

Thanks for the update Matt. Hope things start getting better for everyone there in the lab. Thanks for your hard work and dedication to the project. Good luck with the music.... Play a few songs for me. ;)

Profile Mike
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 22425
Credit: 29,409,617
RAC: 25,866
Germany
Message 1120762 - Posted: 24 Jun 2011, 6:28:55 UTC

Thanks for the update Matt.

____________

rob smith
Volunteer moderator
Send message
Joined: 7 Mar 03
Posts: 7681
Credit: 44,999,875
RAC: 76,826
United Kingdom
Message 1120819 - Posted: 24 Jun 2011, 10:36:54 UTC

I assume that when Matt is silent either things are going according to plan, so there is nothing really to report, or things are going so badly that he hasn't got time to report. I hope that in the next few weeks it is the former that dominates, and that his plans to divert more time to his music are well fulfilled, without too much interruption from the lab.





(off topic - Matt what's the guitar in your sig, and how's your young, feline, apprentice doing, looks as if it could be a mean picker.....)
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 527,839
RAC: 70
United States
Message 1120895 - Posted: 24 Jun 2011, 14:56:17 UTC - in response to Message 1120614.

If the driver is still part of the Linux kernel source code, you could just compile a custom kernel as part of your Fedora installation. Copy the /boot/config-2.6.xx file into /usr/src/kernel/2.6.xx/.config before running the make menuconfig.

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 11732
Credit: 5,969,877
RAC: 0
United States
Message 1120944 - Posted: 24 Jun 2011, 16:53:50 UTC - in response to Message 1120657.

If the event causing the leak happens infrequently, finding it in the mounds and mounds of output can take forever. Hence finding a tool that reduces output when all memory is reachable makes the task perhaps possible. There will be delay in the output and going backwards to find the issue is another matter. If the issue is library calls that leak - there are some - then the problem may be intractable. If enabling debugging makes the process too slow, that is another issue. But if you can find out what use the block is that leaks then you can design in debugging to find where it may go missing if a read through doesn't tell you.
____________

Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 2789
Credit: 11,757,893
RAC: 535
Canada
Message 1120954 - Posted: 24 Jun 2011, 17:00:28 UTC - in response to Message 1120614.

Matt thank you and the rest of the SETI@home crew for all your hard.

Best Wishes
Byron

Profile Donald L. Johnson
Avatar
Send message
Joined: 5 Aug 02
Posts: 5697
Credit: 564,577
RAC: 619
United States
Message 1120962 - Posted: 24 Jun 2011, 17:06:18 UTC - in response to Message 1120614.
Last modified: 24 Jun 2011, 17:06:45 UTC

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt

Matt, that is the least of our worries. (8{)
____________
Donald
Infernal Optimist / Submariner, retired

Profile Chris S
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 29571
Credit: 9,013,120
RAC: 28,332
United Kingdom
Message 1121034 - Posted: 24 Jun 2011, 18:42:31 UTC

Always appreciated to be able to read your updates. Thanks.

Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 2789
Credit: 11,757,893
RAC: 535
Canada
Message 1121246 - Posted: 25 Jun 2011, 1:17:00 UTC

Matt thank you and thanks to the rest of the SETI@home crew for all your hard work.

you guys are the best

Best Wishes
Byron

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1831
Credit: 7,566,077
RAC: 19,978
United States
Message 1123418 - Posted: 1 Jul 2011, 16:10:12 UTC

Query: how come we're (or, at least I'm) able to upload, even though the upload server shows as "disabled" on the "Server Status" page?
____________
.

DrFoo
Send message
Joined: 17 Jul 99
Posts: 25
Credit: 20,611,508
RAC: 22,028
United States
Message 1123443 - Posted: 1 Jul 2011, 17:32:28 UTC

Matt:

Something I ran across recently that is probably right on target for you guys (and a lot of others). If you want the stability of RHEL/CentOS AND the latest key software packages (like MySQL 5.5), I don't know of a better way to go. It's sponsored by RackSpace who obviously knows something about this sort of thing and has a vested interest in making it all work.

http://iuscommunity.org/About

____________

1 · 2 · Next

Message boards : Technical News : Stardust and Sand (Jun 23 2011)

Copyright © 2014 University of California