Stardust and Sand (Jun 23 2011)


log in

Advanced search

Message boards : Technical News : Stardust and Sand (Jun 23 2011)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1120614 - Posted: 23 Jun 2011, 21:35:31 UTC

Here's another catch-up tech news report. No big news, but more of the usual.

Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.

The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.

We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.

The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Slavac
Volunteer tester
Avatar
Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1120616 - Posted: 23 Jun 2011, 21:41:57 UTC - in response to Message 1120614.

Here's another catch-up tech news report. No big news, but more of the usual.

Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.

The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.

We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.

The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt


If you end up needing a new CPU for Bruno let me/us know what type and we'll get you sent a replacement asap.
____________


Executive Director GPU Users Group Inc. -
brad@gpuug.org

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4141
Credit: 33,587,328
RAC: 26,811
United Kingdom
Message 1120621 - Posted: 23 Jun 2011, 21:50:30 UTC - in response to Message 1120614.
Last modified: 23 Jun 2011, 21:50:44 UTC

Thanks for the update Matt, keep up all the good work,

Claggy

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12729
Credit: 7,264,687
RAC: 17,530
United States
Message 1120625 - Posted: 23 Jun 2011, 22:02:01 UTC
Last modified: 23 Jun 2011, 22:06:03 UTC

Matt:

You should be able to get the root prompt, but you might not be able to launch (page in) ssh/login/bash to get any prompt. Not sure how you are configed, but you might need to leave a terminal logged in and set to above normal priority. Obviously a security risk so it needs to behind a physically locked door.

As to that leak, not sure what debugging tools you have, but unless it is one of the POSIX designed in leaks, you should be able to find and quash it. Perhaps a little personal development time reading up on the different available tools might find a new path to try. Worse you will find the right tool but it isn't available for Fedora. e.g. Malloc Debug http://www.manpagez.com/man/3/malloc/
____________

Profile Byron Leigh Hatch @ team Carl SaganProject donor
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 3619
Credit: 11,909,794
RAC: 1,093
Canada
Message 1120639 - Posted: 23 Jun 2011, 22:58:25 UTC - in response to Message 1120614.

thanks for the update Matt

Best Wishes
Byron

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13625
Credit: 31,040,245
RAC: 20,867
United States
Message 1120651 - Posted: 23 Jun 2011, 23:41:27 UTC - in response to Message 1120614.
Last modified: 23 Jun 2011, 23:47:31 UTC

Glad to see that your music hasn't completely taken you away from us... yet. Thanks for the update Matt!

Berserker
Volunteer tester
Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,386,463
RAC: 0
United Kingdom
Message 1120657 - Posted: 23 Jun 2011, 23:56:10 UTC

I've spent more than my fair share of time with malloc debug and various equivalents. It can work, but for non-trivial cases it can take like, forever (I've spent weeks on this sort of problem).

Hopefully there's some decent memory profilers for *nix. If so, can you dummy up a bucketload of either simulated or actual data and throw at a testbed assimilator. Memory profiling should at least help you with where to look, if it doesn't give you the smoking gun.

That said, as it's all DB backed, unclosed queries/result sets would be a place to start.
____________
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.

Berserker
Volunteer tester
Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,386,463
RAC: 0
United Kingdom
Message 1120663 - Posted: 24 Jun 2011, 0:08:04 UTC

As for MySQL, I've rolled my own (I use Gentoo, ergo I have no choice), and have had no troubles (but then I don't use InnoDB, replicas or countless other features you do). The trick, as ever, is finding a 'good' version and then figuring out what arcane combination of configure options pushes the right buttons to make it have all the features you want, in the right order. Not sure if Fedora have 'volunteers' as such, but if they do, maybe one of them could help.
____________
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.

bill
Send message
Joined: 16 Jun 99
Posts: 861
Credit: 23,975,287
RAC: 13,839
United States
Message 1120664 - Posted: 24 Jun 2011, 0:08:08 UTC

Thanks for the update Matt. Much appreciated.

Profile Jeff Mercer
Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1120711 - Posted: 24 Jun 2011, 3:44:29 UTC

Thanks for the update Matt. Hope things start getting better for everyone there in the lab. Thanks for your hard work and dedication to the project. Good luck with the music.... Play a few songs for me. ;)

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 24530
Credit: 33,858,664
RAC: 23,594
Germany
Message 1120762 - Posted: 24 Jun 2011, 6:28:55 UTC

Thanks for the update Matt.

____________

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8535
Credit: 59,355,914
RAC: 83,918
United Kingdom
Message 1120819 - Posted: 24 Jun 2011, 10:36:54 UTC

I assume that when Matt is silent either things are going according to plan, so there is nothing really to report, or things are going so badly that he hasn't got time to report. I hope that in the next few weeks it is the former that dominates, and that his plans to divert more time to his music are well fulfilled, without too much interruption from the lab.





(off topic - Matt what's the guitar in your sig, and how's your young, feline, apprentice doing, looks as if it could be a mean picker.....)
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

DJStarfox
Send message
Joined: 23 May 01
Posts: 1045
Credit: 560,168
RAC: 442
United States
Message 1120895 - Posted: 24 Jun 2011, 14:56:17 UTC - in response to Message 1120614.

If the driver is still part of the Linux kernel source code, you could just compile a custom kernel as part of your Fedora installation. Copy the /boot/config-2.6.xx file into /usr/src/kernel/2.6.xx/.config before running the make menuconfig.

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12729
Credit: 7,264,687
RAC: 17,530
United States
Message 1120944 - Posted: 24 Jun 2011, 16:53:50 UTC - in response to Message 1120657.

If the event causing the leak happens infrequently, finding it in the mounds and mounds of output can take forever. Hence finding a tool that reduces output when all memory is reachable makes the task perhaps possible. There will be delay in the output and going backwards to find the issue is another matter. If the issue is library calls that leak - there are some - then the problem may be intractable. If enabling debugging makes the process too slow, that is another issue. But if you can find out what use the block is that leaks then you can design in debugging to find where it may go missing if a read through doesn't tell you.
____________

Profile Byron Leigh Hatch @ team Carl SaganProject donor
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 3619
Credit: 11,909,794
RAC: 1,093
Canada
Message 1120954 - Posted: 24 Jun 2011, 17:00:28 UTC - in response to Message 1120614.

Matt thank you and the rest of the SETI@home crew for all your hard.

Best Wishes
Byron

Profile Donald L. JohnsonProject donor
Avatar
Send message
Joined: 5 Aug 02
Posts: 6261
Credit: 738,486
RAC: 1,157
United States
Message 1120962 - Posted: 24 Jun 2011, 17:06:18 UTC - in response to Message 1120614.
Last modified: 24 Jun 2011, 17:06:45 UTC

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt

Matt, that is the least of our worries. (8{)
____________
Donald
Infernal Optimist / Submariner, retired

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32092
Credit: 13,773,703
RAC: 25,376
United Kingdom
Message 1121034 - Posted: 24 Jun 2011, 18:42:31 UTC

Always appreciated to be able to read your updates. Thanks.

Profile Byron Leigh Hatch @ team Carl SaganProject donor
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 3619
Credit: 11,909,794
RAC: 1,093
Canada
Message 1121246 - Posted: 25 Jun 2011, 1:17:00 UTC

Matt thank you and thanks to the rest of the SETI@home crew for all your hard work.

you guys are the best

Best Wishes
Byron

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1961
Credit: 10,479,963
RAC: 10,474
United States
Message 1123418 - Posted: 1 Jul 2011, 16:10:12 UTC

Query: how come we're (or, at least I'm) able to upload, even though the upload server shows as "disabled" on the "Server Status" page?
____________
.

DrFoo
Send message
Joined: 17 Jul 99
Posts: 26
Credit: 25,465,512
RAC: 27,278
United States
Message 1123443 - Posted: 1 Jul 2011, 17:32:28 UTC

Matt:

Something I ran across recently that is probably right on target for you guys (and a lot of others). If you want the stability of RHEL/CentOS AND the latest key software packages (like MySQL 5.5), I don't know of a better way to go. It's sponsored by RackSpace who obviously knows something about this sort of thing and has a vested interest in making it all work.

http://iuscommunity.org/About

____________

1 · 2 · Next

Message boards : Technical News : Stardust and Sand (Jun 23 2011)

Copyright © 2014 University of California