Stardust and Sand (Jun 23 2011)

Message boards : Technical News : Stardust and Sand (Jun 23 2011)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1120614 - Posted: 23 Jun 2011, 21:35:31 UTC

Here's another catch-up tech news report. No big news, but more of the usual.

Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.

The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.

We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.

The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1120614 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1120616 - Posted: 23 Jun 2011, 21:41:57 UTC - in response to Message 1120614.  

Here's another catch-up tech news report. No big news, but more of the usual.

Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.

The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.

We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.

The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt


If you end up needing a new CPU for Bruno let me/us know what type and we'll get you sent a replacement asap.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1120616 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1120621 - Posted: 23 Jun 2011, 21:50:30 UTC - in response to Message 1120614.  
Last modified: 23 Jun 2011, 21:50:44 UTC

Thanks for the update Matt, keep up all the good work,

Claggy
ID: 1120621 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30934
Credit: 53,134,872
RAC: 32
United States
Message 1120625 - Posted: 23 Jun 2011, 22:02:01 UTC
Last modified: 23 Jun 2011, 22:06:03 UTC

Matt:

You should be able to get the root prompt, but you might not be able to launch (page in) ssh/login/bash to get any prompt. Not sure how you are configed, but you might need to leave a terminal logged in and set to above normal priority. Obviously a security risk so it needs to behind a physically locked door.

As to that leak, not sure what debugging tools you have, but unless it is one of the POSIX designed in leaks, you should be able to find and quash it. Perhaps a little personal development time reading up on the different available tools might find a new path to try. Worse you will find the right tool but it isn't available for Fedora. e.g. Malloc Debug http://www.manpagez.com/man/3/malloc/
ID: 1120625 · Report as offensive
Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 4548
Credit: 35,667,570
RAC: 4
Canada
Message 1120639 - Posted: 23 Jun 2011, 22:58:25 UTC - in response to Message 1120614.  

thanks for the update Matt

Best Wishes
Byron
ID: 1120639 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 1120651 - Posted: 23 Jun 2011, 23:41:27 UTC - in response to Message 1120614.  
Last modified: 23 Jun 2011, 23:47:31 UTC

Glad to see that your music hasn't completely taken you away from us... yet. Thanks for the update Matt!
ID: 1120651 · Report as offensive
Berserker
Volunteer tester

Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,440,087
RAC: 0
United Kingdom
Message 1120657 - Posted: 23 Jun 2011, 23:56:10 UTC

I've spent more than my fair share of time with malloc debug and various equivalents. It can work, but for non-trivial cases it can take like, forever (I've spent weeks on this sort of problem).

Hopefully there's some decent memory profilers for *nix. If so, can you dummy up a bucketload of either simulated or actual data and throw at a testbed assimilator. Memory profiling should at least help you with where to look, if it doesn't give you the smoking gun.

That said, as it's all DB backed, unclosed queries/result sets would be a place to start.
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.
ID: 1120657 · Report as offensive
Berserker
Volunteer tester

Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,440,087
RAC: 0
United Kingdom
Message 1120663 - Posted: 24 Jun 2011, 0:08:04 UTC

As for MySQL, I've rolled my own (I use Gentoo, ergo I have no choice), and have had no troubles (but then I don't use InnoDB, replicas or countless other features you do). The trick, as ever, is finding a 'good' version and then figuring out what arcane combination of configure options pushes the right buttons to make it have all the features you want, in the right order. Not sure if Fedora have 'volunteers' as such, but if they do, maybe one of them could help.
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.
ID: 1120663 · Report as offensive
bill

Send message
Joined: 16 Jun 99
Posts: 861
Credit: 29,352,955
RAC: 0
United States
Message 1120664 - Posted: 24 Jun 2011, 0:08:08 UTC

Thanks for the update Matt. Much appreciated.
ID: 1120664 · Report as offensive
Profile Jeff Mercer

Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1120711 - Posted: 24 Jun 2011, 3:44:29 UTC

Thanks for the update Matt. Hope things start getting better for everyone there in the lab. Thanks for your hard work and dedication to the project. Good luck with the music.... Play a few songs for me. ;)
ID: 1120711 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34353
Credit: 79,922,639
RAC: 80
Germany
Message 1120762 - Posted: 24 Jun 2011, 6:28:55 UTC

Thanks for the update Matt.



With each crime and every kindness we birth our future.
ID: 1120762 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22456
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1120819 - Posted: 24 Jun 2011, 10:36:54 UTC

I assume that when Matt is silent either things are going according to plan, so there is nothing really to report, or things are going so badly that he hasn't got time to report. I hope that in the next few weeks it is the former that dominates, and that his plans to divert more time to his music are well fulfilled, without too much interruption from the lab.





(off topic - Matt what's the guitar in your sig, and how's your young, feline, apprentice doing, looks as if it could be a mean picker.....)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1120819 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 1120895 - Posted: 24 Jun 2011, 14:56:17 UTC - in response to Message 1120614.  

If the driver is still part of the Linux kernel source code, you could just compile a custom kernel as part of your Fedora installation. Copy the /boot/config-2.6.xx file into /usr/src/kernel/2.6.xx/.config before running the make menuconfig.
ID: 1120895 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30934
Credit: 53,134,872
RAC: 32
United States
Message 1120944 - Posted: 24 Jun 2011, 16:53:50 UTC - in response to Message 1120657.  

If the event causing the leak happens infrequently, finding it in the mounds and mounds of output can take forever. Hence finding a tool that reduces output when all memory is reachable makes the task perhaps possible. There will be delay in the output and going backwards to find the issue is another matter. If the issue is library calls that leak - there are some - then the problem may be intractable. If enabling debugging makes the process too slow, that is another issue. But if you can find out what use the block is that leaks then you can design in debugging to find where it may go missing if a read through doesn't tell you.
ID: 1120944 · Report as offensive
Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 4548
Credit: 35,667,570
RAC: 4
Canada
Message 1120954 - Posted: 24 Jun 2011, 17:00:28 UTC - in response to Message 1120614.  

Matt thank you and the rest of the SETI@home crew for all your hard.

Best Wishes
Byron
ID: 1120954 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1120962 - Posted: 24 Jun 2011, 17:06:18 UTC - in response to Message 1120614.  
Last modified: 24 Jun 2011, 17:06:45 UTC

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt

Matt, that is the least of our worries. (8{)
Donald
Infernal Optimist / Submariner, retired
ID: 1120962 · Report as offensive
Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 4548
Credit: 35,667,570
RAC: 4
Canada
Message 1121246 - Posted: 25 Jun 2011, 1:17:00 UTC

Matt thank you and thanks to the rest of the SETI@home crew for all your hard work.

you guys are the best

Best Wishes
Byron

ID: 1121246 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1123418 - Posted: 1 Jul 2011, 16:10:12 UTC

Query: how come we're (or, at least I'm) able to upload, even though the upload server shows as "disabled" on the "Server Status" page?
.

Hello, from Albany, CA!...
ID: 1123418 · Report as offensive
DrFoo

Send message
Joined: 17 Jul 99
Posts: 26
Credit: 28,975,189
RAC: 0
United States
Message 1123443 - Posted: 1 Jul 2011, 17:32:28 UTC

Matt:

Something I ran across recently that is probably right on target for you guys (and a lot of others). If you want the stability of RHEL/CentOS AND the latest key software packages (like MySQL 5.5), I don't know of a better way to go. It's sponsored by RackSpace who obviously knows something about this sort of thing and has a vested interest in making it all work.

http://iuscommunity.org/About

ID: 1123443 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1124575 - Posted: 4 Jul 2011, 13:47:29 UTC

Looks (from the Cricket graph) like something broke around 1600 Berkeley time on Sunday, July 3... Can't upload... and only re-trys are downloading.
.

Hello, from Albany, CA!...
ID: 1124575 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Stardust and Sand (Jun 23 2011)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.