According to the Spiral (Mar 22 2012)


log in

Advanced search

Message boards : Technical News : According to the Spiral (Mar 22 2012)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1208991 - Posted: 22 Mar 2012, 19:54:22 UTC
Last modified: 22 Mar 2012, 20:13:49 UTC

Since my last missive we had the usual string of minor bumps in the road. A couple garden variety server crashes, mainly. I sometimes wonder how big corporations manage to have such great uptime when we keep hitting fundamental flaws with linux getting locked up under heavy load. I think the answers are they (a) have massive redundancy (whereas we generally have very little mostly due to lack of physical space and available power), (b) have far more manpower (on 24 hour call) to kick machines immediately when they lock up, and (c) are under-utilizing servers (whereas we generally tend to push everything to their limits until they break).

Meanwhile, we've been slowly employing the new servers, georgem and paddym (and a 45-drive JBOD), donated via the great help of the GPU Users Group. I have georgem in the closet hooked up to half the JBOD. One snag: of the 24 drives meant for georgem, 5 failed immediately. This is quite high, but given the recent world-wide drive shortage quality control may have taken a hit. Not sure if others are seeing the same. So we're not building a RAID on it just yet - when we get replacement drives it'll soon become the new download server (with workunit storage directly attached) and maybe upload server (with results also directly attached). Not a pressing need, but the sooner we can retire bruno/vader/anakin the better.

I'm going to get an OS on paddym shortly. It was meant to be a compute server, but may take over science database server duties. You see we were assuming that oscar, our current science database, could attached to the other half of the JBOD thus adding more spindles and therefore much needed disk i/o to the mix. Our assumptions were wrong - despite having a generic external SATA port on the back it seems that the HP RAID card in the system can only attach to HP JBOD enclosure, not just any enclosure. Maybe there's a way around that. Not sure yet. Nor is there any free slots to add a 3ware card. Anyway, one option is just put a 3ware card in paddym and move the science database to that system (which does have more memory and more/faster CPUs). But migration would take a month. Long story short, lots of testing/brainstorming going on to determine the path of action.

Other progress: we finally launched the new splitters which are sensitive to VGC values and thus skip (most) noisy data blocks instead of splitting them into workunits that will return quickly and clog up our pipeline. Yay! However there were unexpected results last night: turns out it's actually slower to parse such a noisy data file and skip bad blocks than to just split everything, so splitters were getting stuck on these files and not generating work. Oops. We ran out of multi-beam work last night due to this, and I reverted back this morning just to the plumbing working again. I'm going to change the logic to be a little more aggressive and thus speed up skipping through noisy files, and implement that next week.

I'm also working on old SERENDIP code in order to bring it more up to date (i.e. make it read from a mysql database instead of flat files). I actually got the whole suite compiled again for the first time in a decade. Soon chunks of SERENDIP can be used to parse data currently being collected at Green Bank and help remove RFI.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4067
Credit: 32,897,852
RAC: 7,508
United Kingdom
Message 1208995 - Posted: 22 Mar 2012, 20:06:02 UTC - in response to Message 1208991.

Thanks for the update and all the work,

Claggy

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3406
Credit: 19,616,178
RAC: 18,566
Sweden
Message 1208998 - Posted: 22 Mar 2012, 20:09:27 UTC - in response to Message 1208991.
Last modified: 22 Mar 2012, 20:17:26 UTC

Thanks for the update Matt.
____________

Profile Graham MiddletonProject donor
Send message
Joined: 1 Sep 00
Posts: 422
Credit: 44,891,086
RAC: 29,698
United Kingdom
Message 1209008 - Posted: 22 Mar 2012, 20:23:49 UTC

Matt,

Many, many thanks for all the work you're doing and for the update.

I am very proud.
____________
Happy Crunching,

Graham

GPUUG Officer




graham@gpuug.org

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 31452
Credit: 12,178,470
RAC: 29,021
United Kingdom
Message 1209014 - Posted: 22 Mar 2012, 20:40:38 UTC
Last modified: 22 Mar 2012, 20:41:57 UTC

Thanks for the heads up Matt. Yes Graham you have good reason to be as well, and thank you for helping Seti@Home in the way that you have.

Profile Ronald R CODNEY
Avatar
Send message
Joined: 19 Nov 11
Posts: 87
Credit: 420,497
RAC: 0
United States
Message 1209020 - Posted: 22 Mar 2012, 20:58:46 UTC

Matt: Ur a stud dude. Thanks is never enough. :-)

Profile Alex Storey
Volunteer tester
Avatar
Send message
Joined: 14 Jun 04
Posts: 536
Credit: 1,644,407
RAC: 574
Greece
Message 1209022 - Posted: 22 Mar 2012, 21:05:13 UTC

Thanx Matt! Thanx Graham!

Profile Khangollo
Avatar
Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1209027 - Posted: 22 Mar 2012, 21:41:28 UTC - in response to Message 1208991.

Thanks for the update, but...

I sometimes wonder how big corporations manage to have such great uptime when we keep hitting fundamental flaws with linux getting locked up under heavy load. I think the answers are they (a) have massive redundancy (whereas we generally have very little mostly due to lack of physical space and available power), (b) have far more manpower (on 24 hour call) to kick machines immediately when they lock up, and (c) are under-utilizing servers (whereas we generally tend to push everything to their limits until they break).

(d) For their production servers they don't choose Fedora, which is meant for desktops, really. (Too bleeding edge software, also many experimental bells and whistles turned on in their compiled kernel!).

____________

Profile Bill GProject donor
Avatar
Send message
Joined: 1 Jun 01
Posts: 347
Credit: 39,791,485
RAC: 68,780
United States
Message 1209029 - Posted: 22 Mar 2012, 21:47:02 UTC - in response to Message 1209027.

Thanks for the efforts and report Matt.
I am always amazed to see all the hoops you have to jump through.
____________

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11162
Credit: 13,951,570
RAC: 12,418
United States
Message 1209259 - Posted: 23 Mar 2012, 14:41:46 UTC - in response to Message 1208991.

Thanks for the update, Matt.

I see you also doubled the number of RFIs and Nitpickrs and they actually show as running on the Server Status page.
____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Dena Wiltsie
Send message
Joined: 19 Apr 01
Posts: 1121
Credit: 541,537
RAC: 340
United States
Message 1209420 - Posted: 23 Mar 2012, 22:36:52 UTC

There is no trick to near 100% uptime. The company I work for produces a device that requires that type of up time so what we did was plant a number of general purpose diagnostic tools such as traps or dumps catch the failure. The hard part follows, Every failure needs to be examined to understand the failure. If a failure lacks the information needed to locate the problem then additional work is done to the code to capture information that would aid in diagnosing the problem. It's not something you can do overnight but can take years. In any case, we move the failure rate from one every few hours to over a year for most customers. The problem is we are now at the point where failures are so far apart that sometimes the customer will reset the unit instead of call us. Our mistake, all you need to do is press one switch and wait about a minute and the unit is back in operation.

In addition, some failure can be detected but aren't terminal. In those cases, we may write in some cleanup code and then continue on with the job at hand. The design that came before the current device used this approach and while it never crash, it could do some strange things. This required a different debugging approach where we set aside a chunk of memory ( expensive in the 32k byte days) to hold failure information then the user would be made aware that we needed to be called to check out a failure.

It didn't require many programmers as we never had more than 6 with at least half that number making new bugs for the others to find. It just took someone dumb enough like me to keep working on a bug till it was solved.
____________

__W__
Avatar
Send message
Joined: 28 Mar 09
Posts: 114
Credit: 3,145,426
RAC: 2,593
Germany
Message 1209656 - Posted: 24 Mar 2012, 14:45:50 UTC - in response to Message 1208991.

... 5 failed immediately. This is quite high, but given the recent world-wide drive shortage quality control may have taken a hit. Not sure if others are seeing the same.

I think this is more a transportation problem. Some years ago, as i walked through our stock, a packet of 20 disks "crossed" my way, 1 meter above ground, touch down after a 5 meter flight. Here it was the "normal" way of unloading the truck by the driver (i think it was UPS). Most of them don't take care of their packets even if they are labeled as fragile.

__W__

____________
_______________________________________________________________________________

Profile Bill Walker
Avatar
Send message
Joined: 4 Sep 99
Posts: 3352
Credit: 2,041,100
RAC: 2,101
Canada
Message 1209748 - Posted: 24 Mar 2012, 18:56:14 UTC - in response to Message 1209420.

It didn't require many programmers as we never had more than 6 ...


Which leaves this project only 6 full time programmers short of what they need for 100% up-time. Since we are already doing the distributed data processing faster than it can be gathered, and faster than the preliminary analysis can be further processed, there are probably lots of other places to spend time and money than 100% up time.

100% up-time for WU delivery may make us crunchers happier, but it doesn't help the science.
____________

Profile Slavac
Volunteer tester
Avatar
Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1209767 - Posted: 24 Mar 2012, 19:33:55 UTC - in response to Message 1209656.

... 5 failed immediately. This is quite high, but given the recent world-wide drive shortage quality control may have taken a hit. Not sure if others are seeing the same.

I think this is more a transportation problem. Some years ago, as i walked through our stock, a packet of 20 disks "crossed" my way, 1 meter above ground, touch down after a 5 meter flight. Here it was the "normal" way of unloading the truck by the driver (i think it was UPS). Most of them don't take care of their packets even if they are labeled as fragile.

__W__


It could very well be. Good news is 5 replacement drives will be sitting on their desk on Monday.
____________


Executive Director GPU Users Group Inc. -
brad@gpuug.org

__W__
Avatar
Send message
Joined: 28 Mar 09
Posts: 114
Credit: 3,145,426
RAC: 2,593
Germany
Message 1209988 - Posted: 25 Mar 2012, 9:09:43 UTC - in response to Message 1209767.

... It could very well be. Good news is 5 replacement drives will be sitting on their desk on Monday.


+1

__W__

____________
_______________________________________________________________________________

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 31452
Credit: 12,178,470
RAC: 29,021
United Kingdom
Message 1209993 - Posted: 25 Mar 2012, 11:13:13 UTC

+2

Dena Wiltsie
Send message
Joined: 19 Apr 01
Posts: 1121
Credit: 541,537
RAC: 340
United States
Message 1210019 - Posted: 25 Mar 2012, 14:20:33 UTC - in response to Message 1209748.

It didn't require many programmers as we never had more than 6 ...


Which leaves this project only 6 full time programmers short of what they need for 100% up-time. Since we are already doing the distributed data processing faster than it can be gathered, and faster than the preliminary analysis can be further processed, there are probably lots of other places to spend time and money than 100% up time.

100% up-time for WU delivery may make us crunchers happier, but it doesn't help the science.

In our case, we were providing new features/products and I was also involved in debugging new hardware as well. After a while you do other things with your time as the failure are not often enough to require all your time.
In our case, we used over 100,000 lines of assembler code and because of conditional macro coding the final listings were so large we only attempted to print them once.

The big problem with hunting bugs is programmers would rather code that look for hard to find crashes.
____________

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1923
Credit: 9,754,408
RAC: 17,142
United States
Message 1210427 - Posted: 26 Mar 2012, 20:14:56 UTC

Umm, at some point are you going to add Georgem to the list of computers on the "Server Status" page?
____________
.

Profile cliff
Avatar
Send message
Joined: 16 Dec 07
Posts: 322
Credit: 2,509,590
RAC: 0
United Kingdom
Message 1210440 - Posted: 26 Mar 2012, 20:37:47 UTC - in response to Message 1210427.

GeorgeM is on the Server Stats page, look towards the bottom, Its there:-)

Cheers,
____________
Cliff,
Been there, Done that, Still no damm T shirt!

Profile Graham MiddletonProject donor
Send message
Joined: 1 Sep 00
Posts: 422
Credit: 44,891,086
RAC: 29,698
United Kingdom
Message 1210449 - Posted: 26 Mar 2012, 20:49:35 UTC - in response to Message 1210427.
Last modified: 26 Mar 2012, 20:50:20 UTC

Umm, at some point are you going to add Georgem to the list of computers on the "Server Status" page?



I agree it isn't on the Hosts list, but I for one would rather that the Matt and the other guys at Seti@Home concentrated on getting the problems they have fixed, and the JBOD array and PaddyM up & running, before they worry about the cosmetics of the ancillary information on the Status page. Simple priorities.
____________
Happy Crunching,

Graham

GPUUG Officer




graham@gpuug.org

1 · 2 · Next

Message boards : Technical News : According to the Spiral (Mar 22 2012)

Copyright © 2014 University of California