According to the Spiral (Mar 22 2012)

Message boards : Technical News : According to the Spiral (Mar 22 2012)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 4
United States
Message 1210798 - Posted: 28 Mar 2012, 0:18:57 UTC - in response to Message 1209027.  
Last modified: 28 Mar 2012, 0:33:36 UTC

Thanks for the update, but...

I sometimes wonder how big corporations manage to have such great uptime when we keep hitting fundamental flaws with linux getting locked up under heavy load. I think the answers are they (a) have massive redundancy (whereas we generally have very little mostly due to lack of physical space and available power), (b) have far more manpower (on 24 hour call) to kick machines immediately when they lock up, and (c) are under-utilizing servers (whereas we generally tend to push everything to their limits until they break).

(d) For their production servers they don't choose Fedora, which is meant for desktops, really. (Too bleeding edge software, also many experimental bells and whistles turned on in their compiled kernel!).


That's Ubuntu's main problem too. Debian itself is stable as stable can be, but it lacks the newest hardware support. So ubuntu uses s.i.d. and testing software which IMO pushes buggy stuff out to the end user. (For my servers) I like ubuntu (server), but I always start with a minimal install and add packages I need only when really needed.

As far as what distro they should use, I'm not qualified to put an opinion on that out there.

I am a debian, where as most businesses/organizations are redhats (or CENTOS)

-Dave
ID: 1210798 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 4
United States
Message 1210793 - Posted: 28 Mar 2012, 0:12:16 UTC

I love these details about the server upgrades. Keep us updated. :-)
Thanks!

(5 out of 24 drives is a LOT, I feel your pain there... I'd have expected 2 duds maximum out of that number, and maybe a third within 6 months)

-Dave
ID: 1210793 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 507
United Kingdom
Message 1210637 - Posted: 27 Mar 2012, 10:18:14 UTC - in response to Message 1210558.  

CERN is using Scientific Linux on their programs, which is Red Hat plus some scientific libraries. I am seeing them run in the BOINC_VM window at Test4Theory@home and it seems very stable, also using very little RAM (256 MB), while my Solaris Virtual Machine needs 1.5 GB RAM just to run SETI@home.
Tullio

...but Scientific Linux CERN is very conservatively out-of-date. Nearly all CERN machines are still running SLC5 which is kernel 2.6.18-274, so I have to run that on our user machines too. I have one machine running on SLC6 which is 2.6.32-220.
ID: 1210637 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1210628 - Posted: 27 Mar 2012, 8:29:40 UTC - in response to Message 1210613.  

Just a FYI for those interested, the 5 replacement drives arrived at SETI today and all tested out okay.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1210628 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15157
Credit: 4,362,181
RAC: 6
Netherlands
Message 1210613 - Posted: 27 Mar 2012, 7:39:53 UTC - in response to Message 1210470.  

What HHG means is for GeorgeM to be mentioned in the Hosts list, which now only shows:

anakin: Intel Server (2 x 2.8GHz Xeon, 4 GB RAM)
bane: Intel Server (2 x quad-core 2.66GHz Xeon, 4 GB RAM)
bruno: Intel Server (2 x 2.66GHz Xeon, 8 GB RAM)
carolyn: Intel Server (2 x quad-core 2.4GHz Xeon, 96 GB RAM)
jocelyn: Sun V40z (4 x 2.2GHz Opteron, 28 GB RAM)
lando: Intel Server (4 x 3.20GHz Xeon, 4 GB RAM)
marvin: Intel Server (2 x 2.66GHz Xeon, 16 GB RAM)
maul: Intel Server (4 x 2.66GHz Xeon, 8 GB RAM)
oscar: Intel Server (2 x quad-core 2.4GHz Xeon, 96 GB RAM)
synergy: Intel Server (2 x hexa-core 2.53GHz Xeon, 96 GB RAM)
thinman: AMD Server (2 x 2.4GHz Opteron, 16 GB RAM)
thumper: Sun Fire X4500 (2 x dual-core 2.6GHz Opteron, 16 GB RAM)
vader: Intel Server (4 x dual-core 3GHz Xeon, 16 GB RAM)
ID: 1210613 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 7993
Credit: 2,930,782
RAC: 1
Italy
Message 1210558 - Posted: 27 Mar 2012, 1:54:22 UTC

CERN is using Scientific Linux on their programs, which is Red Hat plus some scientific libraries. I am seeing them run in the BOINC_VM window at Test4Theory@home and it seems very stable, also using very little RAM (256 MB), while my Solaris Virtual Machine needs 1.5 GB RAM just to run SETI@home.
Tullio
ID: 1210558 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1210471 - Posted: 26 Mar 2012, 21:40:51 UTC - in response to Message 1209027.  

Good point, and agreed. We're also still using apache while it seems the world is moving toward nginx (I've been looking into switching at some point, if I can determine it's worth it). So yes there's some non-optimal situations in general here, and maybe others that arise sticking with Fedora... but we kind of need the bleeding edge for BOINC/SETI@home software development purposes, and don't have the management overhead to deal with multiple OS flavors. I know it sounds wimpy not wanted to deal with multiple linuxes but it actually is a real pain. Plus most of our problems with linux are kernel-related, not linux-flavor-related.

- Matt


Thanks for the update, but...

I sometimes wonder how big corporations manage to have such great uptime when we keep hitting fundamental flaws with linux getting locked up under heavy load. I think the answers are they (a) have massive redundancy (whereas we generally have very little mostly due to lack of physical space and available power), (b) have far more manpower (on 24 hour call) to kick machines immediately when they lock up, and (c) are under-utilizing servers (whereas we generally tend to push everything to their limits until they break).

(d) For their production servers they don't choose Fedora, which is meant for desktops, really. (Too bleeding edge software, also many experimental bells and whistles turned on in their compiled kernel!).


-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1210471 · Report as offensive
Profile cliff
Avatar

Send message
Joined: 16 Dec 07
Posts: 625
Credit: 3,590,440
RAC: 0
United Kingdom
Message 1210470 - Posted: 26 Mar 2012, 21:38:53 UTC - in response to Message 1210468.  
Last modified: 26 Mar 2012, 21:39:04 UTC

Seek and Ye shall find
Cliff,
Been there, Done that, Still no damm T shirt!
ID: 1210470 · Report as offensive
Profile Michel448a
Volunteer tester
Avatar

Send message
Joined: 27 Oct 00
Posts: 1331
Credit: 2,970,814
RAC: 0
Canada
Message 1210468 - Posted: 26 Mar 2012, 21:36:55 UTC

lol is ? or isn't ?

to be or not to be ?
ID: 1210468 · Report as offensive
Profile Graham Middleton

Send message
Joined: 1 Sep 00
Posts: 1113
Credit: 86,815,638
RAC: 0
United Kingdom
Message 1210449 - Posted: 26 Mar 2012, 20:49:35 UTC - in response to Message 1210427.  
Last modified: 26 Mar 2012, 20:50:20 UTC

Umm, at some point are you going to add Georgem to the list of computers on the "Server Status" page?



I agree it isn't on the Hosts list, but I for one would rather that the Matt and the other guys at Seti@Home concentrated on getting the problems they have fixed, and the JBOD array and PaddyM up & running, before they worry about the cosmetics of the ancillary information on the Status page. Simple priorities.
Happy Crunching,

Graham

ID: 1210449 · Report as offensive
Profile cliff
Avatar

Send message
Joined: 16 Dec 07
Posts: 625
Credit: 3,590,440
RAC: 0
United Kingdom
Message 1210440 - Posted: 26 Mar 2012, 20:37:47 UTC - in response to Message 1210427.  

GeorgeM is on the Server Stats page, look towards the bottom, Its there:-)

Cheers,
Cliff,
Been there, Done that, Still no damm T shirt!
ID: 1210440 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1210427 - Posted: 26 Mar 2012, 20:14:56 UTC

Umm, at some point are you going to add Georgem to the list of computers on the "Server Status" page?
.

Hello, from Albany, CA!...
ID: 1210427 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1626
Credit: 24,230,968
RAC: 59
United States
Message 1210019 - Posted: 25 Mar 2012, 14:20:33 UTC - in response to Message 1209748.  

It didn't require many programmers as we never had more than 6 ...


Which leaves this project only 6 full time programmers short of what they need for 100% up-time. Since we are already doing the distributed data processing faster than it can be gathered, and faster than the preliminary analysis can be further processed, there are probably lots of other places to spend time and money than 100% up time.

100% up-time for WU delivery may make us crunchers happier, but it doesn't help the science.

In our case, we were providing new features/products and I was also involved in debugging new hardware as well. After a while you do other things with your time as the failure are not often enough to require all your time.
In our case, we used over 100,000 lines of assembler code and because of conditional macro coding the final listings were so large we only attempted to print them once.

The big problem with hunting bugs is programmers would rather code that look for hard to find crashes.
ID: 1210019 · Report as offensive
Profile Gone with the wind
Volunteer tester

Send message
Joined: 19 Nov 00
Posts: 41704
Credit: 42,645,437
RAC: 95
Message 1209993 - Posted: 25 Mar 2012, 11:13:13 UTC

+2
ID: 1209993 · Report as offensive
__W__
Avatar

Send message
Joined: 28 Mar 09
Posts: 116
Credit: 5,943,642
RAC: 0
Germany
Message 1209988 - Posted: 25 Mar 2012, 9:09:43 UTC - in response to Message 1209767.  

... It could very well be. Good news is 5 replacement drives will be sitting on their desk on Monday.


+1

__W__

_______________________________________________________________________________
ID: 1209988 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1209767 - Posted: 24 Mar 2012, 19:33:55 UTC - in response to Message 1209656.  

... 5 failed immediately. This is quite high, but given the recent world-wide drive shortage quality control may have taken a hit. Not sure if others are seeing the same.

I think this is more a transportation problem. Some years ago, as i walked through our stock, a packet of 20 disks "crossed" my way, 1 meter above ground, touch down after a 5 meter flight. Here it was the "normal" way of unloading the truck by the driver (i think it was UPS). Most of them don't take care of their packets even if they are labeled as fragile.

__W__


It could very well be. Good news is 5 replacement drives will be sitting on their desk on Monday.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1209767 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 1209748 - Posted: 24 Mar 2012, 18:56:14 UTC - in response to Message 1209420.  

It didn't require many programmers as we never had more than 6 ...


Which leaves this project only 6 full time programmers short of what they need for 100% up-time. Since we are already doing the distributed data processing faster than it can be gathered, and faster than the preliminary analysis can be further processed, there are probably lots of other places to spend time and money than 100% up time.

100% up-time for WU delivery may make us crunchers happier, but it doesn't help the science.

ID: 1209748 · Report as offensive
__W__
Avatar

Send message
Joined: 28 Mar 09
Posts: 116
Credit: 5,943,642
RAC: 0
Germany
Message 1209656 - Posted: 24 Mar 2012, 14:45:50 UTC - in response to Message 1208991.  

... 5 failed immediately. This is quite high, but given the recent world-wide drive shortage quality control may have taken a hit. Not sure if others are seeing the same.

I think this is more a transportation problem. Some years ago, as i walked through our stock, a packet of 20 disks "crossed" my way, 1 meter above ground, touch down after a 5 meter flight. Here it was the "normal" way of unloading the truck by the driver (i think it was UPS). Most of them don't take care of their packets even if they are labeled as fragile.

__W__

_______________________________________________________________________________
ID: 1209656 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1626
Credit: 24,230,968
RAC: 59
United States
Message 1209420 - Posted: 23 Mar 2012, 22:36:52 UTC

There is no trick to near 100% uptime. The company I work for produces a device that requires that type of up time so what we did was plant a number of general purpose diagnostic tools such as traps or dumps catch the failure. The hard part follows, Every failure needs to be examined to understand the failure. If a failure lacks the information needed to locate the problem then additional work is done to the code to capture information that would aid in diagnosing the problem. It's not something you can do overnight but can take years. In any case, we move the failure rate from one every few hours to over a year for most customers. The problem is we are now at the point where failures are so far apart that sometimes the customer will reset the unit instead of call us. Our mistake, all you need to do is press one switch and wait about a minute and the unit is back in operation.

In addition, some failure can be detected but aren't terminal. In those cases, we may write in some cleanup code and then continue on with the job at hand. The design that came before the current device used this approach and while it never crash, it could do some strange things. This required a different debugging approach where we set aside a chunk of memory ( expensive in the 32k byte days) to hold failure information then the user would be made aware that we needed to be called to check out a failure.

It didn't require many programmers as we never had more than 6 with at least half that number making new bugs for the others to find. It just took someone dumb enough like me to keep working on a bug till it was solved.
ID: 1209420 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 27
United States
Message 1209259 - Posted: 23 Mar 2012, 14:41:46 UTC - in response to Message 1208991.  

Thanks for the update, Matt.

I see you also doubled the number of RFIs and Nitpickrs and they actually show as running on the Server Status page.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1209259 · Report as offensive
1 · 2 · Next

Message boards : Technical News : According to the Spiral (Mar 22 2012)


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.