Eggs (Mar 24 2008)

Message boards : Technical News : Eggs (Mar 24 2008)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 730034 - Posted: 24 Mar 2008, 22:28:55 UTC

Things have been running rather well over the past couple of weeks. Having effectively unlimited bandwidth really helps. It's a little more hectic behind the scenes as new data keeps getting sent up from Arecibo - we are continually working to offload the data to our local servers (and remote mass storage) so we can send back the blank drives for more. Steps will be taken soon to improve this situation (namely: sending some data to our remote storage via our faster Hurricane connection).

There was a bit of a panic this morning, however. Suddenly gowron, our workunit storage server, reset itself. Not only did it reboot, but it lost all host/IP information. For all we could tell at first it lost everything! We had to connect to it over serial (most difficult part: finding the right cables) but once we got in we found our 2 terabytes of workunits were still intact (whew). So it was mostly a matter of reconfiguring the basic things and we were back in business. Why did it reset itself? That remains a mystery.

Another minor gripe: I spent a man/day last week working on testing mdadm's "spare group" feature. That is, if a drive fails on a RAID device without a spare, it can steal a spare from another RAID device in the same RAID group - mdadm's way of enabling a "hot spare pool." We never had a case where this would happen, nor did we ever test it. Now that thumper is less two spares (due to making a new small, separate RAID1 for database indexes) I wanted to test this. I made simple test cases and failed drives - but the available spares in the spare group weren't being utilized. Long story short - I actually recompiled my own mdadm with fprintf's all over the place and found mdadm behaving strangely. Thing is, this is mdadm version 2.6.2 we're talking about here, and mdadm is already up to version 2.6.4. So I download that, and it worked, so apparently this bad behavior has been fixed. But Fedora doesn't have the latest version available yet, at least via "yum update," so we're pretty much waiting on the new version to become available before implementing a less trusted version, even if it seems to work better.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 730034 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 730043 - Posted: 24 Mar 2008, 22:37:09 UTC


. . . Thanks for the Posting Matt - figure 'Murphy' played a bit role eh ;)

> Each of You Keep up the good work @ Berkeley . . .


BOINC Wiki . . .

Science Status Page . . .
ID: 730043 · Report as offensive
aplayer

Send message
Joined: 26 Apr 00
Posts: 13
Credit: 15,217,341
RAC: 0
United States
Message 730060 - Posted: 24 Mar 2008, 22:54:24 UTC

Another job well done. Ty for the news.
ID: 730060 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30687
Credit: 53,134,872
RAC: 32
United States
Message 730147 - Posted: 25 Mar 2008, 6:07:51 UTC - in response to Message 730034.  

You have to watch out for those cosmic rays. With memory getting as dense as it is today a single particle can switch a bit or two. Better get a stronger magnetic field around the building to deflect them. :)

TY for the post and info.

Things have been running rather well over the past couple of weeks. Having effectively unlimited bandwidth really helps. It's a little more hectic behind the scenes as new data keeps getting sent up from Arecibo - we are continually working to offload the data to our local servers (and remote mass storage) so we can send back the blank drives for more. Steps will be taken soon to improve this situation (namely: sending some data to our remote storage via our faster Hurricane connection).

There was a bit of a panic this morning, however. Suddenly gowron, our workunit storage server, reset itself. Not only did it reboot, but it lost all host/IP information. For all we could tell at first it lost everything! We had to connect to it over serial (most difficult part: finding the right cables) but once we got in we found our 2 terabytes of workunits were still intact (whew). So it was mostly a matter of reconfiguring the basic things and we were back in business. Why did it reset itself? That remains a mystery.

Another minor gripe: I spent a man/day last week working on testing mdadm's "spare group" feature. That is, if a drive fails on a RAID device without a spare, it can steal a spare from another RAID device in the same RAID group - mdadm's way of enabling a "hot spare pool." We never had a case where this would happen, nor did we ever test it. Now that thumper is less two spares (due to making a new small, separate RAID1 for database indexes) I wanted to test this. I made simple test cases and failed drives - but the available spares in the spare group weren't being utilized. Long story short - I actually recompiled my own mdadm with fprintf's all over the place and found mdadm behaving strangely. Thing is, this is mdadm version 2.6.2 we're talking about here, and mdadm is already up to version 2.6.4. So I download that, and it worked, so apparently this bad behavior has been fixed. But Fedora doesn't have the latest version available yet, at least via "yum update," so we're pretty much waiting on the new version to become available before implementing a less trusted version, even if it seems to work better.

- Matt


ID: 730147 · Report as offensive
Profile AndyW Project Donor
Volunteer tester
Avatar

Send message
Joined: 23 Oct 02
Posts: 5862
Credit: 10,957,677
RAC: 18
United Kingdom
Message 730175 - Posted: 25 Mar 2008, 9:44:55 UTC

Always appreciate your updates Matt. Thank you.
I hope Gowron isn't about to steal the limelight and need regular attention, for your sake!
ID: 730175 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 730297 - Posted: 25 Mar 2008, 23:03:05 UTC - in response to Message 730147.  

You have to watch out for those cosmic rays. With memory getting as dense as it is today a single particle can switch a bit or two. Better get a stronger magnetic field around the building to deflect them. :)


Some particles can do as much damage as a baseball bat in a biscuit factory, as far as chips are concerned! Nothing will stop them.

I don't think a big field would help that much... it could make more dangerous particles out of those that would have missed without it, and the kind with such high energies that can damage or influence a chip probably wouldn't even notice the field.
If it worked, you'd probably need a field so strong that the electrons in the chips would lose their balance and fall off the wires..

Lead lining and ECC offer more hope!
ID: 730297 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30687
Credit: 53,134,872
RAC: 32
United States
Message 730307 - Posted: 25 Mar 2008, 23:21:45 UTC - in response to Message 730297.  

You have to watch out for those cosmic rays. With memory getting as dense as it is today a single particle can switch a bit or two. Better get a stronger magnetic field around the building to deflect them. :)


Some particles can do as much damage as a baseball bat in a biscuit factory, as far as chips are concerned! Nothing will stop them.

I don't think a big field would help that much... it could make more dangerous particles out of those that would have missed without it, and the kind with such high energies that can damage or influence a chip probably wouldn't even notice the field.
If it worked, you'd probably need a field so strong that the electrons in the chips would lose their balance and fall off the wires..

Lead lining and ECC offer more hope!


Hey, I know I'm talking about magnets the size of the LHC. After all the particles are doing well over 99% of C, so bending their flight path will take a bit of power. After all they already got through the earth's magnetic field and a few dozen miles of air, and a building, so I don't think a little lead lining will make any difference to them. Got to have that 100% up time you know. ;)

ID: 730307 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 730527 - Posted: 26 Mar 2008, 5:58:51 UTC - in response to Message 730307.  
Last modified: 26 Mar 2008, 6:28:14 UTC

Some of the most energetic particles literally have the energy of a tennis ball at first serve, but fortunately very rare. It's the slower secondary particles after a collision with a dense object that can do the damage.
The magnetic field in a synchrotron is very strong, and very confined, and the particles in the ring possess the same polarity. They damage the collider if they go off course, so the timing strength and precision is crucial (hence LHC@Home!). I don't think one could make such a strong magnetic field to encompass a whole building without a nearby dedicated power station, and any hard disks in a 100m radius probably wouldn't approve, nor the staff or their credit cards!
What repels one polarity, attracts the other...

I'd move it a km underground where natural radiation in the rocks can be easily blocked, but this can push up the commuting time a little!

For 100% uptime, I'd do a multiple-master mysql configuration with lots and lots of ECC ram, dual psus and UPS each with independent power supplies from two separate providers for each half of the system, but that starts to get expensive...
Then you can bring a database down for back up or repair while the other server(s) is(are) still running.
ID: 730527 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20359
Credit: 7,508,002
RAC: 20
United Kingdom
Message 730573 - Posted: 26 Mar 2008, 11:58:47 UTC

See:

Should every computer chip have a cosmic ray detector?


Note that Boinc is protected from the effects of this to some extent by the quorum checking.

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 730573 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 730575 - Posted: 26 Mar 2008, 12:22:45 UTC - in response to Message 730573.  

ID: 730575 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 730597 - Posted: 26 Mar 2008, 13:24:06 UTC


. . . so, it's the Cosmic Rays freezin' my system every-now & then - wondered what was goin' on

< maybe more research is needed here: Hard drive wobbles track earthquake spread

> isn't somebody @ Berkeley still doing Research regarding ?

ps - Thanks Martin . . .


BOINC Wiki . . .

Science Status Page . . .
ID: 730597 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 730617 - Posted: 26 Mar 2008, 15:04:09 UTC - in response to Message 730297.  



Lead lining and ECC offer more hope!



Check again; lead is a good source of (alpha) radiation, which also flips your bits. Manufacturers try to use isotope enriched lead to reduce the risk, but that is expensive and not fool-proof. ECC and clever design is the answer, short of going to a wider bandgap material (non-Si).
ID: 730617 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 731067 - Posted: 27 Mar 2008, 19:57:53 UTC

Having effectively unlimited bandwidth really helps.

That's great to hear about. Did I miss some think in one of Matt's post's? If not how did Seti manage to get almost free bandwidth?

Cheers
Speedy
ID: 731067 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 731070 - Posted: 27 Mar 2008, 20:01:44 UTC - in response to Message 731067.  

Having effectively unlimited bandwidth really helps.

That's great to hear about. Did I miss some think in one of Matt's post's? If not how did Seti manage to get almost free bandwidth?

Cheers
Speedy

He didn't say "free" - "effectively unlimited" just means a fatter pipe that doesn't get max'd out.

F.
ID: 731070 · Report as offensive
Profile [KWSN]John Galt 007
Volunteer tester
Avatar

Send message
Joined: 9 Nov 99
Posts: 2444
Credit: 25,086,197
RAC: 0
United States
Message 731074 - Posted: 27 Mar 2008, 20:05:46 UTC - in response to Message 731070.  

Having effectively unlimited bandwidth really helps.

That's great to hear about. Did I miss some think in one of Matt's post's? If not how did Seti manage to get almost free bandwidth?

Cheers
Speedy

He didn't say "free" - "effectively unlimited" just means a fatter pipe that doesn't get max'd out.

F.


Techincally, not a fatter pipe, but a bigger faucet (router). But, who's technical anymore? ;-)>
Clk2HlpSetiCty:::PayIt4ward

ID: 731074 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 731088 - Posted: 27 Mar 2008, 21:02:15 UTC - in response to Message 731074.  
Last modified: 27 Mar 2008, 21:03:24 UTC


He didn't say "free" - "effectively unlimited" just means a fatter pipe that doesn't get max'd out.

F.


Techincally, not a fatter pipe, but a bigger faucet (router). But, who's technical anymore? ;-)>


Oops! We...ll, er, that word is in my CV...

[Slapped wrist - hangs head.]

Of course, you are right. Thanks.

F.
ID: 731088 · Report as offensive
Profile Mr. Majestic
Volunteer tester
Avatar

Send message
Joined: 26 Nov 07
Posts: 4752
Credit: 258,845
RAC: 0
United States
Message 731250 - Posted: 28 Mar 2008, 4:17:35 UTC

Thanks ,your updates are always appreciated Matt.

ID: 731250 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 731255 - Posted: 28 Mar 2008, 4:24:03 UTC - in response to Message 731088.  


Techincally, not a fatter pipe, but a bigger faucet (router). But, who's technical anymore? ;-)>

[/quote]
I now understand thank you for the reminder. Hope you all have a happy crunching weekend.
Cheers
Speedy
ID: 731255 · Report as offensive

Message boards : Technical News : Eggs (Mar 24 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.