There Goes a Tenner (Jan 20 2011)

Message boards : Technical News : There Goes a Tenner (Jan 20 2011)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1068814 - Posted: 21 Jan 2011, 0:21:17 UTC

As expected it took about 1.5 days to copy all the results from our failed upload server (bruno) to the new upload server (synergy). I was out yesterday hence the lack of update from me, but nothing could get done until the result copy finished anyway.

Jeff and I tackled the remaining stuff this morning to bring synergy back up, and it's now pretending to be bruno. It's working fairly well except, predictably, the disk i/o subsystem isn't happy with lots of little random i/o's (there are only 4 working spindles on synergy, as opposed to 20 on bruno). Still, it's working heroically to recover from the past two days of data distribution silence.

Meanwhile, what the heck is wrong with bruno? I wish we knew. I've been battling this all day since getting synergy on line. It seems there are fundamental issues that transcend disks/partitions/controllers. Random drives are disappearing, random partitions are disappearing, and this was still happening after taking the 3ware card out of the system entirely... We're stumped. It might just be a cluster of simple problems with confounding symptoms. I give up for now.

By the way, bruno was named after Giordano Bruno.

Also by the way, somebody asked if we should have two upload servers. We used to have the upload server split onto two systems but this wasn't helping - in fact it was making it worse. The problem is not the lack of bandwidth i/o, but disk i/o. The results have to live somewhere, and require lots of random read/writes. So it's best if the upload server saves the results on directly attached storage. If it is also serving them over NFS (or likewise equivalent) such that a second upload server can write to them, it's too much of an overhead drag. So the upload server has to be a singular server which also (1) holds the results and (2) as much of the backend processing on these result files as possible. I think right now the only backend processing on results which bruno does NOT do is assimilation, which vader handles. You might think "why not just have the upload server save the results IT gets on ITS own storage?" Then we end up with two piles of results, randomly split, and then the NFS/mounting bottleneck is simply pushed down the pike to the validators, who need to read both piles at once.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1068814 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1068817 - Posted: 21 Jan 2011, 0:25:27 UTC - in response to Message 1068814.  

Thanks for the update Matt, and for you and Jeff getting the replacement Bruno working in quick time,

Claggy
ID: 1068817 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1068865 - Posted: 21 Jan 2011, 3:39:30 UTC

Random drives are disappearing, random partitions are disappearing

Crikey that's a bit of a conundrum. I'm glad the data was transferred OK, and I hope you have one of those eureka moments :)
ID: 1068865 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1068867 - Posted: 21 Jan 2011, 3:47:37 UTC - in response to Message 1068814.  
Last modified: 21 Jan 2011, 3:48:56 UTC

Matt, thanks for the news and the whole crew for their work!


BTW.
Just curious..
Normally after a longer outage we see immediately a ~ 50 MBits/s UL peak at the Cricket graph and quickly empty transfers overview in BOINC.

Now we see after ~ 9 ½ hours a ~ 20 MBits/s UL peak.
I have still hundreds of results for UL and they can't go home (backlog).

Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: connect() failed
and
Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: HTTP error
ID: 1068867 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 1068874 - Posted: 21 Jan 2011, 4:00:41 UTC - in response to Message 1068814.  

Ideally, to do multiple upload and post-processing servers, you'd need a very easy way to partition the results. Essentially, you'd have to create multiple pipelines. For example, odd numbered results _1, _3, etc would goto upload1, validator2. Even numbered results _0, _2, etc would goto upload2, validator2, etc.
ID: 1068874 · Report as offensive
Blake Bonkofsky
Volunteer tester
Avatar

Send message
Joined: 29 Dec 99
Posts: 617
Credit: 46,383,149
RAC: 0
United States
Message 1068928 - Posted: 21 Jan 2011, 7:31:55 UTC - in response to Message 1068867.  

Matt, thanks for the news and the whole crew for their work!


BTW.
Just curious..
Normally after a longer outage we see immediately a ~ 50 MBits/s UL peak at the Cricket graph and quickly empty transfers overview in BOINC.

Now we see after ~ 9 ½ hours a ~ 20 MBits/s UL peak.
I have still hundreds of results for UL and they can't go home (backlog).

Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: connect() failed
and
Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: HTTP error


Previously there was a 10 sec backoff too, that is now 5 minutes. I'm sure that leads to a lot of the "smoothing" we are seeing.
ID: 1068928 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22532
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1068945 - Posted: 21 Jan 2011, 9:59:27 UTC

Performance wise I'd expect Synergy to be about 10-20% of the throughput of Bruno. This means that the catch-up from an outage will be slower, with a higher retry count. So we have to sit here and be patient for a bit longer. So what? the data we are processing is already a few months old, and not "time critical".
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1068945 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1068951 - Posted: 21 Jan 2011, 10:41:37 UTC - in response to Message 1068945.  

Performance wise I'd expect Synergy to be about 10-20% of the throughput of Bruno. This means that the catch-up from an outage will be slower, with a higher retry count. So we have to sit here and be patient for a bit longer. So what? the data we are processing is already a few months old, and not "time critical".

Uploads and downloads seem to have settled down nicely, but there's quite a backlog growing for validations - also running on Synergy (aka 'the new Bruno'). They'll be held back by the lack of disk I/O, too - every validation attempt will require finding and retrieving at least two, and possibly several, previously uploaded result files.
ID: 1068951 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1068956 - Posted: 21 Jan 2011, 11:04:02 UTC

Sooooooooo.
I am not sure where this leaves us.
New server for Bruno?

Or more spindles for Synergy?

What do we need?

Fix Bruno, or add to Synergy?

Both?

My Purrball is sitting here staring at me wondering what to do next.
She really is.

"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1068956 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1068958 - Posted: 21 Jan 2011, 11:16:43 UTC - in response to Message 1068956.  

Fix Bruno, or add to Synergy?

I'd go with fix Bruno.
Synergy is already spoken for.
Grant
Darwin NT
ID: 1068958 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1068960 - Posted: 21 Jan 2011, 11:19:07 UTC - in response to Message 1068958.  

Fix Bruno, or add to Synergy?

I'd go with fix Bruno.
Synergy is already spoken for.

For what?

Last I heard, it's duties were not spoken for yet.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1068960 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1068969 - Posted: 21 Jan 2011, 11:50:43 UTC - in response to Message 1068968.  

In the server list it shows it as handling Nitpicker duties?

I think Matt said that was kinda a test routine.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1068969 · Report as offensive
Blake Bonkofsky
Volunteer tester
Avatar

Send message
Joined: 29 Dec 99
Posts: 617
Credit: 46,383,149
RAC: 0
United States
Message 1068971 - Posted: 21 Jan 2011, 12:01:00 UTC - in response to Message 1068969.  

Either way, synergy is way more powerful than the demands of the upload server require. Look at the old specs of bruno compared to synergy. The RAM is different by a factor of twelve! To replace bruno completely probably wouldn't cost nearly as much as carolyn and oscar, nor even synergy.
ID: 1068971 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 1068980 - Posted: 21 Jan 2011, 12:43:04 UTC

Here's a little script I wrote that would gradually open up the flood gates and stop the fileserver from thrashing and dropping half uploaded files

#!/bin/bash
# choke - gradually open seti@home gates
# usage: choke [0-3|setup]
#
IPT=/sbin/iptables

if [ "$1" = "setup" ]; then
  $IPT -N CHOKE
  $IPT -F CHOKE
  $IPT -I INPUT 1 -p tcp -m state --state ESTABLISHED,RELATED -j ACCEPT
  $IPT -I INPUT 2 -p tcp -m state --state NEW --dport 80 -j CHOKE
fi

$IPT -F CHOKE

# accept 0/4 traffic
if [ $1 = 0 ]; then $IPT -A CHOKE -j REJECT; fi;

# accept 1/4 traffic - reject *.1,2,3 but pass .0,4,8,12,16 etc.
if [ $1 = 1 ]; then $IPT -A CHOKE \! -s 0.0.0.0/0.0.0.3 -j REJECT; fi;

# accept 2/4 traffic - reject *.2,3 pass 0,1,4,5,6,9 etc.
if [ $1 = 2 ]; then $IPT -A CHOKE \! -s 0.0.0.0/0.0.0.2 -j REJECT; fi;

# accept 3/4 traffic - reject *.3 pass 0,1,2,4,5,6,8,9,10 etc.
if [ $1 = 3 ]; then $IPT -A CHOKE -s 0.0.0.3/0.0.0.3 -j REJECT; fi;


This would open up address space in 1/4 steps using inverse subnets, though could suffer from favouritism.

Another way would be to cycle quarters for an hour each until traffic reduced enough to open up completely.
ID: 1068980 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 1069010 - Posted: 21 Jan 2011, 15:30:44 UTC - in response to Message 1068958.  

Fix Bruno, or add to Synergy?

I'd go with fix Bruno.
Synergy is already spoken for.


I disagree. Bruno is an old server that is having hardware problems again (dropped drives in array). There would be far less headaches just to add spindles to Synergy (a new server).

That's assuming there's money for either operation.
ID: 1069010 · Report as offensive
Profile Todd Hebert
Volunteer tester
Avatar

Send message
Joined: 16 Jun 00
Posts: 648
Credit: 228,292,957
RAC: 0
United States
Message 1069017 - Posted: 21 Jan 2011, 15:58:37 UTC - in response to Message 1069010.  

Synergy does not have the ability to install additional drives in its chassis.
To add drives would require a new raid card to allow external connections, a suitable drive arrary chassis and the drives. You could ball park this at $4k.

At this point you could purchase a SuperMicro storage server, motherboard, RAID card, memory and drives for around $5.5k (I'll donate the processors again) I've already been looking at this as an option.

Synergy was never intended to have the duties of Bruno - it was a compute server with 5x 1TB SAS2 Hard drives to allow reliable operation by using RAID 6 (Which has a bunch of overhead but excellent reliability)

There was a significant need to extend the overall science of S@H and this server fit the bill to provide this and other resource have been diverted away to meet the demands of the users.

Todd
ID: 1069017 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1069019 - Posted: 21 Jan 2011, 16:10:57 UTC - in response to Message 1069017.  

Synergy does not have the ability to install additional drives in its chassis.
To add drives would require a new raid card to allow external connections, a suitable drive arrary chassis and the drives. You could ball park this at $4k.

At this point you could purchase a SuperMicro storage server, motherboard, RAID card, memory and drives for around $5.5k (I'll donate the processors again) I've already been looking at this as an option.

Synergy was never intended to have the duties of Bruno - it was a compute server with 5x 1TB SAS2 Hard drives to allow reliable operation by using RAID 6 (Which has a bunch of overhead but excellent reliability)

There was a significant need to extend the overall science of S@H and this server fit the bill to provide this and other resource have been diverted away to meet the demands of the users.

Todd
I suppose we should wait and see if Bruno is still viable.

"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1069019 · Report as offensive
Profile Todd Hebert
Volunteer tester
Avatar

Send message
Joined: 16 Jun 00
Posts: 648
Credit: 228,292,957
RAC: 0
United States
Message 1069042 - Posted: 21 Jan 2011, 17:07:32 UTC

Oh, I fully agree! If it was the raid card I was prepared to just order one and have it drop-shipped to Berkely but it sounds like it is more than just that being the problem.

Drive arrays are such that they require regular maintenance and should be swapped out to prevent failures.

When I worked at Cray Research we had two large storerooms of full height 5.25" 1GB Micropolis SCSI drives and one storeroom would be empty in a month. Drives were used on average for 2500 - 3000 hours before they were replaced in the array. Granted they got beat up pretty hard with insane throughput needs and were in constant operation. But this is not unlike the needs of S@H.

Todd
ID: 1069042 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1069048 - Posted: 21 Jan 2011, 17:12:12 UTC - in response to Message 1069042.  

Oh, I fully agree! If it was the raid card I was prepared to just order one and have it drop-shipped to Berkely but it sounds like it is more than just that being the problem.

Drive arrays are such that they require regular maintenance and should be swapped out to prevent failures.

When I worked at Cray Research we had two large storerooms of full height 5.25" 1GB Micropolis SCSI drives and one storeroom would be empty in a month. Drives were used on average for 2500 - 3000 hours before they were replaced in the array. Granted they got beat up pretty hard with insane throughput needs and were in constant operation. But this is not unlike the needs of S@H.

Todd

You worked at Cray???

I was impressed with your knowledge, and your generosity.

Now I am REALLY impressed. That explains a lot.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1069048 · Report as offensive
Profile Todd Hebert
Volunteer tester
Avatar

Send message
Joined: 16 Jun 00
Posts: 648
Credit: 228,292,957
RAC: 0
United States
Message 1069067 - Posted: 21 Jan 2011, 17:52:05 UTC - in response to Message 1069048.  

That was many many years ago - late 80's into the early 90's. Went back to school to get my masters from UW-Madison and then went to Microsoft as a 5th level enterprise networking tech. Been up and down the road a few times :)
Todd
ID: 1069067 · Report as offensive
1 · 2 · Next

Message boards : Technical News : There Goes a Tenner (Jan 20 2011)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.