There Goes a Tenner (Jan 20 2011)


log in

Advanced search

Message boards : Technical News : There Goes a Tenner (Jan 20 2011)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1390
Credit: 74,079
RAC: 0
United States
Message 1068814 - Posted: 21 Jan 2011, 0:21:17 UTC

As expected it took about 1.5 days to copy all the results from our failed upload server (bruno) to the new upload server (synergy). I was out yesterday hence the lack of update from me, but nothing could get done until the result copy finished anyway.

Jeff and I tackled the remaining stuff this morning to bring synergy back up, and it's now pretending to be bruno. It's working fairly well except, predictably, the disk i/o subsystem isn't happy with lots of little random i/o's (there are only 4 working spindles on synergy, as opposed to 20 on bruno). Still, it's working heroically to recover from the past two days of data distribution silence.

Meanwhile, what the heck is wrong with bruno? I wish we knew. I've been battling this all day since getting synergy on line. It seems there are fundamental issues that transcend disks/partitions/controllers. Random drives are disappearing, random partitions are disappearing, and this was still happening after taking the 3ware card out of the system entirely... We're stumped. It might just be a cluster of simple problems with confounding symptoms. I give up for now.

By the way, bruno was named after Giordano Bruno.

Also by the way, somebody asked if we should have two upload servers. We used to have the upload server split onto two systems but this wasn't helping - in fact it was making it worse. The problem is not the lack of bandwidth i/o, but disk i/o. The results have to live somewhere, and require lots of random read/writes. So it's best if the upload server saves the results on directly attached storage. If it is also serving them over NFS (or likewise equivalent) such that a second upload server can write to them, it's too much of an overhead drag. So the upload server has to be a singular server which also (1) holds the results and (2) as much of the backend processing on these result files as possible. I think right now the only backend processing on results which bruno does NOT do is assimilation, which vader handles. You might think "why not just have the upload server save the results IT gets on ITS own storage?" Then we end up with two piles of results, randomly split, and then the NFS/mounting bottleneck is simply pushed down the pike to the validators, who need to read both piles at once.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4209
Credit: 34,469,243
RAC: 18,525
United Kingdom
Message 1068817 - Posted: 21 Jan 2011, 0:25:27 UTC - in response to Message 1068814.

Thanks for the update Matt, and for you and Jeff getting the replacement Bruno working in quick time,

Claggy

Profile Zapped SparkyProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 8907
Credit: 1,320,857
RAC: 699
United Kingdom
Message 1068865 - Posted: 21 Jan 2011, 3:39:30 UTC

Random drives are disappearing, random partitions are disappearing

Crikey that's a bit of a conundrum. I'm glad the data was transferred OK, and I hope you have one of those eureka moments :)

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7115
Credit: 61,257,416
RAC: 6,709
Germany
Message 1068867 - Posted: 21 Jan 2011, 3:47:37 UTC - in response to Message 1068814.
Last modified: 21 Jan 2011, 3:48:56 UTC

Matt, thanks for the news and the whole crew for their work!


BTW.
Just curious..
Normally after a longer outage we see immediately a ~ 50 MBits/s UL peak at the Cricket graph and quickly empty transfers overview in BOINC.

Now we see after ~ 9 ½ hours a ~ 20 MBits/s UL peak.
I have still hundreds of results for UL and they can't go home (backlog).

Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: connect() failed
and
Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: HTTP error
____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

DJStarfox
Send message
Joined: 23 May 01
Posts: 1045
Credit: 568,320
RAC: 353
United States
Message 1068874 - Posted: 21 Jan 2011, 4:00:41 UTC - in response to Message 1068814.

Ideally, to do multiple upload and post-processing servers, you'd need a very easy way to partition the results. Essentially, you'd have to create multiple pipelines. For example, odd numbered results _1, _3, etc would goto upload1, validator2. Even numbered results _0, _2, etc would goto upload2, validator2, etc.

Blake Bonkofsky
Volunteer tester
Avatar
Send message
Joined: 29 Dec 99
Posts: 617
Credit: 46,332,781
RAC: 0
United States
Message 1068928 - Posted: 21 Jan 2011, 7:31:55 UTC - in response to Message 1068867.

Matt, thanks for the news and the whole crew for their work!


BTW.
Just curious..
Normally after a longer outage we see immediately a ~ 50 MBits/s UL peak at the Cricket graph and quickly empty transfers overview in BOINC.

Now we see after ~ 9 ½ hours a ~ 20 MBits/s UL peak.
I have still hundreds of results for UL and they can't go home (backlog).

Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: connect() failed
and
Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: HTTP error


Previously there was a 10 sec backoff too, that is now 5 minutes. I'm sure that leads to a lot of the "smoothing" we are seeing.
____________

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8734
Credit: 61,635,402
RAC: 49,464
United Kingdom
Message 1068945 - Posted: 21 Jan 2011, 9:59:27 UTC

Performance wise I'd expect Synergy to be about 10-20% of the throughput of Bruno. This means that the catch-up from an outage will be slower, with a higher retry count. So we have to sit here and be patient for a bit longer. So what? the data we are processing is already a few months old, and not "time critical".
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8760
Credit: 52,710,302
RAC: 24,754
United Kingdom
Message 1068951 - Posted: 21 Jan 2011, 10:41:37 UTC - in response to Message 1068945.

Performance wise I'd expect Synergy to be about 10-20% of the throughput of Bruno. This means that the catch-up from an outage will be slower, with a higher retry count. So we have to sit here and be patient for a bit longer. So what? the data we are processing is already a few months old, and not "time critical".

Uploads and downloads seem to have settled down nicely, but there's quite a backlog growing for validations - also running on Synergy (aka 'the new Bruno'). They'll be held back by the lack of disk I/O, too - every validation attempt will require finding and retrieving at least two, and possibly several, previously uploaded result files.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5917
Credit: 61,699,828
RAC: 25,926
Australia
Message 1068958 - Posted: 21 Jan 2011, 11:16:43 UTC - in response to Message 1068956.

Fix Bruno, or add to Synergy?

I'd go with fix Bruno.
Synergy is already spoken for.
____________
Grant
Darwin NT.

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32309
Credit: 14,271,629
RAC: 10,419
United Kingdom
Message 1068968 - Posted: 21 Jan 2011, 11:49:35 UTC

In the server list it shows it as handling Nitpicker duties?

____________
Damsel Rescuer, Uli Devotee, Julie Supporter, ES99 Admirer,
Raccoon Friend, Anniet fan, Official crusty old fart


Blake Bonkofsky
Volunteer tester
Avatar
Send message
Joined: 29 Dec 99
Posts: 617
Credit: 46,332,781
RAC: 0
United States
Message 1068971 - Posted: 21 Jan 2011, 12:01:00 UTC - in response to Message 1068969.

Either way, synergy is way more powerful than the demands of the upload server require. Look at the old specs of bruno compared to synergy. The RAM is different by a factor of twelve! To replace bruno completely probably wouldn't cost nearly as much as carolyn and oscar, nor even synergy.
____________

Profile Andy Lee Robinson
Avatar
Send message
Joined: 8 Dec 05
Posts: 615
Credit: 43,366,207
RAC: 18,381
Hungary
Message 1068980 - Posted: 21 Jan 2011, 12:43:04 UTC

Here's a little script I wrote that would gradually open up the flood gates and stop the fileserver from thrashing and dropping half uploaded files

#!/bin/bash # choke - gradually open seti@home gates # usage: choke [0-3|setup] # IPT=/sbin/iptables if [ "$1" = "setup" ]; then $IPT -N CHOKE $IPT -F CHOKE $IPT -I INPUT 1 -p tcp -m state --state ESTABLISHED,RELATED -j ACCEPT $IPT -I INPUT 2 -p tcp -m state --state NEW --dport 80 -j CHOKE fi $IPT -F CHOKE # accept 0/4 traffic if [ $1 = 0 ]; then $IPT -A CHOKE -j REJECT; fi; # accept 1/4 traffic - reject *.1,2,3 but pass .0,4,8,12,16 etc. if [ $1 = 1 ]; then $IPT -A CHOKE \! -s 0.0.0.0/0.0.0.3 -j REJECT; fi; # accept 2/4 traffic - reject *.2,3 pass 0,1,4,5,6,9 etc. if [ $1 = 2 ]; then $IPT -A CHOKE \! -s 0.0.0.0/0.0.0.2 -j REJECT; fi; # accept 3/4 traffic - reject *.3 pass 0,1,2,4,5,6,8,9,10 etc. if [ $1 = 3 ]; then $IPT -A CHOKE -s 0.0.0.3/0.0.0.3 -j REJECT; fi;


This would open up address space in 1/4 steps using inverse subnets, though could suffer from favouritism.

Another way would be to cycle quarters for an hour each until traffic reduced enough to open up completely.

DJStarfox
Send message
Joined: 23 May 01
Posts: 1045
Credit: 568,320
RAC: 353
United States
Message 1069010 - Posted: 21 Jan 2011, 15:30:44 UTC - in response to Message 1068958.

Fix Bruno, or add to Synergy?

I'd go with fix Bruno.
Synergy is already spoken for.


I disagree. Bruno is an old server that is having hardware problems again (dropped drives in array). There would be far less headaches just to add spindles to Synergy (a new server).

That's assuming there's money for either operation.

Profile Todd Hebert
Volunteer tester
Avatar
Send message
Joined: 16 Jun 00
Posts: 647
Credit: 217,127,962
RAC: 0
United States
Message 1069017 - Posted: 21 Jan 2011, 15:58:37 UTC - in response to Message 1069010.

Synergy does not have the ability to install additional drives in its chassis.
To add drives would require a new raid card to allow external connections, a suitable drive arrary chassis and the drives. You could ball park this at $4k.

At this point you could purchase a SuperMicro storage server, motherboard, RAID card, memory and drives for around $5.5k (I'll donate the processors again) I've already been looking at this as an option.

Synergy was never intended to have the duties of Bruno - it was a compute server with 5x 1TB SAS2 Hard drives to allow reliable operation by using RAID 6 (Which has a bunch of overhead but excellent reliability)

There was a significant need to extend the overall science of S@H and this server fit the bill to provide this and other resource have been diverted away to meet the demands of the users.

Todd
____________

Profile Todd Hebert
Volunteer tester
Avatar
Send message
Joined: 16 Jun 00
Posts: 647
Credit: 217,127,962
RAC: 0
United States
Message 1069042 - Posted: 21 Jan 2011, 17:07:32 UTC

Oh, I fully agree! If it was the raid card I was prepared to just order one and have it drop-shipped to Berkely but it sounds like it is more than just that being the problem.

Drive arrays are such that they require regular maintenance and should be swapped out to prevent failures.

When I worked at Cray Research we had two large storerooms of full height 5.25" 1GB Micropolis SCSI drives and one storeroom would be empty in a month. Drives were used on average for 2500 - 3000 hours before they were replaced in the array. Granted they got beat up pretty hard with insane throughput needs and were in constant operation. But this is not unlike the needs of S@H.

Todd
____________

1 · 2 · Next

Message boards : Technical News : There Goes a Tenner (Jan 20 2011)

Copyright © 2014 University of California