Message boards :
Technical News :
There Goes a Tenner (Jan 20 2011)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
As expected it took about 1.5 days to copy all the results from our failed upload server (bruno) to the new upload server (synergy). I was out yesterday hence the lack of update from me, but nothing could get done until the result copy finished anyway. Jeff and I tackled the remaining stuff this morning to bring synergy back up, and it's now pretending to be bruno. It's working fairly well except, predictably, the disk i/o subsystem isn't happy with lots of little random i/o's (there are only 4 working spindles on synergy, as opposed to 20 on bruno). Still, it's working heroically to recover from the past two days of data distribution silence. Meanwhile, what the heck is wrong with bruno? I wish we knew. I've been battling this all day since getting synergy on line. It seems there are fundamental issues that transcend disks/partitions/controllers. Random drives are disappearing, random partitions are disappearing, and this was still happening after taking the 3ware card out of the system entirely... We're stumped. It might just be a cluster of simple problems with confounding symptoms. I give up for now. By the way, bruno was named after Giordano Bruno. Also by the way, somebody asked if we should have two upload servers. We used to have the upload server split onto two systems but this wasn't helping - in fact it was making it worse. The problem is not the lack of bandwidth i/o, but disk i/o. The results have to live somewhere, and require lots of random read/writes. So it's best if the upload server saves the results on directly attached storage. If it is also serving them over NFS (or likewise equivalent) such that a second upload server can write to them, it's too much of an overhead drag. So the upload server has to be a singular server which also (1) holds the results and (2) as much of the backend processing on these result files as possible. I think right now the only backend processing on results which bruno does NOT do is assimilation, which vader handles. You might think "why not just have the upload server save the results IT gets on ITS own storage?" Then we end up with two piles of results, randomly split, and then the NFS/mounting bottleneck is simply pushed down the pike to the validators, who need to read both piles at once. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Thanks for the update Matt, and for you and Jeff getting the replacement Bruno working in quick time, Claggy |
Dimly Lit Lightbulb 😀 Send message Joined: 30 Aug 08 Posts: 15399 Credit: 7,423,413 RAC: 1 |
Random drives are disappearing, random partitions are disappearing Crikey that's a bit of a conundrum. I'm glad the data was transferred OK, and I hope you have one of those eureka moments :) |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Matt, thanks for the news and the whole crew for their work! BTW. Just curious.. Normally after a longer outage we see immediately a ~ 50 MBits/s UL peak at the Cricket graph and quickly empty transfers overview in BOINC. Now we see after ~ 9 ½ hours a ~ 20 MBits/s UL peak. I have still hundreds of results for UL and they can't go home (backlog). Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: connect() failed and Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: HTTP error |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Ideally, to do multiple upload and post-processing servers, you'd need a very easy way to partition the results. Essentially, you'd have to create multiple pipelines. For example, odd numbered results _1, _3, etc would goto upload1, validator2. Even numbered results _0, _2, etc would goto upload2, validator2, etc. |
Blake Bonkofsky Send message Joined: 29 Dec 99 Posts: 617 Credit: 46,383,149 RAC: 0 |
Matt, thanks for the news and the whole crew for their work! Previously there was a 10 sec backoff too, that is now 5 minutes. I'm sure that leads to a lot of the "smoothing" we are seeing. |
rob smith Send message Joined: 7 Mar 03 Posts: 22532 Credit: 416,307,556 RAC: 380 |
Performance wise I'd expect Synergy to be about 10-20% of the throughput of Bruno. This means that the catch-up from an outage will be slower, with a higher retry count. So we have to sit here and be patient for a bit longer. So what? the data we are processing is already a few months old, and not "time critical". Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Performance wise I'd expect Synergy to be about 10-20% of the throughput of Bruno. This means that the catch-up from an outage will be slower, with a higher retry count. So we have to sit here and be patient for a bit longer. So what? the data we are processing is already a few months old, and not "time critical". Uploads and downloads seem to have settled down nicely, but there's quite a backlog growing for validations - also running on Synergy (aka 'the new Bruno'). They'll be held back by the lack of disk I/O, too - every validation attempt will require finding and retrieving at least two, and possibly several, previously uploaded result files. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Sooooooooo. I am not sure where this leaves us. New server for Bruno? Or more spindles for Synergy? What do we need? Fix Bruno, or add to Synergy? Both? My Purrball is sitting here staring at me wondering what to do next. She really is. "Time is simply the mechanism that keeps everything from happening all at once." |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
Fix Bruno, or add to Synergy? I'd go with fix Bruno. Synergy is already spoken for. Grant Darwin NT |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Fix Bruno, or add to Synergy? For what? Last I heard, it's duties were not spoken for yet. "Time is simply the mechanism that keeps everything from happening all at once." |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
In the server list it shows it as handling Nitpicker duties? I think Matt said that was kinda a test routine. "Time is simply the mechanism that keeps everything from happening all at once." |
Blake Bonkofsky Send message Joined: 29 Dec 99 Posts: 617 Credit: 46,383,149 RAC: 0 |
Either way, synergy is way more powerful than the demands of the upload server require. Look at the old specs of bruno compared to synergy. The RAM is different by a factor of twelve! To replace bruno completely probably wouldn't cost nearly as much as carolyn and oscar, nor even synergy. |
Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0 |
Here's a little script I wrote that would gradually open up the flood gates and stop the fileserver from thrashing and dropping half uploaded files #!/bin/bash # choke - gradually open seti@home gates # usage: choke [0-3|setup] # IPT=/sbin/iptables if [ "$1" = "setup" ]; then $IPT -N CHOKE $IPT -F CHOKE $IPT -I INPUT 1 -p tcp -m state --state ESTABLISHED,RELATED -j ACCEPT $IPT -I INPUT 2 -p tcp -m state --state NEW --dport 80 -j CHOKE fi $IPT -F CHOKE # accept 0/4 traffic if [ $1 = 0 ]; then $IPT -A CHOKE -j REJECT; fi; # accept 1/4 traffic - reject *.1,2,3 but pass .0,4,8,12,16 etc. if [ $1 = 1 ]; then $IPT -A CHOKE \! -s 0.0.0.0/0.0.0.3 -j REJECT; fi; # accept 2/4 traffic - reject *.2,3 pass 0,1,4,5,6,9 etc. if [ $1 = 2 ]; then $IPT -A CHOKE \! -s 0.0.0.0/0.0.0.2 -j REJECT; fi; # accept 3/4 traffic - reject *.3 pass 0,1,2,4,5,6,8,9,10 etc. if [ $1 = 3 ]; then $IPT -A CHOKE -s 0.0.0.3/0.0.0.3 -j REJECT; fi; This would open up address space in 1/4 steps using inverse subnets, though could suffer from favouritism. Another way would be to cycle quarters for an hour each until traffic reduced enough to open up completely. |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Fix Bruno, or add to Synergy? I disagree. Bruno is an old server that is having hardware problems again (dropped drives in array). There would be far less headaches just to add spindles to Synergy (a new server). That's assuming there's money for either operation. |
Todd Hebert Send message Joined: 16 Jun 00 Posts: 648 Credit: 228,292,957 RAC: 0 |
Synergy does not have the ability to install additional drives in its chassis. To add drives would require a new raid card to allow external connections, a suitable drive arrary chassis and the drives. You could ball park this at $4k. At this point you could purchase a SuperMicro storage server, motherboard, RAID card, memory and drives for around $5.5k (I'll donate the processors again) I've already been looking at this as an option. Synergy was never intended to have the duties of Bruno - it was a compute server with 5x 1TB SAS2 Hard drives to allow reliable operation by using RAID 6 (Which has a bunch of overhead but excellent reliability) There was a significant need to extend the overall science of S@H and this server fit the bill to provide this and other resource have been diverted away to meet the demands of the users. Todd |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Synergy does not have the ability to install additional drives in its chassis.I suppose we should wait and see if Bruno is still viable. "Time is simply the mechanism that keeps everything from happening all at once." |
Todd Hebert Send message Joined: 16 Jun 00 Posts: 648 Credit: 228,292,957 RAC: 0 |
Oh, I fully agree! If it was the raid card I was prepared to just order one and have it drop-shipped to Berkely but it sounds like it is more than just that being the problem. Drive arrays are such that they require regular maintenance and should be swapped out to prevent failures. When I worked at Cray Research we had two large storerooms of full height 5.25" 1GB Micropolis SCSI drives and one storeroom would be empty in a month. Drives were used on average for 2500 - 3000 hours before they were replaced in the array. Granted they got beat up pretty hard with insane throughput needs and were in constant operation. But this is not unlike the needs of S@H. Todd |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Oh, I fully agree! If it was the raid card I was prepared to just order one and have it drop-shipped to Berkely but it sounds like it is more than just that being the problem. You worked at Cray??? I was impressed with your knowledge, and your generosity. Now I am REALLY impressed. That explains a lot. "Time is simply the mechanism that keeps everything from happening all at once." |
Todd Hebert Send message Joined: 16 Jun 00 Posts: 648 Credit: 228,292,957 RAC: 0 |
That was many many years ago - late 80's into the early 90's. Went back to school to get my masters from UW-Madison and then went to Microsoft as a 5th level enterprise networking tech. Been up and down the road a few times :) Todd |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.