There Goes a Tenner (Jan 20 2011)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1068814 - Posted: 21 Jan 2011, 0:21:17 UTC As expected it took about 1.5 days to copy all the results from our failed upload server (bruno) to the new upload server (synergy). I was out yesterday hence the lack of update from me, but nothing could get done until the result copy finished anyway. Jeff and I tackled the remaining stuff this morning to bring synergy back up, and it's now pretending to be bruno. It's working fairly well except, predictably, the disk i/o subsystem isn't happy with lots of little random i/o's (there are only 4 working spindles on synergy, as opposed to 20 on bruno). Still, it's working heroically to recover from the past two days of data distribution silence. Meanwhile, what the heck is wrong with bruno? I wish we knew. I've been battling this all day since getting synergy on line. It seems there are fundamental issues that transcend disks/partitions/controllers. Random drives are disappearing, random partitions are disappearing, and this was still happening after taking the 3ware card out of the system entirely... We're stumped. It might just be a cluster of simple problems with confounding symptoms. I give up for now. By the way, bruno was named after Giordano Bruno. Also by the way, somebody asked if we should have two upload servers. We used to have the upload server split onto two systems but this wasn't helping - in fact it was making it worse. The problem is not the lack of bandwidth i/o, but disk i/o. The results have to live somewhere, and require lots of random read/writes. So it's best if the upload server saves the results on directly attached storage. If it is also serving them over NFS (or likewise equivalent) such that a second upload server can write to them, it's too much of an overhead drag. So the upload server has to be a singular server which also (1) holds the results and (2) as much of the backend processing on these result files as possible. I think right now the only backend processing on results which bruno does NOT do is assimilation, which vader handles. You might think "why not just have the upload server save the results IT gets on ITS own storage?" Then we end up with two piles of results, randomly split, and then the NFS/mounting bottleneck is simply pushed down the pike to the validators, who need to read both piles at once. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1068814 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1068817 - Posted: 21 Jan 2011, 0:25:27 UTC - in response to Message 1068814. Thanks for the update Matt, and for you and Jeff getting the replacement Bruno working in quick time, Claggy ID: 1068817 ·

Dimly Lit Lightbulb ðŸ˜€ Volunteer tester Send message Joined: 30 Aug 08 Posts: 15399 Credit: 7,423,413 RAC: 1	Message 1068865 - Posted: 21 Jan 2011, 3:39:30 UTC Random drives are disappearing, random partitions are disappearing Crikey that's a bit of a conundrum. I'm glad the data was transferred OK, and I hope you have one of those eureka moments :) ID: 1068865 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1068867 - Posted: 21 Jan 2011, 3:47:37 UTC - in response to Message 1068814. Last modified: 21 Jan 2011, 3:48:56 UTC Matt, thanks for the news and the whole crew for their work! BTW. Just curious.. Normally after a longer outage we see immediately a ~ 50 MBits/s UL peak at the Cricket graph and quickly empty transfers overview in BOINC. Now we see after ~ 9 Â½ hours a ~ 20 MBits/s UL peak. I have still hundreds of results for UL and they can't go home (backlog). Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: connect() failed and Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: HTTP error ID: 1068867 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 1068874 - Posted: 21 Jan 2011, 4:00:41 UTC - in response to Message 1068814. Ideally, to do multiple upload and post-processing servers, you'd need a very easy way to partition the results. Essentially, you'd have to create multiple pipelines. For example, odd numbered results _1, _3, etc would goto upload1, validator2. Even numbered results _0, _2, etc would goto upload2, validator2, etc. ID: 1068874 ·

Blake Bonkofsky Volunteer tester Send message Joined: 29 Dec 99 Posts: 617 Credit: 46,383,149 RAC: 0	Message 1068928 - Posted: 21 Jan 2011, 7:31:55 UTC - in response to Message 1068867. Matt, thanks for the news and the whole crew for their work! BTW. Just curious.. Normally after a longer outage we see immediately a ~ 50 MBits/s UL peak at the Cricket graph and quickly empty transfers overview in BOINC. Now we see after ~ 9 Â½ hours a ~ 20 MBits/s UL peak. I have still hundreds of results for UL and they can't go home (backlog). Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: connect() failed and Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxx: HTTP error Previously there was a 10 sec backoff too, that is now 5 minutes. I'm sure that leads to a lot of the "smoothing" we are seeing. ID: 1068928 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22327 Credit: 416,307,556 RAC: 380	Message 1068945 - Posted: 21 Jan 2011, 9:59:27 UTC Performance wise I'd expect Synergy to be about 10-20% of the throughput of Bruno. This means that the catch-up from an outage will be slower, with a higher retry count. So we have to sit here and be patient for a bit longer. So what? the data we are processing is already a few months old, and not "time critical". Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1068945 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 1068951 - Posted: 21 Jan 2011, 10:41:37 UTC - in response to Message 1068945. Performance wise I'd expect Synergy to be about 10-20% of the throughput of Bruno. This means that the catch-up from an outage will be slower, with a higher retry count. So we have to sit here and be patient for a bit longer. So what? the data we are processing is already a few months old, and not "time critical". Uploads and downloads seem to have settled down nicely, but there's quite a backlog growing for validations - also running on Synergy (aka 'the new Bruno'). They'll be held back by the lack of disk I/O, too - every validation attempt will require finding and retrieving at least two, and possibly several, previously uploaded result files. ID: 1068951 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51470 Credit: 1,018,363,574 RAC: 1,004	Message 1068956 - Posted: 21 Jan 2011, 11:04:02 UTC Sooooooooo. I am not sure where this leaves us. New server for Bruno? Or more spindles for Synergy? What do we need? Fix Bruno, or add to Synergy? Both? My Purrball is sitting here staring at me wondering what to do next. She really is. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1068956 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13797 Credit: 208,696,464 RAC: 304	Message 1068958 - Posted: 21 Jan 2011, 11:16:43 UTC - in response to Message 1068956. Fix Bruno, or add to Synergy? I'd go with fix Bruno. Synergy is already spoken for. Grant Darwin NT ID: 1068958 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51470 Credit: 1,018,363,574 RAC: 1,004	Message 1068960 - Posted: 21 Jan 2011, 11:19:07 UTC - in response to Message 1068958. Fix Bruno, or add to Synergy? I'd go with fix Bruno. Synergy is already spoken for. For what? Last I heard, it's duties were not spoken for yet. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1068960 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51470 Credit: 1,018,363,574 RAC: 1,004	Message 1068969 - Posted: 21 Jan 2011, 11:50:43 UTC - in response to Message 1068968. In the server list it shows it as handling Nitpicker duties? I think Matt said that was kinda a test routine. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1068969 ·

Blake Bonkofsky Volunteer tester Send message Joined: 29 Dec 99 Posts: 617 Credit: 46,383,149 RAC: 0	Message 1068971 - Posted: 21 Jan 2011, 12:01:00 UTC - in response to Message 1068969. Either way, synergy is way more powerful than the demands of the upload server require. Look at the old specs of bruno compared to synergy. The RAM is different by a factor of twelve! To replace bruno completely probably wouldn't cost nearly as much as carolyn and oscar, nor even synergy. ID: 1068971 ·

Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0	Message 1068980 - Posted: 21 Jan 2011, 12:43:04 UTC Here's a little script I wrote that would gradually open up the flood gates and stop the fileserver from thrashing and dropping half uploaded files #!/bin/bash # choke - gradually open seti@home gates # usage: choke [0-3\|setup] # IPT=/sbin/iptables if [ "$1" = "setup" ]; then $IPT -N CHOKE $IPT -F CHOKE $IPT -I INPUT 1 -p tcp -m state --state ESTABLISHED,RELATED -j ACCEPT $IPT -I INPUT 2 -p tcp -m state --state NEW --dport 80 -j CHOKE fi $IPT -F CHOKE # accept 0/4 traffic if [ $1 = 0 ]; then $IPT -A CHOKE -j REJECT; fi; # accept 1/4 traffic - reject .1,2,3 but pass .0,4,8,12,16 etc. if [ $1 = 1 ]; then $IPT -A CHOKE \! -s 0.0.0.0/0.0.0.3 -j REJECT; fi; # accept 2/4 traffic - reject .2,3 pass 0,1,4,5,6,9 etc. if [ $1 = 2 ]; then $IPT -A CHOKE \! -s 0.0.0.0/0.0.0.2 -j REJECT; fi; # accept 3/4 traffic - reject *.3 pass 0,1,2,4,5,6,8,9,10 etc. if [ $1 = 3 ]; then $IPT -A CHOKE -s 0.0.0.3/0.0.0.3 -j REJECT; fi; This would open up address space in 1/4 steps using inverse subnets, though could suffer from favouritism. Another way would be to cycle quarters for an hour each until traffic reduced enough to open up completely. ID: 1068980 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 1069010 - Posted: 21 Jan 2011, 15:30:44 UTC - in response to Message 1068958. Fix Bruno, or add to Synergy? I'd go with fix Bruno. Synergy is already spoken for. I disagree. Bruno is an old server that is having hardware problems again (dropped drives in array). There would be far less headaches just to add spindles to Synergy (a new server). That's assuming there's money for either operation. ID: 1069010 ·

Todd Hebert Volunteer tester Send message Joined: 16 Jun 00 Posts: 648 Credit: 228,292,957 RAC: 0	Message 1069017 - Posted: 21 Jan 2011, 15:58:37 UTC - in response to Message 1069010. Synergy does not have the ability to install additional drives in its chassis. To add drives would require a new raid card to allow external connections, a suitable drive arrary chassis and the drives. You could ball park this at $4k. At this point you could purchase a SuperMicro storage server, motherboard, RAID card, memory and drives for around $5.5k (I'll donate the processors again) I've already been looking at this as an option. Synergy was never intended to have the duties of Bruno - it was a compute server with 5x 1TB SAS2 Hard drives to allow reliable operation by using RAID 6 (Which has a bunch of overhead but excellent reliability) There was a significant need to extend the overall science of S@H and this server fit the bill to provide this and other resource have been diverted away to meet the demands of the users. Todd ID: 1069017 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51470 Credit: 1,018,363,574 RAC: 1,004	Message 1069019 - Posted: 21 Jan 2011, 16:10:57 UTC - in response to Message 1069017. Synergy does not have the ability to install additional drives in its chassis. To add drives would require a new raid card to allow external connections, a suitable drive arrary chassis and the drives. You could ball park this at $4k. At this point you could purchase a SuperMicro storage server, motherboard, RAID card, memory and drives for around $5.5k (I'll donate the processors again) I've already been looking at this as an option. Synergy was never intended to have the duties of Bruno - it was a compute server with 5x 1TB SAS2 Hard drives to allow reliable operation by using RAID 6 (Which has a bunch of overhead but excellent reliability) There was a significant need to extend the overall science of S@H and this server fit the bill to provide this and other resource have been diverted away to meet the demands of the users. Todd I suppose we should wait and see if Bruno is still viable. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1069019 ·

Todd Hebert Volunteer tester Send message Joined: 16 Jun 00 Posts: 648 Credit: 228,292,957 RAC: 0	Message 1069042 - Posted: 21 Jan 2011, 17:07:32 UTC Oh, I fully agree! If it was the raid card I was prepared to just order one and have it drop-shipped to Berkely but it sounds like it is more than just that being the problem. Drive arrays are such that they require regular maintenance and should be swapped out to prevent failures. When I worked at Cray Research we had two large storerooms of full height 5.25" 1GB Micropolis SCSI drives and one storeroom would be empty in a month. Drives were used on average for 2500 - 3000 hours before they were replaced in the array. Granted they got beat up pretty hard with insane throughput needs and were in constant operation. But this is not unlike the needs of S@H. Todd ID: 1069042 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51470 Credit: 1,018,363,574 RAC: 1,004	Message 1069048 - Posted: 21 Jan 2011, 17:12:12 UTC - in response to Message 1069042. Oh, I fully agree! If it was the raid card I was prepared to just order one and have it drop-shipped to Berkely but it sounds like it is more than just that being the problem. Drive arrays are such that they require regular maintenance and should be swapped out to prevent failures. When I worked at Cray Research we had two large storerooms of full height 5.25" 1GB Micropolis SCSI drives and one storeroom would be empty in a month. Drives were used on average for 2500 - 3000 hours before they were replaced in the array. Granted they got beat up pretty hard with insane throughput needs and were in constant operation. But this is not unlike the needs of S@H. Todd You worked at Cray??? I was impressed with your knowledge, and your generosity. Now I am REALLY impressed. That explains a lot. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1069048 ·

Todd Hebert Volunteer tester Send message Joined: 16 Jun 00 Posts: 648 Credit: 228,292,957 RAC: 0	Message 1069067 - Posted: 21 Jan 2011, 17:52:05 UTC - in response to Message 1069048. That was many many years ago - late 80's into the early 90's. Went back to school to get my masters from UW-Madison and then went to Microsoft as a 5th level enterprise networking tech. Been up and down the road a few times :) Todd ID: 1069067 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.