Barrel of Bottlenecks (Aug 15 2007)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 619900 - Posted: 15 Aug 2007, 22:57:38 UTC Last modified: 15 Aug 2007, 23:01:08 UTC First off, I should point out that the server status page isn't the most accurate thing in the world, especially now as I haven't yet converted any of this code to understand how the new multibeam splitters work (I've been busy). So please don't use the data on this particular web page to inspire panic - many splitters are running, and have been all night, even though the page shows none of them are running at all. That said, we are slowly getting beyond some more of the growing pains in the conversion to multibeam. Here's the past 24 hours in a nutshell: the classic splitters only worked on Solaris/Sparc systems, so they were forced to run on our older (and therefore much slower) servers. So why were the new multibeam splitters, running on state-of-the-art linux systems, running much much slower? The first bottleneck: the local network. The only linux server available as of yesterday (vader) was in our second lab, not in the data closet, so all the reading of raw data and writing of workunits were happening over the lab LAN, and the workunit fileserver's scant few nfsd processes were clogged on these slow reads/writes and therefore the download server was getting blocked reading these freshly created workunits to send to our clients. So this morning Jeff and I worked to get some currently underutilized (but not yet completely configured) servers in the data closet up to snuff so they could take over splitting. Namely lando and bambi (specs now included in the server status page). It has been taking all day to iron out all the cracks with these newer servers. In fact we hit another bottleneck quickly: the memory in lando - it was thrashing pretty hard. Just now as I am writing this paragraph Jeff confirmed that we got bambi working, so we'll so how far we can push that machine and take the load off lando. Jeff's working on this now. Further aggravations: we're still catching up from various recent outages and work shortages, so demand is quite high. That and a bunch of the work we just sent out was terribly noisy - workunits are returning very fast thus creating an artificially increased demand. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 619900 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 619904 - Posted: 15 Aug 2007, 23:05:35 UTC Blah! Sounds like a mess! :) Oddly enough, I actually enjoy some of that frustrating work. ;) But for some reason, I don't envy you (I guess it has to do with the fact that I don't have to listen to 100,000 people on the internet complain that I haven't got a clue!). ID: 619904 ·

speedimic Volunteer tester Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0	Message 619906 - Posted: 15 Aug 2007, 23:08:49 UTC Well, seems there is light at the end of the tunnel... Jeff, Matt, thanx for the good work ! ! ! ! ! mic. ID: 619906 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 619912 - Posted: 15 Aug 2007, 23:15:08 UTC Matt, Sounds like a major fire-fighting exercise, and I hate to land another one on you, but could you possibly have a look at Joe Segur's post in Number Crunching? Seems like a number of the WUs split last night had a negative value in the <triplet_thresh> parameter in the workunit header, which Joe says is unexpected, meaningless and not catered for in the client. Sounds like a splitter logic error from here. The effect is to cause WUs to progress very slowly - 0.05% in 2 hours seems typical - and then exit with a -9. Helps to keep the load off the download server, of course, but doesn't contribute much to the science. ID: 619912 ·

KB7RZF Volunteer tester Send message Joined: 15 Aug 99 Posts: 9549 Credit: 3,308,926 RAC: 2	Message 619953 - Posted: 16 Aug 2007, 0:06:09 UTC Thank you for the news Matt. Sorry to hear about all the heart aches. I hope things get to running better soon. I was hoping on coming by the campus to meet you and Jeff and whoever was working, as I'll be in Sacramento on Friday, but due to other problems that recently arose, I won't be able to make it that far down, maybe the next time we get to Sac we can go the extra hour or so and come down to meet you guys face to face and see the new servers. Keep up the awesome work you guys, and thank you!!! Jeremy ID: 619953 ·

Jim Geuin Send message Joined: 17 May 99 Posts: 6 Credit: 5,538,490 RAC: 32	Message 619972 - Posted: 16 Aug 2007, 0:36:59 UTC Looks to me like the mechanism that sends the work units is backed up. Where in the past, I received a new work unit about every 4000 seconds and had it queued up when the current unit finished, now I am processing a new work unit in under 40 seconds and waiting for a new one. I'd say that in the past, if you were sending say 500,000 units every 4000 seconds, now you are trying to send 500,000 units every 40 seconds. You may not have enough bandwidth to do that. ____________ ID: 619972 ·

Mithotar Send message Joined: 11 Apr 01 Posts: 88 Credit: 66,037,385 RAC: 50	Message 620051 - Posted: 16 Aug 2007, 3:46:43 UTC - in response to Message 619950. Very nice work Matt, But I'm doing NNT now as My RAC was/is in a death dive that shows no signs of quitting and others feel this way too, As the Multiplier was changed to 2.85 from 3.35 and so We may be heading elsewhere or just shutting down until the Multiplier is raised back, As the WUs are taking longer and We're getting less, So If It isn't raised and My cache is empty, I QUIT! =============================================================================== So you're going to take your PCs and go home..........gee didnt you outgrow that when you were 8 years old ? This is a shoestring run - near volunteer manned science project....if finding the signal and having a bit of FUN along the way isnt you goal and doing some good without getting a instant reward - maybe just maybe you are in the wrong place...... Hope you reconsider your words .....and stay. Ater all if and when a signal is found.......who's gonna care what your RAC is/was... Carry On ID: 620051 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30636 Credit: 53,134,872 RAC: 32	Message 620054 - Posted: 16 Aug 2007, 3:59:59 UTC - in response to Message 619900. First off, I should point out that the server status page isn't the most accurate thing in the world, especially now as I haven't yet converted any of this code to understand how the new multibeam splitters work (I've been busy). So please don't use the data on this particular web page to inspire panic - many splitters are running, and have been all night, even though the page shows none of them are running at all. That said, we are slowly getting beyond some more of the growing pains in the conversion to multibeam. Here's the past 24 hours in a nutshell: the classic splitters only worked on Solaris/Sparc systems, so they were forced to run on our older (and therefore much slower) servers. So why were the new multibeam splitters, running on state-of-the-art linux systems, running much much slower? The first bottleneck: the local network. The only linux server available as of yesterday (vader) was in our second lab, not in the data closet, so all the reading of raw data and writing of workunits were happening over the lab LAN, and the workunit fileserver's scant few nfsd processes were clogged on these slow reads/writes and therefore the download server was getting blocked reading these freshly created workunits to send to our clients. So this morning Jeff and I worked to get some currently underutilized (but not yet completely configured) servers in the data closet up to snuff so they could take over splitting. Namely lando and bambi (specs now included in the server status page). It has been taking all day to iron out all the cracks with these newer servers. In fact we hit another bottleneck quickly: the memory in lando - it was thrashing pretty hard. Just now as I am writing this paragraph Jeff confirmed that we got bambi working, so we'll so how far we can push that machine and take the load off lando. Jeff's working on this now. Further aggravations: we're still catching up from various recent outages and work shortages, so demand is quite high. That and a bunch of the work we just sent out was terribly noisy - workunits are returning very fast thus creating an artificially increased demand. - Matt Wed Aug 15 20:53:16 2007\|SETI@home Beta Test\|Message from server: Project encountered internal error: shared memory Sounds like you need to get some more ram. ID: 620054 ·

Unbeliever Send message Joined: 6 Jan 01 Posts: 2 Credit: 1,455,862 RAC: 0	Message 620087 - Posted: 16 Aug 2007, 5:41:45 UTC - in response to Message 619950. Very nice work Matt, But I'm doing NNT now as My RAC was/is in a death dive that shows no signs of quitting and others feel this way too, As the Multiplier was changed to 2.85 from 3.35 and so We may be heading elsewhere or just shutting down until the Multiplier is raised back, As the WUs are taking longer and We're getting less, So If It isn't raised and My cache is empty, I QUIT! Batman, please don't speak in others name. There are lots of us not feeling that way. We are here to help in good cause, not to keep scores. Statistics is just a bit of fun, to keep eye on progress. Matt and others in SETI team are doing heroic work struggling to keep project on track. Thumbs up, folks! ID: 620087 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 620088 - Posted: 16 Aug 2007, 5:42:30 UTC Thanx once again Matt, for taking the time to keep those of us who are interested informed of your daily battles. Sounds like you boys have been up to yer butts in alligators again. I had to LOL when I read the part about Vader trying to help fill the work cache over your lab's lan. No wonder things were tied in knots. Keep up the good fight. I'm sure you'll get it all sorted soon. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 620088 ·

M4rtyn Volunteer tester Send message Joined: 4 Aug 03 Posts: 48 Credit: 799,965 RAC: 0	Message 620173 - Posted: 16 Aug 2007, 9:40:49 UTC - in response to Message 620087. Very nice work Matt, But I'm doing NNT now as My RAC was/is in a death dive that shows no signs of quitting and others feel this way too, As the Multiplier was changed to 2.85 from 3.35 and so We may be heading elsewhere or just shutting down until the Multiplier is raised back, As the WUs are taking longer and We're getting less, So If It isn't raised and My cache is empty, I QUIT! Batman, please don't speak in others name. There are lots of us not feeling that way. We are here to help in good cause, not to keep scores. Statistics is just a bit of fun, to keep eye on progress. Matt and others in SETI team are doing heroic work struggling to keep project on track. Thumbs up, folks! The percentage of people like myself,in just for the credit has allways been higher at seti than other projects simply because seti paid more credit than the others especially with the optimised apps. I know it would be easy to assume from the forum postings that this is not the case. A quick look will show you that out of the 158,000 active users the vast majority of posts are from the few devoties probably less than 100. The so called "credit junkie" is just less vocal here in the forum as most have little intrest in the project outside the numbers. As the work we contribute is as usfull as the next guys it would be both unfair and not in the projects intrest to ignor complaints about credit. Personaly I'm all in favour of a level "credit" playing field as I would then be free to participate in any project I wish and still remain competitive in the overall stats. m4rtyn ************************** ************************* ID: 620173 ·

Sharkbait Send message Joined: 2 Jun 99 Posts: 11 Credit: 6,333,145 RAC: 0	Message 620186 - Posted: 16 Aug 2007, 10:45:51 UTC Hello all, I just wanted to say that over the last couple of weeks I've seen again and again discouraging posts from people who are frustrated with SETI. I'm pretty tired of it. I'm not here for credit, or to run as many workunits as possible, and I only read this page because I like to see what progress the team is making. I'm here because the mission is about searching the universe for other intelligent life, something that I'm pretty sure we all believe in. So please keep your aggravated complaints to yourself. If you have a legitimate issue that is one thing, but if you're upset because your RAC has dived or something just know that I'm sure Matt and the rest of the team also wish that you had plenty of work to process and that everything was going easy for you, because then their job would be easier too! These guys work hard on a project that runs off of volunteer time and donations of money and equipment. If you want to see them succeed - donate! But please take your crying elsewhere. Thank you SETI team for all your hard work, and I look forward to the day when SETI announces a significant finding because I truly believe that one day all this hard work will pay off. And that is worth more than any amount of credit I could get for a workunit. To everyone else, keep up the good work! Ben ID: 620186 ·

William Roeder Volunteer tester Send message Joined: 19 May 99 Posts: 69 Credit: 523,414 RAC: 0	Message 620243 - Posted: 16 Aug 2007, 13:20:04 UTC - in response to Message 619906. Well, seems there is light at the end of the tunnel... The light from the on coming train? ID: 620243 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 620292 - Posted: 16 Aug 2007, 15:30:25 UTC It seems that the current set of patched together servers are unreliable, leading to partial or complete overall system failures. This behavior must have the effect of throttling the actual overall work throughput (wu's processed per day). Plus, the complexity of keeping all the patched together servers 'running' seems to limit the manhours available to maintain and improve the software environment (web pages, statistics, debugged scripts, etc.) So, rather than adding more servers and unreliable hardware and software, wouldn't it make sense to reduce the number of servers? Bump up the RAM to the maximum in each box, adjust some OS parameters, simplify the networking, adjust workloads, etc. I realize the potential performance would be reduced, but the actual performance may equate to what we are currently experiencing or be better. And, doing so might increase the man-hours available for the rest of the project. ID: 620292 ·

Bounce Send message Joined: 3 Apr 99 Posts: 66 Credit: 5,604,569 RAC: 0	Message 620342 - Posted: 16 Aug 2007, 17:46:01 UTC - in response to Message 620292. excellent idea. please let the team know when they can expect the killer server you're shipping to them. ID: 620342 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 620366 - Posted: 16 Aug 2007, 18:36:39 UTC - in response to Message 620342. excellent idea. please let the team know when they can expect the killer server you're shipping to them. I guess you missed my point. ID: 620366 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 620372 - Posted: 16 Aug 2007, 18:43:50 UTC - in response to Message 620366. excellent idea. please let the team know when they can expect the killer server you're shipping to them. I guess you missed my point. If I recall correctly, the servers are maxed out on RAM already. BOINC WIKI ID: 620372 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 620378 - Posted: 16 Aug 2007, 18:55:31 UTC Server wise - oddly enough we were philosophically debating this yesterday: why not get a big big big server instead of screwing around with smaller ones and their apparently unique idiosyncrasies? Obviously we'd benefit from having all the CPUs/RAM/disks at our disposal on one system but there are two cons off the top of my head: 1. Expense. As noted pretty much all our servers are donated. We've gotten some big donations among them, but nevertheless nothing bigger than the Sun thumpers (24 TB disk, 8 GB RAM, 2 processors) or something like sidious (4 dual-core Xeons, 16 GB RAM, no disk). Roughly speaking we need a single machine that has at least 64 processors and 128 GB RAM before making it useful. Not sure how much one of those cost, but systems like these are less likely to be thrown at us. 2. Ramp-up Time. Since we are short staffed and all busy doing too many things, it's far easier and faster to glom a new small server onto our current setup than to pour everything over onto a new one in one fell swoop. A rough guess is that we'd be down for a month to convert our current system to a single piece of super-high-end hardware. Obviously, #1 is the major con, and #2 is just a fine point. If we did obtain such hardware we'd gladly do whatever it took to actually use it. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 620378 ·

whawn Send message Joined: 11 Apr 00 Posts: 18 Credit: 1,053,191 RAC: 2	Message 620389 - Posted: 16 Aug 2007, 19:31:52 UTC - in response to Message 619912. The effect is to cause WUs to progress very slowly - 0.05% in 2 hours seems typical - and then exit with a -9. Helps to keep the load off the download server, of course, but doesn't contribute much to the science. Had a WU that went for eleven hours yesterday, and still showed only .05% completed. I suspended that one and the next in line took only two hours. (I'm running SETI on my really old, slow box, because my double proc box seizes up on the BOINC client.) This morning there was no new work, so I resumed work on the very slow WU. It seemed to start over again from scratch, showing zero proc time, and now shows a sensible 2 hours to completion. Also BOINC is having a tough time DLing new work. It's been working on two new WU for an hour or so, already. As for the credit discussion -- with all the credit I've got, and a $1.25, I can get a cup of weak coffee. IOW, I'm not here for the credits. I want ET to be able to phone home, is all. ID: 620389 ·

Heechee Send message Joined: 29 Sep 99 Posts: 5 Credit: 13,765,984 RAC: 32	Message 620396 - Posted: 16 Aug 2007, 19:55:56 UTC - in response to Message 620389. The effect is to cause WUs to progress very slowly - 0.05% in 2 hours seems typical - and then exit with a -9. Helps to keep the load off the download server, of course, but doesn't contribute much to the science. Had a WU that went for eleven hours yesterday, and still showed only .05% completed. I suspended that one and the next in line took only two hours. (I'm running SETI on my really old, slow box, because my double proc box seizes up on the BOINC client.) This morning there was no new work, so I resumed work on the very slow WU. It seemed to start over again from scratch, showing zero proc time, and now shows a sensible 2 hours to completion. Also BOINC is having a tough time DLing new work. It's been working on two new WU for an hour or so, already. As for the credit discussion -- with all the credit I've got, and a $1.25, I can get a cup of weak coffee. IOW, I'm not here for the credits. I want ET to be able to phone home, is all. I have a WU that I just suspended that had been running for 24:49:05 and only showed 0.020% complete with 27:48 left to completion. The only reason I am here is to search for a signal from elsewhere in the universe. Credits are nice, but that shouldn't be what all this is about. I think that Matt and the whole crew do an awesome job keeping things running! Thanks for all that you do!! An ant on the move does more than a dozing ox. - Lao Tzu ID: 620396 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.