Message boards :
Technical News :
Barrel of Bottlenecks (Aug 15 2007)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
First off, I should point out that the server status page isn't the most accurate thing in the world, especially now as I haven't yet converted any of this code to understand how the new multibeam splitters work (I've been busy). So please don't use the data on this particular web page to inspire panic - many splitters are running, and have been all night, even though the page shows none of them are running at all. That said, we are slowly getting beyond some more of the growing pains in the conversion to multibeam. Here's the past 24 hours in a nutshell: the classic splitters only worked on Solaris/Sparc systems, so they were forced to run on our older (and therefore much slower) servers. So why were the new multibeam splitters, running on state-of-the-art linux systems, running much much slower? The first bottleneck: the local network. The only linux server available as of yesterday (vader) was in our second lab, not in the data closet, so all the reading of raw data and writing of workunits were happening over the lab LAN, and the workunit fileserver's scant few nfsd processes were clogged on these slow reads/writes and therefore the download server was getting blocked reading these freshly created workunits to send to our clients. So this morning Jeff and I worked to get some currently underutilized (but not yet completely configured) servers in the data closet up to snuff so they could take over splitting. Namely lando and bambi (specs now included in the server status page). It has been taking all day to iron out all the cracks with these newer servers. In fact we hit another bottleneck quickly: the memory in lando - it was thrashing pretty hard. Just now as I am writing this paragraph Jeff confirmed that we got bambi working, so we'll so how far we can push that machine and take the load off lando. Jeff's working on this now. Further aggravations: we're still catching up from various recent outages and work shortages, so demand is quite high. That and a bunch of the work we just sent out was terribly noisy - workunits are returning very fast thus creating an artificially increased demand. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
Blah! Sounds like a mess! :) Oddly enough, I actually enjoy some of that frustrating work. ;) But for some reason, I don't envy you (I guess it has to do with the fact that I don't have to listen to 100,000 people on the internet complain that I haven't got a clue!). |
speedimic Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0 |
Well, seems there is light at the end of the tunnel... Jeff, Matt, thanx for the good work ! ! ! ! ! mic. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Matt, Sounds like a major fire-fighting exercise, and I hate to land another one on you, but could you possibly have a look at Joe Segur's post in Number Crunching? Seems like a number of the WUs split last night had a negative value in the <triplet_thresh> parameter in the workunit header, which Joe says is unexpected, meaningless and not catered for in the client. Sounds like a splitter logic error from here. The effect is to cause WUs to progress very slowly - 0.05% in 2 hours seems typical - and then exit with a -9. Helps to keep the load off the download server, of course, but doesn't contribute much to the science. |
KB7RZF Send message Joined: 15 Aug 99 Posts: 9549 Credit: 3,308,926 RAC: 2 |
Thank you for the news Matt. Sorry to hear about all the heart aches. I hope things get to running better soon. I was hoping on coming by the campus to meet you and Jeff and whoever was working, as I'll be in Sacramento on Friday, but due to other problems that recently arose, I won't be able to make it that far down, maybe the next time we get to Sac we can go the extra hour or so and come down to meet you guys face to face and see the new servers. Keep up the awesome work you guys, and thank you!!! Jeremy |
Jim Geuin Send message Joined: 17 May 99 Posts: 6 Credit: 5,538,490 RAC: 32 |
Looks to me like the mechanism that sends the work units is backed up. Where in the past, I received a new work unit about every 4000 seconds and had it queued up when the current unit finished, now I am processing a new work unit in under 40 seconds and waiting for a new one. I'd say that in the past, if you were sending say 500,000 units every 4000 seconds, now you are trying to send 500,000 units every 40 seconds. You may not have enough bandwidth to do that. ____________ |
Mithotar Send message Joined: 11 Apr 01 Posts: 88 Credit: 66,037,385 RAC: 50 |
Very nice work Matt, But I'm doing NNT now as My RAC was/is in a death dive that shows no signs of quitting and others feel this way too, As the Multiplier was changed to 2.85 from 3.35 and so We may be heading elsewhere or just shutting down until the Multiplier is raised back, As the WUs are taking longer and We're getting less, So If It isn't raised and My cache is empty, I QUIT! =============================================================================== So you're going to take your PCs and go home..........gee didnt you outgrow that when you were 8 years old ? This is a shoestring run - near volunteer manned science project....if finding the signal and having a bit of FUN along the way isnt you goal and doing some good without getting a instant reward - maybe just maybe you are in the wrong place...... Hope you reconsider your words .....and stay. Ater all if and when a signal is found.......who's gonna care what your RAC is/was... Carry On |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31012 Credit: 53,134,872 RAC: 32 |
First off, I should point out that the server status page isn't the most accurate thing in the world, especially now as I haven't yet converted any of this code to understand how the new multibeam splitters work (I've been busy). So please don't use the data on this particular web page to inspire panic - many splitters are running, and have been all night, even though the page shows none of them are running at all. Wed Aug 15 20:53:16 2007|SETI@home Beta Test|Message from server: Project encountered internal error: shared memory Sounds like you need to get some more ram. |
Unbeliever Send message Joined: 6 Jan 01 Posts: 2 Credit: 1,455,862 RAC: 0 |
Very nice work Matt, But I'm doing NNT now as My RAC was/is in a death dive that shows no signs of quitting and others feel this way too, As the Multiplier was changed to 2.85 from 3.35 and so We may be heading elsewhere or just shutting down until the Multiplier is raised back, As the WUs are taking longer and We're getting less, So If It isn't raised and My cache is empty, I QUIT! Batman, please don't speak in others name. There are lots of us not feeling that way. We are here to help in good cause, not to keep scores. Statistics is just a bit of fun, to keep eye on progress. Matt and others in SETI team are doing heroic work struggling to keep project on track. Thumbs up, folks! |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Thanx once again Matt, for taking the time to keep those of us who are interested informed of your daily battles. Sounds like you boys have been up to yer butts in alligators again. I had to LOL when I read the part about Vader trying to help fill the work cache over your lab's lan. No wonder things were tied in knots. Keep up the good fight. I'm sure you'll get it all sorted soon. "Time is simply the mechanism that keeps everything from happening all at once." |
M4rtyn Send message Joined: 4 Aug 03 Posts: 48 Credit: 799,965 RAC: 0 |
Very nice work Matt, But I'm doing NNT now as My RAC was/is in a death dive that shows no signs of quitting and others feel this way too, As the Multiplier was changed to 2.85 from 3.35 and so We may be heading elsewhere or just shutting down until the Multiplier is raised back, As the WUs are taking longer and We're getting less, So If It isn't raised and My cache is empty, I QUIT! The percentage of people like myself,in just for the credit has allways been higher at seti than other projects simply because seti paid more credit than the others especially with the optimised apps. I know it would be easy to assume from the forum postings that this is not the case. A quick look will show you that out of the 158,000 active users the vast majority of posts are from the few devoties probably less than 100. The so called "credit junkie" is just less vocal here in the forum as most have little intrest in the project outside the numbers. As the work we contribute is as usfull as the next guys it would be both unfair and not in the projects intrest to ignor complaints about credit. Personaly I'm all in favour of a level "credit" playing field as I would then be free to participate in any project I wish and still remain competitive in the overall stats. m4rtyn **************************** *************************** |
Sharkbait Send message Joined: 2 Jun 99 Posts: 11 Credit: 6,333,145 RAC: 0 |
Hello all, I just wanted to say that over the last couple of weeks I've seen again and again discouraging posts from people who are frustrated with SETI. I'm pretty tired of it. I'm not here for credit, or to run as many workunits as possible, and I only read this page because I like to see what progress the team is making. I'm here because the mission is about searching the universe for other intelligent life, something that I'm pretty sure we all believe in. So please keep your aggravated complaints to yourself. If you have a legitimate issue that is one thing, but if you're upset because your RAC has dived or something just know that I'm sure Matt and the rest of the team also wish that you had plenty of work to process and that everything was going easy for you, because then their job would be easier too! These guys work hard on a project that runs off of volunteer time and donations of money and equipment. If you want to see them succeed - donate! But please take your crying elsewhere. Thank you SETI team for all your hard work, and I look forward to the day when SETI announces a significant finding because I truly believe that one day all this hard work will pay off. And that is worth more than any amount of credit I could get for a workunit. To everyone else, keep up the good work! Ben |
William Roeder Send message Joined: 19 May 99 Posts: 69 Credit: 523,414 RAC: 0 |
Well, seems there is light at the end of the tunnel... The light from the on coming train? |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
It seems that the current set of patched together servers are unreliable, leading to partial or complete overall system failures. This behavior must have the effect of throttling the actual overall work throughput (wu's processed per day). Plus, the complexity of keeping all the patched together servers 'running' seems to limit the manhours available to maintain and improve the software environment (web pages, statistics, debugged scripts, etc.) So, rather than adding more servers and unreliable hardware and software, wouldn't it make sense to reduce the number of servers? Bump up the RAM to the maximum in each box, adjust some OS parameters, simplify the networking, adjust workloads, etc. I realize the potential performance would be reduced, but the actual performance may equate to what we are currently experiencing or be better. And, doing so might increase the man-hours available for the rest of the project. |
Bounce Send message Joined: 3 Apr 99 Posts: 66 Credit: 5,604,569 RAC: 0 |
excellent idea. please let the team know when they can expect the killer server you're shipping to them. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
excellent idea. please let the team know when they can expect the killer server you're shipping to them. I guess you missed my point. |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
excellent idea. please let the team know when they can expect the killer server you're shipping to them. If I recall correctly, the servers are maxed out on RAM already. BOINC WIKI |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Server wise - oddly enough we were philosophically debating this yesterday: why not get a big big big server instead of screwing around with smaller ones and their apparently unique idiosyncrasies? Obviously we'd benefit from having all the CPUs/RAM/disks at our disposal on one system but there are two cons off the top of my head: 1. Expense. As noted pretty much all our servers are donated. We've gotten some big donations among them, but nevertheless nothing bigger than the Sun thumpers (24 TB disk, 8 GB RAM, 2 processors) or something like sidious (4 dual-core Xeons, 16 GB RAM, no disk). Roughly speaking we need a single machine that has at least 64 processors and 128 GB RAM before making it useful. Not sure how much one of those cost, but systems like these are less likely to be thrown at us. 2. Ramp-up Time. Since we are short staffed and all busy doing too many things, it's *far* easier and faster to glom a new small server onto our current setup than to pour everything over onto a new one in one fell swoop. A rough guess is that we'd be down for a month to convert our current system to a single piece of super-high-end hardware. Obviously, #1 is the major con, and #2 is just a fine point. If we did obtain such hardware we'd gladly do whatever it took to actually use it. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
whawn Send message Joined: 11 Apr 00 Posts: 18 Credit: 1,053,191 RAC: 2 |
Had a WU that went for eleven hours yesterday, and still showed only .05% completed. I suspended that one and the next in line took only two hours. (I'm running SETI on my really old, slow box, because my double proc box seizes up on the BOINC client.) This morning there was no new work, so I resumed work on the very slow WU. It seemed to start over again from scratch, showing zero proc time, and now shows a sensible 2 hours to completion. Also BOINC is having a tough time DLing new work. It's been working on two new WU for an hour or so, already. As for the credit discussion -- with all the credit I've got, and a $1.25, I can get a cup of weak coffee. IOW, I'm not here for the credits. I want ET to be able to phone home, is all. |
Heechee Send message Joined: 29 Sep 99 Posts: 5 Credit: 13,765,984 RAC: 32 |
I have a WU that I just suspended that had been running for 24:49:05 and only showed 0.020% complete with 27:48 left to completion. The only reason I am here is to search for a signal from elsewhere in the universe. Credits are nice, but that shouldn't be what all this is about. I think that Matt and the whole crew do an awesome job keeping things running! Thanks for all that you do!! An ant on the move does more than a dozing ox. - Lao Tzu |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.