Barrel of Bottlenecks (Aug 15 2007)

Message boards : Technical News : Barrel of Bottlenecks (Aug 15 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 619900 - Posted: 15 Aug 2007, 22:57:38 UTC
Last modified: 15 Aug 2007, 23:01:08 UTC

First off, I should point out that the server status page isn't the most accurate thing in the world, especially now as I haven't yet converted any of this code to understand how the new multibeam splitters work (I've been busy). So please don't use the data on this particular web page to inspire panic - many splitters are running, and have been all night, even though the page shows none of them are running at all.

That said, we are slowly getting beyond some more of the growing pains in the conversion to multibeam. Here's the past 24 hours in a nutshell: the classic splitters only worked on Solaris/Sparc systems, so they were forced to run on our older (and therefore much slower) servers. So why were the new multibeam splitters, running on state-of-the-art linux systems, running much much slower? The first bottleneck: the local network. The only linux server available as of yesterday (vader) was in our second lab, not in the data closet, so all the reading of raw data and writing of workunits were happening over the lab LAN, and the workunit fileserver's scant few nfsd processes were clogged on these slow reads/writes and therefore the download server was getting blocked reading these freshly created workunits to send to our clients.

So this morning Jeff and I worked to get some currently underutilized (but not yet completely configured) servers in the data closet up to snuff so they could take over splitting. Namely lando and bambi (specs now included in the server status page). It has been taking all day to iron out all the cracks with these newer servers. In fact we hit another bottleneck quickly: the memory in lando - it was thrashing pretty hard. Just now as I am writing this paragraph Jeff confirmed that we got bambi working, so we'll so how far we can push that machine and take the load off lando. Jeff's working on this now.

Further aggravations: we're still catching up from various recent outages and work shortages, so demand is quite high. That and a bunch of the work we just sent out was terribly noisy - workunits are returning very fast thus creating an artificially increased demand.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 619900 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 619904 - Posted: 15 Aug 2007, 23:05:35 UTC

Blah! Sounds like a mess! :)

Oddly enough, I actually enjoy some of that frustrating work. ;) But for some reason, I don't envy you (I guess it has to do with the fact that I don't have to listen to 100,000 people on the internet complain that I haven't got a clue!).
ID: 619904 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 619906 - Posted: 15 Aug 2007, 23:08:49 UTC

Well, seems there is light at the end of the tunnel...

Jeff, Matt, thanx for the good work ! ! ! ! !


mic.


ID: 619906 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 619912 - Posted: 15 Aug 2007, 23:15:08 UTC

Matt,

Sounds like a major fire-fighting exercise, and I hate to land another one on you, but could you possibly have a look at Joe Segur's post in Number Crunching?

Seems like a number of the WUs split last night had a negative value in the <triplet_thresh> parameter in the workunit header, which Joe says is unexpected, meaningless and not catered for in the client. Sounds like a splitter logic error from here.

The effect is to cause WUs to progress very slowly - 0.05% in 2 hours seems typical - and then exit with a -9. Helps to keep the load off the download server, of course, but doesn't contribute much to the science.
ID: 619912 · Report as offensive
KB7RZF
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 9549
Credit: 3,308,926
RAC: 2
United States
Message 619953 - Posted: 16 Aug 2007, 0:06:09 UTC

Thank you for the news Matt. Sorry to hear about all the heart aches. I hope things get to running better soon. I was hoping on coming by the campus to meet you and Jeff and whoever was working, as I'll be in Sacramento on Friday, but due to other problems that recently arose, I won't be able to make it that far down, maybe the next time we get to Sac we can go the extra hour or so and come down to meet you guys face to face and see the new servers. Keep up the awesome work you guys, and thank you!!!

Jeremy
ID: 619953 · Report as offensive
Profile Jim Geuin

Send message
Joined: 17 May 99
Posts: 6
Credit: 5,538,490
RAC: 32
United States
Message 619972 - Posted: 16 Aug 2007, 0:36:59 UTC

Looks to me like the mechanism that sends the work units is backed up. Where in the past, I received a new work unit about every 4000 seconds and had it queued up when the current unit finished, now I am processing a new work unit in under 40 seconds and waiting for a new one.

I'd say that in the past, if you were sending say 500,000 units every 4000 seconds, now you are trying to send 500,000 units every 40 seconds. You may not have enough bandwidth to do that.
____________


ID: 619972 · Report as offensive
Mithotar
Avatar

Send message
Joined: 11 Apr 01
Posts: 88
Credit: 66,037,385
RAC: 50
United States
Message 620051 - Posted: 16 Aug 2007, 3:46:43 UTC - in response to Message 619950.  

Very nice work Matt, But I'm doing NNT now as My RAC was/is in a death dive that shows no signs of quitting and others feel this way too, As the Multiplier was changed to 2.85 from 3.35 and so We may be heading elsewhere or just shutting down until the Multiplier is raised back, As the WUs are taking longer and We're getting less, So If It isn't raised and My cache is empty, I QUIT!

===============================================================================
So you're going to take your PCs and go home..........gee didnt you outgrow that
when you were 8 years old ?

This is a shoestring run - near volunteer manned science project....if
finding the signal and having a bit of FUN along the way isnt you goal
and doing some good without getting a instant reward - maybe just maybe
you are in the wrong place......

Hope you reconsider your words .....and stay. Ater all if and when a signal
is found.......who's gonna care what your RAC is/was...

Carry On
ID: 620051 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30927
Credit: 53,134,872
RAC: 32
United States
Message 620054 - Posted: 16 Aug 2007, 3:59:59 UTC - in response to Message 619900.  

First off, I should point out that the server status page isn't the most accurate thing in the world, especially now as I haven't yet converted any of this code to understand how the new multibeam splitters work (I've been busy). So please don't use the data on this particular web page to inspire panic - many splitters are running, and have been all night, even though the page shows none of them are running at all.

That said, we are slowly getting beyond some more of the growing pains in the conversion to multibeam. Here's the past 24 hours in a nutshell: the classic splitters only worked on Solaris/Sparc systems, so they were forced to run on our older (and therefore much slower) servers. So why were the new multibeam splitters, running on state-of-the-art linux systems, running much much slower? The first bottleneck: the local network. The only linux server available as of yesterday (vader) was in our second lab, not in the data closet, so all the reading of raw data and writing of workunits were happening over the lab LAN, and the workunit fileserver's scant few nfsd processes were clogged on these slow reads/writes and therefore the download server was getting blocked reading these freshly created workunits to send to our clients.

So this morning Jeff and I worked to get some currently underutilized (but not yet completely configured) servers in the data closet up to snuff so they could take over splitting. Namely lando and bambi (specs now included in the server status page). It has been taking all day to iron out all the cracks with these newer servers. In fact we hit another bottleneck quickly: the memory in lando - it was thrashing pretty hard. Just now as I am writing this paragraph Jeff confirmed that we got bambi working, so we'll so how far we can push that machine and take the load off lando. Jeff's working on this now.

Further aggravations: we're still catching up from various recent outages and work shortages, so demand is quite high. That and a bunch of the work we just sent out was terribly noisy - workunits are returning very fast thus creating an artificially increased demand.

- Matt


Wed Aug 15 20:53:16 2007|SETI@home Beta Test|Message from server: Project encountered internal error: shared memory

Sounds like you need to get some more ram.


ID: 620054 · Report as offensive
Unbeliever

Send message
Joined: 6 Jan 01
Posts: 2
Credit: 1,455,862
RAC: 0
Croatia
Message 620087 - Posted: 16 Aug 2007, 5:41:45 UTC - in response to Message 619950.  

Very nice work Matt, But I'm doing NNT now as My RAC was/is in a death dive that shows no signs of quitting and others feel this way too, As the Multiplier was changed to 2.85 from 3.35 and so We may be heading elsewhere or just shutting down until the Multiplier is raised back, As the WUs are taking longer and We're getting less, So If It isn't raised and My cache is empty, I QUIT!


Batman, please don't speak in others name. There are lots of us not feeling that way. We are here to help in good cause, not to keep scores. Statistics is just a bit of fun, to keep eye on progress. Matt and others in SETI team are doing heroic work struggling to keep project on track. Thumbs up, folks!

ID: 620087 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 620088 - Posted: 16 Aug 2007, 5:42:30 UTC

Thanx once again Matt, for taking the time to keep those of us who are interested informed of your daily battles. Sounds like you boys have been up to yer butts in alligators again.
I had to LOL when I read the part about Vader trying to help fill the work cache over your lab's lan. No wonder things were tied in knots.
Keep up the good fight. I'm sure you'll get it all sorted soon.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 620088 · Report as offensive
Profile M4rtyn
Volunteer tester
Avatar

Send message
Joined: 4 Aug 03
Posts: 48
Credit: 799,965
RAC: 0
United Kingdom
Message 620173 - Posted: 16 Aug 2007, 9:40:49 UTC - in response to Message 620087.  

Very nice work Matt, But I'm doing NNT now as My RAC was/is in a death dive that shows no signs of quitting and others feel this way too, As the Multiplier was changed to 2.85 from 3.35 and so We may be heading elsewhere or just shutting down until the Multiplier is raised back, As the WUs are taking longer and We're getting less, So If It isn't raised and My cache is empty, I QUIT!


Batman, please don't speak in others name. There are lots of us not feeling that way. We are here to help in good cause, not to keep scores. Statistics is just a bit of fun, to keep eye on progress. Matt and others in SETI team are doing heroic work struggling to keep project on track. Thumbs up, folks!


The percentage of people like myself,in just for the credit has allways been higher at seti than other projects simply because seti paid more credit than the others especially with the optimised apps. I know it would be easy to assume from the forum postings that this is not the case. A quick look will show you that out of the 158,000 active users the vast majority of posts are from the few devoties probably less than 100. The so called "credit junkie" is just less vocal here in the forum as most have little intrest in the project outside the numbers.

As the work we contribute is as usfull as the next guys it would be both unfair and not in the projects intrest to ignor complaints about credit.
Personaly I'm all in favour of a level "credit" playing field as I would then be free to participate in any project I wish and still remain competitive in the overall stats.

m4rtyn
**************************** ***************************


ID: 620173 · Report as offensive
Profile Sharkbait

Send message
Joined: 2 Jun 99
Posts: 11
Credit: 6,333,145
RAC: 0
United States
Message 620186 - Posted: 16 Aug 2007, 10:45:51 UTC

Hello all,

I just wanted to say that over the last couple of weeks I've seen again and again discouraging posts from people who are frustrated with SETI. I'm pretty tired of it. I'm not here for credit, or to run as many workunits as possible, and I only read this page because I like to see what progress the team is making. I'm here because the mission is about searching the universe for other intelligent life, something that I'm pretty sure we all believe in. So please keep your aggravated complaints to yourself. If you have a legitimate issue that is one thing, but if you're upset because your RAC has dived or something just know that I'm sure Matt and the rest of the team also wish that you had plenty of work to process and that everything was going easy for you, because then their job would be easier too! These guys work hard on a project that runs off of volunteer time and donations of money and equipment. If you want to see them succeed - donate! But please take your crying elsewhere.

Thank you SETI team for all your hard work, and I look forward to the day when SETI announces a significant finding because I truly believe that one day all this hard work will pay off. And that is worth more than any amount of credit I could get for a workunit.

To everyone else, keep up the good work!

Ben
ID: 620186 · Report as offensive
William Roeder
Volunteer tester
Avatar

Send message
Joined: 19 May 99
Posts: 69
Credit: 523,414
RAC: 0
United States
Message 620243 - Posted: 16 Aug 2007, 13:20:04 UTC - in response to Message 619906.  

Well, seems there is light at the end of the tunnel...

The light from the on coming train?

ID: 620243 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 620292 - Posted: 16 Aug 2007, 15:30:25 UTC

It seems that the current set of patched together servers are unreliable, leading to partial or complete overall system failures. This behavior must have the effect of throttling the actual overall work throughput (wu's processed per day). Plus, the complexity of keeping all the patched together servers 'running' seems to limit the manhours available to maintain and improve the software environment (web pages, statistics, debugged scripts, etc.)

So, rather than adding more servers and unreliable hardware and software, wouldn't it make sense to reduce the number of servers? Bump up the RAM to the maximum in each box, adjust some OS parameters, simplify the networking, adjust workloads, etc.

I realize the potential performance would be reduced, but the actual performance may equate to what we are currently experiencing or be better. And, doing so might increase the man-hours available for the rest of the project.
ID: 620292 · Report as offensive
Bounce

Send message
Joined: 3 Apr 99
Posts: 66
Credit: 5,604,569
RAC: 0
United States
Message 620342 - Posted: 16 Aug 2007, 17:46:01 UTC - in response to Message 620292.  

excellent idea. please let the team know when they can expect the killer server you're shipping to them.
ID: 620342 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 620366 - Posted: 16 Aug 2007, 18:36:39 UTC - in response to Message 620342.  

excellent idea. please let the team know when they can expect the killer server you're shipping to them.


I guess you missed my point.
ID: 620366 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 620372 - Posted: 16 Aug 2007, 18:43:50 UTC - in response to Message 620366.  

excellent idea. please let the team know when they can expect the killer server you're shipping to them.


I guess you missed my point.

If I recall correctly, the servers are maxed out on RAM already.


BOINC WIKI
ID: 620372 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 620378 - Posted: 16 Aug 2007, 18:55:31 UTC

Server wise - oddly enough we were philosophically debating this yesterday: why not get a big big big server instead of screwing around with smaller ones and their apparently unique idiosyncrasies? Obviously we'd benefit from having all the CPUs/RAM/disks at our disposal on one system but there are two cons off the top of my head:

1. Expense. As noted pretty much all our servers are donated. We've gotten some big donations among them, but nevertheless nothing bigger than the Sun thumpers (24 TB disk, 8 GB RAM, 2 processors) or something like sidious (4 dual-core Xeons, 16 GB RAM, no disk). Roughly speaking we need a single machine that has at least 64 processors and 128 GB RAM before making it useful. Not sure how much one of those cost, but systems like these are less likely to be thrown at us.

2. Ramp-up Time. Since we are short staffed and all busy doing too many things, it's *far* easier and faster to glom a new small server onto our current setup than to pour everything over onto a new one in one fell swoop. A rough guess is that we'd be down for a month to convert our current system to a single piece of super-high-end hardware.

Obviously, #1 is the major con, and #2 is just a fine point. If we did obtain such hardware we'd gladly do whatever it took to actually use it.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 620378 · Report as offensive
whawn

Send message
Joined: 11 Apr 00
Posts: 18
Credit: 1,053,191
RAC: 2
United States
Message 620389 - Posted: 16 Aug 2007, 19:31:52 UTC - in response to Message 619912.  


The effect is to cause WUs to progress very slowly - 0.05% in 2 hours seems typical - and then exit with a -9. Helps to keep the load off the download server, of course, but doesn't contribute much to the science.


Had a WU that went for eleven hours yesterday, and still showed only .05% completed. I suspended that one and the next in line took only two hours. (I'm running SETI on my really old, slow box, because my double proc box seizes up on the BOINC client.) This morning there was no new work, so I resumed work on the very slow WU. It seemed to start over again from scratch, showing zero proc time, and now shows a sensible 2 hours to completion.

Also BOINC is having a tough time DLing new work. It's been working on two new WU for an hour or so, already.

As for the credit discussion -- with all the credit I've got, and a $1.25, I can get a cup of weak coffee. IOW, I'm not here for the credits. I want ET to be able to phone home, is all.

ID: 620389 · Report as offensive
Profile Heechee
Avatar

Send message
Joined: 29 Sep 99
Posts: 5
Credit: 13,765,984
RAC: 32
United States
Message 620396 - Posted: 16 Aug 2007, 19:55:56 UTC - in response to Message 620389.  


The effect is to cause WUs to progress very slowly - 0.05% in 2 hours seems typical - and then exit with a -9. Helps to keep the load off the download server, of course, but doesn't contribute much to the science.


Had a WU that went for eleven hours yesterday, and still showed only .05% completed. I suspended that one and the next in line took only two hours. (I'm running SETI on my really old, slow box, because my double proc box seizes up on the BOINC client.) This morning there was no new work, so I resumed work on the very slow WU. It seemed to start over again from scratch, showing zero proc time, and now shows a sensible 2 hours to completion.

Also BOINC is having a tough time DLing new work. It's been working on two new WU for an hour or so, already.

As for the credit discussion -- with all the credit I've got, and a $1.25, I can get a cup of weak coffee. IOW, I'm not here for the credits. I want ET to be able to phone home, is all.


I have a WU that I just suspended that had been running for 24:49:05 and only showed 0.020% complete with 27:48 left to completion.

The only reason I am here is to search for a signal from elsewhere in the universe. Credits are nice, but that shouldn't be what all this is about.

I think that Matt and the whole crew do an awesome job keeping things running! Thanks for all that you do!!

An ant on the move does more than a dozing ox.

- Lao Tzu
ID: 620396 · Report as offensive
1 · 2 · 3 · Next

Message boards : Technical News : Barrel of Bottlenecks (Aug 15 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.