Barrel of Bottlenecks (Aug 15 2007)

Message boards : Technical News : Barrel of Bottlenecks (Aug 15 2007)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile RichaG
Volunteer tester
Avatar

Send message
Joined: 20 May 99
Posts: 1690
Credit: 19,287,294
RAC: 36
United States
Message 620404 - Posted: 16 Aug 2007, 20:18:48 UTC

I just set up a new computer and I can't even get the seti program downloaded.
I have tried project reset a couple of times and that doesn't even help.

There seems to be a good backlog of wus now, but why can't it download the application?


Red Bull Air Racing

Gas price by zip at Seti

ID: 620404 · Report as offensive
Christoph
Volunteer tester

Send message
Joined: 21 Apr 03
Posts: 76
Credit: 355,173
RAC: 0
Germany
Message 620412 - Posted: 16 Aug 2007, 20:25:31 UTC - in response to Message 620378.  

Server wise - oddly enough we were philosophically debating this yesterday: why not get a big big big server instead of screwing around with smaller ones and their apparently unique idiosyncrasies? Obviously we'd benefit from having all the CPUs/RAM/disks at our disposal on one system but there are two cons off the top of my head:

1. Expense. As noted pretty much all our servers are donated. We've gotten some big donations among them, but nevertheless nothing bigger than the Sun thumpers (24 TB disk, 8 GB RAM, 2 processors) or something like sidious (4 dual-core Xeons, 16 GB RAM, no disk). Roughly speaking we need a single machine that has at least 64 processors and 128 GB RAM before making it useful. Not sure how much one of those cost, but systems like these are less likely to be thrown at us.

2. Ramp-up Time. Since we are short staffed and all busy doing too many things, it's *far* easier and faster to glom a new small server onto our current setup than to pour everything over onto a new one in one fell swoop. A rough guess is that we'd be down for a month to convert our current system to a single piece of super-high-end hardware.

Obviously, #1 is the major con, and #2 is just a fine point. If we did obtain such hardware we'd gladly do whatever it took to actually use it.

- Matt



Had have a look at Sun, more than 500k Dollars!
Christoph
ID: 620412 · Report as offensive
Profile yank Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 522
Credit: 22,545,639
RAC: 0
United States
Message 620422 - Posted: 16 Aug 2007, 20:35:06 UTC

For a 64-processor gorilla, 2.4 GHz SPARCVI dual-core chips, 6 MB of on-chip L2 cache, 128 GB of memory and a 64 x 73 GB SAS drive raise the price tag to $10,100,320.

Don't know if this is what the SETI program needs but if all the volunteers in the SETI program give $10.00 each we could be on the way to buying one.


Donate to SETI and if possible join the US Navy team.
http://boinc.mundayweb.com/teamStats.php?userID=14824
ID: 620422 · Report as offensive
Profile Edward Lee Michau
Avatar

Send message
Joined: 31 Jul 06
Posts: 138
Credit: 9,640,846
RAC: 0
United States
Message 620479 - Posted: 16 Aug 2007, 21:55:46 UTC - in response to Message 620396.  


The effect is to cause WUs to progress very slowly - 0.05% in 2 hours seems typical - and then exit with a -9. Helps to keep the load off the download server, of course, but doesn't contribute much to the science.


Had a WU that went for eleven hours yesterday, and still showed only .05% completed. I suspended that one and the next in line took only two hours. (I'm running SETI on my really old, slow box, because my double proc box seizes up on the BOINC client.) This morning there was no new work, so I resumed work on the very slow WU. It seemed to start over again from scratch, showing zero proc time, and now shows a sensible 2 hours to completion.

Also BOINC is having a tough time DLing new work. It's been working on two new WU for an hour or so, already.

As for the credit discussion -- with all the credit I've got, and a $1.25, I can get a cup of weak coffee. IOW, I'm not here for the credits. I want ET to be able to phone home, is all.


I have a WU that I just suspended that had been running for 24:49:05 and only showed 0.020% complete with 27:48 left to completion.

The only reason I am here is to search for a signal from elsewhere in the universe. Credits are nice, but that shouldn't be what all this is about.

I think that Matt and the whole crew do an awesome job keeping things running! Thanks for all that you do!!

I Have had that problem off and on and have found that as soon as you notice the CPU time going up, the percent staying the same and the time to completion going up you need to shut down the program, turn off the computer. Then restart computer and program. Most of the time the WU will start over and run right. If it still wont run right then abort that workunit. Its not worth running ten times too long and still not getting anywhere. Somebody told me this on the Message Boards.
Ed
ID: 620479 · Report as offensive
buck.r

Send message
Joined: 1 Apr 06
Posts: 3
Credit: 276,788
RAC: 0
Australia
Message 620500 - Posted: 16 Aug 2007, 22:14:08 UTC

Ben - couldn't agree with you more ............. these guys are doing a terrific job and need all of our support with their current situation! I worked in the computer industry from the early 60's (yes, I'm vintage) and the fact that they keep us up to "speed", LOL, through all their debugging is incredible under the circumstances! Back off all you pretenders - the rest of us are happy just helping to look for the odd alien ...... and to hell with the number crunching!

Judy (from down under Down Under - almost the edge of the planet)
ID: 620500 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 620529 - Posted: 16 Aug 2007, 22:30:58 UTC

I don't mean criticism of your operations here, but I just wanted to make you think with the following spiel.

Despite what the others think, in my professional opinion, it's better to have a few dedicated servers than have many little servers. But I'll take a slow, reliable server over a fast, flaky one any day.

The other issue that I do not understand is why several of your servers perform multiple roles: e.g., bruno is upload/download, scheduler, feeder, file_deleter1, transitioner, etc. I see you did well with the database servers being dedicated because I bet they work the hardest. If you lose a box like bruno, it would cripple the project.

I'm sure with a couple weeks of time and maybe an additional server or two, you could redesign this setup to be more reliable. Also, if there's a linux guru there, you can save a lot of money using linux instead of SunOS. Sun's support contracts are outrageously priced. I think it would be worth your time to not have to come in on the weekends and fix the servers. Two servers (at least) for each function (for redundancy) and only have servers do multiple roles that are low-impact or low-priority (e.g., file deletion, validation, whatever).
ID: 620529 · Report as offensive
Jesse Viviano

Send message
Joined: 27 Feb 00
Posts: 100
Credit: 3,949,583
RAC: 0
United States
Message 620542 - Posted: 16 Aug 2007, 22:40:07 UTC - in response to Message 620529.  

I don't mean criticism of your operations here, but I just wanted to make you think with the following spiel.

Despite what the others think, in my professional opinion, it's better to have a few dedicated servers than have many little servers. But I'll take a slow, reliable server over a fast, flaky one any day.

The other issue that I do not understand is why several of your servers perform multiple roles: e.g., bruno is upload/download, scheduler, feeder, file_deleter1, transitioner, etc. I see you did well with the database servers being dedicated because I bet they work the hardest. If you lose a box like bruno, it would cripple the project.

I'm sure with a couple weeks of time and maybe an additional server or two, you could redesign this setup to be more reliable. Also, if there's a linux guru there, you can save a lot of money using linux instead of SunOS. Sun's support contracts are outrageously priced. I think it would be worth your time to not have to come in on the weekends and fix the servers. Two servers (at least) for each function (for redundancy) and only have servers do multiple roles that are low-impact or low-priority (e.g., file deletion, validation, whatever).

Actually, postprocessing tasks (e.g. validation, assimilation, transitioning, and deletion) are probably the highest priority tasks in BOINC. Whenever there is a backlog of postprocessing work to do, it slows everything else down. First, there will be intense competition for disk access from the postprocessing tasks. Second, a backlog means that there are more files on the hard drive than there normally would be, causing the file system to slow down on each access. Third, the admins have noted that the disks get dangerously full whenever something is keeping the deleters from operating at top efficiency for too long(e.g. they are disabled, they are slowed down by a slow file system due to too many files on the hard drives, or the network is saturated). Therefore, these tasks are probably the highest priority tasks in BOINC. If there is a backlog, then everything else slows down.
ID: 620542 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 620569 - Posted: 16 Aug 2007, 22:59:16 UTC - in response to Message 620542.  


Actually, postprocessing tasks (e.g. validation, assimilation, transitioning, and deletion) are probably the highest priority tasks in BOINC. Whenever there is a backlog of postprocessing work to do, it slows everything else down. First, there will be intense competition for disk access from the postprocessing tasks. Second, a backlog means that there are more files on the hard drive than there normally would be, causing the file system to slow down on each access. Third, the admins have noted that the disks get dangerously full whenever something is keeping the deleters from operating at top efficiency for too long(e.g. they are disabled, they are slowed down by a slow file system due to too many files on the hard drives, or the network is saturated). Therefore, these tasks are probably the highest priority tasks in BOINC. If there is a backlog, then everything else slows down.


That's very informative, thank you. In that case, perhaps someone could take some performance measures on the various components of the system. The trouble with fixing one bottleneck is that it often just creates another one down the chain. It's the same as widening a road only to make traffic merge later; it just moves the bottleneck somewhere else. That is why careful planning of the server resources is important. Using donated hardware/funds make it real difficult to plan for such things, unfortunately, but not impossible.

The goal would be to figure out how fast can the system deal with finished results? Then you would just have (ideally) enough upload servers (and a fast, smart scheduler) to queue the finished results so the post-processing servers can pull from the upload servers at their own pace. I would also completely separate splitters/result generation servers from the post-processing ones. Bruno seems the ideal server to be this big upload/scheduler disk. It would be nice to have two download/upload/deleters, in case bruno hiccups.
ID: 620569 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 620601 - Posted: 16 Aug 2007, 23:52:47 UTC

There are a million machines at SETI's disposal, all crunching.
The SETI project is centrally based with distributed crunching, but still a star network dependent on the core.

I have a couple of web servers on backbones that are spending 98% of their power on SETI, while not interfering with their primary purpose.
They could also be employed splitting and/or handling up/downloads, batch up the results and feed to the MSDB a few times a day.
Many of us have machines we can donate (virtually) to the project, and many would live topologically close to Berkeley.

Perhaps it's time for a fresh rethink on project architecture and considering maximising on the p2p resources available - ie a million machines!?

Andy.
ID: 620601 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 620606 - Posted: 16 Aug 2007, 23:57:23 UTC - in response to Message 620601.  

Perhaps it's time for a fresh rethink on project architecture and considering maximising on the p2p resources available - ie a million machines!?


I've read it before somewhere that ROM would love to incorporate P2P-type capabilities into the BOINC framework. I'm not sure what the stopping point is, though I'm sure it's a technical one.
ID: 620606 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 620610 - Posted: 17 Aug 2007, 0:03:34 UTC - in response to Message 620542.  

I noticed they only budgeted $13,300 for this year's "Upgrade to failure tolerant server configuration". I think it will require more than that. The biggest contributions are in December, but do you think next January you'll get some upgrades to be able to do it?

I also priced out the "near time persistancy checker compute server" for this year's goals....I found a Dell PowerEdge 6950 rack server for $10,353. Two quad-core AMD Opterons with 32 GB RAM and 500GB of RAID5 storage (5 discs). You could add an external array if more space is needed.

I'd venture to say you guys could use about 16 TB more storage across your servers, especially now that multi-beam data is in production.
ID: 620610 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 620623 - Posted: 17 Aug 2007, 0:19:11 UTC

Many, I take it many have not researched the last major outage and the major outage prior to that. Or read the Donations or the Hardware Donations II threads. There was information about how the infrastructure is put together. A Lot, can be found here in the Tech News...

I will state a few simple things that I know...
* Seti is moving the machines that can run Linux to Linux. Yes that also has other issues. Next
* It has been stated that the Old Splitters being used prior to MultiBeam can only run Sun OS. Matt spent "I believe" over a week trying to port them...
* Part of the fly in the ointment is a NetApp Filer (Fiber channel) with NFS Mounts. I have dealt with several NAS's a couple of those and they were cranky way back then (reading the news and some of the problems it is even crankier now). They we originally built on a striped BSD 3.x kernel... This does not mention they need a few new drives as there are no spares.
* A second 24 port Gigabit switch and a couple of 8 ports are needed... This comes from the end of Donate Hardware II.

So it becomes easy to offer advice (some is actually good)... I used to do a lot of second guessing at one point in time. I then found out that things were not always what they appeared on the surface. Unless you can see it, touch it and/or have someone explain it, it is very tough.

Since Matt has started working the Tech News, "We" now have more information than we have in many years past. Thank You, Matt!

He does owe us a few pictures after the Server Closet Cleanup... But that will come in time...

Please be Patient.

Regards Pappa

Please consider a Donation to the Seti Project.

ID: 620623 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 620647 - Posted: 17 Aug 2007, 0:56:02 UTC - in response to Message 620623.  

So, they need a couple of switches to make the servers run smoother? I found a 24-port 1000M switch w/ 4 GBIC ports on NewEgg for $530. http://www.newegg.com/Product/Product.aspx?Item=N82E16833122178 I'll dig up what I can on eBay and see if I can donate one to you guys. Enough talk, more action, right? :)

The Linux vs. SunOS thing...yeah, that's a can of worms. Thank God for NFS.

I've heard BAD experiences regarding NAS units. They are picky pieces of hardware. Good once you get it working right; just a pain to do so.

You brought up a good point. No one in the message boards, except the people working in the data center, really know how things are arranged and configured. Thanks to Matt we can listen to their logic and at best provide feedback to what actions they take. From my years of experience, I only try to provide constructive ideas and feedback in an effort to help. They are doing the best they can.

Yeah, Matt is a posting machine along with Jeff and the rest working late.
ID: 620647 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 620663 - Posted: 17 Aug 2007, 1:26:36 UTC - in response to Message 620569.  


Actually, postprocessing tasks (e.g. validation, assimilation, transitioning, and deletion) are probably the highest priority tasks in BOINC. Whenever there is a backlog of postprocessing work to do, it slows everything else down. First, there will be intense competition for disk access from the postprocessing tasks. Second, a backlog means that there are more files on the hard drive than there normally would be, causing the file system to slow down on each access. Third, the admins have noted that the disks get dangerously full whenever something is keeping the deleters from operating at top efficiency for too long(e.g. they are disabled, they are slowed down by a slow file system due to too many files on the hard drives, or the network is saturated). Therefore, these tasks are probably the highest priority tasks in BOINC. If there is a backlog, then everything else slows down.


That's very informative, thank you. In that case, perhaps someone could take some performance measures on the various components of the system. The trouble with fixing one bottleneck is that it often just creates another one down the chain. It's the same as widening a road only to make traffic merge later; it just moves the bottleneck somewhere else. That is why careful planning of the server resources is important. Using donated hardware/funds make it real difficult to plan for such things, unfortunately, but not impossible.

The goal would be to figure out how fast can the system deal with finished results? Then you would just have (ideally) enough upload servers (and a fast, smart scheduler) to queue the finished results so the post-processing servers can pull from the upload servers at their own pace. I would also completely separate splitters/result generation servers from the post-processing ones. Bruno seems the ideal server to be this big upload/scheduler disk. It would be nice to have two download/upload/deleters, in case bruno hiccups.


And, as has been pointed out to me in the past, all the post-processing tasks have to be on (now) Bruno, (then kryten, RIP) because of disk access issues.

.

Hello, from Albany, CA!...
ID: 620663 · Report as offensive
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 22 Apr 04
Posts: 758
Credit: 27,771,894
RAC: 0
United States
Message 620696 - Posted: 17 Aug 2007, 2:54:55 UTC - in response to Message 620647.  

The Linux vs. SunOS thing...yeah, that's a can of worms. Thank God for NFS.


NFS...created by Sun in 1985. =;^)
Dublin, California
Team: SETI.USA
ID: 620696 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 620794 - Posted: 17 Aug 2007, 5:24:44 UTC - in response to Message 620647.  

So, they need a couple of switches to make the servers run smoother? I found a 24-port 1000M switch w/ 4 GBIC ports on NewEgg for $530. http://www.newegg.com/Product/Product.aspx?Item=N82E16833122178 I'll dig up what I can on eBay and see if I can donate one to you guys. Enough talk, more action, right? :)

The Linux vs. SunOS thing...yeah, that's a can of worms. Thank God for NFS.

I've heard BAD experiences regarding NAS units. They are picky pieces of hardware. Good once you get it working right; just a pain to do so.

You brought up a good point. No one in the message boards, except the people working in the data center, really know how things are arranged and configured. Thanks to Matt we can listen to their logic and at best provide feedback to what actions they take. From my years of experience, I only try to provide constructive ideas and feedback in an effort to help. They are doing the best they can.

Yeah, Matt is a posting machine along with Jeff and the rest working late.

You are kinda missing the point.

It's great that you have "years of experience" and are willing to offer advice.

The problem isn't advice (or knowledge), it's money.

Most of what SETI has is either donated, or just plain hand-me-downs.

So, we could launch a fairly effective "distributed denial-of-service attack" with our suggestions (there are enough of us with ideas that we could easily overwhelm them) or we can give hardware, or cash, or both, and really help.
ID: 620794 · Report as offensive
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 22 Apr 04
Posts: 758
Credit: 27,771,894
RAC: 0
United States
Message 620816 - Posted: 17 Aug 2007, 5:46:09 UTC - in response to Message 620794.  

So, we could launch a fairly effective "distributed denial-of-service attack" with our suggestions (there are enough of us with ideas that we could easily overwhelm them) or we can give hardware, or cash, or both, and really help.


Great! Give us the HW list then. It's been asked for time after time. Still waiting.
Dublin, California
Team: SETI.USA
ID: 620816 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 621019 - Posted: 17 Aug 2007, 13:08:03 UTC - in response to Message 620794.  

The donation page says we can give a monetary gift or choose a piece of hardware that they have on the list. So, why can I not contribute by donating a piece of hardware from said list? I found a good 3com switch that you quoted as needed. I was about to get it....
ID: 621019 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 621023 - Posted: 17 Aug 2007, 13:18:09 UTC - in response to Message 620816.  

Great! Give us the HW list then. It's been asked for time after time. Still waiting.


Pappa posted the list about 5 messages or so. The official list on the donation page is out of date, I believe.
http://setiathome.berkeley.edu//forum_thread.php?id=41554&nowrap=true#620623
ID: 621023 · Report as offensive
Bounce

Send message
Joined: 3 Apr 99
Posts: 66
Credit: 5,604,569
RAC: 0
United States
Message 621038 - Posted: 17 Aug 2007, 13:57:11 UTC - in response to Message 620606.  

Perhaps it's time for a fresh rethink on project architecture and considering maximising on the p2p resources available - ie a million machines!?


I've read it before somewhere that ROM would love to incorporate P2P-type capabilities into the BOINC framework. I'm not sure what the stopping point is, though I'm sure it's a technical one.


many times the biggest hurdle with farming out core processes is that the data "owners" have to be assured or data security and validity. formal SLAs have to be hammered out so that the "contractor" is held to a certain level of performance so that the "contracting entity" (BOINC) can know that their data is handles iow their standards and expectations and that the "farmed out tasks" will be there (day and night) at the same level as if it were on their hardware, in their buildings and managed by their employees.

these SLAs are often as difficult (or troublesome) as the technology side of things. people offering up "donated services" think twice about not being able to pull back resources when their own demands change.
ID: 621038 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Technical News : Barrel of Bottlenecks (Aug 15 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.