Squeak (Aug 14 2007)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 619283 - Posted: 14 Aug 2007, 23:12:53 UTC Last modified: 14 Aug 2007, 23:13:05 UTC Oy! We seem to be pushing our cranky old servers harder than they'd like. Sometimes it seems like a miracle these things performed as well as they have under such strain. Anyway - we had our usual database outage to backup/compress the database. During so we rebooted several machines to fix mounting problems, clean pipes, etc... One exhibited weird behavior on reboot but eventually we realized this was due to its newer kernel not having the right fibre card drivers. Oh yeah that. But then Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately. Still catching up from recent outages? One annoying thing is that our "TCP connection drops" monitor has been silently failing for who knows how long, so we haven't been correctly told how bad we've been suffering from dropped connections. But still, we've recovered much more quickly before. Is it the new multibeam splitters? They are writing to the file server over the lab LAN as opposed to our dedicated switch, but even still the writes amount to about 15 Mbits, tops, which the LAN is quite able to handle. The only major recent change we can think of is that we are now just sending out 2 copies of each workunit initially, as opposed to 3. So we reduced the probability that the workunit is in the file server's memory cache by as much as 33%. Perhaps this accounts for the slower performance. In any case, we spent too much time staring at log files, iostat output, network graphs, etc. and have since moved on to other projects for now. We figure the servers will either claw their way out of this problem on their own or we'll revisit it tomorrow. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 619283 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 619296 - Posted: 14 Aug 2007, 23:35:45 UTC ok, but does this explain why there are only 32 wu's ready to send right now? ID: 619296 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 619299 - Posted: 14 Aug 2007, 23:50:08 UTC - in response to Message 619296. Common misconception: every 30 minutes or so we take a snapshot of how many wu's are ready to send at that very second. In this case 32. A second later it may have been 100. Then zero a second after that. Then 5000 five minutes later, etc. but it'll still say 32 on the status page. Basically, the more important number is the result creation rate which shows how many wu's are being made ready to send per second, and in this case since the queue isn't growing, all of those are being sent to our users. - Matt ok, but does this explain why there are only 32 wu's ready to send right now? -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 619299 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 619301 - Posted: 14 Aug 2007, 23:52:17 UTC My point was that the level has dropped from 200K to 32. But obviously you are on top of things, so I'll chill out. ID: 619301 ·

gomeyer Volunteer tester Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0	Message 619303 - Posted: 14 Aug 2007, 23:56:46 UTC - in response to Message 619296. ok, but does this explain why there are only 32 wu's ready to send right now? Perhaps more to the point, the RTS queue steadily dropped all last night and this morning (EDT) so the splitters as they were configured during that timeframe were not keeping up with the load. Perhaps due to still catching up from the weekend? Whatever, more data for you. ID: 619303 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 619304 - Posted: 14 Aug 2007, 23:57:04 UTC Just at the moment, the server status page has been frozen for the last 40 minutes, saying that all splitters are offline (three disabled and four not running), yet the result creation rate is 8.97/sec. Kinda confusing - you can see why the questions get asked. Not urgent, but if you feel like a displacement activity while you mull over the download problem..... ID: 619304 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 619305 - Posted: 15 Aug 2007, 0:00:26 UTC - in response to Message 619299. Common misconception: every 30 minutes or so we take a snapshot of how many wu's are ready to send at that very second. In this case 32. A second later it may have been 100. Then zero a second after that. Then 5000 five minutes later, etc. but it'll still say 32 on the status page. Basically, the more important number is the result creation rate which shows how many wu's are being made ready to send per second, and in this case since the queue isn't growing, all of those are being sent to our users. - Matt Is there some way of also measuring the "assignment rate" that would sort-of mirror the creation rate? ID: 619305 ·

Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0	Message 619401 - Posted: 15 Aug 2007, 4:16:47 UTC - in response to Message 619283. Last modified: 15 Aug 2007, 4:17:47 UTC But then Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately. MTU values? A couple of months ago Eric discovered that the MTU sizes weren't correct and reset them to 1476 because of a tunnel. Perhaps they've reset themselves to default values? http://setiathome.berkeley.edu/forum_thread.php?id=39742&nowrap=true#575114 Andy. ID: 619401 ·

RottenMutt Send message Joined: 15 Mar 01 Posts: 1011 Credit: 230,314,058 RAC: 0	Message 619426 - Posted: 15 Aug 2007, 5:26:20 UTC - in response to Message 619283. Oy! We seem to be pushing our cranky old servers harder than they'd like... why are the splitters off line??? ID: 619426 ·

RC Motts Send message Joined: 20 Sep 03 Posts: 1 Credit: 5,081,620 RAC: 0	Message 619431 - Posted: 15 Aug 2007, 5:39:45 UTC I haven'tbeen able to geta single work unit from your servers (on any of my machines ) for days now. The client just continually trys to dowenload Jpg's, help files etc .. on all machines, then simply gives up .. What's going on ?? ID: 619431 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19048 Credit: 40,757,560 RAC: 67	Message 619441 - Posted: 15 Aug 2007, 6:22:45 UTC - in response to Message 619431. I haven'tbeen able to geta single work unit from your servers (on any of my machines ) for days now. The client just continually trys to dowenload Jpg's, help files etc .. on all machines, then simply gives up .. What's going on ?? If you are having trouble downloading the default application from Berkeley, why don't you download the optimised app from Rev 2.4 Optimised apps. The Intel core 2 one will work on your T7200, but I'm not sure about your T2500. If you do not know its extended capablities you could try CPUz or the test tools found on the downloads/Tools and benchmark page. If you don't want to run optimised permanently, whrn Berkeley is back on even keel, closing BOINC and rename/remove the app_info file will return you to normal after BOINC is restarted. Andy ID: 619441 ·

Swibby Bear Send message Joined: 1 Aug 01 Posts: 246 Credit: 7,945,093 RAC: 0	Message 619550 - Posted: 15 Aug 2007, 13:38:35 UTC - in response to Message 619283. Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately. I got a whole bunch of 2-hour WUs. If everyone got some, then you are trying to process 3 or 4 times the normal workload. I forced my box to only do the 8-hour units for now, in hopes that you will get caught up. And yes, everyone is still playing catch up from the outages. Where is Kang when we need him/her? Whit ID: 619550 ·

Neonblue Send message Joined: 15 May 07 Posts: 3 Credit: 18,372,907 RAC: 33	Message 619554 - Posted: 15 Aug 2007, 13:47:57 UTC So it's not a problem on my side that my computer is not getting any work to perform after last weekend? ID: 619554 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 619597 - Posted: 15 Aug 2007, 15:16:06 UTC - in response to Message 619550. Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately. I got a whole bunch of 2-hour WUs. If everyone got some, then you are trying to process 3 or 4 times the normal workload. I forced my box to only do the 8-hour units for now, in hopes that you will get caught up. [serious mode] How do you do that?[/serious mode] And yes, everyone is still playing catch up from the outages. Where is Kang when we need him/her? Whit Kang is on the other side of the wormhole at the moment and is unavailable... <G> . Hello, from Albany, CA!... ID: 619597 ·

RandyC Send message Joined: 20 Oct 99 Posts: 714 Credit: 1,704,345 RAC: 0	Message 619627 - Posted: 15 Aug 2007, 16:24:37 UTC - in response to Message 619597. Last modified: 15 Aug 2007, 16:25:18 UTC [quote] Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately. I got a whole bunch of 2-hour WUs. If everyone got some, then you are trying to process 3 or 4 times the normal workload. I forced my box to only do the 8-hour units for now, in hopes that you will get caught up. [serious mode] How do you do that?[/serious mode] He probably just suspends the short WUs with Boincmgr [edit typos] ID: 619627 ·

Marko Volunteer tester Send message Joined: 2 Jun 99 Posts: 10 Credit: 659,205 RAC: 0	Message 619660 - Posted: 15 Aug 2007, 17:40:40 UTC - in response to Message 619597. [/quote] I got a whole bunch of 2-hour WUs. If everyone got some, then you are trying to process 3 or 4 times the normal workload. I forced my box to only do the 8-hour units for now, in hopes that you will get caught up. [/quote] [serious mode] How do you do that?[/serious mode] And yes, everyone is still playing catch up from the outages. Where is Kang when we need him/her? Whit Kang is on the other side of the wormhole at the moment and is unavailable... <G>[/quote] Nearest wormhole is straight east from Earth, or wishing to travel, go to Spica and them northeast long enough...(got locations from evula's lair, not tested..) : ) Suomi Finland Perkele - Winner of the Eurovision Song contest 2006 ID: 619660 ·

Jesse Viviano Send message Joined: 27 Feb 00 Posts: 100 Credit: 3,949,583 RAC: 0	Message 619774 - Posted: 15 Aug 2007, 20:36:34 UTC If you took a look at the server status, you would notice that there is a big backlog of work units to assimilate into the science database as of this post. It is probably for the best that the splitters are suspended. When there is a big backlog of postprocessing (which includes validation, assimilation, deletion, transitioning, and database purging) at the server-side, there is a bunch of activity that consumes disk I/O and network throughput. When disks are being accessed, they can only serve one thread at a time. They can use tricks like command queuing and caching to speed up average service time, but the end result is that as more threads access the disk at the same time, the time they are blocked waiting for the results of their reads increases. This means that post-processing backlogs slow everything else down. When there is a massive backlog, it needs to be cleared as soon as possible. If splitting was going on at the time that there is a backlog, then the splitters will be competing for the same disk(s) and network throughput that the postprocessing threads are using, allowing the backlog to grow, making the problem worse. This also causes sluggish downloads. Also, as the backlogs grow, the disks fill up. These disks' file system slows down as more files are added to the folders and allowed to linger in them. If they run out of room, then we have a big problem. Therefore, if there is a postprocessing backlog, the admins are right in shutting down the splitters. Once the backlog clears, there will be more disk and network resources available to split work units, serve work units, accept results, and postprocess as the results come in. My point here is that if there is no work to serve, please check the server status page before complaining. If there is a big backlog, please lay off the admins who are trying to prevent the catastophe of full disks. If you want to help this situation, please donate some money so that the administrators can add some speedy disks to the disk array. I can't do this yet because I am a graduate student without a job. ID: 619774 ·

Jim Geuin Send message Joined: 17 May 99 Posts: 6 Credit: 5,538,490 RAC: 32	Message 619826 - Posted: 15 Aug 2007, 21:24:16 UTC Last modified: 15 Aug 2007, 21:31:24 UTC Looks to me like the mechanism that sends the work units is backed up. Where in the past, I received a new work unit about every 4000 seconds and had it queued up when the current unit finished, now I am processing a new work unit in under 40 seconds and waiting for a new one. I'd say that in the past, if you were sending say 500,000 units every 4000 seconds, now you are trying to send 500,000 units every 40 seconds. You may not have enough bandwidth to do that. ID: 619826 ·

Starship Trooper Send message Joined: 25 Jul 04 Posts: 17 Credit: 944,769 RAC: 0	Message 620110 - Posted: 16 Aug 2007, 6:43:25 UTC - in response to Message 619826. Well seems the problem is still here. I'm now awaiting for 3 days, connecting about 12 hours a day, and no workunit has been available. Same thing this morning, so I take a look at server status and see that : ------------------------------------------- sah_splitter1 kosh Not Running sah_splitter2 klaatu Not Running sah_splitter3 penguin Not Running mb_splitter1 lando Not Running mb_splitter2 lando Not Running mb_splitter3 lando Not Running mb_splitter4 lando Disabled mb_splitter5 lando Disabled mb_splitter6 lando Disabled mb_splitter7 lando Disabled mb_splitter8 lando Disabled mb_splitter9 lando Disabled mb_splitter10 bambi Not Running mb_splitter11 bambi Not Running mb_splitter12 bambi Not Running --------------------------------------------- That means my (hopefully) Seti - dedicated machine will for some more time run......Einstein, what an irony. ID: 620110 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 620125 - Posted: 16 Aug 2007, 7:23:37 UTC - in response to Message 620110. so I take a look at server status and see that... I suggest you read the latest Tech News post. Grant Darwin NT ID: 620125 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.