Squeak (Aug 14 2007)

Message boards : Technical News : Squeak (Aug 14 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 619283 - Posted: 14 Aug 2007, 23:12:53 UTC
Last modified: 14 Aug 2007, 23:13:05 UTC

Oy! We seem to be pushing our cranky old servers harder than they'd like. Sometimes it seems like a miracle these things performed as well as they have under such strain. Anyway - we had our usual database outage to backup/compress the database. During so we rebooted several machines to fix mounting problems, clean pipes, etc... One exhibited weird behavior on reboot but eventually we realized this was due to its newer kernel not having the right fibre card drivers. Oh yeah that.

But then Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately. Still catching up from recent outages? One annoying thing is that our "TCP connection drops" monitor has been silently failing for who knows how long, so we haven't been correctly told how bad we've been suffering from dropped connections. But still, we've recovered much more quickly before. Is it the new multibeam splitters? They are writing to the file server over the lab LAN as opposed to our dedicated switch, but even still the writes amount to about 15 Mbits, tops, which the LAN is quite able to handle.

The only major recent change we can think of is that we are now just sending out 2 copies of each workunit initially, as opposed to 3. So we reduced the probability that the workunit is in the file server's memory cache by as much as 33%. Perhaps this accounts for the slower performance. In any case, we spent too much time staring at log files, iostat output, network graphs, etc. and have since moved on to other projects for now. We figure the servers will either claw their way out of this problem on their own or we'll revisit it tomorrow.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 619283 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 619296 - Posted: 14 Aug 2007, 23:35:45 UTC

ok, but does this explain why there are only 32 wu's ready to send right now?
ID: 619296 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 619299 - Posted: 14 Aug 2007, 23:50:08 UTC - in response to Message 619296.  

Common misconception: every 30 minutes or so we take a snapshot of how many wu's are ready to send at that very second. In this case 32. A second later it may have been 100. Then zero a second after that. Then 5000 five minutes later, etc. but it'll still say 32 on the status page. Basically, the more important number is the result creation rate which shows how many wu's are being made ready to send per second, and in this case since the queue isn't growing, all of those are being sent to our users.

- Matt

ok, but does this explain why there are only 32 wu's ready to send right now?


-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 619299 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 619301 - Posted: 14 Aug 2007, 23:52:17 UTC

My point was that the level has dropped from 200K to 32. But obviously you are on top of things, so I'll chill out.
ID: 619301 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 619303 - Posted: 14 Aug 2007, 23:56:46 UTC - in response to Message 619296.  

ok, but does this explain why there are only 32 wu's ready to send right now?

Perhaps more to the point, the RTS queue steadily dropped all last night and this morning (EDT) so the splitters as they were configured during that timeframe were not keeping up with the load. Perhaps due to still catching up from the weekend? Whatever, more data for you.
ID: 619303 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 619304 - Posted: 14 Aug 2007, 23:57:04 UTC

Just at the moment, the server status page has been frozen for the last 40 minutes, saying that all splitters are offline (three disabled and four not running), yet the result creation rate is 8.97/sec.

Kinda confusing - you can see why the questions get asked.

Not urgent, but if you feel like a displacement activity while you mull over the download problem.....
ID: 619304 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 619305 - Posted: 15 Aug 2007, 0:00:26 UTC - in response to Message 619299.  

Common misconception: every 30 minutes or so we take a snapshot of how many wu's are ready to send at that very second. In this case 32. A second later it may have been 100. Then zero a second after that. Then 5000 five minutes later, etc. but it'll still say 32 on the status page. Basically, the more important number is the result creation rate which shows how many wu's are being made ready to send per second, and in this case since the queue isn't growing, all of those are being sent to our users.

- Matt

Is there some way of also measuring the "assignment rate" that would sort-of mirror the creation rate?
ID: 619305 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 619401 - Posted: 15 Aug 2007, 4:16:47 UTC - in response to Message 619283.  
Last modified: 15 Aug 2007, 4:17:47 UTC

But then Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately.


MTU values?
A couple of months ago Eric discovered that the MTU sizes weren't correct and reset them to 1476 because of a tunnel. Perhaps they've reset themselves to default values?
http://setiathome.berkeley.edu/forum_thread.php?id=39742&nowrap=true#575114

Andy.
ID: 619401 · Report as offensive
Profile RottenMutt
Avatar

Send message
Joined: 15 Mar 01
Posts: 1011
Credit: 230,314,058
RAC: 0
United States
Message 619426 - Posted: 15 Aug 2007, 5:26:20 UTC - in response to Message 619283.  

Oy! We seem to be pushing our cranky old servers harder than they'd like...

why are the splitters off line???

ID: 619426 · Report as offensive
RC Motts

Send message
Joined: 20 Sep 03
Posts: 1
Credit: 5,081,620
RAC: 0
United States
Message 619431 - Posted: 15 Aug 2007, 5:39:45 UTC

I haven'tbeen able to geta single work unit from your servers (on any of my machines ) for days now. The client just continually trys to dowenload Jpg's, help files etc .. on all machines, then simply gives up .. What's going on ??
ID: 619431 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19048
Credit: 40,757,560
RAC: 67
United Kingdom
Message 619441 - Posted: 15 Aug 2007, 6:22:45 UTC - in response to Message 619431.  

I haven'tbeen able to geta single work unit from your servers (on any of my machines ) for days now. The client just continually trys to dowenload Jpg's, help files etc .. on all machines, then simply gives up .. What's going on ??

If you are having trouble downloading the default application from Berkeley, why don't you download the optimised app from Rev 2.4 Optimised apps. The Intel core 2 one will work on your T7200, but I'm not sure about your T2500. If you do not know its extended capablities you could try CPUz or the test tools found on the downloads/Tools and benchmark page.

If you don't want to run optimised permanently, whrn Berkeley is back on even keel, closing BOINC and rename/remove the app_info file will return you to normal after BOINC is restarted.

Andy
ID: 619441 · Report as offensive
Swibby Bear

Send message
Joined: 1 Aug 01
Posts: 246
Credit: 7,945,093
RAC: 0
United States
Message 619550 - Posted: 15 Aug 2007, 13:38:35 UTC - in response to Message 619283.  

Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately.


I got a whole bunch of 2-hour WUs. If everyone got some, then you are trying to process 3 or 4 times the normal workload. I forced my box to only do the 8-hour units for now, in hopes that you will get caught up. And yes, everyone is still playing catch up from the outages. Where is Kang when we need him/her?
Whit
ID: 619550 · Report as offensive
Neonblue

Send message
Joined: 15 May 07
Posts: 3
Credit: 18,372,907
RAC: 33
United States
Message 619554 - Posted: 15 Aug 2007, 13:47:57 UTC

So it's not a problem on my side that my computer is not getting any work to perform after last weekend?
ID: 619554 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 619597 - Posted: 15 Aug 2007, 15:16:06 UTC - in response to Message 619550.  

Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately.


I got a whole bunch of 2-hour WUs. If everyone got some, then you are trying to process 3 or 4 times the normal workload. I forced my box to only do the 8-hour units for now, in hopes that you will get caught up.


[serious mode] How do you do that?[/serious mode]

And yes, everyone is still playing catch up from the outages. Where is Kang when we need him/her?
Whit


Kang is on the other side of the wormhole at the moment and is unavailable...
<G>
.

Hello, from Albany, CA!...
ID: 619597 · Report as offensive
Profile RandyC
Avatar

Send message
Joined: 20 Oct 99
Posts: 714
Credit: 1,704,345
RAC: 0
United States
Message 619627 - Posted: 15 Aug 2007, 16:24:37 UTC - in response to Message 619597.  
Last modified: 15 Aug 2007, 16:25:18 UTC

[quote] Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately.


I got a whole bunch of 2-hour WUs. If everyone got some, then you are trying to process 3 or 4 times the normal workload. I forced my box to only do the 8-hour units for now, in hopes that you will get caught up.


[serious mode] How do you do that?[/serious mode]


He probably just suspends the short WUs with Boincmgr
[edit typos]
ID: 619627 · Report as offensive
Marko
Volunteer tester

Send message
Joined: 2 Jun 99
Posts: 10
Credit: 659,205
RAC: 0
Finland
Message 619660 - Posted: 15 Aug 2007, 17:40:40 UTC - in response to Message 619597.  

[/quote]

I got a whole bunch of 2-hour WUs. If everyone got some, then you are trying to process 3 or 4 times the normal workload. I forced my box to only do the 8-hour units for now, in hopes that you will get caught up. [/quote]

[serious mode] How do you do that?[/serious mode]

And yes, everyone is still playing catch up from the outages. Where is Kang when we need him/her?
Whit


Kang is on the other side of the wormhole at the moment and is unavailable...
<G>[/quote]


Nearest wormhole is straight east from Earth, or wishing to travel, go to Spica and them northeast long enough...(got locations from evula's lair, not tested..)
: )

Suomi Finland Perkele - Winner of the Eurovision Song contest 2006
ID: 619660 · Report as offensive
Jesse Viviano

Send message
Joined: 27 Feb 00
Posts: 100
Credit: 3,949,583
RAC: 0
United States
Message 619774 - Posted: 15 Aug 2007, 20:36:34 UTC

If you took a look at the server status, you would notice that there is a big backlog of work units to assimilate into the science database as of this post. It is probably for the best that the splitters are suspended.

When there is a big backlog of postprocessing (which includes validation, assimilation, deletion, transitioning, and database purging) at the server-side, there is a bunch of activity that consumes disk I/O and network throughput. When disks are being accessed, they can only serve one thread at a time. They can use tricks like command queuing and caching to speed up average service time, but the end result is that as more threads access the disk at the same time, the time they are blocked waiting for the results of their reads increases. This means that post-processing backlogs slow everything else down. When there is a massive backlog, it needs to be cleared as soon as possible. If splitting was going on at the time that there is a backlog, then the splitters will be competing for the same disk(s) and network throughput that the postprocessing threads are using, allowing the backlog to grow, making the problem worse. This also causes sluggish downloads. Also, as the backlogs grow, the disks fill up. These disks' file system slows down as more files are added to the folders and allowed to linger in them. If they run out of room, then we have a big problem.

Therefore, if there is a postprocessing backlog, the admins are right in shutting down the splitters. Once the backlog clears, there will be more disk and network resources available to split work units, serve work units, accept results, and postprocess as the results come in.

My point here is that if there is no work to serve, please check the server status page before complaining. If there is a big backlog, please lay off the admins who are trying to prevent the catastophe of full disks.

If you want to help this situation, please donate some money so that the administrators can add some speedy disks to the disk array. I can't do this yet because I am a graduate student without a job.
ID: 619774 · Report as offensive
Profile Jim Geuin

Send message
Joined: 17 May 99
Posts: 6
Credit: 5,538,490
RAC: 32
United States
Message 619826 - Posted: 15 Aug 2007, 21:24:16 UTC
Last modified: 15 Aug 2007, 21:31:24 UTC

Looks to me like the mechanism that sends the work units is backed up. Where in the past, I received a new work unit about every 4000 seconds and had it queued up when the current unit finished, now I am processing a new work unit in under 40 seconds and waiting for a new one.

I'd say that in the past, if you were sending say 500,000 units every 4000 seconds, now you are trying to send 500,000 units every 40 seconds. You may not have enough bandwidth to do that.
ID: 619826 · Report as offensive
Starship Trooper

Send message
Joined: 25 Jul 04
Posts: 17
Credit: 944,769
RAC: 0
France
Message 620110 - Posted: 16 Aug 2007, 6:43:25 UTC - in response to Message 619826.  

Well seems the problem is still here.

I'm now awaiting for 3 days, connecting about 12 hours a day, and no workunit has been available.

Same thing this morning, so I take a look at server status and see that :

-------------------------------------------
sah_splitter1 kosh Not Running
sah_splitter2 klaatu Not Running
sah_splitter3 penguin Not Running
mb_splitter1 lando Not Running
mb_splitter2 lando Not Running
mb_splitter3 lando Not Running
mb_splitter4 lando Disabled
mb_splitter5 lando Disabled
mb_splitter6 lando Disabled
mb_splitter7 lando Disabled
mb_splitter8 lando Disabled
mb_splitter9 lando Disabled
mb_splitter10 bambi Not Running
mb_splitter11 bambi Not Running
mb_splitter12 bambi Not Running
---------------------------------------------

That means my (hopefully) Seti - dedicated machine will for some more time run......Einstein, what an irony.
ID: 620110 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 620125 - Posted: 16 Aug 2007, 7:23:37 UTC - in response to Message 620110.  

so I take a look at server status and see that...

I suggest you read the latest Tech News post.
Grant
Darwin NT
ID: 620125 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Squeak (Aug 14 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.