Tweenday Two (Dec 27 2007)

Message boards : Technical News : Tweenday Two (Dec 27 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 695137 - Posted: 27 Dec 2007, 20:41:10 UTC

("Tweenday" referring to the scant few work days between Xmas and New Year's holidays).

As we progress in our back-end scientific analysis we need to build many indexes on the science database (which vastly speed up queries). In fact, we need and hope to create 2 indexes a week for the next month or two. Seems easy, but each time you fire off such a build the science database locks up for up to 6 hours, during which there will be no assimilation and no splitting of new workunits. Well, we were planning to build another index today but with the frequent "high demand" due to our fast-return workunits the ready-to-send queue is pretty much at zero. So if we started such an index build y'all would get no work until it was done. We decided to postpone this until next week when hopefully we'll have a more user-friendly window of opportunity.

In the meantime, I've been trying to squeeze more juice out of our current servers. I'm kinda stumped as to why we are hitting this 60 MB/sec ceiling of workunit production/sending. I'm not finding any obvious I/O or network bottlenecks. However, while searching I decided to "fix" the server status page. I changed "results in progress" to "results out in the field" which is more accurate. This number never did include the results waiting for the redundant partners to return. So I added a "results returned/awaiting validation" row which also isn't exactly an accurate description either but is the shortest phrase I could think up at the time. Basically these are all the results that have been returned and have yet to enter the validation/assimilation/delete pipeline, after which it is "waiting for db purging." To use a term coined elsewhere, most of these results, if not all, are waiting for their "wingman" (should be "wingperson"). At this point if you add the results ready to send, out in the field, returned/awaiting validation, and awaiting db purging, you have an exact total of the current number of all results in the BOINC database. Thinking about this more, to get a slightly more accurate number of results waiting to reach redundancy before entering the back-end pipeline you take the "results returned/awaiting validation" and subtract 2 times the workunits awaiting validation and subtract 2 times the workunits awaiting assimilation. Whatever.. you get the basic idea. If I think of an easier/quicker way to describe all this I will.

Answering some posts from yesterday's thread:

Missing files like that prompt me to make an immediate fsck on the filesystem.


Very true - except this is a filesystem on network attached storage. The filesystem is propietary and out of our control, therefore no fsck'ing, nor should there be a need for manual fsck'ing.

Why are the bits 'in' larger than the bits 'out'?


In regards to the cricket graphs, the in/out depends on your orientation. The bytes going into the router are coming from the lab, en route to the outside world. So this is "outbound" traffic going "into" the router. Vice versa for the inbound. Basically: green = workunit downloads, blue line = result uploads - though there is some low-level apache traffic noise mixed in there (web sites and schedulers).

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 695137 · Report as offensive
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24875
Credit: 3,081,182
RAC: 7
Ireland
Message 695154 - Posted: 27 Dec 2007, 21:13:30 UTC - in response to Message 695137.  

Thanks Matt for the update.

I trust that you & the team had a good Xmas.

Regards to everyone & a HAPPY NEW YEAR
ID: 695154 · Report as offensive
Profile Jan Schotsmans
Avatar

Send message
Joined: 27 Oct 00
Posts: 98
Credit: 92,693
RAC: 0
Belgium
Message 695165 - Posted: 27 Dec 2007, 21:54:16 UTC

Maybe you could make a suggestion to the heavy contributors that they should set their queue's to a longer period for the period you need to build indexes and cause downtime.

I already set my systems to cache 7 days worth of work, because of all these 25 minute units I'm getting, this queue is pretty HUGE (150 megs ... 400 or so units in queue), but you guys can go out for a weak and it would hurt me :p

The daily stats won't be all that accurate anymore if the reported date is the timestamp used to see when a unit was completed rather then the date the seti app actually finished processing the unit, but at least everyone will be able to keep crunching without trouble.

Also, I thought I saw someone say something about a 60Mbit cap.

So essentially, you do have more bandwidth available (100Mbit), but for some reason the servers just form a bottleneck somewhere.

Rather strange that your hitting exactly 60Mbit if you remember that 60Mbit = 7.5MB/s, which incidentally is what most windows networks tend to crap out on while good nic's(or sometimes just better drivers for the NIC's) and switches bump that straight up to the max 10-12MB/s.

But I thought to remember you guys have Gigabit lnterlinks between all the servers and probably a Gigabit or higher backbone going almost all the way to the Internet router, that 60Mbit/s-7.5MB/s limit is really really weird.
ID: 695165 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 695170 - Posted: 27 Dec 2007, 22:18:35 UTC - in response to Message 695165.  

Maybe you could make a suggestion to the heavy contributors that they should set their queue's to a longer period for the period you need to build indexes and cause downtime.


I don't have exact stats here, but I believe most of our bandwidth is due to users of the "set it and forget it" variety - they never mess around with queues/caches. So if the "heavy" users did stock up it wouldn't help as much as you'd think.

Also, I thought I saw someone say something about a 60Mbit cap.


I mentioned how we seem to have a 60Mbit ceiling - that's not an enforced cap - that's due to internal disk/database/network I/O bottlenecks which are quite dynamic and always difficult to track down. In reality, we have a 1GBit connection to the world via Hurricane Electric, but alas this is constrained by a 100 Mbit fiber coming out of the lab down to campus - it will take some big $$$ to upgrade that, which may happen sooner or later (as it would not only benefit us).

But I thought to remember you guys have Gigabit lnterlinks between all the servers and probably a Gigabit or higher backbone going almost all the way to the Internet router, that 60Mbit/s-7.5MB/s limit is really really weird.


We have gigabit all over our server closet (more or less - some older servers are 100Mbit going into the 1Gbit switch). And yes, you are right - it is weird. Maybe I'll take a look at that switch now that you mention it...

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 695170 · Report as offensive
Profile Jan Schotsmans
Avatar

Send message
Joined: 27 Oct 00
Posts: 98
Credit: 92,693
RAC: 0
Belgium
Message 695178 - Posted: 27 Dec 2007, 22:45:29 UTC - in response to Message 695170.  

Won't just an upgrade of NIC's on both ends of the fibre do the trick? Or is it that low quality fibre?
ID: 695178 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 695188 - Posted: 27 Dec 2007, 22:57:10 UTC


. . . looks like you are getting better @ explainin' "what's up" - Thanks for the Updates Matt

ps - especially considerin' the 'tweendays' ;)

< as an afterthought - Thanks to All others @ Berkeley - a job well done (Thanks)


BOINC Wiki . . .

Science Status Page . . .
ID: 695188 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 695198 - Posted: 27 Dec 2007, 23:24:23 UTC - in response to Message 695178.  

Won't just an upgrade of NIC's on both ends of the fibre do the trick? Or is it that low quality fibre?


That's more or less true, but this fibre is under campus control (not ours), and let's just say they have a very specific way of doing things.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 695198 · Report as offensive
Profile Jan Schotsmans
Avatar

Send message
Joined: 27 Oct 00
Posts: 98
Credit: 92,693
RAC: 0
Belgium
Message 695201 - Posted: 27 Dec 2007, 23:34:13 UTC - in response to Message 695198.  

I think I remember when you posted the pictures of when that cable was put in!

If crunchers here thought what happened last week was a long wait for units, they shoulda been around the period that cable was put in.

Ahh, memories.
ID: 695201 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30608
Credit: 53,134,872
RAC: 32
United States
Message 695309 - Posted: 28 Dec 2007, 5:29:29 UTC - in response to Message 695170.  

Let me get this right, you are paying for a gig but can only use a tenth of it? (And you are asking for donations?!)

Or is the campus is paying for 9/10 of it and you are paying for 1/10?

Or are they billing on bits transfered and not pipe width?

Is it possible to get a not campus controlled fibre? (Did they pull a backup?)

Maybe you could make a suggestion to the heavy contributors that they should set their queue's to a longer period for the period you need to build indexes and cause downtime.


I don't have exact stats here, but I believe most of our bandwidth is due to users of the "set it and forget it" variety - they never mess around with queues/caches. So if the "heavy" users did stock up it wouldn't help as much as you'd think.

Also, I thought I saw someone say something about a 60Mbit cap.


I mentioned how we seem to have a 60Mbit ceiling - that's not an enforced cap - that's due to internal disk/database/network I/O bottlenecks which are quite dynamic and always difficult to track down. In reality, we have a 1GBit connection to the world via Hurricane Electric, but alas this is constrained by a 100 Mbit fiber coming out of the lab down to campus - it will take some big $$$ to upgrade that, which may happen sooner or later (as it would not only benefit us).

But I thought to remember you guys have Gigabit lnterlinks between all the servers and probably a Gigabit or higher backbone going almost all the way to the Internet router, that 60Mbit/s-7.5MB/s limit is really really weird.


We have gigabit all over our server closet (more or less - some older servers are 100Mbit going into the 1Gbit switch). And yes, you are right - it is weird. Maybe I'll take a look at that switch now that you mention it...

- Matt


ID: 695309 · Report as offensive
Profile Jan Schotsmans
Avatar

Send message
Joined: 27 Oct 00
Posts: 98
Credit: 92,693
RAC: 0
Belgium
Message 695328 - Posted: 28 Dec 2007, 9:23:17 UTC

The Cricket graph seem to be for the actual Gigabit Internet port, so maybe the bandwidth problem isn't on the Seti side, but on the campus side.

Can you have them check their side?

If the fibre gets to the campus and then gets linked to a switch of some kind, before getting hooked up to the Gigabit internet pipe, try hooking up a laptop or a PC to that same switch and benchmark the bandwidth between one of the Seti servers and where you are at that point.

Maybe a good idea to do bandwidth tests like that on every hop on the network, between the server racks and the actual internet pipe.
ID: 695328 · Report as offensive
Jim Wilkins
Volunteer tester

Send message
Joined: 11 Oct 99
Posts: 70
Credit: 1,658,376
RAC: 0
United States
Message 695358 - Posted: 28 Dec 2007, 15:44:12 UTC - in response to Message 695137.  

("Tweenday" referring to the scant few work days between Xmas and New Year's holidays).

As we progress in our back-end scientific analysis we need to build many indexes on the science database (which vastly speed up queries). In fact, we need and hope to create 2 indexes a week for the next month or two. Seems easy, but each time you fire off such a build the science database locks up for up to 6 hours, during which there will be no assimilation and no splitting of new workunits. Well, we were planning to build another index today but with the frequent "high demand" due to our fast-return workunits the ready-to-send queue is pretty much at zero. So if we started such an index build y'all would get no work until it was done. We decided to postpone this until next week when hopefully we'll have a more user-friendly window of opportunity.

In the meantime, I've been trying to squeeze more juice out of our current servers. I'm kinda stumped as to why we are hitting this 60 MB/sec ceiling of workunit production/sending. I'm not finding any obvious I/O or network bottlenecks. However, while searching I decided to "fix" the server status page. I changed "results in progress" to "results out in the field" which is more accurate. This number never did include the results waiting for the redundant partners to return. So I added a "results returned/awaiting validation" row which also isn't exactly an accurate description either but is the shortest phrase I could think up at the time. Basically these are all the results that have been returned and have yet to enter the validation/assimilation/delete pipeline, after which it is "waiting for db purging." To use a term coined elsewhere, most of these results, if not all, are waiting for their "wingman" (should be "wingperson"). At this point if you add the results ready to send, out in the field, returned/awaiting validation, and awaiting db purging, you have an exact total of the current number of all results in the BOINC database. Thinking about this more, to get a slightly more accurate number of results waiting to reach redundancy before entering the back-end pipeline you take the "results returned/awaiting validation" and subtract 2 times the workunits awaiting validation and subtract 2 times the workunits awaiting assimilation. Whatever.. you get the basic idea. If I think of an easier/quicker way to describe all this I will.

Answering some posts from yesterday's thread:

Missing files like that prompt me to make an immediate fsck on the filesystem.


Very true - except this is a filesystem on network attached storage. The filesystem is propietary and out of our control, therefore no fsck'ing, nor should there be a need for manual fsck'ing.

Why are the bits 'in' larger than the bits 'out'?


In regards to the cricket graphs, the in/out depends on your orientation. The bytes going into the router are coming from the lab, en route to the outside world. So this is "outbound" traffic going "into" the router. Vice versa for the inbound. Basically: green = workunit downloads, blue line = result uploads - though there is some low-level apache traffic noise mixed in there (web sites and schedulers).

- Matt


Matt,

Is it possible to do the database builds in parallel with the other Tuesday maintenance tasks? That would make it much less painful. Even if you can't, why don't you schedule them serially with the other Tuesday tasks? If the heavy crunchers who process nothing but SETI tasks know that this will happen, they can load up the day before. For the rest of us, we'll let the backup projects run (Einstein is more than happy to gobble up my CPUs right now). For the set and forget types, they won't notice and if they do, they are not "forgetting" enough! ;-) So, for a while, Tuesday downtime is longer than we are used to. I'll bet you you that we will get used to it.

Jim

Thanks,
Jim
ID: 695358 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 695360 - Posted: 28 Dec 2007, 16:03:06 UTC

During these rather rocky times, has anyone thought of disabling the beta project(s) that share the same servers? I think astropulse uses the same hardware, so why not throttle that project down to free up the main project. I realize this may not solve the bandwidth issue, but it may help the processing issues.

Assuming that the shorties are valid wu's, (and some have validly questioned this assumption) then it is clear something is radically wrong with server architecture as it stands and some focused problem solving is needed. Doah! I guess everybody is gone/nearly gone for the long weekend.

Well today I'm going to the casino to chill out. I've had enough of slow servers at seti for a while.
ID: 695360 · Report as offensive
Dudo

Send message
Joined: 25 Dec 99
Posts: 2
Credit: 6,648,547
RAC: 0
Croatia
Message 695460 - Posted: 28 Dec 2007, 23:07:28 UTC - in response to Message 695137.  


In the meantime, I've been trying to squeeze more juice out of our current servers. I'm kinda stumped as to why we are hitting this 60 MB/sec ceiling of workunit production/sending. I'm not finding any obvious I/O or network bottlenecks.


Guys,

It seems that you are looking at wrong graph.

interface packets per second ,

In packets per second both graphs (in/out) are equal.
(it seems that you peaked in that, not bandwidth. or maybe combination of both)
Or maybe is a limit of some firewall.

What (whose) router(s)/firewall(s) do you have in line to the backbone ??

I hope it helps !!!

Bye
ID: 695460 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 695461 - Posted: 28 Dec 2007, 23:17:00 UTC - in response to Message 695360.  

During these rather rocky times, has anyone thought of disabling the beta project(s) that share the same servers? I think astropulse uses the same hardware, so why not throttle that project down to free up the main project. I realize this may not solve the bandwidth issue, but it may help the processing issues.

Assuming that the shorties are valid wu's, (and some have validly questioned this assumption) then it is clear something is radically wrong with server architecture as it stands and some focused problem solving is needed. Doah! I guess everybody is gone/nearly gone for the long weekend.

Well today I'm going to the casino to chill out. I've had enough of slow servers at seti for a while.

I don't think the minimal traffic on Beta is going to make one bit of difference to the main site problems. The Beta server status only has Results ready to send 8,185
Results in progress 14,849
as at 28 Dec 2007 22:21:13 UTC.

And it does keep my cpu's from making bigger demands here, units on Beta (AR=0.4xx) with V 6.00 at moment take ~3:20 (or ~5:00) to crunch whilst most unit here only take <20 mins (or ~40 mins), depending on cpu.
ID: 695461 · Report as offensive
Odysseus
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 1808
Credit: 6,701,347
RAC: 6
Canada
Message 695462 - Posted: 28 Dec 2007, 23:18:03 UTC - in response to Message 695360.  

During these rather rocky times, has anyone thought of disabling the beta project(s) that share the same servers? I think astropulse uses the same hardware, so why not throttle that project down to free up the main project. I realize this may not solve the bandwidth issue, but it may help the processing issues.

Judging from estimates of processing power at the third-party stats sites, S@h Beta produces something like one-third of one percent as much work as the main project. The increase in server capacity from shutting it down would therefore be pretty negligible.

ID: 695462 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 695466 - Posted: 28 Dec 2007, 23:45:22 UTC - in response to Message 695309.  

Let me get this right, you are paying for a gig but can only use a tenth of it? (And you are asking for donations?!)

Or is the campus is paying for 9/10 of it and you are paying for 1/10?

Or are they billing on bits transfered and not pipe width?

Is it possible to get a not campus controlled fibre? (Did they pull a backup?)

I don't have the exact numbers, but if I remember correctly....

The gigabit from Hurricane Electric cost about a quarter of what the previous connection cost from Cogent.

So, even if they waste 9/10ths of it, they are saving money. How is that bad?

Universities are highly political. Matt has hinted about that elsewhere, but the short answer is, the only way to get fiber that isn't controlled by campus is to rent space off campus.

There are some other thoughts, like moving most of the servers nearer where that fat pipe enters campus, but that means other logistic issues as well.

More to the point, it's really easy for us (especially those of us who do internetworking for a living) to sit out her and play "armchair quarterback."

I certainly have my ideas, but unless and until I volunteer to actually write the code, I won't criticize the project (or BOINC) for not doing that.

ID: 695466 · Report as offensive
Profile Jan Schotsmans
Avatar

Send message
Joined: 27 Oct 00
Posts: 98
Credit: 92,693
RAC: 0
Belgium
Message 695469 - Posted: 28 Dec 2007, 23:53:19 UTC

Dudo: good catch, that is weird indeed. But 6000 Packets a sec shouldn't be that hard for a gigabit router, especially if you consider its CPU never got over 10% useage and its only using 1/5th of its ram.

So I'm thinking we might be back to whatever switch/router or NIC brings in that 100Mbit fibre to the Seti farm.

No matter if its the 6000 packets per sec it can't handle or the actual bandwidth, it should be able to handle it easily.

ID: 695469 · Report as offensive
Batman

Send message
Joined: 17 Dec 00
Posts: 8
Credit: 84,508
RAC: 0
United States
Message 695471 - Posted: 29 Dec 2007, 0:03:56 UTC

Would any of this explain why my client is currently unable to download WUs? It has gotten part of two and can't get one bit of three others. It has already worked through its queue of WUs. I'll bump up my queue to 7 days, but that won't help now since my tank is empty.
ID: 695471 · Report as offensive
Profile CElliott
Volunteer tester

Send message
Joined: 19 Jul 99
Posts: 178
Credit: 79,285,961
RAC: 0
United States
Message 695477 - Posted: 29 Dec 2007, 0:36:32 UTC - in response to Message 695137.  

In the meantime, I've been trying to squeeze more juice out of our current servers. I'm kinda stumped as to why we are hitting this 60 MB/sec ceiling of workunit production/sending. I'm not finding any obvious I/O or network bottlenecks.


I have been watching the uploads and downloads when the network is congested the last few times it has happened. I have DSL (10 Mbs) or municipal WiFi (1.2 to 2.4 Mbs). What is occurring is that many times an upload or download starts, continues fitfully for several packets, gets almost to the end or even to 100%, and then quits w/o completing. This behavior is very very common. In other words on my machines, many WUs or results are being uploaded or downloaded essentially many times before finally completing.

The way I/O and CPU priority are supposed to work is that when a program is given the CPU on I/O completion, it is assigned a higher priority than normal, and then that priority is allowed to degrade slowly to normal. This is so if a program is I/O bound, it can begin its I/O, be sent to the I/O wait queue, be given the CPU again at a higher than normal priority when the I/O is complete so it can do sufficient computations again to quickly begin another I/O operation. It appears that programs uploading or downloading WUs are not being allocated sufficient CPU time because the delays between I/O attempts appear to be causing the connections to time out. I wonder if the priority mechanism is working on the S@H server or if the server is attempting too many simultaneous connections.
ID: 695477 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 695489 - Posted: 29 Dec 2007, 1:08:55 UTC - in response to Message 695477.  

snipped....
I wonder if the priority mechanism is working on the S@H server or if the server is attempting too many simultaneous connections.

Probably the latter, is my guess, a few months ago when we had problems downloading. On most connections if you didn't start downloading in 22 seconds the connection was broken and you retried later. Most of my downloads during this period go beyond the 22 sec limit and may or may not download a few bytes in the next 5 mins or so before the connection is terminated, although a few do timeout at the 22 sec limit.
ID: 695489 · Report as offensive
1 · 2 · 3 · 4 · Next

Message boards : Technical News : Tweenday Two (Dec 27 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.