Boiling Down Chicken Soup (Sep 11 2007)

Message boards : Technical News : Boiling Down Chicken Soup (Sep 11 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 638994 - Posted: 11 Sep 2007, 22:03:17 UTC

Outside of discussion about not-too-distant-future database replication, we didn't really need to think much today about the science database server that has been giving us grief the past week. As mysterious as the initial fake drive failures were, it's even weirder that they suddenly stopped altogether. I fully tested the "failed" drives - they're fine.

Anyway.. we had the usual outage today which was mundane except I took the time to move some of the directories off the workunit file server and onto a lesser used server. We already have all the workunits hashed out over 1024 directories, so it's easy to move whole directories and make sym links and everybody's happy. However, these directories are HUGE (of course) so it took about 3 hours to move only 64 of them (going about 40 Mbits/sec over the local network during the transfer). We weren't ready to have the project down for a whole day so we'll leave it at that for now. So, we offloaded 6.25% of the traffic from the bottlenecked file server so far. We'll see if that changes anything.

Meanwhile, Jeff/Eric/I are doing some major cleanup on our internal software suites - so many nagging "make" issues to fix, so little time.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 638994 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 639048 - Posted: 11 Sep 2007, 22:51:30 UTC


. . . seems like you're each doing a damn good job - mi system is 'flyin' along with grace . . . so keep up the good work Berkeley. Thanks for the Post Matt, also . . .

ID: 639048 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 639158 - Posted: 12 Sep 2007, 1:59:08 UTC - in response to Message 638994.  

Glad to hear the maintenance window went well today.

Was that low performance due to the server being busy overall (accessing the drives) or choking due to other network traffic? Any respectable RAID 10 configuration should be able to max out a 100base-TX network card. Your file server is RAID 10 SCSI or SAN, yes?
ID: 639158 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 639176 - Posted: 12 Sep 2007, 2:34:22 UTC - in response to Message 638994.  

Matt,
I noticed that 6 splitters have been running this evening. I am getting WU fine, but downloading them is taking 15 minutes each (stuck waiting for the file). Perhaps 6 splitters are clogging the file server again. It seems 4 splitters running was just about right. You may want to make an adjustment again in the morning?
ID: 639176 · Report as offensive
Profile KWSN - MajorKong
Volunteer tester
Avatar

Send message
Joined: 5 Jan 00
Posts: 2892
Credit: 1,499,890
RAC: 0
United States
Message 639204 - Posted: 12 Sep 2007, 3:27:52 UTC - in response to Message 639176.  

Matt,
I noticed that 6 splitters have been running this evening. I am getting WU fine, but downloading them is taking 15 minutes each (stuck waiting for the file). Perhaps 6 splitters are clogging the file server again. It seems 4 splitters running was just about right. You may want to make an adjustment again in the morning?


According to previous posts in here (technical news), they have found that running 3 splitter processes is just about enough to meet demand, but it won't build a surplus. Running more than 3 builds a surplus, but chokes the download process. I have an idea that they are trying to build up a respectable surplus before reducing the number of splitter processes again. They also might be testing it to see if moving those 64 directories has helped any.
ID: 639204 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 639318 - Posted: 12 Sep 2007, 8:21:02 UTC - in response to Message 639204.  

Matt,
I noticed that 6 splitters have been running this evening. I am getting WU fine, but downloading them is taking 15 minutes each (stuck waiting for the file). Perhaps 6 splitters are clogging the file server again. It seems 4 splitters running was just about right. You may want to make an adjustment again in the morning?


According to previous posts in here (technical news), they have found that running 3 splitter processes is just about enough to meet demand, but it won't build a surplus. Running more than 3 builds a surplus, but chokes the download process. I have an idea that they are trying to build up a respectable surplus before reducing the number of splitter processes again. They also might be testing it to see if moving those 64 directories has helped any.

Well, if it was an experiment, it was a partial success. I looked at the state of the shrubbery this morning, and three machines had stuck downloads: but it took far fewer retries to get the pipes cleared than last time we were in this state: and most new work came down at the first or second attempt. All this while six splitters were making about 17 results/second.

The next experiment comes when one of the splitters starts on a basketweave 'tape'.
ID: 639318 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 639395 - Posted: 12 Sep 2007, 11:53:28 UTC - in response to Message 639204.  

Matt,
I noticed that 6 splitters have been running this evening. I am getting WU fine, but downloading them is taking 15 minutes each (stuck waiting for the file). Perhaps 6 splitters are clogging the file server again. It seems 4 splitters running was just about right. You may want to make an adjustment again in the morning?


According to previous posts in here (technical news), they have found that running 3 splitter processes is just about enough to meet demand, but it won't build a surplus. Running more than 3 builds a surplus, but chokes the download process. I have an idea that they are trying to build up a respectable surplus before reducing the number of splitter processes again. They also might be testing it to see if moving those 64 directories has helped any.


I'm sure you're right. They are trying to find a sweet spot where the system runs as smoothly as it can without failure. Four splitters seem about right from my perspective, but if I post how my downloads are going, they can make the right decision for the servers.
ID: 639395 · Report as offensive
Wasabi Peanut
Avatar

Send message
Joined: 14 Jul 99
Posts: 62
Credit: 32,646,911
RAC: 0
Switzerland
Message 639428 - Posted: 12 Sep 2007, 13:36:03 UTC

Down- and uploads are running pretty smoothly here, too. What I did notice since yesterday's outage is that my pending credit is now increasing again.

FWIW,

Ron
ID: 639428 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 639616 - Posted: 12 Sep 2007, 19:42:12 UTC - in response to Message 639428.  

Down- and uploads are running pretty smoothly here, too. What I did notice since yesterday's outage is that my pending credit is now increasing again.

FWIW,

Ron


I think because everyone's work queues are filling up that there is more of a delay for results to finish and get back to the server. That would explain the increase of pending credit, waiting for everyone's result of each WU to be finished. Some pending credit means computers are busy crunching...which is good news.
ID: 639616 · Report as offensive
JLDun
Volunteer tester
Avatar

Send message
Joined: 21 Apr 06
Posts: 573
Credit: 196,101
RAC: 0
United States
Message 640129 - Posted: 13 Sep 2007, 6:38:59 UTC - in response to Message 639616.  

What I did notice since yesterday's outage is that my pending credit is now increasing again.

FWIW,

Ron


Some pending credit means computers are busy crunching...which is good news.

Or that some people temporarily abandoned ship for other projects, came back, and are processing multiple projects at once... [raises hand in embarrassment].
ID: 640129 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 640428 - Posted: 13 Sep 2007, 18:32:37 UTC

Just got my first bulk allocation of 'basketweave' WUs - AR > ~1.4, short run time - and guess what: the downloads are as sticky as ever.

Matt, don't be lulled into a false sense of security by the smooth downloads we saw while long-run-estimate work was being allocated. The stress-point is when the short stuff is going through.
ID: 640428 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 640597 - Posted: 13 Sep 2007, 22:46:36 UTC - in response to Message 640129.  

What I did notice since yesterday's outage is that my pending credit is now increasing again.

FWIW,

Ron


Some pending credit means computers are busy crunching...which is good news.

Or that some people temporarily abandoned ship for other projects, came back, and are processing multiple projects at once... [raises hand in embarrassment].

Why embarrassment? I am crunching for about 50 projects.


BOINC WIKI
ID: 640597 · Report as offensive
Profile RottenMutt
Avatar

Send message
Joined: 15 Mar 01
Posts: 1011
Credit: 230,314,058
RAC: 0
United States
Message 641962 - Posted: 15 Sep 2007, 22:34:14 UTC

I seem to be having problems uploading completed work units, this started this morning and i waited to post until the evening thinking it may clear and it hasn't.
thanks
ID: 641962 · Report as offensive
Profile SMW

Send message
Joined: 16 May 99
Posts: 22
Credit: 29,285,238
RAC: 16
United States
Message 641991 - Posted: 15 Sep 2007, 23:34:03 UTC - in response to Message 641962.  

I seem to be having problems uploading completed work units, this started this morning and i waited to post until the evening thinking it may clear and it hasn't.
thanks


<---having the same issue here on four machines, both Mac and PC. I am thinking it's just the way it will be until Monday. Thank G-D I keep 10 days in my buffer. I am confident that Matt will take care of it when he can.:)
"It is better to be hated for what you are then to be loved for what you are not"
- Andre Gide (1869-1951)
ID: 641991 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 642002 - Posted: 15 Sep 2007, 23:54:24 UTC - in response to Message 641991.  

I seem to be having problems uploading completed work units, this started this morning and i waited to post until the evening thinking it may clear and it hasn't.
thanks


<---having the same issue here on four machines, both Mac and PC. I am thinking it's just the way it will be until Monday. Thank G-D I keep 10 days in my buffer. I am confident that Matt will take care of it when he can.:)


Prob may be outside of SETI control... it seems to be also happening to the Beta project. Just another dose of patience will see us through, I'd say.
ID: 642002 · Report as offensive
Wheel1
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 9
Credit: 2,173,021
RAC: 0
Sweden
Message 642019 - Posted: 16 Sep 2007, 0:30:40 UTC

Slightly frustrating also that the server-stats page shows everything as running just fine.
But http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets;ranges=d%3Aw%3Am%3Ay tells us that traffic has dropped to the 15Mbit-range since about noon Berkeley time.
ID: 642019 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 642025 - Posted: 16 Sep 2007, 1:01:05 UTC - in response to Message 642019.  

Slightly frustrating also that the server-stats page shows everything as running just fine.
But http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets;ranges=d%3Aw%3Am%3Ay tells us that traffic has dropped to the 15Mbit-range since about noon Berkeley time.

The server stats page is on the fritz with the servers, if you check, the last update time is 18:50 utc, about 7 hours ago.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 642025 · Report as offensive
Profile RottenMutt
Avatar

Send message
Joined: 15 Mar 01
Posts: 1011
Credit: 230,314,058
RAC: 0
United States
Message 642379 - Posted: 16 Sep 2007, 15:39:48 UTC - in response to Message 642019.  
Last modified: 16 Sep 2007, 15:46:03 UTC

Slightly frustrating also that the server-stats page shows everything as running just fine.
But Graphs for gigabitethernet2_3 tells us that traffic has dropped to the 15Mbit-range since about noon Berkeley time.


As wheel1 posted it dropped to 15Mb and seems to have cleared at 10pm:)
Everything is correctly working:)
I started to say back to normal, but I quickly came to realization that statement would be ambiguous.

edit: scheduler, feeder and upload/download processes have been stopped and results ready to send are dropping:(
ID: 642379 · Report as offensive
Profile Seejay
Volunteer tester
Avatar

Send message
Joined: 5 Jul 06
Posts: 42
Credit: 37,125
RAC: 0
Italy
Message 642504 - Posted: 16 Sep 2007, 18:22:53 UTC - in response to Message 642379.  

Slightly frustrating also that the server-stats page shows everything as running just fine.
But Graphs for gigabitethernet2_3 tells us that traffic has dropped to the 15Mbit-range since about noon Berkeley time.


As wheel1 posted it dropped to 15Mb and seems to have cleared at 10pm:)
Everything is correctly working:)
I started to say back to normal, but I quickly came to realization that statement would be ambiguous.

edit: scheduler, feeder and upload/download processes have been stopped and results ready to send are dropping:(



In fact, there's been no XML update for the stats. sites for 2 days. Matt, what's going on??
ID: 642504 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 642581 - Posted: 16 Sep 2007, 20:05:41 UTC - in response to Message 642504.  

In fact, there's been no XML update for the stats. sites for 2 days. Matt, what's going on??


Well, a new stat xml run has been done today.

However, if I had to guess I would say it was the stat run for yesterday trying to go off which precipitated the meltdown we just went thorough. They've been known to bring the backend to its kness in the past for various reasons.

Alinator
ID: 642581 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Boiling Down Chicken Soup (Sep 11 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.