Message boards :
Technical News :
Boiling Down Chicken Soup (Sep 11 2007)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Outside of discussion about not-too-distant-future database replication, we didn't really need to think much today about the science database server that has been giving us grief the past week. As mysterious as the initial fake drive failures were, it's even weirder that they suddenly stopped altogether. I fully tested the "failed" drives - they're fine. Anyway.. we had the usual outage today which was mundane except I took the time to move some of the directories off the workunit file server and onto a lesser used server. We already have all the workunits hashed out over 1024 directories, so it's easy to move whole directories and make sym links and everybody's happy. However, these directories are HUGE (of course) so it took about 3 hours to move only 64 of them (going about 40 Mbits/sec over the local network during the transfer). We weren't ready to have the project down for a whole day so we'll leave it at that for now. So, we offloaded 6.25% of the traffic from the bottlenecked file server so far. We'll see if that changes anything. Meanwhile, Jeff/Eric/I are doing some major cleanup on our internal software suites - so many nagging "make" issues to fix, so little time. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
. . . seems like you're each doing a damn good job - mi system is 'flyin' along with grace . . . so keep up the good work Berkeley. Thanks for the Post Matt, also . . . |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Glad to hear the maintenance window went well today. Was that low performance due to the server being busy overall (accessing the drives) or choking due to other network traffic? Any respectable RAID 10 configuration should be able to max out a 100base-TX network card. Your file server is RAID 10 SCSI or SAN, yes? |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Matt, I noticed that 6 splitters have been running this evening. I am getting WU fine, but downloading them is taking 15 minutes each (stuck waiting for the file). Perhaps 6 splitters are clogging the file server again. It seems 4 splitters running was just about right. You may want to make an adjustment again in the morning? |
KWSN - MajorKong Send message Joined: 5 Jan 00 Posts: 2892 Credit: 1,499,890 RAC: 0 |
Matt, According to previous posts in here (technical news), they have found that running 3 splitter processes is just about enough to meet demand, but it won't build a surplus. Running more than 3 builds a surplus, but chokes the download process. I have an idea that they are trying to build up a respectable surplus before reducing the number of splitter processes again. They also might be testing it to see if moving those 64 directories has helped any. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Matt, Well, if it was an experiment, it was a partial success. I looked at the state of the shrubbery this morning, and three machines had stuck downloads: but it took far fewer retries to get the pipes cleared than last time we were in this state: and most new work came down at the first or second attempt. All this while six splitters were making about 17 results/second. The next experiment comes when one of the splitters starts on a basketweave 'tape'. |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Matt, I'm sure you're right. They are trying to find a sweet spot where the system runs as smoothly as it can without failure. Four splitters seem about right from my perspective, but if I post how my downloads are going, they can make the right decision for the servers. |
Wasabi Peanut Send message Joined: 14 Jul 99 Posts: 62 Credit: 32,646,911 RAC: 0 |
Down- and uploads are running pretty smoothly here, too. What I did notice since yesterday's outage is that my pending credit is now increasing again. FWIW, Ron |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Down- and uploads are running pretty smoothly here, too. What I did notice since yesterday's outage is that my pending credit is now increasing again. I think because everyone's work queues are filling up that there is more of a delay for results to finish and get back to the server. That would explain the increase of pending credit, waiting for everyone's result of each WU to be finished. Some pending credit means computers are busy crunching...which is good news. |
JLDun Send message Joined: 21 Apr 06 Posts: 573 Credit: 196,101 RAC: 0 |
What I did notice since yesterday's outage is that my pending credit is now increasing again. Or that some people temporarily abandoned ship for other projects, came back, and are processing multiple projects at once... [raises hand in embarrassment]. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Just got my first bulk allocation of 'basketweave' WUs - AR > ~1.4, short run time - and guess what: the downloads are as sticky as ever. Matt, don't be lulled into a false sense of security by the smooth downloads we saw while long-run-estimate work was being allocated. The stress-point is when the short stuff is going through. |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
What I did notice since yesterday's outage is that my pending credit is now increasing again. Why embarrassment? I am crunching for about 50 projects. BOINC WIKI |
RottenMutt Send message Joined: 15 Mar 01 Posts: 1011 Credit: 230,314,058 RAC: 0 |
I seem to be having problems uploading completed work units, this started this morning and i waited to post until the evening thinking it may clear and it hasn't. thanks |
SMW Send message Joined: 16 May 99 Posts: 22 Credit: 29,285,238 RAC: 16 |
I seem to be having problems uploading completed work units, this started this morning and i waited to post until the evening thinking it may clear and it hasn't. <---having the same issue here on four machines, both Mac and PC. I am thinking it's just the way it will be until Monday. Thank G-D I keep 10 days in my buffer. I am confident that Matt will take care of it when he can.:) "It is better to be hated for what you are then to be loved for what you are not" - Andre Gide (1869-1951) |
littlegreenmanfrommars Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0 |
I seem to be having problems uploading completed work units, this started this morning and i waited to post until the evening thinking it may clear and it hasn't. Prob may be outside of SETI control... it seems to be also happening to the Beta project. Just another dose of patience will see us through, I'd say. |
Wheel1 Send message Joined: 3 Apr 99 Posts: 9 Credit: 2,173,021 RAC: 0 |
Slightly frustrating also that the server-stats page shows everything as running just fine. But http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets;ranges=d%3Aw%3Am%3Ay tells us that traffic has dropped to the 15Mbit-range since about noon Berkeley time. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Slightly frustrating also that the server-stats page shows everything as running just fine. The server stats page is on the fritz with the servers, if you check, the last update time is 18:50 utc, about 7 hours ago. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
RottenMutt Send message Joined: 15 Mar 01 Posts: 1011 Credit: 230,314,058 RAC: 0 |
Slightly frustrating also that the server-stats page shows everything as running just fine. As wheel1 posted it dropped to 15Mb and seems to have cleared at 10pm:) Everything is correctly working:) I started to say back to normal, but I quickly came to realization that statement would be ambiguous. edit: scheduler, feeder and upload/download processes have been stopped and results ready to send are dropping:( |
Seejay Send message Joined: 5 Jul 06 Posts: 42 Credit: 37,125 RAC: 0 |
Slightly frustrating also that the server-stats page shows everything as running just fine. In fact, there's been no XML update for the stats. sites for 2 days. Matt, what's going on?? |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
In fact, there's been no XML update for the stats. sites for 2 days. Matt, what's going on?? Well, a new stat xml run has been done today. However, if I had to guess I would say it was the stat run for yesterday trying to go off which precipitated the meltdown we just went thorough. They've been known to bring the backend to its kness in the past for various reasons. Alinator |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.