Oh no! Bruno! (Oct 29 2008)

Message boards : Technical News : Oh no! Bruno! (Oct 29 2008)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 824721 - Posted: 29 Oct 2008, 22:55:54 UTC

Well we haven't really gotten completely around the general problems with our raw data drives being unreadable via our tangled web of SATA enclosures and USB converters, etc. However I did find one thing this morning which helped. Turns out one enclosure just simply stopped working. Long story short, upon very careful inspection I found one of the drive bays had a tiny tiny piece of pink fluff wedged in the SATA power plug. The fluff was from our shipping containers to/from Arecibo. Bits of it get torn off from regular use, and it looks like some got stuck on a drive, which then got wedged into the power plug upon insertion into the enclosure. I dug it out, replaced the drives, and they were visible again. At least for now. I do appreciate the "modprobe" suggestion in the last thread, which may help other similar issues.

Jeff and I were discussing a lot of stuff today, focused mainly on future planning and needs, i.e. what are our current bottlenecks, how do we fix them, and then what will our new bottlenecks be? We're resurrecting conversation with campus, possibly to have them research the current cost/feasibility of increasing our bandwidth. We're also internally discussing needs regarding a potential move towards less redundancy - which will pretty much double our load if we decide to keep up with demand, and can keep up. As well we were scratching our heads about these semi-regular bandwidth spikes that max out our current bandwidth and wreak general havoc for an hour or so at a time.

As far as the last thing I found an important clue today. The assimilator code has a memory leak - it's had the leak for years now, but it's usually not a problem. It eventually reaches a limit, fails, then restarts within a few minutes. Today I found the assimilators have been dying quite often recently, and their failures are perfectly in tandem with upward bumps we see in upload traffic. No surprise, as the assimilators and uploads happen on the same machine (bruno) - so if bloated, resource-consuming assimilators suddenly disappear from the process queue, more resources are suddenly given to uploads.

The story goes on from there, but I have to get back to work and will leave the conclusion until tomorrow. You see, I put in a "assimilator killer" cronjob today in every two hours to restart the assimilators regularly and prevent them from bloating too much. I think observing the effects of that over the next 24 hours will inform what I think about other network problems we've been having...

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 824721 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30651
Credit: 53,134,872
RAC: 32
United States
Message 824751 - Posted: 30 Oct 2008, 0:41:01 UTC - in response to Message 824721.  

Thanks for the update.

I assume if the memory leak has been known about for this long, it is the kind the system forces on you by it's allocating a block but not letting you free it. Otherwise I'd just suggest fixing it.

Gary


ID: 824751 · Report as offensive
Profile Neil Blaikie
Volunteer tester
Avatar

Send message
Joined: 17 May 99
Posts: 143
Credit: 6,652,341
RAC: 0
Canada
Message 824752 - Posted: 30 Oct 2008, 0:44:15 UTC

Glad to see after some hard work from you guys, you seem to have sorted quite a bit today.

Hopefully all your much appreciated hard work pays off in the long run.

Thank you to all of you there and fingers crossed for the next 24/48 hours or so.

P.S Hope the air conditioner problem has been resolved also, bring them to Canada, they will stay nice and cold with the snow we had last night! (East coast anyway) :-)
ID: 824752 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 824772 - Posted: 30 Oct 2008, 2:20:07 UTC - in response to Message 824751.  

Yeah.. from what I understand it's lost in the depths of some informix library code.

- Matt



I assume if the memory leak has been known about for this long, it is the kind the system forces on you by it's allocating a block but not letting you free it. Otherwise I'd just suggest fixing it.


-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 824772 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 824804 - Posted: 30 Oct 2008, 3:37:32 UTC

Good sleuthing guys!

The tangled web of services, processes, databases, hardware, etc. otherwise known as the 'Seti servers' are a much more complex piece of work than many folks realize, I think.

Sounds like you are getting to the bottom of at least some of the persistently perplexing issues.

Keep up the good work, and as usual, thanks for keeping all of us in the loop and informed.

Mark
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 824804 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824878 - Posted: 30 Oct 2008, 10:44:44 UTC - in response to Message 824721.  

As far as the last thing I found an important clue today. The assimilator code has a memory leak - it's had the leak for years now, but it's usually not a problem. It eventually reaches a limit, fails, then restarts within a few minutes. Today I found the assimilators have been dying quite often recently, and their failures are perfectly in tandem with upward bumps we see in upload traffic. No surprise, as the assimilators and uploads happen on the same machine (bruno) - so if bloated, resource-consuming assimilators suddenly disappear from the process queue, more resources are suddenly given to uploads.

It's great that you're making progress on this, but - always hoping to be proved wrong - it doesn't sound as if you've reached the ultimate smoking gun yet.

From where I sit, the upward bumps in upload traffic, also coincide with the (end of the) spikes in download traffic. From your explanation (and looking at the processes running on Bruno), I can't quite see why the download traffic should rise so significantly half-an-hour before the assimilators are due to run out of memory! In fact, I think you've attributed the download spikes to the batch-release of Astropulse work: that should reduce the load on the assimilators, because hosts which have just received a big lump of AP work shouldn't need to contact the schedulers (and hence report any work they might have completed) for a long time.
ID: 824878 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19062
Credit: 40,757,560
RAC: 67
United Kingdom
Message 824886 - Posted: 30 Oct 2008, 11:32:43 UTC - in response to Message 824878.  

that should reduce the load on the assimilators, because hosts which have just received a big lump of AP work shouldn't need to contact the schedulers (and hence report any work they might have completed) for a long time.

That's not necessarily true is it, if a multi-core host requests work and gets one AP unit. One, it could go to the end of the cache queue, and still be doing MB tasks, and two once the AP task starts the other cores can still be doing MB work, and hammering on the door.
ID: 824886 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824888 - Posted: 30 Oct 2008, 11:53:00 UTC - in response to Message 824886.  

that should reduce the load on the assimilators, because hosts which have just received a big lump of AP work shouldn't need to contact the schedulers (and hence report any work they might have completed) for a long time.

That's not necessarily true is it, if a multi-core host requests work and gets one AP unit. One, it could go to the end of the cache queue, and still be doing MB tasks, and two once the AP task starts the other cores can still be doing MB work, and hammering on the door.

Absolutely, and if it's one of my 5.10.13 boxes, it'll be reporting the MB work even if the cache doesn't need topping up following the 20-hour work injection.

My only point was that any effect of the issuing of a large number of AP tasks would be a tendency to decrease, rather than increase, the assimilator workload: so I couldn't see any causal link between 'issue AP work' and 'assimilator failure 30 minutes later'.
ID: 824888 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824898 - Posted: 30 Oct 2008, 13:02:34 UTC
Last modified: 30 Oct 2008, 13:07:33 UTC

The end of the current download spike seems to illustrate the point perfectly. Data from the Cricket graph page:

Average bits in (for the day): 
Cur: 92.82 Mbits/sec
Avg: 74.49 Mbits/sec
Max: 96.02 Mbits/sec

 Average bits out (for the day): 
Cur: 16.44 Mbits/sec
Avg: 10.80 Mbits/sec
Max: 22.52 Mbits/sec
 
Last updated at Thu Oct 30 05:52:15 2008

All it took was the slightest easing-off in the download rate ('bits in' - 92.8Mb instead of 96.0Mb), and the upload rate ('bits out') jumped by 60%. Four minutes later, download was 92.4Mb, upload 22.75Mb: another four minutes, at 79.5Mb/17.0Mb, the storm is over.

Edit - and a page stall on posting suggests that the database is really, really busy. Could the arrow of causality be running in the opposite direction?

Lots of downloads --> uploads inhibited --> upload spike when the downloads finish --> (reporting?) --> lots of work to assimilate --> assimilators reach their memory leak limit and fall over?
ID: 824898 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824913 - Posted: 30 Oct 2008, 13:56:16 UTC
Last modified: 30 Oct 2008, 14:26:13 UTC

Matt,

Would it be possible for you to give the List of recently connected client types script a kick, please? Not just because other people have mourned its passing, but because the data it would provide is directly relevent to the analysis of this upload/download problem.

The weak link in my theory in the last post was the --> (reporting?) --> stage. But last time I looked at the 'client types' list, I remember being surprised by the number of BOINC v6.1.0 clients reporting. If that remains the case, the weak link in my theory is explained.

v6.1.0 is a BOINC client (actually, a family of at least five BOINC clients) compiled and distributed by Crunch3r, and which feature various flavours of 'Report Results Immediately' - the early ones in the family did indeed report results immediately, and suffered validate errors as a result. Later family members report 60 seconds after uploading.

If significant numbers of v6.1.0 clients appear on the 'client types' list, then we have an explanation for heavy assimilator activity just after the download spike starts to ease off.
ID: 824913 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 824940 - Posted: 30 Oct 2008, 16:15:38 UTC

And a side note Matt.....

I have not had as much trouble as some, but I am getting a LOT of page load hangs in the forums this morning....much more than usual. Have to hit the reload button and then the second time they usually load OK.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 824940 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 824942 - Posted: 30 Oct 2008, 16:24:57 UTC

Has anyone mentioned or tried to put the SETI servers (workloads, downloads, uploads, etc.) into a discrete event simulation? All this talk about "this would help" etc. are just educated guesses at best. It's my master's research, so I was just wondering because it can be a very useful exercise.
ID: 824942 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 824951 - Posted: 30 Oct 2008, 17:05:15 UTC - in response to Message 824942.  

Has anyone mentioned or tried to put the SETI servers (workloads, downloads, uploads, etc.) into a discrete event simulation?


Actually, for internal purposes, and perhaps for the people who follow these threads in detail, I started the exercise of writing down all the "entities." Entities include: upload server, download server, mysql database, informix database, splitters, web site, forums, etc. etc. etc. as well as what each entity is (a) dependent upon for good performance and (b) affected by tangentially.

Long story short, after a few entities I realized that the dependency permutations were numerous enough to be effectively infinite and I have better things to work on.

Example: why are peaks in downloads sparking a delayed peak in uploads? Shouldn't it be the other way around (you'd expect a peak in uploads to result in the clients then requesting more work)?

Well, the peak in downloads maxes out our bandwidth. That means nfsd's running on gowron (the workunit storage server) are blocked as clients are holding onto httpd download processes much longer to get through the logjam. That means the other processes that access the workunit storage, like the file deleters and validators, are also getting bogged down. Turns out these processes run on the result storage server, i.e. bruno, which also happens to be the upload server (as it makes sense for the upload server to write to its own disks). Well, since the file deleters and validators (which also like to be running on the server where the results are) and gummed up this consumes resources that the upload httpds require as well. So when that download peak starts wrapping up gowron's load goes down, the file deleters and validators get a breath of air, and resources are given back to the uploads.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 824951 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 824953 - Posted: 30 Oct 2008, 17:09:53 UTC - in response to Message 824951.  

Niiiiice ;D

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 824953 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 824956 - Posted: 30 Oct 2008, 17:20:21 UTC

Would turning off the AP production and downloads for a few days be able to resolve matters a bit? Or are you convinced they are the source of the spikes and isolating that mechanism is not important?
ID: 824956 · Report as offensive
Iztok s52d (and friends)

Send message
Joined: 12 Jan 01
Posts: 136
Credit: 393,469,375
RAC: 116
Slovenia
Message 824966 - Posted: 30 Oct 2008, 18:20:48 UTC - in response to Message 824772.  

Hi!

if killall -TERM assimilator is handled properly, just ignore.

Regarding assimilators: killing them from cronjob might leave some
inconsistency. (I hate killall -KILL)

I would put a counter into the code, like: asimilate 1000 WUs and stop.
Then just restart them with shell script.

Benefits:
- controlable memory leak
- they do not restart all in the same time.


73
iztok

Ah, 10 years ago some serious SW run with DOSEMU having this nice feature....

ID: 824966 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 824967 - Posted: 30 Oct 2008, 18:20:51 UTC - in response to Message 824898.  

The end of the current download spike seems to illustrate the point perfectly. Data from the Cricket graph page:

Average bits in (for the day): 
Cur: 92.82 Mbits/sec
Avg: 74.49 Mbits/sec
Max: 96.02 Mbits/sec

 Average bits out (for the day): 
Cur: 16.44 Mbits/sec
Avg: 10.80 Mbits/sec
Max: 22.52 Mbits/sec
 
Last updated at Thu Oct 30 05:52:15 2008

All it took was the slightest easing-off in the download rate ('bits in' - 92.8Mb instead of 96.0Mb), and the upload rate ('bits out') jumped by 60%. Four minutes later, download was 92.4Mb, upload 22.75Mb: another four minutes, at 79.5Mb/17.0Mb, the storm is over.

Richard,

If you can track down a copy of "Computer Networks" by Andrew Tannenbaum it is a really good read.

What you're reporting matches one of the tables in the book exactly: as loading increases to near 100% the actual throughput drops -- the amount of time dealing with errors, lost packets and the like becomes significant and efficiency goes way down.

This is one of the reasons I've mentioned p-Persistence: if "we" can crank down the load to stay right around 92Mb, the overall throughput will go up.

Works really well.

-- Ned

ID: 824967 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824978 - Posted: 30 Oct 2008, 18:50:15 UTC - in response to Message 824967.  

Richard,

If you can track down a copy of "Computer Networks" by Andrew Tannenbaum it is a really good read.

What you're reporting matches one of the tables in the book exactly: as loading increases to near 100% the actual throughput drops -- the amount of time dealing with errors, lost packets and the like becomes significant and efficiency goes way down.

This is one of the reasons I've mentioned p-Persistence: if "we" can crank down the load to stay right around 92Mb, the overall throughput will go up.

Works really well.

-- Ned

All of which suggests that the most important, but still do-able, task with regard to this set of network congestion is to track down and eliminate the cause of the download 'spikes'. That might be randomised task allocation, a more even flow of AP tasks into the workflow, AP splitter inhibition when the storage is getting full, or some combination of the above.

Of course, 'important' is a relative term, and there must be more important tasks in the overall scheme of things: but I would imagine that consciously running the network up to (and beyond) its throughput limit is going to have undesirable project side-effects - as Matt has described in today's post.

Eliminating the network 'spiking' isn't going to eliminate network congestion entirely: we'll be left with outage recovery congestion, and the 'all WUs are shorty WUs' congestion - neither of which can probably be solved without a fatter pipe down the hill. But a smoothing function on the downloads would be a good start.
ID: 824978 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 825002 - Posted: 30 Oct 2008, 21:00:27 UTC - in response to Message 824978.  

Richard,

If you can track down a copy of "Computer Networks" by Andrew Tannenbaum it is a really good read.

What you're reporting matches one of the tables in the book exactly: as loading increases to near 100% the actual throughput drops -- the amount of time dealing with errors, lost packets and the like becomes significant and efficiency goes way down.

This is one of the reasons I've mentioned p-Persistence: if "we" can crank down the load to stay right around 92Mb, the overall throughput will go up.

Works really well.

-- Ned

All of which suggests that the most important, but still do-able, task with regard to this set of network congestion is to track down and eliminate the cause of the download 'spikes'. That might be randomised task allocation, a more even flow of AP tasks into the workflow, AP splitter inhibition when the storage is getting full, or some combination of the above.

Of course, 'important' is a relative term, and there must be more important tasks in the overall scheme of things: but I would imagine that consciously running the network up to (and beyond) its throughput limit is going to have undesirable project side-effects - as Matt has described in today's post.

Eliminating the network 'spiking' isn't going to eliminate network congestion entirely: we'll be left with outage recovery congestion, and the 'all WUs are shorty WUs' congestion - neither of which can probably be solved without a fatter pipe down the hill. But a smoothing function on the downloads would be a good start.

Richard,

I mentioned the book because I thought you'd find it interesting.

The answer to the congestion question (also from Tannenbaum) is a mechanism to reduce the demand on the network.

The BOINC client decides when it is going to connect (for any reason), and the load at the servers is primarily coming from clients.

In another post, Joe Segur said something to the general effect that "BOINC retries scheduler requests each minute." There is supposed to be a random back-off, but I'm not sure how well it works, and the 11 second delay message comes from the scheduler, so a connection has to work for that to happen.

Worst case, assume 1 minute. If BOINC could be "set" to 0.1-persistent, it would retry at some random time, but on average once every ten minutes. That'd lower the connection rate by an order of magnitude.

... and except for those who watch BOINC like a hawk, on the client side it would be unnoticable.

Even better if the project could broadcast this info. DNS can do that nicely.

-- Ned
ID: 825002 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 825012 - Posted: 30 Oct 2008, 21:38:02 UTC - in response to Message 825002.  

Richard,

I mentioned the book because I thought you'd find it interesting.

I'll certainly keep my eye open for it - thanks.

The answer to the congestion question (also from Tannenbaum) is a mechanism to reduce the demand on the network.

The BOINC client decides when it is going to connect (for any reason), and the load at the servers is primarily coming from clients.

In another post, Joe Segur said something to the general effect that "BOINC retries scheduler requests each minute." There is supposed to be a random back-off, but I'm not sure how well it works, and the 11 second delay message comes from the scheduler, so a connection has to work for that to happen.

From my experience, if BOINC fails to connect with the scheduler at the first attempt, it backs off by 1 minute the first couple of times, then increases the back-off and fairly quickly reaches a state of "back-off by a random amount between 0 and 4 hours". Then, after a while (10 attempts?), it starts to worry that it might be looking in the wrong place: hence the 'master file download', and a reversion to the initial 1-minute backoffs.

Worst case, assume 1 minute. If BOINC could be "set" to 0.1-persistent, it would retry at some random time, but on average once every ten minutes. That'd lower the connection rate by an order of magnitude.

Yes, in the case of these download spikes, which tend to last for ~30 minutes, an initial interval of 10 minutes would smooth things out usefully. Though I'm not sure how it would play with every BOINC project, and every cause of network failures.

... and except for those who watch BOINC like a hawk, on the client side it would be unnoticable.

Even better if the project could broadcast this info. DNS can do that nicely.

-- Ned

Correct me if I'm wrong, but wouldn't it be necessary for new BOINC client code to be incorporated to listen for, and act upon, the DNS-embedded backoff instructions? That sounds good as a long-term strategy, but we both know how long it can take to get a new concept added into BOINC, and then for it to be taken up by the majority of the user base (heck, some people are still running BOINC v3).

That, in my view, puts it into the un-doable category, at least in the short term. I was hoping Matt might go on a spike-hunt on the grounds that (1) it could be quick, and (2) he wouldn't need any cooperation from other agencies to code or deploy something outside his direct control.
ID: 825012 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Oh no! Bruno! (Oct 29 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.