The End of All Things (Oct 30 2008)


log in

Advanced search

Message boards : Technical News : The End of All Things (Oct 30 2008)

1 · 2 · 3 · 4 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 825050 - Posted: 30 Oct 2008, 23:00:44 UTC
Last modified: 30 Oct 2008, 23:00:55 UTC

Okay. So the assimilator memory leak wasn't a problem so much as an effect. It's consumption of resources still needs to be addressed, but it was only affecting itself, and being aggravated by the other problems around it.

Poring through logs I confirmed that the network bursts were indeed due to Astropulse downloads - during the "baseline" 2 out of 100 workunit downloads are Astropulse, but during the "burst" 40 out of 100 are Astropulse. The Astropulse workunits are much larger in size than SETI@home workunits, hence the bandwidth consumption. I also confirmed it wasn't a single (or few) clients hitting us at once - connections were randomly distributed over many IP addresses.

It finally dawned on me, and now like most things is painfully obvious on hindsight. The SETI@home and Astropulse splitters have separate high water marks. For SETI@home, if we get above 50000 results ready to send, we temporarily halt splitting. For Astropulse, it is still set pretty low at 2500. Every so often a splitter process checks to see the size of the queue and if it should stop. Since there are many SETI@home splitters running at a time, and there is always a delay in transitioning state, thousands of workunits may be generated before the splitters actually realize they are above the high water mark. And then they go to sleep for a while - like an hour or so - until the queue drains enough and they wake up again and get back to work.

The thing is, during SETI@home's "sleep until we're needed again" phase the Astropulse splitters continue to run since they haven't reached their high water mark even though it's much lower - those splitters are fewer in number and run slower. Now remember when workunits are created, the transitioners also create respecitve results to "send." New results are id'ed serially - i.e. they are tagged with a number in the database which increases automatically. So during these periods you'll get an area in database id space rich in Astropulse results.

Moving on to the feeder. Since it's stupid regarding application types, it fills its own send queue with the oldest results ready to send regardless of application, and the way mysql works this tends to mean in database id order. Of course with the ready-to-send queue at 50000 or so, we have to send out 50000 results before we finally see the effects of what happened above - many hours, usually. Then suddenly - bam! - 20 times more Astropulse workunits than normal. That arbitrary time delay really confused matters.

Anyway, one easy solution is to make the feeder smarter. It does have an "-allapps" flag to send to all applications equally. We were hesitant to use this before due to fear this will give too many shared memory slots to Astropulse - and it may very well cause periods of low work during peak periods as the feeder has half the memory for SETI@home workunits than it did. Nevertheless we turned this on today and it had an immediate, positive smoothing effect. Sweet.

Other than that today... some data pipeline scripting, and continuing discussions amongst the gang regarding changing redundancy to zero - trying to wrap our brains around all the current bottlenecks and what will suffer depending on what we do. As it stands now, our servers most likely will not be able to support reducing redundancy all the way to zero *and* keeping up with current workunit demand. So we have to either improve our server i/o or figure out what other knobs to turn.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 825054 - Posted: 30 Oct 2008, 23:15:40 UTC
Last modified: 30 Oct 2008, 23:16:19 UTC

.

. . . again - WOW! - busy for You all @ Berkeley eh - Thanks for the Updates Matt (and to each of the others - Thanks too!)

ps - luv the title Sir ;)
____________
BOINC Wiki . . .

Science Status Page . . .

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12492
Credit: 6,806,407
RAC: 5,887
United States
Message 825055 - Posted: 30 Oct 2008, 23:16:22 UTC - in response to Message 825050.

Thanks for the update Jeff.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8492
Credit: 49,790,470
RAC: 53,568
United Kingdom
Message 825068 - Posted: 30 Oct 2008, 23:50:29 UTC - in response to Message 825055.

Thanks for the update Jeff.

Jeff?

Is that Jeff, as in Mutt and Jeff?

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 22,166,317
RAC: 3,879
United States
Message 825257 - Posted: 31 Oct 2008, 12:32:48 UTC
Last modified: 31 Oct 2008, 12:56:32 UTC

Ahh, was the -allapps flag set and everybody assumed this was going to work for 3 days until Matt gets back? If so, it is possible that somebody is being to optimistic. At this moment, ul/dl's have been stalled for hours and cricket isn't chirping anymore, like a canary in the mine.

// Edit: actually it isn't uploads but the requests for more work that appears to be frozen. but you all probably know this already//

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 22,166,317
RAC: 3,879
United States
Message 825276 - Posted: 31 Oct 2008, 14:09:08 UTC

somebody, somewhere gave the network a kick and my ready to report queue has been cleared. thank you Mother Nature.

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 825296 - Posted: 31 Oct 2008, 16:08:21 UTC

The problems last night/this morning were random apache failure on the scheduler, which happens for no good reason from time to time. Jeff restarted it when he got in this morning.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ruben
Send message
Joined: 30 Oct 08
Posts: 2
Credit: 55,124
RAC: 0
Belgium
Message 825365 - Posted: 31 Oct 2008, 18:13:33 UTC

Is this still an issue ? I keep getting http errors.

Profile the silver surfer
Avatar
Send message
Joined: 24 Feb 01
Posts: 131
Credit: 3,734,972
RAC: 0
Austria
Message 825368 - Posted: 31 Oct 2008, 18:22:31 UTC - in response to Message 825365.

Is this still an issue ? I keep getting http errors.


Sure is, and I´m running out of work.........
But at least I can do a bit for Milkyway !
____________

Tommy
Send message
Joined: 26 Jul 00
Posts: 9
Credit: 530,369
RAC: 0
United States
Message 825371 - Posted: 31 Oct 2008, 18:28:34 UTC

Someone please give the upload server a sift kick - it is very very very slow if it is working at all. Thanks, Tom

John G
Send message
Joined: 29 Dec 01
Posts: 63
Credit: 10,142,278
RAC: 0
Canada
Message 825376 - Posted: 31 Oct 2008, 18:44:19 UTC

Yes now the upload server is acting up for some reason

Profile speedimic
Volunteer tester
Avatar
Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 825392 - Posted: 31 Oct 2008, 19:15:23 UTC

The server is up (shows the apche-test-page at least).
Might be just the 90 mbit/s wall we're knocking on for couple of hrs now.
____________
mic.


Profile the silver surfer
Avatar
Send message
Joined: 24 Feb 01
Posts: 131
Credit: 3,734,972
RAC: 0
Austria
Message 825414 - Posted: 31 Oct 2008, 20:00:42 UTC - in response to Message 825369.

Is this still an issue ? I keep getting http errors.


Sure is, and I´m running out of work.........
But at least I can do a bit for Milkyway !

Travesty....LOL....


Actually Milkyway has a new improved app = MUCH FASTER, released today,
and it crunches like hell !!! That`s what I was referring to !!
But that`s OFF TOPIC for this thread.

All the best,
Kurt
____________

gomeyer
Volunteer tester
Send message
Joined: 21 May 99
Posts: 488
Credit: 50,157,953
RAC: 0
United States
Message 825425 - Posted: 31 Oct 2008, 20:27:20 UTC

I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through.

Profile speedimic
Volunteer tester
Avatar
Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 825450 - Posted: 31 Oct 2008, 21:25:34 UTC

I referred to the upload server mentioned in the post before mine.

This one is still up and I get all my uploads through after some trys (due to network load I think).
On the download side at least bane is up.
Vader is not responding...again...

____________
mic.


gomeyer
Volunteer tester
Send message
Joined: 21 May 99
Posts: 488
Credit: 50,157,953
RAC: 0
United States
Message 825453 - Posted: 31 Oct 2008, 21:30:41 UTC - in response to Message 825444.

I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through.

Looka like a long weekend........LOL......crunch 'em if you got 'em.........

I'm OK for now with a 6 day cache on all but one machine. I hate a cache that large but the way things have been you almost have no choice.

That one machine was being kept with a small/reasonable queue as a sort of test I guess. This is what I get for trying to maintain a small/reasonable queue. This is why people go nuts and keep 10 days worth, a practice which I despise.

Another thing worries me. That test machine has 26 pending downloads. (All uploads done/reported and the queue is empty.) But the web page shows 31 work units in progress. So, now there are 5 ghosts on just this one machine. If that's an indication of a general problem this could get really ugly.

1 · 2 · 3 · 4 · Next

Message boards : Technical News : The End of All Things (Oct 30 2008)

Copyright © 2014 University of California