The End of All Things (Oct 30 2008)

Message boards : Technical News : The End of All Things (Oct 30 2008)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 825050 - Posted: 30 Oct 2008, 23:00:44 UTC
Last modified: 30 Oct 2008, 23:00:55 UTC

Okay. So the assimilator memory leak wasn't a problem so much as an effect. It's consumption of resources still needs to be addressed, but it was only affecting itself, and being aggravated by the other problems around it.

Poring through logs I confirmed that the network bursts were indeed due to Astropulse downloads - during the "baseline" 2 out of 100 workunit downloads are Astropulse, but during the "burst" 40 out of 100 are Astropulse. The Astropulse workunits are much larger in size than SETI@home workunits, hence the bandwidth consumption. I also confirmed it wasn't a single (or few) clients hitting us at once - connections were randomly distributed over many IP addresses.

It finally dawned on me, and now like most things is painfully obvious on hindsight. The SETI@home and Astropulse splitters have separate high water marks. For SETI@home, if we get above 50000 results ready to send, we temporarily halt splitting. For Astropulse, it is still set pretty low at 2500. Every so often a splitter process checks to see the size of the queue and if it should stop. Since there are many SETI@home splitters running at a time, and there is always a delay in transitioning state, thousands of workunits may be generated before the splitters actually realize they are above the high water mark. And then they go to sleep for a while - like an hour or so - until the queue drains enough and they wake up again and get back to work.

The thing is, during SETI@home's "sleep until we're needed again" phase the Astropulse splitters continue to run since they haven't reached their high water mark even though it's much lower - those splitters are fewer in number and run slower. Now remember when workunits are created, the transitioners also create respecitve results to "send." New results are id'ed serially - i.e. they are tagged with a number in the database which increases automatically. So during these periods you'll get an area in database id space rich in Astropulse results.

Moving on to the feeder. Since it's stupid regarding application types, it fills its own send queue with the oldest results ready to send regardless of application, and the way mysql works this tends to mean in database id order. Of course with the ready-to-send queue at 50000 or so, we have to send out 50000 results before we finally see the effects of what happened above - many hours, usually. Then suddenly - bam! - 20 times more Astropulse workunits than normal. That arbitrary time delay really confused matters.

Anyway, one easy solution is to make the feeder smarter. It does have an "-allapps" flag to send to all applications equally. We were hesitant to use this before due to fear this will give too many shared memory slots to Astropulse - and it may very well cause periods of low work during peak periods as the feeder has half the memory for SETI@home workunits than it did. Nevertheless we turned this on today and it had an immediate, positive smoothing effect. Sweet.

Other than that today... some data pipeline scripting, and continuing discussions amongst the gang regarding changing redundancy to zero - trying to wrap our brains around all the current bottlenecks and what will suffer depending on what we do. As it stands now, our servers most likely will not be able to support reducing redundancy all the way to zero *and* keeping up with current workunit demand. So we have to either improve our server i/o or figure out what other knobs to turn.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 825050 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 825054 - Posted: 30 Oct 2008, 23:15:40 UTC
Last modified: 30 Oct 2008, 23:16:19 UTC

.

. . . again - WOW! - busy for You all @ Berkeley eh - Thanks for the Updates Matt (and to each of the others - Thanks too!)

ps - luv the title Sir ;)
BOINC Wiki . . .

Science Status Page . . .
ID: 825054 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30608
Credit: 53,134,872
RAC: 32
United States
Message 825055 - Posted: 30 Oct 2008, 23:16:22 UTC - in response to Message 825050.  

Thanks for the update Jeff.
ID: 825055 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 825068 - Posted: 30 Oct 2008, 23:50:29 UTC - in response to Message 825055.  

Thanks for the update Jeff.

Jeff?

Is that Jeff, as in Mutt and Jeff?
ID: 825068 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 825257 - Posted: 31 Oct 2008, 12:32:48 UTC
Last modified: 31 Oct 2008, 12:56:32 UTC

Ahh, was the -allapps flag set and everybody assumed this was going to work for 3 days until Matt gets back? If so, it is possible that somebody is being to optimistic. At this moment, ul/dl's have been stalled for hours and cricket isn't chirping anymore, like a canary in the mine.

// Edit: actually it isn't uploads but the requests for more work that appears to be frozen. but you all probably know this already//
ID: 825257 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 825276 - Posted: 31 Oct 2008, 14:09:08 UTC

somebody, somewhere gave the network a kick and my ready to report queue has been cleared. thank you Mother Nature.
ID: 825276 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 825296 - Posted: 31 Oct 2008, 16:08:21 UTC

The problems last night/this morning were random apache failure on the scheduler, which happens for no good reason from time to time. Jeff restarted it when he got in this morning.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 825296 · Report as offensive
ruben

Send message
Joined: 30 Oct 08
Posts: 2
Credit: 55,124
RAC: 0
Belgium
Message 825365 - Posted: 31 Oct 2008, 18:13:33 UTC

Is this still an issue ? I keep getting http errors.
ID: 825365 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 825367 - Posted: 31 Oct 2008, 18:20:45 UTC - in response to Message 825365.  

Is this still an issue ? I keep getting http errors.

Not too pretty in comms land right now.....no uppys or downys......

Maybe the boyz in the lab have to change into their best kicking boots again.......
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 825367 · Report as offensive
Profile the silver surfer
Avatar

Send message
Joined: 24 Feb 01
Posts: 131
Credit: 3,739,307
RAC: 0
Austria
Message 825368 - Posted: 31 Oct 2008, 18:22:31 UTC - in response to Message 825365.  

Is this still an issue ? I keep getting http errors.


Sure is, and I´m running out of work.........
But at least I can do a bit for Milkyway !

ID: 825368 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 825369 - Posted: 31 Oct 2008, 18:23:06 UTC - in response to Message 825368.  

Is this still an issue ? I keep getting http errors.


Sure is, and I´m running out of work.........
But at least I can do a bit for Milkyway !

Travesty....LOL....
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 825369 · Report as offensive
Tommy

Send message
Joined: 26 Jul 00
Posts: 9
Credit: 530,369
RAC: 0
United States
Message 825371 - Posted: 31 Oct 2008, 18:28:34 UTC

Someone please give the upload server a sift kick - it is very very very slow if it is working at all. Thanks, Tom
ID: 825371 · Report as offensive
John G

Send message
Joined: 29 Dec 01
Posts: 68
Credit: 10,932,850
RAC: 0
Canada
Message 825376 - Posted: 31 Oct 2008, 18:44:19 UTC

Yes now the upload server is acting up for some reason
ID: 825376 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 825392 - Posted: 31 Oct 2008, 19:15:23 UTC

The server is up (shows the apche-test-page at least).
Might be just the 90 mbit/s wall we're knocking on for couple of hrs now.
mic.


ID: 825392 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 825393 - Posted: 31 Oct 2008, 19:16:48 UTC - in response to Message 825392.  

The server is up (shows the apche-test-page at least).
Might be just the 90 mbit/s wall we're knocking on for couple of hrs now.

Uhh.......I dunno......
If that's all it is, something should be getting through once in a while, and my uploads are dead in the water across all rigs.......
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 825393 · Report as offensive
Profile the silver surfer
Avatar

Send message
Joined: 24 Feb 01
Posts: 131
Credit: 3,739,307
RAC: 0
Austria
Message 825414 - Posted: 31 Oct 2008, 20:00:42 UTC - in response to Message 825369.  

Is this still an issue ? I keep getting http errors.


Sure is, and I´m running out of work.........
But at least I can do a bit for Milkyway !

Travesty....LOL....


Actually Milkyway has a new improved app = MUCH FASTER, released today,
and it crunches like hell !!! That`s what I was referring to !!
But that`s OFF TOPIC for this thread.

All the best,
Kurt

ID: 825414 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 825425 - Posted: 31 Oct 2008, 20:27:20 UTC

I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through.
ID: 825425 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 825444 - Posted: 31 Oct 2008, 21:07:01 UTC - in response to Message 825425.  

I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through.

Looka like a long weekend........LOL......crunch 'em if you got 'em.........
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 825444 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 825450 - Posted: 31 Oct 2008, 21:25:34 UTC

I referred to the upload server mentioned in the post before mine.

This one is still up and I get all my uploads through after some trys (due to network load I think).
On the download side at least bane is up.
Vader is not responding...again...

mic.


ID: 825450 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 825453 - Posted: 31 Oct 2008, 21:30:41 UTC - in response to Message 825444.  

I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through.

Looka like a long weekend........LOL......crunch 'em if you got 'em.........

I'm OK for now with a 6 day cache on all but one machine. I hate a cache that large but the way things have been you almost have no choice.

That one machine was being kept with a small/reasonable queue as a sort of test I guess. This is what I get for trying to maintain a small/reasonable queue. This is why people go nuts and keep 10 days worth, a practice which I despise.

Another thing worries me. That test machine has 26 pending downloads. (All uploads done/reported and the queue is empty.) But the web page shows 31 work units in progress. So, now there are 5 ghosts on just this one machine. If that's an indication of a general problem this could get really ugly.
ID: 825453 · Report as offensive
1 · 2 · 3 · 4 · Next

Message boards : Technical News : The End of All Things (Oct 30 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.