The End of All Things (Oct 30 2008)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 825050 - Posted: 30 Oct 2008, 23:00:44 UTC Last modified: 30 Oct 2008, 23:00:55 UTC Okay. So the assimilator memory leak wasn't a problem so much as an effect. It's consumption of resources still needs to be addressed, but it was only affecting itself, and being aggravated by the other problems around it. Poring through logs I confirmed that the network bursts were indeed due to Astropulse downloads - during the "baseline" 2 out of 100 workunit downloads are Astropulse, but during the "burst" 40 out of 100 are Astropulse. The Astropulse workunits are much larger in size than SETI@home workunits, hence the bandwidth consumption. I also confirmed it wasn't a single (or few) clients hitting us at once - connections were randomly distributed over many IP addresses. It finally dawned on me, and now like most things is painfully obvious on hindsight. The SETI@home and Astropulse splitters have separate high water marks. For SETI@home, if we get above 50000 results ready to send, we temporarily halt splitting. For Astropulse, it is still set pretty low at 2500. Every so often a splitter process checks to see the size of the queue and if it should stop. Since there are many SETI@home splitters running at a time, and there is always a delay in transitioning state, thousands of workunits may be generated before the splitters actually realize they are above the high water mark. And then they go to sleep for a while - like an hour or so - until the queue drains enough and they wake up again and get back to work. The thing is, during SETI@home's "sleep until we're needed again" phase the Astropulse splitters continue to run since they haven't reached their high water mark even though it's much lower - those splitters are fewer in number and run slower. Now remember when workunits are created, the transitioners also create respecitve results to "send." New results are id'ed serially - i.e. they are tagged with a number in the database which increases automatically. So during these periods you'll get an area in database id space rich in Astropulse results. Moving on to the feeder. Since it's stupid regarding application types, it fills its own send queue with the oldest results ready to send regardless of application, and the way mysql works this tends to mean in database id order. Of course with the ready-to-send queue at 50000 or so, we have to send out 50000 results before we finally see the effects of what happened above - many hours, usually. Then suddenly - bam! - 20 times more Astropulse workunits than normal. That arbitrary time delay really confused matters. Anyway, one easy solution is to make the feeder smarter. It does have an "-allapps" flag to send to all applications equally. We were hesitant to use this before due to fear this will give too many shared memory slots to Astropulse - and it may very well cause periods of low work during peak periods as the feeder has half the memory for SETI@home workunits than it did. Nevertheless we turned this on today and it had an immediate, positive smoothing effect. Sweet. Other than that today... some data pipeline scripting, and continuing discussions amongst the gang regarding changing redundancy to zero - trying to wrap our brains around all the current bottlenecks and what will suffer depending on what we do. As it stands now, our servers most likely will not be able to support reducing redundancy all the way to zero and keeping up with current workunit demand. So we have to either improve our server i/o or figure out what other knobs to turn. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 825050 ·

Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0	Message 825054 - Posted: 30 Oct 2008, 23:15:40 UTC Last modified: 30 Oct 2008, 23:16:19 UTC . . . . again - WOW! - busy for You all @ Berkeley eh - Thanks for the Updates Matt (and to each of the others - Thanks too!) ps - luv the title Sir ;) BOINC Wiki . . . Science Status Page . . . ID: 825054 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 31300 Credit: 53,134,872 RAC: 32	Message 825055 - Posted: 30 Oct 2008, 23:16:22 UTC - in response to Message 825050. Thanks for the update Jeff. ID: 825055 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 825068 - Posted: 30 Oct 2008, 23:50:29 UTC - in response to Message 825055. Thanks for the update Jeff. Jeff? Is that Jeff, as in Mutt and Jeff? ID: 825068 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 825257 - Posted: 31 Oct 2008, 12:32:48 UTC Last modified: 31 Oct 2008, 12:56:32 UTC Ahh, was the -allapps flag set and everybody assumed this was going to work for 3 days until Matt gets back? If so, it is possible that somebody is being to optimistic. At this moment, ul/dl's have been stalled for hours and cricket isn't chirping anymore, like a canary in the mine. // Edit: actually it isn't uploads but the requests for more work that appears to be frozen. but you all probably know this already// ID: 825257 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 825276 - Posted: 31 Oct 2008, 14:09:08 UTC somebody, somewhere gave the network a kick and my ready to report queue has been cleared. thank you Mother Nature. ID: 825276 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 825296 - Posted: 31 Oct 2008, 16:08:21 UTC The problems last night/this morning were random apache failure on the scheduler, which happens for no good reason from time to time. Jeff restarted it when he got in this morning. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 825296 ·

ruben Send message Joined: 30 Oct 08 Posts: 2 Credit: 55,124 RAC: 0	Message 825365 - Posted: 31 Oct 2008, 18:13:33 UTC Is this still an issue ? I keep getting http errors. ID: 825365 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51531 Credit: 1,018,363,574 RAC: 1,004	Message 825367 - Posted: 31 Oct 2008, 18:20:45 UTC - in response to Message 825365. Is this still an issue ? I keep getting http errors. Not too pretty in comms land right now.....no uppys or downys...... Maybe the boyz in the lab have to change into their best kicking boots again....... "Time is simply the mechanism that keeps everything from happening all at once." ID: 825367 ·

the silver surfer Send message Joined: 24 Feb 01 Posts: 131 Credit: 3,739,307 RAC: 0	Message 825368 - Posted: 31 Oct 2008, 18:22:31 UTC - in response to Message 825365. Is this still an issue ? I keep getting http errors. Sure is, and IÃ‚Â´m running out of work......... But at least I can do a bit for Milkyway ! ID: 825368 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51531 Credit: 1,018,363,574 RAC: 1,004	Message 825369 - Posted: 31 Oct 2008, 18:23:06 UTC - in response to Message 825368. Is this still an issue ? I keep getting http errors. Sure is, and IÃ‚Â´m running out of work......... But at least I can do a bit for Milkyway ! Travesty....LOL.... "Time is simply the mechanism that keeps everything from happening all at once." ID: 825369 ·

Tommy Send message Joined: 26 Jul 00 Posts: 9 Credit: 530,369 RAC: 0	Message 825371 - Posted: 31 Oct 2008, 18:28:34 UTC Someone please give the upload server a sift kick - it is very very very slow if it is working at all. Thanks, Tom ID: 825371 ·

John G Send message Joined: 29 Dec 01 Posts: 68 Credit: 10,932,850 RAC: 0	Message 825376 - Posted: 31 Oct 2008, 18:44:19 UTC Yes now the upload server is acting up for some reason ID: 825376 ·

speedimic Volunteer tester Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0	Message 825392 - Posted: 31 Oct 2008, 19:15:23 UTC The server is up (shows the apche-test-page at least). Might be just the 90 mbit/s wall we're knocking on for couple of hrs now. mic. ID: 825392 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51531 Credit: 1,018,363,574 RAC: 1,004	Message 825393 - Posted: 31 Oct 2008, 19:16:48 UTC - in response to Message 825392. The server is up (shows the apche-test-page at least). Might be just the 90 mbit/s wall we're knocking on for couple of hrs now. Uhh.......I dunno...... If that's all it is, something should be getting through once in a while, and my uploads are dead in the water across all rigs....... "Time is simply the mechanism that keeps everything from happening all at once." ID: 825393 ·

the silver surfer Send message Joined: 24 Feb 01 Posts: 131 Credit: 3,739,307 RAC: 0	Message 825414 - Posted: 31 Oct 2008, 20:00:42 UTC - in response to Message 825369. Is this still an issue ? I keep getting http errors. Sure is, and IÃ‚Â´m running out of work......... But at least I can do a bit for Milkyway ! Travesty....LOL.... Actually Milkyway has a new improved app = MUCH FASTER, released today, and it crunches like hell !!! That`s what I was referring to !! But that`s OFF TOPIC for this thread. All the best, Kurt ID: 825414 ·

gomeyer Volunteer tester Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0	Message 825425 - Posted: 31 Oct 2008, 20:27:20 UTC I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through. ID: 825425 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51531 Credit: 1,018,363,574 RAC: 1,004	Message 825444 - Posted: 31 Oct 2008, 21:07:01 UTC - in response to Message 825425. I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through. Looka like a long weekend........LOL......crunch 'em if you got 'em......... "Time is simply the mechanism that keeps everything from happening all at once." ID: 825444 ·

speedimic Volunteer tester Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0	Message 825450 - Posted: 31 Oct 2008, 21:25:34 UTC I referred to the upload server mentioned in the post before mine. This one is still up and I get all my uploads through after some trys (due to network load I think). On the download side at least bane is up. Vader is not responding...again... mic. ID: 825450 ·

gomeyer Volunteer tester Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0	Message 825453 - Posted: 31 Oct 2008, 21:30:41 UTC - in response to Message 825444. I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through. Looka like a long weekend........LOL......crunch 'em if you got 'em......... I'm OK for now with a 6 day cache on all but one machine. I hate a cache that large but the way things have been you almost have no choice. That one machine was being kept with a small/reasonable queue as a sort of test I guess. This is what I get for trying to maintain a small/reasonable queue. This is why people go nuts and keep 10 days worth, a practice which I despise. Another thing worries me. That test machine has 26 pending downloads. (All uploads done/reported and the queue is empty.) But the web page shows 31 work units in progress. So, now there are 5 ghosts on just this one machine. If that's an indication of a general problem this could get really ugly. ID: 825453 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.