Message boards :
Technical News :
The End of All Things (Oct 30 2008)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Okay. So the assimilator memory leak wasn't a problem so much as an effect. It's consumption of resources still needs to be addressed, but it was only affecting itself, and being aggravated by the other problems around it. Poring through logs I confirmed that the network bursts were indeed due to Astropulse downloads - during the "baseline" 2 out of 100 workunit downloads are Astropulse, but during the "burst" 40 out of 100 are Astropulse. The Astropulse workunits are much larger in size than SETI@home workunits, hence the bandwidth consumption. I also confirmed it wasn't a single (or few) clients hitting us at once - connections were randomly distributed over many IP addresses. It finally dawned on me, and now like most things is painfully obvious on hindsight. The SETI@home and Astropulse splitters have separate high water marks. For SETI@home, if we get above 50000 results ready to send, we temporarily halt splitting. For Astropulse, it is still set pretty low at 2500. Every so often a splitter process checks to see the size of the queue and if it should stop. Since there are many SETI@home splitters running at a time, and there is always a delay in transitioning state, thousands of workunits may be generated before the splitters actually realize they are above the high water mark. And then they go to sleep for a while - like an hour or so - until the queue drains enough and they wake up again and get back to work. The thing is, during SETI@home's "sleep until we're needed again" phase the Astropulse splitters continue to run since they haven't reached their high water mark even though it's much lower - those splitters are fewer in number and run slower. Now remember when workunits are created, the transitioners also create respecitve results to "send." New results are id'ed serially - i.e. they are tagged with a number in the database which increases automatically. So during these periods you'll get an area in database id space rich in Astropulse results. Moving on to the feeder. Since it's stupid regarding application types, it fills its own send queue with the oldest results ready to send regardless of application, and the way mysql works this tends to mean in database id order. Of course with the ready-to-send queue at 50000 or so, we have to send out 50000 results before we finally see the effects of what happened above - many hours, usually. Then suddenly - bam! - 20 times more Astropulse workunits than normal. That arbitrary time delay really confused matters. Anyway, one easy solution is to make the feeder smarter. It does have an "-allapps" flag to send to all applications equally. We were hesitant to use this before due to fear this will give too many shared memory slots to Astropulse - and it may very well cause periods of low work during peak periods as the feeder has half the memory for SETI@home workunits than it did. Nevertheless we turned this on today and it had an immediate, positive smoothing effect. Sweet. Other than that today... some data pipeline scripting, and continuing discussions amongst the gang regarding changing redundancy to zero - trying to wrap our brains around all the current bottlenecks and what will suffer depending on what we do. As it stands now, our servers most likely will not be able to support reducing redundancy all the way to zero *and* keeping up with current workunit demand. So we have to either improve our server i/o or figure out what other knobs to turn. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
. . . . again - WOW! - busy for You all @ Berkeley eh - Thanks for the Updates Matt (and to each of the others - Thanks too!) ps - luv the title Sir ;) BOINC Wiki . . . Science Status Page . . . |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30608 Credit: 53,134,872 RAC: 32 |
Thanks for the update Jeff. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
|
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Ahh, was the -allapps flag set and everybody assumed this was going to work for 3 days until Matt gets back? If so, it is possible that somebody is being to optimistic. At this moment, ul/dl's have been stalled for hours and cricket isn't chirping anymore, like a canary in the mine. // Edit: actually it isn't uploads but the requests for more work that appears to be frozen. but you all probably know this already// |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
somebody, somewhere gave the network a kick and my ready to report queue has been cleared. thank you Mother Nature. |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
The problems last night/this morning were random apache failure on the scheduler, which happens for no good reason from time to time. Jeff restarted it when he got in this morning. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
ruben Send message Joined: 30 Oct 08 Posts: 2 Credit: 55,124 RAC: 0 |
Is this still an issue ? I keep getting http errors. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Is this still an issue ? I keep getting http errors. Not too pretty in comms land right now.....no uppys or downys...... Maybe the boyz in the lab have to change into their best kicking boots again....... "Freedom is just Chaos, with better lighting." Alan Dean Foster |
the silver surfer Send message Joined: 24 Feb 01 Posts: 131 Credit: 3,739,307 RAC: 0 |
Is this still an issue ? I keep getting http errors. Sure is, and I´m running out of work......... But at least I can do a bit for Milkyway ! |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Is this still an issue ? I keep getting http errors. Travesty....LOL.... "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Tommy Send message Joined: 26 Jul 00 Posts: 9 Credit: 530,369 RAC: 0 |
Someone please give the upload server a sift kick - it is very very very slow if it is working at all. Thanks, Tom |
John G Send message Joined: 29 Dec 01 Posts: 68 Credit: 10,932,850 RAC: 0 |
Yes now the upload server is acting up for some reason |
speedimic Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0 |
The server is up (shows the apche-test-page at least). Might be just the 90 mbit/s wall we're knocking on for couple of hrs now. mic. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
The server is up (shows the apche-test-page at least). Uhh.......I dunno...... If that's all it is, something should be getting through once in a while, and my uploads are dead in the water across all rigs....... "Freedom is just Chaos, with better lighting." Alan Dean Foster |
the silver surfer Send message Joined: 24 Feb 01 Posts: 131 Credit: 3,739,307 RAC: 0 |
Is this still an issue ? I keep getting http errors. Actually Milkyway has a new improved app = MUCH FASTER, released today, and it crunches like hell !!! That`s what I was referring to !! But that`s OFF TOPIC for this thread. All the best, Kurt |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through. Looka like a long weekend........LOL......crunch 'em if you got 'em......... "Freedom is just Chaos, with better lighting." Alan Dean Foster |
speedimic Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0 |
I referred to the upload server mentioned in the post before mine. This one is still up and I get all my uploads through after some trys (due to network load I think). On the download side at least bane is up. Vader is not responding...again... mic. |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
I'm getting an occasional upload complete, but absolutely zero downloads. I mean nothing. When the backlog is this bad they usually start dl'ing and time out part way through, but now it's not a single byte making it through. I'm OK for now with a 6 day cache on all but one machine. I hate a cache that large but the way things have been you almost have no choice. That one machine was being kept with a small/reasonable queue as a sort of test I guess. This is what I get for trying to maintain a small/reasonable queue. This is why people go nuts and keep 10 days worth, a practice which I despise. Another thing worries me. That test machine has 26 pending downloads. (All uploads done/reported and the queue is empty.) But the web page shows 31 work units in progress. So, now there are 5 ghosts on just this one machine. If that's an indication of a general problem this could get really ugly. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.