Message boards :
Technical News :
Splitsville (Aug 16 2007)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
So here's the deal. Getting multibeam data out to the public is having its ups and downs. Thanks to some helpful poking and prodding from various users we uncovered a problem with the splitter causing it to generate workunits with bogus triplet thresholds. The result: about 50% of the workunits sent out were overflowing quickly and returning, creating network clogs on our already-overwhelmed servers. And about 2.5% of the workunits were sent out with impossibly low threshholds, causing clients to spin on ridiculously slow calculations. The mystery here is why these aren't also immediately overflowing (with such thresholds they should report a lot of garbage right away). This may have to do when/where the client checks for overflow - it may take several hours to reach 0.001% done, but then the hope is these clients will then finally be bursting with data and returning the results home. This was actually a problem in beta that got fixed, but now somehow resurfaced, which is also a mystery. CVS out of sync? Some stupid code put in to check for config overrides on the command line? Unfortunately the splitter guru is on vacation, so we had to make our best attempt to understand the code and patch it ourselves. Jeff just did so and put the fixed version on line and we're watching the thresholds. So far so good. Meanwhile, we're back to yesterday's problem of just not having enough throughput from the workunit file server, so that's the main bottleneck right now, and there's not much we can do about it except wait for the current artificial demand (caused by the excessive overflows) to die down and see if we catch up. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
HDRW Send message Joined: 18 Oct 02 Posts: 14 Credit: 189,189 RAC: 0 |
Matt, Thanks for keeping us informed, as usual! Look on the bright side: at least you don't work for Skype! :-) Just a thought: I wonder how much of the load on the system is related to the number of work units, and how much to the volume of them? When things get tough, how easy would it be to increase the size of the WUs, so that they take longer to come back, thus reducing the rate that communications are happening? Cheers, Howard |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
Unfortunately the splitter guru is on vacation I take it that's Eric's dept? |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
Thank You Matt . . . Nice Work all-around |
Pablo_ZPM Send message Joined: 13 Jul 01 Posts: 3 Credit: 367,720 RAC: 0 |
Hello Matt and all, I'm new on the forum though I've been churning out s@h work for some years now. I've been getting worrying workunits lately so I decided to post an inquiry. Maybe somebody will be able to make sense of what is happening. Below are the workunits received and sent back in the past 24 hours on one of my computers. 14 of them (traffic!!!) more or less add up to a single standard WU while generating 14 times the traffic! I (my comps) usually need to connect once every 48 hours to get/return work. How do yoy guys advise? Maybe we (those experiencing similar problems) should lay off for a while and get back online in say three or four days to let you guys and your overburdened servers to have a breather and straighten things our. ET won't be going anywhere :o) What say you? By the way, for the past several years I've been admiring the great job you are doing. Keep it up and us - dreamers will be there for yoy! Done 79.14 0.02 0.01 Done 120.47 0.06 0.05 Done 9,503.13 19.32 pending Done 9,538.72 19.32 pending Done 46,457.27 16.91 16.91 Done 22,682.17 28.43 28.43 Done 79.52 0.02 0.02 Done 167.93 0.02 0.02 Done 9,689.89 19.31 19.31 Done 10,599.06 19.31 pending Done 8,238.59 16.84 16.84 Done 84.91 0.02 0.02 Done 10,448.47 19.31 19.31 Done 6,712.44 12.15 12.15 Done 129.86 0.03 0.03 Pablo_ZPM |
W-K 666 Send message Joined: 18 May 99 Posts: 19354 Credit: 40,757,560 RAC: 67 |
Hello Matt and all, Probably what you are seeing is being discussed in Work Unit Problem on the Number Crunching Board. Andy |
Bounce Send message Joined: 3 Apr 99 Posts: 66 Credit: 5,604,569 RAC: 0 |
down again? one of my machines is on its last WU and attempts to connect report that there are no new WUs available. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14676 Credit: 200,643,578 RAC: 874 |
Hello Matt and all, Actually, I don't think that list shows any significant problems at all. None of your low-credit WUs took longer than 3 minutes, and you only had 6 short ones (out of 15) - that's below the 50% rate that Matt says they were sending out for a while. Now they've fixed that problem, you should see fewer very short units: and you should expect to see the 'normal' units work their way through more quickly (because the new program is much more efficient). The only one which might be a minor cause for concern is the one which took 46,457 seconds for 16.91 credits, but even that doesn't matter if it only happens rarely. Yes, you're good to go - keep on crunching! |
Clyde C. Phillips, III Send message Joined: 2 Aug 00 Posts: 1851 Credit: 5,955,047 RAC: 0 |
I looked at my results today and saw several 10,500-second (for a PD950) 0.5 credit ones, most pending. Also I aborted two units that weren't progressing and the time-to-completion increasing. Also a lot of 0.5 to 30-second ones that must have been -9 overflow ones. There's still a big mess with the new Multibeam system. I know Matt is working as hard as he can..... |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
Clyde I hope you reported the workunit ID in the NC Forum with the continued to run and got nowhere... What was seen in Beta in a Few of those cases if supend close BOINC and then restart. It either completes normally... or errors out. We had a few people play with a few that were captured and could not find what the cause was... I looked at my results today and saw several 10,500-second (for a PD950) 0.5 credit ones, most pending. Also I aborted two units that weren't progressing and the time-to-completion increasing. Also a lot of 0.5 to 30-second ones that must have been -9 overflow ones. There's still a big mess with the new Multibeam system. I know Matt is working as hard as he can..... Please consider a Donation to the Seti Project. |
Christopher Coulter Send message Joined: 15 Sep 05 Posts: 1 Credit: 1,898 RAC: 0 |
my computer time to finsh is 29hrs is it positable that mine is in the 2.5 percent. thanks Christopher coulter |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
So here's the deal. Getting multibeam data out to the public is having its ups and downs. Thanks to some helpful poking and prodding from various users we uncovered a problem with the splitter causing it to generate workunits with bogus triplet thresholds. The result: about 50% of the workunits sent out were overflowing quickly and returning, creating network clogs on our already-overwhelmed servers. And about 2.5% of the workunits were sent out with impossibly low threshholds, causing clients to spin on ridiculously slow calculations. The mystery here is why these aren't also immediately overflowing (with such thresholds they should report a lot of garbage right away). This may have to do when/where the client checks for overflow - it may take several hours to reach 0.001% done, but then the hope is these clients will then finally be bursting with data and returning the results home. Is it possible that so much information is generated that the counter overflows and a value that is actually too small (possibly even negative) is reported? BOINC WIKI |
blade148 Send message Joined: 22 Jul 01 Posts: 5 Credit: 985,224 RAC: 0 |
Hi Matt and thanks for the Update. Could I ask a favour .? ( if it's in your relm ..?? ) Could you update the news on the front of the seti web site ..?? It might help with other peeps understand and to remain patient while you guys get sorted. thanks for your hard work ! Matt |
Clyde C. Phillips, III Send message Joined: 2 Aug 00 Posts: 1851 Credit: 5,955,047 RAC: 0 |
Clyde I'll see if I can find those and report them, with appropriate computation log, to Numbercrunching. Hope it's not too late. |
seti@elrcastor.com Send message Joined: 30 Jan 00 Posts: 35 Credit: 4,879,559 RAC: 0 |
The beta project still has issues 2007-08-16 08:30:06 [SETI@home Beta Test] Sending scheduler request: To fetch work 2007-08-16 08:30:06 [SETI@home Beta Test] Requesting 40883 seconds of new work 2007-08-16 08:30:12 [SETI@home Beta Test] Scheduler RPC succeeded 2007-08-16 08:30:12 [SETI@home Beta Test] Message from server: Project encountered internal error: shared memory 2007-08-16 08:30:12 [SETI@home Beta Test] Deferring communication for 1 hr 0 min 0 sec 2007-08-16 08:30:12 [SETI@home Beta Test] Reason: project is down 2007-08-16 08:30:12 [SETI@home Beta Test] Deferring communication for 3 hr 43 min 59 sec 2007-08-16 08:30:12 [SETI@home Beta Test] Reason: project is down |
Alex Striker Send message Joined: 11 Jan 04 Posts: 42 Credit: 33,960,026 RAC: 20 |
Hi Matt, Jeff, and all of the staff thanks for the update keep up the good work /Alex Striker Team Striker Denmark Team Striker - Seti - Denmark Happy Crunching /Alex Striker, founder of: Team Striker SETI/BOINC English version webpage us on Facebook |
Thebrez1 Send message Joined: 19 May 99 Posts: 4 Credit: 16,257,385 RAC: 56 |
I am curious, not hostile. I have been running SETI since 1999. The old Seti used to run great, never any problems and I could store enough units to keep going when the servers did crash. Now I am about ready to take it off my system. It seems SETI is down more than up lately. Is it Seti or Boinc that is the problem? Is it ever going to be fixed? It just seems the "better" the program gets the worse it performs. |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
I am curious, not hostile. I have been running SETI since 1999. The old Seti used to run great, never any problems and I could store enough units to keep going when the servers did crash. Now I am about ready to take it off my system. It seems SETI is down more than up lately. Is it Seti or Boinc that is the problem? Is it ever going to be fixed? It just seems the "better" the program gets the worse it performs. Yes, the project goes up and down on occassion. In the budget for this year are several server upgrades that will give the project servers some redundancy, providing better uptime. Right now, SETI needs donations to make this happen. Otherwise, if you want to keep your computer busy and you don't have any other projects, consider allocating 50% of BOINC's time to another project. http://boinc.berkeley.edu/projects.php |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
I am curious, not hostile. I have been running SETI since 1999. The old Seti used to run great, never any problems <snip> I don't know about that. There's plenty of crunchers that have been here since the beginning that can tell you about some of the nightmares when SETI Classic had problems. They were just less noticeable then. ;) |
RandyC Send message Joined: 20 Oct 99 Posts: 714 Credit: 1,704,345 RAC: 0 |
So here's the deal. Getting multibeam data out to the public is having its ups and downs. Just a quick question... Now that multibeam data is out (and seemingly stable...for now), what's the status of the old data? Is there any left to be split? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.