Splitsville (Aug 16 2007)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 620572 - Posted: 16 Aug 2007, 23:03:19 UTC So here's the deal. Getting multibeam data out to the public is having its ups and downs. Thanks to some helpful poking and prodding from various users we uncovered a problem with the splitter causing it to generate workunits with bogus triplet thresholds. The result: about 50% of the workunits sent out were overflowing quickly and returning, creating network clogs on our already-overwhelmed servers. And about 2.5% of the workunits were sent out with impossibly low threshholds, causing clients to spin on ridiculously slow calculations. The mystery here is why these aren't also immediately overflowing (with such thresholds they should report a lot of garbage right away). This may have to do when/where the client checks for overflow - it may take several hours to reach 0.001% done, but then the hope is these clients will then finally be bursting with data and returning the results home. This was actually a problem in beta that got fixed, but now somehow resurfaced, which is also a mystery. CVS out of sync? Some stupid code put in to check for config overrides on the command line? Unfortunately the splitter guru is on vacation, so we had to make our best attempt to understand the code and patch it ourselves. Jeff just did so and put the fixed version on line and we're watching the thresholds. So far so good. Meanwhile, we're back to yesterday's problem of just not having enough throughput from the workunit file server, so that's the main bottleneck right now, and there's not much we can do about it except wait for the current artificial demand (caused by the excessive overflows) to die down and see if we catch up. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 620572 ·

HDRW Send message Joined: 18 Oct 02 Posts: 14 Credit: 189,189 RAC: 0	Message 620584 - Posted: 16 Aug 2007, 23:22:03 UTC - in response to Message 620572. Matt, Thanks for keeping us informed, as usual! Look on the bright side: at least you don't work for Skype! :-) Just a thought: I wonder how much of the load on the system is related to the number of work units, and how much to the volume of them? When things get tough, how easy would it be to increase the size of the WUs, so that they take longer to come back, thus reducing the rate that communications are happening? Cheers, Howard ID: 620584 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 620588 - Posted: 16 Aug 2007, 23:27:02 UTC - in response to Message 620572. Last modified: 16 Aug 2007, 23:27:26 UTC Unfortunately the splitter guru is on vacation I take it that's Eric's dept? ID: 620588 ·

Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0	Message 620616 - Posted: 17 Aug 2007, 0:10:27 UTC Thank You Matt . . . Nice Work all-around ID: 620616 ·

Pablo_ZPM Send message Joined: 13 Jul 01 Posts: 3 Credit: 367,720 RAC: 0	Message 621011 - Posted: 17 Aug 2007, 12:34:40 UTC Hello Matt and all, I'm new on the forum though I've been churning out s@h work for some years now. I've been getting worrying workunits lately so I decided to post an inquiry. Maybe somebody will be able to make sense of what is happening. Below are the workunits received and sent back in the past 24 hours on one of my computers. 14 of them (traffic!!!) more or less add up to a single standard WU while generating 14 times the traffic! I (my comps) usually need to connect once every 48 hours to get/return work. How do yoy guys advise? Maybe we (those experiencing similar problems) should lay off for a while and get back online in say three or four days to let you guys and your overburdened servers to have a breather and straighten things our. ET won't be going anywhere :o) What say you? By the way, for the past several years I've been admiring the great job you are doing. Keep it up and us - dreamers will be there for yoy! Done 79.14 0.02 0.01 Done 120.47 0.06 0.05 Done 9,503.13 19.32 pending Done 9,538.72 19.32 pending Done 46,457.27 16.91 16.91 Done 22,682.17 28.43 28.43 Done 79.52 0.02 0.02 Done 167.93 0.02 0.02 Done 9,689.89 19.31 19.31 Done 10,599.06 19.31 pending Done 8,238.59 16.84 16.84 Done 84.91 0.02 0.02 Done 10,448.47 19.31 19.31 Done 6,712.44 12.15 12.15 Done 129.86 0.03 0.03 Pablo_ZPM ID: 621011 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19422 Credit: 40,757,560 RAC: 67	Message 621028 - Posted: 17 Aug 2007, 13:32:47 UTC - in response to Message 621011. Hello Matt and all, I'm new on the forum though I've been churning out s@h work for some years now. I've been getting worrying workunits lately so I decided to post an inquiry. Maybe somebody will be able to make sense of what is happening. Below are the workunits received and sent back in the past 24 hours on one of my computers. 14 of them (traffic!!!) more or less add up to a single standard WU while generating 14 times the traffic! I (my comps) usually need to connect once every 48 hours to get/return work. How do yoy guys advise? Maybe we (those experiencing similar problems) should lay off for a while and get back online in say three or four days to let you guys and your overburdened servers to have a breather and straighten things our. ET won't be going anywhere :o) What say you? By the way, for the past several years I've been admiring the great job you are doing. Keep it up and us - dreamers will be there for yoy! Done 79.14 0.02 0.01 Done 120.47 0.06 0.05 Done 9,503.13 19.32 pending Done 9,538.72 19.32 pending Done 46,457.27 16.91 16.91 Done 22,682.17 28.43 28.43 Done 79.52 0.02 0.02 Done 167.93 0.02 0.02 Done 9,689.89 19.31 19.31 Done 10,599.06 19.31 pending Done 8,238.59 16.84 16.84 Done 84.91 0.02 0.02 Done 10,448.47 19.31 19.31 Done 6,712.44 12.15 12.15 Done 129.86 0.03 0.03 Pablo_ZPM Probably what you are seeing is being discussed in Work Unit Problem on the Number Crunching Board. Andy ID: 621028 ·

Bounce Send message Joined: 3 Apr 99 Posts: 66 Credit: 5,604,569 RAC: 0	Message 621032 - Posted: 17 Aug 2007, 13:44:26 UTC down again? one of my machines is on its last WU and attempts to connect report that there are no new WUs available. ID: 621032 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874	Message 621053 - Posted: 17 Aug 2007, 14:25:53 UTC - in response to Message 621028. Hello Matt and all, I'm new on the forum though I've been churning out s@h work for some years now. I've been getting worrying workunits lately so I decided to post an inquiry. Maybe somebody will be able to make sense of what is happening. Below are the workunits received and sent back in the past 24 hours on one of my computers. 14 of them (traffic!!!) more or less add up to a single standard WU while generating 14 times the traffic! I (my comps) usually need to connect once every 48 hours to get/return work. How do yoy guys advise? Maybe we (those experiencing similar problems) should lay off for a while and get back online in say three or four days to let you guys and your overburdened servers to have a breather and straighten things our. ET won't be going anywhere :o) What say you? By the way, for the past several years I've been admiring the great job you are doing. Keep it up and us - dreamers will be there for yoy! Done 79.14 0.02 0.01 Done 120.47 0.06 0.05 Done 9,503.13 19.32 pending Done 9,538.72 19.32 pending Done 46,457.27 16.91 16.91 Done 22,682.17 28.43 28.43 Done 79.52 0.02 0.02 Done 167.93 0.02 0.02 Done 9,689.89 19.31 19.31 Done 10,599.06 19.31 pending Done 8,238.59 16.84 16.84 Done 84.91 0.02 0.02 Done 10,448.47 19.31 19.31 Done 6,712.44 12.15 12.15 Done 129.86 0.03 0.03 Pablo_ZPM Probably what you are seeing is being discussed in Work Unit Problem on the Number Crunching Board. Andy Actually, I don't think that list shows any significant problems at all. None of your low-credit WUs took longer than 3 minutes, and you only had 6 short ones (out of 15) - that's below the 50% rate that Matt says they were sending out for a while. Now they've fixed that problem, you should see fewer very short units: and you should expect to see the 'normal' units work their way through more quickly (because the new program is much more efficient). The only one which might be a minor cause for concern is the one which took 46,457 seconds for 16.91 credits, but even that doesn't matter if it only happens rarely. Yes, you're good to go - keep on crunching! ID: 621053 ·

Clyde C. Phillips, III Send message Joined: 2 Aug 00 Posts: 1851 Credit: 5,955,047 RAC: 0	Message 621227 - Posted: 17 Aug 2007, 18:24:02 UTC I looked at my results today and saw several 10,500-second (for a PD950) 0.5 credit ones, most pending. Also I aborted two units that weren't progressing and the time-to-completion increasing. Also a lot of 0.5 to 30-second ones that must have been -9 overflow ones. There's still a big mess with the new Multibeam system. I know Matt is working as hard as he can..... ID: 621227 ·

Pappa Volunteer tester Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0	Message 621298 - Posted: 17 Aug 2007, 20:10:28 UTC - in response to Message 621227. Clyde I hope you reported the workunit ID in the NC Forum with the continued to run and got nowhere... What was seen in Beta in a Few of those cases if supend close BOINC and then restart. It either completes normally... or errors out. We had a few people play with a few that were captured and could not find what the cause was... I looked at my results today and saw several 10,500-second (for a PD950) 0.5 credit ones, most pending. Also I aborted two units that weren't progressing and the time-to-completion increasing. Also a lot of 0.5 to 30-second ones that must have been -9 overflow ones. There's still a big mess with the new Multibeam system. I know Matt is working as hard as he can..... Please consider a Donation to the Seti Project. ID: 621298 ·

Christopher Coulter Volunteer tester Send message Joined: 15 Sep 05 Posts: 1 Credit: 1,898 RAC: 0	Message 621339 - Posted: 17 Aug 2007, 21:16:16 UTC my computer time to finsh is 29hrs is it positable that mine is in the 2.5 percent. thanks Christopher coulter ID: 621339 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 621561 - Posted: 18 Aug 2007, 0:53:15 UTC - in response to Message 620572. So here's the deal. Getting multibeam data out to the public is having its ups and downs. Thanks to some helpful poking and prodding from various users we uncovered a problem with the splitter causing it to generate workunits with bogus triplet thresholds. The result: about 50% of the workunits sent out were overflowing quickly and returning, creating network clogs on our already-overwhelmed servers. And about 2.5% of the workunits were sent out with impossibly low threshholds, causing clients to spin on ridiculously slow calculations. The mystery here is why these aren't also immediately overflowing (with such thresholds they should report a lot of garbage right away). This may have to do when/where the client checks for overflow - it may take several hours to reach 0.001% done, but then the hope is these clients will then finally be bursting with data and returning the results home. This was actually a problem in beta that got fixed, but now somehow resurfaced, which is also a mystery. CVS out of sync? Some stupid code put in to check for config overrides on the command line? Unfortunately the splitter guru is on vacation, so we had to make our best attempt to understand the code and patch it ourselves. Jeff just did so and put the fixed version on line and we're watching the thresholds. So far so good. Meanwhile, we're back to yesterday's problem of just not having enough throughput from the workunit file server, so that's the main bottleneck right now, and there's not much we can do about it except wait for the current artificial demand (caused by the excessive overflows) to die down and see if we catch up. - Matt Is it possible that so much information is generated that the counter overflows and a value that is actually too small (possibly even negative) is reported? BOINC WIKI ID: 621561 ·

blade148 Send message Joined: 22 Jul 01 Posts: 5 Credit: 985,224 RAC: 0	Message 622011 - Posted: 18 Aug 2007, 15:45:54 UTC Hi Matt and thanks for the Update. Could I ask a favour .? ( if it's in your relm ..?? ) Could you update the news on the front of the seti web site ..?? It might help with other peeps understand and to remain patient while you guys get sorted. thanks for your hard work ! Matt ID: 622011 ·

Clyde C. Phillips, III Send message Joined: 2 Aug 00 Posts: 1851 Credit: 5,955,047 RAC: 0	Message 622144 - Posted: 18 Aug 2007, 18:22:42 UTC - in response to Message 621298. Clyde I hope you reported the workunit ID in the NC Forum with the continued to run and got nowhere... What was seen in Beta in a Few of those cases if supend close BOINC and then restart. It either completes normally... or errors out. We had a few people play with a few that were captured and could not find what the cause was... I looked at my results today and saw several 10,500-second (for a PD950) 0.5 credit ones, most pending. Also I aborted two units that weren't progressing and the time-to-completion increasing. Also a lot of 0.5 to 30-second ones that must have been -9 overflow ones. There's still a big mess with the new Multibeam system. I know Matt is working as hard as he can..... I'll see if I can find those and report them, with appropriate computation log, to Numbercrunching. Hope it's not too late. ID: 622144 ·

seti@elrcastor.com Volunteer tester Send message Joined: 30 Jan 00 Posts: 35 Credit: 4,879,559 RAC: 0	Message 622441 - Posted: 19 Aug 2007, 3:45:35 UTC The beta project still has issues 2007-08-16 08:30:06 [SETI@home Beta Test] Sending scheduler request: To fetch work 2007-08-16 08:30:06 [SETI@home Beta Test] Requesting 40883 seconds of new work 2007-08-16 08:30:12 [SETI@home Beta Test] Scheduler RPC succeeded 2007-08-16 08:30:12 [SETI@home Beta Test] Message from server: Project encountered internal error: shared memory 2007-08-16 08:30:12 [SETI@home Beta Test] Deferring communication for 1 hr 0 min 0 sec 2007-08-16 08:30:12 [SETI@home Beta Test] Reason: project is down 2007-08-16 08:30:12 [SETI@home Beta Test] Deferring communication for 3 hr 43 min 59 sec 2007-08-16 08:30:12 [SETI@home Beta Test] Reason: project is down ID: 622441 ·

Alex Striker Send message Joined: 11 Jan 04 Posts: 42 Credit: 33,960,026 RAC: 20	Message 623156 - Posted: 20 Aug 2007, 9:07:23 UTC Hi Matt, Jeff, and all of the staff thanks for the update keep up the good work /Alex Striker Team Striker Denmark Team Striker - Seti - Denmark Happy Crunching /Alex Striker, founder of: Team Striker SETI/BOINC English version webpage us on Facebook ID: 623156 ·

Thebrez1 Volunteer tester Send message Joined: 19 May 99 Posts: 4 Credit: 16,257,385 RAC: 56	Message 623331 - Posted: 20 Aug 2007, 15:33:33 UTC I am curious, not hostile. I have been running SETI since 1999. The old Seti used to run great, never any problems and I could store enough units to keep going when the servers did crash. Now I am about ready to take it off my system. It seems SETI is down more than up lately. Is it Seti or Boinc that is the problem? Is it ever going to be fixed? It just seems the "better" the program gets the worse it performs. ID: 623331 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 623409 - Posted: 20 Aug 2007, 16:39:13 UTC - in response to Message 623331. Last modified: 20 Aug 2007, 16:39:48 UTC I am curious, not hostile. I have been running SETI since 1999. The old Seti used to run great, never any problems and I could store enough units to keep going when the servers did crash. Now I am about ready to take it off my system. It seems SETI is down more than up lately. Is it Seti or Boinc that is the problem? Is it ever going to be fixed? It just seems the "better" the program gets the worse it performs. Yes, the project goes up and down on occassion. In the budget for this year are several server upgrades that will give the project servers some redundancy, providing better uptime. Right now, SETI needs donations to make this happen. Otherwise, if you want to keep your computer busy and you don't have any other projects, consider allocating 50% of BOINC's time to another project. http://boinc.berkeley.edu/projects.php ID: 623409 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 623435 - Posted: 20 Aug 2007, 16:56:39 UTC - in response to Message 623331. I am curious, not hostile. I have been running SETI since 1999. The old Seti used to run great, never any problems <snip> I don't know about that. There's plenty of crunchers that have been here since the beginning that can tell you about some of the nightmares when SETI Classic had problems. They were just less noticeable then. ;) ID: 623435 ·

RandyC Send message Joined: 20 Oct 99 Posts: 714 Credit: 1,704,345 RAC: 0	Message 623453 - Posted: 20 Aug 2007, 17:22:28 UTC - in response to Message 620572. So here's the deal. Getting multibeam data out to the public is having its ups and downs. - Matt Just a quick question... Now that multibeam data is out (and seemingly stable...for now), what's the status of the old data? Is there any left to be split? ID: 623453 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.