Splitsville (Aug 16 2007)

Message boards : Technical News : Splitsville (Aug 16 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 19
United States
Message 620572 - Posted: 16 Aug 2007, 23:03:19 UTC

So here's the deal. Getting multibeam data out to the public is having its ups and downs. Thanks to some helpful poking and prodding from various users we uncovered a problem with the splitter causing it to generate workunits with bogus triplet thresholds. The result: about 50% of the workunits sent out were overflowing quickly and returning, creating network clogs on our already-overwhelmed servers. And about 2.5% of the workunits were sent out with impossibly low threshholds, causing clients to spin on ridiculously slow calculations. The mystery here is why these aren't also immediately overflowing (with such thresholds they should report a lot of garbage right away). This may have to do when/where the client checks for overflow - it may take several hours to reach 0.001% done, but then the hope is these clients will then finally be bursting with data and returning the results home.

This was actually a problem in beta that got fixed, but now somehow resurfaced, which is also a mystery. CVS out of sync? Some stupid code put in to check for config overrides on the command line? Unfortunately the splitter guru is on vacation, so we had to make our best attempt to understand the code and patch it ourselves. Jeff just did so and put the fixed version on line and we're watching the thresholds. So far so good.

Meanwhile, we're back to yesterday's problem of just not having enough throughput from the workunit file server, so that's the main bottleneck right now, and there's not much we can do about it except wait for the current artificial demand (caused by the excessive overflows) to die down and see if we catch up.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 620572 · Report as offensive
Profile HDRW

Send message
Joined: 18 Oct 02
Posts: 14
Credit: 189,189
RAC: 0
United Kingdom
Message 620584 - Posted: 16 Aug 2007, 23:22:03 UTC - in response to Message 620572.  

Matt,

Thanks for keeping us informed, as usual! Look on the bright side: at least you don't work for Skype! :-)

Just a thought: I wonder how much of the load on the system is related to the number of work units, and how much to the volume of them? When things get tough, how easy would it be to increase the size of the WUs, so that they take longer to come back, thus reducing the rate that communications are happening?

Cheers,

Howard

ID: 620584 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15681
Credit: 81,497,941
RAC: 22,255
United States
Message 620588 - Posted: 16 Aug 2007, 23:27:02 UTC - in response to Message 620572.  
Last modified: 16 Aug 2007, 23:27:26 UTC

Unfortunately the splitter guru is on vacation


I take it that's Eric's dept?
ID: 620588 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 620616 - Posted: 17 Aug 2007, 0:10:27 UTC


Thank You Matt . . . Nice Work all-around

ID: 620616 · Report as offensive
Pablo_ZPM

Send message
Joined: 13 Jul 01
Posts: 3
Credit: 367,720
RAC: 0
Poland
Message 621011 - Posted: 17 Aug 2007, 12:34:40 UTC

Hello Matt and all,
I'm new on the forum though I've been churning out s@h work for some years now. I've been getting worrying workunits lately so I decided to post an inquiry. Maybe somebody will be able to make sense of what is happening. Below are the workunits received and sent back in the past 24 hours on one of my computers. 14 of them (traffic!!!) more or less add up to a single standard WU while generating 14 times the traffic! I (my comps) usually need to connect once every 48 hours to get/return work. How do yoy guys advise? Maybe we (those experiencing similar problems) should lay off for a while and get back online in say three or four days to let you guys and your overburdened servers to have a breather and straighten things our. ET won't be going anywhere :o) What say you? By the way, for the past several years I've been admiring the great job you are doing. Keep it up and us - dreamers will be there for yoy!


Done 79.14 0.02 0.01
Done 120.47 0.06 0.05
Done 9,503.13 19.32 pending
Done 9,538.72 19.32 pending
Done 46,457.27 16.91 16.91
Done 22,682.17 28.43 28.43
Done 79.52 0.02 0.02
Done 167.93 0.02 0.02
Done 9,689.89 19.31 19.31
Done 10,599.06 19.31 pending
Done 8,238.59 16.84 16.84
Done 84.91 0.02 0.02
Done 10,448.47 19.31 19.31
Done 6,712.44 12.15 12.15
Done 129.86 0.03 0.03

Pablo_ZPM
ID: 621011 · Report as offensive
Nick: ID 666
Volunteer tester

Send message
Joined: 18 May 99
Posts: 12988
Credit: 35,874,953
RAC: 20,526
United Kingdom
Message 621028 - Posted: 17 Aug 2007, 13:32:47 UTC - in response to Message 621011.  

Hello Matt and all,
I'm new on the forum though I've been churning out s@h work for some years now. I've been getting worrying workunits lately so I decided to post an inquiry. Maybe somebody will be able to make sense of what is happening. Below are the workunits received and sent back in the past 24 hours on one of my computers. 14 of them (traffic!!!) more or less add up to a single standard WU while generating 14 times the traffic! I (my comps) usually need to connect once every 48 hours to get/return work. How do yoy guys advise? Maybe we (those experiencing similar problems) should lay off for a while and get back online in say three or four days to let you guys and your overburdened servers to have a breather and straighten things our. ET won't be going anywhere :o) What say you? By the way, for the past several years I've been admiring the great job you are doing. Keep it up and us - dreamers will be there for yoy!


Done 79.14 0.02 0.01
Done 120.47 0.06 0.05
Done 9,503.13 19.32 pending
Done 9,538.72 19.32 pending
Done 46,457.27 16.91 16.91
Done 22,682.17 28.43 28.43
Done 79.52 0.02 0.02
Done 167.93 0.02 0.02
Done 9,689.89 19.31 19.31
Done 10,599.06 19.31 pending
Done 8,238.59 16.84 16.84
Done 84.91 0.02 0.02
Done 10,448.47 19.31 19.31
Done 6,712.44 12.15 12.15
Done 129.86 0.03 0.03

Pablo_ZPM

Probably what you are seeing is being discussed in Work Unit Problem on the Number Crunching Board.

Andy
ID: 621028 · Report as offensive
Bounce

Send message
Joined: 3 Apr 99
Posts: 66
Credit: 5,604,569
RAC: 0
United States
Message 621032 - Posted: 17 Aug 2007, 13:44:26 UTC

down again? one of my machines is on its last WU and attempts to connect report that there are no new WUs available.
ID: 621032 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 13111
Credit: 148,027,083
RAC: 181,320
United Kingdom
Message 621053 - Posted: 17 Aug 2007, 14:25:53 UTC - in response to Message 621028.  

Hello Matt and all,
I'm new on the forum though I've been churning out s@h work for some years now. I've been getting worrying workunits lately so I decided to post an inquiry. Maybe somebody will be able to make sense of what is happening. Below are the workunits received and sent back in the past 24 hours on one of my computers. 14 of them (traffic!!!) more or less add up to a single standard WU while generating 14 times the traffic! I (my comps) usually need to connect once every 48 hours to get/return work. How do yoy guys advise? Maybe we (those experiencing similar problems) should lay off for a while and get back online in say three or four days to let you guys and your overburdened servers to have a breather and straighten things our. ET won't be going anywhere :o) What say you? By the way, for the past several years I've been admiring the great job you are doing. Keep it up and us - dreamers will be there for yoy!


Done 79.14 0.02 0.01
Done 120.47 0.06 0.05
Done 9,503.13 19.32 pending
Done 9,538.72 19.32 pending
Done 46,457.27 16.91 16.91
Done 22,682.17 28.43 28.43
Done 79.52 0.02 0.02
Done 167.93 0.02 0.02
Done 9,689.89 19.31 19.31
Done 10,599.06 19.31 pending
Done 8,238.59 16.84 16.84
Done 84.91 0.02 0.02
Done 10,448.47 19.31 19.31
Done 6,712.44 12.15 12.15
Done 129.86 0.03 0.03

Pablo_ZPM

Probably what you are seeing is being discussed in Work Unit Problem on the Number Crunching Board.

Andy

Actually, I don't think that list shows any significant problems at all. None of your low-credit WUs took longer than 3 minutes, and you only had 6 short ones (out of 15) - that's below the 50% rate that Matt says they were sending out for a while.

Now they've fixed that problem, you should see fewer very short units: and you should expect to see the 'normal' units work their way through more quickly (because the new program is much more efficient).

The only one which might be a minor cause for concern is the one which took 46,457 seconds for 16.91 credits, but even that doesn't matter if it only happens rarely.

Yes, you're good to go - keep on crunching!
ID: 621053 · Report as offensive
Profile Clyde C. Phillips, III

Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 621227 - Posted: 17 Aug 2007, 18:24:02 UTC

I looked at my results today and saw several 10,500-second (for a PD950) 0.5 credit ones, most pending. Also I aborted two units that weren't progressing and the time-to-completion increasing. Also a lot of 0.5 to 30-second ones that must have been -9 overflow ones. There's still a big mess with the new Multibeam system. I know Matt is working as hard as he can.....
ID: 621227 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 621298 - Posted: 17 Aug 2007, 20:10:28 UTC - in response to Message 621227.  

Clyde

I hope you reported the workunit ID in the NC Forum with the continued to run and got nowhere... What was seen in Beta in a Few of those cases if supend close BOINC and then restart. It either completes normally... or errors out. We had a few people play with a few that were captured and could not find what the cause was...

I looked at my results today and saw several 10,500-second (for a PD950) 0.5 credit ones, most pending. Also I aborted two units that weren't progressing and the time-to-completion increasing. Also a lot of 0.5 to 30-second ones that must have been -9 overflow ones. There's still a big mess with the new Multibeam system. I know Matt is working as hard as he can.....


Please consider a Donation to the Seti Project.

ID: 621298 · Report as offensive
Christopher Coulter
Volunteer tester

Send message
Joined: 15 Sep 05
Posts: 1
Credit: 1,898
RAC: 0
United States
Message 621339 - Posted: 17 Aug 2007, 21:16:16 UTC

my computer time to finsh is 29hrs is it positable that mine is in the 2.5 percent. thanks Christopher coulter
ID: 621339 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 621561 - Posted: 18 Aug 2007, 0:53:15 UTC - in response to Message 620572.  

So here's the deal. Getting multibeam data out to the public is having its ups and downs. Thanks to some helpful poking and prodding from various users we uncovered a problem with the splitter causing it to generate workunits with bogus triplet thresholds. The result: about 50% of the workunits sent out were overflowing quickly and returning, creating network clogs on our already-overwhelmed servers. And about 2.5% of the workunits were sent out with impossibly low threshholds, causing clients to spin on ridiculously slow calculations. The mystery here is why these aren't also immediately overflowing (with such thresholds they should report a lot of garbage right away). This may have to do when/where the client checks for overflow - it may take several hours to reach 0.001% done, but then the hope is these clients will then finally be bursting with data and returning the results home.

This was actually a problem in beta that got fixed, but now somehow resurfaced, which is also a mystery. CVS out of sync? Some stupid code put in to check for config overrides on the command line? Unfortunately the splitter guru is on vacation, so we had to make our best attempt to understand the code and patch it ourselves. Jeff just did so and put the fixed version on line and we're watching the thresholds. So far so good.

Meanwhile, we're back to yesterday's problem of just not having enough throughput from the workunit file server, so that's the main bottleneck right now, and there's not much we can do about it except wait for the current artificial demand (caused by the excessive overflows) to die down and see if we catch up.

- Matt

Is it possible that so much information is generated that the counter overflows and a value that is actually too small (possibly even negative) is reported?


BOINC WIKI
ID: 621561 · Report as offensive
Profile blade148

Send message
Joined: 22 Jul 01
Posts: 5
Credit: 985,224
RAC: 0
United Kingdom
Message 622011 - Posted: 18 Aug 2007, 15:45:54 UTC

Hi Matt and thanks for the Update.

Could I ask a favour .? ( if it's in your relm ..?? )

Could you update the news on the front of the seti web site ..??

It might help with other peeps understand and to remain patient while you guys get sorted.


thanks for your hard work !

Matt
ID: 622011 · Report as offensive
Profile Clyde C. Phillips, III

Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 622144 - Posted: 18 Aug 2007, 18:22:42 UTC - in response to Message 621298.  

Clyde

I hope you reported the workunit ID in the NC Forum with the continued to run and got nowhere... What was seen in Beta in a Few of those cases if supend close BOINC and then restart. It either completes normally... or errors out. We had a few people play with a few that were captured and could not find what the cause was...

I looked at my results today and saw several 10,500-second (for a PD950) 0.5 credit ones, most pending. Also I aborted two units that weren't progressing and the time-to-completion increasing. Also a lot of 0.5 to 30-second ones that must have been -9 overflow ones. There's still a big mess with the new Multibeam system. I know Matt is working as hard as he can.....



I'll see if I can find those and report them, with appropriate computation log, to Numbercrunching. Hope it's not too late.

ID: 622144 · Report as offensive
seti@elrcastor.com
Volunteer tester

Send message
Joined: 30 Jan 00
Posts: 35
Credit: 4,879,559
RAC: 0
United States
Message 622441 - Posted: 19 Aug 2007, 3:45:35 UTC

The beta project still has issues

2007-08-16 08:30:06 [SETI@home Beta Test] Sending scheduler request: To fetch work
2007-08-16 08:30:06 [SETI@home Beta Test] Requesting 40883 seconds of new work
2007-08-16 08:30:12 [SETI@home Beta Test] Scheduler RPC succeeded
2007-08-16 08:30:12 [SETI@home Beta Test] Message from server: Project encountered internal error: shared memory
2007-08-16 08:30:12 [SETI@home Beta Test] Deferring communication for 1 hr 0 min 0 sec
2007-08-16 08:30:12 [SETI@home Beta Test] Reason: project is down
2007-08-16 08:30:12 [SETI@home Beta Test] Deferring communication for 3 hr 43 min 59 sec
2007-08-16 08:30:12 [SETI@home Beta Test] Reason: project is down
ID: 622441 · Report as offensive
Profile Alex Striker

Send message
Joined: 11 Jan 04
Posts: 40
Credit: 32,042,907
RAC: 14,668
Denmark
Message 623156 - Posted: 20 Aug 2007, 9:07:23 UTC

Hi Matt, Jeff, and all of the staff
thanks for the update
keep up the good work

/Alex Striker
Team Striker
Denmark
Team Striker - Seti - Denmark
Happy Crunching

/Alex Striker, founder of:
Team Striker SETI/BOINC

English version webpage
us on Facebook
ID: 623156 · Report as offensive
Profile Thebrez1
Volunteer tester

Send message
Joined: 19 May 99
Posts: 4
Credit: 15,247,062
RAC: 4,679
United States
Message 623331 - Posted: 20 Aug 2007, 15:33:33 UTC

I am curious, not hostile. I have been running SETI since 1999. The old Seti used to run great, never any problems and I could store enough units to keep going when the servers did crash. Now I am about ready to take it off my system. It seems SETI is down more than up lately. Is it Seti or Boinc that is the problem? Is it ever going to be fixed? It just seems the "better" the program gets the worse it performs.
ID: 623331 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1060
Credit: 1,092,504
RAC: 61
United States
Message 623409 - Posted: 20 Aug 2007, 16:39:13 UTC - in response to Message 623331.  
Last modified: 20 Aug 2007, 16:39:48 UTC

I am curious, not hostile. I have been running SETI since 1999. The old Seti used to run great, never any problems and I could store enough units to keep going when the servers did crash. Now I am about ready to take it off my system. It seems SETI is down more than up lately. Is it Seti or Boinc that is the problem? Is it ever going to be fixed? It just seems the "better" the program gets the worse it performs.


Yes, the project goes up and down on occassion. In the budget for this year are several server upgrades that will give the project servers some redundancy, providing better uptime. Right now, SETI needs donations to make this happen.

Otherwise, if you want to keep your computer busy and you don't have any other projects, consider allocating 50% of BOINC's time to another project.
http://boinc.berkeley.edu/projects.php
ID: 623409 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15681
Credit: 81,497,941
RAC: 22,255
United States
Message 623435 - Posted: 20 Aug 2007, 16:56:39 UTC - in response to Message 623331.  

I am curious, not hostile. I have been running SETI since 1999. The old Seti used to run great, never any problems <snip>


I don't know about that. There's plenty of crunchers that have been here since the beginning that can tell you about some of the nightmares when SETI Classic had problems. They were just less noticeable then. ;)
ID: 623435 · Report as offensive
Profile RandyC
Avatar

Send message
Joined: 20 Oct 99
Posts: 714
Credit: 1,704,345
RAC: 0
United States
Message 623453 - Posted: 20 Aug 2007, 17:22:28 UTC - in response to Message 620572.  

So here's the deal. Getting multibeam data out to the public is having its ups and downs.

- Matt


Just a quick question...

Now that multibeam data is out (and seemingly stable...for now), what's the status of the old data? Is there any left to be split?
ID: 623453 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Splitsville (Aug 16 2007)


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.