Fast One (May 16 2007)

Message boards : Technical News : Fast One (May 16 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 14 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 569009 - Posted: 16 May 2007, 23:43:00 UTC
Last modified: 16 May 2007, 23:43:54 UTC

Quick note as I gotta catch a bus..

Wow - what a mess. I think we're in the middle of our biggest outage recovery to date, and it's breaking everything. The good news is we're coming into some newer hardware which we'll get on line to help somehow.

See Eric's thread in the Staff Blog. He's been working overtime getting a new frankenstein machine together to act as another upload/download server and reduce the load on bruno. The scheduling server (galileo) has been choking - I just now moved all that over to bruno as well. So we may retire galileo soon, too. Jeff has been going nuts trying to track down errors in validator/assimilator code so we can get those on line as well. And our old friend "slow feeder query" is back, probably just being aggravated by the heavy load.

Gotta go..

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 569009 · Report as offensive
KB7RZF
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 9549
Credit: 3,308,926
RAC: 2
United States
Message 569012 - Posted: 16 May 2007, 23:47:33 UTC - in response to Message 569009.  

Matt, thanks for the quick update. We all keep our fingers crossed, and wish you all good luck on getting things sorted. You guys keep up the awesome job, we know its a pain in the rear.

Jeremy

Quick note as I gotta catch a bus..

Wow - what a mess. I think we're in the middle of our biggest outage recovery to date, and it's breaking everything. The good news is we're coming into some newer hardware which we'll get on line to help somehow.

See Eric's thread in the Staff Blog. He's been working overtime getting a new frankenstein machine together to act as another upload/download server and reduce the load on bruno. The scheduling server (galileo) has been choking - I just now moved all that over to bruno as well. So we may retire galileo soon, too. Jeff has been going nuts trying to track down errors in validator/assimilator code so we can get those on line as well. And our old friend "slow feeder query" is back, probably just being aggravated by the heavy load.

Gotta go..

- Matt


ID: 569012 · Report as offensive
Flyer

Send message
Joined: 8 Aug 00
Posts: 3
Credit: 545,047
RAC: 0
United States
Message 569038 - Posted: 17 May 2007, 0:23:34 UTC

Matt and company,

thanks for the great effort. Take your time get it fixed correctly and we'll all be better off for it.

Again Thanks
Flyer
ID: 569038 · Report as offensive
Profile JDenise

Send message
Joined: 29 Aug 01
Posts: 12
Credit: 2,493,076
RAC: 3
United States
Message 569060 - Posted: 17 May 2007, 0:51:34 UTC
Last modified: 17 May 2007, 0:52:24 UTC

I know it's all in good hands.

Keep up your spirits there should be light at the end of the tunnel so don't let it startle you when you come upon it.

Best of luck & wishes
Jim

USAF Projects Page
My Home Site
ID: 569060 · Report as offensive
Claudel

Send message
Joined: 2 Dec 00
Posts: 1
Credit: 109,396
RAC: 0
Canada
Message 569063 - Posted: 17 May 2007, 0:55:25 UTC

would it help if everybody stop asking for new work ?
ID: 569063 · Report as offensive
tombew

Send message
Joined: 12 Apr 00
Posts: 111
Credit: 12,182,261
RAC: 0
United States
Message 569070 - Posted: 17 May 2007, 1:04:20 UTC

Thanks for the update.
ID: 569070 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569074 - Posted: 17 May 2007, 1:11:43 UTC - in response to Message 569063.  

would it help if everybody stop asking for new work ?


While a noble effort, you have no chance in getting that level of cooperation.

I agree with what I've seen mentioned elsewhere, that the projects need an additional throttle mechanism built into BOINC; a "break glass" that is only performed in dire circumstances... Something that puts a little more control into their hands to get wildly out of control processes back in control quicker.
ID: 569074 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 569093 - Posted: 17 May 2007, 1:20:22 UTC

First, let me say thanks to the SAH staff for their efforts over the last few weeks, both in fixing things and in keeping us informed.

Second, I finally got a WU, and I'm trying to report it, but I'm getting a new error message - new to me anyway.

5/16/2007 8:47:39 PM|SETI@home|Sending scheduler request: To report completed tasks
5/16/2007 8:47:39 PM|SETI@home|Requesting 1863 seconds of new work, and reporting 1 completed tasks
5/16/2007 8:48:04 PM|SETI@home|Scheduler RPC succeeded [server version 509]
5/16/2007 8:48:04 PM|SETI@home|Message from server: Incomplete request received.

Anything I should be doing from my end, or is this part of the general mess already underway?

ID: 569093 · Report as offensive
divedude

Send message
Joined: 5 Jun 06
Posts: 9
Credit: 4,394,705
RAC: 0
United States
Message 569108 - Posted: 17 May 2007, 1:30:20 UTC

You guys do an awesome job and I hope that the work we all do helping to process the work units results in something. But, is it my understanding from the forums that you rely on one to three servers to upload/download units and process them? With no backup servers? A single server down should not have resulted in a 2 week or more downtime. I have just now started getting work, but my uploads are not working. We understand that it is based on donations, but a project this large should have backup servers in place before operating.. Can we as a community petition Sun to donate more server hardware to enhance the program?
ID: 569108 · Report as offensive
Martin Johnson

Send message
Joined: 9 Jun 01
Posts: 201
Credit: 224,995
RAC: 0
United Kingdom
Message 569109 - Posted: 17 May 2007, 1:31:18 UTC

I just got this too for the first time, plus New Host Venue (??):

01-44-20|SETI@home|Sending scheduler request: Requested by user
01-44-20|SETI@home|(not requesting new work or reporting completed tasks)
01-44-35|SETI@home|Scheduler RPC succeeded [server version 509]
01-44-35|SETI@home|Message from server: Incomplete request received.
01-44-35|SETI@home|New host venue:
01-44-35||General prefs: from SETI@home (last modified 2007-04-01 01:31:42)
01-44-35||Host location: none
01-44-35||General prefs: using your defaults
01-44-35|SETI@home|Deferring communication for 11 sec
01-44-35|SETI@home|Reason: requested by project
ID: 569109 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569115 - Posted: 17 May 2007, 1:36:30 UTC - in response to Message 569093.  
Last modified: 17 May 2007, 1:38:52 UTC


Anything I should be doing from my end, or is this part of the general mess already underway?


I think everyone is seeing that now. Not sure what it is. Probably won't be until at least 15:00 GMT today (05/17) before it is fixed (unless someone is staying late again)...

Edit: Had to change "tomorrow" to "today" because of being in EDT and so it is already "tomorrow" as far as GMT is concerned...
ID: 569115 · Report as offensive
Profile Fuzzy Hollynoodles
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 9659
Credit: 251,998
RAC: 0
Message 569130 - Posted: 17 May 2007, 2:10:28 UTC - in response to Message 569108.  

You guys do an awesome job and I hope that the work we all do helping to process the work units results in something. But, is it my understanding from the forums that you rely on one to three servers to upload/download units and process them? With no backup servers? A single server down should not have resulted in a 2 week or more downtime. I have just now started getting work, but my uploads are not working. We understand that it is based on donations, but a project this large should have backup servers in place before operating.. Can we as a community petition Sun to donate more server hardware to enhance the program?


Hi divedude and welcome to the boards. :-)

Yes, you have got it right, they are operating on a shoestring, all relied on money and hardware donations. They don't have any grants, and the failings of the servers we have seen here the past weeks are a result of working with old, obsolete hardware. They got the Thumper from Sun last year, but it was a beta test model and the one they got to replace Thumper was not a donation, they got it for a reduced price. So what they have to work with at the moment is donated hardware as it seems that the old servers are giving up one by one. Of fatigue, I suppose.

And no, they don't have any backup servers, hence the long outages and difficulties the past weeks.

So all donations are welcome, money and usable hardware. You can see their budget here.

In case you would like to donate money please click on the link in my sig. Thank you.


"I'm trying to maintain a shred of dignity in this world." - Me

ID: 569130 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65740
Credit: 55,293,173
RAC: 49
United States
Message 569153 - Posted: 17 May 2007, 2:42:01 UTC - in response to Message 569130.  

You guys do an awesome job and I hope that the work we all do helping to process the work units results in something. But, is it my understanding from the forums that you rely on one to three servers to upload/download units and process them? With no backup servers? A single server down should not have resulted in a 2 week or more downtime. I have just now started getting work, but my uploads are not working. We understand that it is based on donations, but a project this large should have backup servers in place before operating.. Can we as a community petition Sun to donate more server hardware to enhance the program?


Hi divedude and welcome to the boards. :-)

Yes, you have got it right, they are operating on a shoestring, all relied on money and hardware donations. They don't have any grants, and the failings of the servers we have seen here the past weeks are a result of working with old, obsolete hardware. They got the Thumper from Sun last year, but it was a beta test model and the one they got to replace Thumper was not a donation, they got it for a reduced price. So what they have to work with at the moment is donated hardware as it seems that the old servers are giving up one by one. Of fatigue, I suppose.

And no, they don't have any backup servers, hence the long outages and difficulties the past weeks.

So all donations are welcome, money and usable hardware. You can see their budget here.

In case you would like to donate money please click on the link in my sig. Thank you.


Fatigue, figures, The Seti users must be driving the old servers into a metal breakdown. ;) Hopefully nothing else will go wrong as I'm about to put up a 5th cruncher on a shoestring of My own(I need to replace one or two psus eventually).
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 569153 · Report as offensive
Mithotar
Avatar

Send message
Joined: 11 Apr 01
Posts: 88
Credit: 66,037,385
RAC: 50
United States
Message 569159 - Posted: 17 May 2007, 3:08:13 UTC - in response to Message 569063.  

would it help if everybody stop asking for new work ?



It might help but as noted elsewhere its not likely to happen.
I have 5 PCs doing BOINC.......I have shut down BOINC on 4 of
the 5 and left just the 1 running BOINC ...my "canary" if you like
......its not much but every little bit will help get things
back to normal.......



ID: 569159 · Report as offensive
TarracoServer
Volunteer tester

Send message
Joined: 11 Apr 07
Posts: 38
Credit: 595,022
RAC: 0
Spain
Message 569310 - Posted: 17 May 2007, 8:58:21 UTC

I don't think that to stop Boinc clients would be the best option, because the problem will be when they'll say You can connect now: Another overflow for Bruno.

The best is what they're doing: A new Up/download server to free Bruno.

Keep on good job!
ID: 569310 · Report as offensive
Teasel

Send message
Joined: 16 May 03
Posts: 2
Credit: 3,467,167
RAC: 0
United Kingdom
Message 569326 - Posted: 17 May 2007, 9:20:20 UTC - in response to Message 569074.  

would it help if everybody stop asking for new work ?


While a noble effort, you have no chance in getting that level of cooperation.

I agree with what I've seen mentioned elsewhere, that the projects need an additional throttle mechanism built into BOINC; a "break glass" that is only performed in dire circumstances... Something that puts a little more control into their hands to get wildly out of control processes back in control quicker.

I'm no network expert, but how about simply firewalling out half of the internet? That might relieve the load on the servers sufficiently that they could achieve some reasonable throughput and clear some of the backlogue. A bit tough on the half that's firewalled, but the chunk of IP addresses allowed through could be changed every few hours to give everyone a chance.
ID: 569326 · Report as offensive
Profile Mephist0
Volunteer tester

Send message
Joined: 4 Dec 99
Posts: 12
Credit: 1,401,540
RAC: 0
Sweden
Message 569328 - Posted: 17 May 2007, 9:22:32 UTC - in response to Message 569159.  

would it help if everybody stop asking for new work ?



It might help but as noted elsewhere its not likely to happen.
I have 5 PCs doing BOINC.......I have shut down BOINC on 4 of
the 5 and left just the 1 running BOINC ...my "canary" if you like
......its not much but every little bit will help get things
back to normal.......




I have done the same.. I have 11 PCs running SETI but only one is running SETI right now (right now its not requesting more work). The 10 others is running rosetta until connection problems gets better :) I think that will help reduce the load on the project..
ID: 569328 · Report as offensive
Profile Mephist0
Volunteer tester

Send message
Joined: 4 Dec 99
Posts: 12
Credit: 1,401,540
RAC: 0
Sweden
Message 569329 - Posted: 17 May 2007, 9:24:58 UTC - in response to Message 569326.  

would it help if everybody stop asking for new work ?


While a noble effort, you have no chance in getting that level of cooperation.

I agree with what I've seen mentioned elsewhere, that the projects need an additional throttle mechanism built into BOINC; a "break glass" that is only performed in dire circumstances... Something that puts a little more control into their hands to get wildly out of control processes back in control quicker.

I'm no network expert, but how about simply firewalling out half of the internet? That might relieve the load on the servers sufficiently that they could achieve some reasonable throughput and clear some of the backlogue. A bit tough on the half that's firewalled, but the chunk of IP addresses allowed through could be changed every few hours to give everyone a chance.


Thats a real good idea! I'm not network expert either but it sounds like a good idea.. If the bottleneck is the servers and not the firewall itself that would work i think.. :)
ID: 569329 · Report as offensive
Profile Delong
Volunteer tester
Avatar

Send message
Joined: 12 Jun 99
Posts: 105
Credit: 5,858,225
RAC: 0
United Kingdom
Message 569342 - Posted: 17 May 2007, 11:09:24 UTC

Looks as though everything is back to normal, uploading and downloading no problems here. Well done chaps.
ID: 569342 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20267
Credit: 7,508,002
RAC: 20
United Kingdom
Message 569367 - Posted: 17 May 2007, 12:20:21 UTC - in response to Message 569074.  
Last modified: 17 May 2007, 12:25:35 UTC

would it help if everybody stop asking for new work ?

While a noble effort, you have no chance in getting that level of cooperation.

I agree with what I've seen mentioned elsewhere, that the projects need an additional throttle mechanism built into BOINC...

There is "exponential backoff" built into the Boinc manager that is designed to avoid giving a Boinc project a DDoS from its own clients. Perhaps that feature needs to be looked at again...

There's also a problem/vulnerability in the Boinc communication protocol in that to complete a transaction, there is more than one TCP connection required for successful completion. There is then additional wasteful overhead generated if any part of the sequence fails. Worse still, under heavy load, the chance of getting subsequent connections to complete the sequence gets forever reduced (choked by all the other first connections attempts from everyone else) until noone can get to complete the sequence...

Load shaping to give higher priority to connections that are further along the sequence so that once you're in you are guaranteed to complete the transfer would likely greatly help.


Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 569367 · Report as offensive
1 · 2 · 3 · 4 . . . 14 · Next

Message boards : Technical News : Fast One (May 16 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.