Fast One (May 16 2007)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 569009 - Posted: 16 May 2007, 23:43:00 UTC Last modified: 16 May 2007, 23:43:54 UTC Quick note as I gotta catch a bus.. Wow - what a mess. I think we're in the middle of our biggest outage recovery to date, and it's breaking everything. The good news is we're coming into some newer hardware which we'll get on line to help somehow. See Eric's thread in the Staff Blog. He's been working overtime getting a new frankenstein machine together to act as another upload/download server and reduce the load on bruno. The scheduling server (galileo) has been choking - I just now moved all that over to bruno as well. So we may retire galileo soon, too. Jeff has been going nuts trying to track down errors in validator/assimilator code so we can get those on line as well. And our old friend "slow feeder query" is back, probably just being aggravated by the heavy load. Gotta go.. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 569009 ·

KB7RZF Volunteer tester Send message Joined: 15 Aug 99 Posts: 9549 Credit: 3,308,926 RAC: 2	Message 569012 - Posted: 16 May 2007, 23:47:33 UTC - in response to Message 569009. Matt, thanks for the quick update. We all keep our fingers crossed, and wish you all good luck on getting things sorted. You guys keep up the awesome job, we know its a pain in the rear. Jeremy Quick note as I gotta catch a bus.. Wow - what a mess. I think we're in the middle of our biggest outage recovery to date, and it's breaking everything. The good news is we're coming into some newer hardware which we'll get on line to help somehow. See Eric's thread in the Staff Blog. He's been working overtime getting a new frankenstein machine together to act as another upload/download server and reduce the load on bruno. The scheduling server (galileo) has been choking - I just now moved all that over to bruno as well. So we may retire galileo soon, too. Jeff has been going nuts trying to track down errors in validator/assimilator code so we can get those on line as well. And our old friend "slow feeder query" is back, probably just being aggravated by the heavy load. Gotta go.. - Matt ID: 569012 ·

Flyer Send message Joined: 8 Aug 00 Posts: 3 Credit: 545,047 RAC: 0	Message 569038 - Posted: 17 May 2007, 0:23:34 UTC Matt and company, thanks for the great effort. Take your time get it fixed correctly and we'll all be better off for it. Again Thanks Flyer ID: 569038 ·

JDenise Send message Joined: 29 Aug 01 Posts: 12 Credit: 2,493,076 RAC: 3	Message 569060 - Posted: 17 May 2007, 0:51:34 UTC Last modified: 17 May 2007, 0:52:24 UTC I know it's all in good hands. Keep up your spirits there should be light at the end of the tunnel so don't let it startle you when you come upon it. Best of luck & wishes Jim USAF Projects Page My Home Site ID: 569060 ·

Claudel Send message Joined: 2 Dec 00 Posts: 1 Credit: 109,396 RAC: 0	Message 569063 - Posted: 17 May 2007, 0:55:25 UTC would it help if everybody stop asking for new work ? ID: 569063 ·

tombew Send message Joined: 12 Apr 00 Posts: 111 Credit: 12,182,261 RAC: 0	Message 569070 - Posted: 17 May 2007, 1:04:20 UTC Thanks for the update. ID: 569070 ·

Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0	Message 569074 - Posted: 17 May 2007, 1:11:43 UTC - in response to Message 569063. would it help if everybody stop asking for new work ? While a noble effort, you have no chance in getting that level of cooperation. I agree with what I've seen mentioned elsewhere, that the projects need an additional throttle mechanism built into BOINC; a "break glass" that is only performed in dire circumstances... Something that puts a little more control into their hands to get wildly out of control processes back in control quicker. ID: 569074 ·

Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0	Message 569093 - Posted: 17 May 2007, 1:20:22 UTC First, let me say thanks to the SAH staff for their efforts over the last few weeks, both in fixing things and in keeping us informed. Second, I finally got a WU, and I'm trying to report it, but I'm getting a new error message - new to me anyway. 5/16/2007 8:47:39 PM\|SETI@home\|Sending scheduler request: To report completed tasks 5/16/2007 8:47:39 PM\|SETI@home\|Requesting 1863 seconds of new work, and reporting 1 completed tasks 5/16/2007 8:48:04 PM\|SETI@home\|Scheduler RPC succeeded [server version 509] 5/16/2007 8:48:04 PM\|SETI@home\|Message from server: Incomplete request received. Anything I should be doing from my end, or is this part of the general mess already underway? ID: 569093 ·

divedude Send message Joined: 5 Jun 06 Posts: 9 Credit: 4,394,705 RAC: 0	Message 569108 - Posted: 17 May 2007, 1:30:20 UTC You guys do an awesome job and I hope that the work we all do helping to process the work units results in something. But, is it my understanding from the forums that you rely on one to three servers to upload/download units and process them? With no backup servers? A single server down should not have resulted in a 2 week or more downtime. I have just now started getting work, but my uploads are not working. We understand that it is based on donations, but a project this large should have backup servers in place before operating.. Can we as a community petition Sun to donate more server hardware to enhance the program? ID: 569108 ·

Martin Johnson Send message Joined: 9 Jun 01 Posts: 201 Credit: 224,995 RAC: 0	Message 569109 - Posted: 17 May 2007, 1:31:18 UTC I just got this too for the first time, plus New Host Venue (??): 01-44-20\|SETI@home\|Sending scheduler request: Requested by user 01-44-20\|SETI@home\|(not requesting new work or reporting completed tasks) 01-44-35\|SETI@home\|Scheduler RPC succeeded [server version 509] 01-44-35\|SETI@home\|Message from server: Incomplete request received. 01-44-35\|SETI@home\|New host venue: 01-44-35\|\|General prefs: from SETI@home (last modified 2007-04-01 01:31:42) 01-44-35\|\|Host location: none 01-44-35\|\|General prefs: using your defaults 01-44-35\|SETI@home\|Deferring communication for 11 sec 01-44-35\|SETI@home\|Reason: requested by project ID: 569109 ·

Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0	Message 569115 - Posted: 17 May 2007, 1:36:30 UTC - in response to Message 569093. Last modified: 17 May 2007, 1:38:52 UTC Anything I should be doing from my end, or is this part of the general mess already underway? I think everyone is seeing that now. Not sure what it is. Probably won't be until at least 15:00 GMT today (05/17) before it is fixed (unless someone is staying late again)... Edit: Had to change "tomorrow" to "today" because of being in EDT and so it is already "tomorrow" as far as GMT is concerned... ID: 569115 ·

Fuzzy Hollynoodles Volunteer tester Send message Joined: 3 Apr 99 Posts: 9659 Credit: 251,998 RAC: 0	Message 569130 - Posted: 17 May 2007, 2:10:28 UTC - in response to Message 569108. You guys do an awesome job and I hope that the work we all do helping to process the work units results in something. But, is it my understanding from the forums that you rely on one to three servers to upload/download units and process them? With no backup servers? A single server down should not have resulted in a 2 week or more downtime. I have just now started getting work, but my uploads are not working. We understand that it is based on donations, but a project this large should have backup servers in place before operating.. Can we as a community petition Sun to donate more server hardware to enhance the program? Hi divedude and welcome to the boards. :-) Yes, you have got it right, they are operating on a shoestring, all relied on money and hardware donations. They don't have any grants, and the failings of the servers we have seen here the past weeks are a result of working with old, obsolete hardware. They got the Thumper from Sun last year, but it was a beta test model and the one they got to replace Thumper was not a donation, they got it for a reduced price. So what they have to work with at the moment is donated hardware as it seems that the old servers are giving up one by one. Of fatigue, I suppose. And no, they don't have any backup servers, hence the long outages and difficulties the past weeks. So all donations are welcome, money and usable hardware. You can see their budget here. In case you would like to donate money please click on the link in my sig. Thank you. "I'm trying to maintain a shred of dignity in this world." - Me ID: 569130 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65740 Credit: 55,293,173 RAC: 49	Message 569153 - Posted: 17 May 2007, 2:42:01 UTC - in response to Message 569130. You guys do an awesome job and I hope that the work we all do helping to process the work units results in something. But, is it my understanding from the forums that you rely on one to three servers to upload/download units and process them? With no backup servers? A single server down should not have resulted in a 2 week or more downtime. I have just now started getting work, but my uploads are not working. We understand that it is based on donations, but a project this large should have backup servers in place before operating.. Can we as a community petition Sun to donate more server hardware to enhance the program? Hi divedude and welcome to the boards. :-) Yes, you have got it right, they are operating on a shoestring, all relied on money and hardware donations. They don't have any grants, and the failings of the servers we have seen here the past weeks are a result of working with old, obsolete hardware. They got the Thumper from Sun last year, but it was a beta test model and the one they got to replace Thumper was not a donation, they got it for a reduced price. So what they have to work with at the moment is donated hardware as it seems that the old servers are giving up one by one. Of fatigue, I suppose. And no, they don't have any backup servers, hence the long outages and difficulties the past weeks. So all donations are welcome, money and usable hardware. You can see their budget here. In case you would like to donate money please click on the link in my sig. Thank you. Fatigue, figures, The Seti users must be driving the old servers into a metal breakdown. ;) Hopefully nothing else will go wrong as I'm about to put up a 5th cruncher on a shoestring of My own(I need to replace one or two psus eventually). The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 569153 ·

Mithotar Send message Joined: 11 Apr 01 Posts: 88 Credit: 66,037,385 RAC: 50	Message 569159 - Posted: 17 May 2007, 3:08:13 UTC - in response to Message 569063. would it help if everybody stop asking for new work ? It might help but as noted elsewhere its not likely to happen. I have 5 PCs doing BOINC.......I have shut down BOINC on 4 of the 5 and left just the 1 running BOINC ...my "canary" if you like ......its not much but every little bit will help get things back to normal....... ID: 569159 ·

TarracoServer Volunteer tester Send message Joined: 11 Apr 07 Posts: 38 Credit: 595,022 RAC: 0	Message 569310 - Posted: 17 May 2007, 8:58:21 UTC I don't think that to stop Boinc clients would be the best option, because the problem will be when they'll say You can connect now: Another overflow for Bruno. The best is what they're doing: A new Up/download server to free Bruno. Keep on good job! ID: 569310 ·

Teasel Send message Joined: 16 May 03 Posts: 2 Credit: 3,467,167 RAC: 0	Message 569326 - Posted: 17 May 2007, 9:20:20 UTC - in response to Message 569074. would it help if everybody stop asking for new work ? While a noble effort, you have no chance in getting that level of cooperation. I agree with what I've seen mentioned elsewhere, that the projects need an additional throttle mechanism built into BOINC; a "break glass" that is only performed in dire circumstances... Something that puts a little more control into their hands to get wildly out of control processes back in control quicker. I'm no network expert, but how about simply firewalling out half of the internet? That might relieve the load on the servers sufficiently that they could achieve some reasonable throughput and clear some of the backlogue. A bit tough on the half that's firewalled, but the chunk of IP addresses allowed through could be changed every few hours to give everyone a chance. ID: 569326 ·

Mephist0 Volunteer tester Send message Joined: 4 Dec 99 Posts: 12 Credit: 1,401,540 RAC: 0	Message 569328 - Posted: 17 May 2007, 9:22:32 UTC - in response to Message 569159. would it help if everybody stop asking for new work ? It might help but as noted elsewhere its not likely to happen. I have 5 PCs doing BOINC.......I have shut down BOINC on 4 of the 5 and left just the 1 running BOINC ...my "canary" if you like ......its not much but every little bit will help get things back to normal....... I have done the same.. I have 11 PCs running SETI but only one is running SETI right now (right now its not requesting more work). The 10 others is running rosetta until connection problems gets better :) I think that will help reduce the load on the project.. ID: 569328 ·

Mephist0 Volunteer tester Send message Joined: 4 Dec 99 Posts: 12 Credit: 1,401,540 RAC: 0	Message 569329 - Posted: 17 May 2007, 9:24:58 UTC - in response to Message 569326. would it help if everybody stop asking for new work ? While a noble effort, you have no chance in getting that level of cooperation. I agree with what I've seen mentioned elsewhere, that the projects need an additional throttle mechanism built into BOINC; a "break glass" that is only performed in dire circumstances... Something that puts a little more control into their hands to get wildly out of control processes back in control quicker. I'm no network expert, but how about simply firewalling out half of the internet? That might relieve the load on the servers sufficiently that they could achieve some reasonable throughput and clear some of the backlogue. A bit tough on the half that's firewalled, but the chunk of IP addresses allowed through could be changed every few hours to give everyone a chance. Thats a real good idea! I'm not network expert either but it sounds like a good idea.. If the bottleneck is the servers and not the firewall itself that would work i think.. :) ID: 569329 ·

Delong Volunteer tester Send message Joined: 12 Jun 99 Posts: 105 Credit: 5,858,225 RAC: 0	Message 569342 - Posted: 17 May 2007, 11:09:24 UTC Looks as though everything is back to normal, uploading and downloading no problems here. Well done chaps. ID: 569342 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20267 Credit: 7,508,002 RAC: 20	Message 569367 - Posted: 17 May 2007, 12:20:21 UTC - in response to Message 569074. Last modified: 17 May 2007, 12:25:35 UTC would it help if everybody stop asking for new work ? While a noble effort, you have no chance in getting that level of cooperation. I agree with what I've seen mentioned elsewhere, that the projects need an additional throttle mechanism built into BOINC... There is "exponential backoff" built into the Boinc manager that is designed to avoid giving a Boinc project a DDoS from its own clients. Perhaps that feature needs to be looked at again... There's also a problem/vulnerability in the Boinc communication protocol in that to complete a transaction, there is more than one TCP connection required for successful completion. There is then additional wasteful overhead generated if any part of the sequence fails. Worse still, under heavy load, the chance of getting subsequent connections to complete the sequence gets forever reduced (choked by all the other first connections attempts from everyone else) until noone can get to complete the sequence... Load shaping to give higher priority to connections that are further along the sequence so that once you're in you are guaranteed to complete the transfer would likely greatly help. Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 569367 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.