Panic Mode On (12) Server problems

Author	Message
perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 867792 - Posted: 21 Feb 2009, 23:06:55 UTC - in response to Message 867778. Raistmer just came out with his new V9 package so downloads may go crazy as we try to get the new AP5.03s PROUD MEMBER OF Team Starfire World BOINC ID: 867792 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 867804 - Posted: 21 Feb 2009, 23:30:31 UTC - in response to Message 867792. Last modified: 21 Feb 2009, 23:31:03 UTC Serverpage shows nothing out of the ordinary, but all my uploads are stuck, 22-2-2009 0:09:07\|SETI@home\|Backing off 3 hr 45 min 58 sec on upload of 15ja09aa.4494.11524.7.8.142_0_0 22-2-2009 0:09:09\|\|Internet access OK - project servers may be temporarily down. 22-2-2009 0:09:10\|SETI@home\|Computation for task 14ja09aa.4912.11520.8.8.87_0 finished 22-2-2009 0:09:10\|SETI@home\|Starting 14ja09aa.4912.11520.8.8.95_0 22-2-2009 0:09:10\|SETI@home\|Starting task 14ja09aa.4912.11520.8.8.95_0 using setiathome_enhanced version 603 22-2-2009 0:09:13\|SETI@home\|Started upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:10:12\|\|Project communication failed: attempting access to reference site 22-2-2009 0:10:12\|SETI@home\|Temporarily failed upload of 14ja09aa.4912.11520.8.8.87_0_0: connect() failed 22-2-2009 0:10:12\|SETI@home\|Backing off 1 min 0 sec on upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:10:13\|\|Internet access OK - project servers may be temporarily down. 22-2-2009 0:11:13\|SETI@home\|Started upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:11:35\|\|Project communication failed: attempting access to reference site 22-2-2009 0:11:35\|SETI@home\|Temporarily failed upload of 14ja09aa.4912.11520.8.8.87_0_0: connect() failed 22-2-2009 0:11:35\|SETI@home\|Backing off 1 min 0 sec on upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:11:36\|\|Internet access OK - project servers may be temporarily down. 22-2-2009 0:11:42\|SETI@home\|Computation for task 15ja09aa.4494.11524.7.8.136_1 finished 22-2-2009 0:11:42\|SETI@home\|Starting 14ja09aa.4912.11520.8.8.69_0 22-2-2009 0:11:42\|SETI@home\|Starting task 14ja09aa.4912.11520.8.8.69_0 using setiathome_enhanced version 603 22-2-2009 0:11:44\|SETI@home\|Started upload of 15ja09aa.4494.11524.7.8.136_1_0 22-2-2009 0:12:35\|SETI@home\|Started upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:12:57\|\|Project communication failed: attempting access to reference site 22-2-2009 0:12:57\|SETI@home\|Temporarily failed upload of 14ja09aa.4912.11520.8.8.87_0_0: connect() failed 22-2-2009 0:12:57\|SETI@home\|Backing off 1 min 0 sec on upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:12:58\|\|Internet access OK - project servers may be temporarily down. 22-2-2009 0:12:58\|SETI@home\|Started upload of 15ja09aa.4494.11524.7.8.152_0_0 Maybe it's solved in a few hours, I hope ;) ID: 867804 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 867851 - Posted: 22 Feb 2009, 1:56:50 UTC - in response to Message 867773. I messaged Eric to see if he could kick the Upload server remotely from home Pete, It's not the upload server he needs to kick. The communications channel has been saturated with downloads for 20 solid hours now. Sometimes this is because of MB 'shorties' (VHAR), but I've seen no sign of that in my own downloads. Instead, I've been getting a larger allocation than recently of AP_v5, and the ones I'm crunching seem to be running at normal speed. But the server status page is showing AP results being returned at well over 3,000 per hour, yet the number awaiting validation - with the validator disabled - is rising at only 40 per hour. The average turnround time for AP tasks will soon fall below 10 hours, which is ludicrous. I suspect that a significant number of hosts are trashing every AP task they receive, and coming back for more. If every AP task returned is replaced by one new download, then 3,000 tasks per hour - almost 1 every second - requires 8 megabytes of download every second, or 64 megabits of pure data. Add check bits, routing data, communications protocol overhead and so on and the 95 Mbs effective throughput is easily reached. In order to protect the servers from overload, and preserve the integrity of the science database, Eric should be putting a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves. ID: 867851 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 867853 - Posted: 22 Feb 2009, 1:58:55 UTC - in response to Message 867758. I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work. I've changed my cache size but it still just requests 0 seconds of work. One of the safeties built in is a limit on 2*ncpus uploads before work fetch is halted to that project. BOINC WIKI ID: 867853 ·

Westsail and Pyxey Volunteer tester Send message Joined: 26 Jul 99 Posts: 338 Credit: 20,544,999 RAC: 0	Message 867866 - Posted: 22 Feb 2009, 2:34:29 UTC "The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov ID: 867866 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 867872 - Posted: 22 Feb 2009, 2:50:10 UTC Trying to do what little bit I can to help. I've suspended my network activity for the night at the very least. I will try again in the morning and if it's no better I'll suspend it again. PROUD MEMBER OF Team Starfire World BOINC ID: 867872 ·

nutcase Volunteer tester Send message Joined: 13 Jun 05 Posts: 19 Credit: 6,589,801 RAC: 0	Message 867937 - Posted: 22 Feb 2009, 5:14:51 UTC - in response to Message 867853. Last modified: 22 Feb 2009, 5:16:29 UTC I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work. I've changed my cache size but it still just requests 0 seconds of work. One of the safeties built in is a limit on 2*ncpus uploads before work fetch is halted to that project. well, this one is affecting me badly as it is affecting other projects also. My 8 core system refuse to get new work from ANY PROJECT! this is not affecting my quad or dual core systems though. so, basically right now I have 16 cores idle doing nothing because BOINC will not get work from any project I attach to. ID: 867937 ·

Jack Zhang Volunteer tester Send message Joined: 2 Jul 06 Posts: 206 Credit: 6,142,449 RAC: 0	Message 867942 - Posted: 22 Feb 2009, 5:45:50 UTC The net load is at MAX for the past few hours... Too many WU upload connections or an attack? What if Fiction was Fact and Fact was Fiction and vice versa? ID: 867942 ·

littlegreenmanfrommars Volunteer tester Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0	Message 867943 - Posted: 22 Feb 2009, 5:48:42 UTC - in response to Message 867937. I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work. I've changed my cache size but it still just requests 0 seconds of work. One of the safeties built in is a limit on 2*ncpus uploads before work fetch is halted to that project. well, this one is affecting me badly as it is affecting other projects also. My 8 core system refuse to get new work from ANY PROJECT! this is not affecting my quad or dual core systems though. so, basically right now I have 16 cores idle doing nothing because BOINC will not get work from any project I attach to. I would suggest this is a problem with your 8 core system, as the others are working ok. Problems with SETI@home should not affect other projects, except to allow your machine to crunch extra work from them while S@h sorts it's life out. Once S@h is working correctly, you should find it has accrued a "debt" from the other projects, so it will catch up with the work it "missed" during the hiccup/outage. I have managed to upload ONE WU, and am still happily downloading new work. One rig has 8 completed WU's in the upload queue, a second rig has three. Both are downloading with no issues. ID: 867943 ·

littlegreenmanfrommars Volunteer tester Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0	Message 867946 - Posted: 22 Feb 2009, 5:56:52 UTC - in response to Message 867942. Last modified: 22 Feb 2009, 5:58:38 UTC The net load is at MAX for the past few hours... Too many WU upload connections or an attack? Every completed WU causes BOINC to contact S@h at regular intervals, trying to upload. (Look under "Transfers" tab). It follows that the more completed WU's a given machine has in it's upload queue, the more attempts it will make to contact the S@h servers in a given period of time. Although each packet sent for these attempts is relatively small, there will be several hundred thousand machines, each with a growing list of completed WU's in their queues. This will generate a lot of network traffic. Best advice in such a situation is to turn off network activity for a while. (BOINC tool menu > select Network activity suspended). If all crunchers suspend network activity overnight, this should reduce contacts by approximately 33% at any given time, taking some of the pressure off. As soon as the log jam starts to move, it should clear up pretty quickly. Previous experience says about 4 to 6 hours. ID: 867946 ·

Tribble Send message Joined: 21 Feb 02 Posts: 65 Credit: 7,978,002 RAC: 0	Message 867949 - Posted: 22 Feb 2009, 6:40:38 UTC - in response to Message 867946. [quote] If all crunchers suspend network activity overnight, this should reduce contacts by approximately 33% at any given time, taking some of the pressure off. As soon as the log jam starts to move, it should clear up pretty quickly. Previous experience says about 4 to 6 hours. I've suspended network activity as you suggested as there isn't a reason for me to even try as I haven't uploaded anything in 24 hours anyway due to the jam. I hope it gets sorted soon. ID: 867949 ·

Vipin Palazhi Send message Joined: 29 Feb 08 Posts: 286 Credit: 167,386,578 RAC: 0	Message 867950 - Posted: 22 Feb 2009, 6:40:41 UTC One of my system is out of work. I hadnt connected it for almost a day and a half, and now it is sitting with two days work which are constantly trying to upload. Switched off the network activity on all the rigs, which have a total of around 500 WUs to upload. Hope the issue gets resolved before tuesday's weekly outage. ______________ ID: 867950 ·

littlegreenmanfrommars Volunteer tester Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0	Message 867955 - Posted: 22 Feb 2009, 7:13:36 UTC If it works, an indication things are returning to normal will be a shortening of your "pending credit" queue. As WU's start to return, they will "match up" with those already returned, so your total credit should begin to rise. The cricket graph (URL below) is a better way to check how much bandwidth is being used at the Berkeley end. Once you see the incoming bits reducing, you can resume network activity. Personally, I do this on one rig at a time, waiting for the queue to clear on one rig before resuming network activity on the next rig. http://fragment1.berkeley.edu/newcricket/mini-graph.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets;ranges=d ID: 867955 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13959 Credit: 208,696,464 RAC: 304	Message 867962 - Posted: 22 Feb 2009, 7:59:56 UTC - in response to Message 867851. ... a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves. After having a few more looks at Scarecrow's AP graphs thoughout the day, this measure gets my vote. It's all those AP units being downloaded that's clogging up the pipe. Grant Darwin NT ID: 867962 ·

Rob.B Send message Joined: 23 Jul 99 Posts: 157 Credit: 1,439,682 RAC: 0	Message 867963 - Posted: 22 Feb 2009, 8:05:00 UTC I have suspended all network activity on my boxes also, as the advert (UK) say's "every little helps". What is strange though, is that although the issue seems to be with SETI, I can return data for other projects but the client refuses to request new work for any project. Same on all boxes. Rob ID: 867963 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 867985 - Posted: 22 Feb 2009, 9:50:22 UTC - in response to Message 867962. ... a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves. After having a few more looks at Scarecrow's AP graphs thoughout the day, this measure gets my vote. It's all those AP units being downloaded that's clogging up the pipe. OK, had a night's sleep and I think I've found the problem - well, the next stage in the chain. Have a look at WU 417685549. Downloaded seven times, mine is the only one which is running - every other copy failed because they couldn't download the executable file. All my recent AP allocations look like that, though this is the most extreme. Eric needs to turn on the 'proxy server' distribution channel used when new MB executables threaten to clog the pipes - or AP distribution needs to be restricted to those who have manually downloaded and installed the new Lunatics r112 optimisation for Astropulse_v5 (plug!). ID: 867985 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 867994 - Posted: 22 Feb 2009, 10:31:46 UTC - in response to Message 867962. ... a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves. After having a few more looks at Scarecrow's AP graphs thoughout the day, this measure gets my vote. It's all those AP units being downloaded that's clogging up the pipe. As usual, Richard's analysis was right on target. It's the Astropulse v5 work erroring out and being resent. To be even more specific, the errors are: <message> app_version download error: couldn't get input files: <file_xfer_error> <file_name>astropulse_5.03_windows_intelx86.exe</file_name> <error_code>-200</error_code> </file_xfer_error> Some example WUs: 417836906, 417836901, and 417768997. There are others with numbers close to those, AP work tends to get runs of contiguous numbers because there are times when the mb_splitter processes are not producing work. I've sent email to staff, hopefully they can either change the URL of the app download to something which works or cut off delivery of Astropulse v5 work. Joe ID: 867994 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 868011 - Posted: 22 Feb 2009, 10:58:34 UTC - in response to Message 867994. Last modified: 22 Feb 2009, 10:59:54 UTC I've sent email to staff, hopefully they can either change the URL of the app download to something which works or cut off delivery of Astropulse v5 work. Joe So have I. They have a ready-made solution available: In order to keep our bandwidth from going bonkers due to all the new client downloads, we employ the use of Coral Cache. This is all well and good, except that some ISPs out there firewall http redirects, which means a tiny subset of users cannot download these new clients. This is unfortunate, as we have no choice because we can't handle the new client downloads ourselves. So these few users will suffer a bit until we can remove such caching. (Matt Lebofsky, Dec 17 2008) All they have to do is turn it on. ID: 868011 ·

Hans Kramer Volunteer tester Send message Joined: 16 May 99 Posts: 61 Credit: 8,770,184 RAC: 0	Message 868012 - Posted: 22 Feb 2009, 11:04:31 UTC - in response to Message 867994. @Joe & Richard, True, Astropulse is clogging up the system. But I somewhat disagree to the solutions you propose. The underlying problem is bandwidth, there is just not enough room to supply all the demand, generating all kinds of problems, including (possibly) the download errors. As I see it, the only real solution for now AND the future is adding bandwidth. From previous posts by Matt c.s. I believe there was some problem at Berkeley to be able to do that. Please correct me if I'm wrong about that. From past experience I know a lot of problems can be solved by one thing, money. If that's the case, I see a challenge for Pete to raise enough greenbacks to make it happen ;-)). Maybe a nice gift for S@H's 10th anniversary? ID: 868012 ·

Jack Zhang Volunteer tester Send message Joined: 2 Jul 06 Posts: 206 Credit: 6,142,449 RAC: 0	Message 868013 - Posted: 22 Feb 2009, 11:18:16 UTC - in response to Message 867963. I have suspended all network activity on my boxes also, as the advert (UK) say's "every little helps". What is strange though, is that although the issue seems to be with SETI, I can return data for other projects but the client refuses to request new work for any project. Same on all boxes. Rob As of right now, it's still not letting up, this is a abnormally long amount of time for peak bandwidth (that isn't after an outage). What if Fiction was Fact and Fact was Fiction and vice versa? ID: 868013 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.