Panic Mode On (12) Server problems

Author	Message
James Sotherden Send message Joined: 16 May 99 Posts: 10436 Credit: 110,373,059 RAC: 54	Message 867776 - Posted: 21 Feb 2009, 22:34:58 UTC My mac is running stock but i have never had an AP WU on this machine, My old P4 XP gets lots of AP WU I had one of the new AP v5 last week took 8 days to crunch with the stock opt. have one now in the waiting to run that says 175 hours, Aslong as the new AP running stock stays stable i can live with it, but im considering just letting the mac run seti and my old pc running milkyway full time. [/quote] Old James ID: 867776 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 867778 - Posted: 21 Feb 2009, 22:37:13 UTC - in response to Message 867773. I messaged Eric to see if he could kick the Upload server remotely from home From the looks of things uploads are still happening occasionally, it's just all the download traffic clogging the pipe at the moment. Once the downloads taper off, the uploads will be able to get through. Grant Darwin NT ID: 867778 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 867792 - Posted: 21 Feb 2009, 23:06:55 UTC - in response to Message 867778. Raistmer just came out with his new V9 package so downloads may go crazy as we try to get the new AP5.03s PROUD MEMBER OF Team Starfire World BOINC ID: 867792 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 867804 - Posted: 21 Feb 2009, 23:30:31 UTC - in response to Message 867792. Last modified: 21 Feb 2009, 23:31:03 UTC Serverpage shows nothing out of the ordinary, but all my uploads are stuck, 22-2-2009 0:09:07\|SETI@home\|Backing off 3 hr 45 min 58 sec on upload of 15ja09aa.4494.11524.7.8.142_0_0 22-2-2009 0:09:09\|\|Internet access OK - project servers may be temporarily down. 22-2-2009 0:09:10\|SETI@home\|Computation for task 14ja09aa.4912.11520.8.8.87_0 finished 22-2-2009 0:09:10\|SETI@home\|Starting 14ja09aa.4912.11520.8.8.95_0 22-2-2009 0:09:10\|SETI@home\|Starting task 14ja09aa.4912.11520.8.8.95_0 using setiathome_enhanced version 603 22-2-2009 0:09:13\|SETI@home\|Started upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:10:12\|\|Project communication failed: attempting access to reference site 22-2-2009 0:10:12\|SETI@home\|Temporarily failed upload of 14ja09aa.4912.11520.8.8.87_0_0: connect() failed 22-2-2009 0:10:12\|SETI@home\|Backing off 1 min 0 sec on upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:10:13\|\|Internet access OK - project servers may be temporarily down. 22-2-2009 0:11:13\|SETI@home\|Started upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:11:35\|\|Project communication failed: attempting access to reference site 22-2-2009 0:11:35\|SETI@home\|Temporarily failed upload of 14ja09aa.4912.11520.8.8.87_0_0: connect() failed 22-2-2009 0:11:35\|SETI@home\|Backing off 1 min 0 sec on upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:11:36\|\|Internet access OK - project servers may be temporarily down. 22-2-2009 0:11:42\|SETI@home\|Computation for task 15ja09aa.4494.11524.7.8.136_1 finished 22-2-2009 0:11:42\|SETI@home\|Starting 14ja09aa.4912.11520.8.8.69_0 22-2-2009 0:11:42\|SETI@home\|Starting task 14ja09aa.4912.11520.8.8.69_0 using setiathome_enhanced version 603 22-2-2009 0:11:44\|SETI@home\|Started upload of 15ja09aa.4494.11524.7.8.136_1_0 22-2-2009 0:12:35\|SETI@home\|Started upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:12:57\|\|Project communication failed: attempting access to reference site 22-2-2009 0:12:57\|SETI@home\|Temporarily failed upload of 14ja09aa.4912.11520.8.8.87_0_0: connect() failed 22-2-2009 0:12:57\|SETI@home\|Backing off 1 min 0 sec on upload of 14ja09aa.4912.11520.8.8.87_0_0 22-2-2009 0:12:58\|\|Internet access OK - project servers may be temporarily down. 22-2-2009 0:12:58\|SETI@home\|Started upload of 15ja09aa.4494.11524.7.8.152_0_0 Maybe it's solved in a few hours, I hope ;) ID: 867804 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 867851 - Posted: 22 Feb 2009, 1:56:50 UTC - in response to Message 867773. I messaged Eric to see if he could kick the Upload server remotely from home Pete, It's not the upload server he needs to kick. The communications channel has been saturated with downloads for 20 solid hours now. Sometimes this is because of MB 'shorties' (VHAR), but I've seen no sign of that in my own downloads. Instead, I've been getting a larger allocation than recently of AP_v5, and the ones I'm crunching seem to be running at normal speed. But the server status page is showing AP results being returned at well over 3,000 per hour, yet the number awaiting validation - with the validator disabled - is rising at only 40 per hour. The average turnround time for AP tasks will soon fall below 10 hours, which is ludicrous. I suspect that a significant number of hosts are trashing every AP task they receive, and coming back for more. If every AP task returned is replaced by one new download, then 3,000 tasks per hour - almost 1 every second - requires 8 megabytes of download every second, or 64 megabits of pure data. Add check bits, routing data, communications protocol overhead and so on and the 95 Mbs effective throughput is easily reached. In order to protect the servers from overload, and preserve the integrity of the science database, Eric should be putting a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves. ID: 867851 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 867853 - Posted: 22 Feb 2009, 1:58:55 UTC - in response to Message 867758. I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work. I've changed my cache size but it still just requests 0 seconds of work. One of the safeties built in is a limit on 2*ncpus uploads before work fetch is halted to that project. BOINC WIKI ID: 867853 ·

Westsail and Pyxey Volunteer tester Send message Joined: 26 Jul 99 Posts: 338 Credit: 20,544,999 RAC: 0	Message 867866 - Posted: 22 Feb 2009, 2:34:29 UTC "The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov ID: 867866 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 867872 - Posted: 22 Feb 2009, 2:50:10 UTC Trying to do what little bit I can to help. I've suspended my network activity for the night at the very least. I will try again in the morning and if it's no better I'll suspend it again. PROUD MEMBER OF Team Starfire World BOINC ID: 867872 ·

nutcase Volunteer tester Send message Joined: 13 Jun 05 Posts: 19 Credit: 6,589,801 RAC: 0	Message 867937 - Posted: 22 Feb 2009, 5:14:51 UTC - in response to Message 867853. Last modified: 22 Feb 2009, 5:16:29 UTC I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work. I've changed my cache size but it still just requests 0 seconds of work. One of the safeties built in is a limit on 2*ncpus uploads before work fetch is halted to that project. well, this one is affecting me badly as it is affecting other projects also. My 8 core system refuse to get new work from ANY PROJECT! this is not affecting my quad or dual core systems though. so, basically right now I have 16 cores idle doing nothing because BOINC will not get work from any project I attach to. ID: 867937 ·

Jack Zhang Volunteer tester Send message Joined: 2 Jul 06 Posts: 206 Credit: 6,142,449 RAC: 0	Message 867942 - Posted: 22 Feb 2009, 5:45:50 UTC The net load is at MAX for the past few hours... Too many WU upload connections or an attack? What if Fiction was Fact and Fact was Fiction and vice versa? ID: 867942 ·

littlegreenmanfrommars Volunteer tester Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0	Message 867943 - Posted: 22 Feb 2009, 5:48:42 UTC - in response to Message 867937. I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work. I've changed my cache size but it still just requests 0 seconds of work. One of the safeties built in is a limit on 2*ncpus uploads before work fetch is halted to that project. well, this one is affecting me badly as it is affecting other projects also. My 8 core system refuse to get new work from ANY PROJECT! this is not affecting my quad or dual core systems though. so, basically right now I have 16 cores idle doing nothing because BOINC will not get work from any project I attach to. I would suggest this is a problem with your 8 core system, as the others are working ok. Problems with SETI@home should not affect other projects, except to allow your machine to crunch extra work from them while S@h sorts it's life out. Once S@h is working correctly, you should find it has accrued a "debt" from the other projects, so it will catch up with the work it "missed" during the hiccup/outage. I have managed to upload ONE WU, and am still happily downloading new work. One rig has 8 completed WU's in the upload queue, a second rig has three. Both are downloading with no issues. ID: 867943 ·

littlegreenmanfrommars Volunteer tester Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0	Message 867946 - Posted: 22 Feb 2009, 5:56:52 UTC - in response to Message 867942. Last modified: 22 Feb 2009, 5:58:38 UTC The net load is at MAX for the past few hours... Too many WU upload connections or an attack? Every completed WU causes BOINC to contact S@h at regular intervals, trying to upload. (Look under "Transfers" tab). It follows that the more completed WU's a given machine has in it's upload queue, the more attempts it will make to contact the S@h servers in a given period of time. Although each packet sent for these attempts is relatively small, there will be several hundred thousand machines, each with a growing list of completed WU's in their queues. This will generate a lot of network traffic. Best advice in such a situation is to turn off network activity for a while. (BOINC tool menu > select Network activity suspended). If all crunchers suspend network activity overnight, this should reduce contacts by approximately 33% at any given time, taking some of the pressure off. As soon as the log jam starts to move, it should clear up pretty quickly. Previous experience says about 4 to 6 hours. ID: 867946 ·

Tribble Send message Joined: 21 Feb 02 Posts: 65 Credit: 7,978,002 RAC: 0	Message 867949 - Posted: 22 Feb 2009, 6:40:38 UTC - in response to Message 867946. [quote] If all crunchers suspend network activity overnight, this should reduce contacts by approximately 33% at any given time, taking some of the pressure off. As soon as the log jam starts to move, it should clear up pretty quickly. Previous experience says about 4 to 6 hours. I've suspended network activity as you suggested as there isn't a reason for me to even try as I haven't uploaded anything in 24 hours anyway due to the jam. I hope it gets sorted soon. ID: 867949 ·

Vipin Palazhi Send message Joined: 29 Feb 08 Posts: 286 Credit: 167,386,578 RAC: 0	Message 867950 - Posted: 22 Feb 2009, 6:40:41 UTC One of my system is out of work. I hadnt connected it for almost a day and a half, and now it is sitting with two days work which are constantly trying to upload. Switched off the network activity on all the rigs, which have a total of around 500 WUs to upload. Hope the issue gets resolved before tuesday's weekly outage. ______________ ID: 867950 ·

littlegreenmanfrommars Volunteer tester Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0	Message 867955 - Posted: 22 Feb 2009, 7:13:36 UTC If it works, an indication things are returning to normal will be a shortening of your "pending credit" queue. As WU's start to return, they will "match up" with those already returned, so your total credit should begin to rise. The cricket graph (URL below) is a better way to check how much bandwidth is being used at the Berkeley end. Once you see the incoming bits reducing, you can resume network activity. Personally, I do this on one rig at a time, waiting for the queue to clear on one rig before resuming network activity on the next rig. http://fragment1.berkeley.edu/newcricket/mini-graph.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets;ranges=d ID: 867955 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65749 Credit: 55,293,173 RAC: 49	Message 867958 - Posted: 22 Feb 2009, 7:39:14 UTC I'm just getting HTTP errors on trying to upload, So I too hope this can be fixed before Tuesday. And I've suspended Network access to Seti until then. 2/21/2009 11:36:20 PM\|SETI@home\|[file_xfer] Started upload of file 17ja09aa.6195.388982.6.8.109_1_0 2/21/2009 11:36:20 PM\|SETI@home\|[file_xfer] Started upload of file 17ja09aa.19076.5385.14.8.179_1_0 2/21/2009 11:36:42 PM\|\|Project communication failed: attempting access to reference site 2/21/2009 11:36:42 PM\|SETI@home\|[file_xfer] Temporarily failed upload of 17ja09aa.6195.388982.6.8.109_1_0: HTTP error 2/21/2009 11:36:42 PM\|SETI@home\|Backing off 2 hr 14 min 31 sec on upload of file 17ja09aa.6195.388982.6.8.109_1_0 2/21/2009 11:36:42 PM\|SETI@home\|[file_xfer] Temporarily failed upload of 17ja09aa.19076.5385.14.8.179_1_0: HTTP error 2/21/2009 11:36:42 PM\|SETI@home\|Backing off 3 hr 25 min 51 sec on upload of file 17ja09aa.19076.5385.14.8.179_1_0 2/21/2009 11:36:43 PM\|\|Access to reference site succeeded - project servers may be temporarily down. 2/21/2009 11:38:00 PM\|\|Suspending network activity - user request The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 867958 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 867962 - Posted: 22 Feb 2009, 7:59:56 UTC - in response to Message 867851. ... a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves. After having a few more looks at Scarecrow's AP graphs thoughout the day, this measure gets my vote. It's all those AP units being downloaded that's clogging up the pipe. Grant Darwin NT ID: 867962 ·

Rob.B Send message Joined: 23 Jul 99 Posts: 157 Credit: 1,439,682 RAC: 0	Message 867963 - Posted: 22 Feb 2009, 8:05:00 UTC I have suspended all network activity on my boxes also, as the advert (UK) say's "every little helps". What is strange though, is that although the issue seems to be with SETI, I can return data for other projects but the client refuses to request new work for any project. Same on all boxes. Rob ID: 867963 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 867985 - Posted: 22 Feb 2009, 9:50:22 UTC - in response to Message 867962. ... a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves. After having a few more looks at Scarecrow's AP graphs thoughout the day, this measure gets my vote. It's all those AP units being downloaded that's clogging up the pipe. OK, had a night's sleep and I think I've found the problem - well, the next stage in the chain. Have a look at WU 417685549. Downloaded seven times, mine is the only one which is running - every other copy failed because they couldn't download the executable file. All my recent AP allocations look like that, though this is the most extreme. Eric needs to turn on the 'proxy server' distribution channel used when new MB executables threaten to clog the pipes - or AP distribution needs to be restricted to those who have manually downloaded and installed the new Lunatics r112 optimisation for Astropulse_v5 (plug!). ID: 867985 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 867994 - Posted: 22 Feb 2009, 10:31:46 UTC - in response to Message 867962. ... a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves. After having a few more looks at Scarecrow's AP graphs thoughout the day, this measure gets my vote. It's all those AP units being downloaded that's clogging up the pipe. As usual, Richard's analysis was right on target. It's the Astropulse v5 work erroring out and being resent. To be even more specific, the errors are: <message> app_version download error: couldn't get input files: <file_xfer_error> <file_name>astropulse_5.03_windows_intelx86.exe</file_name> <error_code>-200</error_code> </file_xfer_error> Some example WUs: 417836906, 417836901, and 417768997. There are others with numbers close to those, AP work tends to get runs of contiguous numbers because there are times when the mb_splitter processes are not producing work. I've sent email to staff, hopefully they can either change the URL of the app download to something which works or cut off delivery of Astropulse v5 work. Joe ID: 867994 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.