Panic Mode On (12) Server problems

Message boards : Number crunching : Panic Mode On (12) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 867776 - Posted: 21 Feb 2009, 22:34:58 UTC

My mac is running stock but i have never had an AP WU on this machine, My old P4 XP gets lots of AP WU I had one of the new AP v5 last week took 8 days to crunch with the stock opt. have one now in the waiting to run that says 175 hours, Aslong as the new AP running stock stays stable i can live with it, but im considering just letting the mac run seti and my old pc running milkyway full time.
[/quote]

Old James
ID: 867776 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13842
Credit: 208,696,464
RAC: 304
Australia
Message 867778 - Posted: 21 Feb 2009, 22:37:13 UTC - in response to Message 867773.  

I messaged Eric to see if he could kick the Upload server remotely from home

From the looks of things uploads are still happening occasionally, it's just all the download traffic clogging the pipe at the moment. Once the downloads taper off, the uploads will be able to get through.
Grant
Darwin NT
ID: 867778 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 867792 - Posted: 21 Feb 2009, 23:06:55 UTC - in response to Message 867778.  

Raistmer just came out with his new V9 package so downloads may go crazy as we try to get the new AP5.03s


PROUD MEMBER OF Team Starfire World BOINC
ID: 867792 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 867804 - Posted: 21 Feb 2009, 23:30:31 UTC - in response to Message 867792.  
Last modified: 21 Feb 2009, 23:31:03 UTC

Serverpage shows nothing out of the ordinary, but all my uploads are stuck,
22-2-2009 0:09:07|SETI@home|Backing off 3 hr 45 min 58 sec on upload of 15ja09aa.4494.11524.7.8.142_0_0
22-2-2009 0:09:09||Internet access OK - project servers may be temporarily down.
22-2-2009 0:09:10|SETI@home|Computation for task 14ja09aa.4912.11520.8.8.87_0 finished
22-2-2009 0:09:10|SETI@home|Starting 14ja09aa.4912.11520.8.8.95_0
22-2-2009 0:09:10|SETI@home|Starting task 14ja09aa.4912.11520.8.8.95_0 using setiathome_enhanced version 603
22-2-2009 0:09:13|SETI@home|Started upload of 14ja09aa.4912.11520.8.8.87_0_0
22-2-2009 0:10:12||Project communication failed: attempting access to reference site
22-2-2009 0:10:12|SETI@home|Temporarily failed upload of 14ja09aa.4912.11520.8.8.87_0_0: connect() failed
22-2-2009 0:10:12|SETI@home|Backing off 1 min 0 sec on upload of 14ja09aa.4912.11520.8.8.87_0_0
22-2-2009 0:10:13||Internet access OK - project servers may be temporarily down.
22-2-2009 0:11:13|SETI@home|Started upload of 14ja09aa.4912.11520.8.8.87_0_0
22-2-2009 0:11:35||Project communication failed: attempting access to reference site
22-2-2009 0:11:35|SETI@home|Temporarily failed upload of 14ja09aa.4912.11520.8.8.87_0_0: connect() failed
22-2-2009 0:11:35|SETI@home|Backing off 1 min 0 sec on upload of 14ja09aa.4912.11520.8.8.87_0_0
22-2-2009 0:11:36||Internet access OK - project servers may be temporarily down.
22-2-2009 0:11:42|SETI@home|Computation for task 15ja09aa.4494.11524.7.8.136_1 finished
22-2-2009 0:11:42|SETI@home|Starting 14ja09aa.4912.11520.8.8.69_0
22-2-2009 0:11:42|SETI@home|Starting task 14ja09aa.4912.11520.8.8.69_0 using setiathome_enhanced version 603
22-2-2009 0:11:44|SETI@home|Started upload of 15ja09aa.4494.11524.7.8.136_1_0
22-2-2009 0:12:35|SETI@home|Started upload of 14ja09aa.4912.11520.8.8.87_0_0
22-2-2009 0:12:57||Project communication failed: attempting access to reference site
22-2-2009 0:12:57|SETI@home|Temporarily failed upload of 14ja09aa.4912.11520.8.8.87_0_0: connect() failed
22-2-2009 0:12:57|SETI@home|Backing off 1 min 0 sec on upload of 14ja09aa.4912.11520.8.8.87_0_0
22-2-2009 0:12:58||Internet access OK - project servers may be temporarily down.
22-2-2009 0:12:58|SETI@home|Started upload of 15ja09aa.4494.11524.7.8.152_0_0

Maybe it's solved in a few hours, I hope ;)
ID: 867804 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14678
Credit: 200,643,578
RAC: 874
United Kingdom
Message 867851 - Posted: 22 Feb 2009, 1:56:50 UTC - in response to Message 867773.  

I messaged Eric to see if he could kick the Upload server remotely from home

Pete,

It's not the upload server he needs to kick.

The communications channel has been saturated with downloads for 20 solid hours now. Sometimes this is because of MB 'shorties' (VHAR), but I've seen no sign of that in my own downloads.

Instead, I've been getting a larger allocation than recently of AP_v5, and the ones I'm crunching seem to be running at normal speed.

But the server status page is showing AP results being returned at well over 3,000 per hour, yet the number awaiting validation - with the validator disabled - is rising at only 40 per hour. The average turnround time for AP tasks will soon fall below 10 hours, which is ludicrous.

I suspect that a significant number of hosts are trashing every AP task they receive, and coming back for more. If every AP task returned is replaced by one new download, then 3,000 tasks per hour - almost 1 every second - requires 8 megabytes of download every second, or 64 megabits of pure data. Add check bits, routing data, communications protocol overhead and so on and the 95 Mbs effective throughput is easily reached.

In order to protect the servers from overload, and preserve the integrity of the science database, Eric should be putting a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves.
ID: 867851 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 867853 - Posted: 22 Feb 2009, 1:58:55 UTC - in response to Message 867758.  

I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work.

I've changed my cache size but it still just requests 0 seconds of work.

One of the safeties built in is a limit on 2*ncpus uploads before work fetch is halted to that project.


BOINC WIKI
ID: 867853 · Report as offensive
Profile Westsail and *Pyxey*
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 338
Credit: 20,544,999
RAC: 0
United States
Message 867866 - Posted: 22 Feb 2009, 2:34:29 UTC



"The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov
ID: 867866 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 867872 - Posted: 22 Feb 2009, 2:50:10 UTC

Trying to do what little bit I can to help. I've suspended my network activity for the night at the very least. I will try again in the morning and if it's no better I'll suspend it again.


PROUD MEMBER OF Team Starfire World BOINC
ID: 867872 · Report as offensive
Profile nutcase
Volunteer tester
Avatar

Send message
Joined: 13 Jun 05
Posts: 19
Credit: 6,589,801
RAC: 0
United States
Message 867937 - Posted: 22 Feb 2009, 5:14:51 UTC - in response to Message 867853.  
Last modified: 22 Feb 2009, 5:16:29 UTC

I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work.

I've changed my cache size but it still just requests 0 seconds of work.

One of the safeties built in is a limit on 2*ncpus uploads before work fetch is halted to that project.


well, this one is affecting me badly as it is affecting other projects also.

My 8 core system refuse to get new work from ANY PROJECT!

this is not affecting my quad or dual core systems though.


so, basically right now I have 16 cores idle doing nothing because BOINC will not get work from any project I attach to.
ID: 867937 · Report as offensive
Profile Jack Zhang
Volunteer tester
Avatar

Send message
Joined: 2 Jul 06
Posts: 206
Credit: 6,142,449
RAC: 0
Canada
Message 867942 - Posted: 22 Feb 2009, 5:45:50 UTC

The net load is at MAX for the past few hours... Too many WU upload connections or an attack?


What if Fiction was Fact and Fact was Fiction and vice versa?
ID: 867942 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 867943 - Posted: 22 Feb 2009, 5:48:42 UTC - in response to Message 867937.  

I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work.

I've changed my cache size but it still just requests 0 seconds of work.

One of the safeties built in is a limit on 2*ncpus uploads before work fetch is halted to that project.


well, this one is affecting me badly as it is affecting other projects also.

My 8 core system refuse to get new work from ANY PROJECT!

this is not affecting my quad or dual core systems though.


so, basically right now I have 16 cores idle doing nothing because BOINC will not get work from any project I attach to.


I would suggest this is a problem with your 8 core system, as the others are working ok. Problems with SETI@home should not affect other projects, except to allow your machine to crunch extra work from them while S@h sorts it's life out. Once S@h is working correctly, you should find it has accrued a "debt" from the other projects, so it will catch up with the work it "missed" during the hiccup/outage.

I have managed to upload ONE WU, and am still happily downloading new work. One rig has 8 completed WU's in the upload queue, a second rig has three. Both are downloading with no issues.
ID: 867943 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 867946 - Posted: 22 Feb 2009, 5:56:52 UTC - in response to Message 867942.  
Last modified: 22 Feb 2009, 5:58:38 UTC

The net load is at MAX for the past few hours... Too many WU upload connections or an attack?




Every completed WU causes BOINC to contact S@h at regular intervals, trying to upload. (Look under "Transfers" tab). It follows that the more completed WU's a given machine has in it's upload queue, the more attempts it will make to contact the S@h servers in a given period of time.

Although each packet sent for these attempts is relatively small, there will be several hundred thousand machines, each with a growing list of completed WU's in their queues. This will generate a lot of network traffic.

Best advice in such a situation is to turn off network activity for a while.
(BOINC tool menu > select Network activity suspended).

If all crunchers suspend network activity overnight, this should reduce contacts by approximately 33% at any given time, taking some of the pressure off. As soon as the log jam starts to move, it should clear up pretty quickly. Previous experience says about 4 to 6 hours.
ID: 867946 · Report as offensive
Tribble

Send message
Joined: 21 Feb 02
Posts: 65
Credit: 7,978,002
RAC: 0
Australia
Message 867949 - Posted: 22 Feb 2009, 6:40:38 UTC - in response to Message 867946.  

[quote]

If all crunchers suspend network activity overnight, this should reduce contacts by approximately 33% at any given time, taking some of the pressure off. As soon as the log jam starts to move, it should clear up pretty quickly. Previous experience says about 4 to 6 hours.



I've suspended network activity as you suggested as there isn't a reason for me to even try as I haven't uploaded anything in 24 hours anyway due to the jam.

I hope it gets sorted soon.
ID: 867949 · Report as offensive
Profile Vipin Palazhi
Avatar

Send message
Joined: 29 Feb 08
Posts: 286
Credit: 167,386,578
RAC: 0
India
Message 867950 - Posted: 22 Feb 2009, 6:40:41 UTC

One of my system is out of work. I hadnt connected it for almost a day and a half, and now it is sitting with two days work which are constantly trying to upload. Switched off the network activity on all the rigs, which have a total of around 500 WUs to upload.

Hope the issue gets resolved before tuesday's weekly outage.
______________


ID: 867950 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 867955 - Posted: 22 Feb 2009, 7:13:36 UTC

If it works, an indication things are returning to normal will be a shortening of your "pending credit" queue. As WU's start to return, they will "match up" with those already returned, so your total credit should begin to rise.

The cricket graph (URL below) is a better way to check how much bandwidth is being used at the Berkeley end. Once you see the incoming bits reducing, you can resume network activity.

Personally, I do this on one rig at a time, waiting for the queue to clear on one rig before resuming network activity on the next rig.

http://fragment1.berkeley.edu/newcricket/mini-graph.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;view=Octets;ranges=d
ID: 867955 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 66286
Credit: 55,293,173
RAC: 49
United States
Message 867958 - Posted: 22 Feb 2009, 7:39:14 UTC

I'm just getting HTTP errors on trying to upload, So I too hope this can be fixed before Tuesday. And I've suspended Network access to Seti until then.

2/21/2009 11:36:20 PM|SETI@home|[file_xfer] Started upload of file 17ja09aa.6195.388982.6.8.109_1_0
2/21/2009 11:36:20 PM|SETI@home|[file_xfer] Started upload of file 17ja09aa.19076.5385.14.8.179_1_0
2/21/2009 11:36:42 PM||Project communication failed: attempting access to reference site
2/21/2009 11:36:42 PM|SETI@home|[file_xfer] Temporarily failed upload of 17ja09aa.6195.388982.6.8.109_1_0: HTTP error
2/21/2009 11:36:42 PM|SETI@home|Backing off 2 hr 14 min 31 sec on upload of file 17ja09aa.6195.388982.6.8.109_1_0
2/21/2009 11:36:42 PM|SETI@home|[file_xfer] Temporarily failed upload of 17ja09aa.19076.5385.14.8.179_1_0: HTTP error
2/21/2009 11:36:42 PM|SETI@home|Backing off 3 hr 25 min 51 sec on upload of file 17ja09aa.19076.5385.14.8.179_1_0
2/21/2009 11:36:43 PM||Access to reference site succeeded - project servers may be temporarily down.
2/21/2009 11:38:00 PM||Suspending network activity - user request
Savoir-Faire is everywhere!
The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST

ID: 867958 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13842
Credit: 208,696,464
RAC: 304
Australia
Message 867962 - Posted: 22 Feb 2009, 7:59:56 UTC - in response to Message 867851.  

... a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves.

After having a few more looks at Scarecrow's AP graphs thoughout the day, this measure gets my vote. It's all those AP units being downloaded that's clogging up the pipe.
Grant
Darwin NT
ID: 867962 · Report as offensive
Rob.B

Send message
Joined: 23 Jul 99
Posts: 157
Credit: 1,439,682
RAC: 0
United Kingdom
Message 867963 - Posted: 22 Feb 2009, 8:05:00 UTC

I have suspended all network activity on my boxes also, as the advert (UK) say's "every little helps".

What is strange though, is that although the issue seems to be with SETI, I can return data for other projects but the client refuses to request new work for any project. Same on all boxes.

Rob
ID: 867963 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14678
Credit: 200,643,578
RAC: 874
United Kingdom
Message 867985 - Posted: 22 Feb 2009, 9:50:22 UTC - in response to Message 867962.  

... a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves.

After having a few more looks at Scarecrow's AP graphs thoughout the day, this measure gets my vote. It's all those AP units being downloaded that's clogging up the pipe.

OK, had a night's sleep and I think I've found the problem - well, the next stage in the chain.

Have a look at WU 417685549. Downloaded seven times, mine is the only one which is running - every other copy failed because they couldn't download the executable file. All my recent AP allocations look like that, though this is the most extreme.

Eric needs to turn on the 'proxy server' distribution channel used when new MB executables threaten to clog the pipes - or AP distribution needs to be restricted to those who have manually downloaded and installed the new Lunatics r112 optimisation for Astropulse_v5 (plug!).
ID: 867985 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 867994 - Posted: 22 Feb 2009, 10:31:46 UTC - in response to Message 867962.  

... a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves.

After having a few more looks at Scarecrow's AP graphs thoughout the day, this measure gets my vote. It's all those AP units being downloaded that's clogging up the pipe.

As usual, Richard's analysis was right on target. It's the Astropulse v5 work erroring out and being resent. To be even more specific, the errors are:

<message>
app_version download error: couldn't get input files:
<file_xfer_error>
<file_name>astropulse_5.03_windows_intelx86.exe</file_name>
<error_code>-200</error_code>
</file_xfer_error>


Some example WUs: 417836906, 417836901, and 417768997. There are others with numbers close to those, AP work tends to get runs of contiguous numbers because there are times when the mb_splitter processes are not producing work.

I've sent email to staff, hopefully they can either change the URL of the app download to something which works or cut off delivery of Astropulse v5 work.
                                                            Joe
ID: 867994 · Report as offensive
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · Next

Message boards : Number crunching : Panic Mode On (12) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.