Panic Mode On (20) Server problems

Message boards : Number crunching : Panic Mode On (20) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 15 · Next

AuthorMessage
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 917517 - Posted: 14 Jul 2009, 4:48:13 UTC

Same story as last week. AP becomes available, bandwidth goes to maximum and my uploads and downloads are borked again. When AP work runs out then perhaps one day later I can finally get through. By then all the AP work is gone.

Yea...........this is working really well!
Boinc....Boinc....Boinc....Boinc....
ID: 917517 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 917520 - Posted: 14 Jul 2009, 4:59:18 UTC
Last modified: 14 Jul 2009, 5:37:23 UTC

I normally carry a 4 day cache. I have set all my computers to No New Work. Next Monday (July 20) I am shutting down my computers even if they still have completed work to upload. Then, color me gone!

[edit]On Vacation to Utah!!
Boinc....Boinc....Boinc....Boinc....
ID: 917520 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 917529 - Posted: 14 Jul 2009, 5:27:54 UTC

I have looked at everything I can. I have a bit of information coming from Seti Staff. Looking that MB, AP and Cuda used to "all share" the same Upload/Download link and recover there is something else wrong.

In a email that was sent to Seti Staff. At a point in time the 100Megabit link was Full Duplex. Meaning Uploads should not interfere with Downloads and vice versa (each is in its own channel). It may be that something happened to cause a link/connection coming up the hill to revert to a simplex mode. Hardware failure, lost configuration.

Regards


Please consider a Donation to the Seti Project.

ID: 917529 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 917531 - Posted: 14 Jul 2009, 5:35:06 UTC - in response to Message 917525.  

Here a picture of what's going on right now as I'm not able to upload or report right now, As I'm getting nothing but HTML Errors.

7/13/2009 10:11:15 PM SETI@home Started upload of 24no08ac.23426.13160.15.8.195_1_0
7/13/2009 10:11:15 PM SETI@home Temporarily failed upload of 28au08ag.21792.19295.3.8.174_1_0: HTTP error
7/13/2009 10:11:15 PM SETI@home Backing off 20 min 14 sec on upload of 28au08ag.21792.19295.3.8.174_1_0
7/13/2009 10:11:16 PM Internet access OK - project servers may be temporarily down.
7/13/2009 10:11:16 PM SETI@home Started upload of 17se08aa.20680.2526.11.8.163_1_0
7/13/2009 10:11:18 PM Project communication failed: attempting access to reference site
7/13/2009 10:11:18 PM SETI@home Temporarily failed upload of 24no08ac.23426.13160.15.8.195_1_0: HTTP error
7/13/2009 10:11:18 PM SETI@home Backing off 1 hr 32 min 55 sec on upload of 24no08ac.23426.13160.15.8.195_1_0
7/13/2009 10:11:19 PM Internet access OK - project servers may be temporarily down.
7/13/2009 10:11:19 PM SETI@home Temporarily failed upload of 17se08aa.20680.2526.11.8.163_1_0: HTTP error
7/13/2009 10:11:19 PM SETI@home Backing off 2 hr 42 min 20 sec on upload of 17se08aa.20680.2526.11.8.163_1_0


So they disabled the upload server, Seems like It might be counterproductive to Me as the bandwidth was at about 60 to 70 and now It's between 75 and 90 instead, I've set My end to NNT, Not that It will do any good, As I can't stop Boinc from trying to do the impossible short of detaching from S@H as I don't feel like suspending S@H just to stop Boinc from trying to upload and report.


It seems a bit backwards to me to kill the uploads. However, This is also the machine that processes validations. They might just be trying to catch up before going down for maint.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 917531 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 917532 - Posted: 14 Jul 2009, 5:43:18 UTC - in response to Message 917525.  

Here a picture of what's going on right now as I'm not able to upload or report right now, As I'm getting nothing but HTML Errors.


7/13/2009 10:11:15 PM SETI@home Started upload of 24no08ac.23426.13160.15.8.195_1_0
7/13/2009 10:11:15 PM SETI@home Temporarily failed upload of 28au08ag.21792.19295.3.8.174_1_0: HTTP error
7/13/2009 10:11:15 PM SETI@home Backing off 20 min 14 sec on upload of 28au08ag.21792.19295.3.8.174_1_0
7/13/2009 10:11:16 PM Internet access OK - project servers may be temporarily down.
7/13/2009 10:11:16 PM SETI@home Started upload of 17se08aa.20680.2526.11.8.163_1_0
7/13/2009 10:11:18 PM Project communication failed: attempting access to reference site
7/13/2009 10:11:18 PM SETI@home Temporarily failed upload of 24no08ac.23426.13160.15.8.195_1_0: HTTP error
7/13/2009 10:11:18 PM SETI@home Backing off 1 hr 32 min 55 sec on upload of 24no08ac.23426.13160.15.8.195_1_0
7/13/2009 10:11:19 PM Internet access OK - project servers may be temporarily down.
7/13/2009 10:11:19 PM SETI@home Temporarily failed upload of 17se08aa.20680.2526.11.8.163_1_0: HTTP error
7/13/2009 10:11:19 PM SETI@home Backing off 2 hr 42 min 20 sec on upload of 17se08aa.20680.2526.11.8.163_1_0


So they disabled the upload server, Seems like It might be counterproductive to Me as the bandwidth was at about 60 to 70 and now It's between 75 and 90 instead, I've set My end to NNT, Not that It will do any good, As I can't stop Boinc from trying to do the impossible short of detaching from S@H as I don't feel like suspending S@H just to stop Boinc from trying to upload and report.

Thank You

SJ, Please do set No New Tasks. It will leave room for others.

Matt and Others come and go all hours of the day. Many things happen unnoticed. Apparently you are trying to create panic during what may be troubleshooting is not obvious to anyone...
The hard part is rather than asking a question/notifying, your response was designed to create panic.

Regards


Please consider a Donation to the Seti Project.

ID: 917532 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 917536 - Posted: 14 Jul 2009, 5:50:07 UTC - in response to Message 917529.  

I have looked at everything I can. I have a bit of information coming from Seti Staff. Looking that MB, AP and Cuda used to "all share" the same Upload/Download link and recover there is something else wrong.

In a email that was sent to Seti Staff. At a point in time the 100Megabit link was Full Duplex. Meaning Uploads should not interfere with Downloads and vice versa (each is in its own channel). It may be that something happened to cause a link/connection coming up the hill to revert to a simplex mode. Hardware failure, lost configuration.

Regards




Heeey

This actually makes perfect sense, if i try to recall what's happening this behaviour about not beeing able to upload etc has started to arose about 2 months ago in a quite distinct manner.
Before this i was hardly ever met at the morning with the sign from boinc that it couldn't connect to the project check your internet connection message.

If full duplex isn't present the communication is plugged to that extent that it feels like an DSL extension, if uploads from the dsl is in the roof downloads stops to a crawl if you have multiple connections to deal with.

Do you know how the internal interconnect is between the servers?
I hope that it's atleast decent switches with paired GBit truncs so the bottlenecks is minimised.
Decent GBit switches nowadays cost peanuts and HP is a favour of mine in internal backbone structure.

Kind regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 917536 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 917537 - Posted: 14 Jul 2009, 5:50:22 UTC - in response to Message 917532.  

Here a picture of what's going on right now as I'm not able to upload or report right now, As I'm getting nothing but HTML Errors.

7/13/2009 10:11:15 PM SETI@home Started upload of 24no08ac.23426.13160.15.8.195_1_0
7/13/2009 10:11:15 PM SETI@home Temporarily failed upload of 28au08ag.21792.19295.3.8.174_1_0: HTTP error
7/13/2009 10:11:15 PM SETI@home Backing off 20 min 14 sec on upload of 28au08ag.21792.19295.3.8.174_1_0
7/13/2009 10:11:16 PM Internet access OK - project servers may be temporarily down.
7/13/2009 10:11:16 PM SETI@home Started upload of 17se08aa.20680.2526.11.8.163_1_0
7/13/2009 10:11:18 PM Project communication failed: attempting access to reference site
7/13/2009 10:11:18 PM SETI@home Temporarily failed upload of 24no08ac.23426.13160.15.8.195_1_0: HTTP error
7/13/2009 10:11:18 PM SETI@home Backing off 1 hr 32 min 55 sec on upload of 24no08ac.23426.13160.15.8.195_1_0
7/13/2009 10:11:19 PM Internet access OK - project servers may be temporarily down.
7/13/2009 10:11:19 PM SETI@home Temporarily failed upload of 17se08aa.20680.2526.11.8.163_1_0: HTTP error
7/13/2009 10:11:19 PM SETI@home Backing off 2 hr 42 min 20 sec on upload of 17se08aa.20680.2526.11.8.163_1_0


So they disabled the upload server, Seems like It might be counterproductive to Me as the bandwidth was at about 60 to 70 and now It's between 75 and 90 instead, I've set My end to NNT, Not that It will do any good, As I can't stop Boinc from trying to do the impossible short of detaching from S@H as I don't feel like suspending S@H just to stop Boinc from trying to upload and report.

Thank You

SJ, Please do set No New Tasks. It will leave room for others.

Matt and Others come and go all hours of the day. Many things happen unnoticed. Apparently you are trying to create panic during what may be troubleshooting is not obvious to anyone...
The hard part is rather than asking a question/notifying, your response was designed to create panic.

Regards



I head it just as the description says "Disabled: Program has been disabled by staff (for debugging/maintenance)" Figured it was something for the overall good of things. If it said "Not Running: Program failed or ran out of work (or the project is down)" Then I would be a little worried, but not really.

Expecting the servers to be down tuesday. Maybe they are just getting a jump on things this week?

I think it is just the "not knowing" that is hard to deal with sometimes.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 917537 · Report as offensive
Profile TCP JESUS
Avatar

Send message
Joined: 19 Jan 03
Posts: 205
Credit: 1,248,845
RAC: 0
Canada
Message 917538 - Posted: 14 Jul 2009, 5:55:00 UTC - in response to Message 917536.  

Do you know how the internal interconnect is between the servers?
I hope that it's atleast decent switches with paired GBit truncs so the bottlenecks is minimised.


I think it's been stated before that INTERNAL structure already consists of a Gigabit backbone.
I am TCP JESUS...The Carpenter Phenom Jesus....and HAMMERING is what I do best!
formerly known as...MC Hammer.
ID: 917538 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 917539 - Posted: 14 Jul 2009, 5:58:28 UTC
Last modified: 14 Jul 2009, 6:01:09 UTC

Aaaand i remember one more thing that i noticed.

For about three-four months ago i started to get alot of messages indicating that i couldn't connect to the scheduler but down/uploading work was fine.

It seems like that my gpumachine couldn't keep the link to berkeley up long enough so that berkeley would receive the schedxxx.xml files, if i made a proxy server on my DSL line here at home instead of using my works fibre ISP connection and set my gpumachine to talk to my proxy at home on DSL it could connect to berkeley and properly reach the scheduler..

Why?! I don't really know actually! All i know that this started to happen about three to four months ago so nowadays i need to manually switch between my proxy at home and off so it can connect properly to s@h because the larger the schedxxxx.xml files is the harder it is to send it to berkeley and this would fail if growing beyond reporting 500 Wu's+.

So the last four to five months now has been a constant supervising of the gpumachine because it simply can't manage itself, it would get borked in one way or another.

Kind regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 917539 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 917540 - Posted: 14 Jul 2009, 6:05:18 UTC - in response to Message 917539.  

Aaaand i remember one more thing that i noticed.

For about three-four months ago i started to get alot of messages indicating that i couldn't connect to the scheduler but down/uploading work was fine.

It seems like that my gpumachine couldn't keep the link to berkeley up long enough so that berkeley would receive the schedxxx.xml files, if i made a proxy server on my DSL line here at home instead of using my works fibre ISP connection and set my gpumachine to talk to my proxy at home on DSL it could connect to berkeley and properly reach the scheduler..

Why?! I don't really know actually! All i know that this started to happen about three to four months ago so nowadays i need to manually switch between my proxy at home and off so it can connect properly to s@h because the larger the schedxxxx.xml files is the harder it is to send it to berkeley and this would fail if growing beyond reporting 500 Wu's+.

So the last four to five months now has been a constant supervising of the gpumachine because it simply can't manage itself, it would get borked in one way or another.

Kind regards Vyper


I have noticed that BOINC often "forgets" to upload results and I have to do a manual Update to get them to send in. I'll see logs of "finished wu, uploading wu, dowloading wu" and such with no errors. Just will have 6-10 results "Ready for upload" in my tasks. I have read around to see if this is a "normal" thnk to occur, but it happens on all of my hosts.

SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 917540 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 917543 - Posted: 14 Jul 2009, 6:28:53 UTC - in response to Message 917529.  

In a email that was sent to Seti Staff. At a point in time the 100Megabit link was Full Duplex. Meaning Uploads should not interfere with Downloads and vice versa (each is in its own channel).

We forget that TCP is a sliding window protocol. If the 100 megabit line is saturated inbound, part of that inbound traffic are the ACKs for the outbound traffic.

When the ACKs are delayed or lost, at some point the sender stops sending new data, and waits. When the ACKs don't arrive (because they were lost) data is resent.

In either direction, when the load is very high, data in the other direction will suffer too.

ID: 917543 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 917545 - Posted: 14 Jul 2009, 6:33:53 UTC - in response to Message 917543.  
Last modified: 14 Jul 2009, 6:34:30 UTC

I offer another analogy -- I call it the snow plow effect. I've used that to describe workload before and after a vacation -- the snowplow clearing the road (vacation) creates very large piles of snow (work) on either side of the road. It seems often enough for SETI that the snowplow (the Tuesday outage) results in very large piles of snow (upload/download congestion) for anywhere from 12 to 24 hours on either side.

Analogies are sloppy, I realize this.



In either direction, when the load is very high, data in the other direction will suffer too.

ID: 917545 · Report as offensive
Profile TCP JESUS
Avatar

Send message
Joined: 19 Jan 03
Posts: 205
Credit: 1,248,845
RAC: 0
Canada
Message 917546 - Posted: 14 Jul 2009, 6:37:38 UTC

So...how does taking the Upload server offline the day before a known outage (tuesday maintenance) help network congestion again ?
I am TCP JESUS...The Carpenter Phenom Jesus....and HAMMERING is what I do best!
formerly known as...MC Hammer.
ID: 917546 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 917549 - Posted: 14 Jul 2009, 6:43:53 UTC - in response to Message 917546.  

So...how does taking the Upload server offline the day before a known outage (tuesday maintenance) help network congestion again ?


I'm guessing it would give the db's more free time to catch up. So that the servers are not down for 24 hours?
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 917549 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13927
Credit: 208,696,464
RAC: 304
Australia
Message 917556 - Posted: 14 Jul 2009, 8:07:02 UTC - in response to Message 917549.  


Hmm. I see the upload server is disabled.
This would probably explain why i can't upload & there is no inbound network traffic worth mentioning.
Grant
Darwin NT
ID: 917556 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 21702
Credit: 7,508,002
RAC: 20
United Kingdom
Message 917570 - Posted: 14 Jul 2009, 9:57:45 UTC - in response to Message 917543.  

In a email that was sent to Seti Staff. At a point in time the 100Megabit link was Full Duplex. Meaning Uploads should not interfere with Downloads and vice versa (each is in its own channel).

We forget that TCP is a sliding window protocol. If the 100 megabit line is saturated inbound, part of that inbound traffic are the ACKs for the outbound traffic.

When the ACKs are delayed or lost, at some point the sender stops sending new data, and waits. When the ACKs don't arrive (because they were lost) data is resent.

In either direction, when the load is very high, data in the other direction will suffer too.

That's a very 'subdued' way of describing the situation.

Lose the TCP control packets in either direction and the link is DOSed with an exponentially increasing stack of resend attempts that DOS for further attempts that then DOS for... Until the link disgracefully degrades to being totally blocked. Max link utilisation but no useful information gets through.

The only limiting factors are the TCP timeouts and the rate of new connection attempts.


And I thought the smooth 71Mb/s was due to some cool traffic management. OK, so restricting the available WUs is also a clumsy way to "traffic manage"!


In short, keep the link at never anything more than 89Mb/s MAX and everyone is happy!

Happy smooth crunchin',
Martin


See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 917570 · Report as offensive
Chelski
Avatar

Send message
Joined: 3 Jan 00
Posts: 121
Credit: 8,979,050
RAC: 0
Malaysia
Message 917573 - Posted: 14 Jul 2009, 10:13:24 UTC - in response to Message 917570.  

Interesting observation. I was scratching my head earlier why downloading was very smooth and upload didnt even start properly like earlier problems (when they tend to get stuck at 100% because the ACK was lost in space)

Well, very soon the smooth 71Mbps will die down as clients stop requesting work after a certain number of WUs (3?) are stuck on the upload queue.
ID: 917573 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 917575 - Posted: 14 Jul 2009, 10:31:34 UTC


If you have > 'CPUs x 2' in the UL overview, BOINC will not ask for new work.

Current I have ~ 400 results ready for UL and every few minutes increasing..

ID: 917575 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 917581 - Posted: 14 Jul 2009, 11:24:23 UTC - in response to Message 917382.  

Hi, a look at the server status page, tells the UPLOAD SERVER is disabled.
Don't know why it's DISABLED?!

Maybe after the regular maintenance outage, today, they will turn it on, if it functions correctly.


ID: 917581 · Report as offensive
Zebra3
Avatar

Send message
Joined: 22 Oct 01
Posts: 186
Credit: 13,658,148
RAC: 0
Canada
Message 917584 - Posted: 14 Jul 2009, 12:12:01 UTC

As I wade into this quasi firestorm I look at my machines and see that my earliest deadline is July 20th which other than the days importance in history it is still 6 days away. I have a few hundred WU's waiting to upload but I carry a decent 4 day cache and have NEVER run out of MB WU's to crunch. Occasionally I will run out of the demon CUDA WU's but when that happens I load a WU from CPUGRID and the GPU is happy crunching away on it for the next 24 hours or so.

So in my mind if your cache of WU's is fresh and you crunch and report them in a timely fashion a 1 or 2 day interruption should have no concern other than you are staring at all those 100% completed WU's and hoping that your hard drive doesn't crash like mine did recently. There are bumps on the road all the time that don't need to be made into mountains...just deal with what is most important...FAMILY and the rest of the world will come together just fine.

As the song goes...DON'T WORRY...BE HAPPY!

Cheers
http://www.novascotia.com
ID: 917584 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 15 · Next

Message boards : Number crunching : Panic Mode On (20) Server problems


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.