Panic Mode On (77) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (77) Server Problems?

Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · Next
Author Message
Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2236
Credit: 8,446,444
RAC: 4,085
United States
Message 1301282 - Posted: 2 Nov 2012, 15:45:44 UTC

Mark, I noticed something last night that I think you're talking about. I reported one completed AP and the website showed it as reported, but BOINC didn't get the memo and instead.. "timeout was reached." The next report cleared it up, but every time you talk to the scheduler, your client_state has to be sent. I imagine for the faster rigs, it is probably in the >1MB range, so I agree.. wasted bandwidth for scheduler time-outs.

Kind of like cramming 16MB of data through the pipe only to have it be 100% blanked.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,646,474
RAC: 645,755
United States
Message 1301288 - Posted: 2 Nov 2012, 15:54:01 UTC - in response to Message 1301282.
Last modified: 2 Nov 2012, 15:54:59 UTC

Mark, I noticed something last night that I think you're talking about. I reported one completed AP and the website showed it as reported, but BOINC didn't get the memo and instead.. "timeout was reached." The next report cleared it up, but every time you talk to the scheduler, your client_state has to be sent. I imagine for the faster rigs, it is probably in the >1MB range, so I agree.. wasted bandwidth for scheduler time-outs.

Kind of like cramming 16MB of data through the pipe only to have it be 100% blanked.

Yeah....
All the data gets sent through the pipe, and then the server fumbles the ball......more data then transferred trying to determine who recovered the fumble.

Uhh...back to 1st down and goal to go.

More bandwidth consumed on the next attempt.

4th down......how many times are they gonna try?

Oh crap, they missed the field goal.

1st down.......more bandwidth consumed and we still have not scored.

You get my drift. When things get tangled like this, it's a downward spiral. More bandwidth gets used trying to recover fumbles than moving the dang ball.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile cov_route
Avatar
Send message
Joined: 13 Sep 12
Posts: 286
Credit: 6,174,351
RAC: 15,617
Canada
Message 1301293 - Posted: 2 Nov 2012, 16:01:11 UTC

Now I have over 3000 phantom wu's up from about 2k last night. This is about 5x my normal caches size. Should I set NNT? Or just let it do its thing?

zoom314
Avatar
Send message
Joined: 30 Nov 03
Posts: 45787
Credit: 36,411,828
RAC: 7,202
Message 1301297 - Posted: 2 Nov 2012, 16:08:36 UTC - in response to Message 1301202.

I have been monitoring my rig and have no issues.
Except Mark is partnered with me on of my inconclusives. I sure it is stuck in the upload problem.

I'm having less trouble uploading than reporting, something needs some computer exlax for the reporting...
____________

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,646,474
RAC: 645,755
United States
Message 1301299 - Posted: 2 Nov 2012, 16:10:48 UTC - in response to Message 1301293.

Now I have over 3000 phantom wu's up from about 2k last night. This is about 5x my normal caches size. Should I set NNT? Or just let it do its thing?

The kitties are just letting Boinc do it's thing.

The only change I made a while back when the scheduler was tied in knots was add a bit to my cc_config file to report only 100 WUs at a time. That helped at the time, but it is not improving things much right now.

____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 823
Credit: 1,544,880
RAC: 315
Germany
Message 1301305 - Posted: 2 Nov 2012, 16:24:57 UTC - in response to Message 1301282.

(...) but every time you talk to the scheduler, your client_state has to be sent. I imagine for the faster rigs, it is probably in the >1MB range, so I agree.. wasted bandwidth for scheduler time-outs.

sched_request_setiathome.berkeley.edu.xml is send to the scheduler and should be quite a bit smaller than the client_state.xml since it doesn't contain all the information about all files, other projects and only very sparse information about all SETI tasks on that machine.
____________
.

Cherokee150
Send message
Joined: 11 Nov 99
Posts: 103
Credit: 23,219,132
RAC: 30,993
United States
Message 1301320 - Posted: 2 Nov 2012, 17:09:22 UTC - in response to Message 1301299.

Mark, I suspect, however, that the kitties are not purring right now. ;)

By the way, I did do one thing to help alleviate the obviously gigantic mess developing between the SETI host and its clients. I suspended network activity on all my machines.

I always keep my caches maxed out to handle these emergencies, so I'm good-to-go for at least a week. It'll be awhile before any of my units time out. When the dust settles and traffic is back to normal, I'll just let my rigs report their stored results one machine at a time. :)

Wibble
Send message
Joined: 25 Nov 02
Posts: 4
Credit: 569,386
RAC: 370
United Kingdom
Message 1301354 - Posted: 2 Nov 2012, 18:29:08 UTC - in response to Message 1301112.
Last modified: 2 Nov 2012, 18:30:50 UTC

Maybe the last 20 hours or something no well scheduler contact. Always:
Scheduler request failed: Timeout was reached


*new tasks* was enabled.

I set *no new tasks* - and then 178 uploaded tasks were accepted from the scheduler server in a bunch (successful report).


That also worked for me, my last successful scheduler request was on the 30th, I just set 'no new tasks,' clicked 'update' and the request/report went through straight away.

I think I'll leave 'no new tasks' set for a while.
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5695
Credit: 56,352,108
RAC: 48,888
Australia
Message 1301358 - Posted: 2 Nov 2012, 18:33:37 UTC - in response to Message 1301354.

That also worked for me, my last successful scheduler request was on the 30th, I just set 'no new tasks,' clicked 'update' and the request/report went through straight away.

I've tried it with NNT set, and without.
I think the only advantage of NNT set is you can click update as soon as you get the timeout response from the Scheduler without it complaining about it being too soon.


The fact is that overnight neither of my systems were able to get a reponse from the Scheduler that wasn't a timeout. Completed work to be reported piles up, and my caches get smaller & smaller.
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5695
Credit: 56,352,108
RAC: 48,888
Australia
Message 1301392 - Posted: 2 Nov 2012, 20:18:24 UTC - in response to Message 1301358.


I hope they can sort this Scheduler out soon. With all the shorties in the system, and the inabilty to get work on more than 1 attempt in 20, i'm going to run out of work very quickly when almost everything that does get downloaded will be done in minutes.
____________
Grant
Darwin NT.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,646,474
RAC: 645,755
United States
Message 1301396 - Posted: 2 Nov 2012, 20:34:58 UTC - in response to Message 1301392.


I hope they can sort this Scheduler out soon. With all the shorties in the system, and the inabilty to get work on more than 1 attempt in 20, i'm going to run out of work very quickly when almost everything that does get downloaded will be done in minutes.

Scheduler cannot do nothing without bandwidth to send and receive.
This needs to be addressed before you folks complain about much more.

Over and out.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

zoom314
Avatar
Send message
Joined: 30 Nov 03
Posts: 45787
Credit: 36,411,828
RAC: 7,202
Message 1301403 - Posted: 2 Nov 2012, 20:40:38 UTC - in response to Message 1301396.


I hope they can sort this Scheduler out soon. With all the shorties in the system, and the inabilty to get work on more than 1 attempt in 20, i'm going to run out of work very quickly when almost everything that does get downloaded will be done in minutes.

Scheduler cannot do nothing without bandwidth to send and receive.
This needs to be addressed before you folks complain about much more.

Over and out.

Roger Wilco...
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5695
Credit: 56,352,108
RAC: 48,888
Australia
Message 1301404 - Posted: 2 Nov 2012, 20:40:44 UTC - in response to Message 1301396.

Scheduler cannot do nothing without bandwidth to send and receive.
This needs to be addressed before you folks complain about much more.

That would be so, if the problem is bandwidth related (i've posted a work around in the Wish list about that, if it is the case).
However in the past when there have been just as many shorties, after even longer outages, where the Scheduler has had no problems responding to requests. The sheer number of these timeouts is something that's only started over the last 4-8 weeks.
____________
Grant
Darwin NT.

Profile David Anderson (not *that* DA)
Avatar
Send message
Joined: 5 Dec 09
Posts: 107
Credit: 22,655,647
RAC: 15,278
United States
Message 1301432 - Posted: 2 Nov 2012, 21:50:59 UTC

One wonders if the incredible volume
of incoming data (as shown by cricket graph)
is not a manifestation of the
infamous 'buffer bloat'. Rather than
really a need for more bandwidth.

Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 6819
Credit: 24,582,371
RAC: 26,788
United Kingdom
Message 1301435 - Posted: 2 Nov 2012, 22:11:52 UTC
Last modified: 2 Nov 2012, 22:12:28 UTC

Well I have given up with SETI@Home and am just running this Intel(R) Celeron(R) CPU 2.53GHz [Family 15 Model 3 Stepping 4](1 processors) brought for £30 at my local computer fair just to keep an RAC here so I can post.

I have had no problems at all, because I do not demand masses of WU's and it takes time to return the results. As it should be.

I seriously believe that running machines with 4-6-8-12-16 processors and multi graphics cards is more than the system can cope with! The problem is us the crunchers.

My 10 CPU's and 4 GPU's are now happily crunching for other projects.
____________


Today is life, the only life we're sure of. Make the most of today.

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12130
Credit: 6,411,775
RAC: 8,178
United States
Message 1301438 - Posted: 2 Nov 2012, 22:20:33 UTC

Something is strange. I have zero issue with SetiBeta right now, but I can't report on main. As it is on the same communications link it isn't a link issue. It has to be an issue in the lab. Perhaps failed disk in an array?

____________

zoom314
Avatar
Send message
Joined: 30 Nov 03
Posts: 45787
Credit: 36,411,828
RAC: 7,202
Message 1301440 - Posted: 2 Nov 2012, 22:23:41 UTC - in response to Message 1301438.

Something is strange. I have zero issue with SetiBeta right now, but I can't report on main. As it is on the same communications link it isn't a link issue. It has to be an issue in the lab. Perhaps failed disk in an array?

Me too, no reports...
____________

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8375
Credit: 46,700,059
RAC: 19,996
United Kingdom
Message 1301443 - Posted: 2 Nov 2012, 22:31:22 UTC - in response to Message 1301432.

One wonders if the incredible volume
of incoming data (as shown by cricket graph)
is not a manifestation of the
infamous 'buffer bloat'. Rather than
really a need for more bandwidth.

Just checking that you are aware that the Cricket graphs are seen from the point of view of a mid-way router?

The green pixels are data leaving the lab, coming in to the router, and proceeding on its way out to us, the crunchers. That is, our downloads.

The blue line is data leaving the router, in the direction of the SSL lab on the hill. That is, our uploads.

I don't think there's an 'incredible volume of incoming data'.

chromespringer
Avatar
Send message
Joined: 3 Dec 05
Posts: 269
Credit: 18,816,859
RAC: 32,161
United States
Message 1301445 - Posted: 2 Nov 2012, 22:39:24 UTC

Have had 0 tasks since Thurs morning .. haven't been able to report completions or get new work on mach xxxx033 .. have had this response for almost two days now :(

11/2/2012 4:17:19 PM | SETI@home | Reporting 410 completed tasks, requesting new tasks for CPU and ATI
11/2/2012 4:23:01 PM | SETI@home | Scheduler request failed: Timeout was reached
11/2/2012 4:23:05 PM | | Project communication failed: attempting access to reference site
11/2/2012 4:23:06 PM | | Internet access OK - project servers may be temporarily down.
11/2/2012 4:27:33 PM | SETI@home | update requested by user
11/2/2012 4:27:37 PM | SETI@home | Sending scheduler request: Requested by user.
11/2/2012 4:27:37 PM | SETI@home | Reporting 410 completed tasks, requesting new tasks for CPU and ATI
11/2/2012 4:32:57 PM | SETI@home | Scheduler request failed: Timeout was reached
11/2/2012 4:33:00 PM | | Project communication failed: attempting access to reference site
11/2/2012 4:33:01 PM | | Internet access OK - project servers may be temporarily down.

____________

Profile arkayn
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3594
Credit: 47,338,244
RAC: 328
United States
Message 1301498 - Posted: 3 Nov 2012, 0:35:01 UTC

I managed to report 380+ units 25 at a time with NNT set, it just took a while.
____________

Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · Next

Message boards : Number crunching : Panic Mode On (77) Server Problems?

Copyright © 2014 University of California