Panic Mode On (78) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 22 · Next
Author Message
fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1302176 - Posted: 4 Nov 2012, 17:30:28 UTC - in response to Message 1302169.

Anybody got any idea whether we're being haunted by Astropulse ghosts, to the same degree?

I think it's getting to the point that I need to put out an urgent APB, reluctant though I am to do so on a Sunday - just to get a lid put on the situation, until they can look at it tomorrow or Tuesday.

The question is - is stopping the MB splitters enough, or should I ask them to stop AP as well?

With splitters stopped, and no new work entering the system, we could allow work fetch and fish for resends of the work we've already supposedly got.


Have not crunched any AP tasks in days, but I see 2 among the ghosts.

Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 7
United States
Message 1302179 - Posted: 4 Nov 2012, 17:35:13 UTC

The question is - is stopping the MB splitters enough, or should I ask them to stop AP as well?

With splitters stopped, and no new work entering the system, we could allow work fetch and fish for resends of the work we've already supposedly got.


Numbers for AP are not as large, but I have 8 lost AP that were not sent when scheduler gave me new work instead. "Results" out in field are pretty big for MB - 10,685,000 and 137k for AP. Suggest they try to shut off both.
____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3247
Credit: 31,806,008
RAC: 3,419
Netherlands
Message 1302180 - Posted: 4 Nov 2012, 17:38:37 UTC - in response to Message 1302037.

One rig couldn't connect to the SETI servers and displayed the message:

don't need a network connection
?!

After reinstalling BOINC several times, my account setting in SETI were altered?!
And in Malaria and Rosetta, too.
My BOOT-drive (C:) was also shared.

In too many cases, it's not a good idea to run multiple projects,
on 1 host.

Apollogies to my wingmen for a few hundred MB WUs, which could not be uploaded,
due to this network issue and were timed-out.



____________

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1302181 - Posted: 4 Nov 2012, 17:41:36 UTC

As Marcel said..."Just shoot up here amongst us, one of us has got to have some relief."

:)

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8551
Credit: 50,415,703
RAC: 51,150
United Kingdom
Message 1302186 - Posted: 4 Nov 2012, 17:54:21 UTC - in response to Message 1302178.

I would just stop the AP splitters.
MB was chugging along just fine until the AP work entered the picture again.

I sympathise with that assessment of the possible trigger, but I think we've got beyond that point now.

The database is horrendously bloated - well over 10.5 million tasks supposedly 'out in the field', which is 50% more than usual. The biggest problem right now isn't communications - when you get resends, they come down quite smoothly - but the slow "thinking time" response of the scheduler. I don't think simply stopping AP production on its own will free up enough scheduler and database resources to stop the timeouts and the creation of new ghosts.

On the other hand, I agree stopping MB production is a drastic step, and it will impact loads of volunteers with hosts like my little one-core server - which has been plinking along exactly as designed, getting new tasks as needed (and rotating through three different projects). No surplus fat to live off there - one task in progress it says, and I can see it running now.

But my gut feeling is saying, quite strongly, that recovery from this problem is going to take an outage of some sort - and the sooner we start it, the shorter it will be.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5831
Credit: 59,452,941
RAC: 47,872
Australia
Message 1302198 - Posted: 4 Nov 2012, 18:13:35 UTC - in response to Message 1302188.


There are going to 10s of thousands of timeouts when we run out of shorties & all the VLAR resends start going to NVidia GPUs.

Overnight i got a few resends on my systems, but it's still mostly timeouts. There's been the odd HTTP error, couldn't reach Scheduler, and when it was reached Project has no tasks available.
But mostly it's still timeouts.
Been that way pretty much since a few hours after the last weekly outage, and i notice the database queries are still over 1,000/s where less than 800/s is the usual number.
____________
Grant
Darwin NT.

Highlander
Avatar
Send message
Joined: 5 Oct 99
Posts: 144
Credit: 31,210,982
RAC: 8,563
Germany
Message 1302206 - Posted: 4 Nov 2012, 18:24:54 UTC

According to AP Ghosts: on my host http://setiathome.berkeley.edu/show_host_detail.php?hostid=5553346 which only do AP, there are till now no ghost units at all. My other host which is setup as MB only have some, around 100. And only for this host i do manual suspend of network communication (once a day open for 2-3 hours) cause i dont wanna put extra load on the schedular because of NNT.
____________

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 370
Credit: 2,838,916
RAC: 2,222
United States
Message 1302238 - Posted: 4 Nov 2012, 20:00:18 UTC

What I'm seeing now.

This is about MB units, I don't do APs.

Now that the "ready to send" queue has dried up, I've been able to get schedule requests completed even when allowing new tasks. Not getting any new units but I would expect that with the ready queue running on vapors. However Ghost Detector shows I currently have 196 ghost units that hadn't been downloaded but assigned and I should be getting those, eventually. All ATI, almost all shorties BTW.

Problem is that's nearly half of my "in progress" units. And the vast majority of the ones I do have are shorties. I'm guessing I'm down to a couple of days worth of units locally right now.

Que sera sera.
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8551
Credit: 50,415,703
RAC: 51,150
United Kingdom
Message 1302257 - Posted: 4 Nov 2012, 20:41:00 UTC

I've just had a note back from Eric:

I've stopped the splitters and doubled the httpd timeout...

I think we're going to need to at least temporarily go back
to restricting workunits in progress on a per host basis and per RPC
basis, regardless of what complaints we get about people being unable
to keep their hosts busy.

The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control.

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1302262 - Posted: 4 Nov 2012, 20:48:04 UTC - in response to Message 1302257.

I've just had a note back from Eric:

I've stopped the splitters and doubled the httpd timeout...

I think we're going to need to at least temporarily go back
to restricting workunits in progress on a per host basis and per RPC
basis, regardless of what complaints we get about people being unable
to keep their hosts busy.

The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control.


I have NNT turned off on three machines with empty caches and lots of ghost tasks. So far all I get is Project has no tasks available.

Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 7
United States
Message 1302265 - Posted: 4 Nov 2012, 20:52:09 UTC

I decided to try to get some of my lost tasks since the splitters are disabled and there is no new work available. However, I didn't get any of my 467 lost tasks. I got the "no tasks available" message on two attempts. Will have to see how this plays out. I'm back to NNT for now.

Database queries are still high but I got fast responses on those requests- 7 and 8 seconds for work requests ain't bad. Bet Synergy would always run well without the splitting duties.

Need a pool to guess how many ghosts are out there, but guess we won't see the number.
____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8431
Credit: 57,518,639
RAC: 73,929
United Kingdom
Message 1302266 - Posted: 4 Nov 2012, 20:55:53 UTC

Lost tasks come back "automagically".....
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

bill
Send message
Joined: 16 Jun 99
Posts: 861
Credit: 23,705,385
RAC: 23,498
United States
Message 1302273 - Posted: 4 Nov 2012, 21:11:25 UTC - in response to Message 1302180.

"In too many cases, it's not a good idea to run multiple projects,
on 1 host."

What data do you have to support that assumption?
And your personal experiences are too small a data point to be
statistically significant. I offset them with mine of running 8 projects without
having any of the problems you have.

At the moment I've been out of SETI gpu work for a while, running Einstein instead. I'm down to 56 cpu APs (about 8 days worth) that are alternating with LHC and Rosetta. When the servers first started screwing up this weekend I went NNT on SETI. Call it foresight, experience, luck, whatever, but I have 0 ghosts and continuous work to do on my cruncher. When the lab boys get this little snafu fixed I'll be waiting because I have what seems to be in short supply around here

patience.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8551
Credit: 50,415,703
RAC: 51,150
United Kingdom
Message 1302275 - Posted: 4 Nov 2012, 21:12:58 UTC - in response to Message 1302266.

Lost tasks come back "automagically".....

Not yet, they haven't. I'm with Fred and fscheel so far - really quick turnround on requests, but always "Project has no tasks available". I'll let them keep asking, and see what happens over the next few hours.

AllanB
Send message
Joined: 2 Sep 12
Posts: 280
Credit: 425,090
RAC: 0
United Kingdom
Message 1302276 - Posted: 4 Nov 2012, 21:13:11 UTC
Last modified: 4 Nov 2012, 21:14:53 UTC

I had set NNT so to run down my cache to switch off my machine. When I looked after it had finished crunching I had 20 or so ghosts, so I unset NNT for a while to try and see if I could get them. I now appear to have 329 tasks I haven't' got.

This is much more than my cache is set for,will I still get them?

PS I have received none so far as well.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8551
Credit: 50,415,703
RAC: 51,150
United Kingdom
Message 1302279 - Posted: 4 Nov 2012, 21:31:16 UTC - in response to Message 1302276.

I had set NNT so to run down my cache to switch off my machine. When I looked after it had finished crunching I had 20 or so ghosts, so I unset NNT for a while to try and see if I could get them. I now appear to have 329 tasks I haven't' got.

This is much more than my cache is set for,will I still get them?

You should get them as you request work over the next few days, no more than 20 tasks per request. Your computer won't be suddenly overwhelmed with work, and if you don't get through them all in time, don't worry - they'll get sent to somebody else instead.

PS I have received none so far as well.

One of my requests has got a resend now, but other machines are still dry. Never mind, the journey of a thousand WUs starts with a single crunch...

Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 7
United States
Message 1302280 - Posted: 4 Nov 2012, 21:34:16 UTC
Last modified: 4 Nov 2012, 21:35:26 UTC

had set NNT so to run down my cache to switch off my machine. When I looked after it had finished crunching I had 20 or so ghosts, so I unset NNT for a while to try and see if I could get them. I now appear to have 329 tasks I haven't' got.

This is much more than my cache is set for,will I still get them?

PS I have received none so far as well.

When they start flowing again, you should get some until BOINC stops asking for work due to the cache setting. No request = no work. You may have to increase that setting, or just be patient and get them over a period of days.

Edit: Richard beat me to it.
____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 370
Credit: 2,838,916
RAC: 2,222
United States
Message 1302289 - Posted: 4 Nov 2012, 21:59:32 UTC

Well there was a dip in the cricket graphs and about then I got 20 members of my Ghost Army downloaded. However now that the cricket graphs are back up, I'm getting scheduler timeouts again as it's trying to get my ghosts and report newly done units. :(
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Copyright © 2014 University of California