Panic Mode On (78) Server Problems?

Message boards : Number crunching : Panic Mode On (78) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 22 · Next

AuthorMessage
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1302178 - Posted: 4 Nov 2012, 17:31:58 UTC - in response to Message 1302169.  

Anybody got any idea whether we're being haunted by Astropulse ghosts, to the same degree?

I think it's getting to the point that I need to put out an urgent APB, reluctant though I am to do so on a Sunday - just to get a lid put on the situation, until they can look at it tomorrow or Tuesday.

The question is - is stopping the MB splitters enough, or should I ask them to stop AP as well?

With splitters stopped, and no new work entering the system, we could allow work fetch and fish for resends of the work we've already supposedly got.

I would just stop the AP splitters.
MB was chugging along just fine until the AP work entered the picture again.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1302178 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1302179 - Posted: 4 Nov 2012, 17:35:13 UTC

The question is - is stopping the MB splitters enough, or should I ask them to stop AP as well?

With splitters stopped, and no new work entering the system, we could allow work fetch and fish for resends of the work we've already supposedly got.


Numbers for AP are not as large, but I have 8 lost AP that were not sent when scheduler gave me new work instead. "Results" out in field are pretty big for MB - 10,685,000 and 137k for AP. Suggest they try to shut off both.
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1302179 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1302180 - Posted: 4 Nov 2012, 17:38:37 UTC - in response to Message 1302037.  

One rig couldn't connect to the SETI servers and displayed the message:
don't need a network connection
?!

After reinstalling BOINC several times, my account setting in SETI were altered?!
And in Malaria and Rosetta, too.
My BOOT-drive (C:) was also shared.

In too many cases, it's not a good idea to run multiple projects,
on 1 host.

Apollogies to my wingmen for a few hundred MB WUs, which could not be uploaded,
due to this network issue and were timed-out.



ID: 1302180 · Report as offensive
fscheel

Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1302181 - Posted: 4 Nov 2012, 17:41:36 UTC

As Marcel said..."Just shoot up here amongst us, one of us has got to have some relief."

:)
ID: 1302181 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1302186 - Posted: 4 Nov 2012, 17:54:21 UTC - in response to Message 1302178.  

I would just stop the AP splitters.
MB was chugging along just fine until the AP work entered the picture again.

I sympathise with that assessment of the possible trigger, but I think we've got beyond that point now.

The database is horrendously bloated - well over 10.5 million tasks supposedly 'out in the field', which is 50% more than usual. The biggest problem right now isn't communications - when you get resends, they come down quite smoothly - but the slow "thinking time" response of the scheduler. I don't think simply stopping AP production on its own will free up enough scheduler and database resources to stop the timeouts and the creation of new ghosts.

On the other hand, I agree stopping MB production is a drastic step, and it will impact loads of volunteers with hosts like my little one-core server - which has been plinking along exactly as designed, getting new tasks as needed (and rotating through three different projects). No surplus fat to live off there - one task in progress it says, and I can see it running now.

But my gut feeling is saying, quite strongly, that recovery from this problem is going to take an outage of some sort - and the sooner we start it, the shorter it will be.
ID: 1302186 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1302188 - Posted: 4 Nov 2012, 17:59:17 UTC - in response to Message 1302186.  

I would just stop the AP splitters.
MB was chugging along just fine until the AP work entered the picture again.

I sympathise with that assessment of the possible trigger, but I think we've got beyond that point now.

The database is horrendously bloated - well over 10.5 million tasks supposedly 'out in the field', which is 50% more than usual. The biggest problem right now isn't communications - when you get resends, they come down quite smoothly - but the slow "thinking time" response of the scheduler. I don't think simply stopping AP production on its own will free up enough scheduler and database resources to stop the timeouts and the creation of new ghosts.

On the other hand, I agree stopping MB production is a drastic step, and it will impact loads of volunteers with hosts like my little one-core server - which has been plinking along exactly as designed, getting new tasks as needed (and rotating through three different projects). No surplus fat to live off there - one task in progress it says, and I can see it running now.

But my gut feeling is saying, quite strongly, that recovery from this problem is going to take an outage of some sort - and the sooner we start it, the shorter it will be.

Well, I would not be the one to second guess your intuition, Richard.
You seldom are off base.

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1302188 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1302198 - Posted: 4 Nov 2012, 18:13:35 UTC - in response to Message 1302188.  


There are going to 10s of thousands of timeouts when we run out of shorties & all the VLAR resends start going to NVidia GPUs.

Overnight i got a few resends on my systems, but it's still mostly timeouts. There's been the odd HTTP error, couldn't reach Scheduler, and when it was reached Project has no tasks available.
But mostly it's still timeouts.
Been that way pretty much since a few hours after the last weekly outage, and i notice the database queries are still over 1,000/s where less than 800/s is the usual number.
Grant
Darwin NT
ID: 1302198 · Report as offensive
Highlander
Avatar

Send message
Joined: 5 Oct 99
Posts: 167
Credit: 37,987,668
RAC: 16
Germany
Message 1302206 - Posted: 4 Nov 2012, 18:24:54 UTC

According to AP Ghosts: on my host http://setiathome.berkeley.edu/show_host_detail.php?hostid=5553346 which only do AP, there are till now no ghost units at all. My other host which is setup as MB only have some, around 100. And only for this host i do manual suspend of network communication (once a day open for 2-3 hours) cause i dont wanna put extra load on the schedular because of NNT.
- Performance is not a simple linear function of the number of CPUs you throw at the problem. -
ID: 1302206 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 1302238 - Posted: 4 Nov 2012, 20:00:18 UTC

What I'm seeing now.

This is about MB units, I don't do APs.

Now that the "ready to send" queue has dried up, I've been able to get schedule requests completed even when allowing new tasks. Not getting any new units but I would expect that with the ready queue running on vapors. However Ghost Detector shows I currently have 196 ghost units that hadn't been downloaded but assigned and I should be getting those, eventually. All ATI, almost all shorties BTW.

Problem is that's nearly half of my "in progress" units. And the vast majority of the ones I do have are shorties. I'm guessing I'm down to a couple of days worth of units locally right now.

Que sera sera.
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 1302238 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1302257 - Posted: 4 Nov 2012, 20:41:00 UTC

I've just had a note back from Eric:

I've stopped the splitters and doubled the httpd timeout...

I think we're going to need to at least temporarily go back
to restricting workunits in progress on a per host basis and per RPC
basis, regardless of what complaints we get about people being unable
to keep their hosts busy.

The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control.
ID: 1302257 · Report as offensive
fscheel

Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1302262 - Posted: 4 Nov 2012, 20:48:04 UTC - in response to Message 1302257.  

I've just had a note back from Eric:

I've stopped the splitters and doubled the httpd timeout...

I think we're going to need to at least temporarily go back
to restricting workunits in progress on a per host basis and per RPC
basis, regardless of what complaints we get about people being unable
to keep their hosts busy.

The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control.


I have NNT turned off on three machines with empty caches and lots of ghost tasks. So far all I get is Project has no tasks available.
ID: 1302262 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1302265 - Posted: 4 Nov 2012, 20:52:09 UTC

I decided to try to get some of my lost tasks since the splitters are disabled and there is no new work available. However, I didn't get any of my 467 lost tasks. I got the "no tasks available" message on two attempts. Will have to see how this plays out. I'm back to NNT for now.

Database queries are still high but I got fast responses on those requests- 7 and 8 seconds for work requests ain't bad. Bet Synergy would always run well without the splitting duties.

Need a pool to guess how many ghosts are out there, but guess we won't see the number.
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1302265 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1302266 - Posted: 4 Nov 2012, 20:55:53 UTC

Lost tasks come back "automagically".....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1302266 · Report as offensive
bill

Send message
Joined: 16 Jun 99
Posts: 861
Credit: 29,352,955
RAC: 0
United States
Message 1302273 - Posted: 4 Nov 2012, 21:11:25 UTC - in response to Message 1302180.  

"In too many cases, it's not a good idea to run multiple projects,
on 1 host."

What data do you have to support that assumption?
And your personal experiences are too small a data point to be
statistically significant. I offset them with mine of running 8 projects without
having any of the problems you have.

At the moment I've been out of SETI gpu work for a while, running Einstein instead. I'm down to 56 cpu APs (about 8 days worth) that are alternating with LHC and Rosetta. When the servers first started screwing up this weekend I went NNT on SETI. Call it foresight, experience, luck, whatever, but I have 0 ghosts and continuous work to do on my cruncher. When the lab boys get this little snafu fixed I'll be waiting because I have what seems to be in short supply around here

patience.
ID: 1302273 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1302275 - Posted: 4 Nov 2012, 21:12:58 UTC - in response to Message 1302266.  

Lost tasks come back "automagically".....

Not yet, they haven't. I'm with Fred and fscheel so far - really quick turnround on requests, but always "Project has no tasks available". I'll let them keep asking, and see what happens over the next few hours.
ID: 1302275 · Report as offensive
AllanB

Send message
Joined: 2 Sep 12
Posts: 282
Credit: 425,090
RAC: 0
United Kingdom
Message 1302276 - Posted: 4 Nov 2012, 21:13:11 UTC
Last modified: 4 Nov 2012, 21:14:53 UTC

I had set NNT so to run down my cache to switch off my machine. When I looked after it had finished crunching I had 20 or so ghosts, so I unset NNT for a while to try and see if I could get them. I now appear to have 329 tasks I haven't' got.

This is much more than my cache is set for,will I still get them?

PS I have received none so far as well.
ID: 1302276 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1302279 - Posted: 4 Nov 2012, 21:31:16 UTC - in response to Message 1302276.  

I had set NNT so to run down my cache to switch off my machine. When I looked after it had finished crunching I had 20 or so ghosts, so I unset NNT for a while to try and see if I could get them. I now appear to have 329 tasks I haven't' got.

This is much more than my cache is set for,will I still get them?

You should get them as you request work over the next few days, no more than 20 tasks per request. Your computer won't be suddenly overwhelmed with work, and if you don't get through them all in time, don't worry - they'll get sent to somebody else instead.

PS I have received none so far as well.

One of my requests has got a resend now, but other machines are still dry. Never mind, the journey of a thousand WUs starts with a single crunch...
ID: 1302279 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1302280 - Posted: 4 Nov 2012, 21:34:16 UTC
Last modified: 4 Nov 2012, 21:35:26 UTC

had set NNT so to run down my cache to switch off my machine. When I looked after it had finished crunching I had 20 or so ghosts, so I unset NNT for a while to try and see if I could get them. I now appear to have 329 tasks I haven't' got.

This is much more than my cache is set for,will I still get them?

PS I have received none so far as well.

When they start flowing again, you should get some until BOINC stops asking for work due to the cache setting. No request = no work. You may have to increase that setting, or just be patient and get them over a period of days.

Edit: Richard beat me to it.
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1302280 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 1302289 - Posted: 4 Nov 2012, 21:59:32 UTC

Well there was a dip in the cricket graphs and about then I got 20 members of my Ghost Army downloaded. However now that the cricket graphs are back up, I'm getting scheduler timeouts again as it's trying to get my ghosts and report newly done units. :(
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 1302289 · Report as offensive
AllanB

Send message
Joined: 2 Sep 12
Posts: 282
Credit: 425,090
RAC: 0
United Kingdom
Message 1302308 - Posted: 4 Nov 2012, 22:27:27 UTC

Just got two lots of 20!!
ID: 1302308 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.