Panic Mode On (78) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 22 · Next
Author Message
AllanB
Send message
Joined: 2 Sep 12
Posts: 280
Credit: 425,090
RAC: 0
United Kingdom
Message 1302308 - Posted: 4 Nov 2012, 22:27:27 UTC

Just got two lots of 20!!

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8465
Credit: 48,942,335
RAC: 75,622
United Kingdom
Message 1302310 - Posted: 4 Nov 2012, 22:37:06 UTC

I'm starting to get the hang of this. If your cache is a long way below normal, it helps to reduce your cache size settings - that way you're not asking for so much in one go.

When you're recovering from dehydration, take small sips of water, not great big gulps.

Profile Fred J. Verster
Volunteer tester
Send message
Joined: 21 Apr 04
Posts: 3238
Credit: 31,692,237
RAC: 5,989
Netherlands
Message 1302324 - Posted: 4 Nov 2012, 23:27:13 UTC - in response to Message 1302310.
Last modified: 4 Nov 2012, 23:47:55 UTC

I'm starting to get the hang of this. If your cache is a long way below normal, it helps to reduce your cache size settings - that way you're not asking for so much in one go.

When you're recovering from dehydration, take small sips of water, not great big gulps.


So very truth, (in both cases), a smaller, f.i. 3 (or less) and an additionel 2 (or 1) days does work better, also has a shorter turn around time.
Less work to report in one go and less work needed per day, if we * all* ask for
10 + 10 days, we're shure in for SERVER trouble... :-\

In Holland we have a saying : a donkey doesn't hit the same
stone twice.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2723
Credit: 208,584,648
RAC: 502,286
United States
Message 1302394 - Posted: 5 Nov 2012, 5:03:37 UTC - in response to Message 1302257.
Last modified: 5 Nov 2012, 5:06:38 UTC

I've just had a note back from Eric:

I've stopped the splitters and doubled the httpd timeout...

I think we're going to need to at least temporarily go back
to restricting workunits in progress on a per host basis and per RPC
basis, regardless of what complaints we get about people being unable
to keep their hosts busy.

The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control.


Richard,

The LAST thing I want to do is get into some sort of trouble, but I read this several hours ago and it's been bugging me ever since.

Does Eric know you couldn't report 6 tasks any better than you could report 6,000?

I'm not talking about "limiting" the reporting to 6 at a time. I'm saying that if all I had was 6 tasks, I couldn't report them.

If there's some really esoteric reason limiting a machine to 20 work units means that another machine would be able to report 6, I can't fathom it.

I can't even make-up a story that sounds plausible.

Nor do I understand why using a proxy would eliminate the problem with reporting. I can't invent a reason that this would be better or worse restricting work units in progress.

I already KNOW I don't know what I'm talking about, but it would make me feel better if someone would explain in layman's terms how Eric's fix might fix a problem that can be overcome by using a proxy.

Methinks Eric "knows" what the problem is; but he really doesn't.

msattlerProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38922
Credit: 578,695,675
RAC: 515,736
United States
Message 1302396 - Posted: 5 Nov 2012, 5:07:03 UTC

Well, the kitties won't be happy having their caches limited, but I guess if that's what it takes to right the ship........
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2723
Credit: 208,584,648
RAC: 502,286
United States
Message 1302401 - Posted: 5 Nov 2012, 5:41:26 UTC - in response to Message 1302396.
Last modified: 5 Nov 2012, 5:45:22 UTC

Well, the kitties won't be happy having their caches limited, but I guess if that's what it takes to right the ship........



You know what? I've just edited this message away.

It doesn't matter.

The obvious doesn't matter, the occult doesn't matter, it just doesn't matter.

msattlerProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38922
Credit: 578,695,675
RAC: 515,736
United States
Message 1302403 - Posted: 5 Nov 2012, 5:50:05 UTC - in response to Message 1302401.

I didn't say I agreed with it, or understand the logic behind it.
But, I am not Eric.
Things went south after last Tuesday's outage.
And personally, I don't see what cache sizes have anything to do with it.
All was working fine with AP out of the picture. Caches were filled, comms were good, all appeared to be well.
AP fired up, and everything went to Hades in a handbasket.
Could be coincidence, I dunno.

Splitters are off now, and the bandwidth is probably gonna stay maxxed resending ghost tasks for quite a while.

And I stand corrected, tbret....
It does appear that there is a gremlin in the scheduler, and bandwidth is NOT the only problem right now.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 370
Credit: 2,773,272
RAC: 2,058
United States
Message 1302404 - Posted: 5 Nov 2012, 5:56:59 UTC
Last modified: 5 Nov 2012, 6:11:03 UTC

Well all's right with the world now. Ghosts have been downloaded, scheduler requests are working. Okay there aren't any new units being made right now but the odd updating behavior and ghost generation is fixed at least for now. Just waiting for the cricket graph to drop off as the download backlog is cleared up.

Edit: My only problem now is I have a full 6 day queue for my ATI MB cruncher but only about 2 days for the CPU MB cruncher.
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

musicplayer
Send message
Joined: 17 May 10
Posts: 1431
Credit: 687,186
RAC: 3
Message 1302410 - Posted: 5 Nov 2012, 6:31:24 UTC
Last modified: 5 Nov 2012, 6:34:04 UTC

I get the sense that there are currently many tasks which have been returned to the server which lists as "Completed, waiting for validation".

Meaning that a wingman or two still has not completed his or her task and therefore the task in question is unable to become validated.

Anyway, deadlines in this project are quite generous, at least when comparing with PrimeGrid, where you are expected to complete a 10 hour task (CPU-time that is) within 2 days.

Would shortening or perhaps extending the deadlines even further help alleviate or reduce the problem or problems?

My best guess is that if there are tasks readily available for processing, I would eventually get these tasks the usual way.

If many tasks have been returned but are still awaiting validation, I assume that there may be difficulties at receiving new tasks even if such are currently available.

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8309
Credit: 55,252,819
RAC: 75,318
United Kingdom
Message 1302411 - Posted: 5 Nov 2012, 6:49:09 UTC

I doubt that the number of tasks awaiting validation is an issue -there is plenty of disk space, and the server doing the validation is well up to it.
Just now there are about 10,000,000 tasks "out in the field", and about 7,6000,000 tasks awaiting validation, and no new tasks being created as all the splitters are down for one reason or another. What is bemusing is the fact that the query rate is sitting at about 1200qps against the norm of 700-800qps and has been sat around there since the last outage...
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile Mad Fritz
Avatar
Send message
Joined: 20 Jul 01
Posts: 87
Credit: 11,334,904
RAC: 0
Switzerland
Message 1302418 - Posted: 5 Nov 2012, 8:10:57 UTC - in response to Message 1302411.

On a side note - suddenly I have around 2900 WUs with timed out errors...
Anyone else experienced the same?
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5791
Credit: 58,028,572
RAC: 48,174
Australia
Message 1302421 - Posted: 5 Nov 2012, 8:23:26 UTC - in response to Message 1302418.

On a side note - suddenly I have around 2900 WUs with timed out errors...
Anyone else experienced the same?

Not yet, but it will happen when VALRs get re-issued to the CUDA device instead of the CPU.
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5791
Credit: 58,028,572
RAC: 48,174
Australia
Message 1302424 - Posted: 5 Nov 2012, 8:26:25 UTC - in response to Message 1302257.
Last modified: 5 Nov 2012, 8:28:54 UTC

I've just had a note back from Eric:

I've stopped the splitters and doubled the httpd timeout...
....

So does that mean it will take 10 minutes for it to timeout now?
I figure if it's not going to respond within 5 minutes it's as good a time as any for it to timeout.
Usually when it did respond when the timeouts were at their worst it was within a couple of minutes; when things are going well most responses are within 20 seconds or so.

Do we know why the Scheduler is having such a hard time keeping up with the load- more RAM required, faster disk subsystem? New system?
____________
Grant
Darwin NT.

Profile Mad Fritz
Avatar
Send message
Joined: 20 Jul 01
Posts: 87
Credit: 11,334,904
RAC: 0
Switzerland
Message 1302425 - Posted: 5 Nov 2012, 8:29:00 UTC - in response to Message 1302421.

Not yet, but it will happen when VALRs get re-issued to the CUDA device instead of the CPU.

Hmm, but they were all initially sent as CUDAs to me as I don't ask for CPU-work.
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5791
Credit: 58,028,572
RAC: 48,174
Australia
Message 1302426 - Posted: 5 Nov 2012, 8:30:37 UTC - in response to Message 1302425.

Not yet, but it will happen when VALRs get re-issued to the CUDA device instead of the CPU.

Hmm, but they were all initially sent as CUDAs to me as I don't ask for CPU-work.

Might be worth posting a link to a few of them so those that know about these things can have a look.
____________
Grant
Darwin NT.

Profile Tron
Send message
Joined: 16 Aug 09
Posts: 180
Credit: 2,236,055
RAC: 0
United States
Message 1302428 - Posted: 5 Nov 2012, 8:40:33 UTC

so much for trying to heat the greenhouse with seti power tonight..hope my plants don't freeze ..
empty cache on the antique ibm watthog

Profile Mad Fritz
Avatar
Send message
Joined: 20 Jul 01
Posts: 87
Credit: 11,334,904
RAC: 0
Switzerland
Message 1302432 - Posted: 5 Nov 2012, 8:58:43 UTC - in response to Message 1302426.

Might be worth posting a link to a few of them so those that know about these things can have a look.

http://setiathome.berkeley.edu/results.php?hostid=6167352&offset=0&show_names=0&state=6&appid=

Pick one ;-) Hope it helps...
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8465
Credit: 48,942,335
RAC: 75,622
United Kingdom
Message 1302467 - Posted: 5 Nov 2012, 12:43:19 UTC

So, how are we all doing this fine morning?

I found these graphs instructive. I've made a fixed copy - I don't know whether the site is happy about live linking - so this is a snapshot of the position just after 12:00 UTC today (graph times are UTC+1).





We've brought down 'Results in Progress' by over a million overnight, which can only be good for the health of the servers.

We can also see clearly how we got into such a mess yesterday. Somewhere round about 5am UTC Sunday morning (late Saturday evening in Berkeley), some 300,000 tasks suddenly jumped from 'Ready to Send' to 'Results in Progress'. My guess is that they all became ghosts, but I've no idea why - late halloween party in the server closet, perhaps? I'd love to be a fly on the wall in this morning's staff meeting while they scratch their heads over that one.

Anyway, back to the present. I'm finding that for hosts which have ghosts in the database (mainly fast hosts with large caches), I'm able to get them resent reasonably easily - provided I don't ask for too much at once. Large work requests are still hitting the timeout. But slower hosts or hosts with smaller caches - which haven't got any ghosts - aren't able to get any new work at all.

Mark Sattler posted an interesting theory yesterday. He wondered whether asking Synergy to run the Scheduler, several MB splitters, and several AP splitters all at the same time might have been too much, and caused the inital slowdown we saw after maintenance last week. Sounds plausible to me.

I've passed it on to the staff, and suggested that they might consider restarting the splitters on Lando - two of each - to provide a trickle of new work for smaller users who are currently getting nothing, while the power users amongst us work our way through the rest of the lost results. We'll see what they make of it.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5229
Credit: 285,028,834
RAC: 453,466
Brazil
Message 1302468 - Posted: 5 Nov 2012, 12:59:15 UTC - in response to Message 1302467.
Last modified: 5 Nov 2012, 13:07:18 UTC

Mark Sattler posted an interesting theory yesterday. He wondered whether asking Synergy to run the Scheduler, several MB splitters, and several AP splitters all at the same time might have been too much, and caused the inital slowdown we saw after maintenance last week. Sounds plausible to me.

Richard

Congrats, now you are in the right path, i was talking about that months ago. The problem always returns when the AP splitters starts.

Maybe a clue, put less AP splitters to work for a while and see what happens, we all could be surprise on results.

Another clue, during the last problem, i was able to DL (>150kpbps)/UL and report all with the help of a proxie with no problem (without a proxie DL(<1kpbs) and UL Ok report NO), thats interesting because thats point not for a bandwidth problem (the proxie uses the same bandwith). Talk about that with the others on the lab this could show another path to follow to.

Have a good week.
____________

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1302476 - Posted: 5 Nov 2012, 13:56:20 UTC

Success!!!
I have now cleared my ~2000 lost tasks.
Life is good.

Frank

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Copyright © 2014 University of California