Panic Mode On (78) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 22 · Next
Author Message
Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 15
United States
Message 1302482 - Posted: 5 Nov 2012, 14:32:16 UTC

Success!!!
I have now cleared my ~2000 lost tasks.
Life is good.

Frank

I also cleared my 476 lost tasks overnight and earlier this AM. Tried Richard's suggestion on lowering the cache settings, but it didn't help in my case. Couldn't get scheduler for 5-6 hours after the splitters were disabled. I went back to my normal 5.75 days and eventually got them.

There's another issue besides the timeouts. Why did Scheduler keep assigning work when we already had lost tasks? In the past, it has always filled those first. When mine came down, I was still getting the "no tasks available" (empty feeder) message at the end of each batch of 20, suggesting it was still trying to assign new tasks. Think that may need some looking - it was the potential for very large numbers that got me worried.

As to the timeouts, I've also been in the "too much load on Synergy" camp, but I'm not so sure now after seeing how long it took me to connect after the splitters were disabled. I don't buy the bandwidth argument as a sole cause, but it certainly contributes, and some packet dumping router may have a role after the load gets heavy. Is there another possibility - database contention or something like that? I freely admit I don't know much about the issue, just that it sometimes caused problems with strange symptoms during my working years.
____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4252
Credit: 1,050,045
RAC: 235
United States
Message 1302499 - Posted: 5 Nov 2012, 16:42:41 UTC - in response to Message 1302432.

Might be worth posting a link to a few of them so those that know about these things can have a look.

http://setiathome.berkeley.edu/results.php?hostid=6167352&offset=0&show_names=0&state=6&appid=

Pick one ;-) Hope it helps...

WU 1109239375 is enough to demonstrate that they weren't all VLAR. They were judged infeasible for some other reason.

All 2877 were expired between 3:57:47 UTC and 3:57:54 UTC, so the database or other server delays apparently took about 7 seconds to get through that long list of "lost" tasks.
Joe

Profile Tron
Send message
Joined: 16 Aug 09
Posts: 180
Credit: 2,236,055
RAC: 0
United States
Message 1302510 - Posted: 5 Nov 2012, 17:24:04 UTC

ok, now all may machines are empty ...what the heck are you guys doing?

turn the work back on! it's freezing in here :P

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5831
Credit: 59,372,163
RAC: 47,373
Australia
Message 1302527 - Posted: 5 Nov 2012, 18:13:45 UTC - in response to Message 1302467.

Mark Sattler posted an interesting theory yesterday. He wondered whether asking Synergy to run the Scheduler, several MB splitters, and several AP splitters all at the same time might have been too much, and caused the inital slowdown we saw after maintenance last week. Sounds plausible to me.

Take a look at the database graphs- usual activity these days is around 700-800 queries/s. Untill the splitters were shut down, it didn't drop below 1,000/s with suspstain periods of just below 1,500/s & many peaks over 1,500/s.
Even now there are many surges to 1,500/s+, but it's also dropping down to 700/s or less on occasion.
____________
Grant
Darwin NT.

Profile Mad Fritz
Avatar
Send message
Joined: 20 Jul 01
Posts: 87
Credit: 11,334,904
RAC: 0
Switzerland
Message 1302528 - Posted: 5 Nov 2012, 18:13:50 UTC

@Joe
Thanks for looking into it :-)
Am I right that there will not really be any serious harm as the tasks were given to other crunchers in the meantime?

Andy
____________

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11588
Credit: 14,343,495
RAC: 13,281
United States
Message 1302589 - Posted: 5 Nov 2012, 20:17:29 UTC - in response to Message 1302155.

My notional list of "work in progress" has gone up from 1,500 to 2,100 in the last two hours.

Everything is getting set to NNT and staying there, until I see zero tasks ready to send AND the splitters disabled.

It has been 27 hours since you said that. The ready to send has been at or near 0 (I assume it only ticks upward because of occasional timeout reassignments) and the splitters off for six hours that I'm aware of, probably a lot longer, and the Crickets are still maxed out! There was a mild downspike yesterday and an even smaller one just now, but there can't possibly still be that many ghost resends going on, can there? It's got me wondering if either something is wrong with the servers or there's an outside DOS attack going on. Or perhaps a web spider slipped through the filters and is trying to catalog every one of those 9 millions results out in the field and 7 million waiting for validation, or something like that.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11588
Credit: 14,343,495
RAC: 13,281
United States
Message 1302593 - Posted: 5 Nov 2012, 20:35:23 UTC

I was going to remark on the high number of error WUs for my i7 where I had a short time timeout and now my original wingman and my replacement have both had natural timeouts and it's been sent to two more hosts, but now I'm wondering if the first two hosts completed the work and have been unable to upload and report due to the server problems the last few days. (And if that's the case, they'll eventually report late and the WUs will end up stuck and take even longer to disappear off my error list.)

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 6998
Credit: 26,672,659
RAC: 32,787
United Kingdom
Message 1302595 - Posted: 5 Nov 2012, 20:53:50 UTC

, and the Crickets are still maxed out!


Seems to be slowly dropping back, hopefully this is a good sign!
____________


Today is life, the only life we're sure of. Make the most of today.

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8420
Credit: 57,407,439
RAC: 74,701
United Kingdom
Message 1302597 - Posted: 5 Nov 2012, 21:01:01 UTC

Dave (N9JFE)
A look at one of your timed out task shows that it was sent to you with an "impossible" deadline, so you timed out, the same thing happened, at the same time to your wingman. The task was no sent out to two more crunchers, one of whom has reported, and the other is still "in progress". Since the task was delivered to you in September, and faulted on the same day it is pretty safe to say that task is not under the influence of the current server woes.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11588
Credit: 14,343,495
RAC: 13,281
United States
Message 1302623 - Posted: 5 Nov 2012, 21:47:40 UTC - in response to Message 1302597.

Dave (N9JFE)
A look at one of your timed out task shows that it was sent to you with an "impossible" deadline, so you timed out, the same thing happened, at the same time to your wingman. The task was no sent out to two more crunchers, one of whom has reported, and the other is still "in progress". Since the task was delivered to you in September, and faulted on the same day it is pretty safe to say that task is not under the influence of the current server woes.

I know what happened to me (and also to my wingman in at least one case). What I found remarkable was that so many of my timeouts have now had full-time (i.e., not impossible) timeouts by the other users, and I wondered if *they* might be caused by the server problems.

But that's a minor issue.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2268
Credit: 8,713,548
RAC: 4,075
United States
Message 1302645 - Posted: 5 Nov 2012, 22:45:00 UTC

Well the Crickets aren't maxed out anymore, and I just made a scheduler contact and it took three seconds to acknowledge five completed APs. That's pretty quick.

Regarding having too many AP splitters going.. I've been wondering/asking for a while now if we can just knock it down to one AP splitter. If you load up 10 full tapes and let the splitters go as fast as they can, AP finishes all 10 tapes in usually around the same time as MB takes to get through 2-3. Maybe just slow AP down and limit the hindering effect it has on everything else?
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8549
Credit: 50,324,512
RAC: 50,108
United Kingdom
Message 1302665 - Posted: 5 Nov 2012, 23:35:33 UTC

Here comes fun - they've turned on every splitter, and the cricket graph has fallen through the floor. I'm setting NNT and going to bed - tell me about it in the morning.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5340
Credit: 298,280,933
RAC: 463,124
Brazil
Message 1302681 - Posted: 6 Nov 2012, 0:03:53 UTC
Last modified: 6 Nov 2012, 0:40:19 UTC

The largest blind is who don´t want to see...

Hope i´m wrong...

(edit) 05/11/2012 22:33:08 SETI@home Message from server: This computer has reached a limit on tasks in progress

A new limit?
____________

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1302695 - Posted: 6 Nov 2012, 0:47:40 UTC - in response to Message 1302681.

The largest blind is who don´t want to see...

Hope i´m wrong...

(edit) 05/11/2012 22:33:08 SETI@home Message from server: This computer has reached a limit on tasks in progress

A new limit?


I have one pc that's getting that message also..wonder what it means or what the limit is.

Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 15
United States
Message 1302698 - Posted: 6 Nov 2012, 1:08:10 UTC
Last modified: 6 Nov 2012, 1:10:33 UTC

I'm back to timeouts when just reporting on NNT. Wonder what they worked on today?

Edit: The next try was successful, of course. Will try work fetch now, but may go NNT all night.
____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

Profile Khangollo
Avatar
Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1302717 - Posted: 6 Nov 2012, 2:55:11 UTC
Last modified: 6 Nov 2012, 3:00:00 UTC

Timeouts are back plus now I can't download tons of lost tasks I still have; when scheduler is finally successful, all I'm getting is limit reached message.
Looks like limits are ridiculous like they were before (50tasks/CPU) which means even my slowest hosts are going to make scheduler requests *endlessly*, never able to fill a 5 day cache, only compounding the scheduler overload problem...
____________

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4252
Credit: 1,050,045
RAC: 235
United States
Message 1302719 - Posted: 6 Nov 2012, 3:05:49 UTC - in response to Message 1302528.

@Joe
Thanks for looking into it :-)
Am I right that there will not really be any serious harm as the tasks were given to other crunchers in the meantime?

Andy

Yes, you're right. Just a slight delay in actually getting tasks to 2 hosts.
Joe

Profile Donald L. JohnsonProject donor
Avatar
Send message
Joined: 5 Aug 02
Posts: 6209
Credit: 709,884
RAC: 1,209
United States
Message 1302760 - Posted: 6 Nov 2012, 8:02:22 UTC - in response to Message 1302717.

Timeouts are back plus now I can't download tons of lost tasks I still have; when scheduler is finally successful, all I'm getting is limit reached message.
Looks like limits are ridiculous like they were before (50tasks/CPU) which means even my slowest hosts are going to make scheduler requests *endlessly*, never able to fill a 5 day cache, only compounding the scheduler overload problem...

So you might as well reduce your cache settings to something close to the limits, to reduce the number of failed scheduler requests and ease the strain on the servers.....
____________
Donald
Infernal Optimist / Submariner, retired

Profile [seti.international] Dirk SadowskiProject donor
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7091
Credit: 60,496,359
RAC: 18,018
Germany
Message 1302761 - Posted: 6 Nov 2012, 8:07:11 UTC
Last modified: 6 Nov 2012, 8:12:11 UTC

Message from server: This computer has reached a limit on tasks in progress


I don't know how it's this time .. (no admin announced it) ..

If I remember correct - the last time it was max. 50 WUs/CPU-thread and 400 WUs/GPU in BOINC.

So my Intel Core2 Duo E7600 with NVIDIA GeForce GTX260 should get (50 x 2) + 400 = 500 WUs - maybe also this time.


* Best regards! :-) * Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. *
____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Copyright © 2014 University of California