Panic Mode On (78) Server Problems?

Message boards : Number crunching : Panic Mode On (78) Server Problems?

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 22 · Next

AuthorMessage
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1302482 - Posted: 5 Nov 2012, 14:32:16 UTC

Success!!!
I have now cleared my ~2000 lost tasks.
Life is good.

Frank

I also cleared my 476 lost tasks overnight and earlier this AM. Tried Richard's suggestion on lowering the cache settings, but it didn't help in my case. Couldn't get scheduler for 5-6 hours after the splitters were disabled. I went back to my normal 5.75 days and eventually got them.

There's another issue besides the timeouts. Why did Scheduler keep assigning work when we already had lost tasks? In the past, it has always filled those first. When mine came down, I was still getting the "no tasks available" (empty feeder) message at the end of each batch of 20, suggesting it was still trying to assign new tasks. Think that may need some looking - it was the potential for very large numbers that got me worried.

As to the timeouts, I've also been in the "too much load on Synergy" camp, but I'm not so sure now after seeing how long it took me to connect after the splitters were disabled. I don't buy the bandwidth argument as a sole cause, but it certainly contributes, and some packet dumping router may have a role after the load gets heavy. Is there another possibility - database contention or something like that? I freely admit I don't know much about the issue, just that it sometimes caused problems with strange symptoms during my working years.
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

ID: 1302482 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1302499 - Posted: 5 Nov 2012, 16:42:41 UTC - in response to Message 1302432.

Might be worth posting a link to a few of them so those that know about these things can have a look.

http://setiathome.berkeley.edu/results.php?hostid=6167352&offset=0&show_names=0&state=6&appid=

Pick one ;-) Hope it helps...

WU 1109239375 is enough to demonstrate that they weren't all VLAR. They were judged infeasible for some other reason.

All 2877 were expired between 3:57:47 UTC and 3:57:54 UTC, so the database or other server delays apparently took about 7 seconds to get through that long list of "lost" tasks.
                                                                   Joe

ID: 1302499 · Report as offensive
Profile Tron

Send message
Joined: 16 Aug 09
Posts: 180
Credit: 2,250,468
RAC: 0
United States
Message 1302510 - Posted: 5 Nov 2012, 17:24:04 UTC

ok, now all may machines are empty ...what the heck are you guys doing?

turn the work back on! it's freezing in here :P

ID: 1302510 · Report as offensive
kittymanProject Donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 45862
Credit: 814,628,438
RAC: 121,639
United States
Message 1302511 - Posted: 5 Nov 2012, 17:27:24 UTC - in response to Message 1302510.

ok, now all may machines are empty ...what the heck are you guys doing?

turn the work back on! it's freezing in here :P

I don't think any of my rigs actually ran out yet.
But I haven't checked all 9 of them.
If they do, they all have Einstein as a backup project.

But hopefully da boyz in da lab will have on their best thinking hats and kicking boots today and will start to get to the root of the problem.
Best of luck with it, guys.
Kitties make wonderful traveling companions on your journey through life.

Have made a few friends in this life.
Most were cats.

ID: 1302511 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7474
Credit: 90,878,246
RAC: 45,246
Australia
Message 1302527 - Posted: 5 Nov 2012, 18:13:45 UTC - in response to Message 1302467.

Mark Sattler posted an interesting theory yesterday. He wondered whether asking Synergy to run the Scheduler, several MB splitters, and several AP splitters all at the same time might have been too much, and caused the inital slowdown we saw after maintenance last week. Sounds plausible to me.

Take a look at the database graphs- usual activity these days is around 700-800 queries/s. Untill the splitters were shut down, it didn't drop below 1,000/s with suspstain periods of just below 1,500/s & many peaks over 1,500/s.
Even now there are many surges to 1,500/s+, but it's also dropping down to 700/s or less on occasion.
Grant
Darwin NT

ID: 1302527 · Report as offensive
Profile Mad Fritz
Avatar

Send message
Joined: 20 Jul 01
Posts: 87
Credit: 11,334,904
RAC: 0
Switzerland
Message 1302528 - Posted: 5 Nov 2012, 18:13:50 UTC

@Joe
Thanks for looking into it :-)
Am I right that there will not really be any serious harm as the tasks were given to other crunchers in the meantime?

Andy


ID: 1302528 · Report as offensive
David SProject Donor
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 17034
Credit: 20,917,205
RAC: 5,927
United States
Message 1302589 - Posted: 5 Nov 2012, 20:17:29 UTC - in response to Message 1302155.

My notional list of "work in progress" has gone up from 1,500 to 2,100 in the last two hours.

Everything is getting set to NNT and staying there, until I see zero tasks ready to send AND the splitters disabled.

It has been 27 hours since you said that. The ready to send has been at or near 0 (I assume it only ticks upward because of occasional timeout reassignments) and the splitters off for six hours that I'm aware of, probably a lot longer, and the Crickets are still maxed out! There was a mild downspike yesterday and an even smaller one just now, but there can't possibly still be that many ghost resends going on, can there? It's got me wondering if either something is wrong with the servers or there's an outside DOS attack going on. Or perhaps a web spider slipped through the filters and is trying to catalog every one of those 9 millions results out in the field and 7 million waiting for validation, or something like that.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


ID: 1302589 · Report as offensive
David SProject Donor
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 17034
Credit: 20,917,205
RAC: 5,927
United States
Message 1302593 - Posted: 5 Nov 2012, 20:35:23 UTC

I was going to remark on the high number of error WUs for my i7 where I had a short time timeout and now my original wingman and my replacement have both had natural timeouts and it's been sent to two more hosts, but now I'm wondering if the first two hosts completed the work and have been unable to upload and report due to the server problems the last few days. (And if that's the case, they'll eventually report late and the WUs will end up stuck and take even longer to disappear off my error list.)


David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


ID: 1302593 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 8577
Credit: 43,044,868
RAC: 21,004
United Kingdom
Message 1302595 - Posted: 5 Nov 2012, 20:53:50 UTC

, and the Crickets are still maxed out!


Seems to be slowly dropping back, hopefully this is a good sign!
"Sometimes it is the people no one imagines anything of who do the things that no one can imagine."

ID: 1302595 · Report as offensive
rob smithProject Donor
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 13300
Credit: 154,170,748
RAC: 112,818
United Kingdom
Message 1302597 - Posted: 5 Nov 2012, 21:01:01 UTC

Dave (N9JFE)
A look at one of your timed out task shows that it was sent to you with an "impossible" deadline, so you timed out, the same thing happened, at the same time to your wingman. The task was no sent out to two more crunchers, one of whom has reported, and the other is still "in progress". Since the task was delivered to you in September, and faulted on the same day it is pretty safe to say that task is not under the influence of the current server woes.


Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

ID: 1302597 · Report as offensive
David SProject Donor
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 17034
Credit: 20,917,205
RAC: 5,927
United States
Message 1302623 - Posted: 5 Nov 2012, 21:47:40 UTC - in response to Message 1302597.

Dave (N9JFE)
A look at one of your timed out task shows that it was sent to you with an "impossible" deadline, so you timed out, the same thing happened, at the same time to your wingman. The task was no sent out to two more crunchers, one of whom has reported, and the other is still "in progress". Since the task was delivered to you in September, and faulted on the same day it is pretty safe to say that task is not under the influence of the current server woes.

I know what happened to me (and also to my wingman in at least one case). What I found remarkable was that so many of my timeouts have now had full-time (i.e., not impossible) timeouts by the other users, and I wondered if *they* might be caused by the server problems.

But that's a minor issue.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


ID: 1302623 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 2871
Credit: 10,620,139
RAC: 305
United States
Message 1302645 - Posted: 5 Nov 2012, 22:45:00 UTC

Well the Crickets aren't maxed out anymore, and I just made a scheduler contact and it took three seconds to acknowledge five completed APs. That's pretty quick.

Regarding having too many AP splitters going.. I've been wondering/asking for a while now if we can just knock it down to one AP splitter. If you load up 10 full tapes and let the splitters go as fast as they can, AP finishes all 10 tapes in usually around the same time as MB takes to get through 2-3. Maybe just slow AP down and limit the hindering effect it has on everything else?


Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)

ID: 1302645 · Report as offensive
Richard HaselgroveProject Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11136
Credit: 83,514,583
RAC: 41,360
United Kingdom
Message 1302665 - Posted: 5 Nov 2012, 23:35:33 UTC

Here comes fun - they've turned on every splitter, and the cricket graph has fallen through the floor. I'm setting NNT and going to bed - tell me about it in the morning.

ID: 1302665 · Report as offensive
juan BFP
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 5847
Credit: 330,515,136
RAC: 7,705
Panama
Message 1302681 - Posted: 6 Nov 2012, 0:03:53 UTC
Last modified: 6 Nov 2012, 0:40:19 UTC

The largest blind is who don´t want to see...

Hope i´m wrong...

(edit) 05/11/2012 22:33:08 SETI@home Message from server: This computer has reached a limit on tasks in progress

A new limit?


ID: 1302681 · Report as offensive
fscheel

Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1302695 - Posted: 6 Nov 2012, 0:47:40 UTC - in response to Message 1302681.

The largest blind is who don´t want to see...

Hope i´m wrong...

(edit) 05/11/2012 22:33:08 SETI@home Message from server: This computer has reached a limit on tasks in progress

A new limit?


I have one pc that's getting that message also..wonder what it means or what the limit is.

ID: 1302695 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1302698 - Posted: 6 Nov 2012, 1:08:10 UTC
Last modified: 6 Nov 2012, 1:10:33 UTC

I'm back to timeouts when just reporting on NNT. Wonder what they worked on today?

Edit: The next try was successful, of course. Will try work fetch now, but may go NNT all night.


Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

ID: 1302698 · Report as offensive
Profile Khangollo
Avatar

Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1302717 - Posted: 6 Nov 2012, 2:55:11 UTC
Last modified: 6 Nov 2012, 3:00:00 UTC

Timeouts are back plus now I can't download tons of lost tasks I still have; when scheduler is finally successful, all I'm getting is limit reached message.
Looks like limits are ridiculous like they were before (50tasks/CPU) which means even my slowest hosts are going to make scheduler requests *endlessly*, never able to fill a 5 day cache, only compounding the scheduler overload problem...


ID: 1302717 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1302719 - Posted: 6 Nov 2012, 3:05:49 UTC - in response to Message 1302528.

@Joe
Thanks for looking into it :-)
Am I right that there will not really be any serious harm as the tasks were given to other crunchers in the meantime?

Andy

Yes, you're right. Just a slight delay in actually getting tasks to 2 hosts.
                                                                  Joe

ID: 1302719 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8205
Credit: 4,331,349
RAC: 5,314
United States
Message 1302760 - Posted: 6 Nov 2012, 8:02:22 UTC - in response to Message 1302717.

Timeouts are back plus now I can't download tons of lost tasks I still have; when scheduler is finally successful, all I'm getting is limit reached message.
Looks like limits are ridiculous like they were before (50tasks/CPU) which means even my slowest hosts are going to make scheduler requests *endlessly*, never able to fill a 5 day cache, only compounding the scheduler overload problem...

So you might as well reduce your cache settings to something close to the limits, to reduce the number of failed scheduler requests and ease the strain on the servers.....
Donald
Infernal Optimist / Submariner, retired

ID: 1302760 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7066
Credit: 100,926,336
RAC: 61,453
Germany
Message 1302761 - Posted: 6 Nov 2012, 8:07:11 UTC
Last modified: 6 Nov 2012, 8:12:11 UTC

Message from server: This computer has reached a limit on tasks in progress


I don't know how it's this time .. (no admin announced it) ..

If I remember correct - the last time it was max. 50 WUs/CPU-thread and 400 WUs/GPU in BOINC.

So my Intel Core2 Duo E7600 with NVIDIA GeForce GTX260 should get (50 x 2) + 400 = 500 WUs - maybe also this time.


* Best regards! :-) * Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. *

ID: 1302761 · Report as offensive
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?


 
©2016 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.