Panic Mode On (33) Server problems


log in

Advanced search

Message boards : Number crunching : Panic Mode On (33) Server problems

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next
Author Message
Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8634
Credit: 51,651,319
RAC: 49,130
United Kingdom
Message 1005638 - Posted: 18 Jun 2010, 8:49:38 UTC - in response to Message 1005633.

1. Which is where you (plural, not Andy alone) come in. Complain, comment, don't comply, tell how to do it better. Tell it to the developers themselves, do not expect of them to trawl through 40 threads looking for your post. And of course, comments like "dictatorship", "this is ****" and "test it better" are best left behind. On the latter, they're testing it here in a live environment just to see what could go wrong.

That's the reality, but it's bad practice. In the real computing world, whenever a major application or upgrade is launched, the developers should be proactively monitoring the rollout and catching issues as they arise. I still remember (with cold shivers down my spine) the Saturday night I migrated a live telesales database from Microsoft Access to SQL Server. I had to wait until the last call ended at 10pm, then perform the transfer. But I regarded it as a consequent duty to be on-site at 10am the following (Sunday) morning, when the sales lines opened again, to monitor that everything was running smoothly. It was - we didn't lose an order.

4. I actually like that. The biggest problem was always that people expected the claimed credit to be theirs, no matter what. OK, you won't get fun threads anymore that you claimed 17 trillion credits, but let's be honest, the method in which the claimed credit is calculated isn't in use here anymore (time * benchmarks).

Not true. The "claimed credit" shown on this project's website has been derived from the flopcounter for years, and is incredibly stable and reliable - except for the minute percentage of users still using the very earliest v5 clients or before.

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3505
Credit: 47,803,733
RAC: 46,956
Russia
Message 1005639 - Posted: 18 Jun 2010, 8:51:51 UTC - in response to Message 1005633.
Last modified: 18 Jun 2010, 8:59:10 UTC

Quota
Quota is only counting MB tasks, but applying quota limit to all tasks, therefore I cannot download AP tasks.
Quota not being reset at end of 24 hour period, think this is PDT time but it didn't happen at UTC today either. Therefore my quota, and presumably others also, now into its third day.
(expecting more complaints on this as most people seem to run 2 or 3 day caches)

This is performance-limiting factor now indeed.

Resetting quota from close to infinity to some default value (before it was 100*NUMBER_OF_CPU_CORES+100*5*NUMBERS_OF_GPUs AFAIK (GPU part can differ)) if error encountered is OK. If it's only random error like -12 host will have enough work to continue and to prove it's good one to server. But if it's first sign of big host failure the sooner fetch will be inhibited the better. If we could decrease "close to infinity" only by 1 for each failure quota mechanism will be uneffective to deal with broken host.

But currently all says that new quota implemented with bugs. My own host still recives message aboout reached quota (294 so far), but it did not download smth even close to this number for past few days already. That is, downloaded tasks conter reset is broken.
And it looks also as same quota still applied to all app versions. I too get quota reached message on ATI GPU AP work requests too. It's absolutely clear that this host can't download ~300 AP tasks last day at any conditions, actually it downloaded no AP tasks yesterday, no AP tasks today...

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12332
Credit: 2,634,248
RAC: 1,191
Netherlands
Message 1005643 - Posted: 18 Jun 2010, 9:03:25 UTC

All, Mark knows best. He'll fix it.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

WinterKnight
Volunteer tester
Send message
Joined: 18 May 99
Posts: 8687
Credit: 25,062,662
RAC: 30,153
United Kingdom
Message 1005646 - Posted: 18 Jun 2010, 9:04:07 UTC

1 & 2. The identification of apps should have been the first step with nothing more done until it was accurate.

Quota,
The quota should be per application, and therefore 1 & 2 apply.

3 & 4. Richard effectively answered that.
5. Credits for AP, because of Eric's modifying flop count method are at ~800cr.

Brodo answered what I was going to say about extra tasks downloaded, i.e. we no longer know how much is asked for.


Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32107
Credit: 13,812,026
RAC: 24,731
United Kingdom
Message 1005663 - Posted: 18 Jun 2010, 9:44:08 UTC

Can we all just calm down and take 5 please. No-one is slighting anyone intentionally, just voicing opinions.

____________
Damsel Rescuer, Uli Devotee, Julie Supporter, ES99 Admirer,
Raccoon Friend, Anniet fan, Dominant Culture Member


Profile [AF>france>pas-de-calais]symaski62
Volunteer tester
Send message
Joined: 12 Aug 05
Posts: 258
Credit: 100,548
RAC: 1
France
Message 1005664 - Posted: 18 Jun 2010, 10:01:16 UTC

18/06/2010 11:40:31 SETI@home Sending scheduler request: To fetch work.
18/06/2010 11:40:31 SETI@home Requesting new tasks for GPU
18/06/2010 11:40:36 SETI@home Scheduler request completed: got 0 new tasks
18/06/2010 11:40:36 SETI@home Message from server: Project has no jobs available


RED servers ^^


____________
SETI@Home Informational message -9 result_overflow
with a general handicap of 80% and it makes much d' efforts for the community and s' expimer, thank you d' to be understanding.

Profile MiepProject donor
Volunteer moderator
Avatar
Send message
Joined: 23 Jul 99
Posts: 2411
Credit: 351,996
RAC: 0
Message 1005665 - Posted: 18 Jun 2010, 10:33:01 UTC - in response to Message 1005664.

Oh Dear:

WU waiting to validate 43000 and climbing - according to status page one of the validators is down.

So the trickle of work people are getting from quota going up from valid taks will be even smaller.
And it's still a few hours till the guys get in.

There are now so many small fires to put out, it starts looking like the forest is up in flames.

Intressting dilemma: will people be angrier if they fix on the fly (and introduce other problems or we just can't see the fixes quickly enough to appease the community) or if they shut down for another day?

Seems that nerves are so frayed even the most patient of us are having a hard time.
____________
Carola
-------
I'm multilingual - I can misunderstand people in several languages!

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46521
Credit: 36,867,617
RAC: 5,040
United States
Message 1005742 - Posted: 18 Jun 2010, 15:03:13 UTC

I'm on empty and so far Seti will not send Me any new work, But then I gather It won't send work out unless one has the right app, whatever that is, Yet My useless quota kept going up and My complaints have gone unanswered.

6/18/2010 7:50:26 AM SETI@home Sending scheduler request: To fetch work.
6/18/2010 7:50:26 AM SETI@home Requesting new tasks
6/18/2010 7:50:27 AM SETI@home Scheduler request completed: got 0 new tasks
6/18/2010 7:50:27 AM SETI@home Message from server: No work sent
6/18/2010 7:50:27 AM SETI@home Message from server: Your app_info.xml file doesn't have a usable version of SETI@home Enhanced.
6/18/2010 7:50:27 AM SETI@home Message from server: (reached daily quota of 241 tasks)

____________
My Facebook, War Commander, 2015

Profile Blurf
Volunteer tester
Send message
Joined: 2 Sep 06
Posts: 7579
Credit: 6,979,272
RAC: 3,083
United States
Message 1006620 - Posted: 20 Jun 2010, 17:11:48 UTC
Last modified: 20 Jun 2010, 17:15:16 UTC

From Matt's 6/16 post

I know most of you who read these updates know this already, but it bears repeating: nobody working directly on SETI@home (all 5 of us) works full time, and we all have enough other things going on that make it impossible for us to be "on call" in case of outage/emergencies. In my case, I currently have four regular separate sources of income with jobs/gigs in four completely different industries (covering all the bases in case one or more dry up). As for last night, when the httpd problems arose, I was working elsewhere, and when I checked in again around 10:30pm everyone else was asleep and I didn't want to start up the scheduler processes without others' input as they were still effectively on the operating table. We're pretty much given up any hope for 24/7 uptime, but BOINC takes care of that as long as you sign up for other projects.


Something for all of us to keep in mind.
____________


Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,631,148
RAC: 2
United States
Message 1006624 - Posted: 20 Jun 2010, 17:21:37 UTC

Thank you Blurf.. and thank you for helping renew my ticked-offedness.

Yeah we get it. it is not 24/7. Never asked for that. Yeah we get it.
We can sign up for other projects we may or may not agree with or want to
help with. We understand they are under funded/paid.

And we get it that on top of previous server problems, they dumped a bunch
of poorly written non-tested code on us, basically keeping things tied up in a knot for over a week. How about being up 12/2?? Cause it has been ages since I remember seeing all servers up at once.

But.. feel free to quote "so sayeth Matt" again. It really.. helps. Not sure who it helps, but I am sure it does. Or not.

"Bite my shiney metal a**" So sayeth Bender.

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32107
Credit: 13,812,026
RAC: 24,731
United Kingdom
Message 1006642 - Posted: 20 Jun 2010, 18:52:01 UTC

Well, I, for one, have calmed down.
I have to resign myself to the fact that since I have chosen to crunch Seti, and Seti only, I must accept the downtime of the project gracefully.

And there will be some times when my rigs cannot get enough work on hand to keep them running during extended or sequential project failures.

Such is what I have chosen to do.

Welcome to life in Setiland.


Mark, that is the most marvellous post I think I have ever seen you make. I always knew it was worth believing in you.

Take care now.

Chris S.
____________
Damsel Rescuer, Uli Devotee, Julie Supporter, ES99 Admirer,
Raccoon Friend, Anniet fan, Dominant Culture Member


Profile Blurf
Volunteer tester
Send message
Joined: 2 Sep 06
Posts: 7579
Credit: 6,979,272
RAC: 3,083
United States
Message 1006643 - Posted: 20 Jun 2010, 18:53:58 UTC - in response to Message 1006642.

Well, I, for one, have calmed down.
I have to resign myself to the fact that since I have chosen to crunch Seti, and Seti only, I must accept the downtime of the project gracefully.

And there will be some times when my rigs cannot get enough work on hand to keep them running during extended or sequential project failures.

Such is what I have chosen to do.

Welcome to life in Setiland.


Mark, that is the most marvellous post I think I have ever seen you make. I always knew it was worth believing in you.

Take care now.

Chris S.


The calm is nice to see and appreciated.
____________


Profile Dave Cummings
Volunteer tester
Send message
Joined: 16 May 09
Posts: 204
Credit: 928,013
RAC: 329
United Kingdom
Message 1006698 - Posted: 20 Jun 2010, 22:27:44 UTC

glad to see u back

Highlander
Avatar
Send message
Joined: 5 Oct 99
Posts: 146
Credit: 31,453,951
RAC: 11,184
Germany
Message 1006799 - Posted: 21 Jun 2010, 6:30:33 UTC

funny msg:

21.06.2010 08:19:04 [error] Error reported by file upload server: Server is out of disk space

its no wonder with over 1 million unvalidated WU
____________

Profile [B^S] madmac
Volunteer tester
Avatar
Send message
Joined: 9 Feb 04
Posts: 1151
Credit: 3,835,956
RAC: 2,281
United Kingdom
Message 1006802 - Posted: 21 Jun 2010, 6:42:46 UTC - in response to Message 1006799.

funny msg:

21.06.2010 08:19:04 [error] Error reported by file upload server: Server is out of disk space

its no wonder with over 1 million unvalidated WU


Got this message as well today with an upload transient error, will just sit and wait until it has been fixed
____________

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46521
Credit: 36,867,617
RAC: 5,040
United States
Message 1006830 - Posted: 21 Jun 2010, 7:50:52 UTC - in response to Message 1006802.

funny msg:

21.06.2010 08:19:04 [error] Error reported by file upload server: Server is out of disk space

its no wonder with over 1 million unvalidated WU


Got this message as well today with an upload transient error, will just sit and wait until it has been fixed

I turned off network access in Boinc earlier when I heard You all talking about It, I'm sure It'll get fixed on Monday or Tuesday, So We wait.
____________
My Facebook, War Commander, 2015

WinterKnight
Volunteer tester
Send message
Joined: 18 May 99
Posts: 8687
Credit: 25,062,662
RAC: 30,153
United Kingdom
Message 1006831 - Posted: 21 Jun 2010, 8:35:19 UTC

Had to happen sooner or later.
If the validators are switched off then the results are not been transferred to the science database. So no space is being emptied to make room for further uploads.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

Message boards : Number crunching : Panic Mode On (33) Server problems

Copyright © 2014 University of California