Panic Mode On (17) Server problems

Message boards : Number crunching : Panic Mode On (17) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 11 · Next

AuthorMessage
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 908412 - Posted: 17 Jun 2009, 18:26:44 UTC - in response to Message 908398.  

Anyone noticed the new website code at play?...

Some weird effect of CSS on the main page, all black with blue letters before it loads correctly, you mean?
ID: 908412 · Report as offensive
Matthew S. McCleary
Avatar

Send message
Joined: 9 Sep 99
Posts: 121
Credit: 2,288,242
RAC: 0
United States
Message 908415 - Posted: 17 Jun 2009, 18:37:06 UTC - in response to Message 908356.  
Last modified: 17 Jun 2009, 18:40:57 UTC


Perhaps we should be looking at capping the number of CPU’s per account.


Yeahhh... that will go over well with enthusiasts like me who have been working hard to throw everything they've got at it.

Somehow I don't think people would like SETI@home saying to them: "We want your help, but only so much. We're cutting you off at 10 CPUs."

In fact, if SETI@home were to adopt such a policy, I would abandon the project. There are plenty of other worthy DC projects which would never do such a thing.
ID: 908415 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 908417 - Posted: 17 Jun 2009, 18:42:33 UTC
Last modified: 17 Jun 2009, 18:57:11 UTC

My rig has been trying to upload 100 tasks most of the day.
Also still can't report 111 tasks this after having an outage for 5 hrs yesterday.
I am now crunching Aqua on my gpu's and docking on my cpu to fill in missing workload.I have no 6.03 to crunch and only about 200 6.08 will be out of Seti work in 12 hrs on main rig second rig has no 6.08 but 200 6.03 schedular not very good.
Seems like a real problem with Seti servers.
Dave

EDIT just checked the server status page and it's info is 6 hours old
BIG PROBLEM
ID: 908417 · Report as offensive
Profile Clyde C. Phillips, III

Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 908424 - Posted: 17 Jun 2009, 19:03:34 UTC

I see only about three hours of tasks left. It'll be Einstein or else I'll have to turn off my machines tonight.
ID: 908424 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 908426 - Posted: 17 Jun 2009, 19:08:42 UTC

I have 4 gpu's to feed so it looks like Aqua for me just to keep my room warm lol
ID: 908426 · Report as offensive
Profile UrbanFlux
Avatar

Send message
Joined: 21 Sep 06
Posts: 8
Credit: 1,150,626
RAC: 0
United Kingdom
Message 908436 - Posted: 17 Jun 2009, 19:21:21 UTC

Fixed!
Trust No One. Question Everything.
ID: 908436 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 908437 - Posted: 17 Jun 2009, 19:24:30 UTC - in response to Message 908436.  
Last modified: 17 Jun 2009, 19:44:36 UTC

Don't know what is fixed but server status is now showing 7 hrs.
Still no uploads, reporting or downloads.
ID: 908437 · Report as offensive
Profile UrbanFlux
Avatar

Send message
Joined: 21 Sep 06
Posts: 8
Credit: 1,150,626
RAC: 0
United Kingdom
Message 908444 - Posted: 17 Jun 2009, 19:43:23 UTC

Yep sorry, was seeing things.
Trust No One. Question Everything.
ID: 908444 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 908445 - Posted: 17 Jun 2009, 19:47:36 UTC - in response to Message 908415.  
Last modified: 17 Jun 2009, 19:58:18 UTC


Perhaps we should be looking at capping the number of CPU’s per account.


Yeahhh... that will go over well with enthusiasts like me who have been working hard to throw everything they've got at it.

Somehow I don't think people would like SETI@home saying to them: "We want your help, but only so much. We're cutting you off at 10 CPUs."

In fact, if SETI@home were to adopt such a policy, I would abandon the project. There are plenty of other worthy DC projects which would never do such a thing.

I still think the math behind the quota needs to be revised. Sure, one bad task can still be -1, but a good task needs to be something like +1 or +2 instead of x2. Takes 100+ bad tasks to get down to a quota of 1, but it only takes 8 good ones to get back to 100. Sounds like it defeats the purpose of the quota altogether.

$0.02


[edit: and my panic is gone now. The one Linux box that only had an AP to crunch requested 86400 seconds of work (nice number there..) and got 20 MBs. Good to go now.]
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 908445 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 908446 - Posted: 17 Jun 2009, 19:57:32 UTC - in response to Message 908437.  
Last modified: 17 Jun 2009, 20:08:08 UTC

Don't know what is fixed but server status is now showing 7 hrs.
Still no uploads, reporting or downloads.


Before they took the website down, and took the server's offline for maintenance, the Status time was stuck at: [As of 17 Jun 2009 12:20:10 UTC],
now at least it's updating it's status every 10 mins, even through the stats haven't been updated (yet).

Work is getting through, I've got about 60 WU's downloaded, and another 20 downloading,

You might have to get some uploaded before you'll request more.

Claggy
ID: 908446 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 908447 - Posted: 17 Jun 2009, 19:58:24 UTC - in response to Message 908445.  
Last modified: 17 Jun 2009, 20:01:07 UTC

No point in having quotas if the schedular gets things totally screwed.
I have 1 rig with 8 cores idle cos of no AP's or 6.03 plenty of 6.08's, and another rig with 4 cores and 2 gpu's with plenty of 6.03's and no 6.08's.
Still have over 150 trying to upload here, been like this all day.
ID: 908447 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 908487 - Posted: 17 Jun 2009, 21:24:56 UTC - in response to Message 908445.  

I still think the math behind the quota needs to be revised. Sure, one bad task can still be -1, but a good task needs to be something like +1 or +2 instead of x2. Takes 100+ bad tasks to get down to a quota of 1, but it only takes 8 good ones to get back to 100. Sounds like it defeats the purpose of the quota altogether.

It really depends on what you're trying to achieve.

If the purpose is to keep a machine that returns consistently bad results from chewing through work, it's good enough. It'll take a while to throttle it down, so an occasional validator error or math bug won't hurt machines that are not broken.

It also means a broken machine (or maybe a bug-fix to the application) can recover quite quickly once it's fixed.

If you did away with the -1/x2 mechanism entirely, things would probably still be okay, there would just be a lot more reissues.
ID: 908487 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 908510 - Posted: 17 Jun 2009, 21:56:02 UTC

I was also thinking of proposing that a new user's quota be at 10, and go from there. That way they don't go and fill up their cache and then abandon ship, leaving all of those tasks to time out and be reissued.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 908510 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 908520 - Posted: 17 Jun 2009, 22:05:06 UTC - in response to Message 908510.  

I was also thinking of proposing that a new user's quota be at 10, and go from there. That way they don't go and fill up their cache and then abandon ship, leaving all of those tasks to time out and be reissued.

I'd think even lower would be fine, maybe as low as two.

Each validated WU would double-up, and they'd get to full quota fairly quickly.
ID: 908520 · Report as offensive
Profile [AF>france>pas-de-calais]symaski62
Volunteer tester

Send message
Joined: 12 Aug 05
Posts: 258
Credit: 100,548
RAC: 0
France
Message 908521 - Posted: 17 Jun 2009, 22:06:14 UTC

:) PAUSE !

18/06/2009 00:02:40 Internet access OK - project servers may be temporarily down.

---------------------------------

18/06/2009 00:00:36		Resuming network activity
18/06/2009 00:00:36	SETI@home	Started upload of 01mr09ad.27455.4980.11.8.163_1_0
18/06/2009 00:00:36	SETI@home	Started upload of 24fe09ac.31767.5389.3.8.130_0_0
18/06/2009 00:00:36	SETI@home	Sending scheduler request: To fetch work.
18/06/2009 00:00:36	SETI@home	Requesting new tasks
18/06/2009 00:00:41	SETI@home	Scheduler request completed: got 0 new tasks
18/06/2009 00:00:41	SETI@home	Message from server: (Project has no jobs available)
18/06/2009 00:00:44	SETI@home	[error] Error reported by file upload server: can't open file
18/06/2009 00:00:44	SETI@home	Temporarily failed upload of 24fe09ac.31767.5389.3.8.130_0_0: transient upload error
18/06/2009 00:00:44	SETI@home	Backing off 1 min 0 sec on upload of 24fe09ac.31767.5389.3.8.130_0_0
18/06/2009 00:00:47	SETI@home	[error] Error reported by file upload server: can't open file
18/06/2009 00:00:47	SETI@home	Temporarily failed upload of 01mr09ad.27455.4980.11.8.163_1_0: transient upload error
18/06/2009 00:00:47	SETI@home	Backing off 1 min 0 sec on upload of 01mr09ad.27455.4980.11.8.163_1_0
18/06/2009 00:01:44	SETI@home	Started upload of 24fe09ac.31767.5389.3.8.130_0_0
18/06/2009 00:01:47	SETI@home	Started upload of 01mr09ad.27455.4980.11.8.163_1_0
18/06/2009 00:01:56	SETI@home	Sending scheduler request: To fetch work.
18/06/2009 00:01:56	SETI@home	Requesting new tasks
18/06/2009 00:02:01	SETI@home	Scheduler request completed: got 1 new tasks
18/06/2009 00:02:03	SETI@home	Started download of 12mr09ac.29680.16841.4.8.223
18/06/2009 00:02:09		Project communication failed: attempting access to reference site
18/06/2009 00:02:09	SETI@home	Temporarily failed upload of 01mr09ad.27455.4980.11.8.163_1_0: connect() failed
18/06/2009 00:02:09	SETI@home	Backing off 1 min 12 sec on upload of 01mr09ad.27455.4980.11.8.163_1_0
18/06/2009 00:02:09	SETI@home	Finished download of 12mr09ac.29680.16841.4.8.223
18/06/2009 00:02:11		BOINC can't access Internet - check network connection or proxy configuration.
18/06/2009 00:02:38		Project communication failed: attempting access to reference site
18/06/2009 00:02:38	SETI@home	Temporarily failed upload of 24fe09ac.31767.5389.3.8.130_0_0: connect() failed
18/06/2009 00:02:38	SETI@home	Backing off 1 min 0 sec on upload of 24fe09ac.31767.5389.3.8.130_0_0
18/06/2009 00:02:40		Internet access OK - project servers may be temporarily down.
18/06/2009 00:02:56		Suspending network activity - user request

SETI@Home Informational message -9 result_overflow
with a general handicap of 80% and it makes much d' efforts for the community and s' expimer, thank you d' to be understanding.
ID: 908521 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 908524 - Posted: 17 Jun 2009, 22:08:59 UTC - in response to Message 908510.  

I was also thinking of proposing that a new user's quota be at 10, and go from there. That way they don't go and fill up their cache and then abandon ship, leaving all of those tasks to time out and be reissued.

I've also thought that the initial DCF should be pretty high. That would keep inefficient machines from over-requesting work initially, and allow more work once things settle in.

The obvious problem is the high projected time, but it should be possible to explain that (or show a "displayed duration" based on a "display-DCF").
ID: 908524 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 908527 - Posted: 17 Jun 2009, 22:12:07 UTC - in response to Message 908521.  

:) PAUSE !

18/06/2009 00:02:40 Internet access OK - project servers may be temporarily down.


Looking at what is happening with uploads at the moment, suspending network for a few hours seems like the best option. Those of my uploads that are managing to make contact are getting to 100% but then failing to get the final "ack" from the database so it looks like a database access issue again.

F.
ID: 908527 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 908570 - Posted: 17 Jun 2009, 23:51:40 UTC

I was just looking at the cricket graph and there is still about 30mbit leftover (if you add inbound and outbound together, it shouldn't be more than 100mbit..in a theoretical sense), but none of my 40 pending uploads will go through. I don't think it's a lack of bandwidth, I think there's just too many requests happening for apache to handle all of them.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 908570 · Report as offensive
Larry256
Volunteer tester

Send message
Joined: 11 Nov 05
Posts: 25
Credit: 5,715,079
RAC: 8
United States
Message 908591 - Posted: 18 Jun 2009, 0:35:23 UTC - in response to Message 908487.  
Last modified: 18 Jun 2009, 0:35:40 UTC

I still think the math behind the quota needs to be revised. Sure, one bad task can still be -1, but a good task needs to be something like +1 or +2 instead of x2. Takes 100+ bad tasks to get down to a quota of 1, but it only takes 8 good ones to get back to 100. Sounds like it defeats the purpose of the quota altogether.

It really depends on what you're trying to achieve.

If the purpose is to keep a machine that returns consistently bad results from chewing through work, it's good enough. It'll take a while to throttle it down, so an occasional validator error or math bug won't hurt machines that are not broken.

It also means a broken machine (or maybe a bug-fix to the application) can recover quite quickly once it's fixed.

If you did away with the -1/x2 mechanism entirely, things would probably still be okay, there would just be a lot more reissues.



Wouldn't it cause less?
ID: 908591 · Report as offensive
Larry256
Volunteer tester

Send message
Joined: 11 Nov 05
Posts: 25
Credit: 5,715,079
RAC: 8
United States
Message 908595 - Posted: 18 Jun 2009, 0:57:31 UTC - in response to Message 908575.  

I was just looking at the cricket graph and there is still about 30mbit leftover (if you add inbound and outbound together, it shouldn't be more than 100mbit..in a theoretical sense), but none of my 40 pending uploads will go through. I don't think it's a lack of bandwidth, I think there's just too many requests happening for apache to handle all of them.


I finally removed/aborted all (12) of my stuck uploads, yes I lost some 600 credits, but all my cores where out of jobs.

Immediately thereafter Boinc downloaded 24 new WU's. So, I don't think it's a matter of lack of bandwidth, but lack of something else, database wise.

Sten-Arne


That will help them out- download,wait 1hr then abort.
Repeat untill the problem is fixed.
LOL
ID: 908595 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (17) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.