Panic Mode On (57) Server problems?

Message boards : Number crunching : Panic Mode On (57) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

AuthorMessage
Profile Jeff Bakle
Volunteer tester

Send message
Joined: 24 Dec 99
Posts: 19
Credit: 5,056,116
RAC: 1
United States
Message 1158235 - Posted: 2 Oct 2011, 13:32:53 UTC - in response to Message 1158177.  

Team,

This following messages are typical today...

10/1/2011 7:40:19 PM | SETI@home | update requested by user
10/1/2011 7:40:21 PM | SETI@home | Sending scheduler request: Requested by user.
10/1/2011 7:40:21 PM | SETI@home | Requesting new tasks for NVIDIA GPU
10/1/2011 7:40:25 PM | SETI@home | Scheduler request completed: got 0 new tasks
10/1/2011 7:40:25 PM | SETI@home | No tasks sent
10/1/2011 7:40:25 PM | SETI@home | No tasks are available for SETI@home Enhanced
10/1/2011 7:40:25 PM | SETI@home | Tasks for CPU are available, but your preferences are set to not accept them


Jeff


I wonder if that means that there's a lot of VLAR's around, (VLAR's aren't sent to Nvidia GPU's)


Out from 88 units i´ve downloaded over night there was only 6 VLARs.
Most VHARs and a few mid range units.


I think Jeff should check his project preferences then, and check that SETI@home Enhanced is still selected, then report back,

Claggy,

I have checked SETI@home Enhanced, and it is still selected. I have reset my project (no issues since I have not had any work for 2 days), and reset my queue for the maximum 10 days. I am still not getting work with the same response from the server...

I appreciate your help regardless! I will keep working on it.


Jeff
ID: 1158235 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1158247 - Posted: 2 Oct 2011, 13:47:05 UTC - in response to Message 1158216.  

No problems here, everything works perfect....


Except that it doesn't :-)

However it will take more than this to make me angry, and start threatening to quit. Much, much more.



Don't look now, but.... I'm sure Berkeley will find a way...
ID: 1158247 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1158258 - Posted: 2 Oct 2011, 14:30:59 UTC - in response to Message 1158249.  

However it will take more than this to make me angry, and start threatening to quit. Much, much more.

Seti could go off line for a year, I would still wait for it to come back.

Don't look now, but.... I'm sure Berkeley will find a way...

What God has joined together let no man put asunder.

Ooops sorry, picked up the wrong hymn book. Ahem. Fight the good fight with ......

Aye aye, Chris.
The kitties would keep sniffing for WUs a loooooooooooooooooooong time if the project were forced into another extended downtime.
And will keep doing what they can to avoid that from happening.

Meanwhile, back at the crunching farm, all rigs have finally been able to top off their caches to the current limits.....50 CPU, 400 GPU per rig.
I am sure the fact that all the GPU work is not VHAR at the moment has helped immensely. Hopefully the limits can be increased a bit next week, although with further adjustments in the Boinc server code in the offing, I am guessing it shall not be totally lifted for some time to come until most hosts have had more time to adjust their DCFs accordingly.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1158258 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1158288 - Posted: 2 Oct 2011, 15:46:33 UTC - in response to Message 1158221.  

So.. I keep hearing about this limit of 50/cpu... my single core machine has 99 in progress and the messages tab says nothing about a limit. Hm. Now it's 100, still no message. Either it's a glitch, or it's because a 10-day cache on this machine is somewhere between 90-110 MBs on average, at least with the weird estimates and most of them being shorties.

I'm keep hitting the limit on 2 PCs. What I don't understand is why BOINC still keep asking for CPU work, just now it has asked for new CPU work 5 times in a row. Wouldn't it be better if it waits until a CPU task is finished before asking for more?

BOINC hasn't been programmed to take any notice of the reason why no work is issued - even when the reason is stated (it isn't always).

The newer BOINC v6.12.xx are programmed to back off and ask less frequently when no work is forthcoming - but the backoff is reset to zero when a task completes, and BOINC allows time for the just-completed task to upload and be available for reporting before requesting new work. So, its behaviour is quite close to what you're suggesting: if you have reached the quota limit, there's a fair likelyhood that your next scheduler request will be 'report one, get one replacement'.

You, on the other hand, are running BOINC v6.10.58/60 - that version of BOINC simply knows how much work you've said that you'd like to have, and keeps asking 'more, more, more', even when the server is replying 'no, no, no'. If you are concerned about the strain that your repeated fruitless requests are placing on the servers, you could consider upgrading to BOINC v6.12.34, or temporarily reduce you cache size to something which matches the current (temporary) quota limit more closely.


While I truly support the IDEA of 6.12.XX asking less frequently, I have found the reality of trying to use it frustrating beyond belief, especially on a fast machine. The backoffs are currently horrendous, unmanageable, and require constant button abuse in order to try to get a cache of any size. This is especially true if there is ANY network congestion( not that we have seen that lately).

If you take a look at the first couple of pages of "Top hosts", It seems I am not the only person with similar experiences. I saw only 1 of the top 40 machines running 6.12. I myself tried, and had to go back.

I really do not want to beat up servers any more than necessary, but 6.12 is IMO not ready for prime time.

This, along with the current limits running enough work for a fast machine to last a few hours, makes every little bump in the road a major issue. Keeping the limits in place seems a necessary bandaid to the recent "upgrades"(*cough*)
to the server software, but an increase would be much appreciated by the faster machines.
Janice
ID: 1158288 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1158291 - Posted: 2 Oct 2011, 15:50:43 UTC

That wasn´t a problem for me the last 12 month s^s.



With each crime and every kindness we birth our future.
ID: 1158291 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1158294 - Posted: 2 Oct 2011, 15:57:41 UTC - in response to Message 1158288.  
Last modified: 2 Oct 2011, 15:59:11 UTC

While I truly support the IDEA of 6.12.XX asking less frequently, I have found the reality of trying to use it frustrating beyond belief, especially on a fast machine. The backoffs are currently horrendous, unmanageable, and require constant button abuse in order to try to get a cache of any size. This is especially true if there is ANY network congestion( not that we have seen that lately).

I do find it's rather hard keeping a full cache of Astropulse with Boinc 6.12.x, it just doesn't ask enough, which means my T8100 is empty.

Claggy
ID: 1158294 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34249
Credit: 79,922,639
RAC: 80
Germany
Message 1158297 - Posted: 2 Oct 2011, 16:01:52 UTC - in response to Message 1158294.  

While I truly support the IDEA of 6.12.XX asking less frequently, I have found the reality of trying to use it frustrating beyond belief, especially on a fast machine. The backoffs are currently horrendous, unmanageable, and require constant button abuse in order to try to get a cache of any size. This is especially true if there is ANY network congestion( not that we have seen that lately).

I do find it's rather hard keeping a full cache of Astropulse with Boinc 6.12.x, it just doesn't ask enough, which means my T8100 is empty.

Claggy


When the servers run stable for a few weeks its also no problem.
But seti wasn´t stable the last couple month.



With each crime and every kindness we birth our future.
ID: 1158297 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1158298 - Posted: 2 Oct 2011, 16:02:53 UTC - in response to Message 1158291.  

That wasn´t a problem for me the last 12 month s^s.


As I said, the slower machines do not have nearly as much of an issue. A couple of hundred units lasts a while. Right now with clean upload/download pipes, it is not an issue. But during congestion it certainly is.
Janice
ID: 1158298 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1158308 - Posted: 2 Oct 2011, 16:27:33 UTC - in response to Message 1158221.  

So.. I keep hearing about this limit of 50/cpu... my single core machine has 99 in progress and the messages tab says nothing about a limit. Hm. Now it's 100, still no message. Either it's a glitch, or it's because a 10-day cache on this machine is somewhere between 90-110 MBs on average, at least with the weird estimates and most of them being shorties.

I'm keep hitting the limit on 2 PCs. What I don't understand is why BOINC still keep asking for CPU work, just now it has asked for new CPU work 5 times in a row. Wouldn't it be better if it waits until a CPU task is finished before asking for more?

BOINC hasn't been programmed to take any notice of the reason why no work is issued - even when the reason is stated (it isn't always).

The newer BOINC v6.12.xx are programmed to back off and ask less frequently when no work is forthcoming - but the backoff is reset to zero when a task completes, and BOINC allows time for the just-completed task to upload and be available for reporting before requesting new work. So, its behaviour is quite close to what you're suggesting: if you have reached the quota limit, there's a fair likelyhood that your next scheduler request will be 'report one, get one replacement'.

You, on the other hand, are running BOINC v6.10.58/60 - that version of BOINC simply knows how much work you've said that you'd like to have, and keeps asking 'more, more, more', even when the server is replying 'no, no, no'. If you are concerned about the strain that your repeated fruitless requests are placing on the servers, you could consider upgrading to BOINC v6.12.34, or temporarily reduce you cache size to something which matches the current (temporary) quota limit more closely.

I ended up topping out at 103 tasks. BOINC finally stopped asking for work since three tasks have completed and DCFs have adjusted slightly. Between 98 and 103, it was a case of "asking for <100 seconds of work" and got 1 new task in response.

I'm still curious why this machine gets past that 50 limit for CPU.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1158308 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 1158315 - Posted: 2 Oct 2011, 17:30:41 UTC - in response to Message 1158288.  


While I truly support the IDEA of 6.12.XX asking less frequently, I have found the reality of trying to use it frustrating beyond belief, especially on a fast machine. The backoffs are currently horrendous, unmanageable, and require constant button abuse in order to try to get a cache of any size. This is especially true if there is ANY network congestion( not that we have seen that lately).

If you take a look at the first couple of pages of "Top hosts", It seems I am not the only person with similar experiences. I saw only 1 of the top 40 machines running 6.12. I myself tried, and had to go back.



Never got round to trying 6.12, but if there had been a surge of interest among the "Top hosts" I would have seen it and followed suit.

This machine is not the fastest by any means but to give some of the readers of this board an idea of what this needs to keep it running.

Regular WU's approx 500 per day for GPU's.

VHAR WU's almost THREE thousand per day for GPU's, (1 VHAR processed every 30 seconds)

CPU, I only use 3 cores and its only an old AMD, only does 30 - 50 WU's per day.

Multiply that out and you should see what I need to download to have a few days cache for the next hiccup.

The very top machines need 2 to 3 times that amount to keep going flat out and even more to rebuild a cache.


Kevin


ID: 1158315 · Report as offensive
SupeRNovA
Volunteer tester
Avatar

Send message
Joined: 25 Oct 04
Posts: 131
Credit: 12,741,814
RAC: 0
Bulgaria
Message 1158387 - Posted: 2 Oct 2011, 22:14:04 UTC

i think that the top 100 hosts need to have a big WU limit that can allow them to store more work.
ID: 1158387 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1158388 - Posted: 2 Oct 2011, 22:28:10 UTC

I keep reading about the wu limit. I am not sure it is not an indication of some other problem.

IE:

I have 2 hardware identical machines, within a few 10's of rac points of each other. One has over 400 wus and the other 120. The one with 120 SOMETIMES tells me I have reached the wu limit sometimes not. The one with 400+ does not.

With all the different response messages, at different times, it would almost appear that the scheduler response message is purely random.
Dave

ID: 1158388 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 1158402 - Posted: 2 Oct 2011, 23:24:01 UTC - in response to Message 1158388.  

I keep reading about the wu limit. I am not sure it is not an indication of some other problem.

IE:

I have 2 hardware identical machines, within a few 10's of rac points of each other. One has over 400 wus and the other 120. The one with 120 SOMETIMES tells me I have reached the wu limit sometimes not. The one with 400+ does not.

With all the different response messages, at different times, it would almost appear that the scheduler response message is purely random.


The message may appear to be random but when everything is running OK my "tasks in progress" tops out at 450 WU's.


Kevin


ID: 1158402 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 1158410 - Posted: 3 Oct 2011, 0:09:33 UTC - in response to Message 1158387.  

i think that the top 100 hosts need to have a big WU limit that can allow them to store more work.


What we need is a gradual increase in WU limit, so that the work flows in and out without swamping the pipe, this could be done for all hosts not just the top performers.

What we don't need is the limit just being removed and the usual swamped pipe problems.

It would also be nice if the relaxing of the limit is done when we are not in the middle of a shortie storm.


Kevin


ID: 1158410 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65690
Credit: 55,293,173
RAC: 49
United States
Message 1158419 - Posted: 3 Oct 2011, 2:10:55 UTC

Every time since about 2:21pm I try and get more wu's I get HTTP internal server error and yep no downloads as a result, I'm 41 short of the 400 of course, Hopefully this will be fixed sometime on Monday.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1158419 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1158431 - Posted: 3 Oct 2011, 3:28:02 UTC

Scheduler is coming and going. Most of the evening it has been "couldn't connect to server", then I got a successful contact, and the next one was "no headers, no data".
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1158431 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1158433 - Posted: 3 Oct 2011, 3:34:27 UTC
Last modified: 3 Oct 2011, 3:35:00 UTC

I think the database is getting tied in knots...and the servers with it.
Uploads seem to be connecting and going through A-OK.
And downloads, when you finally get some, are pretty zippy too.

But connecting to the scheduler to report tasks and/or request work is very hit and miss right now, and even once you seem to connect, most attempts are resulting in 'http internal server error'.

Will probably remain that way until tomorrow morning when somebody's back in the lab to sort things again.

Meowsigh.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1158433 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1158435 - Posted: 3 Oct 2011, 3:41:28 UTC

There has to be something wrong with db_purge. Unless Jeff forgot to turn it back on after the Tuesday maintenance. Bloated DB seems most likely.

If on Monday it goes down to give the database a chance to catch up on the backlog and then do the compression and backup on Tuesday, I'm fine with that. If it means fixing things, I think we can all agree to ~30 hours of downtime for fixing instead of what we're dealing with now.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1158435 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1158440 - Posted: 3 Oct 2011, 4:15:59 UTC - in response to Message 1158435.  

There has to be something wrong with db_purge. Unless Jeff forgot to turn it back on after the Tuesday maintenance. Bloated DB seems most likely.

If on Monday it goes down to give the database a chance to catch up on the backlog and then do the compression and backup on Tuesday, I'm fine with that. If it means fixing things, I think we can all agree to ~30 hours of downtime for fixing instead of what we're dealing with now.



I would agree to as long as they need to fix it, presuming of course they have the tools to fix it.
Janice
ID: 1158440 · Report as offensive
Blake Bonkofsky
Volunteer tester
Avatar

Send message
Joined: 29 Dec 99
Posts: 617
Credit: 46,383,149
RAC: 0
United States
Message 1158441 - Posted: 3 Oct 2011, 4:19:13 UTC - in response to Message 1158440.  

Looks like it's finally broken for the night. Haven't been able to connect in an hour now, and the times on the status page are falling behind. The page itself is updating on time, but various statistics on the right are all an hour behind now, and climbing.
ID: 1158441 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

Message boards : Number crunching : Panic Mode On (57) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.