Panic Mode On (78) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next
Author Message
Cherokee150
Send message
Joined: 11 Nov 99
Posts: 108
Credit: 25,119,543
RAC: 18,731
United States
Message 1303135 - Posted: 7 Nov 2012, 14:51:45 UTC - in response to Message 1303061.

Fred,
It appears that the 64 unit reporting limit you were told has already been changed by the staff. If you look at my post http://setiathome.berkeley.edu/forum_thread.php?id=69890&postid=1302985 and Paul's post http://setiathome.berkeley.edu/forum_thread.php?id=69890&postid=1303071 it would appear that they have now set the limit to 100 CPU, 100 GPU and 100 reports, or, probably, 100 on everything.

Sadly, this is not going to fix what is obviously a different problem, as the system has proven for a long time that it can handle far larger numbers of uploads, downloads, and caches per host. As most of us know, this whole problem has been recently introduced and will require a specific fix. Limiting our hosts only compounds the problem by forcing our hosts to hit the servers far more often.

I wish Matt were back, as this is his expertise (I know this to be true from my extensive, personally conducted tour of SETI and visit with him not too long ago). He would be able to track down the problem faster and patch things up quicker. Furthermore, there is also just too much work for the remaining staff to tackle problems quickly, and we are beginning to see the results of this.

Oh, for a government that would realize the importance of and would fund true scientific research, or, should I say, true scientific research that does not lead to more powerful weapons and surveillance technology!

But I digress... If you only knew what I knew....... (that almost sounds like lyrics to a song, lol!)

Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 2
United States
Message 1303147 - Posted: 7 Nov 2012, 15:29:37 UTC

Okay, I didn't pick up on the point you made that 100 worked now. I think I'll stay with a lower numver to increase probability of success in this environment. Also miss Matt's expertise and and funding is the key. I was crunching Orbit@Home when it ran out of money, and it's not pretty.


____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8629
Credit: 51,368,255
RAC: 50,221
United Kingdom
Message 1303160 - Posted: 7 Nov 2012, 16:18:14 UTC - in response to Message 1303061.

Sorry for the late response:

Scheduler was modified a few weeks ago to accept a max of 64...

It was actually months ago (May), and the figure was increased relatively quickly to accept 256 tasks reported at once.

BarryAZ
Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 12,268,625
RAC: 4,144
United States
Message 1303215 - Posted: 7 Nov 2012, 18:37:44 UTC - in response to Message 1303160.

I believe others have been reporting the scheduler problem -- here's what I see.

No problems with uploads.

No problems with reporting *IF* I have set no new work.

Big problem with reporting and getting work if I have not set no new work. The scheduler goes into suspended animation -- it takes 50 to 10 minutes for the scheduler to time out and release that action. The report back is 'Timeout was reached, servers may be down. They are not, but the scheduler is in trouble as it has been for the past week.

My approach at this point, pending some confirmation back at the shop that this problem -- which others have reported from what I've read -- and that a fix is in view is simply to have all my SETI systems configured for no new work and let them complete and clear out and have other projects pick up the slack.

I am certain that once folks acknowledge the problem and work on it, we'll get some information regarding anticipated resolution.

I would note I seen this problem during the past week or so on multiple systems running multiple different versions of the BOINC software.

Profile S@NL Etienne Dokkum
Volunteer tester
Avatar
Send message
Joined: 11 Jun 99
Posts: 165
Credit: 16,947,878
RAC: 20,317
Netherlands
Message 1303224 - Posted: 7 Nov 2012, 18:57:30 UTC

besides the max. number of tasks set anyone else noticed the following or is it just me :

Just got 1 task, an AP. Nothing strange there but it got an ETA of 4996:28:06 hours

weird as even the laptop it runs on normally crunches it away in under 15 hours
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8629
Credit: 51,368,255
RAC: 50,221
United Kingdom
Message 1303227 - Posted: 7 Nov 2012, 19:00:53 UTC - in response to Message 1303215.

Big problem with reporting and getting work if I have not set no new work. The scheduler goes into suspended animation -- it takes 50 to 10 minutes for the scheduler to time out and release that action. The report back is 'Timeout was reached, servers may be down. They are not, but the scheduler is in trouble as it has been for the past week.

There is some general confusion over the timeout message, and I'm not surprised: even the project staff were caught out by this one.

In fact, you see the timeout message when the boinc client - your own computer - decides that nothing is coming back from the server, and gives up - it stops listening.

The full message includes lines like

04/11/2012 22:13:51 | SETI@home | [sched_op] NVIDIA GPU work request: 43326.80 seconds; 0.00 GPUs
04/11/2012 22:19:00 | | [http] [ID#1] Info: Operation too slow. Less than 10 bytes/sec transferred the last 300 seconds
04/11/2012 22:19:00 | | [http] [ID#1] Info: Closing connection #0
04/11/2012 22:19:00 | | [http] HTTP error: Timeout was reached
04/11/2012 22:19:01 | SETI@home | Scheduler request failed: Timeout was reached

The actual timeout value - as in the 'too slow' line above - is 300 seconds or five minutes, but from what I've seen, some other activity (maybe a task finishing and uploading) can fool the communications subsystem into thinking that something is happening after all, and it allows some extra time.

The root cause of the problem is, obviously, the server taking too long to work out which tasks to send and assemble them into a suitable reply to your request. But there's no simple switch on the server to say 'extend the time limit': we'll just have to wait for them to uncover the root cause, and fix that instead.

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 370
Credit: 2,884,953
RAC: 2,514
United States
Message 1303228 - Posted: 7 Nov 2012, 19:02:18 UTC - in response to Message 1303061.
Last modified: 7 Nov 2012, 19:17:36 UTC

It appears I don't have a cc_config.xml on my system already. I believe it goes in programdata\boinc (I'm running Win 7 64-bit) but it didn't seem to help (yes I shut the BOINC Manager down and restarted it). I'll paste the cc_config.xml below to get an opinion if I built it right.

<cc_config>
<options>
<max_tasks_reported>100</max_tasks_reported>
</options>
</cc_config>


That looks okay - and yes, it belongs on the top level data dirctory, not one of the project directories or the program directory. Scheduler was modified a few weeks ago to accept a max of 64, so there's little point in using higher values unless you run other projects and need it there.

Project is running inconsistently. Looking at my log for last 8 hours, I see failure to connect, timeouts, no tasks available, over the limit messages, and I got 18 cpu tasks overnight. I couldn't get any for 24 hours before that. I'm still below limit for cpu work and down to 99 cpu tasks for 6 cores. Could be a problem here, so I'm looking for corroboration that Scheduler is refusing you when you're certain that you are below the limit (50/cpu core and 400/gpu). Hard to tell with the variety of responses and connection issues.

Thanks. I didn't think the cc_config file would do anything in my case considering I'm seeing the same symptoms as last week. That it appears that the uploaded units are being reported and processed OK, it's just that the client isn't getting the reply, or new the new units being assigned. Unless of course I to a schedule request with NNT and then those "ready to report" marked units are wiped from my tasks tab.

Also Ghost Detector was failing with the message "Hmm, Server indicates less Work Units 'In Progress' than client_state.xml thinks you have on board ... Aborted". I assume that's due to the syncing error. Once the done units were cleared out with NNT/report it ran fine and surprise, ghosts once again.

Edit: Oh, and I'm running around 300-340 pending with 3 CPU and 1 GPU task running when the schedule Nazi said "no more units for you" (Seinfeld joke variant, US TV comedy series for those who aren't familiar, I'm talking about the "This computer has reached a limit on tasks in progress" message). Split was around 90 for the CPU and the rest GPU but something like 95% of the GPU ones were shorties.
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

BarryAZ
Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 12,268,625
RAC: 4,144
United States
Message 1303251 - Posted: 7 Nov 2012, 19:41:52 UTC - in response to Message 1303227.

I don't want to have it extend the time limit -- I have other projects going and when the scheduler does this, it preempts other projects reporting for the duration. I'm hoping that the root cause is acknowledged, identified and dealt with -- until then, I'm completing SETI work on hand and getting and processing work from other projects.




The root cause of the problem is, obviously, the server taking too long to work out which tasks to send and assemble them into a suitable reply to your request. But there's no simple switch on the server to say 'extend the time limit': we'll just have to wait for them to uncover the root cause, and fix that instead.

Profile Michael W.F. Miles
Avatar
Send message
Joined: 24 Mar 07
Posts: 244
Credit: 28,805,627
RAC: 12,567
Canada
Message 1303256 - Posted: 7 Nov 2012, 20:03:07 UTC

It would seem that these server troubles started right when the New York disaster took place.

I am not sure if this is a direct problem to what is going on with the scheduler but I hope it gets fixed very soon.
The limits imposed this time around make it very hard as just reporting finished tasks is a problem as well as getting enough work to feed 6 cpu cores and one GTX 460 for one day when I have am supposed to have enough work to keep all happy for 3 days according to my prefs settings.




N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11919
Credit: 14,593,683
RAC: 12,074
United States
Message 1303270 - Posted: 7 Nov 2012, 20:43:13 UTC
Last modified: 7 Nov 2012, 20:46:13 UTC

I just had a thought, but I don't know how feasible it is or whether it might help anything.

Would it help if they turn off downloads for five minutes every hour or half hour to allow uploads, reports, and miscellaneous traffic to get through unimpeded? They'd probably have to even stop ghost resends for this to work, but if it works it should dramatically reduce the number of ghosts after a few days.

edit: Even better would be to turn off uploads except during that five minute period, but to have any real effect on the network it would probably have to be a mod to the client software so it wouldn't try to send when it shouldn't.
____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Philhnnss
Send message
Joined: 22 Feb 08
Posts: 57
Credit: 10,534,064
RAC: 0
United States
Message 1303271 - Posted: 7 Nov 2012, 20:44:31 UTC

11/7/2012 2:32:31 PM
SETI@home
Not requesting tasks: some download is stalled


OK, I can kinda understand setting limits. That way more people can get work.
But the above I just don't get. Now you need to babysit your systems to make
sure all your downloads run as they should, or you can not have any work? I
am really starting to understand why people are becoming upset. This was
supposed to be a set and forget system wasn't it?

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8629
Credit: 51,368,255
RAC: 50,221
United Kingdom
Message 1303274 - Posted: 7 Nov 2012, 20:50:51 UTC - in response to Message 1303271.

11/7/2012 2:32:31 PM
SETI@home
Not requesting tasks: some download is stalled

OK, I can kinda understand setting limits. That way more people can get work.
But the above I just don't get. Now you need to babysit your systems to make
sure all your downloads run as they should, or you can not have any work? I
am really starting to understand why people are becoming upset. This was
supposed to be a set and forget system wasn't it?

The behaviour is unchanged - the BOINC client has never requested work when some download is stalled.

The trouble is, the old (silent) way generated thread after thread saying 'SETI isn't sending me any work'. Wrong. The computer wasn't requesting work, but the user didn't know it.

Now, with the additional messages (there are several of them), you can see at a glance what the reason for the drought is, and decide whether it's one you can (or want to) do something about yourself.

Philhnnss
Send message
Joined: 22 Feb 08
Posts: 57
Credit: 10,534,064
RAC: 0
United States
Message 1303277 - Posted: 7 Nov 2012, 20:59:10 UTC - in response to Message 1303274.

I guess I never got the old messages. It just upset me last night before I
went to work I had about 4 hours worth of cache built up. Thought that was
good. It would get more work as it ran. I guess as soon as I left, one
stalled. So when I checked after I got home, nothing. Aborted that download
and started over. Now I got it again. Frustrating!!It just seems like such a
waste to leave my systems on and not have any work because I can not babysit
them 24/7

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 359,137
RAC: 27
Germany
Message 1303281 - Posted: 7 Nov 2012, 21:06:22 UTC - in response to Message 1303277.

I guess I never got the old messages...

You couldn't get the old messages, because there haven't been any (as Richard just stated ;-).

Gruß,
Gundolf

Philhnnss
Send message
Joined: 22 Feb 08
Posts: 57
Credit: 10,534,064
RAC: 0
United States
Message 1303283 - Posted: 7 Nov 2012, 21:10:36 UTC - in response to Message 1303281.

I guess I never got the old messages...

You couldn't get the old messages, because there haven't been any (as Richard just stated ;-).

Gruß,
Gundolf


Opp's didn't catch that. Sorry, just frustrated and venting. I'm good now, LOL!!

Profile Mad Fritz
Avatar
Send message
Joined: 20 Jul 01
Posts: 87
Credit: 11,334,904
RAC: 0
Switzerland
Message 1303329 - Posted: 7 Nov 2012, 23:21:49 UTC

Just let me know if that project will ever working again... ATM I'am sick of it
____________

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1303336 - Posted: 7 Nov 2012, 23:51:12 UTC

:)...This reminds me of a song from HeeHaw.

Gloom despair and agony on me
Deep dark depression excessive misery.

bluestar
Send message
Joined: 5 Sep 12
Posts: 258
Credit: 1,192,267
RAC: 297
Message 1303498 - Posted: 8 Nov 2012, 13:25:00 UTC
Last modified: 8 Nov 2012, 13:25:19 UTC

Got a computational error on one of my CUDA tasks.

This because I was out shopping for the weekend.

Maybe it should be wise to not let CUDA tasks run if you are away from your computer?

Perhaps not so easy. You may happen to get new tasks. Unless you have suspended some tasks, those CUDA tasks which are either "Ready to Start" or "Waiting to Run" will start running automatically after 3 minutes of keyboard or mouse inactivity with the default settings in place.

I guess in this project things are never going to become perfect in its workings.

Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 2
United States
Message 1303507 - Posted: 8 Nov 2012, 14:06:57 UTC

Got a computational error on one of my CUDA tasks.

This because I was out shopping for the weekend.

Maybe it should be wise to not let CUDA tasks run if you are away from your computer?

Perhaps not so easy. You may happen to get new tasks. Unless you have suspended some tasks, those CUDA tasks which are either "Ready to Start" or "Waiting to Run" will start running automatically after 3 minutes of keyboard or mouse inactivity with the default settings in place.

I guess in this project things are never going to become perfect in its workings.

Easy fix - don't shop on weekends! :=}

Don't worry about that one. It's a -12 error due to the application and data. These are usually regarded as no-fault. I often have one or two in my errors list.

On your website task page, you can drill down on the task to see the Std_Err report, which includes this:

SETI@home error -12 Unknown error
cudaAcc_find_triplets doesn't support more than MAX_TRIPLETS_ABOVE_THRESHOLD numBinsAboveThreshold in find_triplets_kernel

Don't do anything special. Most of us run unattended and check it periodically. If you encounter more errors or just have some questions, feel free to open your own thread and ask for help. We usually use this one for comments and observations on how the project is running. Welcome to the project!
____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4418
Credit: 118,408,769
RAC: 136,568
United States
Message 1303519 - Posted: 8 Nov 2012, 14:40:07 UTC - in response to Message 1303498.

Got a computational error on one of my CUDA tasks.

This because I was out shopping for the weekend.

Maybe it should be wise to not let CUDA tasks run if you are away from your computer?

Perhaps not so easy. You may happen to get new tasks. Unless you have suspended some tasks, those CUDA tasks which are either "Ready to Start" or "Waiting to Run" will start running automatically after 3 minutes of keyboard or mouse inactivity with the default settings in place.

I guess in this project things are never going to become perfect in its workings.

If you desire to suspend GPU computing while you are away from your computer you can tell BOINC to do that. There are options form the manager for "snooze" and "snooze GPU". However they are only for 1 hour IIRC. If you wanted to suspend GPU processing while you are out for a few hours there is a command line option.
If you wanted to suspend GPU processing for 4 hours you could use.
boinccmd --set_gpu_mode never 14400
Then after 4 hours BOINC would resume the previous state you had the GPU processing set. Normally it would be "auto" or "based on preferences".

I just leave the GPU to do what it will. Sometimes there will be an error which I have no control over & the GPU trashes several hundred tasks. I just restart the machine when I find there is an issue & make sure everything is OK afterwards.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Copyright © 2014 University of California