Panic Mode On (78) Server Problems?

Message boards : Number crunching : Panic Mode On (78) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next

AuthorMessage
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1303147 - Posted: 7 Nov 2012, 15:29:37 UTC

Okay, I didn't pick up on the point you made that 100 worked now. I think I'll stay with a lower numver to increase probability of success in this environment. Also miss Matt's expertise and and funding is the key. I was crunching Orbit@Home when it ran out of money, and it's not pretty.


Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1303147 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1303160 - Posted: 7 Nov 2012, 16:18:14 UTC - in response to Message 1303061.  

Sorry for the late response:

Scheduler was modified a few weeks ago to accept a max of 64...

It was actually months ago (May), and the figure was increased relatively quickly to accept 256 tasks reported at once.
ID: 1303160 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 1303215 - Posted: 7 Nov 2012, 18:37:44 UTC - in response to Message 1303160.  

I believe others have been reporting the scheduler problem -- here's what I see.

No problems with uploads.

No problems with reporting *IF* I have set no new work.

Big problem with reporting and getting work if I have not set no new work. The scheduler goes into suspended animation -- it takes 50 to 10 minutes for the scheduler to time out and release that action. The report back is 'Timeout was reached, servers may be down. They are not, but the scheduler is in trouble as it has been for the past week.

My approach at this point, pending some confirmation back at the shop that this problem -- which others have reported from what I've read -- and that a fix is in view is simply to have all my SETI systems configured for no new work and let them complete and clear out and have other projects pick up the slack.

I am certain that once folks acknowledge the problem and work on it, we'll get some information regarding anticipated resolution.

I would note I seen this problem during the past week or so on multiple systems running multiple different versions of the BOINC software.
ID: 1303215 · Report as offensive
Profile S@NL Etienne Dokkum
Volunteer tester
Avatar

Send message
Joined: 11 Jun 99
Posts: 212
Credit: 43,822,095
RAC: 0
Netherlands
Message 1303224 - Posted: 7 Nov 2012, 18:57:30 UTC

besides the max. number of tasks set anyone else noticed the following or is it just me :

Just got 1 task, an AP. Nothing strange there but it got an ETA of 4996:28:06 hours

weird as even the laptop it runs on normally crunches it away in under 15 hours
ID: 1303224 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1303227 - Posted: 7 Nov 2012, 19:00:53 UTC - in response to Message 1303215.  

Big problem with reporting and getting work if I have not set no new work. The scheduler goes into suspended animation -- it takes 50 to 10 minutes for the scheduler to time out and release that action. The report back is 'Timeout was reached, servers may be down. They are not, but the scheduler is in trouble as it has been for the past week.

There is some general confusion over the timeout message, and I'm not surprised: even the project staff were caught out by this one.

In fact, you see the timeout message when the boinc client - your own computer - decides that nothing is coming back from the server, and gives up - it stops listening.

The full message includes lines like

04/11/2012 22:13:51 | SETI@home | [sched_op] NVIDIA GPU work request: 43326.80 seconds; 0.00 GPUs
04/11/2012 22:19:00 | | [http] [ID#1] Info: Operation too slow. Less than 10 bytes/sec transferred the last 300 seconds
04/11/2012 22:19:00 | | [http] [ID#1] Info: Closing connection #0
04/11/2012 22:19:00 | | [http] HTTP error: Timeout was reached
04/11/2012 22:19:01 | SETI@home | Scheduler request failed: Timeout was reached

The actual timeout value - as in the 'too slow' line above - is 300 seconds or five minutes, but from what I've seen, some other activity (maybe a task finishing and uploading) can fool the communications subsystem into thinking that something is happening after all, and it allows some extra time.

The root cause of the problem is, obviously, the server taking too long to work out which tasks to send and assemble them into a suitable reply to your request. But there's no simple switch on the server to say 'extend the time limit': we'll just have to wait for them to uncover the root cause, and fix that instead.
ID: 1303227 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 1303228 - Posted: 7 Nov 2012, 19:02:18 UTC - in response to Message 1303061.  
Last modified: 7 Nov 2012, 19:17:36 UTC

It appears I don't have a cc_config.xml on my system already. I believe it goes in programdata\boinc (I'm running Win 7 64-bit) but it didn't seem to help (yes I shut the BOINC Manager down and restarted it). I'll paste the cc_config.xml below to get an opinion if I built it right.

<cc_config>
<options>
<max_tasks_reported>100</max_tasks_reported>
</options>
</cc_config>


That looks okay - and yes, it belongs on the top level data dirctory, not one of the project directories or the program directory. Scheduler was modified a few weeks ago to accept a max of 64, so there's little point in using higher values unless you run other projects and need it there.

Project is running inconsistently. Looking at my log for last 8 hours, I see failure to connect, timeouts, no tasks available, over the limit messages, and I got 18 cpu tasks overnight. I couldn't get any for 24 hours before that. I'm still below limit for cpu work and down to 99 cpu tasks for 6 cores. Could be a problem here, so I'm looking for corroboration that Scheduler is refusing you when you're certain that you are below the limit (50/cpu core and 400/gpu). Hard to tell with the variety of responses and connection issues.

Thanks. I didn't think the cc_config file would do anything in my case considering I'm seeing the same symptoms as last week. That it appears that the uploaded units are being reported and processed OK, it's just that the client isn't getting the reply, or new the new units being assigned. Unless of course I to a schedule request with NNT and then those "ready to report" marked units are wiped from my tasks tab.

Also Ghost Detector was failing with the message "Hmm, Server indicates less Work Units 'In Progress' than client_state.xml thinks you have on board ... Aborted". I assume that's due to the syncing error. Once the done units were cleared out with NNT/report it ran fine and surprise, ghosts once again.

Edit: Oh, and I'm running around 300-340 pending with 3 CPU and 1 GPU task running when the schedule Nazi said "no more units for you" (Seinfeld joke variant, US TV comedy series for those who aren't familiar, I'm talking about the "This computer has reached a limit on tasks in progress" message). Split was around 90 for the CPU and the rest GPU but something like 95% of the GPU ones were shorties.
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 1303228 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 1303251 - Posted: 7 Nov 2012, 19:41:52 UTC - in response to Message 1303227.  

I don't want to have it extend the time limit -- I have other projects going and when the scheduler does this, it preempts other projects reporting for the duration. I'm hoping that the root cause is acknowledged, identified and dealt with -- until then, I'm completing SETI work on hand and getting and processing work from other projects.




The root cause of the problem is, obviously, the server taking too long to work out which tasks to send and assemble them into a suitable reply to your request. But there's no simple switch on the server to say 'extend the time limit': we'll just have to wait for them to uncover the root cause, and fix that instead.

ID: 1303251 · Report as offensive
Profile Michael W.F. Miles
Avatar

Send message
Joined: 24 Mar 07
Posts: 268
Credit: 34,410,870
RAC: 0
Canada
Message 1303256 - Posted: 7 Nov 2012, 20:03:07 UTC

It would seem that these server troubles started right when the New York disaster took place.

I am not sure if this is a direct problem to what is going on with the scheduler but I hope it gets fixed very soon.
The limits imposed this time around make it very hard as just reporting finished tasks is a problem as well as getting enough work to feed 6 cpu cores and one GTX 460 for one day when I have am supposed to have enough work to keep all happy for 3 days according to my prefs settings.




ID: 1303256 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1303270 - Posted: 7 Nov 2012, 20:43:13 UTC
Last modified: 7 Nov 2012, 20:46:13 UTC

I just had a thought, but I don't know how feasible it is or whether it might help anything.

Would it help if they turn off downloads for five minutes every hour or half hour to allow uploads, reports, and miscellaneous traffic to get through unimpeded? They'd probably have to even stop ghost resends for this to work, but if it works it should dramatically reduce the number of ghosts after a few days.

edit: Even better would be to turn off uploads except during that five minute period, but to have any real effect on the network it would probably have to be a mod to the client software so it wouldn't try to send when it shouldn't.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1303270 · Report as offensive
Philhnnss
Volunteer tester

Send message
Joined: 22 Feb 08
Posts: 63
Credit: 30,694,327
RAC: 162
United States
Message 1303271 - Posted: 7 Nov 2012, 20:44:31 UTC

11/7/2012 2:32:31 PM
SETI@home
Not requesting tasks: some download is stalled


OK, I can kinda understand setting limits. That way more people can get work.
But the above I just don't get. Now you need to babysit your systems to make
sure all your downloads run as they should, or you can not have any work? I
am really starting to understand why people are becoming upset. This was
supposed to be a set and forget system wasn't it?
ID: 1303271 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1303274 - Posted: 7 Nov 2012, 20:50:51 UTC - in response to Message 1303271.  

11/7/2012 2:32:31 PM
SETI@home
Not requesting tasks: some download is stalled

OK, I can kinda understand setting limits. That way more people can get work.
But the above I just don't get. Now you need to babysit your systems to make
sure all your downloads run as they should, or you can not have any work? I
am really starting to understand why people are becoming upset. This was
supposed to be a set and forget system wasn't it?

The behaviour is unchanged - the BOINC client has never requested work when some download is stalled.

The trouble is, the old (silent) way generated thread after thread saying 'SETI isn't sending me any work'. Wrong. The computer wasn't requesting work, but the user didn't know it.

Now, with the additional messages (there are several of them), you can see at a glance what the reason for the drought is, and decide whether it's one you can (or want to) do something about yourself.
ID: 1303274 · Report as offensive
Philhnnss
Volunteer tester

Send message
Joined: 22 Feb 08
Posts: 63
Credit: 30,694,327
RAC: 162
United States
Message 1303277 - Posted: 7 Nov 2012, 20:59:10 UTC - in response to Message 1303274.  

I guess I never got the old messages. It just upset me last night before I
went to work I had about 4 hours worth of cache built up. Thought that was
good. It would get more work as it ran. I guess as soon as I left, one
stalled. So when I checked after I got home, nothing. Aborted that download
and started over. Now I got it again. Frustrating!!It just seems like such a
waste to leave my systems on and not have any work because I can not babysit
them 24/7
ID: 1303277 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1303281 - Posted: 7 Nov 2012, 21:06:22 UTC - in response to Message 1303277.  

I guess I never got the old messages...

You couldn't get the old messages, because there haven't been any (as Richard just stated ;-).

Gruß,
Gundolf
ID: 1303281 · Report as offensive
Philhnnss
Volunteer tester

Send message
Joined: 22 Feb 08
Posts: 63
Credit: 30,694,327
RAC: 162
United States
Message 1303283 - Posted: 7 Nov 2012, 21:10:36 UTC - in response to Message 1303281.  

I guess I never got the old messages...

You couldn't get the old messages, because there haven't been any (as Richard just stated ;-).

Gruß,
Gundolf


Opp's didn't catch that. Sorry, just frustrated and venting. I'm good now, LOL!!
ID: 1303283 · Report as offensive
Profile Mad Fritz
Avatar

Send message
Joined: 20 Jul 01
Posts: 87
Credit: 11,334,904
RAC: 0
Switzerland
Message 1303329 - Posted: 7 Nov 2012, 23:21:49 UTC

Just let me know if that project will ever working again... ATM I'am sick of it
ID: 1303329 · Report as offensive
fscheel

Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1303336 - Posted: 7 Nov 2012, 23:51:12 UTC

:)...This reminds me of a song from HeeHaw.

Gloom despair and agony on me
Deep dark depression excessive misery.
ID: 1303336 · Report as offensive
bluestar

Send message
Joined: 5 Sep 12
Posts: 6979
Credit: 2,084,789
RAC: 3
Message 1303498 - Posted: 8 Nov 2012, 13:25:00 UTC
Last modified: 8 Nov 2012, 13:25:19 UTC

Got a computational error on one of my CUDA tasks.

This because I was out shopping for the weekend.

Maybe it should be wise to not let CUDA tasks run if you are away from your computer?

Perhaps not so easy. You may happen to get new tasks. Unless you have suspended some tasks, those CUDA tasks which are either "Ready to Start" or "Waiting to Run" will start running automatically after 3 minutes of keyboard or mouse inactivity with the default settings in place.

I guess in this project things are never going to become perfect in its workings.
ID: 1303498 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1303507 - Posted: 8 Nov 2012, 14:06:57 UTC

Got a computational error on one of my CUDA tasks.

This because I was out shopping for the weekend.

Maybe it should be wise to not let CUDA tasks run if you are away from your computer?

Perhaps not so easy. You may happen to get new tasks. Unless you have suspended some tasks, those CUDA tasks which are either "Ready to Start" or "Waiting to Run" will start running automatically after 3 minutes of keyboard or mouse inactivity with the default settings in place.

I guess in this project things are never going to become perfect in its workings.

Easy fix - don't shop on weekends! :=}

Don't worry about that one. It's a -12 error due to the application and data. These are usually regarded as no-fault. I often have one or two in my errors list.

On your website task page, you can drill down on the task to see the Std_Err report, which includes this:

SETI@home error -12 Unknown error
cudaAcc_find_triplets doesn't support more than MAX_TRIPLETS_ABOVE_THRESHOLD numBinsAboveThreshold in find_triplets_kernel

Don't do anything special. Most of us run unattended and check it periodically. If you encounter more errors or just have some questions, feel free to open your own thread and ask for help. We usually use this one for comments and observations on how the project is running. Welcome to the project!
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1303507 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1303519 - Posted: 8 Nov 2012, 14:40:07 UTC - in response to Message 1303498.  

Got a computational error on one of my CUDA tasks.

This because I was out shopping for the weekend.

Maybe it should be wise to not let CUDA tasks run if you are away from your computer?

Perhaps not so easy. You may happen to get new tasks. Unless you have suspended some tasks, those CUDA tasks which are either "Ready to Start" or "Waiting to Run" will start running automatically after 3 minutes of keyboard or mouse inactivity with the default settings in place.

I guess in this project things are never going to become perfect in its workings.

If you desire to suspend GPU computing while you are away from your computer you can tell BOINC to do that. There are options form the manager for "snooze" and "snooze GPU". However they are only for 1 hour IIRC. If you wanted to suspend GPU processing while you are out for a few hours there is a command line option.
If you wanted to suspend GPU processing for 4 hours you could use.
boinccmd --set_gpu_mode never 14400
Then after 4 hours BOINC would resume the previous state you had the GPU processing set. Normally it would be "auto" or "based on preferences".

I just leave the GPU to do what it will. Sometimes there will be an error which I have no control over & the GPU trashes several hundred tasks. I just restart the machine when I find there is an issue & make sure everything is OK afterwards.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1303519 · Report as offensive
bluestar

Send message
Joined: 5 Sep 12
Posts: 6979
Credit: 2,084,789
RAC: 3
Message 1303538 - Posted: 8 Nov 2012, 15:19:27 UTC
Last modified: 8 Nov 2012, 15:23:00 UTC

Thanks for those comments both of you!

Good catch there, but not that task anyway.

I just had a dream about watching some tasks dated 1999 -> by means of some tapes having being split and run.

I have been able to locate some tasks dated 2005 on the three discs that are in my computer right now. The rest of it will have to wait for a couple more weeks or months to get back to the surface.

Perhaps I missed something there which I should have been able to take a note of.
ID: 1303538 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.