Panic Mode On (78) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 22 · Next
Author Message
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 7082
Credit: 27,544,151
RAC: 36,221
United Kingdom
Message 1304872 - Posted: 11 Nov 2012, 11:56:09 UTC - in response to Message 1304870.

I cannot believe this!!

I accidently unset NNT on one machine. Realised quite quickly and reset it. However although it reported "Project communication failed" when I checked I had 56 ghosts.

Now to get them I had to unset NNT again, I got them and another 81 ghosts!! As this was a mistake as I don't crunch SETI on this machine any more I am trying decided what to do as I don't really want to abandon them but SETI@Home no longer deserves my time and effort!

Very annoying!!!

Set a smaller cache size, don't try and get the full 20 resends at once, just because AP isn't being split doesn't mean everything is magically fixed, scheduler contacts still sometimes take a long time:

11/11/2012 10:26:12 SETI@home [sched_op_debug] Starting scheduler request
11/11/2012 10:26:12 SETI@home Sending scheduler request: Requested by user.
11/11/2012 10:26:12 SETI@home Reporting 1 completed tasks, requesting new tasks for CPU
11/11/2012 10:26:12 SETI@home [sched_op_debug] CPU work request: 409918.70 seconds; 0.00 CPUs
11/11/2012 10:26:12 SETI@home [sched_op_debug] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs
11/11/2012 10:26:12 SETI@home [sched_op_debug] ATI GPU work request: 0.00 seconds; 0.00 GPUs
11/11/2012 10:30:20 SETI@home Scheduler request completed: got 7 new tasks
11/11/2012 10:30:20 SETI@home [sched_op_debug] Server version 701
11/11/2012 10:30:20 SETI@home Message from server: Resent lost task 04se12aa.20122.12750.140733193388047.10.230_0
11/11/2012 10:30:20 SETI@home Message from server: Resent lost task 04se12aa.20122.12750.140733193388047.10.233_0
11/11/2012 10:30:20 SETI@home Message from server: Resent lost task 04se12aa.20269.12750.140733193388048.10.255_0
11/11/2012 10:30:20 SETI@home Message from server: Resent lost task 04se12aa.20122.12750.140733193388047.10.241_0
11/11/2012 10:30:20 SETI@home Message from server: Resent lost task 04se12aa.20122.12750.140733193388047.10.247_0
11/11/2012 10:30:20 SETI@home Message from server: Resent lost task 04se12aa.20269.12750.140733193388048.10.205_1
11/11/2012 10:30:20 SETI@home Message from server: Resent lost task 04se12aa.20269.12750.140733193388048.10.200_0
11/11/2012 10:30:20 SETI@home Project requested delay of 303 seconds
11/11/2012 10:30:20 SETI@home [sched_op_debug] estimated total CPU job duration: 8399 seconds
11/11/2012 10:30:20 SETI@home [sched_op_debug] estimated total NVIDIA GPU job duration: 0 seconds
11/11/2012 10:30:20 SETI@home [sched_op_debug] estimated total ATI GPU job duration: 0 seconds
11/11/2012 10:30:20 SETI@home [sched_op_debug] handle_scheduler_reply(): got ack for result 29se12ab.30551.24198.140733193388036.10.127_1
11/11/2012 10:30:20 SETI@home [sched_op_debug] Deferring communication for 5 min 3 sec
11/11/2012 10:30:20 SETI@home [sched_op_debug] Reason: requested by project

Claggy

Sorry realised I don't care. I will abandon the ones I have and the other will time out naturally.
____________


Today is life, the only life we're sure of. Make the most of today.

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4141
Credit: 33,627,110
RAC: 28,119
United Kingdom
Message 1304877 - Posted: 11 Nov 2012, 12:03:10 UTC - in response to Message 1304872.

The servers are still in recover after the AP splitting, no doubt it'll be some time before everyone's ghosts are resent,

There is a scheduler Bug fix in the works, hopefully it'll be deployed at Seti Beta on Monday, not expecting it to be a total cure, just a step in the right direction,

Claggy

Profile Khangollo
Avatar
Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1304911 - Posted: 11 Nov 2012, 14:01:15 UTC - in response to Message 1304863.

My RAC has been steadily declining and I have noticed that the task list for two of my rigs show most of the assigned tasks under Error, and the status as abandoned. Could anyone tell me why this might occur. The rigs still have all the tasks and are crunching them, but obviously, not gaining any credit for the work being done. Should I reset the rig or is this something that will get sorted out automatically?

The same happened to me on one computer. After thousands of scheduler timeouts, one request apparently got mangled/misinterpreted enough, server thought I reset the project and decided to abandon all tasks. That's at least what I think what happened...
In any case, you should delete (abort) all those tasks with boinc manager, as they will not get automatically deleted and server will just ignore them (you won't get any credit; they were already marked abandoned for you and re-sent to other crunchers).

____________

Profile Bill GProject donor
Avatar
Send message
Joined: 1 Jun 01
Posts: 349
Credit: 43,136,905
RAC: 47,816
United States
Message 1304961 - Posted: 11 Nov 2012, 16:19:17 UTC - in response to Message 1304877.

The servers are still in recover after the AP splitting, no doubt it'll be some time before everyone's ghosts are resent,

There is a scheduler Bug fix in the works, hopefully it'll be deployed at Seti Beta on Monday, not expecting it to be a total cure, just a step in the right direction,

Claggy


Yes, it will be some time. I am now under 3000 ghosts on one computer and with the cache limit it will be a long time before they are all sent to me. But at least downloads are working as they should.

Most of the CPU tasks are being set to run at High Priority once they are scheduled to run. It is interesting to note that only 5 of the 8 CPU tasks will be High Priority at a time, they start in normal running mode then become High Priority as a WU gets finished. Just interesting to note.
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5868
Credit: 60,607,065
RAC: 47,475
Australia
Message 1304998 - Posted: 11 Nov 2012, 18:13:53 UTC - in response to Message 1304877.

There is a scheduler Bug fix in the works, hopefully it'll be deployed at Seti Beta on Monday, not expecting it to be a total cure, just a step in the right direction,

Will that fix the Scheduler timeout problems, or fix the increasing number of Ghosts that are created when the Scheduler keeps timing out?

____________
Grant
Darwin NT.

Profile Michael W.F. Miles
Avatar
Send message
Joined: 24 Mar 07
Posts: 244
Credit: 28,850,004
RAC: 11,160
Canada
Message 1305011 - Posted: 11 Nov 2012, 18:39:04 UTC

What get me here is after fixing the connection issue with a proxy server I am now having to check on this machine every four hours as I only am getting enough work to keep it going for four hours before the LIMITS kick in.

I have been running Ghostdet and two days ago I was getting 54% ghosts tasks
Yesterday was 15%
Today 0% with 200 tasks on board.

They are all mostly shorties though.

Seems to be working itself out.

Now everytime I get a work request in the servers will only do ONE TO ONE
Report one, get one.

I hope this gets solved really fast as my patience is wearing thin as most are

We have built the fastest computer system in the world, lets keep it busy

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1305017 - Posted: 11 Nov 2012, 19:01:29 UTC

How does one go about finding a proxy that is safe to use?

Frank

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4141
Credit: 33,627,110
RAC: 28,119
United Kingdom
Message 1305020 - Posted: 11 Nov 2012, 19:06:52 UTC - in response to Message 1304998.

There is a scheduler Bug fix in the works, hopefully it'll be deployed at Seti Beta on Monday, not expecting it to be a total cure, just a step in the right direction,

Will that fix the Scheduler timeout problems, or fix the increasing number of Ghosts that are created when the Scheduler keeps timing out?

It'll fix the Bug of resending work to the wrong device, ie Boinc asks for CPU work only, but gets resends to the GPU instead (which wasn't asking for work), and so timing out any VLARs it encounters.

Claggy

Profile Vipin Palazhi
Avatar
Send message
Joined: 29 Feb 08
Posts: 249
Credit: 107,294,587
RAC: 74,691
India
Message 1305226 - Posted: 12 Nov 2012, 2:49:02 UTC - in response to Message 1304911.

The same happened to me on one computer. After thousands of scheduler timeouts, one request apparently got mangled/misinterpreted enough, server thought I reset the project and decided to abandon all tasks. That's at least what I think what happened...
In any case, you should delete (abort) all those tasks with boinc manager, as they will not get automatically deleted and server will just ignore them (you won't get any credit; they were already marked abandoned for you and re-sent to other crunchers).

Thanks Khangollo, I shall do that.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5868
Credit: 60,607,065
RAC: 47,475
Australia
Message 1305282 - Posted: 12 Nov 2012, 5:52:14 UTC - in response to Message 1305226.


While i was at work, for some reson my internet connection died.
When i was able to reconnect, and upload all the work that had piled up, naturally the Scheduler timed out on all request for work & reporting.
Even with No New Tasks set it took several attempts to get a response from the Scheduler.
And even now, with only one task to report on one system, and a cou7ple on the other, all that i'm getting are Scheduler timeout errors.

Few more hours & i'll be completely out of work, before the weekly outage where i was expecting to run out of GPU work at least.
____________
Grant
Darwin NT.

Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 1
United States
Message 1305351 - Posted: 12 Nov 2012, 13:47:33 UTC

While i was at work, for some reson my internet connection died.
When i was able to reconnect, and upload all the work that had piled up, naturally the Scheduler timed out on all request for work & reporting.
Even with No New Tasks set it took several attempts to get a response from the Scheduler.
And even now, with only one task to report on one system, and a cou7ple on the other, all that i'm getting are Scheduler timeout errors.

Few more hours & i'll be completely out of work, before the weekly outage where i was expecting to run out of GPU work at least.

Also getting timeouts, even on NNT. Dropping my max reported setting. I'll also run out because I've got mostly shorties. Low limits and shorties = cruelty to crunchers!

I crunched 3 tasks this morning with 60 day deadlines - 10 Jan. I don't remember seeing that before.
____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11992
Credit: 14,659,228
RAC: 12,191
United States
Message 1305380 - Posted: 12 Nov 2012, 15:25:16 UTC

I haven't put my hands on my i7 for a few days, but I will have to when i get home from work today. What concerns me, though, is that my account page says it only has 83 in progress, well below its limit of 200. Before all the trouble started, it typically ran in the 1100-1600 range. I know it just downloaded some new units from Einstein, but I don't know what the cause/effect relationship is. Is it getting Einstein because it can't get Seti, or is it feeling debt to Einstein and favoring it for now? My other two machines each reported one unit back to Einstein over the weekend without asking for more, leaving one of them with only Seti work on board. And I just slid back a position in my joining date class. :-(

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11992
Credit: 14,659,228
RAC: 12,191
United States
Message 1305533 - Posted: 12 Nov 2012, 20:29:04 UTC - in response to Message 1305408.
Last modified: 12 Nov 2012, 20:31:31 UTC

Media alert......
The kitties have inbound WUs!!!!

Purrrrr......

[edit]Purr also for the fact that, according to the weather thing in my signature at the time I posted this, we actually got up to 33F here. Woo hoo.
____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 370
Credit: 2,896,916
RAC: 2,406
United States
Message 1305604 - Posted: 12 Nov 2012, 23:03:11 UTC - in response to Message 1305408.

You do seem to have a metric buttload of GPU tasks even though your CPU pile finally dropped bellow 100 on two kitties.

I think the all tasks web page still needs a filter for CPU vs GPU tasks, or at least a count.
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

Profile ivan
Volunteer tester
Avatar
Send message
Joined: 5 Mar 01
Posts: 625
Credit: 143,542,422
RAC: 150,031
United Kingdom
Message 1305632 - Posted: 13 Nov 2012, 0:13:01 UTC - in response to Message 1305604.

You do seem to have a metric buttload of GPU tasks even though your CPU pile finally dropped bellow 100 on two kitties.

I think the all tasks web page still needs a filter for CPU vs GPU tasks, or at least a count.

Well, this does it for me, but for a particular machine with particular software. On a Windows machine you'd need either cygwin or some replacement for wc (word count); ISTR there's a DOS equivalent of grep -- find? Bug: the grep for 'fermi' returns a line not associated with jobs in progress, so the third line overcounts by one.

[eesridr:BOINC] > cat showjobs date grep 'received_time' client_state.xml|wc grep 'fermi' client_state.xml|wc [eesridr:BOINC] > . showjobs Tue Nov 13 00:06:50 GMT 2012 687 687 36411 589 589 23560

____________

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 370
Credit: 2,896,916
RAC: 2,406
United States
Message 1305645 - Posted: 13 Nov 2012, 0:56:52 UTC - in response to Message 1305640.

I was just talking about one of the rigs that recently got CPU units. You still had around 1500 GPU units for the 3 GPUs. @500 seconds per GPU unit that's nearly 3 days worth left. Even if you get down to 100 per GPU that's still a half a day's worth. What did you normally run your queue as? 10 days.

It doesn't make a difference in bandwidth usage in the long run once the whole seti@home ecosystem hits steady state, it'll just mean that when a super cruncher's nVidia card goes off the rails they can only shaft at most 100 wingman per GPU as oppose to thousands. (Please check your, not directed at you msattler just nVidia users in general, results daily to catch when you system starts to produce mostly inconclusive/error/invalid GPU results.)
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5414
Credit: 306,538,657
RAC: 328,896
Brazil
Message 1305773 - Posted: 13 Nov 2012, 12:12:47 UTC - in response to Message 1305645.

I was just talking about one of the rigs that recently got CPU units. You still had around 1500 GPU units for the 3 GPUs. @500 seconds per GPU unit that's nearly 3 days worth left. Even if you get down to 100 per GPU that's still a half a day's worth. What did you normally run your queue as? 10 days.

It doesn't make a difference in bandwidth usage in the long run once the whole seti@home ecosystem hits steady state, it'll just mean that when a super cruncher's nVidia card goes off the rails they can only shaft at most 100 wingman per GPU as oppose to thousands. (Please check your, not directed at you msattler just nVidia users in general, results daily to catch when you system starts to produce mostly inconclusive/error/invalid GPU results.)

Each 690 crunch a WU in less than 7 min runing 3 WU at time on each GPU (it have 2) about 48 per hour or more, so in a big cruncher (3x690) a 100 WU cache is simpy ridiculous, not last for 1 hour. I have 2x690 sleeping on a bed waiting they rissing the limits, with the actual limits is a waste of time/resources put them to work, simply they will not receive the WU they need to work.
____________

WezH
Volunteer tester
Send message
Joined: 19 Aug 99
Posts: 113
Credit: 4,536,287
RAC: 30,446
Finland
Message 1305804 - Posted: 13 Nov 2012, 19:55:14 UTC

Yay! Back from normal Tuesday time-out. (btw, people in lab are really morning people...)

Let's see what comes next... Cricket on top now, AP splitting disabled... Let's see and hope for better...
____________

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (78) Server Problems?

Copyright © 2014 University of California