Panic Mode On (38) Server problems


log in

Advanced search

Message boards : Number crunching : Panic Mode On (38) Server problems

1 · 2 · 3 · 4 . . . 11 · Next
Author Message
Profile arkaynProject donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3688
Credit: 48,717,194
RAC: 6,752
United States
Message 1030586 - Posted: 3 Sep 2010, 23:50:58 UTC

Time for a new one.
____________

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,897,216
RAC: 10,980
United States
Message 1030589 - Posted: 3 Sep 2010, 23:54:30 UTC - in response to Message 1030586.

Time for a new one.


WHAT??? WHY??? What happened???? I didn't break it!!!!

____________


PROUD MEMBER OF Team Starfire World BOINC

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5862
Credit: 60,437,973
RAC: 46,518
Australia
Message 1030593 - Posted: 3 Sep 2010, 23:59:35 UTC - in response to Message 1030589.


Real cause for concern- Scarecrow's graphs are broken.
____________
Grant
Darwin NT.

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,631,148
RAC: 2
United States
Message 1030735 - Posted: 4 Sep 2010, 4:11:24 UTC

I have taken a temporary measure to attempt to try and help with the backlog waiting for validation..

I have do not accept new work turned on, all of my _0 and _1's suspended until I clean out all of the _2+'s.

Just trying to help. I will unsuspend and re-allow as soon as those _2+'s are gone.
____________

Janice

BANZAI56
Volunteer tester
Send message
Joined: 17 May 00
Posts: 123
Credit: 33,974,747
RAC: 0
United States
Message 1030765 - Posted: 4 Sep 2010, 5:23:39 UTC

Ran cache dry to install v0.37 on C2D with GT 240.

Asked for new work and promptly was hit with around 120 ghosts.

Bye bye with a detach/reattach and will try again tomorrow. :(


Got Orbit to keep it busy till then...
____________

Profile [B^S] madmac
Volunteer tester
Avatar
Send message
Joined: 9 Feb 04
Posts: 1150
Credit: 3,831,723
RAC: 2,719
United Kingdom
Message 1030769 - Posted: 4 Sep 2010, 6:09:10 UTC

I see that my validate errors have now been sent out again, well thats 12 hrs work gone.
____________

Profile MadMaC
Volunteer tester
Avatar
Send message
Joined: 4 Apr 01
Posts: 201
Credit: 47,158,217
RAC: 0
United Kingdom
Message 1030789 - Posted: 4 Sep 2010, 8:38:30 UTC - in response to Message 1030765.

Ran cache dry to install v0.37 on C2D with GT 240.

Asked for new work and promptly was hit with around 120 ghosts.

Bye bye with a detach/reattach and will try again tomorrow. :(


Got Orbit to keep it busy till then...



How can you tell if you have a ghost unit???
____________

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3513
Credit: 20,667,996
RAC: 21,929
Sweden
Message 1030791 - Posted: 4 Sep 2010, 9:03:07 UTC - in response to Message 1030789.

Ran cache dry to install v0.37 on C2D with GT 240.

Asked for new work and promptly was hit with around 120 ghosts.

Bye bye with a detach/reattach and will try again tomorrow. :(


Got Orbit to keep it busy till then...



How can you tell if you have a ghost unit???


That's easy. Compare the WU´s that's been allocated to your computer (from the web page), to what you really have on your computer (Boinc Manager).

____________

Profile Aristoteles Doukas
Avatar
Send message
Joined: 11 Apr 08
Posts: 1091
Credit: 2,140,913
RAC: 0
Finland
Message 1030906 - Posted: 4 Sep 2010, 16:31:28 UTC

this don´t actually belong here, but boinc all projects stat has been couple of weeks broken, hence boinc in facebook is showing results as warpes as boinc all project etc, does anybody know when smeone will fix them or how to contact someone.

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,897,216
RAC: 10,980
United States
Message 1030913 - Posted: 4 Sep 2010, 17:34:34 UTC - in response to Message 1030769.
Last modified: 4 Sep 2010, 17:42:02 UTC

I see that my validate errors have now been sent out again, well thats 12 hrs work gone.


Looks like they managed to fix three out of four Validate errors I had. One had already been completed by the two new guys it was sent to but the others are gone now. I haven't gone looking for them yet to make sure though.


Found them... they have been granted credit for me and are still open for the new guys to crunch and get credit too. Guess that fourth one just slipped through the cracks.
____________


PROUD MEMBER OF Team Starfire World BOINC

Profile [B^S] madmac
Volunteer tester
Avatar
Send message
Joined: 9 Feb 04
Posts: 1150
Credit: 3,831,723
RAC: 2,719
United Kingdom
Message 1031131 - Posted: 5 Sep 2010, 11:35:58 UTC - in response to Message 1030913.

Looks like they have managed to sort out my validate errors and thrown up a new account number again.
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5862
Credit: 60,437,973
RAC: 46,518
Australia
Message 1031303 - Posted: 6 Sep 2010, 4:48:25 UTC
Last modified: 6 Sep 2010, 4:52:00 UTC

Hmm. One of the MB assimilators isn't running, and the number of results to be assimilated is gorwing rapidly. How long till we run out of disk space again?
Place your bets.
____________
Grant
Darwin NT.

Dave
Avatar
Send message
Joined: 29 Mar 02
Posts: 774
Credit: 23,193,139
RAC: 0
United Kingdom
Message 1031390 - Posted: 6 Sep 2010, 17:25:27 UTC - in response to Message 1030735.
Last modified: 6 Sep 2010, 17:27:21 UTC

I have taken a temporary measure to attempt to try and help with the backlog waiting for validation..

I have do not accept new work turned on, all of my _0 and _1's suspended until I clean out all of the _2+'s.

Just trying to help. I will unsuspend and re-allow as soon as those _2+'s are gone.


This is so scarily similar to what I've bn doing I wondered if there'd been some sort of forum-mix-up + my post had appeared under someone-else's name...

Only I've stopped, because if I suspend my 0s + 1s, when my CUDA stops current unit it then stops completely. Also when I suspend current CPU units to start _2+s, when I unsuspend (resume ;)) it cuts out of those 2+s & starts new ones, saying they're "running (high priority)" (which tbh they are because they're soon to expire).

So for now I'm tempted to leave things to do themselves. That _6 I saw yesterday has been eaten at least...

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,631,148
RAC: 2
United States
Message 1031444 - Posted: 6 Sep 2010, 20:10:36 UTC - in response to Message 1031390.

I have taken a temporary measure to attempt to try and help with the backlog waiting for validation..

I have do not accept new work turned on, all of my _0 and _1's suspended until I clean out all of the _2+'s.

Just trying to help. I will unsuspend and re-allow as soon as those _2+'s are gone.


This is so scarily similar to what I've bn doing I wondered if there'd been some sort of forum-mix-up + my post had appeared under someone-else's name...

Only I've stopped, because if I suspend my 0s + 1s, when my CUDA stops current unit it then stops completely. Also when I suspend current CPU units to start _2+s, when I unsuspend (resume ;)) it cuts out of those 2+s & starts new ones, saying they're "running (high priority)" (which tbh they are because they're soon to expire).

So for now I'm tempted to leave things to do themselves. That _6 I saw yesterday has been eaten at least...


nah... more like GMTA or FSD.. First thing I had to do was set "no new tasks".. then one by one suspend the 0's and 1's not currently being crunched..
It did get everything done, but my CPU's ran dry and for some reason that seemed to slow the GPU work??

I got 2 _5's killed at least. and the units I have received since are 97% 0's and 1's.

Took it off sunday so I could fill up on a "slow" day for the servers.

I still think the ultimate solution might be to turn off the splitters until the waitings drop to.. maybe 3Million?

None of this is causing a problem locally. But I do fear for those poor disk drives. Situation is all normal here, and I am loaded for the outage.
____________

Janice

Dave
Avatar
Send message
Joined: 29 Mar 02
Posts: 774
Credit: 23,193,139
RAC: 0
United Kingdom
Message 1031447 - Posted: 6 Sep 2010, 20:16:14 UTC - in response to Message 1031444.
Last modified: 6 Sep 2010, 20:18:23 UTC

Seems to be controllable this end mostly.

I've got a lovely _6 going atm, just about to finish (07mr07ad.3283.11524.3.10.164), which I thought was done earlier but then I found it again.

Think I've either got loads of ghosts or just loads of shorties & the DCF is now adapting to my CUDA addition ;). Got control of my _3+s next, though there's a _4 that doesn't want to start.

Shame to shut things down for a week or 2 to aid pending but what effect would that have on RAC?

I'll be dumping & reloading my other machines (including the "1KH" machine which now has Lunatics SSSE3 on it - no more 30-hr units ;)) Tue morn UK-time.

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,631,148
RAC: 2
United States
Message 1031466 - Posted: 6 Sep 2010, 21:03:25 UTC - in response to Message 1031447.

I doubt it would take that long if ALL new units were _2's or more.. I think at the end of an outage a slight delay on the splitters (maybe 12-24 hours) would clean most of it up.. just keep the machines "hungry" for a bit.

That ghost resend code someone is working on looks promising too as an even better final solution. In the mean time.. as long as the disks hold up..
____________

Janice

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,897,216
RAC: 10,980
United States
Message 1031478 - Posted: 6 Sep 2010, 22:06:30 UTC - in response to Message 1031466.


That ghost resend code someone is working on looks promising too as an even better final solution.


I am cleaning up a lot of ghosts but a big part of my _2 and _3s are -9 overflows. Hopefully this new installer will take care of a lot of them too. I say a lot of them because some are non-fermi cards overheating or going bad.
____________


PROUD MEMBER OF Team Starfire World BOINC

Dave
Avatar
Send message
Joined: 29 Mar 02
Posts: 774
Credit: 23,193,139
RAC: 0
United Kingdom
Message 1031556 - Posted: 7 Sep 2010, 5:43:57 UTC

Can't download fresh units ;)...

Don't think I'm going to go dry though over the next 4 days.

Profile Donald L. JohnsonProject donor
Avatar
Send message
Joined: 5 Aug 02
Posts: 6257
Credit: 735,397
RAC: 1,139
United States
Message 1031608 - Posted: 7 Sep 2010, 14:57:24 UTC

Looks like we had another Tuesday morning power interruption today. Either that or the server memory got full again. Tried to report 1 last completed task before the shutdown ("normally" about 0830 PDT/1530 UTC), but at 0740 PDT the Upload & Download servers were already disabled. Hope everyone got a full load for the outage.



____________
Donald
Infernal Optimist / Submariner, retired

Profile James Sotherden
Avatar
Send message
Joined: 16 May 99
Posts: 8901
Credit: 35,819,679
RAC: 42,942
United States
Message 1031609 - Posted: 7 Sep 2010, 15:00:14 UTC

Well the day is here again for the shutdown. My mac has plenty of work along with the P4,I have an AP on that so Im set for 3 days. The i7 looks filled up with an AP also.

However I just saw some -12's on the i7, Hope i dont get many more of them or ghosts this time either.

I did what Sten does and set NNT this time. Let work on the computer Saturday night so I hope to see no timeouts friday.
____________

Old James

1 · 2 · 3 · 4 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (38) Server problems

Copyright © 2014 University of California