Panic Mode On (38) Server problems

Message boards : Number crunching : Panic Mode On (38) Server problems
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 11 · Next

AuthorMessage
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1030586 - Posted: 3 Sep 2010, 23:50:58 UTC

Time for a new one.

ID: 1030586 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1030589 - Posted: 3 Sep 2010, 23:54:30 UTC - in response to Message 1030586.  

Time for a new one.


WHAT??? WHY??? What happened???? I didn't break it!!!!



PROUD MEMBER OF Team Starfire World BOINC
ID: 1030589 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1030593 - Posted: 3 Sep 2010, 23:59:35 UTC - in response to Message 1030589.  


Real cause for concern- Scarecrow's graphs are broken.
Grant
Darwin NT
ID: 1030593 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1030735 - Posted: 4 Sep 2010, 4:11:24 UTC

I have taken a temporary measure to attempt to try and help with the backlog waiting for validation..

I have do not accept new work turned on, all of my _0 and _1's suspended until I clean out all of the _2+'s.

Just trying to help. I will unsuspend and re-allow as soon as those _2+'s are gone.
Janice
ID: 1030735 · Report as offensive
Profile BANZAI56
Volunteer tester

Send message
Joined: 17 May 00
Posts: 139
Credit: 47,299,948
RAC: 2
United States
Message 1030765 - Posted: 4 Sep 2010, 5:23:39 UTC

Ran cache dry to install v0.37 on C2D with GT 240.

Asked for new work and promptly was hit with around 120 ghosts.

Bye bye with a detach/reattach and will try again tomorrow. :(


Got Orbit to keep it busy till then...
ID: 1030765 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 1030769 - Posted: 4 Sep 2010, 6:09:10 UTC

I see that my validate errors have now been sent out again, well thats 12 hrs work gone.
ID: 1030769 · Report as offensive
Profile MadMaC
Volunteer tester
Avatar

Send message
Joined: 4 Apr 01
Posts: 201
Credit: 47,158,217
RAC: 0
United Kingdom
Message 1030789 - Posted: 4 Sep 2010, 8:38:30 UTC - in response to Message 1030765.  

Ran cache dry to install v0.37 on C2D with GT 240.

Asked for new work and promptly was hit with around 120 ghosts.

Bye bye with a detach/reattach and will try again tomorrow. :(


Got Orbit to keep it busy till then...



How can you tell if you have a ghost unit???
ID: 1030789 · Report as offensive
Profile Aristoteles Doukas
Avatar

Send message
Joined: 11 Apr 08
Posts: 1091
Credit: 2,140,913
RAC: 0
Finland
Message 1030906 - Posted: 4 Sep 2010, 16:31:28 UTC

this don´t actually belong here, but boinc all projects stat has been couple of weeks broken, hence boinc in facebook is showing results as warpes as boinc all project etc, does anybody know when smeone will fix them or how to contact someone.
ID: 1030906 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1030913 - Posted: 4 Sep 2010, 17:34:34 UTC - in response to Message 1030769.  
Last modified: 4 Sep 2010, 17:42:02 UTC

I see that my validate errors have now been sent out again, well thats 12 hrs work gone.


Looks like they managed to fix three out of four Validate errors I had. One had already been completed by the two new guys it was sent to but the others are gone now. I haven't gone looking for them yet to make sure though.


Found them... they have been granted credit for me and are still open for the new guys to crunch and get credit too. Guess that fourth one just slipped through the cracks.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1030913 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 1031131 - Posted: 5 Sep 2010, 11:35:58 UTC - in response to Message 1030913.  

Looks like they have managed to sort out my validate errors and thrown up a new account number again.
ID: 1031131 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1031303 - Posted: 6 Sep 2010, 4:48:25 UTC
Last modified: 6 Sep 2010, 4:52:00 UTC

Hmm. One of the MB assimilators isn't running, and the number of results to be assimilated is gorwing rapidly. How long till we run out of disk space again?
Place your bets.
Grant
Darwin NT
ID: 1031303 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 1031390 - Posted: 6 Sep 2010, 17:25:27 UTC - in response to Message 1030735.  
Last modified: 6 Sep 2010, 17:27:21 UTC

I have taken a temporary measure to attempt to try and help with the backlog waiting for validation..

I have do not accept new work turned on, all of my _0 and _1's suspended until I clean out all of the _2+'s.

Just trying to help. I will unsuspend and re-allow as soon as those _2+'s are gone.


This is so scarily similar to what I've bn doing I wondered if there'd been some sort of forum-mix-up + my post had appeared under someone-else's name...

Only I've stopped, because if I suspend my 0s + 1s, when my CUDA stops current unit it then stops completely. Also when I suspend current CPU units to start _2+s, when I unsuspend (resume ;)) it cuts out of those 2+s & starts new ones, saying they're "running (high priority)" (which tbh they are because they're soon to expire).

So for now I'm tempted to leave things to do themselves. That _6 I saw yesterday has been eaten at least...
ID: 1031390 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1031444 - Posted: 6 Sep 2010, 20:10:36 UTC - in response to Message 1031390.  

I have taken a temporary measure to attempt to try and help with the backlog waiting for validation..

I have do not accept new work turned on, all of my _0 and _1's suspended until I clean out all of the _2+'s.

Just trying to help. I will unsuspend and re-allow as soon as those _2+'s are gone.


This is so scarily similar to what I've bn doing I wondered if there'd been some sort of forum-mix-up + my post had appeared under someone-else's name...

Only I've stopped, because if I suspend my 0s + 1s, when my CUDA stops current unit it then stops completely. Also when I suspend current CPU units to start _2+s, when I unsuspend (resume ;)) it cuts out of those 2+s & starts new ones, saying they're "running (high priority)" (which tbh they are because they're soon to expire).

So for now I'm tempted to leave things to do themselves. That _6 I saw yesterday has been eaten at least...


nah... more like GMTA or FSD.. First thing I had to do was set "no new tasks".. then one by one suspend the 0's and 1's not currently being crunched..
It did get everything done, but my CPU's ran dry and for some reason that seemed to slow the GPU work??

I got 2 _5's killed at least. and the units I have received since are 97% 0's and 1's.

Took it off sunday so I could fill up on a "slow" day for the servers.

I still think the ultimate solution might be to turn off the splitters until the waitings drop to.. maybe 3Million?

None of this is causing a problem locally. But I do fear for those poor disk drives. Situation is all normal here, and I am loaded for the outage.
Janice
ID: 1031444 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 1031447 - Posted: 6 Sep 2010, 20:16:14 UTC - in response to Message 1031444.  
Last modified: 6 Sep 2010, 20:18:23 UTC

Seems to be controllable this end mostly.

I've got a lovely _6 going atm, just about to finish (07mr07ad.3283.11524.3.10.164), which I thought was done earlier but then I found it again.

Think I've either got loads of ghosts or just loads of shorties & the DCF is now adapting to my CUDA addition ;). Got control of my _3+s next, though there's a _4 that doesn't want to start.

Shame to shut things down for a week or 2 to aid pending but what effect would that have on RAC?

I'll be dumping & reloading my other machines (including the "1KH" machine which now has Lunatics SSSE3 on it - no more 30-hr units ;)) Tue morn UK-time.
ID: 1031447 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1031466 - Posted: 6 Sep 2010, 21:03:25 UTC - in response to Message 1031447.  

I doubt it would take that long if ALL new units were _2's or more.. I think at the end of an outage a slight delay on the splitters (maybe 12-24 hours) would clean most of it up.. just keep the machines "hungry" for a bit.

That ghost resend code someone is working on looks promising too as an even better final solution. In the mean time.. as long as the disks hold up..
Janice
ID: 1031466 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1031478 - Posted: 6 Sep 2010, 22:06:30 UTC - in response to Message 1031466.  


That ghost resend code someone is working on looks promising too as an even better final solution.


I am cleaning up a lot of ghosts but a big part of my _2 and _3s are -9 overflows. Hopefully this new installer will take care of a lot of them too. I say a lot of them because some are non-fermi cards overheating or going bad.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1031478 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 1031556 - Posted: 7 Sep 2010, 5:43:57 UTC

Can't download fresh units ;)...

Don't think I'm going to go dry though over the next 4 days.
ID: 1031556 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1031608 - Posted: 7 Sep 2010, 14:57:24 UTC

Looks like we had another Tuesday morning power interruption today. Either that or the server memory got full again. Tried to report 1 last completed task before the shutdown ("normally" about 0830 PDT/1530 UTC), but at 0740 PDT the Upload & Download servers were already disabled. Hope everyone got a full load for the outage.



Donald
Infernal Optimist / Submariner, retired
ID: 1031608 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1031609 - Posted: 7 Sep 2010, 15:00:14 UTC

Well the day is here again for the shutdown. My mac has plenty of work along with the P4,I have an AP on that so Im set for 3 days. The i7 looks filled up with an AP also.

However I just saw some -12's on the i7, Hope i dont get many more of them or ghosts this time either.

I did what Sten does and set NNT this time. Let work on the computer Saturday night so I hope to see no timeouts friday.
[/quote]

Old James
ID: 1031609 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1031613 - Posted: 7 Sep 2010, 15:09:54 UTC
Last modified: 7 Sep 2010, 15:10:27 UTC

At ~ 12:45 UTC I suspended in my BOINCs the network activity.
For not to get again 'validate errors'.

To now I don't see one in the hosts overview.
ID: 1031613 · Report as offensive
1 · 2 · 3 · 4 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (38) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.