Really strange problem

Message boards : Number crunching : Really strange problem
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 838408 - Posted: 10 Dec 2008, 4:38:18 UTC

Like many of us crunchers I have 5 computers running in my home on a LAN. I use Boinc View and Boinc Manager to log into the remote compters and check up on Boinc. One particular box suddenly started having problems today. If I use Boinc Manager to log into Boinc on that particular box it displays about 8 or 10 work units in the work list and then Boinc Manager locks up. I can log into that box by using a remote desktop connection and run Boinc Manager on that machine, it will lock up Boinc Manager at the same point. Checking task manager on that box shows 4 work units being crunched and they continue to crunch normally.

Boinc View will read the box one time then fail to read it again at the next polling. The work list in Boinc View is stalled at the same point.

Anybody seen this before??
Boinc....Boinc....Boinc....Boinc....
ID: 838408 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 838477 - Posted: 10 Dec 2008, 13:31:42 UTC
Last modified: 10 Dec 2008, 13:32:02 UTC

Same problem this morning. I can't figure a way to get into Boinc and tell it to do anything. Can't order no new work, reset or anything else. I can only start and stop the service. Can't do anything else except let it run.
Boinc....Boinc....Boinc....Boinc....
ID: 838477 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14658
Credit: 200,643,578
RAC: 874
United Kingdom
Message 838482 - Posted: 10 Dec 2008, 13:48:09 UTC

I suggest you use boinccmd (or boinc_cmd if your system pre-dates the name change) to set nomorework on every project on that host, and report anything completed once you see task manager drop to idle on all cores. Then, stop the service and examine the entrails in client_state.xml: or just use boinccmd again to reset all projects.
ID: 838482 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 838487 - Posted: 10 Dec 2008, 14:03:16 UTC
Last modified: 10 Dec 2008, 14:13:51 UTC

Thanks Richard..........I get an error message about unrecognized command.

I tried...."boinccmd nomorework"

Anyone know the exact syntax to stop work requests?

Never mind......I figured it out. At least it ran the command without an error message. Hopefully it will not request more work and will run down the cache.

Thanks Richard.

It is this computer number 3966329 in case anyone is interested.
Boinc....Boinc....Boinc....Boinc....
ID: 838487 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14658
Credit: 200,643,578
RAC: 874
United Kingdom
Message 838490 - Posted: 10 Dec 2008, 14:13:00 UTC

That's why I made the boinccmd a clickable link into the Wiki where the syntax is defined in full!

It'll be something like

boinccmd --project http://setiathome.berkeley.edu nomorework

One more though: if you do succeed in stopping new work, and flushing the queue(s) down to zero, then, after reporting (see the Wiki again!) whatever is still shown as 'in progress' on the project task list(s) may give you a clue which project(s) need resetting.
ID: 838490 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 838491 - Posted: 10 Dec 2008, 14:17:29 UTC

Thanks Richard. At least over the last night it continued to run normally which it would do running as a service. At least now I can let the cache run down and when it empty's I can reset it or something.

With a 7 day cache I didn't want to reset it now and have several hundred people upset with me.

Still an interesting situation though. To lose control of Boinc.
Boinc....Boinc....Boinc....Boinc....
ID: 838491 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 838496 - Posted: 10 Dec 2008, 14:28:06 UTC - in response to Message 838491.  

That is very strange. Two things I always ask:
1) Has anything changed on your workstation or the box in question?
2) Have you rebooted the box in question since this event occurred?

Other than that, seems like you're on the right track to fixing it.
ID: 838496 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 838498 - Posted: 10 Dec 2008, 14:30:14 UTC - in response to Message 838491.  

Sounds like greeblies in your tcp/ip stack. I find beer helps.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 838498 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 838499 - Posted: 10 Dec 2008, 14:36:54 UTC

The computer in question runs without keyboard and monitor. Sits in the corner of the dinning room creating much needed warmth at this time of year. Yes, I logged into the computer with remote desktop and rebooted after I discovered this problem.

It is still running normally except I am blind with respect to Boinc.
Boinc....Boinc....Boinc....Boinc....
ID: 838499 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 838502 - Posted: 10 Dec 2008, 14:39:39 UTC

Which version of BOINC and for that matter, which version of BOINC View?
ID: 838502 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14658
Credit: 200,643,578
RAC: 874
United Kingdom
Message 838504 - Posted: 10 Dec 2008, 14:43:41 UTC - in response to Message 838502.  

Which version of BOINC?

Crunch3r's
ID: 838504 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 838508 - Posted: 10 Dec 2008, 14:50:56 UTC

I use Boinc version 5.10.28 which is then overwritten with Crunch3r's Boinc 6.1.0.32 V5. I did a removal of Boinc and then reinstalled but the results were the same.

I believe that some corruption of the Client_State file happened after yesterday's scheduled outage. If this is true it won't do any good to revert to the Client_State_Previous file because by now the corruption would be there also.

The box is crunching. I will just let it go until the cache is done or it finally runs into the corruption in the state file. I had tried removing 4 work units from the projects folder that I thought might be related. Upon restarting the service the missing work was downloaded again and the problem was not solved.

Just don't know what to do now except let it run. I certainly do not want to abort all this work just yet.
Boinc....Boinc....Boinc....Boinc....
ID: 838508 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 838517 - Posted: 10 Dec 2008, 15:36:09 UTC

Can you check if there is any information on this in the stderrgui.txt and stdoutgui.txt files?
ID: 838517 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 838545 - Posted: 10 Dec 2008, 16:41:59 UTC

From stdoutgui

[12/09/08 21:03:17] TRACE [2948]: RPC_CLIENT::init boinc_socket returned 516
[12/09/08 21:03:17] TRACE [2948]: RPC_CLIENT::init connect returned -1
[12/09/08 21:03:17] TRACE [2948]: RPC_CLIENT::init attempting connect
[12/09/08 21:03:18] TRACE [2948]: RPC_CLIENT::init_poll sock = 516
[12/09/08 21:03:18] TRACE [2948]: RPC_CLIENT::init_poll connected to port 31416
[12/09/08 21:03:18] TRACE [2948]: CAN'T FIND PROJECT http://setiathome.berkeley.edu/
[12/09/08 22:06:49] TRACE [2848]: RPC_CLIENT::init boinc_socket returned 516
[12/09/08 22:06:49] TRACE [2848]: RPC_CLIENT::init connect returned -1
[12/09/08 22:06:49] TRACE [2848]: RPC_CLIENT::init attempting connect
[12/09/08 22:06:49] TRACE [2848]: RPC_CLIENT::init_poll sock = 516
[12/09/08 22:06:49] TRACE [2848]: RPC_CLIENT::init_poll sock = 516
[12/09/08 22:06:49] TRACE [2848]: RPC_CLIENT::init_poll connected to port 31416
[12/09/08 22:06:49] TRACE [2848]: CAN'T FIND PROJECT http://setiathome.berkeley.edu/

From stderrgui and seems to occur everytime I try to run Boinc Manager.

BOINC Windows Runtime Debugger Version 5.10.28
Dump Timestamp : 12/09/08 20:44:18
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004A0DE4 read attempt to address 0x00000064
Engaging BOINC Windows Runtime Debugger...
********************

(What follows is very long............
If anyone really wants to see the file send me a PM with an email address and I can send it along.)

Boinc....Boinc....Boinc....Boinc....
ID: 838545 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 838553 - Posted: 10 Dec 2008, 16:53:59 UTC - in response to Message 838545.  

PM sent.
ID: 838553 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 838560 - Posted: 10 Dec 2008, 17:12:29 UTC

Files Sent
Boinc....Boinc....Boinc....Boinc....
ID: 838560 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 838592 - Posted: 10 Dec 2008, 18:53:28 UTC

Just spent a couple of hours emailing with Ageless. Due to his patience and wonderful trouble shooting I have located and replaced the offending file in the Boinc directory. Microsoft.VC80.CRT.manifest.dll was at fault for the entire mess. I copied the file from another computer over the offending file, file sizes the same before and after.

Ageless spends countless hours here trouble shooting and helping folks and I for one do not have the words to thank him enough for his help. I did not want to dump all the cached work and start over and with his help I didn't have to.

Thankyou Ageless
Boinc....Boinc....Boinc....Boinc....
ID: 838592 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 838593 - Posted: 10 Dec 2008, 18:57:11 UTC - in response to Message 838592.  

You're welcome.

I'll now go sit here with the window open, waiting for the blush to recede. :-)
ID: 838593 · Report as offensive
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24884
Credit: 3,081,182
RAC: 7
Ireland
Message 838654 - Posted: 10 Dec 2008, 23:28:31 UTC

This is what I like about the N/C board, terrific help whenever needed.

Well done Ageless.
ID: 838654 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 838656 - Posted: 10 Dec 2008, 23:40:25 UTC - in response to Message 838593.  

You're welcome.

I'll now go sit here with the window open, waiting for the blush to recede. :-)



. . . well Sir - looks like You're going to be on top: 'Kudos system on the BOINC forums' - Well done [Kudo's to You]


BOINC Wiki . . .

Science Status Page . . .
ID: 838656 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Really strange problem


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.