Panic Mode On (102) Server Problems?

Message boards : Number crunching : Panic Mode On (102) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 25 · Next

AuthorMessage
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1763614 - Posted: 9 Feb 2016, 7:04:50 UTC - in response to Message 1761158.  

A further update on my single-core machine.

I got the new PSU on Monday (Corsair CX430) and installed it. Seemed fine for a few hours. But the machine is still freezing randomly. I'm thinking it must be something to do with the board at this point.

I might dig out the previous board that was in it (that I replaced some caps on, three times) and replace the caps again and see if that does it. I might have to ebay some more caps though...I think I ran out.

I'm pretty sure it is done with the remaining MBv7's that it had.. but I'm trying to get it to boot up long enough to be able to report them.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1763614 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1763663 - Posted: 9 Feb 2016, 13:10:04 UTC - in response to Message 1763614.  
Last modified: 9 Feb 2016, 13:12:34 UTC

Update again.

After a few more hours and many "flip the rocker switch and wait 20 minutes" sessions, I finally got the remaining tasks reported.

Opened the machine up and took the board out, and... wow. That Compaq board was way worse than I saw with a flashlight last week. There are at least a dozen caps on it that are at "critical mass" amount of bulged, and some have oozed a little bit. That one will need extensive repairs if it is to ever live again.

So for the time being, I put the previous board back in there. A Biostar Geforce6100-M7. I've replaced a few caps on it before (history on that in this thread), and.......to be honest, I don't recall why I stopped using it and swapped-in that Compaq board from a dumpster-dived Presario, because the CPU in the Compaq is a downgrade from the one in the Biostar.

Surely there was a reason I stopped using it. I know that I have since learned (thanks to Jason, and others) to just cut the old caps off and leave their legs in the board and just solder replacement parts to those legs (since PCBs are multiple layers). But I don't know why I stopped using that board.

Guess I'll find out if it is glitchy or not. The system shuts off about 3 seconds into loading Windows, but it has no problem going through the burn-in test with my QT Pro CD.

I'm thinking windows doesn't like the hardware change. Going to try safe-mode and go from there. I should have done a sysprep before pulling the old board, but I couldn't keep the system running long enough to have done something like that, because it froze frequently.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1763663 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1763694 - Posted: 9 Feb 2016, 16:26:38 UTC - in response to Message 1763663.  

Okay, it's dead, Jim. I don't know why this Biostar board randomly shuts off, but that's what it does. It went through basically 20 passes of memtest86 without any issues. Booted into safe mode, no problems there. Decided to reboot and let it come up normally, did fine. Logged-in, and about 5 seconds after login.. complete shut-off like I pulled the plug.

Looked around for some old socket 754 boards.. and I'm not paying $25 for one.... and it only has AGP on it instead of PCI-e. So.. :( Altair is dead.

I may possibly keep the BOINC folder from it and use that in whatever the next machine for it ends up being, so that machine ID may live on... but the Altair I've known and loved for all these years is dead. It is also my WSUS server, so I'm going to have to figure out what to do about that.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1763694 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1763818 - Posted: 10 Feb 2016, 9:45:51 UTC - in response to Message 1763694.  

My first v8 Computation Error.
27jl15ab.8952.394725.16.43.52

Another system errored on the same WU as soon as it tried to start it. Will be interesting to see how the other 2 systems cope with it.
Grant
Darwin NT
ID: 1763818 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1763830 - Posted: 10 Feb 2016, 11:18:34 UTC - in response to Message 1763818.  

My first v8 Computation Error.
27jl15ab.8952.394725.16.43.52

Another system errored on the same WU as soon as it tried to start it. Will be interesting to see how the other 2 systems cope with it.

I've tried running it offline, and got a normal finish with no errors (cuda50)

Flopcounter: 38315226103165.727000

Spike count: 9
Autocorr count: 0
Pulse count: 2
Triplet count: 4
Gaussian count: 0
Worker preemptively acknowledging a normal exit.->
called boinc_finish
Exit Status: 0
boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->
Cuda threadsafe ExitProcess() initiated, rval 0
ID: 1763830 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1763845 - Posted: 10 Feb 2016, 13:23:27 UTC - in response to Message 1763663.  

Okay, I'll quit spamming about the situation on my puny single-core machine now, unless I get it back up and running again.

The sit-rep is: The Biostar board: I don't know what even needs to be fixed on that to make it work again. It is a lost cause.

The Compaq board is actually an Asus A8AE-LE.

I looked at the caps on it and did some ebay searching, and found one seller that has two of the caps I need, but not the other two, so I sent them a message asking if they had the other two and they just weren't listed. Awaiting response on that (I'm trying to be cheap and get all four cap sizes from one seller to save on shipping).

The list of what I need for the Compaq/Asus board:

5x 820µF 6.3v 105°C
4x 1500µF 6.3v 105°C
2x 1000µF 16v 105°C
1x 2200µF 10v 105°C

Realistically, there's nothing wrong with going with something like 25v 125°C for all of them, but I need those quantities of those µF values. If I get those, I can get the board up and running again.

Since I found the spec sheet on the HP website for the board, I looked at ebay prices for picking up a better CPU, and I'm not paying $80 for a s939 Athlon X2 4800+. Not when the AM1/2 version can be had for just $6 and free shipping.



I'm still looking at other options that might be available to me. I'm almost considering looking on newegg or amazon for old board+CPU refurbs and seeing what they have in the sub-$20 range.

There are two reasons I want to get Altair running again: 1) it is my WSUS server. 2) The statistics file for it stretches back to mid-2008. Sure, there's some straight lines for some gaps in the data here and there due to previous downtimes, but I've kept that system going for a while.

It started off with XP, then I moved it over to Linux for a few years, and then to Server 2k3. I kept moving the BOINC folder through all of these transitions, so it never lost its identity, nor its stats. At this point.. it is a historical artifact.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1763845 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1763922 - Posted: 10 Feb 2016, 18:37:43 UTC - in response to Message 1763830.  

My first v8 Computation Error.
27jl15ab.8952.394725.16.43.52

Another system errored on the same WU as soon as it tried to start it. Will be interesting to see how the other 2 systems cope with it.

I've tried running it offline, and got a normal finish with no errors (cuda50)

Flopcounter: 38315226103165.727000

Spike count: 9
Autocorr count: 0
Pulse count: 2
Triplet count: 4
Gaussian count: 0
Worker preemptively acknowledging a normal exit.->
called boinc_finish
Exit Status: 0
boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->
Cuda threadsafe ExitProcess() initiated, rval 0


One of the other systems has now run it as well with no issues, and got the same results you did.
Grant
Darwin NT
ID: 1763922 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1764001 - Posted: 11 Feb 2016, 7:03:40 UTC - in response to Message 1763922.  

One of the other systems has now run it as well with no issues, and got the same results you did.

And the second system ran it with no problems.
Grant
Darwin NT
ID: 1764001 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1764030 - Posted: 11 Feb 2016, 11:00:00 UTC - in response to Message 1764001.  
Last modified: 11 Feb 2016, 11:03:26 UTC

Sometimes it's easily forgotten that consumer Geforce lines card's don't Have ECC memory. If you can check back on the original host_GPU, chances are it'd come up clear, or otherwise give some clues for further diagnoses, if a repeatable event is found on the same unit. If not repeatable, then could easily be dealing with soft-error (i.e. more to do with ambient, cosmic and semiconductor packaging radiation, While hardware, driver or firmware issues should tend to be more repeatable).
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1764030 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1764081 - Posted: 11 Feb 2016, 16:03:54 UTC

*long sigh* Must be bad ju-ju this week or something. Overnight, I lost access to my external enclosure. I didn't even notice anything was amiss until I noticed in Computer, I didn't have a scrollbar anymore. Hm.

Dug into all the troubleshooting steps and unless the eSATA cable suddenly went bad overnight without being touched (and the rig stays running 24/7), then the 5-to-1 port multiplier has gone bad.

I've sent an email to iStarUSA to ask what the make/model of the multiplier is for this enclosure, and even said that I know it is out of warranty, but if they have another multi, I'll buy another multiplier module from them.

If any of you recognize this, or have better results than I had with googling the numbers, I'm open to suggestions. http://i.imgur.com/rBktB1P.jpg

I'm almost sure it is an Addonics 5-to-1 though. So I'm probably just better off googling around to find another 5-to-1 module.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1764081 · Report as offensive
Jeanette
Volunteer tester

Send message
Joined: 25 Apr 15
Posts: 55
Credit: 7,827,469
RAC: 0
Denmark
Message 1764089 - Posted: 11 Feb 2016, 16:38:49 UTC - in response to Message 1764081.  

Looks different than the one you show, but could it be something like this?

http://www.addonics.com/products/ad5sapm.php
ID: 1764089 · Report as offensive
Profile JaundicedEye
Avatar

Send message
Joined: 14 Mar 12
Posts: 5375
Credit: 30,870,693
RAC: 1
United States
Message 1764092 - Posted: 11 Feb 2016, 17:01:14 UTC

Wow, 'when it rains......'

Good luck,

"Sour Grapes make a bitter Whine." <(0)>
ID: 1764092 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1764094 - Posted: 11 Feb 2016, 17:08:26 UTC - in response to Message 1764089.  

Looks different than the one you show, but could it be something like this?

http://www.addonics.com/products/ad5sapm.php

Pretty sure this will work without modification to the enclosure.

http://www.addonics.com/products/ad5sarm6g.php
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1764094 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1764095 - Posted: 11 Feb 2016, 17:10:38 UTC - in response to Message 1764094.  

Have you tried contacting Addonics customer service and linking them to your pic to ask their advice?
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1764095 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1764102 - Posted: 11 Feb 2016, 17:34:10 UTC - in response to Message 1764095.  
Last modified: 11 Feb 2016, 17:42:58 UTC

Have you tried contacting Addonics customer service and linking them to your pic to ask their advice?

I have not. I'm pretty sure that's their product though, from the 2008 era. Unless they do lifetime warranties (pretty sure they don't), it would have been long-since expired anyway.

I went ahead and bought the one I linked. Will be coming by mail soon.

It's not the end of the world for me though. I just can't use the enclosure until then, but if I have to have access to any of the five drives, I can just plug them directly into the mobo in the meantime.

This new multiplier is an upgrade though, at the very least. It does SATA-III 6gb/s instead of the previous II 3.0gb/s. It's not going to add any performance, since the Silicon Image 3132 controller is only II. But if I ever decided to upgrade the PCI-e x1 controller to a III one that can do port multipliers.. I'd be set.

Addonics makes some pretty neat stuff though. I remember years ago before SSDs were even rumors spoken through backchannels, Addonics had a 2.5" cartridge for laptops that would take six 32GB SD cards and put them into JBOD mode to make your own DIY SSD. The cartridge was $90, and then there was the cost of six 32GB SD cards (at the time, those were about $60/ea). It was a little pricey, but that was available as an option like.. two years before the first mainstream SSDs came out.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1764102 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1764123 - Posted: 11 Feb 2016, 18:49:03 UTC - in response to Message 1764030.  

Sometimes it's easily forgotten that consumer Geforce lines card's don't Have ECC memory. If you can check back on the original host_GPU, chances are it'd come up clear, or otherwise give some clues for further diagnoses, if a repeatable event is found on the same unit. If not repeatable, then could easily be dealing with soft-error (i.e. more to do with ambient, cosmic and semiconductor packaging radiation, While hardware, driver or firmware issues should tend to be more repeatable).

I'm guessing it was "just one of those things". It had been the only computation error to date since starting v8 work.
This morning I noticed another error on the same system, but the other video card. However this one was a "Finish file present too long", and the cause was probably Windows Update- it was sucking up CPU cycles again, causing extended system pauses yet again.
Much more of this and i'll just kill it off altogether.
Grant
Darwin NT
ID: 1764123 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1764139 - Posted: 11 Feb 2016, 19:49:39 UTC - in response to Message 1764102.  

Cosmic,

I don't mean to be rude, but what do your computer issues have to do with Server Problems?

There are computer builds threads ...
ID: 1764139 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1764147 - Posted: 11 Feb 2016, 20:43:30 UTC - in response to Message 1764139.  

Cosmic,

I don't mean to be rude, but what do your computer issues have to do with Server Problems?

There are computer builds threads ...

It doesn't have anything to do with the servers. Unofficially, these "panic mode" threads have kind of been a catch-all "cafe" of sorts, as long as it doesn't get out-of-hand.

I did realize after about 4 posts about the board in the other system that I was talking about it too much in this thread and decided to stop, but then a new problem came up, and I'm not going to continue blabbing-on about it and flooding the thread with that one, either. The only other mention I'll make of it is when the replacement part arrives and fixes the problem, then that'll be that.




More related to the topic at hand though.. the list of tapes seems to be getting pretty short again... new tapes should show up soon, which means more APs. So that's something to look forward to pretty soon.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1764147 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11358
Credit: 29,581,041
RAC: 66
United States
Message 1764199 - Posted: 12 Feb 2016, 1:11:49 UTC

With only 110 MB channels to be split either new disks will be mounted or we will soon run out of work.
ID: 1764199 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1764334 - Posted: 12 Feb 2016, 12:51:09 UTC - in response to Message 1764139.  

Cosmic,

I don't mean to be rude, but what do your computer issues have to do with Server Problems?

There are computer builds threads ...


I have no problem with these posts, and seeing as it is my thread...

ID: 1764334 · Report as offensive
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 25 · Next

Message boards : Number crunching : Panic Mode On (102) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.