recent woes


log in

Advanced search

Message boards : Technical News : recent woes

1 · 2 · 3 · 4 . . . 7 · Next
Author Message
Jeff Cobb
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 1 Mar 99
Posts: 111
Credit: 40,367
RAC: 0
United States
Message 1038820 - Posted: 6 Oct 2010, 17:57:16 UTC

It's been a painful week, but with some progress.

The server run before last was cut short by our upload space filling up. That was fixed by the bruno migration and we started the last server run a bit early.

But a crash of our primary boinc db machine, mork, got the secondary db server, jocelyn, out of sync. That meant that all of the read only queries had to go to mork instead of jocelyn. This overwhelmed mork and I turned off web access just so the server run could continue. Then mork crashed again Monday evening. Ouch.

Yesterday, we did our normal backup of mork and are recovering jocelyn from that today. The forums are up, but result viewing is disabled at the moment. We need to clear the back end queues ahead of the next server run and mork resources are needed for that.

Mork's tendency to crash seems to have accelerated. Perhaps this is secondary to the cooling crisis we had a couple of weeks ago. Actually, "crash" is not the correct term. It simply hangs and requires a power cycle to boot. Fortunately, we have mork on a networked power strip and can power cycle it remotely. Upon boot, there are no footprints whatsoever as to the cause of the hang. This sounds like hardware. So today we are going to bring mork down to swap out all of the memory and remove a couple of unused components in a desperate attempt to fix the problem. The forums of course will be down during this operation.
____________

Profile ScarabDrowner
Volunteer tester
Avatar
Send message
Joined: 13 Sep 03
Posts: 90
Credit: 456,378
RAC: 0
United States
Message 1038823 - Posted: 6 Oct 2010, 18:04:49 UTC - in response to Message 1038820.

Thanks for the update, it's appreciated. And thanks for the advance warning of the forum downtime.
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4241
Credit: 34,947,649
RAC: 22,773
United Kingdom
Message 1038825 - Posted: 6 Oct 2010, 18:08:59 UTC - in response to Message 1038820.

Thanks for the update Jeff, good luck on findind a cause of Mork's foibles.

Claggy

Tom95134Project donor
Send message
Joined: 27 Nov 01
Posts: 213
Credit: 3,433,125
RAC: 618
United States
Message 1038841 - Posted: 6 Oct 2010, 18:56:54 UTC

Could these latest issues be a result of the flood of work coming in after opening the pipe to the outside world?

Just because the SETI machines have gone into maintenance doesn't mean that all those BOINC machines are dozing. I believe that the SETI member have generally increased the amount of local work they hold to cover the normal 3 day outage and are are building up a massive backlog that has to look like a Tsunami once the gates are reopened. Now that the outage is longer the backlog is likely even larger causing even more problems for the SETI systems.

Is there anyway that the SETI team can force the SETI preferences to limit work to only one day on local machines until this gets sorted out?

Sometimes shutting down is the worst thing you can do to a system that is suppose to be available 24x7.
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5945
Credit: 62,379,565
RAC: 38,345
Australia
Message 1038842 - Posted: 6 Oct 2010, 19:01:57 UTC - in response to Message 1038841.

Sometimes shutting down is the worst thing you can do to a system that is suppose to be available 24x7.

Yep, but as Seti isn't meant to be up 24/7 that's not a problem here.
____________
Grant
Darwin NT.

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 13179
Credit: 7,923,968
RAC: 14,775
United States
Message 1038866 - Posted: 6 Oct 2010, 19:59:49 UTC

Thanks for the updates Jeff. News, any kind, is always appreciated.

____________

jravin
Send message
Joined: 25 Mar 02
Posts: 991
Credit: 106,330,508
RAC: 87,883
United States
Message 1038867 - Posted: 6 Oct 2010, 20:01:31 UTC - in response to Message 1038841.

Could these latest issues be a result of the flood of work coming in after opening the pipe to the outside world?

Just because the SETI machines have gone into maintenance doesn't mean that all those BOINC machines are dozing. I believe that the SETI member have generally increased the amount of local work they hold to cover the normal 3 day outage and are are building up a massive backlog that has to look like a Tsunami once the gates are reopened. Now that the outage is longer the backlog is likely even larger causing even more problems for the SETI systems.

Is there anyway that the SETI team can force the SETI preferences to limit work to only one day on local machines until this gets sorted out?

Sometimes shutting down is the worst thing you can do to a system that is suppose to be available 24x7.


Uploads have been running through this last outage and the curent "normal" 3-day shutdown, so it may be that the usual tsunami will be minimal when the servers open for business on Friday (I hope). Still need to report, of course, but the uploads shouldn't be a problem...
____________

Jeff Cobb
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 1 Mar 99
Posts: 111
Credit: 40,367
RAC: 0
United States
Message 1038919 - Posted: 6 Oct 2010, 22:42:22 UTC - in response to Message 1038820.

So today we are going to bring mork down to swap out all of the memory and remove a couple of unused components in a desperate attempt to fix the problem. The forums of course will be down during this operation.


The memory swap did not go well. Some part of the memory we swapped in was faulty, so we fell back. Oh well, an easy test given we had the box open anyway to remove some unused SSDs. We'll likely boot memtest on mork soon to give it a proper test.
____________

Profile SciManStevProject donor
Volunteer tester
Avatar
Send message
Joined: 20 Jun 99
Posts: 4907
Credit: 84,335,067
RAC: 28,726
United States
Message 1038922 - Posted: 6 Oct 2010, 22:48:26 UTC - in response to Message 1038919.

Thank you Jeff! We know it will be fixed when it gets fixed. The information is very much appreciated!

Steve
____________
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website

Invisible Man
Send message
Joined: 24 Jun 01
Posts: 22
Credit: 1,129,336
RAC: 0
United Kingdom
Message 1038923 - Posted: 6 Oct 2010, 22:51:35 UTC

Many thanks Jeff for your latest update.
In case things go wrong later this week the following might need to be considered:
The Weekly “Outage” (what a horrible word this is) originally was on Tuesdays only; now it seems that the system is closed on Tuesday until an indeterminate time on Friday, followed by things going wrong on Saturday/Sunday.
May I offer a suggestion to get things going as we would all like? Why not shut down SAH completely for a whole month, so that the staff can have the time to get all the systems up & running correctly, once and for all?
Your comments would be appreciated.

____________

Profile magpii
Avatar
Send message
Joined: 28 Dec 05
Posts: 9
Credit: 2,936,797
RAC: 3,113
United Kingdom
Message 1038933 - Posted: 6 Oct 2010, 23:05:32 UTC

I have currently around 30 tasks waiting to be uploaded and according to my statistics chart, it hasn't been updated since sept 21st. I have been trying to manually upload for the past week or so and keep getting the "internet access is ok but servers may be temporarily down" message. Is this due to ongoing tech issues on your end or is there something wrong on my end?

Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1120
Credit: 10,678,418
RAC: 19,174
United States
Message 1038940 - Posted: 6 Oct 2010, 23:21:10 UTC - in response to Message 1038933.
Last modified: 6 Oct 2010, 23:23:02 UTC

The memory swap on mork didn't work, probably because at least one of the replacement DIMMS was bad. Jeff has suggested some more detailed memory tests next time mork is down. We're also considering replacing the power supplies in case the hangs are caused by power supply glitches.

Dan has contacted HP and Sun to see if they can give us a deep discount on a machine that could replace mork, hopefully deep enough that we can purchase it on what's remaining from the donations made in the Number Crunching threads.

[edit]I see now that Jeff had already posted an update. Oops[/edit]
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8813
Credit: 53,487,643
RAC: 45,388
United Kingdom
Message 1038942 - Posted: 6 Oct 2010, 23:27:29 UTC - in response to Message 1038940.

The memory swap on mork didn't work, probably because at least one of the replacement DIMMS was bad. Jeff has suggested some more detailed memory tests next time mork is down. We're also considering replacing the power supplies in case the hangs are caused by power supply glitches.

Dan has contacted HP and Sun to see if they can give us a deep discount on a machine that could replace mork, hopefully deep enough that we can purchase it on what's remaining from the donations made in the Number Crunching threads.

Don't be afraid to post the details if there turns out to be a gap between the minimum viable quote and the balance available. I'm sure there are still discretionary funds available out here in the real world, for a worthy and well-documented cause.

[edit]I see now that Jeff had already posted an update. Oops[/edit]

Don't worry about it! Too much information is always better than too little.

Profile ScarabDrowner
Volunteer tester
Avatar
Send message
Joined: 13 Sep 03
Posts: 90
Credit: 456,378
RAC: 0
United States
Message 1038944 - Posted: 6 Oct 2010, 23:29:19 UTC - in response to Message 1038940.

Thanks Eric and Jeff, the updates are appreciated.

I know the new server is already "intended" for a certain job, but if push comes to shove, could it be used to get the project back up and stable?
____________

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 16,370,202
RAC: 9,042
United States
Message 1038946 - Posted: 6 Oct 2010, 23:50:00 UTC - in response to Message 1038940.

Dan has contacted HP and Sun to see if they can give us a deep discount on a machine that could replace mork, hopefully deep enough that we can purchase it on what's remaining from the donations made in the Number Crunching threads.


Good luck Dan, that would be great, sort of like two for the price of one. I hope it doesn't cut into the amount of RAM and harddrives you guys were going to get with Oscar though. At least not too much. If it does let us know so we can wake up Kittyman to start another drive.
____________


PROUD MEMBER OF Team Starfire World BOINC

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 13179
Credit: 7,923,968
RAC: 14,775
United States
Message 1038955 - Posted: 7 Oct 2010, 0:40:50 UTC

Thanks to both Eric and Jeff for the updates. Chasing hardware problems can be a real nightmare.

____________

Richard Huelbig
Send message
Joined: 21 Sep 04
Posts: 1
Credit: 301,125
RAC: 0
United States
Message 1038958 - Posted: 7 Oct 2010, 0:46:34 UTC - in response to Message 1038820.

I'm sure you already know this, and the equipment you're using is very likely much better than what I've used, but overheating problems followed by "hangs" might be caused by bad motherboard capacitors. Check the tops of the caps to see if they're expanded or open and leaking--if they are the motherboard is a goner. Sorry if this is obvious to you folks, but figured I would throw this in since I've run across it in the past. Good luck.

Profile Derek Kennedy
Send message
Joined: 2 Oct 00
Posts: 13
Credit: 599,389
RAC: 0
Canada
Message 1039040 - Posted: 7 Oct 2010, 4:08:17 UTC - in response to Message 1038958.
Last modified: 7 Oct 2010, 4:21:36 UTC

Ive had issues with a new rig I recently built, it would hang for how ever long it took me to figure out it wasnt crunching/folding (if i was at work for example) and got home to reboot. Course this only happened about once a week.

I think for me the issue was my 6 core processor kept getting a core shut down. Since I found out a core was being shut down buy my BIOS and remedied this, the computer has not given me issues. Sp far.

Doesnt make sense to me what happened, but like I said, since I fixed the issue in BIOS it hasnt happened again.

With me folding on a gpu, and folding (now also crunching) using all cores, loosing the one seemed to make the rig hang.
____________

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7122
Credit: 61,605,855
RAC: 16,256
Germany
Message 1039060 - Posted: 7 Oct 2010, 5:26:03 UTC

Jeff & Eric, thanks for the news!

____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

1 · 2 · 3 · 4 . . . 7 · Next

Message boards : Technical News : recent woes

Copyright © 2014 University of California