recent woes

Message boards : Technical News : recent woes
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
Jeff Cobb Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Mar 99
Posts: 122
Credit: 40,367
RAC: 0
United States
Message 1038820 - Posted: 6 Oct 2010, 17:57:16 UTC

It's been a painful week, but with some progress.

The server run before last was cut short by our upload space filling up. That was fixed by the bruno migration and we started the last server run a bit early.

But a crash of our primary boinc db machine, mork, got the secondary db server, jocelyn, out of sync. That meant that all of the read only queries had to go to mork instead of jocelyn. This overwhelmed mork and I turned off web access just so the server run could continue. Then mork crashed again Monday evening. Ouch.

Yesterday, we did our normal backup of mork and are recovering jocelyn from that today. The forums are up, but result viewing is disabled at the moment. We need to clear the back end queues ahead of the next server run and mork resources are needed for that.

Mork's tendency to crash seems to have accelerated. Perhaps this is secondary to the cooling crisis we had a couple of weeks ago. Actually, "crash" is not the correct term. It simply hangs and requires a power cycle to boot. Fortunately, we have mork on a networked power strip and can power cycle it remotely. Upon boot, there are no footprints whatsoever as to the cause of the hang. This sounds like hardware. So today we are going to bring mork down to swap out all of the memory and remove a couple of unused components in a desperate attempt to fix the problem. The forums of course will be down during this operation.
ID: 1038820 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1038821 - Posted: 6 Oct 2010, 18:02:56 UTC

Thank you, Jeff.
Appreciate the update, and the kitties are wishing you the best of luck with mork!

Meow meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1038821 · Report as offensive
Profile ScarabDrowner
Volunteer tester
Avatar

Send message
Joined: 13 Sep 03
Posts: 90
Credit: 456,378
RAC: 0
United States
Message 1038823 - Posted: 6 Oct 2010, 18:04:49 UTC - in response to Message 1038820.  

Thanks for the update, it's appreciated. And thanks for the advance warning of the forum downtime.
ID: 1038823 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1038825 - Posted: 6 Oct 2010, 18:08:59 UTC - in response to Message 1038820.  

Thanks for the update Jeff, good luck on findind a cause of Mork's foibles.

Claggy
ID: 1038825 · Report as offensive
Tom95134

Send message
Joined: 27 Nov 01
Posts: 216
Credit: 3,790,200
RAC: 0
United States
Message 1038841 - Posted: 6 Oct 2010, 18:56:54 UTC

Could these latest issues be a result of the flood of work coming in after opening the pipe to the outside world?

Just because the SETI machines have gone into maintenance doesn't mean that all those BOINC machines are dozing. I believe that the SETI member have generally increased the amount of local work they hold to cover the normal 3 day outage and are are building up a massive backlog that has to look like a Tsunami once the gates are reopened. Now that the outage is longer the backlog is likely even larger causing even more problems for the SETI systems.

Is there anyway that the SETI team can force the SETI preferences to limit work to only one day on local machines until this gets sorted out?

Sometimes shutting down is the worst thing you can do to a system that is suppose to be available 24x7.
ID: 1038841 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1038842 - Posted: 6 Oct 2010, 19:01:57 UTC - in response to Message 1038841.  

Sometimes shutting down is the worst thing you can do to a system that is suppose to be available 24x7.

Yep, but as Seti isn't meant to be up 24/7 that's not a problem here.
Grant
Darwin NT
ID: 1038842 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30923
Credit: 53,134,872
RAC: 32
United States
Message 1038866 - Posted: 6 Oct 2010, 19:59:49 UTC

Thanks for the updates Jeff. News, any kind, is always appreciated.

ID: 1038866 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1038867 - Posted: 6 Oct 2010, 20:01:31 UTC - in response to Message 1038841.  

Could these latest issues be a result of the flood of work coming in after opening the pipe to the outside world?

Just because the SETI machines have gone into maintenance doesn't mean that all those BOINC machines are dozing. I believe that the SETI member have generally increased the amount of local work they hold to cover the normal 3 day outage and are are building up a massive backlog that has to look like a Tsunami once the gates are reopened. Now that the outage is longer the backlog is likely even larger causing even more problems for the SETI systems.

Is there anyway that the SETI team can force the SETI preferences to limit work to only one day on local machines until this gets sorted out?

Sometimes shutting down is the worst thing you can do to a system that is suppose to be available 24x7.


Uploads have been running through this last outage and the curent "normal" 3-day shutdown, so it may be that the usual tsunami will be minimal when the servers open for business on Friday (I hope). Still need to report, of course, but the uploads shouldn't be a problem...
ID: 1038867 · Report as offensive
Jeff Cobb Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Mar 99
Posts: 122
Credit: 40,367
RAC: 0
United States
Message 1038919 - Posted: 6 Oct 2010, 22:42:22 UTC - in response to Message 1038820.  

So today we are going to bring mork down to swap out all of the memory and remove a couple of unused components in a desperate attempt to fix the problem. The forums of course will be down during this operation.


The memory swap did not go well. Some part of the memory we swapped in was faulty, so we fell back. Oh well, an easy test given we had the box open anyway to remove some unused SSDs. We'll likely boot memtest on mork soon to give it a proper test.
ID: 1038919 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6657
Credit: 121,090,076
RAC: 0
United States
Message 1038922 - Posted: 6 Oct 2010, 22:48:26 UTC - in response to Message 1038919.  

Thank you Jeff! We know it will be fixed when it gets fixed. The information is very much appreciated!

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1038922 · Report as offensive
Invisible Man

Send message
Joined: 24 Jun 01
Posts: 22
Credit: 1,129,336
RAC: 0
United Kingdom
Message 1038923 - Posted: 6 Oct 2010, 22:51:35 UTC

Many thanks Jeff for your latest update.
In case things go wrong later this week the following might need to be considered:
The Weekly “Outage” (what a horrible word this is) originally was on Tuesdays only; now it seems that the system is closed on Tuesday until an indeterminate time on Friday, followed by things going wrong on Saturday/Sunday.
May I offer a suggestion to get things going as we would all like? Why not shut down SAH completely for a whole month, so that the staff can have the time to get all the systems up & running correctly, once and for all?
Your comments would be appreciated.

ID: 1038923 · Report as offensive
Profile magpii
Avatar

Send message
Joined: 28 Dec 05
Posts: 9
Credit: 8,694,267
RAC: 0
United Kingdom
Message 1038933 - Posted: 6 Oct 2010, 23:05:32 UTC

I have currently around 30 tasks waiting to be uploaded and according to my statistics chart, it hasn't been updated since sept 21st. I have been trying to manually upload for the past week or so and keep getting the "internet access is ok but servers may be temporarily down" message. Is this due to ongoing tech issues on your end or is there something wrong on my end?
ID: 1038933 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 1038940 - Posted: 6 Oct 2010, 23:21:10 UTC - in response to Message 1038933.  
Last modified: 6 Oct 2010, 23:23:02 UTC

The memory swap on mork didn't work, probably because at least one of the replacement DIMMS was bad. Jeff has suggested some more detailed memory tests next time mork is down. We're also considering replacing the power supplies in case the hangs are caused by power supply glitches.

Dan has contacted HP and Sun to see if they can give us a deep discount on a machine that could replace mork, hopefully deep enough that we can purchase it on what's remaining from the donations made in the Number Crunching threads.

[edit]I see now that Jeff had already posted an update. Oops[/edit]
@SETIEric@qoto.org (Mastodon)

ID: 1038940 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1038942 - Posted: 6 Oct 2010, 23:27:29 UTC - in response to Message 1038940.  

The memory swap on mork didn't work, probably because at least one of the replacement DIMMS was bad. Jeff has suggested some more detailed memory tests next time mork is down. We're also considering replacing the power supplies in case the hangs are caused by power supply glitches.

Dan has contacted HP and Sun to see if they can give us a deep discount on a machine that could replace mork, hopefully deep enough that we can purchase it on what's remaining from the donations made in the Number Crunching threads.

Don't be afraid to post the details if there turns out to be a gap between the minimum viable quote and the balance available. I'm sure there are still discretionary funds available out here in the real world, for a worthy and well-documented cause.

[edit]I see now that Jeff had already posted an update. Oops[/edit]

Don't worry about it! Too much information is always better than too little.
ID: 1038942 · Report as offensive
Profile ScarabDrowner
Volunteer tester
Avatar

Send message
Joined: 13 Sep 03
Posts: 90
Credit: 456,378
RAC: 0
United States
Message 1038944 - Posted: 6 Oct 2010, 23:29:19 UTC - in response to Message 1038940.  

Thanks Eric and Jeff, the updates are appreciated.

I know the new server is already "intended" for a certain job, but if push comes to shove, could it be used to get the project back up and stable?
ID: 1038944 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1038946 - Posted: 6 Oct 2010, 23:50:00 UTC - in response to Message 1038940.  

Dan has contacted HP and Sun to see if they can give us a deep discount on a machine that could replace mork, hopefully deep enough that we can purchase it on what's remaining from the donations made in the Number Crunching threads.


Good luck Dan, that would be great, sort of like two for the price of one. I hope it doesn't cut into the amount of RAM and harddrives you guys were going to get with Oscar though. At least not too much. If it does let us know so we can wake up Kittyman to start another drive.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1038946 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30923
Credit: 53,134,872
RAC: 32
United States
Message 1038955 - Posted: 7 Oct 2010, 0:40:50 UTC

Thanks to both Eric and Jeff for the updates. Chasing hardware problems can be a real nightmare.

ID: 1038955 · Report as offensive
Richard Huelbig

Send message
Joined: 21 Sep 04
Posts: 1
Credit: 301,125
RAC: 0
United States
Message 1038958 - Posted: 7 Oct 2010, 0:46:34 UTC - in response to Message 1038820.  

I'm sure you already know this, and the equipment you're using is very likely much better than what I've used, but overheating problems followed by "hangs" might be caused by bad motherboard capacitors. Check the tops of the caps to see if they're expanded or open and leaking--if they are the motherboard is a goner. Sorry if this is obvious to you folks, but figured I would throw this in since I've run across it in the past. Good luck.
ID: 1038958 · Report as offensive
Profile Derek Kennedy

Send message
Joined: 2 Oct 00
Posts: 13
Credit: 599,389
RAC: 0
Canada
Message 1039040 - Posted: 7 Oct 2010, 4:08:17 UTC - in response to Message 1038958.  
Last modified: 7 Oct 2010, 4:21:36 UTC

Ive had issues with a new rig I recently built, it would hang for how ever long it took me to figure out it wasnt crunching/folding (if i was at work for example) and got home to reboot. Course this only happened about once a week.

I think for me the issue was my 6 core processor kept getting a core shut down. Since I found out a core was being shut down buy my BIOS and remedied this, the computer has not given me issues. Sp far.

Doesnt make sense to me what happened, but like I said, since I fixed the issue in BIOS it hasnt happened again.

With me folding on a gpu, and folding (now also crunching) using all cores, loosing the one seemed to make the rig hang.
ID: 1039040 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1039060 - Posted: 7 Oct 2010, 5:26:03 UTC

Jeff & Eric, thanks for the news!

ID: 1039060 · Report as offensive
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Technical News : recent woes


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.