Message boards :
Technical News :
recent woes
Message board moderation
Author | Message |
---|---|
Jeff Cobb Send message Joined: 1 Mar 99 Posts: 122 Credit: 40,367 RAC: 0 |
It's been a painful week, but with some progress. The server run before last was cut short by our upload space filling up. That was fixed by the bruno migration and we started the last server run a bit early. But a crash of our primary boinc db machine, mork, got the secondary db server, jocelyn, out of sync. That meant that all of the read only queries had to go to mork instead of jocelyn. This overwhelmed mork and I turned off web access just so the server run could continue. Then mork crashed again Monday evening. Ouch. Yesterday, we did our normal backup of mork and are recovering jocelyn from that today. The forums are up, but result viewing is disabled at the moment. We need to clear the back end queues ahead of the next server run and mork resources are needed for that. Mork's tendency to crash seems to have accelerated. Perhaps this is secondary to the cooling crisis we had a couple of weeks ago. Actually, "crash" is not the correct term. It simply hangs and requires a power cycle to boot. Fortunately, we have mork on a networked power strip and can power cycle it remotely. Upon boot, there are no footprints whatsoever as to the cause of the hang. This sounds like hardware. So today we are going to bring mork down to swap out all of the memory and remove a couple of unused components in a desperate attempt to fix the problem. The forums of course will be down during this operation. |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
Thank you, Jeff. Appreciate the update, and the kitties are wishing you the best of luck with mork! Meow meow. "Time is simply the mechanism that keeps everything from happening all at once." |
ScarabDrowner Send message Joined: 13 Sep 03 Posts: 90 Credit: 456,378 RAC: 0 |
Thanks for the update, it's appreciated. And thanks for the advance warning of the forum downtime. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Thanks for the update Jeff, good luck on findind a cause of Mork's foibles. Claggy |
Tom95134 Send message Joined: 27 Nov 01 Posts: 216 Credit: 3,790,200 RAC: 0 |
Could these latest issues be a result of the flood of work coming in after opening the pipe to the outside world? Just because the SETI machines have gone into maintenance doesn't mean that all those BOINC machines are dozing. I believe that the SETI member have generally increased the amount of local work they hold to cover the normal 3 day outage and are are building up a massive backlog that has to look like a Tsunami once the gates are reopened. Now that the outage is longer the backlog is likely even larger causing even more problems for the SETI systems. Is there anyway that the SETI team can force the SETI preferences to limit work to only one day on local machines until this gets sorted out? Sometimes shutting down is the worst thing you can do to a system that is suppose to be available 24x7. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
Sometimes shutting down is the worst thing you can do to a system that is suppose to be available 24x7. Yep, but as Seti isn't meant to be up 24/7 that's not a problem here. Grant Darwin NT |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30923 Credit: 53,134,872 RAC: 32 |
Thanks for the updates Jeff. News, any kind, is always appreciated. |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
Could these latest issues be a result of the flood of work coming in after opening the pipe to the outside world? Uploads have been running through this last outage and the curent "normal" 3-day shutdown, so it may be that the usual tsunami will be minimal when the servers open for business on Friday (I hope). Still need to report, of course, but the uploads shouldn't be a problem... |
Jeff Cobb Send message Joined: 1 Mar 99 Posts: 122 Credit: 40,367 RAC: 0 |
So today we are going to bring mork down to swap out all of the memory and remove a couple of unused components in a desperate attempt to fix the problem. The forums of course will be down during this operation. The memory swap did not go well. Some part of the memory we swapped in was faulty, so we fell back. Oh well, an easy test given we had the box open anyway to remove some unused SSDs. We'll likely boot memtest on mork soon to give it a proper test. |
SciManStev Send message Joined: 20 Jun 99 Posts: 6657 Credit: 121,090,076 RAC: 0 |
Thank you Jeff! We know it will be fixed when it gets fixed. The information is very much appreciated! Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
Invisible Man Send message Joined: 24 Jun 01 Posts: 22 Credit: 1,129,336 RAC: 0 |
Many thanks Jeff for your latest update. In case things go wrong later this week the following might need to be considered: The Weekly “Outage†(what a horrible word this is) originally was on Tuesdays only; now it seems that the system is closed on Tuesday until an indeterminate time on Friday, followed by things going wrong on Saturday/Sunday. May I offer a suggestion to get things going as we would all like? Why not shut down SAH completely for a whole month, so that the staff can have the time to get all the systems up & running correctly, once and for all? Your comments would be appreciated. |
magpii Send message Joined: 28 Dec 05 Posts: 9 Credit: 8,694,267 RAC: 0 |
I have currently around 30 tasks waiting to be uploaded and according to my statistics chart, it hasn't been updated since sept 21st. I have been trying to manually upload for the past week or so and keep getting the "internet access is ok but servers may be temporarily down" message. Is this due to ongoing tech issues on your end or is there something wrong on my end? |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
The memory swap on mork didn't work, probably because at least one of the replacement DIMMS was bad. Jeff has suggested some more detailed memory tests next time mork is down. We're also considering replacing the power supplies in case the hangs are caused by power supply glitches. Dan has contacted HP and Sun to see if they can give us a deep discount on a machine that could replace mork, hopefully deep enough that we can purchase it on what's remaining from the donations made in the Number Crunching threads. [edit]I see now that Jeff had already posted an update. Oops[/edit] @SETIEric@qoto.org (Mastodon) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
The memory swap on mork didn't work, probably because at least one of the replacement DIMMS was bad. Jeff has suggested some more detailed memory tests next time mork is down. We're also considering replacing the power supplies in case the hangs are caused by power supply glitches. Don't be afraid to post the details if there turns out to be a gap between the minimum viable quote and the balance available. I'm sure there are still discretionary funds available out here in the real world, for a worthy and well-documented cause. [edit]I see now that Jeff had already posted an update. Oops[/edit] Don't worry about it! Too much information is always better than too little. |
ScarabDrowner Send message Joined: 13 Sep 03 Posts: 90 Credit: 456,378 RAC: 0 |
Thanks Eric and Jeff, the updates are appreciated. I know the new server is already "intended" for a certain job, but if push comes to shove, could it be used to get the project back up and stable? |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Dan has contacted HP and Sun to see if they can give us a deep discount on a machine that could replace mork, hopefully deep enough that we can purchase it on what's remaining from the donations made in the Number Crunching threads. Good luck Dan, that would be great, sort of like two for the price of one. I hope it doesn't cut into the amount of RAM and harddrives you guys were going to get with Oscar though. At least not too much. If it does let us know so we can wake up Kittyman to start another drive. PROUD MEMBER OF Team Starfire World BOINC |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30923 Credit: 53,134,872 RAC: 32 |
Thanks to both Eric and Jeff for the updates. Chasing hardware problems can be a real nightmare. |
Richard Huelbig Send message Joined: 21 Sep 04 Posts: 1 Credit: 301,125 RAC: 0 |
I'm sure you already know this, and the equipment you're using is very likely much better than what I've used, but overheating problems followed by "hangs" might be caused by bad motherboard capacitors. Check the tops of the caps to see if they're expanded or open and leaking--if they are the motherboard is a goner. Sorry if this is obvious to you folks, but figured I would throw this in since I've run across it in the past. Good luck. |
Derek Kennedy Send message Joined: 2 Oct 00 Posts: 13 Credit: 599,389 RAC: 0 |
Ive had issues with a new rig I recently built, it would hang for how ever long it took me to figure out it wasnt crunching/folding (if i was at work for example) and got home to reboot. Course this only happened about once a week. I think for me the issue was my 6 core processor kept getting a core shut down. Since I found out a core was being shut down buy my BIOS and remedied this, the computer has not given me issues. Sp far. Doesnt make sense to me what happened, but like I said, since I fixed the issue in BIOS it hasnt happened again. With me folding on a gpu, and folding (now also crunching) using all cores, loosing the one seemed to make the rig hang. |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Jeff & Eric, thanks for the news! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.