Weekend server problems (Oct 1, 2007)

Author	Message
Eric Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60	Message 652617 - Posted: 2 Oct 2007, 1:43:18 UTC Last modified: 2 Oct 2007, 1:45:45 UTC What a weekend. Three server crashes in two days, followed by most of today getting things back up and running. First bruno went down, hard. We needed to come up to the lab and power it down in order to get it back up. A lot of the server processes didn't come back up and needed help. But bruno is up now, and will hopefully stay that way. Then lando and isaac went down. It looks like the UPS they were hooked up to failed without warning. They have single power supplies so when the UPS failed, they both went down. Until we get a replacement, they are hooked directly into an outlet. On top of that, automount on bruno is not mounting local devices into their proper places in the NFS tree that gets shared among our systems. That prevented the file deleter and file uploads from working and resulted in the work unit store getting overfilled. Thank the FSM for the "-o bind" option to mount. @SETIEric@qoto.org (Mastodon) ID: 652617 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 652644 - Posted: 2 Oct 2007, 2:58:15 UTC - in response to Message 652617. Last modified: 2 Oct 2007, 2:58:33 UTC I've read several recent donations on the message boards, plus a donation myself just this week. Hopefully, you guys can get a 2nd dedicated splitter machine (not bambi as she's a database server) and some gigabit LAN action. I'll keep my fingers crossed anyway. :) ID: 652644 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 652657 - Posted: 2 Oct 2007, 3:42:01 UTC With all these troubles, would switching to Microsoft Server help any? (Just baiting here) ID: 652657 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30812 Credit: 53,134,872 RAC: 32	Message 652729 - Posted: 2 Oct 2007, 6:17:27 UTC - in response to Message 652657. Don't think you'll catch any fish here. With all these troubles, would switching to Microsoft Server help any? (Just baiting here) ID: 652729 ·

speedimic Volunteer tester Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0	Message 652744 - Posted: 2 Oct 2007, 7:39:37 UTC With all these troubles, would switching to Microsoft Server help any? LOL mic. ID: 652744 ·

Kenn BenoÃ®t-Hutchins Volunteer tester Send message Joined: 24 Aug 99 Posts: 46 Credit: 18,091,320 RAC: 31	Message 652752 - Posted: 2 Oct 2007, 7:55:30 UTC - in response to Message 652657. I have two MacIntosh SE machines (with hard drives). I just plugged them in and they booted up. Bound to be better then a Microsoft Server (dontcha think?). Kenn ThereÃ¢â‚¬â„¢s got to be time for humour!!!!!!! PS I really do have two operational MacIntosh Se machines. With all these troubles, would switching to Microsoft Server help any? (Just baiting here) Kenn What is left unsaid is neither heard, nor heeded. Ce qui est laissÃ© inexprimÃ© ni n'est entendu, ni est observÃ©. ID: 652752 ·

Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0	Message 652767 - Posted: 2 Oct 2007, 9:42:15 UTC Thank You Dr. Korpela for the Time & Effort @ Berkeley . . . It is Much Appreciated . . . ID: 652767 ·

Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0	Message 652816 - Posted: 2 Oct 2007, 13:24:17 UTC - in response to Message 652772. I set my computer to 10 days worth of info That's selfish overkill. I keep a day's worth, and didn't run out. 2 days cache is more than enough for a dedicated cruncher. ID: 652816 ·

Ghery S. Pettit Send message Joined: 7 Nov 99 Posts: 325 Credit: 28,109,066 RAC: 82	Message 652892 - Posted: 2 Oct 2007, 16:22:22 UTC - in response to Message 652816. I set my computer to 10 days worth of info That's selfish overkill. I keep a day's worth, and didn't run out. 2 days cache is more than enough for a dedicated cruncher. Not sure that I'd agree with that. I keep my crunchers well supplied with work as I'm away from home on business too much and when network problems happen at home they've got plenty to work on until I get back and can correct things. And, other than the initial filling of the queue, how is this any different over time than 1 day's worth? You're collecting more as you report finished WUs. The overall load on the system remains the same. ID: 652892 ·

archae86 Send message Joined: 31 Aug 99 Posts: 909 Credit: 1,582,816 RAC: 0	Message 652914 - Posted: 2 Oct 2007, 21:15:58 UTC - in response to Message 652892. The overall load on the system remains the same. Long term average, yes, but not during recovery from an extended outage. Then the long queue machines rebuilding an already adequate queue definitely get in the way of service to out-of-work short queue machines, and extend the length of the high system loading recovery period. On the other side of the coin, however, this impact is felt when the outage was long enough to drive comparison short-queue machine to starvation. So, unless I'm missing something, I don't see the sense in simultaneously holding that long-queue does the system harm and gains no outage advantage to those using it. On the edge of the coin, there is the issue of changing Result duration correction factor with current work load. When a big batch of "easy chewing" results gets issued and drives everybody's RDCF down, there is extra server load as queues formerly thought balanced are seen as needing refill. The long-queue machines amplify this effect. On the other hand they also time-displace it, so I'm not at all sure what the net impact is. ID: 652914 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 652929 - Posted: 2 Oct 2007, 21:43:44 UTC Last modified: 2 Oct 2007, 21:46:54 UTC Well once effect I noticed, which had nothing to do with MB per se, is that it sure seems like since Thumper died back in May when the project came back up from that the reaction of a lot of participant was to max out their. As the problem continued over the summer, and then with the rollout of MB and the work shortage issues which arose from that, even more folks have maxed the cache. So the net effect is there is an awful lot more work in progress than there really needs to be, and that can't be helping matters. The other downside to it is if you max out and then go back to auto-pilot, if you have a ton of work on board, that can all get irrecoverably dumped/abandoned with a host side malfunction, and with the somewhat off the mark deadline estimates for some AR ranges currently can results in a lot of work hanging around even longer than it would have under the same conditions on ESAH. Personally, I think the best thing we could do to help out right now is to set the cache to 3 or 4 days max and let things settle down a bit. I'm not saying that if you have a legitmate need to carry more temporarily for some reason to not do so, but to carry 10 days routinely is just asking for trouble eventually on your side, and makes for more burden on the backend with little return benefit. Alinator ID: 652929 ·

hanyou23 Send message Joined: 14 May 00 Posts: 46 Credit: 2,357,323 RAC: 0	Message 652957 - Posted: 2 Oct 2007, 22:11:58 UTC I have a couple of old Macintosh Workgroup Servers if it helps out at all :E ~..... ID: 652957 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 653023 - Posted: 2 Oct 2007, 23:00:21 UTC - in response to Message 652929. Well once effect I noticed, which had nothing to do with MB per se, is that it sure seems like since Thumper died back in May when the project came back up from that the reaction of a lot of participant was to max out their. As the problem continued over the summer, and then with the rollout of MB and the work shortage issues which arose from that, even more folks have maxed the cache. So the net effect is there is an awful lot more work in progress than there really needs to be, and that can't be helping matters. The other downside to it is if you max out and then go back to auto-pilot, if you have a ton of work on board, that can all get irrecoverably dumped/abandoned with a host side malfunction, and with the somewhat off the mark deadline estimates for some AR ranges currently can results in a lot of work hanging around even longer than it would have under the same conditions on ESAH. Personally, I think the best thing we could do to help out right now is to set the cache to 3 or 4 days max and let things settle down a bit. I'm not saying that if you have a legitmate need to carry more temporarily for some reason to not do so, but to carry 10 days routinely is just asking for trouble eventually on your side, and makes for more burden on the backend with little return benefit. Alinator Of course, there is th option of attaching to multiple projects. I have enough projects attached to each machine so that sevaral can be down at the same time and I don't really notice. The hosts with the always on connection have a connect every X of 0 days with 0.1 days of extra work. I also have some that cannot connect during the day, and those have queues large enough to get through the day. BOINC WIKI ID: 653023 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 653050 - Posted: 2 Oct 2007, 23:37:16 UTC We need a rejuvenated Wiki with some "best know methods" listed to answer recurring questions like this, once and for all (at least until the next architecture change). However, it seems to me that the best thing we can do for the server side operation is to have an absolute minimum number of wu's in each of our caches. Each wu we have laying dormant on our clients is a drain on the overall system. For example, right now we have something like 2 million wu's in progress that must be kept in the server system, eventhough we can only return about 300K wu's per day as a group. So we have about 4-6x inefficiency here. (Rough numbers here; yes some people compute faster than others) A minimum cache would also permit faster crediting of results, obviously. Plus most of us crunch on different projects, giving our favorite project the greatest resource share. So this has the effect of keeping us computing most of the time, even if it is for different projects from time to time. Having a short/zero cache doesn't seem to imply an increase in the loading on the servers. People may fear that the user base would make repeated wu requests and clobber the servers that way. However, a request/download cycle should be considerable faster than a wu compute time, and sequential mulitple downloads should be faster than a repeated request/download cycle. So perhaps the servers should determine an optimal number of wu's to download for each request, based on recent loading and the client's recent performance. ID: 653050 ·

Pappa Volunteer tester Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0	Message 653180 - Posted: 3 Oct 2007, 3:24:01 UTC - in response to Message 652644. DJStarfox et al Last Friday, I had the chance to visit the Seti Lab at Berkeley. It was nice to meet with Eric again and allow my wife to meet with Eric, as she missed the first meeting. To see what so many of "us" have donated so much time effort and contributions for. I did get to see the machine that was Seti Classic, which is now dark (that was close to half a million computer hours on my part). One thing that you mention is that the older servers are actually Sun workstations that were pressed into service to support Seti. Bruno was constructed through the efforts of many users, not to mention other hardware that was donated helped to upgrade other things to increase the reliability of the Seti Servers. Some of those same donations have managed to get some Gigabit into Seti, it does not say there could be better use of more (that was also one of Matt's post in the tech news getting it configured and working). Only that some is in place and recently expanded. That said, Blurf has information in Hardware Donations. The next donation push is still in planning. Eric once posted requesting Volunteers to help with that planning, there is still time to help with ideas to make it better. I will note that your donation is very much appreciated by everyone that is working on Seti. It is a bit sad that more do not know of the need. Together we can keep Seti Alive! Thank You Regards Pappa I've read several recent donations on the message boards, plus a donation myself just this week. Hopefully, you guys can get a 2nd dedicated splitter machine (not bambi as she's a database server) and some gigabit LAN action. I'll keep my fingers crossed anyway. :) Please consider a Donation to the Seti Project. ID: 653180 ·

Mentor397 Send message Joined: 16 May 99 Posts: 25 Credit: 6,794,344 RAC: 108	Message 653245 - Posted: 3 Oct 2007, 5:53:23 UTC - in response to Message 653050. Having a short/zero cache doesn't seem to imply an increase in the loading on the servers. People may fear that the user base would make repeated wu requests and clobber the servers that way. However, a request/download cycle should be considerable faster than a wu compute time, and sequential mulitple downloads should be faster than a repeated request/download cycle. So perhaps the servers should determine an optimal number of wu's to download for each request, based on recent loading and the client's recent performance. Doesn't it already though? If you have a fast computer, the system will take into account your past performance and typically download you more WU's if your cache is set for five days (as an example) than it would download for a considerably slower system, running the same amount of time with the same configuration. Uh, I realize that suddenly I'm going to sound very old fashioned, but when I started way back when, I did it because I liked Seti@home. The science aspect is nice, but I'm not so interested in diseases or climates and therefore haven't found another boinc project I'm interested in. Unless I was willing to kill the other projects when SETI went back up, I'd have to accept a permanent decrease in SETI capacity in order to keep the other projects current. While I'm sure it would be great for credit, it doesn't give me the most bang for what-i'm-interested in. If that made sense. ID: 653245 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19227 Credit: 40,757,560 RAC: 67	Message 653296 - Posted: 3 Oct 2007, 10:03:20 UTC - in response to Message 653245. Having a short/zero cache doesn't seem to imply an increase in the loading on the servers. People may fear that the user base would make repeated wu requests and clobber the servers that way. However, a request/download cycle should be considerable faster than a wu compute time, and sequential mulitple downloads should be faster than a repeated request/download cycle. So perhaps the servers should determine an optimal number of wu's to download for each request, based on recent loading and the client's recent performance. Doesn't it already though? If you have a fast computer, the system will take into account your past performance and typically download you more WU's if your cache is set for five days (as an example) than it would download for a considerably slower system, running the same amount of time with the same configuration. Uh, I realize that suddenly I'm going to sound very old fashioned, but when I started way back when, I did it because I liked Seti@home. The science aspect is nice, but I'm not so interested in diseases or climates and therefore haven't found another boinc project I'm interested in. Unless I was willing to kill the other projects when SETI went back up, I'd have to accept a permanent decrease in SETI capacity in order to keep the other projects current. While I'm sure it would be great for credit, it doesn't give me the most bang for what-i'm-interested in. If that made sense. The amount a host downloads is calculated from the benchmark figures and the RDCF. The benchmarks are notoriously flawed, just ask anybody with a dual boot computer on Linux/Win. The RDCF was introduced into the scheduler to compensate for incorrect benchmarks but is initially one (1) when host is first attached. The RDCF is lowered exponentially, but slowly, if the actual time to complete is lower than predicted, but increases immediately if longer than predicted. You can set a very low resource share for a backup project, i.e. Seti 100:1 backup, or even 100,000:1. The BOINC manager will ensure the initial backup units are crunched on time, at the expense of making the Long Term Credit (LTD) a large -ve number, so that under normal circumstances BOINC will not download from the backup project unless Seti is down and BOINC decides you need more work. ID: 653296 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20635 Credit: 7,508,002 RAC: 20	Message 653300 - Posted: 3 Oct 2007, 10:13:05 UTC - in response to Message 653245. Last modified: 3 Oct 2007, 10:14:45 UTC ... Unless I was willing to kill the other projects when SETI went back up, I'd have to accept a permanent decrease in SETI capacity in order to keep the other projects current. While I'm sure it would be great for credit, it doesn't give me the most bang for what-i'm-interested in. If that made sense. If you're 's@h-only' and want to use other projects as a 'backup' project, choose two that have short(est) WUs and attach to them with a very small resource share. You can even set as low as 0.1 for example so that they rarely (if ever) get to run. There is more about that in the Boinc-Help link below. Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 653300 ·

Mike46360 Send message Joined: 1 Jan 07 Posts: 65 Credit: 40,307 RAC: 0	Message 653349 - Posted: 3 Oct 2007, 12:48:31 UTC I can't attach to the Seti@Home project..are the servers messed up again? ID: 653349 ·

n7rfa Volunteer tester Send message Joined: 13 Apr 04 Posts: 370 Credit: 9,058,599 RAC: 0	Message 653358 - Posted: 3 Oct 2007, 13:12:39 UTC - in response to Message 653349. I can't attach to the Seti@Home project..are the servers messed up again? They don't appear to be. I'm still able to upload and download. ID: 653358 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.