Weekend server problems (Oct 1, 2007)

Message boards : Technical News : Weekend server problems (Oct 1, 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 652617 - Posted: 2 Oct 2007, 1:43:18 UTC
Last modified: 2 Oct 2007, 1:45:45 UTC

What a weekend. Three server crashes in two days, followed by most of today getting things back up and running.

First bruno went down, hard. We needed to come up to the lab and power it down in order to get it back up. A lot of the server processes didn't come back up and needed help. But bruno is up now, and will hopefully stay that way.

Then lando and isaac went down. It looks like the UPS they were hooked up to failed without warning. They have single power supplies so when the UPS failed, they both went down. Until we get a replacement, they are hooked directly into an outlet.

On top of that, automount on bruno is not mounting local devices into their proper places in the NFS tree that gets shared among our systems. That prevented the file deleter and file uploads from working and resulted in the work unit store getting overfilled. Thank the FSM for the "-o bind" option to mount.

@SETIEric@qoto.org (Mastodon)

ID: 652617 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 652644 - Posted: 2 Oct 2007, 2:58:15 UTC - in response to Message 652617.  
Last modified: 2 Oct 2007, 2:58:33 UTC

I've read several recent donations on the message boards, plus a donation myself just this week. Hopefully, you guys can get a 2nd dedicated splitter machine (not bambi as she's a database server) and some gigabit LAN action.

I'll keep my fingers crossed anyway. :)
ID: 652644 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 652657 - Posted: 2 Oct 2007, 3:42:01 UTC

With all these troubles, would switching to Microsoft Server help any?


(Just baiting here)
ID: 652657 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30801
Credit: 53,134,872
RAC: 32
United States
Message 652729 - Posted: 2 Oct 2007, 6:17:27 UTC - in response to Message 652657.  

Don't think you'll catch any fish here.

With all these troubles, would switching to Microsoft Server help any?


(Just baiting here)


ID: 652729 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 652744 - Posted: 2 Oct 2007, 7:39:37 UTC

With all these troubles, would switching to Microsoft Server help any?


LOL
mic.


ID: 652744 · Report as offensive
Profile Kenn Benoît-Hutchins
Volunteer tester
Avatar

Send message
Joined: 24 Aug 99
Posts: 46
Credit: 18,091,320
RAC: 31
Canada
Message 652752 - Posted: 2 Oct 2007, 7:55:30 UTC - in response to Message 652657.  

I have two MacIntosh SE machines (with hard drives). I just plugged them in and
they booted up. Bound to be better then a Microsoft Server (dontcha think?).

Kenn

There’s got to be time for humour!!!!!!!

PS

I really do have two operational MacIntosh Se machines.

With all these troubles, would switching to Microsoft Server help any?


(Just baiting here)


Kenn

What is left unsaid is neither heard, nor heeded.
Ce qui est laissé inexprimé ni n'est entendu, ni est observé.
ID: 652752 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 652767 - Posted: 2 Oct 2007, 9:42:15 UTC


Thank You Dr. Korpela for the Time & Effort @ Berkeley . . . It is Much Appreciated . . .

ID: 652767 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 652816 - Posted: 2 Oct 2007, 13:24:17 UTC - in response to Message 652772.  

I set my computer to 10 days worth of info

That's selfish overkill.
I keep a day's worth, and didn't run out.
2 days cache is more than enough for a dedicated cruncher.
ID: 652816 · Report as offensive
Profile Ghery S. Pettit
Avatar

Send message
Joined: 7 Nov 99
Posts: 325
Credit: 28,109,066
RAC: 82
United States
Message 652892 - Posted: 2 Oct 2007, 16:22:22 UTC - in response to Message 652816.  

I set my computer to 10 days worth of info

That's selfish overkill.
I keep a day's worth, and didn't run out.
2 days cache is more than enough for a dedicated cruncher.


Not sure that I'd agree with that. I keep my crunchers well supplied with work as I'm away from home on business too much and when network problems happen at home they've got plenty to work on until I get back and can correct things. And, other than the initial filling of the queue, how is this any different over time than 1 day's worth? You're collecting more as you report finished WUs. The overall load on the system remains the same.


ID: 652892 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 652914 - Posted: 2 Oct 2007, 21:15:58 UTC - in response to Message 652892.  

The overall load on the system remains the same.

Long term average, yes, but not during recovery from an extended outage. Then the long queue machines rebuilding an already adequate queue definitely get in the way of service to out-of-work short queue machines, and extend the length of the high system loading recovery period.

On the other side of the coin, however, this impact is felt when the outage was long enough to drive comparison short-queue machine to starvation. So, unless I'm missing something, I don't see the sense in simultaneously holding that long-queue does the system harm and gains no outage advantage to those using it.

On the edge of the coin, there is the issue of changing Result duration correction factor with current work load. When a big batch of "easy chewing" results gets issued and drives everybody's RDCF down, there is extra server load as queues formerly thought balanced are seen as needing refill. The long-queue machines amplify this effect. On the other hand they also time-displace it, so I'm not at all sure what the net impact is.

ID: 652914 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 652929 - Posted: 2 Oct 2007, 21:43:44 UTC
Last modified: 2 Oct 2007, 21:46:54 UTC

Well once effect I noticed, which had nothing to do with MB per se, is that it sure seems like since Thumper died back in May when the project came back up from that the reaction of a lot of participant was to max out their.

As the problem continued over the summer, and then with the rollout of MB and the work shortage issues which arose from that, even more folks have maxed the cache.

So the net effect is there is an awful lot more work in progress than there really needs to be, and that can't be helping matters.

The other downside to it is if you max out and then go back to auto-pilot, if you have a ton of work on board, that can all get irrecoverably dumped/abandoned with a host side malfunction, and with the somewhat off the mark deadline estimates for some AR ranges currently can results in a lot of work hanging around even longer than it would have under the same conditions on ESAH.

Personally, I think the best thing we could do to help out right now is to set the cache to 3 or 4 days max and let things settle down a bit. I'm not saying that if you have a legitmate need to carry more temporarily for some reason to not do so, but to carry 10 days routinely is just asking for trouble eventually on your side, and makes for more burden on the backend with little return benefit.

Alinator
ID: 652929 · Report as offensive
Profile hanyou23
Avatar

Send message
Joined: 14 May 00
Posts: 46
Credit: 2,357,323
RAC: 0
United States
Message 652957 - Posted: 2 Oct 2007, 22:11:58 UTC

I have a couple of old Macintosh Workgroup Servers if it helps out at all :E ~.....
ID: 652957 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 653023 - Posted: 2 Oct 2007, 23:00:21 UTC - in response to Message 652929.  

Well once effect I noticed, which had nothing to do with MB per se, is that it sure seems like since Thumper died back in May when the project came back up from that the reaction of a lot of participant was to max out their.

As the problem continued over the summer, and then with the rollout of MB and the work shortage issues which arose from that, even more folks have maxed the cache.

So the net effect is there is an awful lot more work in progress than there really needs to be, and that can't be helping matters.

The other downside to it is if you max out and then go back to auto-pilot, if you have a ton of work on board, that can all get irrecoverably dumped/abandoned with a host side malfunction, and with the somewhat off the mark deadline estimates for some AR ranges currently can results in a lot of work hanging around even longer than it would have under the same conditions on ESAH.

Personally, I think the best thing we could do to help out right now is to set the cache to 3 or 4 days max and let things settle down a bit. I'm not saying that if you have a legitmate need to carry more temporarily for some reason to not do so, but to carry 10 days routinely is just asking for trouble eventually on your side, and makes for more burden on the backend with little return benefit.

Alinator

Of course, there is th option of attaching to multiple projects. I have enough projects attached to each machine so that sevaral can be down at the same time and I don't really notice. The hosts with the always on connection have a connect every X of 0 days with 0.1 days of extra work. I also have some that cannot connect during the day, and those have queues large enough to get through the day.


BOINC WIKI
ID: 653023 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 653050 - Posted: 2 Oct 2007, 23:37:16 UTC

We need a rejuvenated Wiki with some "best know methods" listed to answer recurring questions like this, once and for all (at least until the next architecture change).

However, it seems to me that the best thing we can do for the server side operation is to have an absolute minimum number of wu's in each of our caches. Each wu we have laying dormant on our clients is a drain on the overall system. For example, right now we have something like 2 million wu's in progress that must be kept in the server system, eventhough we can only return about 300K wu's per day as a group. So we have about 4-6x inefficiency here. (Rough numbers here; yes some people compute faster than others)

A minimum cache would also permit faster crediting of results, obviously.

Plus most of us crunch on different projects, giving our favorite project the greatest resource share. So this has the effect of keeping us computing most of the time, even if it is for different projects from time to time.

Having a short/zero cache doesn't seem to imply an increase in the loading on the servers. People may fear that the user base would make repeated wu requests and clobber the servers that way. However, a request/download cycle should be considerable faster than a wu compute time, and sequential mulitple downloads should be faster than a repeated request/download cycle. So perhaps the servers should determine an optimal number of wu's to download for each request, based on recent loading and the client's recent performance.

ID: 653050 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 653180 - Posted: 3 Oct 2007, 3:24:01 UTC - in response to Message 652644.  

DJStarfox et al

Last Friday, I had the chance to visit the Seti Lab at Berkeley. It was nice to meet with Eric again and allow my wife to meet with Eric, as she missed the first meeting. To see what so many of "us" have donated so much time effort and contributions for. I did get to see the machine that was Seti Classic, which is now dark (that was close to half a million computer hours on my part).

One thing that you mention is that the older servers are actually Sun workstations that were pressed into service to support Seti. Bruno was constructed through the efforts of many users, not to mention other hardware that was donated helped to upgrade other things to increase the reliability of the Seti Servers. Some of those same donations have managed to get some Gigabit into Seti, it does not say there could be better use of more (that was also one of Matt's post in the tech news getting it configured and working). Only that some is in place and recently expanded.

That said, Blurf has information in Hardware Donations. The next donation push is still in planning. Eric once posted requesting Volunteers to help with that planning, there is still time to help with ideas to make it better.

I will note that your donation is very much appreciated by everyone that is working on Seti. It is a bit sad that more do not know of the need. Together we can keep Seti Alive!

Thank You

Regards

Pappa


I've read several recent donations on the message boards, plus a donation myself just this week. Hopefully, you guys can get a 2nd dedicated splitter machine (not bambi as she's a database server) and some gigabit LAN action.

I'll keep my fingers crossed anyway. :)


Please consider a Donation to the Seti Project.

ID: 653180 · Report as offensive
Profile Mentor397
Avatar

Send message
Joined: 16 May 99
Posts: 25
Credit: 6,794,344
RAC: 108
United States
Message 653245 - Posted: 3 Oct 2007, 5:53:23 UTC - in response to Message 653050.  



Having a short/zero cache doesn't seem to imply an increase in the loading on the servers. People may fear that the user base would make repeated wu requests and clobber the servers that way. However, a request/download cycle should be considerable faster than a wu compute time, and sequential mulitple downloads should be faster than a repeated request/download cycle. So perhaps the servers should determine an optimal number of wu's to download for each request, based on recent loading and the client's recent performance.



Doesn't it already though? If you have a fast computer, the system will take into account your past performance and typically download you more WU's if your cache is set for five days (as an example) than it would download for a considerably slower system, running the same amount of time with the same configuration.

Uh, I realize that suddenly I'm going to sound very old fashioned, but when I started way back when, I did it because I liked Seti@home. The science aspect is nice, but I'm not so interested in diseases or climates and therefore haven't found another boinc project I'm interested in. Unless I was willing to kill the other projects when SETI went back up, I'd have to accept a permanent decrease in SETI capacity in order to keep the other projects current. While I'm sure it would be great for credit, it doesn't give me the most bang for what-i'm-interested in. If that made sense.


ID: 653245 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19207
Credit: 40,757,560
RAC: 67
United Kingdom
Message 653296 - Posted: 3 Oct 2007, 10:03:20 UTC - in response to Message 653245.  



Having a short/zero cache doesn't seem to imply an increase in the loading on the servers. People may fear that the user base would make repeated wu requests and clobber the servers that way. However, a request/download cycle should be considerable faster than a wu compute time, and sequential mulitple downloads should be faster than a repeated request/download cycle. So perhaps the servers should determine an optimal number of wu's to download for each request, based on recent loading and the client's recent performance.



Doesn't it already though? If you have a fast computer, the system will take into account your past performance and typically download you more WU's if your cache is set for five days (as an example) than it would download for a considerably slower system, running the same amount of time with the same configuration.

Uh, I realize that suddenly I'm going to sound very old fashioned, but when I started way back when, I did it because I liked Seti@home. The science aspect is nice, but I'm not so interested in diseases or climates and therefore haven't found another boinc project I'm interested in. Unless I was willing to kill the other projects when SETI went back up, I'd have to accept a permanent decrease in SETI capacity in order to keep the other projects current. While I'm sure it would be great for credit, it doesn't give me the most bang for what-i'm-interested in. If that made sense.


The amount a host downloads is calculated from the benchmark figures and the RDCF. The benchmarks are notoriously flawed, just ask anybody with a dual boot computer on Linux/Win. The RDCF was introduced into the scheduler to compensate for incorrect benchmarks but is initially one (1) when host is first attached.
The RDCF is lowered exponentially, but slowly, if the actual time to complete is lower than predicted, but increases immediately if longer than predicted.

You can set a very low resource share for a backup project, i.e. Seti 100:1 backup, or even 100,000:1. The BOINC manager will ensure the initial backup units are crunched on time, at the expense of making the Long Term Credit (LTD) a large -ve number, so that under normal circumstances BOINC will not download from the backup project unless Seti is down and BOINC decides you need more work.
ID: 653296 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20605
Credit: 7,508,002
RAC: 20
United Kingdom
Message 653300 - Posted: 3 Oct 2007, 10:13:05 UTC - in response to Message 653245.  
Last modified: 3 Oct 2007, 10:14:45 UTC

... Unless I was willing to kill the other projects when SETI went back up, I'd have to accept a permanent decrease in SETI capacity in order to keep the other projects current. While I'm sure it would be great for credit, it doesn't give me the most bang for what-i'm-interested in. If that made sense.

If you're 's@h-only' and want to use other projects as a 'backup' project, choose two that have short(est) WUs and attach to them with a very small resource share. You can even set as low as 0.1 for example so that they rarely (if ever) get to run.

There is more about that in the Boinc-Help link below.

Happy crunchin',
Martin


See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 653300 · Report as offensive
Mike46360

Send message
Joined: 1 Jan 07
Posts: 65
Credit: 40,307
RAC: 0
United States
Message 653349 - Posted: 3 Oct 2007, 12:48:31 UTC

I can't attach to the Seti@Home project..are the servers messed up again?
ID: 653349 · Report as offensive
n7rfa
Volunteer tester
Avatar

Send message
Joined: 13 Apr 04
Posts: 370
Credit: 9,058,599
RAC: 0
United States
Message 653358 - Posted: 3 Oct 2007, 13:12:39 UTC - in response to Message 653349.  

I can't attach to the Seti@Home project..are the servers messed up again?

They don't appear to be. I'm still able to upload and download.
ID: 653358 · Report as offensive
1 · 2 · 3 · Next

Message boards : Technical News : Weekend server problems (Oct 1, 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.