Moribund Monday (Apr 14 2008)


log in

Advanced search

Message boards : Technical News : Moribund Monday (Apr 14 2008)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1388
Credit: 74,079
RAC: 0
United States
Message 739049 - Posted: 14 Apr 2008, 19:03:42 UTC

Continuing problems with the workunit storage server... There were more resets over the weekend, ultimately resulting in one that caused the server to think enough drives have failed to call the entire RAID dead. We are confident we can trick the server into thinking otherwise - we actually have some helpful techs logged in doing that as I type. We still want to replace the whole box, which we'll hopefully do today, and then the drives will have to resync again. Chances are we'll be down until tomorrow (Tuesday).

So while we are down we'll try to catch up on several things. Moving servers around the closet, incorporating the new drive enclosure that arrived today, getting more stuff on the new KVM, etc.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile SATAN
Avatar
Send message
Joined: 27 Aug 06
Posts: 835
Credit: 2,094,367
RAC: 0
United Kingdom
Message 739055 - Posted: 14 Apr 2008, 19:10:49 UTC

Thanks for the update Matt, we know you do as much as you can.
____________

JPP
Send message
Joined: 31 May 99
Posts: 15
Credit: 16,021,533
RAC: 12,505
France
Message 739068 - Posted: 14 Apr 2008, 19:52:52 UTC - in response to Message 739055.

hi
perhaps you also *may* wish to review the "work unit allocation" algorythm
my pc's are starving ! when servers were still up ; I did not had a chance to receive new /fresh units since my pc were not asking and then when i start asking, servers are down...
so i wish to mention that is the first time i can recall ; since 1999; where my favourite pc got nothing to work anymore ; a bit weird indeed
of course i run the latest sw load / perhaps you should allow more workunits to be requested by clients ? i m a bit confused
cheers
jeanpierr€@jpp
____________

Sagittarius
Send message
Joined: 3 Jan 08
Posts: 10
Credit: 90,431
RAC: 0
Canada
Message 739091 - Posted: 14 Apr 2008, 20:49:18 UTC

Hi Matt, just wonderin'. If you get it up and running by tomorrow AM, any chance of foregoing or delaying the dreaded maintenance day until Wednesday so we can all load up on WU's? At least we'd all be working and not sitting idle another whole day ;)

Cheers
____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3232
Credit: 31,585,541
RAC: 0
Netherlands
Message 739106 - Posted: 14 Apr 2008, 21:42:50 UTC - in response to Message 739091.
Last modified: 14 Apr 2008, 21:47:33 UTC

Hi Matt, just wonderin'. If you get it up and running by tomorrow AM, any chance of foregoing or delaying the dreaded maintenance day until Wednesday so we can all load up on WU's? At least we'd all be working and not sitting idle another whole day ;)

Cheers


Maybe a good time to check your host's as well, defragmenting disk's, cleaning the registry, removing never used programs, virus/spyware-scan, getting e-mail, etc. etc.
Vacuum cleaning your fans & coolers ;)
Mylady says, get rid off the cables ?@#$%
____________


Knight Who Says Ni N!, OUT numbered.................

Profile Andy Worth
Volunteer tester
Avatar
Send message
Joined: 23 Oct 02
Posts: 5807
Credit: 10,408,581
RAC: 3
United Kingdom
Message 739124 - Posted: 14 Apr 2008, 22:23:32 UTC

It's a constant battle isn't it?!

Good luck Matt. Just make sure you have it all working by Thursday so I can feed my new quad. Wouldn't like to see it going hungry ;)
____________


Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1388
Credit: 74,079
RAC: 0
United States
Message 739127 - Posted: 14 Apr 2008, 22:27:55 UTC

The Adaptec guys just left - the switchover to the new server looks like a complete success. Plus they coughed up an extra 2GB RAM for the new server while they were here - though that won't show up as a performance boost until the next rev of the OS.

So the RAIDs are all resync'ing again now, but we should be good to go by tomorrow morning.

I'd like to do the BOINC database reorg/backup on Tuesday like we usually do, but I'll try to get here early and get that out of the way while we're still down.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

gomeyer
Volunteer tester
Send message
Joined: 21 May 99
Posts: 488
Credit: 50,157,953
RAC: 0
United States
Message 739154 - Posted: 14 Apr 2008, 23:26:17 UTC - in response to Message 739127.

The Adaptec guys just left - the switchover to the new server looks like a complete success. Plus they coughed up an extra 2GB RAM for the new server while they were here - though that won't show up as a performance boost until the next rev of the OS.

So the RAIDs are all resync'ing again now, but we should be good to go by tomorrow morning.

I'd like to do the BOINC database reorg/backup on Tuesday like we usually do, but I'll try to get here early and get that out of the way while we're still down.

- Matt

Good news there. Thank you for the extra effort!

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 12
United States
Message 739170 - Posted: 15 Apr 2008, 0:00:48 UTC


. . . Thanks to Each of You @ Berkeley for All that You are Doing

@ Matt - as usual - Thanks for the Updates - It is Appreciated Sir!




____________
BOINC Wiki . . .

Science Status Page . . .

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 532,447
RAC: 19
United States
Message 739229 - Posted: 15 Apr 2008, 2:14:40 UTC - in response to Message 739049.
Last modified: 15 Apr 2008, 2:15:30 UTC

Continuing problems with the workunit storage server... There were more resets over the weekend, ultimately resulting in one that caused the server to think enough drives have failed to call the entire RAID dead. We are confident we can trick the server into thinking otherwise - we actually have some helpful techs logged in doing that as I type. We still want to replace the whole box, which we'll hopefully do today, and then the drives will have to resync again. Chances are we'll be down until tomorrow (Tuesday).

So while we are down we'll try to catch up on several things. Moving servers around the closet, incorporating the new drive enclosure that arrived today, getting more stuff on the new KVM, etc.

- Matt


Just out of curiosity, it is wise to let clients get more work but without downloading the data files? What happens when the download server comes online and everybody tries to download the missing files (hours or days later)?

Would it be better for the scheduler to respond "no work from project" until the download servers are back up? If not, when why not?

Jesse Viviano
Send message
Joined: 27 Feb 00
Posts: 95
Credit: 474,230
RAC: 0
United States
Message 739240 - Posted: 15 Apr 2008, 3:16:21 UTC - in response to Message 739229.

Continuing problems with the workunit storage server... There were more resets over the weekend, ultimately resulting in one that caused the server to think enough drives have failed to call the entire RAID dead. We are confident we can trick the server into thinking otherwise - we actually have some helpful techs logged in doing that as I type. We still want to replace the whole box, which we'll hopefully do today, and then the drives will have to resync again. Chances are we'll be down until tomorrow (Tuesday).

So while we are down we'll try to catch up on several things. Moving servers around the closet, incorporating the new drive enclosure that arrived today, getting more stuff on the new KVM, etc.

- Matt


Just out of curiosity, it is wise to let clients get more work but without downloading the data files? What happens when the download server comes online and everybody tries to download the missing files (hours or days later)?

Would it be better for the scheduler to respond "no work from project" until the download servers are back up? If not, when why not?

While the database cleanup and backup is going on, the download and upload server is normally still running. This allows the clients to download and upload files as needed, but does not allow the uploaded results to be reported until the cleanup and backup completes. Therefore, if we have clients getting assigned work units today, they can be ready to be downloaded tommorrow while the database is down.

The administrators once shut down the upload/download server during database cleanups and backups, hoping that the absence of upload/download activity would speed up the downtime. However, the post-downtime crunch was awful. When they left the upload/download server active during the downtime, this only caused a slight slowdown but allowed the post-downtime crunch to finish up much quicker, because more packets going through the router during the post-downtime crunches were scheduler requests, their responses, and downloads instead of uploads, therefore removing a sizable load off of the then-overloaded router during crunchtime.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38149
Credit: 555,395,595
RAC: 613,218
United States
Message 739259 - Posted: 15 Apr 2008, 4:53:39 UTC

Thanx again for the continued updates Matt. Sorry that you have had so many triala as of late.....hope the replacement download server solves that issue at least.....
Chin up, my man. Your efforts are not unnoticed or unappreciated.

Regards,
Mark.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

cholupa3
Send message
Joined: 13 Jan 08
Posts: 1
Credit: 527,261
RAC: 0
Message 739366 - Posted: 15 Apr 2008, 13:31:13 UTC

It seems to have been a while since the last post, and I'm still having difficulty getting WUs. I was hoping that someone could post regarding their own situation, or on the success/failure/delay of the necessary upgrades/repairs. I just want to see if others are having any success, or if it's still a problem on my end. I know you guys are working hard so thank you all for allowing us to participate in SETI.

-Eric AKA Cholupa

Profile Keith T.
Volunteer tester
Avatar
Send message
Joined: 23 Aug 99
Posts: 738
Credit: 231,168
RAC: 0
United Kingdom
Message 739385 - Posted: 15 Apr 2008, 14:44:55 UTC - in response to Message 739366.
Last modified: 15 Apr 2008, 14:48:17 UTC

It seems to have been a while since the last post, and I'm still having difficulty getting WUs. I was hoping that someone could post regarding their own situation, or on the success/failure/delay of the necessary upgrades/repairs. I just want to see if others are having any success, or if it's still a problem on my end. I know you guys are working hard so thank you all for allowing us to participate in SETI.

-Eric AKA Cholupa


This page will tell you when the WU's start flowing again.

As you can see from the graph there have been no WU's out for more than 24 hours.

When the servers come back online, I expect there will be very heavy traffic for several hours, so if you run out of SETI work you may need a backup project at a small resource share.

I ran out of SETI work last night (have 2 WU's stuck downloading) but my main PC still has work for 6 other projects.

[edit]Other BOINC projects[/edit]
____________
Sir Arthur C Clarke 1917-2008

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 739420 - Posted: 15 Apr 2008, 15:39:41 UTC - in response to Message 739366.

It seems to have been a while since the last post, and I'm still having difficulty getting WUs. I was hoping that someone could post regarding their own situation, or on the success/failure/delay of the necessary upgrades/repairs. I just want to see if others are having any success, or if it's still a problem on my end. I know you guys are working hard so thank you all for allowing us to participate in SETI.

-Eric AKA Cholupa

Your post was just after 6:00am in Berkeley. Since Matt said the server was fixed, but would need time to sync., I wouldn't expect it to be up until after they get in this morning and have a chance to check everything out....
____________

Profile Mentor397
Avatar
Send message
Joined: 16 May 99
Posts: 17
Credit: 4,756,815
RAC: 434
United States
Message 739421 - Posted: 15 Apr 2008, 15:41:23 UTC

I finally got around to checking the computer. I just wanted to say that you guys are doing a fantastic job in spite of enormous difficulties.

- Jim

____________

Profile Daniel Michel
Volunteer tester
Avatar
Send message
Joined: 2 Feb 04
Posts: 14895
Credit: 1,325,875
RAC: 31
United States
Message 739446 - Posted: 15 Apr 2008, 15:59:29 UTC

I hope the DB backup goes well today...And that means No Nasty Surprises for you guys...Good luck!
____________


Proud to be TFFE

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13541
Credit: 29,272,294
RAC: 15,347
United States
Message 739504 - Posted: 15 Apr 2008, 21:02:59 UTC

I'm surprised to see that our friend, the "Reverend" hasn't been around to complain about this recent outage. He always insisted that it was his job to let the SETI team know when they aren't doing theirs.

Warden Dios
Send message
Joined: 28 May 99
Posts: 1
Credit: 118,586
RAC: 5
United Kingdom
Message 739543 - Posted: 15 Apr 2008, 22:38:06 UTC - in response to Message 739504.

I'm happy to say six work units have downloaded within
the last hour or so, and mine is running fine now. I'm
wondering if there's an option to take more units for
pending processing, since my system gets through
them reasonably quickly.

-W.D.

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13541
Credit: 29,272,294
RAC: 15,347
United States
Message 739551 - Posted: 15 Apr 2008, 22:50:09 UTC - in response to Message 739543.

I'm happy to say six work units have downloaded within
the last hour or so, and mine is running fine now. I'm
wondering if there's an option to take more units for
pending processing, since my system gets through
them reasonably quickly.

-W.D.



You can always increase your cache via your preferences in your account.
____________

1 · 2 · Next

Message boards : Technical News : Moribund Monday (Apr 14 2008)

Copyright © 2014 University of California