Panic Mode On (34) Server Problems

Message boards : Number crunching : Panic Mode On (34) Server Problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 11 · Next

AuthorMessage
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1007331 - Posted: 23 Jun 2010, 10:07:41 UTC - in response to Message 1007326.  

Message from server:Project is temporarily shut down for maintenance

When I checked at 5:30 UTC the boards were down, but part of the system was up. Now it's vice versa. And I think it was somewhat greener half an hour ago. Looks like somebody was working really late.


Isn't this just... special.
Janice
ID: 1007331 · Report as offensive
Biffa
Volunteer tester
Avatar

Send message
Joined: 27 Oct 99
Posts: 41
Credit: 22,750,323
RAC: 0
United Kingdom
Message 1007370 - Posted: 23 Jun 2010, 12:47:26 UTC - in response to Message 1007331.  

Isn't this just... special.


No normal :)

ID: 1007370 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1007416 - Posted: 23 Jun 2010, 15:24:50 UTC

This is strange, I set boinc to no new tasks. I watched my GPU crunch to 0 and low and behold another GPU task shows up and starts running. Do we have invisable cache's now? Maybe ET is beaming me work units:)
[/quote]

Old James
ID: 1007416 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 1007421 - Posted: 23 Jun 2010, 15:45:52 UTC
Last modified: 23 Jun 2010, 15:46:12 UTC

I hope everyone read the news on the home page today. Upload server RAID failed. So much for RAID meaning redundant.

<rant>
Maybe they should just shut this whole thing down for a month, reconfigure & replace stuff to be more reliable (probably with lower throughput), and re-launch the project in August. The whole point of science is not pace but progress. Servers crashing every 48 hours is not progress, and the whole SETI team works too hard to have their efforts set back repeatedly because of reliability issues.
</rant>

Sorry if the rant is inappropriate; I had to get that out.
ID: 1007421 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1007424 - Posted: 23 Jun 2010, 15:53:03 UTC

Total cash donations since 1 June 2008 have been $ 217770.11 so about half that annually. It amounts to about 1/5 of a reasonable budget, so perhaps we should consider anything more than 20% uptime a testament to the skills and dedication of project staff.
                                                               Joe
ID: 1007424 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1007438 - Posted: 23 Jun 2010, 16:39:39 UTC - in response to Message 1007421.  

Upload server RAID failed. So much for RAID meaning redundant.

That does not necessarily mean, that data was lost. That can be some other hardware than hard drives that failed or it can be even a software problem (IIRC they are using software RAID).
ID: 1007438 · Report as offensive
Robert Ribbeck
Avatar

Send message
Joined: 7 Jun 02
Posts: 644
Credit: 5,283,174
RAC: 0
United States
Message 1007440 - Posted: 23 Jun 2010, 16:46:30 UTC - in response to Message 1007424.  

Total cash donations since 1 June 2008 have been $ 217770.11 so about half that annually. It amounts to about 1/5 of a reasonable budget, so perhaps we should consider anything more than 20% uptime a testament to the skills and dedication of project staff.
                                                               Joe


Guess you can't count the continuing grant money they receive
or hardware donations to the contributions
ID: 1007440 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1007443 - Posted: 23 Jun 2010, 16:57:59 UTC

Robert................

In case you don't know this. Josef W. Segur has been with SETI from it's inception. He has made or has been responsible for countless bug squashes and enhancements to Boinc and the science apps. He has given countless hours to the project and also from his check book. He should by all rights have the "project scientist" title but he preferes to remain on the sidelines.

Now if the project is not up to your standards I invite you to make a financial donation to the project so that perhaps the project can more closely identify with your standards.
Boinc....Boinc....Boinc....Boinc....
ID: 1007443 · Report as offensive
Berserker
Volunteer tester

Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,440,087
RAC: 0
United Kingdom
Message 1007444 - Posted: 23 Jun 2010, 17:02:03 UTC - in response to Message 1007421.  

I hope everyone read the news on the home page today. Upload server RAID failed. So much for RAID meaning redundant.


Only the disks are redundant, so there's no redundancy in the controller (hardware or software). Additionally, most common RAID levels only handle a single disk failure at a time. I've seen three disks fail near simultaneously (out of a four disk array). That's always fatal. Also since recovery from a failure is an inherently intensive task, it will sometimes trigger a secondary disk failure.

Hopefully, what the SETI staff are dealing with just requires a resync. It is often possible to do a resync with the server online, but it causes a significant performance loss - which is bad news if the servers are as heavily loaded as the SETI servers are. Better to do the recovery offline then.
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.
ID: 1007444 · Report as offensive
Robert Ribbeck
Avatar

Send message
Joined: 7 Jun 02
Posts: 644
Credit: 5,283,174
RAC: 0
United States
Message 1007445 - Posted: 23 Jun 2010, 17:07:31 UTC - in response to Message 1007443.  

Robert................

In case you don't know this. Josef W. Segur has been with SETI from it's inception. He has made or has been responsible for countless bug squashes and enhancements to Boinc and the science apps. He has given countless hours to the project and also from his check book. He should by all rights have the "project scientist" title but he preferes to remain on the sidelines.

Now if the project is not up to your standards I invite you to make a financial donation to the project so that perhaps the project can more closely identify with your standards.


ya ya blah blah blah
Good for him
That does not make him ALWAYS correct

In this case he made an incomplete analysis

And you took it on your self to SLAM me and infer that I HAVE some problem
for correcting him

SHAME ON YOU
ID: 1007445 · Report as offensive
Wandering Willie
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 136
Credit: 2,127,073
RAC: 0
United Kingdom
Message 1007450 - Posted: 23 Jun 2010, 17:10:56 UTC

Away guys you have already had one thread locked today dont continue on this one.
ID: 1007450 · Report as offensive
Fayvitt
Volunteer tester
Avatar

Send message
Joined: 29 Nov 09
Posts: 217
Credit: 1,190,636
RAC: 0
Australia
Message 1007453 - Posted: 23 Jun 2010, 17:14:49 UTC

Is there any chance we could get our resident "forum experts" into Berkeley to donate their time and expertise for the Seti@Home cause?

There seems to be a wealth of talent producing itself. Wondering if there's someway we can tap into it, and harness this plethora of knowledge.

ID: 1007453 · Report as offensive
Fayvitt
Volunteer tester
Avatar

Send message
Joined: 29 Nov 09
Posts: 217
Credit: 1,190,636
RAC: 0
Australia
Message 1007454 - Posted: 23 Jun 2010, 17:16:56 UTC

Sorry WW, saw your post after i posted the previous.

Everyone have a pleasant evening!
ID: 1007454 · Report as offensive
Profile Allie in Vancouver
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 3949
Credit: 1,604,668
RAC: 0
Canada
Message 1007459 - Posted: 23 Jun 2010, 17:31:20 UTC - in response to Message 1007450.  

Away guys you have already had one thread locked today dont continue on this one.



lol

Pure mathematics is, in its way, the poetry of logical ideas.

Albert Einstein
ID: 1007459 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1007460 - Posted: 23 Jun 2010, 17:33:10 UTC

Had about 2 pages of units "to do" on one machine. I was going to suspend
additional, I got a window complaining about my login was unknown on the host(?)
.. I had not clicked the button by the way.

It was a persistent error, so I rebooted. It came back and the SETI units had vanished.

Buffer has been set at 4 days, now running on 1 machine, approx 36 hours of work left.




Janice
ID: 1007460 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1007464 - Posted: 23 Jun 2010, 17:48:21 UTC - in response to Message 1007440.  

Total cash donations since 1 June 2008 have been $ 217770.11 so about half that annually. It amounts to about 1/5 of a reasonable budget, so perhaps we should consider anything more than 20% uptime a testament to the skills and dedication of project staff.
                                                               Joe


Guess you can't count the continuing grant money they receive
or hardware donations to the contributions

That's exactly why I said "cash donations". Certainly the grant Eric managed to get after that budget changes the picture somewhat, though I don't know how much. I doubt there's any significant amount from the old Astropulse grant. The hardware donations by Overland Storage and Intel, and continuing support from Sun are certainly important too.

Will 1/10 of the 1 GBit line being run to SSL cover S@H needs? Perhaps so, the ALFA receiver sytem at Arecibo is being used less than they had hoped. But if enough had been donated to fund a S@H specific 1 GBit line being run up the hill as well as the 1GBit for the rest of the Space Sciences Lab, the capacity would be there if they can get enough funding to record more than 1/100 of the ALFA bandwidth.

If they had been able to get another staff member, would that have improved the uptime? Maybe, and maybe it would have meant more posts initiated by staff members.
                                                              Joe
ID: 1007464 · Report as offensive
Berserker
Volunteer tester

Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,440,087
RAC: 0
United Kingdom
Message 1007466 - Posted: 23 Jun 2010, 17:54:12 UTC - in response to Message 1007460.  


Careful! Some here might enjoy that sort of thing! :D

Had about 2 pages of units "to do" on one machine. I was going to suspend
additional, I got a window complaining about my login was unknown on the host(?)
.. I had not clicked the button by the way.

It was a persistent error, so I rebooted. It came back and the SETI units had vanished.

Open BOINC Manager and check it says 'Connected to localhost' bottom right of the window. If it doesn't, the problem is probably local. If it does, something got hosed - possibly client_state.xml.
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.
ID: 1007466 · Report as offensive
Scarecrow

Send message
Joined: 15 Jul 00
Posts: 4520
Credit: 486,601
RAC: 0
United States
Message 1007467 - Posted: 23 Jun 2010, 17:58:24 UTC - in response to Message 1007466.  


Careful! Some here might enjoy that sort of thing! :D

Me! Me! Me! Me!


ID: 1007467 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 1007468 - Posted: 23 Jun 2010, 18:00:03 UTC

Forums are always down @ weekly outage.

+ think of all the server CPU-time wasted dealing with all this horrible posts...
ID: 1007468 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1007470 - Posted: 23 Jun 2010, 18:00:32 UTC - in response to Message 1007464.  
Last modified: 23 Jun 2010, 18:01:04 UTC


If they had been able to get another staff member, would that have improved the uptime? Maybe, and maybe it would have meant more posts initiated by staff members.
                                                              Joe

IMO it's much more important than any bandwidth or any other hardware upgrades. It's first priority indeed...
ID: 1007470 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (34) Server Problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.