Crashy (Feb 21 2013)

Message boards : Technical News : Crashy (Feb 21 2013)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1340001 - Posted: 21 Feb 2013, 20:34:01 UTC

I already posted this on the front page, but FYI there's going to be another lab-wide power outage all weekend, during which all our servers will be unreachable. Hopefully this is the last of this sort of thing, and/or we relocate to the colocation facility before it happens again.

Meanwhile, we've hit a few bumps in the road. I don't think anything dire is happening outside of normal, expected drive failures and kernel hangs. But it's been causing cascading failures on the public facing servers thanks to the web of dependencies each machine has on another. It may seem bad, but everything is more or less okay. I think. I continue to aggressively upgrade and prepare for the impending probable move to the colocation facility, so maybe I'm exercising some lingering, forgotten hardware and configuration issues.

That's all I have to report for now, tech-wise. Behind the scenes development has been largely focused on getting a new polyphase filter bank splitter into production. The current splitter has standard, known FFT artifacts causing dips in sensitivity at the edges of workunits and rolloffs at the edges of the whole 2.5MHz band, but this new splitter will create workunits that exhibit more even sensitivity across the whole spectrum, as well as more sensivity in general to find singals in the noise. We also are turning corners on (finally) getting the NTPCkr back into regular production.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1340001 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 21019
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1340005 - Posted: 21 Feb 2013, 20:37:19 UTC
Last modified: 21 Feb 2013, 20:39:43 UTC

What is it with all the power loss?...


Can you conscript a few students to pedal like fury to drive a few dynamos to keep the closet powered?...

;-)


And very good news for the improved data splitting for analysis.

Hang in there!

Keep crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1340005 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1340008 - Posted: 21 Feb 2013, 20:48:17 UTC - in response to Message 1340001.  

Thanks for the update Matt,

Claggy
ID: 1340008 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1340029 - Posted: 21 Feb 2013, 21:35:25 UTC
Last modified: 21 Feb 2013, 21:35:36 UTC

...By the way I'm fully aware we are about to temporarily run out of workunits to send. I'm just waiting for a RAID resync to finish (in about 90 minutes) before firing up the splitters again.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1340029 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1340044 - Posted: 21 Feb 2013, 22:36:31 UTC

Matt, thanks for the news!


* Best regards! :-) * Philip J. Fry formerly Sutaru Tsureku, team seti.international founder. * Optimize your PC for higher RAC. * SETI@home needs your help. *
ID: 1340044 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1340206 - Posted: 22 Feb 2013, 14:11:16 UTC - in response to Message 1340029.  

...By the way I'm fully aware we are about to temporarily run out of workunits to send. I'm just waiting for a RAID resync to finish (in about 90 minutes) before firing up the splitters again.

- Matt

Thanks, especially, for explaining that.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1340206 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1340208 - Posted: 22 Feb 2013, 14:17:02 UTC

Thank you, Matt.
As usual, your insights into what goes on behind the scenes helps many of us to understand what goes on besides just server crashes...LOL.
You are a very much appreciated part of the Seti lab team. (Not to slight the others, of course.)
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1340208 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1341610 - Posted: 28 Feb 2013, 12:22:24 UTC - in response to Message 1340001.  
Last modified: 28 Feb 2013, 12:28:33 UTC

Behind the scenes development has been largely focused on getting a new polyphase filter bank splitter into production.

Matt,

It looks as if in compiling the new splitters with

  <splitter_cfg>
    ...
    <pfb_ntaps>0</pfb_ntaps>
    <pfb_width_factor>0</pfb_width_factor>
  </splitter_cfg>

you've picked up the faulty library code which puts a 64-bit representation of the receiver ID into the WU name. Eric confirmed that this was harmless when we first questioned the long names at Beta, but the huge long names do look a bit ugly in BOINC Manager and task lists.

Edit - sorry, not a library problem, but a g++ type conversion problem between enum() and integer - I didn't read on to message 44106.
ID: 1341610 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1341685 - Posted: 28 Feb 2013, 17:11:02 UTC

And another left-over, perhaps from the crash or power-down.

People are reporting that the stats dump for external stats sites, http://setiathome.berkeley.edu/stats/, hasn't been updated since 25 February. Could you kick the daemon, please?
ID: 1341685 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1341702 - Posted: 28 Feb 2013, 18:11:47 UTC - in response to Message 1341685.  

And another left-over, perhaps from the crash or power-down.

People are reporting that the stats dump for external stats sites, http://setiathome.berkeley.edu/stats/, hasn't been updated since 25 February. Could you kick the daemon, please?


Ah! Thanks. I just fixed that (the scripts were trying to run an older binary that doesn't work after the OS upgrade). Should be back to normal again within the next 24 hours...

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1341702 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1341703 - Posted: 28 Feb 2013, 18:13:33 UTC - in response to Message 1341702.  

And another left-over, perhaps from the crash or power-down.

People are reporting that the stats dump for external stats sites, http://setiathome.berkeley.edu/stats/, hasn't been updated since 25 February. Could you kick the daemon, please?


Ah! Thanks. I just fixed that (the scripts were trying to run an older binary that doesn't work after the OS upgrade). Should be back to normal again within the next 24 hours...

- Matt

Thanks for the fix, Matt.
If you happen to see Eric, you can tell him to ignore the email I sent him on the subject a while ago.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1341703 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34353
Credit: 79,922,639
RAC: 80
Germany
Message 1341777 - Posted: 28 Feb 2013, 22:19:39 UTC

Thanks Matt.



With each crime and every kindness we birth our future.
ID: 1341777 · Report as offensive
Thomas
Volunteer tester

Send message
Joined: 9 Dec 11
Posts: 1499
Credit: 1,345,576
RAC: 0
France
Message 1343198 - Posted: 5 Mar 2013, 7:11:22 UTC

Thanks for the heads-up Matt !
Good news for this new splitter ! :)
ID: 1343198 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1343948 - Posted: 8 Mar 2013, 3:42:49 UTC
Last modified: 8 Mar 2013, 3:43:06 UTC

The suggested registry entry seems to be improving throughput in spite of the eternal log jam of the bandwidth. I would be very curious if amount of work accomplished increases as people make this edit, or if any negative side effects on the server end are noted.
Janice
ID: 1343948 · Report as offensive
Profile David Mueller
Volunteer tester

Send message
Joined: 8 Feb 01
Posts: 5
Credit: 21,378,404
RAC: 16
Canada
Message 1344314 - Posted: 9 Mar 2013, 2:49:55 UTC

Not sure if you can help I have been trying to download ap files. Since the scheduled shutdown. I have tried different connections to the internet.
Normally I can download AP files at a good speed greater than 25, they just continue to download slow, less than 10.
Even rebooted my system & SETI app.
Thanks
Dave
ID: 1344314 · Report as offensive
Profile Gatekeeper
Avatar

Send message
Joined: 14 Jul 04
Posts: 887
Credit: 176,479,616
RAC: 0
United States
Message 1344325 - Posted: 9 Mar 2013, 3:16:47 UTC - in response to Message 1344314.  

You need to read this thread.
ID: 1344325 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1345372 - Posted: 11 Mar 2013, 16:30:31 UTC

@ Matt,

I'm occasionally seeing

Data Distribution State				SETI@home #	Astropulse #	As of*
Results ready to send				315,972		0		-1m
Current result creation rate			41.2241/sec	0.0441/sec	5m
Results out in the field			3,584,585	150,826		-1m
Results received in last hour			98,472		1,194		0m
Result turnaround time (last hour average)	29.95 hours	149.99 hours	0m
Results returned and awaiting validation	2,819,506	176,961		-1m
Workunits waiting for validation		52		3		-1m
Workunits waiting for assimilation		84		49		-1m
Workunit files waiting for deletion		32		0		-1m
Result files waiting for deletion		208		0		-1m
Workunits waiting for db purging		880,516		23,465		-1m
Results waiting for db purging			1,850,801	63,477		-1m

on the sever status page.

Could those "As of minus one minute" lines indicate that one or more servers have lost their NTP syncs again?
ID: 1345372 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22456
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1345400 - Posted: 11 Mar 2013, 17:12:43 UTC

daylight saving strikes again....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1345400 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1345639 - Posted: 12 Mar 2013, 0:48:21 UTC - in response to Message 1345400.  

Hello ,sorry about this guys but i'm haveing troubble with einstein@home can't upload units can't get to web pages or message boards and yes i know this is not einstein but anybody out there know what has happened to einstein@home and how long the problem will take to fix ??? Thank's , i know i'm a pain in the ass
ID: 1345639 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1345654 - Posted: 12 Mar 2013, 1:40:23 UTC - in response to Message 1345639.  
Last modified: 12 Mar 2013, 1:40:53 UTC

We are all in the same boat, the E@H site is out, have no ideia why, but today i see something about some trouble with the replica server. Let´s see tomorrow.
ID: 1345654 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Crashy (Feb 21 2013)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.