One Last Note... (Dec 20 2012)


log in

Advanced search

Message boards : Technical News : One Last Note... (Dec 20 2012)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1317819 - Posted: 20 Dec 2012, 21:11:10 UTC

One more quick update before the apocalypse. Or holiday week off. Or whatever.

We seem to be still having minor headaches due to fallout from the power failures of a couple weeks ago. The various back end queues aren't draining as fast as we'd like. We mostly see that in the assimilator queue size. We recently realized that the backlog is such that one of the four assimilators is dealing with over 99% of the backlog - so effictively we're only 25% as efficient dealing with this particular queue. We're letting this clear itself out "naturally" as opposed to adding more complexity to solve a temporary problem.

I did cause a couple more headaches this morning moving archives from one full partition on one server to a less full partition on another. This caused all the queues to expand, and all network traffic to slow down. This is a bit of a clue as to our general woes. Maybe there's some faulty internal network wiring or switching or configuration...?

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

Okay. See you on the other side...

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12695
Credit: 7,172,515
RAC: 14,963
United States
Message 1317825 - Posted: 20 Dec 2012, 21:29:09 UTC

Have a happy and safe end of world.

____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4139
Credit: 33,413,152
RAC: 18,982
United Kingdom
Message 1317850 - Posted: 20 Dec 2012, 22:13:11 UTC - in response to Message 1317819.

Thanks for the update Matt, have a merry christmas and a happy new year,

Claggy

Profile ivan
Volunteer tester
Avatar
Send message
Joined: 5 Mar 01
Posts: 621
Credit: 142,737,814
RAC: 145,044
United Kingdom
Message 1317880 - Posted: 20 Dec 2012, 22:57:27 UTC - in response to Message 1317819.
Last modified: 20 Dec 2012, 23:03:45 UTC

Thanks for the update, Matt. A Merry Christmas and a Happy New Year to you, all the other personnel at the Lab, and to all my fellow crunchers. It's been a good year for science and I was lucky enough to play a small part in it. Now we've found a potentially-habitable planet just 12 light-years away -- can anyone invent instantaneous teleportation over that distance? (No, without proving Einstein wrong... :-( )
____________

Profile Zapped "Sixth Sense" Sparky
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 8328
Credit: 1,293,820
RAC: 1,095
United Kingdom
Message 1317886 - Posted: 20 Dec 2012, 23:07:18 UTC
Last modified: 20 Dec 2012, 23:07:43 UTC

Thanks for the news Matt, to you and everyone in the lab, have a very Merry Christmas and a Happy New Year!
____________
In an alternate universe, it was a ZX81 that asked for clothes, boots and motorcycle.

Client error 418: I'm a teapot

Tropical Goldfish Fish 15: Squeaky bras 'R us

Illusions of normality sufferer

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2286
Credit: 8,791,861
RAC: 3,808
United States
Message 1318004 - Posted: 21 Dec 2012, 6:17:34 UTC

If moving data from one machine to another via the network is causing a global issue like that, you are right to suspect equipment or wiring, however, it could just be some limitation in the drivers for the NICs themselves.

Do you have jumbo frame support? Maybe some Rx/Tx buffer sizes need to be adjusted, or checksum offloading needs to be enabled.

Jumbo frames on gigabit are definitely nice. I have two machines on my network that on gigabit with the default of 1500 for the MTU can only manage about 270mbit and the slower machine's CPU is maxed out. I switched over to 9K for the MTU and I get 890mbit and about 75% CPU load. This is moving data across NFS, between Windows and Linux, I might add.

TL;DR: it may just be some parameter tweaking in the NIC's drivers.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12323
Credit: 2,625,391
RAC: 938
Netherlands
Message 1318074 - Posted: 21 Dec 2012, 10:46:02 UTC
Last modified: 21 Dec 2012, 10:46:09 UTC

Coming up to noon here in The Netherlands, we're still not dying or otherwise feeling very doom-dayey. ;-)
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Profile {BDC} Thomas DupontProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Dec 11
Posts: 3876
Credit: 1,325,438
RAC: 271
France
Message 1318141 - Posted: 21 Dec 2012, 13:19:41 UTC - in response to Message 1317819.

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

Good news, Matt ! :)
Good luck to fix the rest.
THX for the update.
Merry Christmas and Happy New Year to you and all your loved ones !
____________
Founder of team BRIGADE DU COSMOS
Ranked 55th !

Tcarey
Send message
Joined: 20 Aug 99
Posts: 26
Credit: 33,209,061
RAC: 30,774
United States
Message 1318646 - Posted: 22 Dec 2012, 5:11:58 UTC

Thanks for the update. I would like to say that since the machines got over the database crash the communications rates and unit feeding frequency have been excellent on my systems.

Looks like I'm not getting as many units total in my local buffers which is fine since replacements are coming quickly after uploading.

Is limiting the number of units allowed to each machine part of the fix for the database overload that was going?

Merry Christmas, Happy New year to all.

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32024
Credit: 13,698,717
RAC: 29,063
United Kingdom
Message 1318766 - Posted: 22 Dec 2012, 12:32:40 UTC

As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment.

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

I truly think that ALL the Seti servers should be on a similar UPS system. A New Year fundraiser for the GPUUG seems to be beckoning .....

In the meantime, may I wish you and the other guys in the lab, a very happy Christmas and a peaceful New Year. You've earned it!

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4296
Credit: 1,065,441
RAC: 951
United States
Message 1318885 - Posted: 22 Dec 2012, 17:39:59 UTC - in response to Message 1318646.

...
Looks like I'm not getting as many units total in my local buffers which is fine since replacements are coming quickly after uploading.

Is limiting the number of units allowed to each machine part of the fix for the database overload that was going?
...

Yes, each host is allowed to have up to 100 CPU tasks and 100 GPU tasks in progress. That has reduced the number of tasks the database has to keep track of by about 6 million.
Joe

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11919
Credit: 14,593,683
RAC: 12,074
United States
Message 1319380 - Posted: 23 Dec 2012, 19:03:47 UTC - in response to Message 1318766.

As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment.

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

I truly think that ALL the Seti servers should be on a similar UPS system. A New Year fundraiser for the GPUUG seems to be beckoning .....

In the meantime, may I wish you and the other guys in the lab, a very happy Christmas and a peaceful New Year. You've earned it!

Everything is on a UPS. However, as has been explained, it's not that easy. Different processes, running on different machines, have to be stopped in a specific order to avoid all the corruption that occurred last time. That requires either someone to be there to do it, or (if it's even possible) a very complex script overseeing all the shutdowns.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12695
Credit: 7,172,515
RAC: 14,963
United States
Message 1319414 - Posted: 23 Dec 2012, 20:18:25 UTC - in response to Message 1319380.

As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment.

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

I truly think that ALL the Seti servers should be on a similar UPS system. A New Year fundraiser for the GPUUG seems to be beckoning .....

In the meantime, may I wish you and the other guys in the lab, a very happy Christmas and a peaceful New Year. You've earned it!

Everything is on a UPS. However, as has been explained, it's not that easy. Different processes, running on different machines, have to be stopped in a specific order to avoid all the corruption that occurred last time. That requires either someone to be there to do it, or (if it's even possible) a very complex script overseeing all the shutdowns.

Tad more than that. As was explained it used to all shut down when the UPS(s) said they were on battery. The issue was the mains at the lab are a bit flaky. So it was shutting down all the time on momentary brownout conditions. To restart after a shutdown someone has to actually be there.

As to a script, I think that is something that needs investigation. As many of the machines pull double duty perhaps they can find a charge number that isn't on the Seti@home budget to write the script. If the script waited to begin the shutdown until say one minute of mains failure, then you can be rather sure something is really up. Hopefully that isn't so long that a UPS would run dry before an orderly shutdown is complete. But you test!

____________

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32024
Credit: 13,698,717
RAC: 29,063
United Kingdom
Message 1320453 - Posted: 27 Dec 2012, 16:29:44 UTC

The issue was the mains at the lab are a bit flaky. So it was shutting down all the time on momentary brownout conditions. To restart after a shutdown someone has to actually be there.

Ah, that is a different ball game. I would have been totally amazed if the kit wasn't on UPS, it just wouldn't have been logical. But I thought UPS's could detect brownouts and knowing that they were transitory, not shutdown. Anyway isn't that the function of the UPS controlling program, e.g. Powerchute, that can be configured on how to react to various scenarios?

If I told UCB what I really thought of their power supplies, they would not like it one little bit, it's almost a public scandal, and it is high time something was done about it. Although the politics will probably preclude making too much fuss. We will plod on despite UCB, not because of them.



Neil L. CarterProject donor
Volunteer tester
Send message
Joined: 6 Dec 99
Posts: 53
Credit: 4,332,375
RAC: 5,656
United States
Message 1321591 - Posted: 29 Dec 2012, 19:54:12 UTC - in response to Message 1317819.

Greetings:

A couple of requests for your website. Both to improve our understanding of what your systems have to deal with on a continuing basis.

1. You have a 'Server Status' page with a lot of very good information. I suggest you change it to a 'Systems Status' page and include some networking throughput details as well as the server status and splitter status sections. You already have 'Results received in last hour', but it appears to me your network issues would be better spelled out in Kb/s in and out, or something like that, maybe separated into different types of data.....

2. Again, in relation to the 'Server Status' page, you have some very precise definitions in your 'Glossary' section. Could someone put together a data/systems flowchart so we can better understand how the data flows through your systems?

Just some thoughts to assist us not as technically aware of the processes involved...

Thanks!

Neil
____________

Profile ivan
Volunteer tester
Avatar
Send message
Joined: 5 Mar 01
Posts: 621
Credit: 142,737,814
RAC: 145,044
United Kingdom
Message 1321623 - Posted: 29 Dec 2012, 20:57:54 UTC - in response to Message 1321591.

1. You have a 'Server Status' page with a lot of very good information. I suggest you change it to a 'Systems Status' page and include some networking throughput details as well as the server status and splitter status sections. You already have 'Results received in last hour', but it appears to me your network issues would be better spelled out in Kb/s in and out, or something like that, maybe separated into different types of data.....

Something like this, perhaps. Green is data out from the Lab, blue is incoming. We commonly call this the "cricket graph" for reasons that may be obvious...
____________

Neil L. CarterProject donor
Volunteer tester
Send message
Joined: 6 Dec 99
Posts: 53
Credit: 4,332,375
RAC: 5,656
United States
Message 1322339 - Posted: 30 Dec 2012, 19:17:34 UTC - in response to Message 1321623.

Greetings:

It would figure that something like this already existed...

So, why not include the summary data, not the graph, on the Status page?

This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Update queries?

Thanks!

Neil
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4139
Credit: 33,413,152
RAC: 18,982
United Kingdom
Message 1322360 - Posted: 30 Dec 2012, 19:44:00 UTC - in response to Message 1322339.

This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results.

Because the router is facing the other way, Green is downloads to us, Blue is uploads to the Servers,

Claggy

Profile ivan
Volunteer tester
Avatar
Send message
Joined: 5 Mar 01
Posts: 621
Credit: 142,737,814
RAC: 145,044
United Kingdom
Message 1322364 - Posted: 30 Dec 2012, 19:45:49 UTC - in response to Message 1322339.


So, why not include the summary data, not the graph, on the Status page?

I don't think that's actually SETI's graph, but the Berkeley network groups. They probably don't want to draw overly much traffic, tho' it's well-known on the forum.

This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Update queries?

In and out are from the router's point-of-view, green is into the router from the Lab and thus out to The World while blue is in from outside and out to inside.
____________

1 · 2 · Next

Message boards : Technical News : One Last Note... (Dec 20 2012)

Copyright © 2014 University of California