One Last Note... (Dec 20 2012)

Message boards : Technical News : One Last Note... (Dec 20 2012)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1317819 - Posted: 20 Dec 2012, 21:11:10 UTC

One more quick update before the apocalypse. Or holiday week off. Or whatever.

We seem to be still having minor headaches due to fallout from the power failures of a couple weeks ago. The various back end queues aren't draining as fast as we'd like. We mostly see that in the assimilator queue size. We recently realized that the backlog is such that one of the four assimilators is dealing with over 99% of the backlog - so effictively we're only 25% as efficient dealing with this particular queue. We're letting this clear itself out "naturally" as opposed to adding more complexity to solve a temporary problem.

I did cause a couple more headaches this morning moving archives from one full partition on one server to a less full partition on another. This caused all the queues to expand, and all network traffic to slow down. This is a bit of a clue as to our general woes. Maybe there's some faulty internal network wiring or switching or configuration...?

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

Okay. See you on the other side...

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1317819 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31043
Credit: 53,134,872
RAC: 32
United States
Message 1317825 - Posted: 20 Dec 2012, 21:29:09 UTC

Have a happy and safe end of world.

ID: 1317825 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51488
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1317833 - Posted: 20 Dec 2012, 21:49:26 UTC
Last modified: 20 Dec 2012, 22:14:15 UTC

Thanks for the news, Matt.
It really is good to have you around to try to dig into some of the 'hidden' gremlins that some of us have always suspected were throwing a wrench into things behind the obvious scenes.

Wish you all success in finding and sorting them.

In whatever kind of world we have tomorrow...LOL.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1317833 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1317850 - Posted: 20 Dec 2012, 22:13:11 UTC - in response to Message 1317819.  

Thanks for the update Matt, have a merry christmas and a happy new year,

Claggy
ID: 1317850 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1317880 - Posted: 20 Dec 2012, 22:57:27 UTC - in response to Message 1317819.  
Last modified: 20 Dec 2012, 23:03:45 UTC

Thanks for the update, Matt. A Merry Christmas and a Happy New Year to you, all the other personnel at the Lab, and to all my fellow crunchers. It's been a good year for science and I was lucky enough to play a small part in it. Now we've found a potentially-habitable planet just 12 light-years away -- can anyone invent instantaneous teleportation over that distance? (No, without proving Einstein wrong... :-( )
ID: 1317880 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1317886 - Posted: 20 Dec 2012, 23:07:18 UTC
Last modified: 20 Dec 2012, 23:07:43 UTC

Thanks for the news Matt, to you and everyone in the lab, have a very Merry Christmas and a Happy New Year!

Member of the People Encouraging Niceness In Society club.

ID: 1317886 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1318004 - Posted: 21 Dec 2012, 6:17:34 UTC

If moving data from one machine to another via the network is causing a global issue like that, you are right to suspect equipment or wiring, however, it could just be some limitation in the drivers for the NICs themselves.

Do you have jumbo frame support? Maybe some Rx/Tx buffer sizes need to be adjusted, or checksum offloading needs to be enabled.

Jumbo frames on gigabit are definitely nice. I have two machines on my network that on gigabit with the default of 1500 for the MTU can only manage about 270mbit and the slower machine's CPU is maxed out. I switched over to 9K for the MTU and I get 890mbit and about 75% CPU load. This is moving data across NFS, between Windows and Linux, I might add.

TL;DR: it may just be some parameter tweaking in the NIC's drivers.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1318004 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1318074 - Posted: 21 Dec 2012, 10:46:02 UTC
Last modified: 21 Dec 2012, 10:46:09 UTC

Coming up to noon here in The Netherlands, we're still not dying or otherwise feeling very doom-dayey. ;-)
ID: 1318074 · Report as offensive
Thomas
Volunteer tester

Send message
Joined: 9 Dec 11
Posts: 1499
Credit: 1,345,576
RAC: 0
France
Message 1318141 - Posted: 21 Dec 2012, 13:19:41 UTC - in response to Message 1317819.  

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

Good news, Matt ! :)
Good luck to fix the rest.
THX for the update.
Merry Christmas and Happy New Year to you and all your loved ones !
ID: 1318141 · Report as offensive
Tcarey

Send message
Joined: 20 Aug 99
Posts: 30
Credit: 70,655,757
RAC: 24
United States
Message 1318646 - Posted: 22 Dec 2012, 5:11:58 UTC

Thanks for the update. I would like to say that since the machines got over the database crash the communications rates and unit feeding frequency have been excellent on my systems.

Looks like I'm not getting as many units total in my local buffers which is fine since replacements are coming quickly after uploading.

Is limiting the number of units allowed to each machine part of the fix for the database overload that was going?

Merry Christmas, Happy New year to all.
ID: 1318646 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1318885 - Posted: 22 Dec 2012, 17:39:59 UTC - in response to Message 1318646.  

...
Looks like I'm not getting as many units total in my local buffers which is fine since replacements are coming quickly after uploading.

Is limiting the number of units allowed to each machine part of the fix for the database overload that was going?
...

Yes, each host is allowed to have up to 100 CPU tasks and 100 GPU tasks in progress. That has reduced the number of tasks the database has to keep track of by about 6 million.
                                                                    Joe
ID: 1318885 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1319380 - Posted: 23 Dec 2012, 19:03:47 UTC - in response to Message 1318766.  

As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment.

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

I truly think that ALL the Seti servers should be on a similar UPS system. A New Year fundraiser for the GPUUG seems to be beckoning .....

In the meantime, may I wish you and the other guys in the lab, a very happy Christmas and a peaceful New Year. You've earned it!

Everything is on a UPS. However, as has been explained, it's not that easy. Different processes, running on different machines, have to be stopped in a specific order to avoid all the corruption that occurred last time. That requires either someone to be there to do it, or (if it's even possible) a very complex script overseeing all the shutdowns.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1319380 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31043
Credit: 53,134,872
RAC: 32
United States
Message 1319414 - Posted: 23 Dec 2012, 20:18:25 UTC - in response to Message 1319380.  

As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment.

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

I truly think that ALL the Seti servers should be on a similar UPS system. A New Year fundraiser for the GPUUG seems to be beckoning .....

In the meantime, may I wish you and the other guys in the lab, a very happy Christmas and a peaceful New Year. You've earned it!

Everything is on a UPS. However, as has been explained, it's not that easy. Different processes, running on different machines, have to be stopped in a specific order to avoid all the corruption that occurred last time. That requires either someone to be there to do it, or (if it's even possible) a very complex script overseeing all the shutdowns.

Tad more than that. As was explained it used to all shut down when the UPS(s) said they were on battery. The issue was the mains at the lab are a bit flaky. So it was shutting down all the time on momentary brownout conditions. To restart after a shutdown someone has to actually be there.

As to a script, I think that is something that needs investigation. As many of the machines pull double duty perhaps they can find a charge number that isn't on the Seti@home budget to write the script. If the script waited to begin the shutdown until say one minute of mains failure, then you can be rather sure something is really up. Hopefully that isn't so long that a UPS would run dry before an orderly shutdown is complete. But you test!

ID: 1319414 · Report as offensive
Neil L. Carter Project Donor
Volunteer tester

Send message
Joined: 6 Dec 99
Posts: 62
Credit: 16,385,509
RAC: 27
United States
Message 1321591 - Posted: 29 Dec 2012, 19:54:12 UTC - in response to Message 1317819.  

Greetings:

A couple of requests for your website. Both to improve our understanding of what your systems have to deal with on a continuing basis.

1. You have a 'Server Status' page with a lot of very good information. I suggest you change it to a 'Systems Status' page and include some networking throughput details as well as the server status and splitter status sections. You already have 'Results received in last hour', but it appears to me your network issues would be better spelled out in Kb/s in and out, or something like that, maybe separated into different types of data.....

2. Again, in relation to the 'Server Status' page, you have some very precise definitions in your 'Glossary' section. Could someone put together a data/systems flowchart so we can better understand how the data flows through your systems?

Just some thoughts to assist us not as technically aware of the processes involved...

Thanks!

Neil
ID: 1321591 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1321623 - Posted: 29 Dec 2012, 20:57:54 UTC - in response to Message 1321591.  

1. You have a 'Server Status' page with a lot of very good information. I suggest you change it to a 'Systems Status' page and include some networking throughput details as well as the server status and splitter status sections. You already have 'Results received in last hour', but it appears to me your network issues would be better spelled out in Kb/s in and out, or something like that, maybe separated into different types of data.....

Something like this, perhaps. Green is data out from the Lab, blue is incoming. We commonly call this the "cricket graph" for reasons that may be obvious...
ID: 1321623 · Report as offensive
Neil L. Carter Project Donor
Volunteer tester

Send message
Joined: 6 Dec 99
Posts: 62
Credit: 16,385,509
RAC: 27
United States
Message 1322339 - Posted: 30 Dec 2012, 19:17:34 UTC - in response to Message 1321623.  

Greetings:

It would figure that something like this already existed...

So, why not include the summary data, not the graph, on the Status page?

This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Update queries?

Thanks!

Neil
ID: 1322339 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1322360 - Posted: 30 Dec 2012, 19:44:00 UTC - in response to Message 1322339.  

This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results.

Because the router is facing the other way, Green is downloads to us, Blue is uploads to the Servers,

Claggy
ID: 1322360 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1322364 - Posted: 30 Dec 2012, 19:45:49 UTC - in response to Message 1322339.  


So, why not include the summary data, not the graph, on the Status page?

I don't think that's actually SETI's graph, but the Berkeley network groups. They probably don't want to draw overly much traffic, tho' it's well-known on the forum.

This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Update queries?

In and out are from the router's point-of-view, green is into the router from the Lab and thus out to The World while blue is in from outside and out to inside.
ID: 1322364 · Report as offensive
Neil L. Carter Project Donor
Volunteer tester

Send message
Joined: 6 Dec 99
Posts: 62
Credit: 16,385,509
RAC: 27
United States
Message 1323411 - Posted: 1 Jan 2013, 23:50:23 UTC - in response to Message 1322364.  

Ok, thanks guys!!

Happy New Year!

Neil
ID: 1323411 · Report as offensive

Message boards : Technical News : One Last Note... (Dec 20 2012)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.