One Last Note... (Dec 20 2012)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1317819 - Posted: 20 Dec 2012, 21:11:10 UTC One more quick update before the apocalypse. Or holiday week off. Or whatever. We seem to be still having minor headaches due to fallout from the power failures of a couple weeks ago. The various back end queues aren't draining as fast as we'd like. We mostly see that in the assimilator queue size. We recently realized that the backlog is such that one of the four assimilators is dealing with over 99% of the backlog - so effictively we're only 25% as efficient dealing with this particular queue. We're letting this clear itself out "naturally" as opposed to adding more complexity to solve a temporary problem. I did cause a couple more headaches this morning moving archives from one full partition on one server to a less full partition on another. This caused all the queues to expand, and all network traffic to slow down. This is a bit of a clue as to our general woes. Maybe there's some faulty internal network wiring or switching or configuration...? On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down. Okay. See you on the other side... - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1317819 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 31304 Credit: 53,134,872 RAC: 32	Message 1317825 - Posted: 20 Dec 2012, 21:29:09 UTC Have a happy and safe end of world. ID: 1317825 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51531 Credit: 1,018,363,574 RAC: 1,004	Message 1317833 - Posted: 20 Dec 2012, 21:49:26 UTC Last modified: 20 Dec 2012, 22:14:15 UTC Thanks for the news, Matt. It really is good to have you around to try to dig into some of the 'hidden' gremlins that some of us have always suspected were throwing a wrench into things behind the obvious scenes. Wish you all success in finding and sorting them. In whatever kind of world we have tomorrow...LOL. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1317833 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1317850 - Posted: 20 Dec 2012, 22:13:11 UTC - in response to Message 1317819. Thanks for the update Matt, have a merry christmas and a happy new year, Claggy ID: 1317850 ·

ivan Volunteer tester Send message Joined: 5 Mar 01 Posts: 783 Credit: 348,560,338 RAC: 223	Message 1317880 - Posted: 20 Dec 2012, 22:57:27 UTC - in response to Message 1317819. Last modified: 20 Dec 2012, 23:03:45 UTC Thanks for the update, Matt. A Merry Christmas and a Happy New Year to you, all the other personnel at the Lab, and to all my fellow crunchers. It's been a good year for science and I was lucky enough to play a small part in it. Now we've found a potentially-habitable planet just 12 light-years away -- can anyone invent instantaneous teleportation over that distance? (No, without proving Einstein wrong... :-( ) ID: 1317880 ·

Dimly Lit Lightbulb ðŸ˜€ Volunteer tester Send message Joined: 30 Aug 08 Posts: 15401 Credit: 7,423,413 RAC: 1	Message 1317886 - Posted: 20 Dec 2012, 23:07:18 UTC Last modified: 20 Dec 2012, 23:07:43 UTC Thanks for the news Matt, to you and everyone in the lab, have a very Merry Christmas and a Happy New Year! Member of the People Encouraging Niceness In Society club. ID: 1317886 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1318004 - Posted: 21 Dec 2012, 6:17:34 UTC If moving data from one machine to another via the network is causing a global issue like that, you are right to suspect equipment or wiring, however, it could just be some limitation in the drivers for the NICs themselves. Do you have jumbo frame support? Maybe some Rx/Tx buffer sizes need to be adjusted, or checksum offloading needs to be enabled. Jumbo frames on gigabit are definitely nice. I have two machines on my network that on gigabit with the default of 1500 for the MTU can only manage about 270mbit and the slower machine's CPU is maxed out. I switched over to 9K for the MTU and I get 890mbit and about 75% CPU load. This is moving data across NFS, between Windows and Linux, I might add. TL;DR: it may just be some parameter tweaking in the NIC's drivers. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1318004 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1318074 - Posted: 21 Dec 2012, 10:46:02 UTC Last modified: 21 Dec 2012, 10:46:09 UTC Coming up to noon here in The Netherlands, we're still not dying or otherwise feeling very doom-dayey. ;-) ID: 1318074 ·

Thomas Volunteer tester Send message Joined: 9 Dec 11 Posts: 1499 Credit: 1,345,576 RAC: 0	Message 1318141 - Posted: 21 Dec 2012, 13:19:41 UTC - in response to Message 1317819. On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down. Good news, Matt ! :) Good luck to fix the rest. THX for the update. Merry Christmas and Happy New Year to you and all your loved ones ! ID: 1318141 ·

Tcarey Send message Joined: 20 Aug 99 Posts: 30 Credit: 70,655,757 RAC: 24	Message 1318646 - Posted: 22 Dec 2012, 5:11:58 UTC Thanks for the update. I would like to say that since the machines got over the database crash the communications rates and unit feeding frequency have been excellent on my systems. Looks like I'm not getting as many units total in my local buffers which is fine since replacements are coming quickly after uploading. Is limiting the number of units allowed to each machine part of the fix for the database overload that was going? Merry Christmas, Happy New year to all. ID: 1318646 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1318885 - Posted: 22 Dec 2012, 17:39:59 UTC - in response to Message 1318646. ... Looks like I'm not getting as many units total in my local buffers which is fine since replacements are coming quickly after uploading. Is limiting the number of units allowed to each machine part of the fix for the database overload that was going? ... Yes, each host is allowed to have up to 100 CPU tasks and 100 GPU tasks in progress. That has reduced the number of tasks the database has to keep track of by about 6 million. Joe ID: 1318885 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1319380 - Posted: 23 Dec 2012, 19:03:47 UTC - in response to Message 1318766. As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment. On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down. I truly think that ALL the Seti servers should be on a similar UPS system. A New Year fundraiser for the GPUUG seems to be beckoning ..... In the meantime, may I wish you and the other guys in the lab, a very happy Christmas and a peaceful New Year. You've earned it! Everything is on a UPS. However, as has been explained, it's not that easy. Different processes, running on different machines, have to be stopped in a specific order to avoid all the corruption that occurred last time. That requires either someone to be there to do it, or (if it's even possible) a very complex script overseeing all the shutdowns. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1319380 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 31304 Credit: 53,134,872 RAC: 32	Message 1319414 - Posted: 23 Dec 2012, 20:18:25 UTC - in response to Message 1319380. As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment. On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down. I truly think that ALL the Seti servers should be on a similar UPS system. A New Year fundraiser for the GPUUG seems to be beckoning ..... In the meantime, may I wish you and the other guys in the lab, a very happy Christmas and a peaceful New Year. You've earned it! Everything is on a UPS. However, as has been explained, it's not that easy. Different processes, running on different machines, have to be stopped in a specific order to avoid all the corruption that occurred last time. That requires either someone to be there to do it, or (if it's even possible) a very complex script overseeing all the shutdowns. Tad more than that. As was explained it used to all shut down when the UPS(s) said they were on battery. The issue was the mains at the lab are a bit flaky. So it was shutting down all the time on momentary brownout conditions. To restart after a shutdown someone has to actually be there. As to a script, I think that is something that needs investigation. As many of the machines pull double duty perhaps they can find a charge number that isn't on the Seti@home budget to write the script. If the script waited to begin the shutdown until say one minute of mains failure, then you can be rather sure something is really up. Hopefully that isn't so long that a UPS would run dry before an orderly shutdown is complete. But you test! ID: 1319414 ·

Neil L. Carter Volunteer tester Send message Joined: 6 Dec 99 Posts: 62 Credit: 16,385,509 RAC: 27	Message 1321591 - Posted: 29 Dec 2012, 19:54:12 UTC - in response to Message 1317819. Greetings: A couple of requests for your website. Both to improve our understanding of what your systems have to deal with on a continuing basis. 1. You have a 'Server Status' page with a lot of very good information. I suggest you change it to a 'Systems Status' page and include some networking throughput details as well as the server status and splitter status sections. You already have 'Results received in last hour', but it appears to me your network issues would be better spelled out in Kb/s in and out, or something like that, maybe separated into different types of data..... 2. Again, in relation to the 'Server Status' page, you have some very precise definitions in your 'Glossary' section. Could someone put together a data/systems flowchart so we can better understand how the data flows through your systems? Just some thoughts to assist us not as technically aware of the processes involved... Thanks! Neil ID: 1321591 ·

ivan Volunteer tester Send message Joined: 5 Mar 01 Posts: 783 Credit: 348,560,338 RAC: 223	Message 1321623 - Posted: 29 Dec 2012, 20:57:54 UTC - in response to Message 1321591. 1. You have a 'Server Status' page with a lot of very good information. I suggest you change it to a 'Systems Status' page and include some networking throughput details as well as the server status and splitter status sections. You already have 'Results received in last hour', but it appears to me your network issues would be better spelled out in Kb/s in and out, or something like that, maybe separated into different types of data..... Something like this, perhaps. Green is data out from the Lab, blue is incoming. We commonly call this the "cricket graph" for reasons that may be obvious... ID: 1321623 ·

Neil L. Carter Volunteer tester Send message Joined: 6 Dec 99 Posts: 62 Credit: 16,385,509 RAC: 27	Message 1322339 - Posted: 30 Dec 2012, 19:17:34 UTC - in response to Message 1321623. Greetings: It would figure that something like this already existed... So, why not include the summary data, not the graph, on the Status page? This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Update queries? Thanks! Neil ID: 1322339 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1322360 - Posted: 30 Dec 2012, 19:44:00 UTC - in response to Message 1322339. This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Because the router is facing the other way, Green is downloads to us, Blue is uploads to the Servers, Claggy ID: 1322360 ·

ivan Volunteer tester Send message Joined: 5 Mar 01 Posts: 783 Credit: 348,560,338 RAC: 223	Message 1322364 - Posted: 30 Dec 2012, 19:45:49 UTC - in response to Message 1322339. So, why not include the summary data, not the graph, on the Status page? I don't think that's actually SETI's graph, but the Berkeley network groups. They probably don't want to draw overly much traffic, tho' it's well-known on the forum. This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Update queries? In and out are from the router's point-of-view, green is into the router from the Lab and thus out to The World while blue is in from outside and out to inside. ID: 1322364 ·

Neil L. Carter Volunteer tester Send message Joined: 6 Dec 99 Posts: 62 Credit: 16,385,509 RAC: 27	Message 1323411 - Posted: 1 Jan 2013, 23:50:23 UTC - in response to Message 1322364. Ok, thanks guys!! Happy New Year! Neil ID: 1323411 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.