Message boards :
Technical News :
One Last Note... (Dec 20 2012)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
One more quick update before the apocalypse. Or holiday week off. Or whatever. We seem to be still having minor headaches due to fallout from the power failures of a couple weeks ago. The various back end queues aren't draining as fast as we'd like. We mostly see that in the assimilator queue size. We recently realized that the backlog is such that one of the four assimilators is dealing with over 99% of the backlog - so effictively we're only 25% as efficient dealing with this particular queue. We're letting this clear itself out "naturally" as opposed to adding more complexity to solve a temporary problem. I did cause a couple more headaches this morning moving archives from one full partition on one server to a less full partition on another. This caused all the queues to expand, and all network traffic to slow down. This is a bit of a clue as to our general woes. Maybe there's some faulty internal network wiring or switching or configuration...? On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down. Okay. See you on the other side... - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31043 Credit: 53,134,872 RAC: 32 |
Have a happy and safe end of world. |
kittyman Send message Joined: 9 Jul 00 Posts: 51488 Credit: 1,018,363,574 RAC: 1,004 |
Thanks for the news, Matt. It really is good to have you around to try to dig into some of the 'hidden' gremlins that some of us have always suspected were throwing a wrench into things behind the obvious scenes. Wish you all success in finding and sorting them. In whatever kind of world we have tomorrow...LOL. "Time is simply the mechanism that keeps everything from happening all at once." |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Thanks for the update Matt, have a merry christmas and a happy new year, Claggy |
ivan Send message Joined: 5 Mar 01 Posts: 783 Credit: 348,560,338 RAC: 223 |
Thanks for the update, Matt. A Merry Christmas and a Happy New Year to you, all the other personnel at the Lab, and to all my fellow crunchers. It's been a good year for science and I was lucky enough to play a small part in it. Now we've found a potentially-habitable planet just 12 light-years away -- can anyone invent instantaneous teleportation over that distance? (No, without proving Einstein wrong... :-( ) |
Dimly Lit Lightbulb 😀 Send message Joined: 30 Aug 08 Posts: 15399 Credit: 7,423,413 RAC: 1 |
Thanks for the news Matt, to you and everyone in the lab, have a very Merry Christmas and a Happy New Year! Member of the People Encouraging Niceness In Society club. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
If moving data from one machine to another via the network is causing a global issue like that, you are right to suspect equipment or wiring, however, it could just be some limitation in the drivers for the NICs themselves. Do you have jumbo frame support? Maybe some Rx/Tx buffer sizes need to be adjusted, or checksum offloading needs to be enabled. Jumbo frames on gigabit are definitely nice. I have two machines on my network that on gigabit with the default of 1500 for the MTU can only manage about 270mbit and the slower machine's CPU is maxed out. I switched over to 9K for the MTU and I get 890mbit and about 75% CPU load. This is moving data across NFS, between Windows and Linux, I might add. TL;DR: it may just be some parameter tweaking in the NIC's drivers. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Coming up to noon here in The Netherlands, we're still not dying or otherwise feeling very doom-dayey. ;-) |
Thomas Send message Joined: 9 Dec 11 Posts: 1499 Credit: 1,345,576 RAC: 0 |
On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down. Good news, Matt ! :) Good luck to fix the rest. THX for the update. Merry Christmas and Happy New Year to you and all your loved ones ! |
Tcarey Send message Joined: 20 Aug 99 Posts: 30 Credit: 70,655,757 RAC: 24 |
Thanks for the update. I would like to say that since the machines got over the database crash the communications rates and unit feeding frequency have been excellent on my systems. Looks like I'm not getting as many units total in my local buffers which is fine since replacements are coming quickly after uploading. Is limiting the number of units allowed to each machine part of the fix for the database overload that was going? Merry Christmas, Happy New year to all. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
... Yes, each host is allowed to have up to 100 CPU tasks and 100 GPU tasks in progress. That has reduced the number of tasks the database has to keep track of by about 6 million. Joe |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment. Everything is on a UPS. However, as has been explained, it's not that easy. Different processes, running on different machines, have to be stopped in a specific order to avoid all the corruption that occurred last time. That requires either someone to be there to do it, or (if it's even possible) a very complex script overseeing all the shutdowns. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31043 Credit: 53,134,872 RAC: 32 |
As usual Matt, very many thanks for the update, it is appreciated. But I did catch your UPS comment. Tad more than that. As was explained it used to all shut down when the UPS(s) said they were on battery. The issue was the mains at the lab are a bit flaky. So it was shutting down all the time on momentary brownout conditions. To restart after a shutdown someone has to actually be there. As to a script, I think that is something that needs investigation. As many of the machines pull double duty perhaps they can find a charge number that isn't on the Seti@home budget to write the script. If the script waited to begin the shutdown until say one minute of mains failure, then you can be rather sure something is really up. Hopefully that isn't so long that a UPS would run dry before an orderly shutdown is complete. But you test! |
Neil L. Carter Send message Joined: 6 Dec 99 Posts: 62 Credit: 16,385,509 RAC: 27 |
Greetings: A couple of requests for your website. Both to improve our understanding of what your systems have to deal with on a continuing basis. 1. You have a 'Server Status' page with a lot of very good information. I suggest you change it to a 'Systems Status' page and include some networking throughput details as well as the server status and splitter status sections. You already have 'Results received in last hour', but it appears to me your network issues would be better spelled out in Kb/s in and out, or something like that, maybe separated into different types of data..... 2. Again, in relation to the 'Server Status' page, you have some very precise definitions in your 'Glossary' section. Could someone put together a data/systems flowchart so we can better understand how the data flows through your systems? Just some thoughts to assist us not as technically aware of the processes involved... Thanks! Neil |
ivan Send message Joined: 5 Mar 01 Posts: 783 Credit: 348,560,338 RAC: 223 |
1. You have a 'Server Status' page with a lot of very good information. I suggest you change it to a 'Systems Status' page and include some networking throughput details as well as the server status and splitter status sections. You already have 'Results received in last hour', but it appears to me your network issues would be better spelled out in Kb/s in and out, or something like that, maybe separated into different types of data..... Something like this, perhaps. Green is data out from the Lab, blue is incoming. We commonly call this the "cricket graph" for reasons that may be obvious... |
Neil L. Carter Send message Joined: 6 Dec 99 Posts: 62 Credit: 16,385,509 RAC: 27 |
Greetings: It would figure that something like this already existed... So, why not include the summary data, not the graph, on the Status page? This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Update queries? Thanks! Neil |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
This raises another question. Why so much more data in than out? One would think the downloads from the servers would be higher than the uploads, since the download package sizes are so much larger than the uploaded results. Because the router is facing the other way, Green is downloads to us, Blue is uploads to the Servers, Claggy |
ivan Send message Joined: 5 Mar 01 Posts: 783 Credit: 348,560,338 RAC: 223 |
I don't think that's actually SETI's graph, but the Berkeley network groups. They probably don't want to draw overly much traffic, tho' it's well-known on the forum.
In and out are from the router's point-of-view, green is into the router from the Lab and thus out to The World while blue is in from outside and out to inside. |
Neil L. Carter Send message Joined: 6 Dec 99 Posts: 62 Credit: 16,385,509 RAC: 27 |
Ok, thanks guys!! Happy New Year! Neil |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.