Message boards :
Technical News :
Tired of Waiting (Oct 23 2008)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help. So we've been getting these frequent, scary, but ultimately harmless kernel warnings on bruno, our upload server. Research by Jeff showed a quick kernel upgrade would fix that. We brought the kernel in yesterday and rebooted this morning to pick it up. The new kernel was fine but exercised a set of other mysterious problems, mostly centered on our upload storage partition (which is software RAIDed). Lots of confusing/misleading fsck freakouts, mounting failures, disk label conflicts, etc. but eventually we were able to convince the system everything was okay, but not after a series of long, boring reboots. Speaking of RAID, I still haven't put in the new spare on bambi. It's late enough in the week to not mess around with any hardware, especially after dealing with the above. Plus the particular RAID array in question is now 1 drive away from degradation (no big deal), and 2 drives away from failure. Plus it's a replica of the science database - and the primary is in good shape, and is backed up weekly. So no need to panic - we'll get the drive in there early next week. Speaking of science database, I'm finding our signal tables (pulse, triplet, spike, gaussian) are sufficiently large that informix is automatically guessing that with certain "expensive" queries indexes aren't worth using, and is reverting to sequential scans which take forever. This has to be addressed sooner than later. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
BroncoBob9 Send message Joined: 29 May 03 Posts: 62 Credit: 2,443,241 RAC: 0 |
Thanks for the update Matt! |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30673 Credit: 53,134,872 RAC: 32 |
Thanks for the update. Looks like you are going to be busy for a while. |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
. . . Each of You @ Berkeley are to be Commended on a job well done - and Thanks for the Post Update Matt BOINC Wiki . . . Science Status Page . . . |
AndyW Send message Joined: 23 Oct 02 Posts: 5862 Credit: 10,957,677 RAC: 18 |
There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help. I've suffered that before with my own web server. Poor thing was crawled to a slow and painful death by bots! |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help. Can you explain what is happening when a 'googlebot' attacks? Is this a DoS thing, with lots of rapid fire hits? |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
The problem is having all of those work units available for everyone to see, it also allowed the googlebot to index all of those pages. Google (and the rest of the search engines) allow a site to publish a set of instructions (a "robots.txt" file) that tell the spider what parts of the site are not to be indexed or visited. The standard is here: http://www.robotstxt.org/. If you click on our names to the left of each post, it goes to our "computers" page (if they aren't hidden) and from there into the results. Those link to other computers and other results. ... and all of those pages will disappear when the work units transition. There is no reason for a search engine to visit them -- and apparently, the main project had the right exclusions, but the beta project didn't. So he updated the robots.txt file. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30673 Credit: 53,134,872 RAC: 32 |
The problem is having all of those work units available for everyone to see, it also allowed the googlebot to index all of those pages. As you know being a volunteer it isn't quite like that on the Beta site. The results there aren't deleted 24 hours after validation but hang around for a long time for error checking reasons. Millions of pages to index. |
AndyW Send message Joined: 23 Oct 02 Posts: 5862 Credit: 10,957,677 RAC: 18 |
It's not an attack, it's just bots "doing their job". When you search for a page via a search engine, Google, Yahoo, MSN etc, the way they build the search results is to constantly scan the Internet servers around the world. An index of the results is kept for easy & quick reference and some sites (google for example) keep a copy (cache) of those pages. If you do not have a robots.txt file then a lot of bandwidth / page views can be consumed by bots indexing and download all your files to store them in a cache. There are also sites that specialise in keeping an archive of the Internet, so again they send bots to your server to download everything and keep copies of it. Matt has spotted this going on and modified the robots.txt file on the SETI server which should hopefully cut the traffic and save these files from being (unnecessarily) scanned and cached. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything. |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything. Yes, it is voluntary. However, if a bot does not follow the rules and annoys a site, the IP address of the bot can be blocked from the web site entirely. It is in the best interests of the bots to follow the rules. BOINC WIKI |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
Matt - early warning, the .XML stats is on the fritz again, or the numbers have become stagnant. I currently show 4000 credits on this site that the stats sites don't know about. . Hello, from Albany, CA!... |
Dr Who Fan Send message Joined: 8 Jan 01 Posts: 3225 Credit: 715,342 RAC: 4 |
Matt - early warning, the .XML stats is on the fritz again, or the numbers have become stagnant. I currently show 4000 credits on this site that the stats sites don't know about. According to http://boincstats.com/stats/project_graph.php?pr=sah it's been over 24 hours since SETI last exported stats:
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything. As John pointed out, it isn't enforced, and there are even some spiders that read the robots.txt file and only search the excluded resources. (usually, phishing for spammable addresses). If they spider slowly, I don't really care what someone else does to access my server. If they start to become a drain on resources, they get attention, and there are always ways to enforce the rules. ... and if a spider like EMailSiphon wants to scan an excluded resource, I've got pages full of bogus E-Mail addresses at the ready. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30673 Credit: 53,134,872 RAC: 32 |
Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything. Which just slows the net for everyone else. Just deny them access to the site in its entirety when you find them. |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
Matt - early warning, the .XML stats is on the fritz again, or the numbers have become stagnant. I currently show 4000 credits on this site that the stats sites don't know about. This has been updated on BoincStats to 2008-10-25 18:54:21 GMT BUT the numbers reported in the XML are about the same as the previous day. In other words, the XML's are being created but are not reporting any of the work done. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
This has been updated on BoincStats to 2008-10-25 18:54:21 GMT BUT the numbers reported in the XML are about the same as the previous day. In other words, the XML's are being created but are not reporting any of the work done. Yes, BOINCstats is showing a daily increment of 271,784 credits, where it would normally be over 40 million. That normally happens when the replica BOINC database (sidious) fails to keep in step with the primary database (jocelyn). |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Yes, BOINCstats is showing a daily increment of 271,784 credits, where it would normally be over 40 million. There was a bad drive that was left to rot over the weekend. Hopefully when Matt (or somebody) replaces the bad drive on Monday, he can check the DB replication performance. I still can't believe they don't have "hot swap". |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
BTW, the "client connection stats" script needs a kick again - and has for the last 3 weeks... currently showing only the line: last updated: January 01 1970 00:00:00. Which can't be right! . Hello, from Albany, CA!... |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.