Tired of Waiting (Oct 23 2008)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 822345 - Posted: 23 Oct 2008, 20:55:56 UTC There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help. So we've been getting these frequent, scary, but ultimately harmless kernel warnings on bruno, our upload server. Research by Jeff showed a quick kernel upgrade would fix that. We brought the kernel in yesterday and rebooted this morning to pick it up. The new kernel was fine but exercised a set of other mysterious problems, mostly centered on our upload storage partition (which is software RAIDed). Lots of confusing/misleading fsck freakouts, mounting failures, disk label conflicts, etc. but eventually we were able to convince the system everything was okay, but not after a series of long, boring reboots. Speaking of RAID, I still haven't put in the new spare on bambi. It's late enough in the week to not mess around with any hardware, especially after dealing with the above. Plus the particular RAID array in question is now 1 drive away from degradation (no big deal), and 2 drives away from failure. Plus it's a replica of the science database - and the primary is in good shape, and is backed up weekly. So no need to panic - we'll get the drive in there early next week. Speaking of science database, I'm finding our signal tables (pulse, triplet, spike, gaussian) are sufficiently large that informix is automatically guessing that with certain "expensive" queries indexes aren't worth using, and is reverting to sequential scans which take forever. This has to be addressed sooner than later. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 822345 ·

BroncoBob9 Send message Joined: 29 May 03 Posts: 62 Credit: 2,443,241 RAC: 0	Message 822347 - Posted: 23 Oct 2008, 21:00:26 UTC - in response to Message 822345. Thanks for the update Matt! ID: 822347 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30673 Credit: 53,134,872 RAC: 32	Message 822348 - Posted: 23 Oct 2008, 21:02:32 UTC - in response to Message 822345. Thanks for the update. Looks like you are going to be busy for a while. ID: 822348 ·

Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0	Message 822371 - Posted: 23 Oct 2008, 22:12:34 UTC . . . Each of You @ Berkeley are to be Commended on a job well done - and Thanks for the Post Update Matt BOINC Wiki . . . Science Status Page . . . ID: 822371 ·

AndyW Volunteer tester Send message Joined: 23 Oct 02 Posts: 5862 Credit: 10,957,677 RAC: 18	Message 822825 - Posted: 24 Oct 2008, 18:10:36 UTC - in response to Message 822345. There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help. I've suffered that before with my own web server. Poor thing was crawled to a slow and painful death by bots! ID: 822825 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 822886 - Posted: 24 Oct 2008, 22:32:36 UTC - in response to Message 822825. There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help. I've suffered that before with my own web server. Poor thing was crawled to a slow and painful death by bots! Can you explain what is happening when a 'googlebot' attacks? Is this a DoS thing, with lots of rapid fire hits? ID: 822886 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 822921 - Posted: 25 Oct 2008, 0:33:20 UTC - in response to Message 822886. No, it is just indexing the site. The problem is having all of those work units available for everyone to see, it also allowed the googlebot to index all of those pages. ID: 822921 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 822976 - Posted: 25 Oct 2008, 3:37:45 UTC - in response to Message 822921. The problem is having all of those work units available for everyone to see, it also allowed the googlebot to index all of those pages. Google (and the rest of the search engines) allow a site to publish a set of instructions (a "robots.txt" file) that tell the spider what parts of the site are not to be indexed or visited. The standard is here: http://www.robotstxt.org/. If you click on our names to the left of each post, it goes to our "computers" page (if they aren't hidden) and from there into the results. Those link to other computers and other results. ... and all of those pages will disappear when the work units transition. There is no reason for a search engine to visit them -- and apparently, the main project had the right exclusions, but the beta project didn't. So he updated the robots.txt file. ID: 822976 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30673 Credit: 53,134,872 RAC: 32	Message 823048 - Posted: 25 Oct 2008, 6:11:15 UTC - in response to Message 822976. The problem is having all of those work units available for everyone to see, it also allowed the googlebot to index all of those pages. Google (and the rest of the search engines) allow a site to publish a set of instructions (a "robots.txt" file) that tell the spider what parts of the site are not to be indexed or visited. The standard is here: http://www.robotstxt.org/. If you click on our names to the left of each post, it goes to our "computers" page (if they aren't hidden) and from there into the results. Those link to other computers and other results. ... and all of those pages will disappear when the work units transition. There is no reason for a search engine to visit them -- and apparently, the main project had the right exclusions, but the beta project didn't. So he updated the robots.txt file. As you know being a volunteer it isn't quite like that on the Beta site. The results there aren't deleted 24 hours after validation but hang around for a long time for error checking reasons. Millions of pages to index. ID: 823048 ·

AndyW Volunteer tester Send message Joined: 23 Oct 02 Posts: 5862 Credit: 10,957,677 RAC: 18	Message 823077 - Posted: 25 Oct 2008, 8:34:18 UTC - in response to Message 822886. Can you explain what is happening when a 'googlebot' attacks? Is this a DoS thing, with lots of rapid fire hits? It's not an attack, it's just bots "doing their job". When you search for a page via a search engine, Google, Yahoo, MSN etc, the way they build the search results is to constantly scan the Internet servers around the world. An index of the results is kept for easy & quick reference and some sites (google for example) keep a copy (cache) of those pages. If you do not have a robots.txt file then a lot of bandwidth / page views can be consumed by bots indexing and download all your files to store them in a cache. There are also sites that specialise in keeping an archive of the Internet, so again they send bots to your server to download everything and keep copies of it. Matt has spotted this going on and modified the robots.txt file on the SETI server which should hopefully cut the traffic and save these files from being (unnecessarily) scanned and cached. ID: 823077 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 823134 - Posted: 25 Oct 2008, 13:13:53 UTC Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything. ID: 823134 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 823219 - Posted: 25 Oct 2008, 18:33:31 UTC - in response to Message 823134. Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything. Yes, it is voluntary. However, if a bot does not follow the rules and annoys a site, the IP address of the bot can be blocked from the web site entirely. It is in the best interests of the bots to follow the rules. BOINC WIKI ID: 823219 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 823221 - Posted: 25 Oct 2008, 18:34:57 UTC Matt - early warning, the .XML stats is on the fritz again, or the numbers have become stagnant. I currently show 4000 credits on this site that the stats sites don't know about. . Hello, from Albany, CA!... ID: 823221 ·

Dr Who Fan Volunteer tester Send message Joined: 8 Jan 01 Posts: 3225 Credit: 715,342 RAC: 4	Message 823258 - Posted: 25 Oct 2008, 21:09:19 UTC - in response to Message 823221. Matt - early warning, the .XML stats is on the fritz again, or the numbers have become stagnant. I currently show 4000 credits on this site that the stats sites don't know about. According to http://boincstats.com/stats/project_graph.php?pr=sah it's been over 24 hours since SETI last exported stats: Last update user XML 2008-10-24 18:53:55 GMT Last update host XML 2008-10-24 21:01:37 GMT Last update team XML 2008-10-24 21:08:12 GMT ID: 823258 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 823276 - Posted: 25 Oct 2008, 21:45:41 UTC - in response to Message 823134. Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything. As John pointed out, it isn't enforced, and there are even some spiders that read the robots.txt file and only search the excluded resources. (usually, phishing for spammable addresses). If they spider slowly, I don't really care what someone else does to access my server. If they start to become a drain on resources, they get attention, and there are always ways to enforce the rules. ... and if a spider like EMailSiphon wants to scan an excluded resource, I've got pages full of bogus E-Mail addresses at the ready. ID: 823276 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30673 Credit: 53,134,872 RAC: 32	Message 823310 - Posted: 25 Oct 2008, 23:09:20 UTC - in response to Message 823276. Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything. As John pointed out, it isn't enforced, and there are even some spiders that read the robots.txt file and only search the excluded resources. (usually, phishing for spammable addresses). If they spider slowly, I don't really care what someone else does to access my server. If they start to become a drain on resources, they get attention, and there are always ways to enforce the rules. ... and if a spider like EMailSiphon wants to scan an excluded resource, I've got pages full of bogus E-Mail addresses at the ready. Which just slows the net for everyone else. Just deny them access to the site in its entirety when you find them. ID: 823310 ·

gomeyer Volunteer tester Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0	Message 823362 - Posted: 26 Oct 2008, 1:28:48 UTC - in response to Message 823258. Matt - early warning, the .XML stats is on the fritz again, or the numbers have become stagnant. I currently show 4000 credits on this site that the stats sites don't know about. According to http://boincstats.com/stats/project_graph.php?pr=sah it's been over 24 hours since SETI last exported stats: Last update user XML 2008-10-24 18:53:55 GMT Last update host XML 2008-10-24 21:01:37 GMT Last update team XML 2008-10-24 21:08:12 GMT This has been updated on BoincStats to 2008-10-25 18:54:21 GMT BUT the numbers reported in the XML are about the same as the previous day. In other words, the XML's are being created but are not reporting any of the work done. ID: 823362 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874	Message 823451 - Posted: 26 Oct 2008, 9:08:36 UTC - in response to Message 823362. This has been updated on BoincStats to 2008-10-25 18:54:21 GMT BUT the numbers reported in the XML are about the same as the previous day. In other words, the XML's are being created but are not reporting any of the work done. Yes, BOINCstats is showing a daily increment of 271,784 credits, where it would normally be over 40 million. That normally happens when the replica BOINC database (sidious) fails to keep in step with the primary database (jocelyn). ID: 823451 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 823537 - Posted: 26 Oct 2008, 16:10:36 UTC - in response to Message 823451. Yes, BOINCstats is showing a daily increment of 271,784 credits, where it would normally be over 40 million. That normally happens when the replica BOINC database (sidious) fails to keep in step with the primary database (jocelyn). There was a bad drive that was left to rot over the weekend. Hopefully when Matt (or somebody) replaces the bad drive on Monday, he can check the DB replication performance. I still can't believe they don't have "hot swap". ID: 823537 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 823898 - Posted: 27 Oct 2008, 17:04:25 UTC BTW, the "client connection stats" script needs a kick again - and has for the last 3 weeks... currently showing only the line: last updated: January 01 1970 00:00:00. Which can't be right! . Hello, from Albany, CA!... ID: 823898 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.