Tired of Waiting (Oct 23 2008)

Message boards : Technical News : Tired of Waiting (Oct 23 2008)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 822345 - Posted: 23 Oct 2008, 20:55:56 UTC

There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help.

So we've been getting these frequent, scary, but ultimately harmless kernel warnings on bruno, our upload server. Research by Jeff showed a quick kernel upgrade would fix that. We brought the kernel in yesterday and rebooted this morning to pick it up. The new kernel was fine but exercised a set of other mysterious problems, mostly centered on our upload storage partition (which is software RAIDed). Lots of confusing/misleading fsck freakouts, mounting failures, disk label conflicts, etc. but eventually we were able to convince the system everything was okay, but not after a series of long, boring reboots.

Speaking of RAID, I still haven't put in the new spare on bambi. It's late enough in the week to not mess around with any hardware, especially after dealing with the above. Plus the particular RAID array in question is now 1 drive away from degradation (no big deal), and 2 drives away from failure. Plus it's a replica of the science database - and the primary is in good shape, and is backed up weekly. So no need to panic - we'll get the drive in there early next week.

Speaking of science database, I'm finding our signal tables (pulse, triplet, spike, gaussian) are sufficiently large that informix is automatically guessing that with certain "expensive" queries indexes aren't worth using, and is reverting to sequential scans which take forever. This has to be addressed sooner than later.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 822345 · Report as offensive
Profile BroncoBob9
Avatar

Send message
Joined: 29 May 03
Posts: 62
Credit: 2,443,241
RAC: 0
United States
Message 822347 - Posted: 23 Oct 2008, 21:00:26 UTC - in response to Message 822345.  

Thanks for the update Matt!
ID: 822347 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30651
Credit: 53,134,872
RAC: 32
United States
Message 822348 - Posted: 23 Oct 2008, 21:02:32 UTC - in response to Message 822345.  

Thanks for the update. Looks like you are going to be busy for a while.

ID: 822348 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 822371 - Posted: 23 Oct 2008, 22:12:34 UTC


. . . Each of You @ Berkeley are to be Commended on a job well done - and Thanks for the Post Update Matt


BOINC Wiki . . .

Science Status Page . . .
ID: 822371 · Report as offensive
Profile AndyW Project Donor
Volunteer tester
Avatar

Send message
Joined: 23 Oct 02
Posts: 5862
Credit: 10,957,677
RAC: 18
United Kingdom
Message 822825 - Posted: 24 Oct 2008, 18:10:36 UTC - in response to Message 822345.  

There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help.




I've suffered that before with my own web server. Poor thing was crawled to a slow and painful death by bots!
ID: 822825 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 822886 - Posted: 24 Oct 2008, 22:32:36 UTC - in response to Message 822825.  

There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help.




I've suffered that before with my own web server. Poor thing was crawled to a slow and painful death by bots!


Can you explain what is happening when a 'googlebot' attacks? Is this a DoS thing, with lots of rapid fire hits?
ID: 822886 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 822921 - Posted: 25 Oct 2008, 0:33:20 UTC - in response to Message 822886.  

No, it is just indexing the site.

The problem is having all of those work units available for everyone to see, it also allowed the googlebot to index all of those pages.

ID: 822921 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 822976 - Posted: 25 Oct 2008, 3:37:45 UTC - in response to Message 822921.  

The problem is having all of those work units available for everyone to see, it also allowed the googlebot to index all of those pages.

Google (and the rest of the search engines) allow a site to publish a set of instructions (a "robots.txt" file) that tell the spider what parts of the site are not to be indexed or visited.

The standard is here: http://www.robotstxt.org/.

If you click on our names to the left of each post, it goes to our "computers" page (if they aren't hidden) and from there into the results. Those link to other computers and other results.

... and all of those pages will disappear when the work units transition.

There is no reason for a search engine to visit them -- and apparently, the main project had the right exclusions, but the beta project didn't.

So he updated the robots.txt file.
ID: 822976 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30651
Credit: 53,134,872
RAC: 32
United States
Message 823048 - Posted: 25 Oct 2008, 6:11:15 UTC - in response to Message 822976.  

The problem is having all of those work units available for everyone to see, it also allowed the googlebot to index all of those pages.

Google (and the rest of the search engines) allow a site to publish a set of instructions (a "robots.txt" file) that tell the spider what parts of the site are not to be indexed or visited.

The standard is here: http://www.robotstxt.org/.

If you click on our names to the left of each post, it goes to our "computers" page (if they aren't hidden) and from there into the results. Those link to other computers and other results.

... and all of those pages will disappear when the work units transition.

There is no reason for a search engine to visit them -- and apparently, the main project had the right exclusions, but the beta project didn't.

So he updated the robots.txt file.

As you know being a volunteer it isn't quite like that on the Beta site. The results there aren't deleted 24 hours after validation but hang around for a long time for error checking reasons. Millions of pages to index.


ID: 823048 · Report as offensive
Profile AndyW Project Donor
Volunteer tester
Avatar

Send message
Joined: 23 Oct 02
Posts: 5862
Credit: 10,957,677
RAC: 18
United Kingdom
Message 823077 - Posted: 25 Oct 2008, 8:34:18 UTC - in response to Message 822886.  


Can you explain what is happening when a 'googlebot' attacks? Is this a DoS thing, with lots of rapid fire hits?


It's not an attack, it's just bots "doing their job".

When you search for a page via a search engine, Google, Yahoo, MSN etc, the way they build the search results is to constantly scan the Internet servers around the world. An index of the results is kept for easy & quick reference and some sites (google for example) keep a copy (cache) of those pages.

If you do not have a robots.txt file then a lot of bandwidth / page views can be consumed by bots indexing and download all your files to store them in a cache.

There are also sites that specialise in keeping an archive of the Internet, so again they send bots to your server to download everything and keep copies of it.

Matt has spotted this going on and modified the robots.txt file on the SETI server which should hopefully cut the traffic and save these files from being (unnecessarily) scanned and cached.
ID: 823077 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 823134 - Posted: 25 Oct 2008, 13:13:53 UTC

Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything.
ID: 823134 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 823219 - Posted: 25 Oct 2008, 18:33:31 UTC - in response to Message 823134.  

Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything.

Yes, it is voluntary. However, if a bot does not follow the rules and annoys a site, the IP address of the bot can be blocked from the web site entirely. It is in the best interests of the bots to follow the rules.


BOINC WIKI
ID: 823219 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 823221 - Posted: 25 Oct 2008, 18:34:57 UTC

Matt - early warning, the .XML stats is on the fritz again, or the numbers have become stagnant. I currently show 4000 credits on this site that the stats sites don't know about.
.

Hello, from Albany, CA!...
ID: 823221 · Report as offensive
Dr Who Fan
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 3213
Credit: 715,342
RAC: 4
United States
Message 823258 - Posted: 25 Oct 2008, 21:09:19 UTC - in response to Message 823221.  

Matt - early warning, the .XML stats is on the fritz again, or the numbers have become stagnant. I currently show 4000 credits on this site that the stats sites don't know about.

According to http://boincstats.com/stats/project_graph.php?pr=sah it's been over 24 hours since SETI last exported stats:

  • Last update user XML 2008-10-24 18:53:55 GMT
  • Last update host XML 2008-10-24 21:01:37 GMT
  • Last update team XML 2008-10-24 21:08:12 GMT


ID: 823258 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 823276 - Posted: 25 Oct 2008, 21:45:41 UTC - in response to Message 823134.  

Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything.

As John pointed out, it isn't enforced, and there are even some spiders that read the robots.txt file and only search the excluded resources. (usually, phishing for spammable addresses).

If they spider slowly, I don't really care what someone else does to access my server.

If they start to become a drain on resources, they get attention, and there are always ways to enforce the rules.

... and if a spider like EMailSiphon wants to scan an excluded resource, I've got pages full of bogus E-Mail addresses at the ready.
ID: 823276 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30651
Credit: 53,134,872
RAC: 32
United States
Message 823310 - Posted: 25 Oct 2008, 23:09:20 UTC - in response to Message 823276.  

Curious. How is this protocol enforced? It sounds like the bot 'voluntarily' reads the /robots.txt file, but I don't see where the server can enforce anything.

As John pointed out, it isn't enforced, and there are even some spiders that read the robots.txt file and only search the excluded resources. (usually, phishing for spammable addresses).

If they spider slowly, I don't really care what someone else does to access my server.

If they start to become a drain on resources, they get attention, and there are always ways to enforce the rules.

... and if a spider like EMailSiphon wants to scan an excluded resource, I've got pages full of bogus E-Mail addresses at the ready.


Which just slows the net for everyone else. Just deny them access to the site in its entirety when you find them.

ID: 823310 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 823362 - Posted: 26 Oct 2008, 1:28:48 UTC - in response to Message 823258.  

Matt - early warning, the .XML stats is on the fritz again, or the numbers have become stagnant. I currently show 4000 credits on this site that the stats sites don't know about.

According to http://boincstats.com/stats/project_graph.php?pr=sah it's been over 24 hours since SETI last exported stats:

  • Last update user XML 2008-10-24 18:53:55 GMT
  • Last update host XML 2008-10-24 21:01:37 GMT
  • Last update team XML 2008-10-24 21:08:12 GMT


This has been updated on BoincStats to 2008-10-25 18:54:21 GMT BUT the numbers reported in the XML are about the same as the previous day. In other words, the XML's are being created but are not reporting any of the work done.
ID: 823362 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 823451 - Posted: 26 Oct 2008, 9:08:36 UTC - in response to Message 823362.  

This has been updated on BoincStats to 2008-10-25 18:54:21 GMT BUT the numbers reported in the XML are about the same as the previous day. In other words, the XML's are being created but are not reporting any of the work done.

Yes, BOINCstats is showing a daily increment of 271,784 credits, where it would normally be over 40 million.

That normally happens when the replica BOINC database (sidious) fails to keep in step with the primary database (jocelyn).
ID: 823451 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 823537 - Posted: 26 Oct 2008, 16:10:36 UTC - in response to Message 823451.  

Yes, BOINCstats is showing a daily increment of 271,784 credits, where it would normally be over 40 million.

That normally happens when the replica BOINC database (sidious) fails to keep in step with the primary database (jocelyn).


There was a bad drive that was left to rot over the weekend. Hopefully when Matt (or somebody) replaces the bad drive on Monday, he can check the DB replication performance. I still can't believe they don't have "hot swap".
ID: 823537 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 823898 - Posted: 27 Oct 2008, 17:04:25 UTC

BTW, the "client connection stats" script needs a kick again - and has for the last 3 weeks... currently showing only the line:

last updated: January 01 1970 00:00:00.

Which can't be right!
.

Hello, from Albany, CA!...
ID: 823898 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Tired of Waiting (Oct 23 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.