Tuesday, January 23, 2007

Robots.txt and Search Engines

It is not sexy but it useful. The robots.txt is suppose to tell robots/bots/crawlers where they can crawl on a web site. The robots.txt must be placed in the root folder. The Robot Exclusion Standard (RES) is a proposed proposed that has never been agreed on. However, the respectable search engines on the web voluntarily follow it with extensions. There are however reported cases when they allegedly fail to comply with instructions in the robots.txt


Key points

  • # indicate that any follow it is a comment
  • * - wild card meaning everything
  • User-agent - The name of the robot/bot/spider/crawler
  • Disallow: - Allow access to all of site
  • Disallow: / - Disallow access to all of site
  • Disallow: /images/ - do not crawl the images folder
  • Disallow: /folder/filename.html - do not crawl filename.html in folder
Sample File
----------------------------------------------------
# pervent access of all user agents to the follow folders
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

# Specific intructions for specific user-agents
User-agent: msnbot
Disallow: /cgi-bin/
Disallow: /images/
Crawl-delay: 10

User-agent: Teoma
Disallow: /cgi-bin/
Disallow: /images/
Crawl-delay: 10

User-agent: Slurp
Disallow: /cgi-bin/
Disallow: /images/
Crawl-delay: 10


------------------------------------------------------------

Allow: and Crawl-delay: are not part of the standard but supported by MSN and Yahoo

Information on what, why and how see
http://www.robotstxt.org/wc/faq.html

What the Big Three plus one say

ASK
http://about.ask.com/en/docs/about/webmasters.shtml

Yahoo
http://help.yahoo.com/help/us/ysearch/slurp/slurp-02.html

Using the robots.txt analysis tool
http://www.google.com/support/webmasters/bin/topic.py?topic=8475

MSN
http://search.msn.com.sg/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm

Tougher action

see:

Using .htaccess for site control
http://linuxreviews.org/webdesign/htaccess/

How to keep bad robots, spiders and web crawlers away
http://www.fleiner.com/bots/



robots.txt generator
http://www.mcanerin.com/EN/search-engine/robots-txt.asp


http://www.outfront.net/tutorials_02/adv_tech/robots.htm

Audio Noise Removal

Many times in record clips we end up with unwanted background noise. Recently faced with the situation, I  needed to clean up some audio and...