Key points
- # indicate that any follow it is a comment
- * - wild card meaning everything
- User-agent - The name of the robot/bot/spider/crawler
- Disallow: - Allow access to all of site
- Disallow: / - Disallow access to all of site
- Disallow: /images/ - do not crawl the images folder
- Disallow: /folder/filename.html - do not crawl filename.html in folder
----------------------------------------------------
# pervent access of all user agents to the follow folders
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
# Specific intructions for specific user-agents
User-agent: msnbot
Disallow: /cgi-bin/
Disallow: /images/
Crawl-delay: 10
User-agent: Teoma
Disallow: /cgi-bin/
Disallow: /images/
Crawl-delay: 10
User-agent: Slurp
Disallow: /cgi-bin/
Disallow: /images/
Crawl-delay: 10
------------------------------------------------------------
Allow: and Crawl-delay: are not part of the standard but supported by MSN and Yahoo
Information on what, why and how see
http://www.robotstxt.org/wc/faq.html
What the Big Three plus one say
ASK
http://about.ask.com/en/docs/about/webmasters.shtml
Yahoo
http://help.yahoo.com/help/us/ysearch/slurp/slurp-02.html
Using the robots.txt analysis tool
http://www.google.com/support/webmasters/bin/topic.py?topic=8475
MSN
http://search.msn.com.sg/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm
Tougher action
see:
Using .htaccess for site control
http://linuxreviews.org/webdesign/htaccess/
How to keep bad robots, spiders and web crawlers away
http://www.fleiner.com/bots/
robots.txt generator
http://www.mcanerin.com/EN/search-engine/robots-txt.asp
http://www.outfront.net/tutorials_02/adv_tech/robots.htm