Robots.txt Exclusion Implementation Guide
If you are reading this article and looking for information to implement robots.txt, I assume that you are a webmaster and know how robots txt file work.
For those who do not know much about robots.txt, here is some information about it.
What is robots.txt and do I need it?
The robots.txt file is a structured text file that is placed in root folder of a website which then tells the search engine spiders about the files/folders/URLs that needs to be excluded from indexing. Robots.txt file also acts like an access control list. Since all major search engines look for presence of a robots.txt file in your site, it’s recommended that you put a robots.txt file in your website too. Though, you may not want to disallow any content from being indexed, it’s just that you will follow a web standard [or would be standard].
Format of a robots.txt file
It contains records in one more lines in the format as mentioned below:-
#This is comment line. Comment line starts with #
User-agent: [Name or *]
Disallow:[optionalspace][what-to-exclude][optionalspace]
Disallow:[optionalspace][what-to-exclude][optionalspace]
Each of such lines follow after a User-agent line, which is used to mention the name of the User-agent for which the record is created. Each line that starts with a # is treated as a comment.
An example robots.txt file:-
# An example of robots.txt
User-agent: *
Disallow: /
In the above example, the line - [User-agent: *] is for all the robots. The second line tell all the robots not to index anything from this site. The forward slash is taken as starting of anything which follows your domain name such as http://www.your-domain.com/ [Here the trailing slash is the start of all the possible following links in the website such as http://www.your-domain.com/about or http://www.your-domain.com/sitemap/]
Another example but this time we will limit access to Google search engine’s spider only.
#This is to limit Google from indexing the site
User-agent: googlebot
Disallow: /images
Disallow: /content/html/
Disallow: /content/index/index.html
In the above example, the googlebot is the name of Google’s search engine spider and is being disallowed from indexing anything under images and /content/html folder and also not to index the index.html [only] in /content/index/ folder. [See a list of known robot names]
How to create my own robots.txt
It’s very simple. Just use notepad or any plain text editor, create plain text file by name robots.txt [lower case] and put your required content in it and upload it to your domain’s web server. Place it in root folder such that it is accessible using the URL http://www.your-domain.com/robots.txt and you are done. [You must be able to access robots.txt this way]
A robots.txt file for your site which will do nothing will look like:
User-agent: *
Disallow:
Other uses of a robots.txt file
It is always used to disallow content from being indexed and if some pages of your site were already indexed in search engines and now you need those pages to go off the search engine’s results and cache, then you may simply put those URLs in the robots.txt and let it do the rest.
Some tools which will generate robots.txt automatically for you are available via search in Google, and you may also use this free tool @ seochat.com
For more information on robots.txt visit - robotstxt.org
