Wednesday, April 18, 2012

Effective use of robots.txt

Off late I did some work on SEO and got an opprtunity to play around the robots.txt file to apply various rules. Would like to share the common understanding around it, feel free to provide your comments or share your experiences.

As part of sensible SEO practice its important to keep a firm grasp on managing exactly what information we don't want being crawled!
A robots.txt file restricts access to your site by search engine robots that crawl the web. These bots are automated and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages.
You need a robots.txt file only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file

The simplest robots.txt file uses two rules:
User-agent: the robot the following rule applies to
Disallow: the URL you want to block

These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.

Some example below -

User-agent: *
Disallow: /images/

User-Agent: Googlebot
Disallow: /archive/

The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).

  • To block the entire site, use a forward slash.
Disallow: /

  • To block a directory and everything in it, follow the directory name with a forward slash.

Disallow: /archive-directory/

  • To block a page, list the page.

Disallow: /checkout.jsp

  • To remove a specific image from Google Images, add the following:

User-agent: Googlebot-Image
Disallow: /images/logo.jpg

  • To remove all images on your site from Google Images:

User-agent: Googlebot-Image
Disallow: /

  • To block files of a specific file type (for example, .gif), use the following:

User-agent: Googlebot
Disallow: /*.gif$

  • To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:

User-agent: Googlebot
Disallow: /*.xls$

We can restrict crawling where it's not needed with robots.txt
A "robots.txt" file tells search engines whether they can access and therefore crawl parts of your site. This file, which must be named "robots.txt", is placed in the root directory of your site. e.g - www.example.com/robots.txt

If you have a multi country site then each country should have its own robots.txt
To further read follow these links for generating and using robots.txt

robots.txt generator
Using robots.txt files
Caveats of each URL blocking method

Kindly note Google has a limit of only being able to process up to 500KB of your robots.txt file.

Popular Posts