Introduction to Robots.txt

The robots.txt file is very important onpage SEO factor that instruct search engines to either index our site or deindex particular pages of our website or blog. The robots.txt file needs to be placed in the root directory of our website. You can find the example URL of a perfect robots.txt file here: http://www.delhitrainingcourses.com/robots.txt . This particular file will give instructions to search engines that which part of your website should be allowed to visit and indexed.

For proper functioning of the robots.txt file you should only upload it in the root directory of your website that is in your direct www folder and not in any other sub directory such as www.yourdomain.com/subdirectory/robots.txt

All major search engines including Google visit your root folder robots.txt file to find the instructions related to crawling and indexing.

Once can even generate robots.txt file from: http://www.mcanerin.com/EN/search-engine/robots-txt.asp .

How does a Robots.txt look?

One must always use a text file or notepad file to make the robots.txt file. Once it is created paste the following instructing codes in that notepad file and named it as robots.txt.

User-agent: *
Disallow:

This code is the simplest form of robots.txt file and it allows all the search engines to visit, crawl and indexed all the pages of your website or blog including all directories and sub-directories.

But if you want to disallow everything from your website or blog than use:

User-agent: *
Disallow: /

This will deindexed all the pages of your website from the search engines.

The difference between both the codes is just a single slash (/). So, carefully use this feature. If you accidently use slash means your site will be out of the search engine.

Now if you think that certain part of your website or information on your website should be for only particular group of people and should not be publicly viewable such as the presentations in .ppt format that contains some information about your company, you can restrict it by using the following code:

User-agent: *
Disallow: /presentations/*.ppt

Major Known Spiders / Crawlers

Google, Google Image search engine, Bing, Yahoo, Ask, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Scrub the Web, Robozilla (DMOZ), Gigabot (Gigablast), Twiceler (Cuil)

How to restrict specific search engine from Robots.txt

If you would like to restrict a particular search engine from indexing your website or blog such as if you want to restrict Google Image bot to not to index the images or pictures of your website as they are copyright by you and you don’t want others to use your images. So, normally the images would have .png, .gif or .jpg format, you can use the following robots.txt code to restrict Google Image bot:

User-agent: Googlebot-Image
Disallow: /*.gif$

Similarly you can do with other search engines as well such as for Yahoo or Bing.

Examples:

User-agent: msnbot
Disallow: /*.ppt$
Disallow: /*.png$
Disallow: /*.exe$

Benefits of using Robots.txt File

There are many benefits of a robots.txt file in SEO as follows:

It Saves the Bandwidth of Website: Your webhosting provider provides you bandwidth limit or traffic limit. Robots.txt by restricting unnecessary pages of your website will eliminate the unnecessary traffic as well. So, the spiders or visitors would not visit the irrelevant pages and directories such as your cgi-bin folder etc.

It provides you a protection: It does not provide a very strong level of protection but it will not let the people to visit the restricted part of your blog or website who come through search engines. People can access your restricted document by directly typing the exact URL in the browser.

It cleans your logs: Every time a search engine comes to your site through a user query it also visit your robots.txt file and it can happen many a times during a single day. If the search engines do not find the robots.txt file than it create a “404 found error” each time which creates the logs within your site. Also 404 errors are not good from SEO point of view as well.

Provide protection against duplicate content penalties of Google: If you do have several pages containing the same content you can catch up in Google for duplicate content. But if you would restrict unnecessary pages from search engines and allow only one genuine page in search engines than you would not be in any kind of penalization policies of Google and other search engines.