Galactic Civilizations 3
How to create a robot.txt

How to create a robot.txt

Category: Javascript.
Posted by on the 05/12/2013.
Last update on the 05/12/2013.

Description

This tutorial explains how to create a robot.txt.

Introduction

The files robots.txt are used by webmasters in order to give instructions to web spider on how they can index the website. This mechanism is calling Robots Exclusion Protocol.

In this tutorial, we will consider that our website (http://www.tutorielsenfolie.com) has the following structure:

Folders Files
/  
  index.html
  robots.txt
users/  
  customPage.html
  public.html
secret/  
  secret.html

How does it work?

When a robot visits your website, it starts to read the file http://www.tutorielsenfolie.com/robots.txt. Then, depending on its content, it will read the file http://www.tutorielsenfolie.com/index.html. If the robots.txt contains the following instructions:

User-agent: *
Disallow: /

  • User-agent:* means that the following section is applied to all indexation robots.
  • Disallow:/ means that the robots should not visits the website.

When using robots.txt, there are two important things to keep in mind:

  1. The robots can ignore the file robots.txt. In particularly if they are hacker's robots that scan your website in order to find security gap, collect emails or for other reasons.
  2. The robots.txt is public, this means that everyone can see which sections you want to hide to the indexation robots. robots.txt should not be used to hide important data.

How to write a robots.txt

The file robots.txt is composed of sections divided in two parts:

  • The first describes which indexation robots are concerned by the second part. This part starts with User-agent : followed by the list of indexation robots separated by a comma (* for all).
  • The second describes the folders or files that should not be taken into account when indexing the website. This part is composed of lines that start with Disallow : followed by the folder or the file that should not be indexed.

Example:

User-agent: Google
Disallow: /secret/
Disallow: /users/customPage.html

Note that we can combine the last two lines as follows: Disallow : /secret/ /users/customPage.html.

NB: A section must not contain empty line because they are used to differentiate two sections. Furthermore, the regular expressions are not supported. The character * is a special value that means « all robots ».

Examples

Exclude all robots from the server

User-agent: *
Disallow: /

Exclude one robot from the directory secret/

User-agent: BadBot
Disallow: /secret/

Allow only one robot on the server

User-agent: Google
Disallow:
User-agent: *
Disallow: /

Exclude all robots from the server except for one file

The issue come from the fact that there is no keyword that allows authorizing one file or one robot on the server. The best way is to put all files in a directory (for example /noRobot) and place the file to index outside this folder. Then you just need to write the file robot.txt like that:

User-agent: *
Disallow: /antiRobot/

Alternative to the robots.txt

If you don't want to use or you cannot create a file robots.txt, an alternative exists. This one consists to use the meta tag in order to indicate to the robots if they can or cannot index the page or the links inside the document.

<meta name="ROBOTS" content="NOINDEX" />

Indicates that the page should not be indexed.

<meta name="ROBOTS" content="NOFOLLOW" />

Indicates that the robots must not follow the link on this page.

You like this tutorial ?
Like this website in order to promote it on facebook !

You like this tutorial ? Share it with your friends !
Partager sur Facebook Partager sur Twitter Partager sur Myspace Partager sur Stumbleupon Soumettre sur Reddit Partager sur Digg Ajouter à vos favoris Technorati Ajouter à vos favoris Live Ajouter à vos favoris Google Ajouter sur vos favoris Yahoo