This tutorial explains how to create a robot.txt.
The files robots.txt are used by webmasters in order to give instructions to web spider on how they can index the website. This mechanism is calling Robots Exclusion Protocol.
In this tutorial, we will consider that our website (http://www.tutorielsenfolie.com) has the following structure:
How does it work?
When a robot visits your website, it starts to read the file http://www.tutorielsenfolie.com/robots.txt. Then, depending on its content, it will read the file http://www.tutorielsenfolie.com/index.html. If the robots.txt contains the following instructions:
- User-agent:* means that the following section is applied to all indexation robots.
- Disallow:/ means that the robots should not visits the website.
When using robots.txt, there are two important things to keep in mind:
- The robots can ignore the file robots.txt. In particularly if they are hacker's robots that scan your website in order to find security gap, collect emails or for other reasons.
- The robots.txt is public, this means that everyone can see which sections you want to hide to the indexation robots. robots.txt should not be used to hide important data.
How to write a robots.txt
The file robots.txt is composed of sections divided in two parts:
- The first describes which indexation robots are concerned by the second part. This part starts with User-agent : followed by the list of indexation robots separated by a comma (* for all).
- The second describes the folders or files that should not be taken into account when indexing the website. This part is composed of lines that start with Disallow : followed by the folder or the file that should not be indexed.
Note that we can combine the last two lines as follows: Disallow : /secret/ /users/customPage.html.
NB: A section must not contain empty line because they are used to differentiate two sections. Furthermore, the regular expressions are not supported. The character * is a special value that means « all robots ».
Exclude all robots from the server
Exclude one robot from the directory secret/
Allow only one robot on the server
Exclude all robots from the server except for one file
The issue come from the fact that there is no keyword that allows authorizing one file or one robot on the server. The best way is to put all files in a directory (for example /noRobot) and place the file to index outside this folder. Then you just need to write the file robot.txt like that:
Alternative to the robots.txt
If you don't want to use or you cannot create a file robots.txt, an alternative exists. This one consists to use the meta tag in order to indicate to the robots if they can or cannot index the page or the links inside the document.
<meta name="ROBOTS" content="NOINDEX" />
Indicates that the page should not be indexed.
<meta name="ROBOTS" content="NOFOLLOW" />
Indicates that the robots must not follow the link on this page.
You like this tutorial ?
Like this website in order to promote it on facebook !