1. So what’s the robots.txt file?:
The robot.txt file is simple a text file which should be placed on the web server and tell the web crawlers whether to access a file or not.
2. What’s the point of using robots.txt?:
The robot.txt is a very powerful file to avoid the indexing of pages without quality content. For instance, if you have two versions of a web page and let’s say the one for viewing in browsers and the other one for printing purpose. You would better prefer the printing version to be excluded from being crawled, otherwise you would risk the imposed of a duplicate content penalties.
See robots.txt examples:
N/B: For this to be applicable, the robots.txt should be placed in the top-level directory of a web server, i.e. https://yoursite.com/robots.txt for blogger blog follow this steps.
3. How to create a robots.txt file
Well don’t get scared when you hear the term robot, then you think there is something to do with building yourself a robot. Rather note that a robots.txt file is just a text file and hence you can use the Notepad or any other plain text editor also you can create it in the code editor or even “copy and paste” it from somewhere else.
Importantly don’t put much focus on the idea that you are creating a robots.txt file, just think of it like as if you are writing a simple note as they are pretty much the same procedure. Robots.txt file can be created either manually and also by using online services.
Manual method: As mentioned earlier, robots.txt file can be created using any plain text editor. You can create the content depending on your requirements then save it as a text file with the name of robots in txt format as robots.txt.
Online method: There is a great amount of online sources for robots.txt creation tools. You may choose whichever tool you prefer using but you have to be very careful and check your file if it contains some forbidden information that may risk your blog performances. Robots.txt file is somehow delicate so using online method is not very safe as a manual method.
4. How to set up a robots.txt file
A properly robots.txt file configuration prevents the private information to be found by the search engines and displayed to the public. However, we should not forget that the robots.txt commands are not the full protection but just a guide to crawl action. It’s good to note that googlebots follow the instructions in a robots.txt but other robots can easily ignore them therefore to achieve the result, you have to understand and use robots.txt correctly. The correct form of the robots.txt begins with the directive “User-agent” naming the robot that the certain directives are applied to.
See example below:
N/B This setting makes the robot use only the directive corresponding to user-agent’s name as shown in the examples given below:
User-agent directive provides only the task to a particular given robot. Right after the directive there are the tasks for the named robot. In the example mentioned above, you can check up the usage of the prohibited directive which are “Disallow” meaning “/*utm” and that's how you close the pages with UTM-marks.
See example of incorrect line in robots below:
And below you get the example of correct line in robots:
As seen in the example given above, the tasks in the robots.txt are in blocks. Every block considers the instruction for the certain robot or for all the robots “*” furthermore it is very important to observe the right order of the tasks for robots.txt when using both directives “Allow” and “Disallow”
“Allow” is the permission directive while “Disallow” is the opposite directive which you restrict permission to the robot.
Here below is an example of using the both directives:
The example above forbids all the robots to index the pages beginning with “/contact”, and permits to index the pages beginning with“/contact/page”.
So let's again see the same example in the right order below:
As you can see in example above, at first we forbid the whole part and then we permit some of its parts and in below example is the other way to use both directives:
The directives “Allow” and “Disallow” can be used without switches, though it will be read opposite to the switch “/”.
Now below is an example of the directive without switches applications:
Well it’s decision on how to create the right directive since both variants are appropriate. Be attentive and don’t get confused as you put the right priorities and point the forbidden details in the switch of the directives.
5. Robots.txt syntax
Search engine robots execute the commands of the robots.txt and therefore every search engine can read the robots.txt syntax differently. What you need to do is to check the set of the rules to prevent the common mistakes of the robots.txt as outlined below:
- Every directive should begin from/with a new line.
- Avoid putting more than one directive on the line.
- Avoid putting spaces in the very start of the line.
- The directive switch must be on one-line.
- Do not put the directive switch in quotes.
- Do not put a semicolon after the directive.
- The robot.txt command must be like:- [Directive_name]:[optional space][value][optional space].
- The comments must be added after hash mark #.
- The empty line should be read as the finished directive User-agent.
- The directive “Disallow” (with a null value) is equal to “Allow: /” that means to allow everything. There is only one switch to put in the directives “Allow” or “Disallow”.
- The file name is case sensitive therefore uppercase letters are not allowed. E.g Robots.txt or ROBOTS.TXT is not correct.
- The robots.txt itself is not so case-sensitive, while the names of files and directories are very much case-sensitive.
- In case the directive switch is the directory, put slash “/” before the directory name, e.g Disallow: /category.
- Much heavier robots.txt (exceeding 32 Kb) are read as allowed and equal to “Disallow:”.
- Unavailable robots.txt can be read as allowed one.
- If the robots.txt is empty, it will be read as allowed one.
- Some listing directives “User-agent” without empty lines will be ignored, except the first one.
- Using of national characters is not allowed in robots.txt.
Different search engines can read the robots.txt file syntax in their own way and therefore some rules can or may be missed. With note always try to put just necessary content to the robots.txt. The fewer lines you have, the better result will be and attend to your content quality.
6. Testing your robots.txt file
In order to check if the syntax and file structure are correct, you may use any of online tools such as what Google provides https://www.google.com/webmasters/tools/siteoverview?hl=ru
Googlebot are used by Google for websites indexing in its search engine and so fine that it understands a few more instructions than other robots. To check tfor robots.txt file online, put the robot.txt to the root directory of the website or else, the server will not detect your robots.txt. It is recommended to check your robots.txt availability e.g: yoursite.com/robots.txt. There are many and different online robots.txt validators. It’s just depends with your own choice.
7. Robots.txt Allow
Since Allow is opposite to Disallow. This directives has the similar syntax methods with “Disallow”.
as the example below displays:
Indexing the whole website is not allowed unless the pages beginning with /page.
Below is example of Allow and Disallow with empty value:
8. Robots.txt Disallow
Disallow is the prohibitive directive used in the file robots.txt. “Disallow” prohibits to index the website or some of its parts. This depends with the path given in the directive switch.
See the example of forbidden website indexation:
The example above closes the access for all robots to index the website.
Special symbols * and $ are allowed in the Disallow directory switch.
* – any quantity of any symbols. For example, the switch /page* suffices /page, /page1, /page-about-me, /page/good-food.
$ – points to the switch value correspondence. The directive Disallow will prohibit /page, but the website indexation /page1, /page-about-me or /page/good-food will be allowed.
While closing website indexation, the search engines may react with “url restricted by robots.txt” error. If you need to prohibit the page indexation, you can use not just robots txt, but also the similar html-tags as show below:
- meta name=»robots» content=»noindex»/> — not to index the page content;
- meta name=»robots» content=»nofollow»/> — not to follow the links;
- meta name=»robots» content=»none»/> — forbidden to index the page content and follow the links;
- meta name=»robots» content=»noindex, nofollow»/> — equal to content=»none».
9. Robots.txt sitemap
Directive “Sitemap” is used to detect sitemap.xml location in the robots.txt file. See the example below of the robots.txt having a sitemap.xml:
10. Directive Clean-Param
“Clean-param” Directive allows excluding from indexing pages with dynamic parameters. Hence the pages give the same content but with a different URL. Here you decide if the page is available in several locations. The biggest task is to remove all the extra dynamic addresses which can be a very much and to do this, we have to eliminate all dynamic parameters using a robots.txt directive “Clean-param”. See the example shown below:
Let’s Consider the example of the page with the following URL:
11. Directive Crawl-delay
This instruction allows the avoiding the server overload if the web crawlers are used to reach your site too often. Therefore this directive is effective mainly to the sites with a huge page size. Below is our Robots.txt “Crawl-delay” example:-
In the example above, we just requested Google robots to download the pages of our website no more than once per three seconds. Some search engines read the format with the fractional number, as a guideline parameter “Crawl-delay” robots.txt.
12. Comments in robots.txt file
The comments in the robots.txt begin with the hash sign # and are valid until the end of the current line and ignored by robots as shown below:
User-agent: *
# Comment can start the line
Disallow: /page # Comment can also continue the line
#robots
#ignor
#comments
Host: www,yoursite.com
The Common Mistakes
1. The mistake in syntax:
Wrong : User-agent: /
Disallow: Yahoo
Correct: User-agent: Yahoo
Disallow: /
2. The several directives “Disallow” in one line:
Wrong : Disallow: /css/ /cgi-bin/ /images/
Correct: Disallow: /css/
/cgi-bin/
/images/
3. Wrong file name:
Wrong : Robot.txt(unproper case)
robot.txt(missing 's')
ROBOT.TXT(Caps lock)
Correct: robots.txt
Robots.txt file is one of the most important SEO tools, as it has a direct impact to your website indexation procedures.