Robots.txt : How to Create the Perfect File for SEO
In this article, we will tell you what robot.txt in SEO is, what it looks like, and how to create it correctly. It is the file that is responsible for blocking the indexing of pages and even of an entire site. The incorrect file structure is a common situation even among experienced SEO-optimizers, so we will dwell on common mistakes when editing robot.txt.
What Is Robots.txt?
Robots.txt is a text file that informs search robots which of the files or pages are closed for crawling and indexing. The document is placed in the root directory of the site.
Let’s take a look at how robot.txt works. Search engines have two goals:
- To crawl the network for content detection;
- To index the found content to show it to users for identical search queries.
For indexing, a search robot visits URLs from one site to another, crawling billions of links and web resources. After opening a site, the system looks for a robots.txt file. If the crawler finds a document, it first scans it, and after receiving instructions from it, it continues to crawl the page.
When there are no directives in the file, or it is not created at all, the robot will continue crawling and indexing without taking into account the data on how the system should perform these actions. This can result in the search engine’s indexing of unwanted content.
But many SEO-specialists point out that some search engines ignore the instructions in the robot.txt file. For example, parsers of e-mails and malicious robots. Google also does not perceive the document as a strict directive, but it does consider it as a recommendation when crawling the page.
User-agent and Main Directives
Each search engine has its own user-agents. Robots.txt prescribes rules for each. Here is a list of the most popular search bots:
- Google: Googlebot
- Bing: Bingbot
- Yahoo: Slurp
- Baidu: Baiduspider
When creating a rule for all search engines, use this symbol: (*). For example, let’s create a ban for all robots except for Bing. In the document, it will look like this:
A robots.txt document can contain a different number of rules for search agents. At the same time, each robot perceives only its own directives. That is, the instructions for Google, for example, are not relevant for Yahoo or any other search engine. An exception will be if you specify the same agent a couple of times. Then the system will execute all directives.
It is important to indicate the exact names of search bots; otherwise, the robots will not follow the specified rules.
These are instructions for crawling and indexing of sites for search robots.
This is the list of directives supported by Google:
It allows you to close the access of search engines to content. For example, if you need to hide a directory and all of its pages from the scanner for all systems, then the robots.txt file will look as follows:
If it is for a specific crawler, then it will look like this:
Note: Write the path after the directive, otherwise, robots will ignore it.
This allows robots to scan a specific page, even if it has been restricted. For example, you can allow search engines to crawl only one blog post:
Allow: /blog/what is SEO
It is also possible to specify robots.txt to allow all the content:
Note: Google and Bing search engines support this directive. As with the previous directive, always indicate the path after allow.
If you make a mistake in robots.txt, disallow and allow will conflict. For example, if you have mentioned:
Disallow: /blog/what is SEO
Allow: /blog/what is SEO
As you can see, the URL is allowed and prohibited for indexing at the same time. Google and Bing search engines will prioritize the directive with more characters. In this case, it is disallow. If the number of characters is the same, then the allow directive will be used, i.e., a limiting directive.
Other search engines will select the first directive from the list. In our example, this is disallow.
The sitemap stipulated in robots.txt allows search robots to indicate the address of the sitemap. Here’s an example of such a robots.txt:
Allow: /blog/what is SEO
If the sitemap is listed in Google Search Console, then this information will be enough for Google. But other search engines, like Bing, look for it in robots.txt.
You don’t have to repeat the directive for different robots, as it works for all of them. We recommend that you write it at the beginning of the file.
Note: You can specify any number of sitemaps.
You can also read the related article Sitemaps XML Guide: Best Tricks, Tips, and Tools.
Previously, the directive showed the time delay between scans. Google currently doesn’t support it, but it can be specified for Bing. For Googlebot, the crawl rate is specified in the Google Search Console.
For Googlebot in robots.txt, noindex has never been supported. The robots’ meta-tags are used to exclude a page from the search engine.
This has not been supported by Google since last year. The URL rel = “nofollow” attribute is used instead.
Examples of robots.txt
Let’s consider a standard robot.txt file example:
Allow: /blog/what is SEO
Allow: /blog/what is SEO
Note: You can specify any number of user-agents and directives you want. Always write commands on a new line.
Why Is Robots.txt Important for SEO?
Robots txt file for SEO plays an important role, as it allows you to give instructions to search robots which pages of your site should be crawled and which should not. In addition, the file allows you to:
- Avoid duplicate content in search results;
- Block non-public pages; for example, if you created an intermediate version;
- Prohibit indexing of certain files, such as PDFs or images; and
- Increase your Google crawl budget. This is the number of pages that Googlebot can crawl. If there are many of them on the site, then the search robot will take longer to view all the content. This can negatively affect the site’s ranking. You can close non-priority pages from the scanner so that the bot can index only the pages that are important for promotion.
If your site has no content to control access to, then you may not need to create a robots.txt file. But we still recommend creating one to better optimize its site.
Robots.txt and Robots Meta-tags
The robots meta-tags are not the directives of robots.txt; they are the fragments of HTML code. These are commands for search bots how to crawl and index site content. They are added to the page section.
The robots meta-tags have two parts:
- name = ” ‘. Here you need to type the name of the search agent, for example, Bingbot.
- content = ”. Here are directions on what the bot should do.
So, what do robots look like? Have a look at our example:
<meta name = ” bingbot ” content = “noindex”>
There are two types of robot meta-tags:
- The Meta Robots Tag: This instructs search engines on how to crawl specific files, pages, and subfolders of a site.
- The X–robots Tag: This actually fulfills the same function but in the HTTP headers. Many experts are inclined to believe that X-robots tags are more functional, but they require open access to .php and .htaccess files or to the server. Therefore, it is not always possible to use them.
The table below shows the main directives for the robots meta-tags, taking into account search engines.
The content of the file robots.txt must match robots meta-tags. The most common mistake that SEO-optimizers make is: in robots.txt, they close the page from crawling, and in the data of robots meta-tags, they open it.
Many search engines, including Google, prioritize content in robots.txt so that an important page can be hidden from indexing. You can fix this inconsistency by adjusting the content in the robots meta-tags and the robots.txt document.
How to Find Robots.txt?
Robots.txt can be found in the site’s external interface. This method is suitable for any site. It can also be used to view the file of any other resource. Just enter the site URL into your browser’s search bar and add /robots.txt at the end. If the file is found, you will see:
Or an empty file will open, like in the below example:
Also, you may see the message about the 404 error, like here:
If, when checking robots.txt on your site, you have found a blank page or a 404 error, then no file was created for the resource, or there were errors in it.
For sites developed on the basis of CMS WordPress and Magento 2, there are alternative ways to check the file:
- You can find robots.txt WordPress in the WP-admin section. In the sidebar, you will find one of the Yoast SEO, Rank Math, or All in One SEO plugins that generate the file. Read more in the articles Yoast vs Rank Math SEO, Rank Math Plugin Installation Walkthrough, Setting Up SEO Plugins, Yoast vs All in One SEO Pack.
- In Magento 2, the file can be found in the Content-Configuration section under the Design tab.
For the Shopware platform, you first need to install a plugin that will allow you to create and edit robots.txt in the future.
How to Create Robots.txt
To create robots.txt, you need any text editor. Most often, specialists choose Windows Notepad. If this document has already been created on the site, but you need to edit it, then delete only its content– not the entire document.
Regardless of your purposes, the document format will look like a standard robot.txt sample:
Sitemap: URL–address (we recommend you always to specify it)
user–agent: * (or you should specify the name of a certain search bot)
Disallow: / (the path to the content you would like to hide)
Then, add the remaining directives in their necessary amount.
You can check out the complete guide from Google on how to create rules for search bots here. The information is updated if the search engine makes changes to the algorithm for creating a document.
Save the file under the name “robot.txt.”
To generate a file, it is possible to use robots.txt generator.
The main advantage of this service is that it helps to avoid syntax errors.
Where to Place Robots.txt?
Default robots.txt is located in the root folder of the site. To control the scanner on sitename.com, the document must be located at sitename.com/robots.txt.
If you want to control the crawling of content on subdomains, like blog.sitename.com, then the document should be located at this URL: blog.sitename.com/robots.txt.
Use any FTP client to connect to the root directory.
Best Practices of Robots.txt Optimization for SEO
- The masks (*) can be used to specify not only all search crawlers but also for identical URLs on the site. For example, if you want to close all product categories or blog sections with certain parameters from indexing, then instead of listing them, you can do as below:
Bots will not scan all addresses in the /blog/ subfolder with a question mark.
- Do not use a robots.txt document to hide sensitive information in search results. Sometimes other pages may link to your site’s content, and the data will be indexed bypassing the directives. To block the page, use a password or NoIndex.
- Some search engines have multiple bots. For example, Google has an agent for general content search – Googlebot, and Googlebot-Image that crawls images. We recommend prescribing directives for each of them to better control the scanning process on the site.
- Use the $ symbol to indicate the end of URLs. For example, if you need to disable the scanning of PDF files, the directive will look like this: Disallow: / *.pdf$.
- You can hide the printable version of the page as it is technically duplicate content. Tell bots which one can be scanned. This is useful if you need to test pages with identical content but different designs.
- Typically, when changes are made, robots.txt content is cached after 24 hours. It is possible to speed up this process by sending a file address to Google.
- When writing the rules, specify the path as specifically as possible. For example, let’s say that you are testing the French version of the site located in the /fr/ subfolder. If you write the directive like this: Disallow: /fr, then you will close the access to other content that begins with /fr. For example: /french-perfumery/. Therefore, always add a “/” at the end.
- A separate robots.txt file should be created for each subdomain.
- You can leave comments in the document for optimizers, or yourself if you are working on several projects. To enter text, start a line with the “#” character.
How to Check Robots.txt
You can check the correctness of the created document in Google Search Console. The search engine offers a free robots.txt tester.
To start the process, open your profile for webmasters.
Select the necessary website and click on the Crawl button on the left sidebar.
You will get access to the service of Google robots.txt tester.
If a robots.txt address has already been entered in the field, remove it and enter your own. Click on the test button in the bottom right corner.
If the text changes to “allowed,” then your file has been created correctly.
You can also test new directives right in the tool to check how correct they are. If there are no errors, you can copy the text and add it to your robots.txt document. For detailed instructions on how to use the service, read the information here.
Common Mistakes in Robots.txt Files
Below is a list of the most common mistakes that webmasters make when working with a robots.txt file.
- The name contains uppercase letters. The file is simply called “robots.txt.” Do not use capital letters.
- It contains an incorrect search agent format. For example, some experts write the name of the bot in the directive: Disallow: Googlebot. Always specify robots after the user-agent line.
- Each directory, file, or page should be written on a new line. If you add them in one, the bots will ignore the data.
- Correctly write the host directive if you need it in your work.
5. The HTTP header is incorrect.
Be sure to check the Coverage Report in Google Search Console. Errors in the document will be displayed there.
Let’s consider the most common ones.
1. Access to the URL-address is blocked:
This error appears when one of the URLs in the sitemap is blocked by robots.txt. You need to find these pages and make changes to the file to remove the scan ban. To find a directive that blocks a URL, you can use robots.txt tester from Google. The main goal is to exclude further errors in blocking priority content.
2. Forbidden in robots.txt:
The site contains content that is blocked by robots.txt and is not indexed by the search engine. If these pages are essential, then you need to remove the blocking, after making sure that the page is not prohibited from indexing by using noindex.
If you need to close access to a page or file to exclude it from the search engine index, we recommend using the robots meta tag instead of the disallow directive. This guarantees a positive outcome. If you do not remove the scan blocking, then the search engine will not find noindex, and the content will be indexed.
3. Content is indexed bypassing blocking in robots.txt document:
Some pages or files may still be in the search engine index despite being prohibited in robots.txt. You might have accidentally blocked the content that you want. To fix this, correct the document. In other cases, the robots=noindex meta tag should be used for your page. Read more in the article Power of Nofollow Links. New SEO Tactics.
How to Close a Page from Indexing in Robots.txt
One of the main tasks of robots.txt is to hide certain pages, files, and directories from indexing in search engines. Here are some examples of what content is most often blocked from bots:
- Duplicate content;
- Pagination pages;
- Categories of goods and services;
- Content pages for moderators;
- Online shopping baskets;
- Chats and feedback forms; and
- Thank you pages.
To stop content from crawling, the disallow directive should be used. Let’s look at examples of how you can block search agents from accessing different types of pages.
1. If you need to close a specific subfolder:
user–agent: (specify the bit name and add *, if the rule should be applied to all the search engines)
2. If you close a specific web page on the site:
user–agent: (* or name of the robot)
Disallow: /name –subfolder/page.html
Here is an example of how prohibiting directives is specified by an online store:
Optimizers have blocked all content and pages that are a non-priority for promotion in search results. This increases the crawling budget of some search robots, for example, Googlebot. This action will allow improvements in the rankings of sites in the future, of course, taking into account other significant factors.
We do not recommend hiding confidential information using the disallow directive, as malicious systems can still bypass the block. Some experts use baits to blacklist IP addresses. To do this, a directive with an attractive name for scammers is added to the file, for example, Disallow:/logins/page.html. This way, you can create your own blacklist of IP addresses.
Robots.txt is a simple, yet essential, document for the practice of SEO. With its help, search robots can effectively crawl and index a resource, and display only useful and priority content for users in the SERP. Search results will be formed more accurately, which will help attract more targeted visitors to your site and improve your click-through rate.
Typically, creating robots.txt is a one-time, laborious job. Then you only need to adjust the content of the document, depending on the development of the site. Most SEO-specialists recommend using robots.txt regardless of the resource type.