A web robot’s primary job is to crawl or scan websites and pages for information; they work tirelessly to collect data for search engines and other applications. For some, there is good reason to keep pages away from search engines. Whether you want to fine-tune access to your site or want to work on a development site without showing up on Google results, once implemented, the robots.txt file lets web crawlers and bots know what information they can collect.
A robots.txt is a plain text website file at the root of your site that follows the Robots Exclusion Standard. For example, www.yourdomain.com would have a robots.txt file at www.yourdomain.com/robots.txt. The file consists of one or more rules that allow or block access to crawlers, constraining them to a specified file path in the website. By default, all files are entirely allowed for crawling unless otherwise specified.
The robots.txt file is one of the first aspects analyzed by crawlers. It is important to note that your site can only have one robots.txt file. The file gets implemented on one or several pages or an entire site to discourage search engines from showing details about your website.
This article will provide five steps to create a robots.txt file and the syntax needed to keep bots at bay.
You must have access to the root of your domain. Your web hosting provider can assist you as to whether or not you have the appropriate access.
The most important part of the file is its creation and location. Use any text editor to create a robots.txt file and can be found on:
Finally, you will need to ensure that your robots.txt file is a UTF-8 encoded text file. Google and other popular search engines and crawlers may ignore characters outside of the UTF-8 range, possibly making your robots.txt rules invalid.
The next step in how to create robots.txt files is to set the user-agent. The user-agent pertains to the web crawlers or search engines that you wish to allow or block. Several entities could be the user-agent. We have listed a few crawlers below, as well as their associations.
There are three different ways to establish a user-agent within your robots.txt file.
The syntax that you use to set the user agent is User-agent: NameOfBot. Below, DuckDuckBot is the only user-agent established.
# Example of how to set user-agent
User-agent: DuckDuckBot
If we have to add more than one, follow the same process as you did for the DuckDuckBot user-agent on a subsequent line, inputting the name of the additional user-agent. In this example, we used Facebot.
#Example of how to set more than one user-agent
User-agent: DuckDuckBot
User-agent: Facebot
To block all bots or crawlers, substitute the name of the bot with an asterisk (*).
#Example of how to set all crawlers as user-agent
User-agent: *
A robots.txt file is read in groups. A group will specify who the user-agent is and have one rule or directive to indicate which files or directories the user-agent can or cannot access.
Here are the directives used:
The web crawlers process the groups from top to bottom. As mentioned before, they access any page or directory not explicitly set to disallow. Therefore, add Disallow: / beneath the user-agent information in each group to block those specific user agents from crawling your website.
# Example of how to block DuckDuckBot
User-agent: DuckDuckBot
Disallow: /
#Example of how to block more than one user-agent
User-agent: DuckDuckBot
User-agent: Facebot
Disallow: /
#Example of how to block all crawlers
User-agent: *
Disallow: /
To block a specific subdomain from all crawlers, add a forward slash and the full subdomain URL in your disallow rule.
# Example
User-agent: *
Disallow: /https://page.yourdomain.com/robots.txt
If you want to block a directory, follow the same process by adding a forward slash and your directory name, but then end with another forward slash.
# Example
User-agent: *
Disallow: /images/
Finally, if you would like for all search engines to collect information on all your site pages, you can create either an allow or disallow rule, but make sure to add a forward slash when using the allow rule. Examples of both rules are shown below.
# Allow example to allow all crawlers
User-agent: *
Allow: /
# Disallow example to allow all crawlers
User-agent: *
Disallow:
Websites do not automatically come with a robots.txt file as it is not required. Once you decide to create one, upload the file to your website’s root directory. Uploading depends on your site’s file structure and your web hosting environment. Reach out to your hosting provider for assistance on how to upload your robots.txt file.
There are several ways to test and make sure your robots.txt file functions correctly. With any one of these, you can see any errors in your syntax or logic. Here are a few of them:
If you use WordPress the Yoast SEO plugin, you’ll see a section within the admin window to create a robots.txt file.
Log into the backend of your WordPress website and access Tools under the SEO section, and then click File editor.