How to create a robots.txt that follows Google’s new standards

by RANKtoo July. 3, 2019

Recently, Google announced the standard format of robots.txt file, and reminded webmasters of some unsupported rules in robots.txt. hereby, we compiled the robots.txt specifications for your information.

Google defines the news specifications for robots.txt regarding the following aspects:

1. Clear definition about file location and range of validity

The robots.txt file must be saved in the top-level directory of the host. For instance: http://example.com/folder/robots.txt is not a valid robots.txt file as crawlers don't check for robots.txt files in subdirectories. Besides, it's not limited to HTTP anymore and can be used for FTP.

2. File format and size

The expected file format is plain text encoded in UTF-8, and Google currently enforces a size limit of 500 kibibytes (KiB).

3. How Google handles HTTP result codes like 2XX, 3XX, 4XX, 5XX

Google also defined how the HTTP result codes of the robots.txt file would be handled.

4. Standard of URL matching based on path values.

For instance:User-agent: *

Disallow: /calendar/

Disallow: /junk/

It means disallow crawling of a directory and its contents.

User-agent: Googlebot

Disallow: /*.gif$

This means disallow crawling of files of a specific file type (for example, .gif



Why do you need robots.txt file for your website?

1. Make the most of website crawl buget. By disallow crawlers to craw some low seo value webpages can help you optimize crawl buget.

2. Preventing duplicate content from appearing in SERPs (note that meta robots is often a better choice for this)

3. Specifying the location of sitemap(s)

4. Preventing search engines from indexing certain files on your website (images, PDFs, etc.)

5. Specifying a crawl delay in order to prevent your servers from being overloaded when crawlers load multiple pieces of content at once


How to create a robots.txt that follows Google’s new standards

First you need to understand that there are generally three different outcomes when robots.txt files are fetched:

full allow: All content may be crawled.

full disallow: No content may be crawled

conditional allow: The directives in the robots.txt determine the ability to crawl certain content.


For example, if you want all the search engine bots to crawl your webpages except for the 404.html page. Then you can start the robots.txt file like this:

user-agent: *

allow: /

disallow: /404.html

Beside these, you also need to indicate the sitemap absolute URL.
sitemap: https://www.example.com/sitemap.xml




The URL matching rule based on the path valutes is complexed.

* designates 0 or more instances of any valid character.

$ designates the end of the URL.


If the value on allow: <path> and disallow: <path> comes without * or $, that means that value matches all the URLs that start with such value. For instance, disallow: /fish, matches these URLs:

/fish

/fish.html

/fish/salmon.html

/fishheads

/fishheads/yummy.html

/fish.php?id=anything

If disallow:/*.php$,then matches below URLs:

/filename.php

/folder/filename.php

but does not match these URLs($ means URL ends with php):

/filename.php?parameters

/filename.php/

/filename.php5

/windows.PHP



You can visit Google official article for more information about URL matching rules and file location and range validity.