Recently, Google announced the standard format of robots.txt file, and reminded webmasters of some unsupported rules in robots.txt. hereby, we compiled the robots.txt specifications for your information.
Google defines the news specifications for robots.txt regarding the following aspects:
1. Clear definition about file location and range of validity
The robots.txt file must be saved in the top-level directory of the host. For instance: http://example.com/folder/robots.txt is not a valid robots.txt file as crawlers don't check for robots.txt files in subdirectories. Besides, it's not limited to HTTP anymore and can be used for FTP.
2. File format and size
The expected file format is plain text encoded in UTF-8, and Google currently enforces a size limit of 500 kibibytes (KiB).
3. How Google handles HTTP result codes like 2XX, 3XX, 4XX, 5XX
Google also defined how the HTTP result codes of the robots.txt file would be handled.
4. Standard of URL matching based on path values.
For instance：User-agent: *
It means disallow crawling of a directory and its contents.
This means disallow crawling of files of a specific file type (for example, .gif
Why do you need robots.txt file for your website?
1. Make the most of website crawl buget. By disallow crawlers to craw some low seo value webpages can help you optimize crawl buget.
2. Preventing duplicate content from appearing in SERPs (note that meta robots is often a better choice for this)
3. Specifying the location of sitemap(s)
4. Preventing search engines from indexing certain files on your website (images, PDFs, etc.)
5. Specifying a crawl delay in order to prevent your servers from being overloaded when crawlers load multiple pieces of content at once
First you need to understand that there are generally three different outcomes when robots.txt files are fetched:
full allow: All content may be crawled.
full disallow: No content may be crawled
conditional allow: The directives in the robots.txt determine the ability to crawl certain content.
For example, if you want all the search engine bots to crawl your webpages except for the 404.html page. Then you can start the robots.txt file like this:
Beside these, you also need to indicate the sitemap absolute URL.
The URL matching rule based on the path valutes is complexed.
* designates 0 or more instances of any valid character.
$ designates the end of the URL.
If the value on allow: <path> and disallow: <path> comes without * or $, that means that value matches all the URLs that start with such value. For instance, disallow: /fish, matches these URLs:
If disallow：/*.php$，then matches below URLs:
but does not match these URLs（$ means URL ends with php）：
You can visit Google official article for more information about URL matching rules and file location and range validity.