Pro­fes­sion­al website operators generally aim to make their sites more visible for search engines. One re­quire­ment for this is making sure that all URLs can be read by search bots and then properly indexed. While this may sound like a simple task, it must be noted that search engines only rarely fully crawl websites. Even Google’s ca­pac­i­ties for gathering and storing website content are limited. Instead, every domain is allotted a certain crawling budget, which de­ter­mines how many URLS are read out and, if necessary, indexed. Operators of large websites are advised to tackle this topic in a strategic manner by signaling to search bots which areas of a given page should be crawled and which pages should be ignored. Important tools for index man­age­ment include: robots data in meta tags, canonical tags, redirects as well as the file robots.txt, which is what this tutorial is about.

What is robot.txt?

robots.txt is a text file that’s stored in the root directory of a domain. By blocking some or all search robots from selected parts of a site, these files allow website operators to control search engines’ access to websites. The in­for­ma­tion found in the robots.txt file refers to the entire directory tree. This latter aspect sets this indexing man­age­ment tool sig­nif­i­cant­ly apart from meta robot data and redirects, which are only ap­plic­a­ble for special HTML documents. The word ‘block’ should be given special attention in this context. Search engines interpret robot.txt files merely as a guideline; this means that it can’t force any specific crawling behavior upon search engines. Google and other large search engines claim that they heed these in­struc­tions. However, the only way to prevent any un­war­rant­ed access from occurring is by im­ple­ment­ing strong password pro­tec­tion measures.

Creating a robot.txt

In order to give search bots access to in­di­vid­ual crawling guide­lines, a pure text file has to be named ‘robots.txt’ and then stored in the domain’s root directory. If, for example, crawling guide­lines for the domain, example.com, are to be defined, then the robots.txt needs to be stored in the same directory as www.example.com. When accessed over the internet, this file could be found as follows: www.example.com/robots.txt. If the hosting model for the website doesn’t offer access to the server’s root directory, and instead only to a subfolder (e.g. www.example.com/user/), then im­ple­ment­ing indexing man­age­ment with a robots.txt file isn’t possible. Website operators setting up a robots.txt should use a pure text editor, like vi (Linux) or notpad.exe (Windows); when carrying out a FTP transfer, it’s also important to make sure that the file’s trans­ferred in ASCII mode. Online, the file can be created with a robot.txt generator. Given that syntax errors can have dev­as­tat­ing effects on a web project’s indexing, it’s rec­om­mend­ed to test the text file prior to uploading it. Google’s search console offers a tool for this.

robots.txt structure

Every robots.txt text contains records that are composed of two parts. The first part is in­tro­duced with the keyword, user agent, and addresses a search bot that can be given in­struc­tions in the second part. These in­struc­tions deal with rules for crawling bans. Initiated by the keyword, disallow, these commands then go on to name a directory or multiple files. The result is the following basic structure:

user-agent: Googlebot
disallow: /temp/ 
disallow: /news.html
disallow: /print

The robot.txt in the example above only applies to web crawlers with the name, ‘Googlebot‘, and ‘prohibits’ it from reading out the directory /temp/ and the file, news. Ad­di­tion­al­ly, all files and di­rec­to­ries with paths beginning with print are also blocked. Notice here how disallow: /temp/ and disallow: /print can only be dif­fer­en­ti­at­ed from one another (in terms of syntax) by a missing slash (/) at the end; this is re­spon­si­ble for a con­sid­er­ably different meaning in the robots.txt’s syntax.

Inserting comments

robot.txts file can be sup­ple­ment­ed with comments, if needed. These are then labeled with a preceding hashtag.

# robots.txt for http://www.example.com
user-agent: Googlebot
disallow: /temp/ # directory contains temporary data 
disallow: /print/ # directory contains print pages
disallow: /news.html # file changes daily

Ad­dress­ing multiple user agents

Should multiple user agents be addressed, then the robots.txt can contain any number of blocks written in ac­cor­dance with its structure:

# robots.txt for http://www.example.com
    user-agent: Googlebot
    disallow: /temp/ 
     
    user-agent: Bingbot 
    disallow: /print/

While Google’s web crawler is pro­hib­it­ed to search through the directory, /temp/, the Bing bot is prevented from crawling /print/.

Ad­dress­ing all user agents

Should a certain directory or file for all web crawlers need to be blocked, then an asterisk (*) in­di­cat­ing a wildcard for all users is im­ple­ment­ed. 

# robots.txt for http://www.beispiel.de
user-agent: *
disallow: /temp/
disallow: /print/
disallow: /pictures/

The robots.txt file blocks the di­rec­to­ries /temp/, /print/ and /pictures/ for all web crawlers.

Excluding all di­rec­to­ries from indexing

Should a website need to com­plete­ly block all users agents, then just a slash after the keyword, disallow, is needed.

# robots.txt for http://www.beispiel.de
    user-agent: *
    disallow: /

All web crawlers are in­struct­ed to ignore the entire website. Such robot.txt files can be used, for example, for web projects that are still un­der­go­ing their test phases.

Allowing indexing for all di­rec­to­ries

Web operators can allow search bots to be able to crawl and index entire websites by applying the keyword, disallow, without a slash:

# robots.txt for http://www.example.com
user-agent: Googlebot
disallow:

If the robot.txt file contains a disallow without a slash, then the entire website is freely available to the web crawlers defined under user agent.

Table 1: robots.txt’s basic functions
Command Example Function
user agent: User agent: Googlebot Address a specific web crawler
user agent: Address all web crawlers
disallow: disallow: The entire website can be crawled
disallow: / The entire site is blocked
disallow: /directory/ A specific directory is blocked
disallow: /file.html A specific file is blocked
disallow: /example All di­rec­to­ries and files with paths beginning with example are blocked.

Further functions

In addition to the de facto-standard functions listed above, search engines also support some ad­di­tion­al pa­ra­me­ters that allow content to be presented in the robots.txt.

The following functions can be found on Google’s support section. They are based on an agreement made with Microsoft and Yahoo!.

Defining ex­cep­tions

In addition to disallow, Google also supports allow, a further keyword in the robots.txt, which enables ex­cep­tions for blocked di­rec­to­ries to be defined.

# robots.txt for http://www.example.com
user-agent: Googlebot
disallow: /news/ 
allow: /news/index.html

The keyword, allow, enables the file, "http://www.beispiel.de/news/index.html", to be read by the Google bot despite the fact that the higher-ranking directory, news, is blocked. 

Blocking files with specific endings

Website operators wishing to prevent Google bots from reading out files with specific endings can use datasets according to the following example:

# robots.txt for http://www.example.com
user agent: Googlebot
disallow: /*.pdf$

The keyword disallow refers to all the files ending with .pdf and protects these Google from bots. The asterisk symbol (*) functions as a wildcard for the domain name. This entry is then completed with a dollar sign, which serves as a line-ending anchor.

Referring web crawlers to site maps

In addition to con­trol­ling crawling behavior, robots.txt files also allow search bots to be referred to a website’s sitemap. A robot.txt with a sitemap reference can be called into action as follows:

# robots.txt for http://www.example.com
user agent: *
disallow: /temp/
sitemap: http://www.example.com/sitemap.xml
Table two: expanded robots.txt functions
Command Example Function
allow: allow: /example.html The entered file or directory cannot be crawled
disallow: /*…$ disallow: /*.jpg$ Files with certain endings are blocked
sitemap: sitemap: http://www.example.com/sitemap.xml The XML sitemap is found under the entered address
Go to Main Menu