Insights

What is Robots.txt? A Guide for SEOs

What is a Robots.txt File?

Robots.txt files are used to communicate to web robots how we want them to crawl our site. Placed at the root of a website, this file directs these robots on which pages they should or should not access. 

Using robots.txt files helps webmasters prevent search engines from accessing sensitive or irrelevant content, thus ensuring only the desired pages are indexed.

This not only helps in controlling the site's bandwidth usage by reducing unnecessary crawler traffic, but also aids in keeping the search results clean and relevant for users, which can have a positive impact on your SEO.

Robots.txt Example

Here’s a basic example of what a robots.txt file may look like:

User-agent: *
Disallow: /private/

How Does Robots.txt Work?

When search engine crawlers encounter a website, their first stop is the robots.txt file. This is like checking the rules before playing a game.

The robot reads the file to understand which parts of the site it's allowed to visit and which it should avoid. If the robots.txt specifies that certain areas of the site are off-limits, the robot will skip those sections and move on to the rest of the site that's open for exploration.

It’s worth noting that robots.txt is not foolproof, and the robots don’t always follow these instructions.

How To Find a Robots.txt File

Robots.txt files should be placed on the root domain of your website. 

Specifically, if your website is www.example.com, then the robots.txt file should be accessible at www.example.com/robots.txt (like www.seerinteractive.com/robots.txt!). This exact /robots.txt formatting is a requirement to ensure the crawlers don’t miss it.

By positioning it here, every visiting web robot knows exactly where to find it before exploring further.

How to Read Robots.txt: Syntax and Examples

how to read robots.txt visual

So, how do you actually read and interpret robots.txt files? Using the four main fields described in Google’s documentation: user-agent, allow, disallow, and sitemap.

We’ll also review Crawl-delay and wildcards, which can provide additional control over how your site is crawled.

disallow, allow, user agent, and sitemap descriptions for robots.txt

  1. Disallow: URL path that cannot be crawled
  2. Allow: URL path that can be crawled
  3. User-agent: specifies the crawler that the rule applies to
  4. Sitemap: Provides the full location of the sitemap

User-agent Examples Include Googlebot Bingbot and DuckDuckbot

User-Agent

What is it:

The robots.txt ‘user-agent’ is the name that identifies crawlers with specific purposes and/origins. User-agents should be defined when granting specific crawlers different access across your site.

Example: 

  • User-agent: Googlebot-Image

What it means:

This is a user-agent from Google for their image search engine. 

The directives following this will only apply to the ‘Googlebot-Image’ user agent.

Wildcards

wildcards visual for robot.txt

There are two wildcard characters that are used in the robots.txt file. They are * and $.

* (Match Sequence)

What is it

The * wildcard character will match any sequence of the same characters.

Example: 

  • User-agent: *

What it means

This addresses all user-agents for the directives following this line of instruction. 

$ (Match URL End)

What is it

The $ wildcard matches any URL path that ends with what’s designated.

Example: 

  • Disallow: /no-crawl.php$

What it means

The crawler would not access /no-crawl.php but could access /no-crawl.php?crawl

Allow and Disallow

disallow and allow visual for robots.txt

Allow

What is it

Robots.txt ‘Allow:’ directs the crawlers to crawl the site, section, or page. If there’s no path specified then the ‘Allow’ gets ignored. 

Example: 

  • Allow: /crawl-this/

What it means

URLs with the path example.com/crawl-this/ can be accessed unless further specifications are provided. 

Disallow

What is it

Robots.txt ‘Disallow:’ directs the crawlers to not crawl the specified site, section(s), or page(s). 

Example: 

  • Disallow: /?s=

What it means

URLs containing the path example.com/?s= should not be accessed unless further specifications are added.

💡 Note: if there are contradicting directives, the crawler will follow the more specific request.

Crawl Delay

crawl delay visual for robots.txt

What is it

The robots.txt crawl delay directive specifies the number of seconds the search engines should delay before crawling or re-crawling the site. Google does not respond to crawl delay requests but other search engines do. 

Example: 

  • Crawl-delay: 10

What it means

The crawler should wait 10 seconds before re-accessing the site.

Sitemap

sitemap visual for robots.txt

What is it

The robots.txt sitemap field provides crawlers with the location of a website’s sitemap. The address is provided as the absolute URL. If more than one sitemap exists then multiple Sitemap: fields can be used. 

Example: 

  • Sitemap: https://www.example.com/sitemap.xml

What it means

The sitemap for https://www.example.com is available at the path /sitemap.xml

Leave comments, or annotations, in your robot.txt file using the pound sign to communicate the intention behind specific requests. This will make your file easier for you and your coworkers to read, understand, and update.

Example:

  • # This is a comment explaining that the file allows access to all user agents
  • User-agent: *
  • Allow: /

Robots.txt Allow All Example

A simple robots.txt file that allows all user agents full access includes

  1. The user-agents directive with the ‘match any’ wildcard character
    • User-agent: *
  2. Either an empty Disallow or an Allow with the forward slash.
    • Disallow: Or Allow:/

robots.txt allow all example

💡 Note: adding the sitemap to the robots file is recommended but not mandatory.

Testing Robots.txt

Always test your robots file before and after implementing! You can validate your robots.txt file in Google Search Console.

If you think you need help creating or configuring your robots.txt file to get your website crawled more effectively, Seer is happy to help

Final Thoughts On Robots.txt Files

The robots.txt file, which lives at the root of a domain, provides site owners with the ability to give directions to crawlers on how their site should be crawled. 

  • When used correctly, the file can help your site be crawled more effectively and provide additional information about your site to search engines.
  • When used incorrectly, the robots.txt file can be the reason your content isn’t able to be displayed within search results.

Pop Quiz!

Can you write a robots file that includes the following?

a) Links to the sitemap

b) Does not allow website.com/no-crawl to be crawled

c) Does allow website.com/no-crawl-robots-guide to be crawled

d) A time delay

e) Comments which explain what each line does

robots.txt visual

💡 Share your answers with us on Twitter (@seerinteractive)!

Additional Resources


Sign up for our newsletter for more posts like this in your inbox:

SIGN UP FOR NEWSLETTER

We love helping marketers like you.

Sign up for our newsletter for forward-thinking digital marketers.