What is a robots.txt File and Why Does It Matter?
A robots.txt file is a plain text file placed at the root directory of your website (e.g., https://yourdomain.com/robots.txt). It follows the Robots Exclusion Protocol (REP) β a long-standing web standard that allows website owners to communicate directly with search engine crawlers about which parts of the site they want indexed and which they prefer to keep private from crawlers.
Every time a search engine bot (like Googlebot, Bingbot, or Yandexbot) visits your website, it checks your robots.txt file first before crawling any other page. This makes robots.txt one of the most powerful, yet often overlooked, tools in technical SEO.
Key Directives Explained
- User-agent: Specifies the crawler the rule applies to. A
*wildcard means the rule applies to all crawlers. - Disallow: Tells the crawler NOT to access a specific URL path. A blank Disallow line means everything on the site is allowed.
- Allow: Explicitly permits access to a URL path, even if a broader Disallow rule blocks the parent directory.
- Sitemap: Points crawlers to the absolute URL of your XML sitemap, helping them discover all your pages more efficiently.
- Crawl-delay: Asks bots to wait a specified number of seconds between requests, protecting server resources from aggressive crawlers.
Practical Example
A typical robots.txt file for a standard website looks like this:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /
Sitemap: https://www.yourdomain.com/sitemap.xml
Which Bots Should You Block?
Most legitimate search engine bots (Googlebot, Bingbot, Yandexbot) should be allowed to crawl your website. However, you may want to control access for:
- AI Training Bots:
GPTBot(OpenAI),ClaudeBot(Anthropic), andCCBot(Common Crawl) are used to train large language models. You can block these if you prefer your content not to be used for AI training. - Scrapers: Aggressive bots that scrape content without adding SEO value. These rarely respect robots.txt, but blocking them can reduce server load.
- Redundant Bots: If you don't target certain markets, you can safely disallow bots like
Baiduspider(China) orYandexBot(Russia) to reduce irrelevant crawl budget usage.
Frequently Asked Questions (FAQs)
Does disallowing a page in robots.txt remove it from Google?
No. Disallowing a URL in robots.txt prevents Google from crawling it, but if other pages link to it, Google may still index the URL without visiting it. To completely remove a page from search results, use a <meta name="noindex"> tag or the Google Search Console URL removal tool.
Where should I place the robots.txt file?
The file must be placed in the root directory of your domain β accessible at https://www.yourdomain.com/robots.txt. It cannot be placed in a subdirectory and still function correctly. After uploading, verify it's accessible at that URL.
Does Googlebot respect the Crawl-Delay directive?
Modern Googlebot does not support the Crawl-delay directive. To control Google's crawl rate, use the Crawl Rate settings inside Google Search Console. Other bots like Bingbot, Yandexbot, and Slurp do typically respect this directive.
Is robots.txt a security measure?
No! Robots.txt is a public file that anyone can read. Listing your private directories in robots.txt can actually expose them to malicious users who check this file specifically. For real security, use HTTP authentication, server-level access rules, or application-level authentication.
Should I block AI training bots like GPTBot?
This is a personal or business decision. If you don't want your content used to train AI models like ChatGPT, you can add Disallow: / under User-agent: GPTBot. Note that compliance is voluntary β ethical crawlers will respect it, but unethical ones may not.