How to Write a robots.txt File (with Examples)
robots.txt tells crawlers where they can and cannot go. Here is the syntax, real examples, the crawl-vs-index distinction, and handling AI crawlers like GPTBot.
The robots.txt file is one of the smallest files on your site and one of the easiest to get catastrophically wrong — a single stray character can hide your entire website from Google. In short: robots.txt is a plain-text file at your domain root (yourdomain.com/robots.txt) that tells crawlers which paths they may and may not request, using User-agent, Disallow, Allow and Sitemap directives; it controls crawling, not indexing, so it is the wrong tool for keeping a page out of search results. This guide covers the syntax, real-world examples, the crucial crawl-versus-index distinction, the mistakes that cause real damage, and how to handle AI crawlers like GPTBot and ClaudeBot.
It is the foundation of crawl control, so it pairs directly with your XML sitemap and with the question of whether AI crawlers can access your site.
What robots.txt is and what it does
robots.txt implements the Robots Exclusion Protocol, a long-standing convention that lets site owners give instructions to automated crawlers. When a well-behaved crawler arrives, it first fetches yourdomain.com/robots.txt, reads the rules, and obeys the ones that apply to its user agent before requesting anything else. Googlebot, Bingbot and the major AI crawlers all honour it.
What it is for is managing crawler traffic and steering crawlers away from areas you do not want crawled — duplicate sections, infinite filter combinations, internal search results, or heavy resources that waste crawl budget. What it is not for is security or privacy: it is a public file that anyone can read, and it is a request, not an enforcement mechanism, so it cannot protect anything. Misbehaving bots can simply ignore it.
The syntax: directives you need to know
robots.txt is built from a small set of directives, grouped into blocks by user agent:
User-agentnames the crawler the following rules apply to.User-agent: *is a wildcard matching any crawler without its own specific block.Disallowspecifies a path the crawler may not request.Disallow: /admin/blocks that folder;Disallow: /blocks the whole site; an emptyDisallow:blocks nothing.Allowcarves out an exception inside a disallowed area.Allow: /admin/public/permits one subfolder within an otherwise blocked/admin/.Sitemapdeclares the full URL of your XML sitemap so crawlers can find it. It is independent of user-agent blocks.
Two wildcard characters are widely supported: * matches any sequence of characters, and $ matches the end of a URL. So Disallow: /*.pdf$ blocks URLs ending in .pdf. When multiple rules could match a URL, Google applies the most specific (longest matching path) rule, which is how Allow exceptions override broader Disallow rules.
Here is a small, well-formed file that ties the directives together:
User-agent: *
Disallow: /admin/
Disallow: /cart/
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml
Where the file goes
Placement is not optional: robots.txt must live at the root of the domain, at exactly https://yourdomain.com/robots.txt. Crawlers look only there — a file in a subdirectory is ignored entirely. Each subdomain and each protocol (http versus https) is treated as a separate host with its own robots.txt, so blog.example.com and example.com need their own files if you want different rules. The file must be served as plain text with a 200 status; if it returns a 404, crawlers assume there are no restrictions and crawl freely.
Common patterns with examples
Most real robots.txt files are variations on a handful of patterns. Here are the useful ones:
Allow everything (the default, made explicit):
User-agent: *
Disallow:
Block one folder for all crawlers:
User-agent: *
Disallow: /private/
Block everything (use with extreme care — staging only):
User-agent: *
Disallow: /
Block a specific crawler, allow the rest:
User-agent: AhrefsBot
Disallow: /
User-agent: *
Disallow:
Block URL parameters (e.g. internal search):
User-agent: *
Disallow: /*?
The "block everything" example is the one to handle carefully: a Disallow: / left over from a staging environment, accidentally pushed to production, is one of the most common and damaging SEO mistakes there is. It tells every crawler to stay out of the entire site.
The crawl-versus-index distinction (the big one)
This is the single most misunderstood thing about robots.txt, so it deserves its own section. robots.txt controls crawling, not indexing.
If you Disallow a URL, crawlers will not fetch its content. But Google can still index the URL itself if other pages link to it — and because it was not allowed to read the page, it may show it in results with no description, just the URL, which looks broken. Worse, because Google cannot crawl the page, it can never see a noindex tag on it.
So the rule is:
| Goal | Right tool | Wrong tool |
|---|---|---|
| Stop a page appearing in search | noindex meta tag or header | robots.txt Disallow |
| Stop crawlers wasting time on a section | robots.txt Disallow | noindex (page still gets crawled) |
| Hide truly private content | Authentication / password | robots.txt (it is public) |
To reliably keep a page out of the index, you must let crawlers reach it and serve a noindex directive — the opposite of blocking it in robots.txt. Blocking a page you want de-indexed can actually keep it stuck in the results, because Google never crawls it to discover the noindex.
The big mistakes to avoid
A few robots.txt errors cause real, measurable harm:
- Blocking CSS and JavaScript. Google renders pages to understand them, so blocking the resources it needs to render can hurt how it sees your page. Google's guidance is explicit: do not block assets required for rendering.
- Using it to hide private pages. robots.txt is public and lists exactly which paths you wanted hidden — a roadmap for anyone curious. Use authentication for anything genuinely sensitive.
- The accidental
Disallow: /. As noted, a leftover staging rule can de-index a whole site. Always check this line first when traffic drops. - Expecting it to remove indexed pages. Covered above — it does the opposite for already-linked URLs.
- Typos and case errors. Paths are case-sensitive and the syntax is unforgiving; a small mistake can block more (or less) than intended.
Because the blast radius is so large, test before you trust: Google Search Console includes a robots.txt report and tester, and you can always fetch the live file in a browser to confirm it says what you think it says.
robots.txt and AI crawlers
robots.txt is also the primary place to manage AI crawlers, which makes it central to GEO. The major AI crawlers respect robots.txt and are identified by their own user-agent names, so you can allow or restrict each one individually:
| AI user agent | Operated by |
|---|---|
GPTBot | OpenAI |
CCBot | Common Crawl (feeds many models) |
Google-Extended | Google's AI products |
ClaudeBot | Anthropic |
PerplexityBot | Perplexity |
For example, to keep OpenAI's crawler out entirely while allowing everyone else, you would add a named block:
User-agent: GPTBot
Disallow: /
User-agent: *
Disallow:
The decision of whether to allow or block these agents is strategic, not technical — allowing them enables citation and visibility in AI answers, while some publishers restrict them to control how their content is used. That trade-off is covered in depth in can AI crawlers access your site?. One subtlety worth knowing: Google-Extended governs use in Google's AI products and does not affect how Googlebot ranks you in classic search, so you can treat AI access separately from your normal SEO.
How robots.txt relates to llms.txt
It is easy to lump robots.txt and llms.txt together, but they do different jobs. robots.txt is an established standard about access — which crawlers may fetch which paths — and it applies to search and AI crawlers alike. llms.txt is a newer, proposed standard about understanding — a curated, described map of your best content for AI models. robots.txt gates the door; llms.txt is a guide for what to read once inside. A complete modern setup tends to have robots.txt (access), an XML sitemap (discovery) and, increasingly, llms.txt (AI curation), each doing its own complementary job.
How crawlers interpret your rules
Knowing how a crawler reads robots.txt helps you avoid surprises. When a crawler fetches the file, it looks for the most specific group of rules that matches its user-agent string. If there is a block named for that exact crawler, it uses that group and ignores the wildcard * group entirely — the groups do not combine. This catches people out: if you add a User-agent: GPTBot block with one rule, GPTBot follows only that block and no longer inherits your general * rules, so anything you intended to apply site-wide must be repeated inside the named block.
Within the matching group, when more than one rule could apply to a URL, the crawler follows the rule with the longest (most specific) path, which is precisely how an Allow exception overrides a broader Disallow. A few more behaviours are worth committing to memory:
- No robots.txt at all (a
404) is treated as "no restrictions" — the whole site is crawlable. - An empty
Disallow:means nothing is blocked, the explicit way to allow everything. - A server error (
5xx) when fetching robots.txt may cause Google to temporarily pause crawling, treating the site as fully disallowed until the file is reachable again — which is why robots.txt availability matters as much as its contents. - Comments start with
#and are ignored by crawlers, useful for documenting why a rule exists.
Because the rules are matched per user agent and per longest path, the safest way to reason about a complex file is to ask, for a given crawler and URL, "which single group matches, and which single rule inside it is most specific?" That is exactly the logic the crawler applies.
Testing and auditing your robots.txt
Given how much damage a wrong rule can do, treat testing as mandatory rather than optional. There are several complementary ways to verify the file does what you intend:
- Read the live file. The simplest check is to open
yourdomain.com/robots.txtin a browser and confirm it returns the rules you expect as plain text with a200status. Surprisingly often, the live file differs from what someone believes is deployed. - Use Google Search Console. The robots.txt report shows the version Google last fetched, when it fetched it, and any parsing issues, so you can confirm Google is seeing your current file and reading it without errors.
- Test specific URLs. A robots.txt tester lets you enter a URL and a user agent and see whether that combination is allowed or blocked, which is the fastest way to confirm a tricky
Allow/Disallowinteraction resolves the way you meant. - Crawl the site. A crawler (screaming-frog-style) will report which URLs are blocked by robots.txt across the whole site, surfacing accidental blocks — like a disallowed CSS directory or a whole section gated by a stray rule — that are invisible when you only read the file top to bottom.
Build a habit of re-checking robots.txt after any platform change, plugin update, or migration, because these are the moments a Disallow: / or an over-broad rule sneaks back in. A broad audit tool that inspects crawlability, such as StackOptic, will flag robots.txt problems alongside the rest of a site's technical and AI-readiness signals, so you catch an accidental block before it quietly costs you traffic.
A quick checklist
- The file is at the domain root and returns plain text with a 200 status.
User-agent,Disallow,AllowandSitemapdirectives are used correctly.- CSS and JavaScript needed for rendering are not blocked.
- No stray
Disallow: /is hiding the whole site. noindex, not robots.txt, is used to keep pages out of search.- Truly private content is protected by authentication, not robots.txt.
- AI crawlers are allowed or restricted deliberately, by name.
- The sitemap is referenced, and changes are tested before going live.
Go deeper
- Help crawlers find your pages: how to create an XML sitemap and submit it.
- The AI access decision: can AI crawlers access your site?
- The AI guidance layer: what is llms.txt and how to check yours.
- The full audit: how to check if your site is ready for AI search.
Want to see your robots.txt, sitemap and crawler access checked automatically? Analyse any URL with StackOptic — a full technical and AI-readiness report, free, no sign-up.
Frequently asked questions
What is a robots.txt file?
A robots.txt file is a plain-text file placed at the root of your domain, at yourdomain.com/robots.txt, that gives instructions to web crawlers about which parts of your site they may and may not request. It follows the Robots Exclusion Protocol and uses simple directives like User-agent, Disallow and Allow. Well-behaved crawlers, including Googlebot and the major AI crawlers, read it before crawling and obey the rules that match them.
Does robots.txt stop a page from being indexed?
No, and this is the most important misconception. robots.txt controls crawling, not indexing. If you Disallow a page, crawlers will not fetch its content, but Google can still index the URL if other pages link to it, sometimes showing it with no description. To reliably keep a page out of search results, allow crawling and use a noindex meta tag or header instead, so the directive can actually be read.
Where does the robots.txt file go?
It must live at the root of your domain or subdomain, at exactly yourdomain.com/robots.txt. Crawlers only look in that one location; a robots.txt in a subfolder is ignored. Each subdomain and protocol is treated separately, so a site and its subdomains can each have their own file. The file must be served as plain text with a 200 status to be honoured.
How do I block AI crawlers in robots.txt?
Reference the AI crawler by its user-agent name and add a Disallow rule. For example, a User-agent line for GPTBot followed by Disallow slash blocks OpenAI's crawler from your whole site; the same pattern works for CCBot, Google-Extended, ClaudeBot and others. The major AI crawlers respect robots.txt, so naming them lets you allow or restrict each one individually depending on your GEO strategy.
What is the difference between robots.txt and llms.txt?
robots.txt is an established standard that controls which crawlers may access which paths; it is about permission and applies to search and AI crawlers alike. llms.txt is a newer, proposed standard that curates and describes your best content for AI models; it is about understanding and prioritisation, not access control. They are complementary: robots.txt gates the door, while llms.txt acts as a guide once a model is inside.
Analyse any website with StackOptic
Get the full technology stack, performance, security and SEO report in seconds — free.
Analyse a websiteRelated articles
How to Optimize a Blog Post for SEO and AI Search (GEO)
One workflow that serves Google and AI engines at once: intent, answer-first intros, scannable structure, schema, E-E-A-T, cited stats and freshness.
How to Handle Pagination for SEO
Pagination done wrong hides content from Google. The modern best practice: self-referencing canonicals, crawlable links, and view-all vs paginated.
How to Improve Your Click-Through Rate in Search
Ranking is half the battle — people still have to click. How to lift search CTR with better titles, meta descriptions, rich results and intent matching.