SEO & GEO

Can AI Crawlers (GPTBot, ClaudeBot, PerplexityBot) Access Your Site?

AI engines can only cite content they can crawl. Here are the major AI user agents, how to check whether you are blocking them, and how to decide who to allow.

StackOptic Research Team11 Apr 20267 min read
Checking whether AI crawlers can access your website

Every GEO tactic in the world is wasted if AI engines cannot read your content in the first place. Generative engines can only cite what their crawlers can fetch, which makes crawler access the foundation of AI visibility — and one of the most commonly overlooked. This guide covers the major AI crawlers and their user agents, how to check whether you are (perhaps accidentally) blocking them, the ways blocking can happen below robots.txt, and how to make a deliberate decision about who to allow.

It is the access layer of the broader GEO audit; pair it with what is llms.txt? for the guidance layer.

Why crawler access is the foundation

Think of it as a gate that everything else sits behind. You can write the most quotable, best-structured, most authoritative page on the internet, but if GPTBot or PerplexityBot cannot fetch it, no AI engine can cite it — your effort is invisible. Conversely, simply allowing the right crawlers, on content you already have, can be the single change that moves you from absent to present in AI answers. That asymmetry is why access is the first thing to check and the cheapest thing to fix. It is also why an accidental block is so costly: it silently caps the return on all your other content work.

The major AI crawlers and their user agents

Each AI provider crawls with one or more named user agents. The most important to know are:

User agentOperated byPurpose
GPTBotOpenAICrawls content (training and product use)
OAI-SearchBot / ChatGPT-UserOpenAISearch and on-demand fetching for ChatGPT
ClaudeBot / anthropic-aiAnthropicCrawls content for Claude
PerplexityBotPerplexityCrawls and cites for Perplexity answers
Google-ExtendedGoogleGoverns use in Google's AI products
Applebot-ExtendedAppleGoverns use in Apple's AI
CCBotCommon CrawlOpen crawl that feeds many models

You reference these strings in robots.txt to allow or disallow each one. Note that some, like Google-Extended, are AI-specific controls that do not affect the provider's classic search crawler — allowing or blocking Google-Extended does not change how Googlebot ranks you.

How robots.txt controls access

robots.txt, the plain-text file at your site root, is the primary mechanism. It lists rules per user agent, granting or denying access to paths. An AI crawler that respects the standard (the major ones do) will read your robots.txt and obey the rules that match its user-agent string, falling back to the wildcard * rules if there is no specific entry for it. So a blanket User-agent: * with a broad Disallow: / will keep AI crawlers out just as surely as a rule that names them. This is exactly why a carelessly copied robots.txt is such a common cause of accidental invisibility — a rule meant to hide a staging path can end up gating your whole site to every bot, AI included.

How to check whether you are blocking them

Start by reading your own robots.txt: visit yourdomain.com/robots.txt and scan for Disallow rules under the wildcard agent and under any of the AI user agents above. A line like User-agent: GPTBot followed by Disallow: / means you are blocking OpenAI's crawler entirely; the absence of any AI-specific rules means they fall under whatever your * rules permit. Beyond robots.txt, confirm two more things: that your important pages return real content to a plain, script-free request (so crawlers that do not run JavaScript still see your content), and that no layer in front of your site is silently blocking these agents (see below). A tool that audits AI/GEO readiness will run all of these checks for you and report which agents can and cannot reach you.

Blocking that happens below robots.txt

robots.txt is not the only place access is decided. Several layers can block AI crawlers regardless of what robots.txt says:

  • CDN and WAF firewalls. A bot-management rule on Cloudflare, Akamai or similar can challenge or block AI agents, sometimes by default, treating them like scrapers to be stopped.
  • Security and SEO plugins. Some content-management plugins include a setting to block AI bots, which a site owner may have toggled without thinking through the GEO consequences.
  • Aggressive rate limiting. Throttling that is too tight can cause crawlers to give up before they index much.
  • JavaScript-only rendering. If your content only appears after client-side scripts run, crawlers that do not execute JavaScript see an empty page — an effective block even with a permissive robots.txt.

The practical lesson is to verify access end to end, not just in robots.txt, because the gate can be anywhere between the request and your content.

Should you allow them? A deliberate decision

Allowing AI crawlers is a strategy choice, not a default to accept blindly. On one side, allowing them is the GEO-positive path: it enables your content to be cited in AI answers, which can build visibility, authority and referral traffic, and it is how most businesses want to be positioned as AI search grows. On the other side, some publishers — particularly those whose business is their content — choose to restrict certain agents to retain control over how their work is used or to negotiate terms. Both positions are legitimate. The key is to decide per agent and on purpose: you might welcome search-and-cite agents like PerplexityBot and OAI-SearchBot while taking a different view on broad training crawlers. Whatever you choose, the failure mode to avoid is an accidental decision — being blocked without knowing it, or being open without having considered it.

How to allow them properly and verify

If your goal is GEO visibility, make sure your robots.txt explicitly permits the agents you want, and that no firewall or rendering issue overrides that. After any change, verify rather than assume: re-read the live robots.txt, test fetching a page the way a simple crawler would, and check back over time, because plugins and CDN settings can reintroduce blocks during unrelated updates. Treat crawler access as something you confirm on a schedule, the same way you would monitor uptime — because a silent block is just as damaging to AI visibility as downtime is to everything else.

What happens after you allow them

Allowing a crawler is the start, not the finish. After you open access, the engine has to actually crawl your pages, store or index what it finds, and then decide — at answer time — whether your content is the best source for a given question. That means citations do not appear the instant you change robots.txt; there is a lag while the crawler revisits and while your content competes on quality. It also means access alone is necessary but not sufficient: once the door is open, everything else in GEO — structure, sourcing, authority, freshness — determines whether you are actually chosen. The right mental model is that allowing crawlers buys you a ticket to the competition, and the rest of your optimisation decides whether you win it.

Legitimate AI crawlers versus bad bots

A fair concern when opening access is telling genuine AI crawlers apart from scrapers that merely impersonate them. User-agent strings can be faked, so a request claiming to be GPTBot is not proof it is. The major providers publish the IP ranges their crawlers use and, in some cases, support reverse-DNS verification, which lets you confirm a crawler is what it says before treating it as such. This matters because the sensible posture is rarely "block all bots" or "allow everything" — it is to welcome verified, well-behaved AI crawlers while still defending against abusive scraping and automated attacks. Your CDN or WAF can usually distinguish the two, so configure it to verify rather than to block indiscriminately, which is how sites accidentally lock out the very crawlers they want.

Monitoring crawler activity in your logs

The most direct way to know what is really happening is your server access logs. They record every request by user agent, so you can see which AI crawlers visit, how often, and which pages they fetch — ground truth that no amount of guessing about robots.txt can replace. Watching the logs tells you whether GPTBot or PerplexityBot is actually crawling after you allowed it, whether a crawler is being throttled or challenged before it gets far, and whether your most important pages are being fetched at all. For a serious GEO effort, periodically reviewing crawler activity in your logs (or in a log-analysis tool) closes the loop between "we allowed them" and "they are genuinely reading us," and surfaces silent problems long before they show up as missing citations.

Go deeper

Not sure who you are blocking? StackOptic checks AI crawler access and your wider AI/GEO readiness in one report — free.

Frequently asked questions

How do I check if AI crawlers can access my site?

Open yourdomain.com/robots.txt and look for any Disallow rules that apply to AI user agents such as GPTBot, ClaudeBot, PerplexityBot, Google-Extended and CCBot, or to the wildcard * agent. Also confirm your content renders as real HTML without requiring JavaScript, and that no CDN or firewall rule is blocking these agents. A GEO/AI-readiness tool like StackOptic checks crawler access for you.

What are the main AI crawler user agents?

The most important ones are GPTBot and OAI-SearchBot (OpenAI / ChatGPT), ClaudeBot and anthropic-ai (Anthropic / Claude), PerplexityBot (Perplexity), Google-Extended (Google's AI products), Applebot-Extended (Apple) and CCBot (Common Crawl, which feeds many models). Each is identified by its user-agent string, which is what you reference in robots.txt to allow or disallow it.

Should I allow or block AI crawlers?

It depends on your goals. If you want visibility in AI answers and the traffic and authority that can follow, allow them — that is the GEO-positive choice. If you are a publisher concerned about your content being used to train or answer without attribution, you may choose to restrict some of them. There is no universally right answer; decide deliberately per agent, and make sure your robots.txt actually reflects the decision.

Can a site block AI crawlers without meaning to?

Yes, and it is common. A robots.txt copied from a template, a security or SEO plugin with a bot-blocking setting, or a CDN/WAF firewall rule that treats AI agents as scrapers can all block them silently. Sites that render content only via JavaScript can also be effectively invisible to crawlers that do not execute scripts. That is why verifying access is the first step in any GEO audit.

Does allowing AI crawlers affect my normal SEO?

Allowing AI-specific agents like GPTBot or Google-Extended does not change how Googlebot ranks your pages in classic search — they are separate agents with separate purposes. Google-Extended, for instance, governs use in Google's AI products without affecting standard Search indexing. So you can allow AI crawlers for GEO while your traditional SEO continues exactly as before.

Analyse any website with StackOptic

Get the full technology stack, performance, security and SEO report in seconds — free.

Analyse a website

Related articles