How to Protect Your Website from Bots and Scrapers
Not all bots are bad. Tell good crawlers from abusive scrapers, spot the signals of bot traffic, and layer rate limiting, CAPTCHA, a WAF and bot management.
If you want to protect your website from bots and scrapers, the first thing to accept is that you cannot — and should not — block all automated traffic. A large share of every site's visitors are bots, and many of them are ones you actively want: the search-engine crawlers that index you, the uptime monitors that watch you, the AI assistants that increasingly send referrals. The real task is to separate the bots you want from the ones abusing your site, and stop the latter without harming the former. This guide explains the difference, the signals that expose abusive automation, and the layered defences — rate limiting, CAPTCHA, a WAF with bot management, honeypots and fingerprinting — that actually work.
It sits alongside how to protect your website from common attacks and leans on the same protective layer described in what is a web application firewall (WAF).
Good bots versus bad bots
Automation is not the enemy; abusive automation is. It helps to picture three buckets.
Good bots do work you benefit from. Search-engine crawlers (Googlebot, Bingbot and others) index your pages so people can find them. AI crawlers from assistants and answer engines fetch content that may surface your brand. Uptime and performance monitors check that your site is up and fast. Social and preview bots generate the link cards when someone shares your URL. Blocking these is self-harm — you lose visibility, monitoring or shareability.
Bad bots exist to extract value from you or to attack. Content scrapers copy your articles, listings or prices wholesale, sometimes to republish, sometimes to undercut you. Price and inventory scrapers track your catalogue for competitors. Credential-stuffing and brute-force tools hammer your login with leaked passwords. Spam bots flood your forms and comments. Vulnerability scanners probe for weaknesses to exploit. Inventory-hoarding bots grab limited stock before humans can. None of these benefit you, and several are outright malicious.
Grey-area bots sit in between — aggressive SEO-tool crawlers, academic scrapers, or commercial data harvesters — which you may tolerate, throttle or block depending on your priorities.
The defining question is never "is this a bot?" but "is this bot's behaviour acceptable?" That framing keeps you from the classic mistake of a blunt rule that blocks Googlebot alongside the scrapers and quietly wrecks your search visibility.
The signals of abusive bot traffic
Bots betray themselves through behaviour that humans do not produce. No single signal is proof, but several together are a strong indication — the same weigh-the-evidence logic that applies when judging whether a website is safe.
- Implausible request rates. A single IP or small range fetching hundreds of pages a minute is not a person. Humans pause, read and click; bots do not.
- Targeted page patterns. Traffic concentrated on your login, search, pricing or API endpoints rather than spread naturally across the site suggests scraping or credential abuse.
- Missing or fake user-agents. Real browsers send consistent, recognisable user-agent strings. Blank, generic (a bare scripting-library default) or obviously spoofed agents are a flag — though sophisticated bots forge convincing ones, so treat this as supporting evidence.
- No assets, cookies or JavaScript. Many bots fetch only HTML and never load the images, CSS, fonts or scripts a browser would, never accept cookies, and never execute JavaScript. A "visitor" that ignores all of that is usually automated.
- Datacentre IP origins. Genuine consumer traffic comes largely from residential and mobile networks. A surge from cloud and hosting providers' IP ranges often signals automation — although determined operators rent residential proxies to hide.
- Bandwidth and load spikes that do not match any real campaign or event, sometimes hitting deep or obscure pages a human would rarely reach.
- Uniform, robotic timing — perfectly regular intervals, no natural variation, identical headers across thousands of requests.
| Signal | What it suggests | Caveat |
|---|---|---|
| Hundreds of requests/min from one source | Scraping or brute force | Could be a shared corporate NAT — verify |
| Traffic focused on login/pricing/API | Credential stuffing or price scraping | Cross-check failed-login and error rates |
| Missing or generic user-agent | Unsophisticated bot | Advanced bots spoof real ones |
| Never loads images/CSS/JS, no cookies | Headless scraper | Some monitors behave similarly |
| Surge from datacentre IP ranges | Automated traffic | Residential proxies evade this |
| Regular, identical-timing requests | Scripted automation | Confirm with header fingerprinting |
Your server access logs and analytics are where these patterns surface first, so reviewing them periodically is the cheapest detection you have. For reading what requests and responses actually contain, see how to read a website's HTTP headers.
robots.txt is advisory, not a fence
Before the real defences, clear up a persistent misconception. robots.txt cannot stop a bad bot. It is a plain text file at the root of your site that requests crawlers avoid certain paths, and well-behaved bots like Googlebot honour it — Google documents the standard and obeys it. But it is entirely voluntary. A scraper or attacker reads robots.txt (if at all) and ignores it, or skips it altogether.
Worse, because robots.txt openly lists the directories you would rather not have crawled, it can act as a map of where your sensitive or interesting areas live — handing a curious adversary a hint. So use robots.txt for what it is good at: politely steering cooperative crawlers away from low-value or duplicate pages, and pointing them at your sitemap. Never treat it as a security control. Real enforcement happens in the layers below. (For the constructive use of the file, see how to write a robots.txt file.)
The defences that actually work
Effective bot defence is layered — no single control catches everything, so you stack measures that each raise the cost of abuse.
Rate limiting and throttling
The foundation. Rate limiting caps how many requests a given client (by IP, token, session or fingerprint) may make in a window, and throttling slows them once they cross a threshold. This blunts scraping and brute-force attempts without affecting normal users, who never come close to the limits. Apply tighter limits to expensive or sensitive endpoints — login, search, API — than to ordinary pages. Rate limiting alone will not stop a distributed attack spread across thousands of IPs, but it defeats the many crude, single-source bots and forces sophisticated ones to slow down.
CAPTCHA and invisible challenges
A challenge asks the client to prove it is human (or at least not an obvious bot). Modern challenges are far less annoying than the old "type the squiggly letters" boxes: Cloudflare Turnstile and hCaptcha can run largely invisibly, scoring the client in the background and only escalating to an interactive test when suspicious. Google's reCAPTCHA works similarly. Place challenges on the highest-risk actions — registration, login, password reset, checkout, comment and contact forms — rather than across the whole site, where they would add friction for everyone. A challenge is a strong filter against automated submission while staying nearly frictionless for genuine visitors.
A web application firewall with bot management
A web application firewall (WAF) inspects incoming traffic and blocks malicious requests before they reach your application, and most modern WAFs (typically delivered through a CDN) include bot management that classifies traffic as human, known-good bot or likely-bad bot using reputation, behaviour and device signals. This is the most capable single layer for bot defence because it adapts: it maintains threat intelligence on known bad sources and patterns, allow-lists verified good crawlers, and challenges or blocks the rest. For the full picture of what a WAF does, see what is a web application firewall (WAF). A WAF is a layer on top of secure code and the other measures here, not a replacement for them.
Honeypots
A honeypot is a trap a human never springs but a bot does. The classic version is a hidden form field — invisible to real users via CSS — that automated submitters fill in regardless; any submission with that field populated is flagged as a bot and discarded. A variant is a hidden link that only an indiscriminate crawler would follow, marking the follower as suspicious. Honeypots are cheap, add zero friction for real visitors, and catch a lot of unsophisticated spam. Advanced bots can learn to avoid them, so they are one layer among several, not a complete answer.
Behavioural and device fingerprinting
Rather than trusting easily-forged signals like the user-agent, fingerprinting builds a fuller picture from many attributes — browser and device characteristics, header order and consistency, TLS handshake details, and on-page behaviour such as mouse movement, timing and navigation patterns. Bots struggle to convincingly mimic the full profile of a real browser driven by a real person, so fingerprinting helps distinguish a genuine visitor from a headless tool dressed up to look like one. It is typically a feature of commercial bot-management products and is what lets them catch sophisticated bots that defeat simpler checks.
Blocking by behaviour and ASN, not single IPs
Blocking individual IP addresses is fragile: attackers rotate through huge pools of addresses, so a single-IP block is obsolete almost immediately, and you risk blocking a shared corporate or carrier gateway that many real users sit behind. More durable approaches:
- Behaviour-based blocking — act on what a client does (rate, targeted endpoints, lack of asset loading) rather than who it appears to be, so the rule survives IP rotation.
- ASN-level decisions — an Autonomous System Number identifies the network an IP belongs to. If abusive traffic clearly originates from a hosting/datacentre ASN that should not be sending you human visitors, you can throttle or challenge that whole network — more robust than chasing individual IPs, though still to be applied carefully to avoid collateral damage.
Protecting APIs and login flows specifically
Two areas deserve dedicated attention because they are the prime targets.
APIs are attractive to scrapers because they return clean, structured data. Defend them with authentication (require keys or tokens, not open access), per-key rate limiting and quotas, pagination limits so a single call cannot dump your whole dataset, and monitoring for keys behaving abnormally. Never assume an undocumented API is safe by obscurity — bots find them.
Login flows are the target of credential stuffing, where leaked username/password pairs are replayed at scale in the hope that someone reused a breached password. The layered defence:
- Multi-factor authentication (MFA) — the single most effective control, because a correct password alone no longer grants access. Enable it everywhere you can, especially on admin accounts.
- Rate limiting and lockouts on repeated failed attempts from the same source.
- A challenge (Turnstile/hCaptcha) on the login and password-reset flows.
- Monitoring for failed-login spikes, a clear fingerprint of a stuffing attack.
- A WAF with bot management to block known credential-stuffing tools before they reach the form.
Credential stuffing is one of the common attacks covered more broadly in how to protect your website from common attacks; the bot-specific angle is simply that it is automation, and the anti-bot layers above are what stop it at volume.
Balancing defence with not blocking the bots you want
The recurring risk in all of this is over-blocking. A defence aggressive enough to stop every scraper can also block Googlebot, Bingbot, legitimate AI crawlers and your own monitoring — and the damage from de-indexing yourself can dwarf the harm the scrapers were doing. Keep the balance with a few habits:
- Allow-list known good bots explicitly, so your aggressive rules never apply to them.
- Verify legitimate crawlers rather than trusting their user-agent. Google publishes a method to confirm Googlebot by reverse DNS lookup (the IP should resolve to a
googlebot.com/google.comhost and forward-resolve back), and other major operators document similar verification. This catches bots pretending to be Googlebot to slip past your defences. - Target behaviour, not automation per se — challenge or throttle abusive patterns, not every non-human request.
- Decide your AI-crawler policy deliberately. You may welcome AI assistants for the referrals and visibility, or restrict them — but make it a conscious choice, applied precisely, not an accident of a broad block.
- Monitor what you block. Review your block and challenge logs for false positives, and confirm in Google Search Console that crawling is healthy.
A quick bot-defence checklist
- Review server logs and analytics for the bot signals above.
- Use robots.txt to guide good crawlers — never as a security control.
- Apply rate limiting and throttling, tighter on login, search and API endpoints.
- Add an invisible challenge (Turnstile/hCaptcha) to high-risk forms.
- Put a WAF with bot management in front of the site.
- Plant honeypots on forms to catch unsophisticated bots for free.
- Authenticate, rate-limit and quota your APIs; cap pagination.
- Enable MFA and failed-login monitoring to defeat credential stuffing.
- Prefer behaviour- and ASN-based decisions over fragile single-IP blocks.
- Allow-list and verify Googlebot, Bingbot and legitimate AI crawlers so you never block the bots you want.
Go deeper
- The protective layer in depth: what is a web application firewall (WAF).
- The broader defensive picture: how to protect your website from common attacks.
- Guide cooperative crawlers properly: how to write a robots.txt file.
- Read the raw signals: how to read a website's HTTP headers.
Want a fast read on a site's headers, configuration and defensive posture? Analyse any URL with StackOptic — free, no sign-up.
Frequently asked questions
How do I know if bots are hitting my website?
Look for patterns humans do not produce: a single IP or small range making hundreds of requests per minute, traffic concentrated on your login, search or pricing pages, requests with missing, generic or spoofed user-agents, visits that never load images, CSS or JavaScript, and sudden bandwidth or server-load spikes that do not match real campaigns. Your server logs and analytics usually reveal this. Several of these signals together strongly indicate automated traffic rather than genuine visitors.
Will blocking bots hurt my SEO?
Only if you block the wrong ones. Search engines such as Google and Bing use crawlers you actively want to allow, and many AI assistants now crawl too. The risk is a blunt rule that catches Googlebot along with the scrapers. Avoid that by verifying legitimate crawlers (Google publishes how to confirm Googlebot by reverse DNS), allow-listing known good bots, and targeting your defences at abusive behaviour rather than blocking automation wholesale. Done carefully, bot defence does not harm SEO.
Can robots.txt stop scrapers?
No. robots.txt is a published request that crawlers voluntarily honour; well-behaved bots like Googlebot obey it, but a scraper or attacker can simply ignore it. It is a coordination tool for cooperative crawlers, not a security control, and because it lists paths you would rather not have crawled it can even hint at where sensitive areas are. Use it to guide good bots, and rely on rate limiting, a WAF and bot management to actually stop the bad ones.
What is the best way to stop credential stuffing on my login page?
Credential stuffing replays leaked username and password pairs against your login form at scale, so the defence is to make automated, high-volume attempts impractical. Combine multi-factor authentication (which stops most attacks even when a password is correct), rate limiting and lockouts on repeated failures, a challenge such as Turnstile or hCaptcha on the login flow, and monitoring for spikes in failed logins. A web application firewall with bot management can block known credential-stuffing tools before they reach the form.
What is a honeypot for catching bots?
A honeypot is a trap that humans never trigger but bots do. A common form is a hidden form field, invisible to real users via CSS, that automated submitters fill in anyway, flagging the submission as a bot. Another is a hidden link that only an indiscriminate crawler would follow. Honeypots are cheap, friction-free for genuine visitors (unlike a visible CAPTCHA), and useful as one layer of a wider defence, though sophisticated bots can learn to avoid them, so they are not sufficient alone.
Analyse any website with StackOptic
Get the full technology stack, performance, security and SEO report in seconds — free.
Analyse a websiteRelated articles
How to Check a Website for Malware
A practical guide to checking any website for malware: the free external scanners to use, the signs of infection, server-side checks, and what to do next.
What Is a Data Breach and How to Respond
A plain-English guide to data breaches: what counts as one, the common causes, a step-by-step incident-response plan, the GDPR 72-hour rule, and prevention.
What Is Cross-Site Scripting (XSS) and How to Prevent It
A defensive guide to cross-site scripting: the three types explained, plus the layered prevention that stops it — output encoding, CSP and framework escaping.