robots.txt Checker: Will This Site Allow Your Scraper?
robots.txt Checker
Paste a robots.txt and the path you want to scrape. Get an instant verdict per user-agent. Runs in your browser — nothing sent anywhere.
How to read the results
- "allowed (no matching rule)" — the path doesn't match any
Disallow:rule for this user-agent. Scrape away. - "allowed (specific Allow rule)" — there's a
Disallow:that would have blocked you, but a more-specificAllow:overrides it. Scrape away. - "blocked by Disallow: /path/" — the user-agent is blocked from this path. Don't scrape (or use a different user-agent string if you have a legitimate reason).
- Crawl-Delay — minimum seconds between requests the site has requested. Not legally binding but worth respecting.
What robots.txt actually means
robots.txt is a polite-protocol convention from 1994. It's not legally binding in most jurisdictions. The legal weight of ignoring it is contested.
In practice:
- Search engines respect it strictly
- Major LLM crawlers (GPTBot, ClaudeBot, PerplexityBot) respect it
- Web scraping libraries don't enforce it by default
- Hostile sites may punish ignoring it with auto-bans
The recommendation: respect robots.txt for any crawler you build. It's a clear signal of intent.
Common patterns you'll see
User-agent: *
Disallow: /
Blocks all bots from everything. Rare but absolute. Don't scrape.
User-agent: GPTBot
Disallow: /
Site is opting out of OpenAI's training crawler. Their crawl, their call.
User-agent: *
Disallow: /admin/
Disallow: /search
Crawl-delay: 1
The standard polite robots.txt. Stay out of /admin/ and /search, hit at most once per second elsewhere.
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
Blocks everyone except Google. Aggressive — site owner only wants their data in one engine.
What to read next
- Web Scraping Legal & Ethics — full landscape including robots.txt's legal weight
- Web Scraping FAQ — 25 most-asked questions
- Getting Started with Web Scraping
Need this customized for your stack?
Custom calculators, comparison dashboards, scraping ROI models — happy to build them for your team. Send a brief to info@luba.media.
Send a brief