Web Scraping Legal & Ethics: 2026 State of Play
Web Scraping Legal & Ethics: 2026 State of Play
This is the question I get asked most often. The answer most lawyers give is some version of "it depends, please retain me at $500/hour to give you a proper answer."
I'm not a lawyer. What follows is the practical landscape I navigate when running my own data business and shipping client work. Not legal advice. Use it to understand the terrain; consult a lawyer for specific risks.
The 2026 baseline
The single most important ruling: hiQ Labs v. LinkedIn (Ninth Circuit, 2022) established that scraping publicly accessible web data does not violate the Computer Fraud and Abuse Act (CFAA) — the federal anti-hacking law that historically was the biggest US legal threat to scrapers.
This is the foundation everything else rests on. In the US, scraping public data is not federally illegal. State laws and contractual restrictions still apply.
The rules I actually follow
After 200+ scraping engagements, the heuristic that keeps me safe is this:
- Public data only. If it requires login or auth to see, I don't scrape it for clients without their explicit credentials and their written confirmation that scraping their own logged-in account is OK.
- Respect
robots.txteven though it's not legally binding. It's a clear signal of intent. - Set a real
User-Agentthat identifies who I am or includes contact info. Hiding intent escalates anti-scraping responses; transparency de-escalates. - Don't compete with paid APIs the source sells. If the site sells the data via API, scraping that data for free is a direct contractual challenge. Legal exposure is much higher.
- Don't aggregate personal data at scale without GDPR/CCPA compliance review.
- Honor
Retry-Afterheaders and 429 responses. Hammering a server until it falls over is a different conversation from polite scraping. - Stop on cease-and-desist. Every C&D I've received, I've stopped, documented, and complied. None has escalated.
What's clearly OK
- Scraping publicly visible product catalogs (most e-commerce)
- Scraping public business directories (BBB, Yelp's public listings)
- Scraping public government data (SEC EDGAR, USGS, World Bank, EU open-data portals)
- Scraping public Wikipedia / Wikidata / Wikimedia content (CC-BY-SA license)
- Scraping public academic data (arXiv, PubMed, OpenAlex, Crossref)
- Scraping public job boards that don't explicitly forbid it (RemoteOK, We Work Remotely)
- Scraping with explicit ToS allowing automated access (Reddit had this until 2023; some still do)
What's clearly NOT OK
- Scraping data behind paid APIs (Twitter/X, Reddit since 2023, LinkedIn — all explicitly forbid it)
- Scraping personal data at scale without GDPR/CCPA basis (EU residents especially)
- Bypassing technical access controls (CAPTCHA-evading at scale, exploiting auth bugs)
- Using scraped data to clone a service or compete head-on with the source
- Scraping copyrighted content for redistribution (news articles, paid books, image libraries)
The gray zones
These are the situations where the answer is "it depends" — I either get explicit client sign-off in writing, or I decline.
Personal data on professional networks (LinkedIn, AngelList)
Public profiles, but the operating company explicitly forbids scraping in ToS and has a track record of litigating. Even with hiQ ruling at the federal level, you can lose a contract-law case.
My rule: don't, ever. Use the official API or pay Apollo/Lusha/ZoomInfo.
News articles and editorial content
Copyrighted, but quotation/research-oriented use can fall under fair use. Aggregating headlines is generally OK; reproducing full articles is not.
My rule: extract metadata and excerpts. Don't redistribute the full text.
E-commerce competitor data
Generally OK if public, but pay attention to tone — building a "Walmart-priced" search competitor on Walmart's data is the kind of thing that triggers C&D letters and contract suits.
My rule: scrape for monitoring and analytics. Don't republish the data as a competing product.
Reviews and user-generated content
User-generated content (Yelp reviews, Amazon product reviews) is a contract minefield. The site has terms with users about how their content can be used; users have terms with the site about what they own.
My rule: extract aggregated metrics (star ratings, review counts) freely; don't republish full user reviews.
Personal email harvesting
Even from public web pages, email harvesting can trigger CAN-SPAM (US) and GDPR (EU) issues if you use the emails for unsolicited marketing.
My rule: only extract emails when the client has a legitimate-interest basis for contacting them and will mention how they got the contact info.
Jurisdictional differences
United States
- hiQ ruling sets federal baseline: public data scraping is not CFAA violation
- State laws vary; California's CCPA + some state computer-trespass laws add wrinkles
- Contract law (ToS) is the realistic risk vector, not criminal law
European Union (GDPR)
- Personal data of EU residents is regulated regardless of where you scrape from
- "Legitimate interest" is the most relevant lawful basis for scraped business contact data
- Mass-aggregation of personal data without basis is a hard no
United Kingdom
- Similar to EU under UK GDPR
- Powerful Computers vs Glasswall (2022) reinforced that "publicly available" doesn't mean "use freely"
Other major jurisdictions
- Canada: PIPEDA applies to personal data; commercial data generally fair game
- Australia: similar privacy framework to GDPR-lite
- Japan, South Korea: less aggressive enforcement but copyright laws are strong
The role of robots.txt
robots.txt is a polite-protocol convention from 1994. It is not legally binding in most jurisdictions. The legal weight of ignoring it is contested.
In practice: respect it anyway. It's a clear signal of intent. Ignoring it raises the chances a site escalates from auto-block to lawyer-letter.
ToS as a contract
Most sites' Terms of Service include a clause forbidding automated access. Whether ToS forms an enforceable contract for someone who hasn't created an account is legally murky.
In practice:
- If you've created an account → strong contract claim against you
- If you haven't, just visited → weaker but not zero claim
- If the data is genuinely public and the ToS is the only barrier → the hiQ line of cases generally protects you
Practical advice for freelancers
When a client asks me to scrape something I'm uncomfortable with, my response template:
"This site's ToS forbids automated access and they've actively pursued similar cases. The data you want is available via their official API ($X/month) or a third-party data provider (Apollo / Lusha / Bright Data datasets). Both are likely cheaper than the legal cost if this escalates. Want me to spec the API integration instead?"
Most clients respect this. The few who don't are clients I don't want.
What I never do
- Scrape behind a login the client doesn't own
- Resell scraped personal data
- Build "API replacements" that compete with paid APIs the source sells
- Hide who I am via fake
User-Agentheaders (impersonation is a different legal exposure than scraping) - Help clients circumvent published access restrictions to data they have no right to
The summary heuristic
If you'd be uncomfortable explaining to the source site's lawyer exactly what you scraped, why, and how — don't do it. Most legitimate scraping work passes that test trivially.
What to read next
- Getting Started with Web Scraping — your first scraper
- Web Scraping FAQ — the 25 most-asked questions
- Web Scraping Tools Comparison — Scrapy vs Playwright vs etc.
If you're unsure whether a specific job is in safe territory, email me at info@luba.media with the source URL and what you're trying to extract. I'll give you my honest take, free.
---
Disclaimer: I am not a lawyer. This is the practical landscape I navigate, not legal advice. For specific risk assessment, consult a lawyer in your jurisdiction.
Hire me to build this for your site
I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.
Send a brief