Eyal Rosenthal · Web scraping at scale

Web Scraping Legal & Ethics: 2026 State of Play

Web Scraping Legal & Ethics: 2026 State of Play

This is the question I get asked most often. The answer most lawyers give is some version of "it depends, please retain me at $500/hour to give you a proper answer."

I'm not a lawyer. What follows is the practical landscape I navigate when running my own data business and shipping client work. Not legal advice. Use it to understand the terrain; consult a lawyer for specific risks.

The 2026 baseline

The single most important ruling: hiQ Labs v. LinkedIn (Ninth Circuit, 2022) established that scraping publicly accessible web data does not violate the Computer Fraud and Abuse Act (CFAA) — the federal anti-hacking law that historically was the biggest US legal threat to scrapers.

This is the foundation everything else rests on. In the US, scraping public data is not federally illegal. State laws and contractual restrictions still apply.

The rules I actually follow

After 200+ scraping engagements, the heuristic that keeps me safe is this:

  1. Public data only. If it requires login or auth to see, I don't scrape it for clients without their explicit credentials and their written confirmation that scraping their own logged-in account is OK.
  2. Respect robots.txt even though it's not legally binding. It's a clear signal of intent.
  3. Set a real User-Agent that identifies who I am or includes contact info. Hiding intent escalates anti-scraping responses; transparency de-escalates.
  4. Don't compete with paid APIs the source sells. If the site sells the data via API, scraping that data for free is a direct contractual challenge. Legal exposure is much higher.
  5. Don't aggregate personal data at scale without GDPR/CCPA compliance review.
  6. Honor Retry-After headers and 429 responses. Hammering a server until it falls over is a different conversation from polite scraping.
  7. Stop on cease-and-desist. Every C&D I've received, I've stopped, documented, and complied. None has escalated.

What's clearly OK

  • Scraping publicly visible product catalogs (most e-commerce)
  • Scraping public business directories (BBB, Yelp's public listings)
  • Scraping public government data (SEC EDGAR, USGS, World Bank, EU open-data portals)
  • Scraping public Wikipedia / Wikidata / Wikimedia content (CC-BY-SA license)
  • Scraping public academic data (arXiv, PubMed, OpenAlex, Crossref)
  • Scraping public job boards that don't explicitly forbid it (RemoteOK, We Work Remotely)
  • Scraping with explicit ToS allowing automated access (Reddit had this until 2023; some still do)

What's clearly NOT OK

  • Scraping data behind paid APIs (Twitter/X, Reddit since 2023, LinkedIn — all explicitly forbid it)
  • Scraping personal data at scale without GDPR/CCPA basis (EU residents especially)
  • Bypassing technical access controls (CAPTCHA-evading at scale, exploiting auth bugs)
  • Using scraped data to clone a service or compete head-on with the source
  • Scraping copyrighted content for redistribution (news articles, paid books, image libraries)

The gray zones

These are the situations where the answer is "it depends" — I either get explicit client sign-off in writing, or I decline.

Personal data on professional networks (LinkedIn, AngelList)

Public profiles, but the operating company explicitly forbids scraping in ToS and has a track record of litigating. Even with hiQ ruling at the federal level, you can lose a contract-law case.

My rule: don't, ever. Use the official API or pay Apollo/Lusha/ZoomInfo.

News articles and editorial content

Copyrighted, but quotation/research-oriented use can fall under fair use. Aggregating headlines is generally OK; reproducing full articles is not.

My rule: extract metadata and excerpts. Don't redistribute the full text.

E-commerce competitor data

Generally OK if public, but pay attention to tone — building a "Walmart-priced" search competitor on Walmart's data is the kind of thing that triggers C&D letters and contract suits.

My rule: scrape for monitoring and analytics. Don't republish the data as a competing product.

Reviews and user-generated content

User-generated content (Yelp reviews, Amazon product reviews) is a contract minefield. The site has terms with users about how their content can be used; users have terms with the site about what they own.

My rule: extract aggregated metrics (star ratings, review counts) freely; don't republish full user reviews.

Personal email harvesting

Even from public web pages, email harvesting can trigger CAN-SPAM (US) and GDPR (EU) issues if you use the emails for unsolicited marketing.

My rule: only extract emails when the client has a legitimate-interest basis for contacting them and will mention how they got the contact info.

Jurisdictional differences

United States

  • hiQ ruling sets federal baseline: public data scraping is not CFAA violation
  • State laws vary; California's CCPA + some state computer-trespass laws add wrinkles
  • Contract law (ToS) is the realistic risk vector, not criminal law

European Union (GDPR)

  • Personal data of EU residents is regulated regardless of where you scrape from
  • "Legitimate interest" is the most relevant lawful basis for scraped business contact data
  • Mass-aggregation of personal data without basis is a hard no

United Kingdom

  • Similar to EU under UK GDPR
  • Powerful Computers vs Glasswall (2022) reinforced that "publicly available" doesn't mean "use freely"

Other major jurisdictions

  • Canada: PIPEDA applies to personal data; commercial data generally fair game
  • Australia: similar privacy framework to GDPR-lite
  • Japan, South Korea: less aggressive enforcement but copyright laws are strong

The role of robots.txt

robots.txt is a polite-protocol convention from 1994. It is not legally binding in most jurisdictions. The legal weight of ignoring it is contested.

In practice: respect it anyway. It's a clear signal of intent. Ignoring it raises the chances a site escalates from auto-block to lawyer-letter.

ToS as a contract

Most sites' Terms of Service include a clause forbidding automated access. Whether ToS forms an enforceable contract for someone who hasn't created an account is legally murky.

In practice:

  • If you've created an account → strong contract claim against you
  • If you haven't, just visited → weaker but not zero claim
  • If the data is genuinely public and the ToS is the only barrier → the hiQ line of cases generally protects you

Practical advice for freelancers

When a client asks me to scrape something I'm uncomfortable with, my response template:

"This site's ToS forbids automated access and they've actively pursued similar cases. The data you want is available via their official API ($X/month) or a third-party data provider (Apollo / Lusha / Bright Data datasets). Both are likely cheaper than the legal cost if this escalates. Want me to spec the API integration instead?"

Most clients respect this. The few who don't are clients I don't want.

What I never do

  • Scrape behind a login the client doesn't own
  • Resell scraped personal data
  • Build "API replacements" that compete with paid APIs the source sells
  • Hide who I am via fake User-Agent headers (impersonation is a different legal exposure than scraping)
  • Help clients circumvent published access restrictions to data they have no right to

The summary heuristic

If you'd be uncomfortable explaining to the source site's lawyer exactly what you scraped, why, and how — don't do it. Most legitimate scraping work passes that test trivially.

If you're unsure whether a specific job is in safe territory, email me at info@luba.media with the source URL and what you're trying to extract. I'll give you my honest take, free.

---

Disclaimer: I am not a lawyer. This is the practical landscape I navigate, not legal advice. For specific risk assessment, consult a lawyer in your jurisdiction.

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief