Eyal Rosenthal · Web scraping at scale

Web Scraping FAQ: Every Question I Get Asked

Web Scraping Frequently Asked Questions

The 25 questions I get asked most often, answered directly. No "it depends" weasel-language unless it actually depends.

In the US, yes — for publicly accessible data. The hiQ Labs v. LinkedIn ruling (2022) established that scraping public web data does not violate the Computer Fraud and Abuse Act. State and EU laws vary; the GDPR adds constraints around personal data even when it's public.

The high-confidence safe playbook: scrape only data that is publicly accessible without login, respect robots.txt even though it's not legally binding, set a real User-Agent that identifies you, and don't scrape data sites have explicitly contracted (paid APIs) to provide.

The high-risk territory: bypassing logins/auth, scraping data the site sells via API (paid databases, financial data providers), aggregating personal data at scale. Get a lawyer if you're operating here.

Will I get sued?

Almost never, if you're scraping publicly accessible data. Lawsuits in this space are rare and typically target large-scale commercial competitors of the source site, not individual freelancers building one-off CSVs.

What does happen often: cease-and-desist letters. Standard response is to stop, document, comply. Don't ignore them.

What's the difference between scraping and crawling?

Scraping = extracting data from a page you already have the URL for. Crawling = discovering URLs by following links. Most "scraping" jobs are actually crawling-then-scraping.

Should I learn Python or JavaScript for scraping?

Python. The ecosystem is far better — requests, BeautifulSoup, Scrapy, Playwright, curl_cffi, nodriver, every LLM SDK, every cloud SDK. JavaScript scraping (Puppeteer, Cheerio) works but the libraries are thinner and you'll fight more rough edges.

How much do web scrapers cost to build?

Order-of-magnitude estimates for a freelancer / agency to build:

ScopeTypical cost (fixed-price)
One-shot CSV from a public site, 100-1,000 records$50-300
Recurring monitor, single site, daily Slack alerts$300-1,500 setup + $200-1,000/mo retainer
Multi-site aggregation, normalized schema$1,500-5,000 setup + $500-2,000/mo
Anti-bot-protected site, JS-rendered, residential proxies+$300-1,500 over baseline
Custom AI extraction, self-healing, vision-LLM fallback+$500-2,000 over baseline

Avoid hourly pricing if you can; freelancers who quote hourly will scope-creep their margin into the work.

How long does it take to build a scraper?

Same scope, varied by experience:

ScopeJunior devMid-level devSenior
Simple bulk-extract (no anti-bot, structured HTML)1-2 days2-4 hours<1 hour
Recurring monitor with Slack alerts1 week1-2 days4-6 hours
Anti-bot bypassed (Cloudflare)2-3 days4-8 hours1-2 hours
Self-healing AI extractor3-5 days1 day4-6 hours

The "senior" column assumes someone who has shipped 50+ scrapers. If you're hiring, ask "how many scrapers have you shipped?" not "how many years experience."

What's the cheapest way to run a scraper 24/7?

$5/month VPS at Hetzner (CX22) or DigitalOcean Basic + cron for scheduling. Total cost including a small Webshare proxy pool: $8-20/month. Handles ~100,000 page fetches per day comfortably.

The full math: Why a $5/mo VPS Beats a $1,200/mo ScrapingBee Plan.

Should I use Selenium or Playwright?

Playwright. Selenium is older, slower, and has worse async support. Playwright is the modern default for new work.

Selenium still has a place in legacy codebases and CI environments where it's already installed.

What's "anti-bot" and how do I get past it?

Anti-bot is software vendors run on their sites to detect and block scripts. Major ones: Cloudflare, DataDome, PerimeterX, Akamai. They detect you via:

  • TLS fingerprinting — Python's TLS stack signs requests differently from Chrome's
  • Browser fingerprinting — Canvas, WebGL, fonts, plugins
  • Behavioral analysis — Mouse movement, request timing patterns
  • IP reputation — Datacenter IPs vs residential IPs

The defeat playbook in order of escalation:

  1. curl_cffi with browser impersonation — defeats ~80% of cases
  2. nodriver for headless stealth — defeats most JS challenges
  3. Webshare residential proxies — defeats IP-reputation gates
  4. Mouse-movement simulation — defeats DataDome's behavioral layer
  5. CAPTCHA solving services (2Captcha, CapSolver) — defeats final-stage challenges

Full playbook: Bypassing Cloudflare, DataDome, and PerimeterX in 2026.

How do I scrape sites behind a login?

Two options.

Option 1 — requests with cookie session. Log in once via the form, capture the session cookies, reuse them on subsequent requests.

session = requests.Session()
session.post("https://site.com/login", data={"user": "...", "pass": "..."})
r = session.get("https://site.com/protected-page")  # cookies persisted

Option 2 — Playwright with persistent context. Launch a real browser, log in once interactively, save the session state. Subsequent runs reuse the saved state.

context = browser.new_context(storage_state="state.json")  # reuses last login

For 2FA-protected sites, Option 2 is the only practical approach.

What's the difference between scraping and an API?

An API is the structured way the site lets you ask for data — JSON in, JSON out, documented, rate-limited, often authenticated. Scraping is what you do when there is no API or when the API doesn't expose the fields you need.

Always prefer the API when one exists. It's faster, more reliable, less brittle, and almost always within ToS.

Do I need to learn Scrapy?

No, not as a beginner. Start with requests + BeautifulSoup. Move to Scrapy only when you're crawling thousands of pages with link-following, deduplication, and pipelines.

I shipped 30+ production scrapers before I touched Scrapy. The framework adds value above ~10,000 pages/day. Below that, the learning curve isn't worth it.

How do I avoid getting IP-banned?

Five rules in order of impact:

  1. Rate-limittime.sleep(uniform(0.5, 1.5)) between requests
  2. Real User-Agent — set it to a current Chrome/Firefox UA
  3. curl_cffi instead of requests — defeats TLS fingerprinting
  4. Residential proxies above ~10,000 requests/day — Webshare, IPRoyal
  5. Avoid datacenter IPs on anti-bot-protected sites — DigitalOcean and AWS are flagged

How do I scrape JavaScript-rendered sites?

Use Playwright. The page's data isn't in the source HTML — it's loaded after the JS runs, which only happens in a real browser engine.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url)
    page.wait_for_selector(".dynamic-content")  # wait for the JS-loaded element
    html = page.content()

Same BeautifulSoup parsing afterward.

How do I scrape Twitter/X / LinkedIn / Reddit?

Short answer: don't, unless you accept the risk. Twitter/X and Reddit changed their ToS in 2023 to ban scraping; LinkedIn has been actively litigating against it.

Use their official APIs:

  • Twitter/X: developer.twitter.com (paid)
  • Reddit: reddit.com/dev/api
  • LinkedIn: official partnership programs only (no public scraping API)

If you absolutely must, use proxies + headless browsers + low rate, and understand the legal exposure.

How do I download all videos from a YouTube channel?

yt-dlp. One command:

yt-dlp -o "%(playlist_index)03d - %(title)s.%(ext)s" \
       "https://www.youtube.com/@CHANNEL/videos"

For just the transcripts (no video files), use --write-auto-subs --skip-download. Or use the youtube-transcript-api Python library.

What should I store data in?

Order of escalation:

  • CSV — for one-shot extracts and quick deliverables
  • SQLite — for medium-volume work (single-machine, persistent state, no setup)
  • Postgres — for production pipelines where multiple processes need access
  • BigQuery / Snowflake — for analytics-heavy use cases at large volume

Don't reach for Postgres on day one. SQLite handles surprisingly large datasets and has zero setup overhead.

How do I handle pagination?

Three patterns:

  1. Numbered pages (?page=1, ?page=2...) — increment until you get an empty page or 404
  2. Cursor-based (?after=abc123) — read the next-cursor from each response
  3. Scroll-loaded (infinite scroll) — Playwright + page.evaluate("window.scrollTo(0, document.body.scrollHeight)")

Always make the loop resume-safe: persist the last-fetched page/cursor to disk, so a crash mid-run picks up where it left off.

What if the site redesigns and breaks my scraper?

Two answers.

Cheap fix: rewrite the CSS selectors. Takes 10-30 minutes per scraper.

Real fix: switch to schema-driven LLM extraction (the self-healing pattern). The LLM doesn't care about CSS classes. The schema (what "price" means) doesn't change when the markup does. Cost: ~$0.0003/page.

Full implementation: Self-Healing AI Web Extractors.

Can I scrape data and then sell it?

Depends entirely on the source's ToS, the data type, and your jurisdiction. Public data with no ToS restrictions is generally fair game. Personal data is regulated by GDPR (EU) and CCPA (California). Data behind paywalls or APIs is contractually restricted.

Don't sell anything you scraped without lawyer review. The rule of thumb: if you're charging money for the data itself (not for the work), you need legal cover.

What's a "rate limit"?

The maximum requests-per-second/minute/hour a server will accept from one IP before throttling or blocking. Polite scrapers stay under the limit by:

  • Adding time.sleep(0.5-2.0) between requests
  • Reading Retry-After headers when rate-limited and waiting accordingly
  • Backing off exponentially on 429 status codes

How do I know if a site is using Cloudflare?

Two checks:

  • Visit the site, check the response headers (browser DevTools → Network tab). If you see Server: cloudflare, yes.
  • Run curl -I https://site.com/ from your terminal. Same Server header.

Cloudflare is on >35% of the web's top sites in 2026. You will encounter it.

Should I use AI to write my scraper code?

Use it as an accelerator, not a replacement. AI is great at boilerplate (HTTP request setup, BeautifulSoup selectors, CSV output). It's poor at the operational discipline that turns scripts into pipelines (idempotency, audit logs, resume safety, rate-limit handling).

Use it for the first 80% of code, do the last 20% (production-readiness) yourself. Or hire someone who's done it 50 times.

What if my client wants me to scrape something I think is unethical?

Tell them. I've turned down jobs that asked for personal-data harvesting from sources that explicitly forbid it. The freelancing market is large enough that you don't need every job. Reputation compounds; bad jobs erode it.

Where do I find more help?

If you have a specific question that isn't here, email info@luba.media. I read every message and reply within 24 hours.

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief