Web Scraping FAQ: Every Question I Get Asked
Web Scraping Frequently Asked Questions
The 25 questions I get asked most often, answered directly. No "it depends" weasel-language unless it actually depends.
Is web scraping legal?
In the US, yes — for publicly accessible data. The hiQ Labs v. LinkedIn ruling (2022) established that scraping public web data does not violate the Computer Fraud and Abuse Act. State and EU laws vary; the GDPR adds constraints around personal data even when it's public.
The high-confidence safe playbook: scrape only data that is publicly accessible without login, respect robots.txt even though it's not legally binding, set a real User-Agent that identifies you, and don't scrape data sites have explicitly contracted (paid APIs) to provide.
The high-risk territory: bypassing logins/auth, scraping data the site sells via API (paid databases, financial data providers), aggregating personal data at scale. Get a lawyer if you're operating here.
Will I get sued?
Almost never, if you're scraping publicly accessible data. Lawsuits in this space are rare and typically target large-scale commercial competitors of the source site, not individual freelancers building one-off CSVs.
What does happen often: cease-and-desist letters. Standard response is to stop, document, comply. Don't ignore them.
What's the difference between scraping and crawling?
Scraping = extracting data from a page you already have the URL for. Crawling = discovering URLs by following links. Most "scraping" jobs are actually crawling-then-scraping.
Should I learn Python or JavaScript for scraping?
Python. The ecosystem is far better — requests, BeautifulSoup, Scrapy, Playwright, curl_cffi, nodriver, every LLM SDK, every cloud SDK. JavaScript scraping (Puppeteer, Cheerio) works but the libraries are thinner and you'll fight more rough edges.
How much do web scrapers cost to build?
Order-of-magnitude estimates for a freelancer / agency to build:
| Scope | Typical cost (fixed-price) |
|---|---|
| One-shot CSV from a public site, 100-1,000 records | $50-300 |
| Recurring monitor, single site, daily Slack alerts | $300-1,500 setup + $200-1,000/mo retainer |
| Multi-site aggregation, normalized schema | $1,500-5,000 setup + $500-2,000/mo |
| Anti-bot-protected site, JS-rendered, residential proxies | +$300-1,500 over baseline |
| Custom AI extraction, self-healing, vision-LLM fallback | +$500-2,000 over baseline |
Avoid hourly pricing if you can; freelancers who quote hourly will scope-creep their margin into the work.
How long does it take to build a scraper?
Same scope, varied by experience:
| Scope | Junior dev | Mid-level dev | Senior |
|---|---|---|---|
| Simple bulk-extract (no anti-bot, structured HTML) | 1-2 days | 2-4 hours | <1 hour |
| Recurring monitor with Slack alerts | 1 week | 1-2 days | 4-6 hours |
| Anti-bot bypassed (Cloudflare) | 2-3 days | 4-8 hours | 1-2 hours |
| Self-healing AI extractor | 3-5 days | 1 day | 4-6 hours |
The "senior" column assumes someone who has shipped 50+ scrapers. If you're hiring, ask "how many scrapers have you shipped?" not "how many years experience."
What's the cheapest way to run a scraper 24/7?
$5/month VPS at Hetzner (CX22) or DigitalOcean Basic + cron for scheduling. Total cost including a small Webshare proxy pool: $8-20/month. Handles ~100,000 page fetches per day comfortably.
The full math: Why a $5/mo VPS Beats a $1,200/mo ScrapingBee Plan.
Should I use Selenium or Playwright?
Playwright. Selenium is older, slower, and has worse async support. Playwright is the modern default for new work.
Selenium still has a place in legacy codebases and CI environments where it's already installed.
What's "anti-bot" and how do I get past it?
Anti-bot is software vendors run on their sites to detect and block scripts. Major ones: Cloudflare, DataDome, PerimeterX, Akamai. They detect you via:
- TLS fingerprinting — Python's TLS stack signs requests differently from Chrome's
- Browser fingerprinting — Canvas, WebGL, fonts, plugins
- Behavioral analysis — Mouse movement, request timing patterns
- IP reputation — Datacenter IPs vs residential IPs
The defeat playbook in order of escalation:
curl_cffiwith browser impersonation — defeats ~80% of casesnodriverfor headless stealth — defeats most JS challenges- Webshare residential proxies — defeats IP-reputation gates
- Mouse-movement simulation — defeats DataDome's behavioral layer
- CAPTCHA solving services (2Captcha, CapSolver) — defeats final-stage challenges
Full playbook: Bypassing Cloudflare, DataDome, and PerimeterX in 2026.
How do I scrape sites behind a login?
Two options.
Option 1 — requests with cookie session. Log in once via the form, capture the session cookies, reuse them on subsequent requests.
session = requests.Session()
session.post("https://site.com/login", data={"user": "...", "pass": "..."})
r = session.get("https://site.com/protected-page") # cookies persisted
Option 2 — Playwright with persistent context. Launch a real browser, log in once interactively, save the session state. Subsequent runs reuse the saved state.
context = browser.new_context(storage_state="state.json") # reuses last login
For 2FA-protected sites, Option 2 is the only practical approach.
What's the difference between scraping and an API?
An API is the structured way the site lets you ask for data — JSON in, JSON out, documented, rate-limited, often authenticated. Scraping is what you do when there is no API or when the API doesn't expose the fields you need.
Always prefer the API when one exists. It's faster, more reliable, less brittle, and almost always within ToS.
Do I need to learn Scrapy?
No, not as a beginner. Start with requests + BeautifulSoup. Move to Scrapy only when you're crawling thousands of pages with link-following, deduplication, and pipelines.
I shipped 30+ production scrapers before I touched Scrapy. The framework adds value above ~10,000 pages/day. Below that, the learning curve isn't worth it.
How do I avoid getting IP-banned?
Five rules in order of impact:
- Rate-limit —
time.sleep(uniform(0.5, 1.5))between requests - Real
User-Agent— set it to a current Chrome/Firefox UA curl_cffiinstead ofrequests— defeats TLS fingerprinting- Residential proxies above ~10,000 requests/day — Webshare, IPRoyal
- Avoid datacenter IPs on anti-bot-protected sites — DigitalOcean and AWS are flagged
How do I scrape JavaScript-rendered sites?
Use Playwright. The page's data isn't in the source HTML — it's loaded after the JS runs, which only happens in a real browser engine.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
page.wait_for_selector(".dynamic-content") # wait for the JS-loaded element
html = page.content()
Same BeautifulSoup parsing afterward.
How do I scrape Twitter/X / LinkedIn / Reddit?
Short answer: don't, unless you accept the risk. Twitter/X and Reddit changed their ToS in 2023 to ban scraping; LinkedIn has been actively litigating against it.
Use their official APIs:
- Twitter/X: developer.twitter.com (paid)
- Reddit: reddit.com/dev/api
- LinkedIn: official partnership programs only (no public scraping API)
If you absolutely must, use proxies + headless browsers + low rate, and understand the legal exposure.
How do I download all videos from a YouTube channel?
yt-dlp. One command:
yt-dlp -o "%(playlist_index)03d - %(title)s.%(ext)s" \
"https://www.youtube.com/@CHANNEL/videos"
For just the transcripts (no video files), use --write-auto-subs --skip-download. Or use the youtube-transcript-api Python library.
What should I store data in?
Order of escalation:
- CSV — for one-shot extracts and quick deliverables
- SQLite — for medium-volume work (single-machine, persistent state, no setup)
- Postgres — for production pipelines where multiple processes need access
- BigQuery / Snowflake — for analytics-heavy use cases at large volume
Don't reach for Postgres on day one. SQLite handles surprisingly large datasets and has zero setup overhead.
How do I handle pagination?
Three patterns:
- Numbered pages (
?page=1,?page=2...) — increment until you get an empty page or 404 - Cursor-based (
?after=abc123) — read the next-cursor from each response - Scroll-loaded (infinite scroll) — Playwright +
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
Always make the loop resume-safe: persist the last-fetched page/cursor to disk, so a crash mid-run picks up where it left off.
What if the site redesigns and breaks my scraper?
Two answers.
Cheap fix: rewrite the CSS selectors. Takes 10-30 minutes per scraper.
Real fix: switch to schema-driven LLM extraction (the self-healing pattern). The LLM doesn't care about CSS classes. The schema (what "price" means) doesn't change when the markup does. Cost: ~$0.0003/page.
Full implementation: Self-Healing AI Web Extractors.
Can I scrape data and then sell it?
Depends entirely on the source's ToS, the data type, and your jurisdiction. Public data with no ToS restrictions is generally fair game. Personal data is regulated by GDPR (EU) and CCPA (California). Data behind paywalls or APIs is contractually restricted.
Don't sell anything you scraped without lawyer review. The rule of thumb: if you're charging money for the data itself (not for the work), you need legal cover.
What's a "rate limit"?
The maximum requests-per-second/minute/hour a server will accept from one IP before throttling or blocking. Polite scrapers stay under the limit by:
- Adding
time.sleep(0.5-2.0)between requests - Reading
Retry-Afterheaders when rate-limited and waiting accordingly - Backing off exponentially on 429 status codes
How do I know if a site is using Cloudflare?
Two checks:
- Visit the site, check the response headers (browser DevTools → Network tab). If you see
Server: cloudflare, yes. - Run
curl -I https://site.com/from your terminal. SameServerheader.
Cloudflare is on >35% of the web's top sites in 2026. You will encounter it.
Should I use AI to write my scraper code?
Use it as an accelerator, not a replacement. AI is great at boilerplate (HTTP request setup, BeautifulSoup selectors, CSV output). It's poor at the operational discipline that turns scripts into pipelines (idempotency, audit logs, resume safety, rate-limit handling).
Use it for the first 80% of code, do the last 20% (production-readiness) yourself. Or hire someone who's done it 50 times.
What if my client wants me to scrape something I think is unethical?
Tell them. I've turned down jobs that asked for personal-data harvesting from sources that explicitly forbid it. The freelancing market is large enough that you don't need every job. Reputation compounds; bad jobs erode it.
Where do I find more help?
- Getting Started with Web Scraping — first scraper in 30 min
- Web Scraping Glossary — every term defined
- Tools Comparison — Scrapy vs Playwright vs etc.
- The repo: 46 production demos, all runnable
If you have a specific question that isn't here, email info@luba.media. I read every message and reply within 24 hours.
Hire me to build this for your site
I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.
Send a brief