Eyal Rosenthal · Web scraping at scale

Web Scraping Tools Comparison 2026: Scrapy vs Playwright vs Beautiful Soup vs ScrapingBee vs DIY

Web Scraping Tools Comparison 2026

I run my own data business on these tools. I also build scrapers for clients who've burned $500-5,000/month on tools they didn't need. Here's the unflinching comparison — no affiliate links, no vendor relationships, no "best tool for everyone" weasel-language.

TL;DR decision tree

If you...Use
Are scraping <1,000 pages/day, no anti-botrequests + BeautifulSoup
Need JavaScript renderingPlaywright
Hit Cloudflare 403scurl_cffi (impersonation) → nodriver (headless stealth)
Need to scrape thousands of pages with link-followingScrapy
Need an LLM to handle markup that changes oftenSelf-healing extractor (custom Python + OpenAI/Anthropic)
Have no Python team and need a managed solution todayApify or Bright Data
Want a one-line API call and don't care about costScrapingBee
Have specific compliance / proxy requirementsBuild your own with curl_cffi + Webshare

The honest summary: for solo developers and small teams running real production scrapers, the right answer is almost always custom Python on a $5 VPS. Managed tools earn their cost only in narrow situations.

The four open-source libraries

requests — the baseline HTTP client

The default. Synchronous, simple, blocking. Every Python tutorial uses it.

import requests
r = requests.get("https://example.com", headers={"User-Agent": "..."})
print(r.text)

When to use: any site that doesn't have anti-bot at the TLS layer. ~70% of public sites.

When to switch away: you get HTTP 403, or empty HTML on sites that work in a browser. Switch to curl_cffi.

Verdict: every Python developer should know this. It's not the answer for production, but it's the foundation.

BeautifulSoup (bs4) — the HTML parser

Parses HTML and exposes a fluent API for navigating and searching it.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
prices = [el.get_text() for el in soup.select("span.price")]

When to use: any time you have HTML and need structured data out of it. Most scrapers in production use bs4.

Alternative: lxml — faster, uglier API, used inside Scrapy.

Verdict: the default HTML parser. Don't overthink this choice.

curl_cffi — the anti-bot bypass

A Python library that wraps curl-impersonate, giving your scripts the exact TLS fingerprint of Chrome/Firefox/Safari.

from curl_cffi import requests as cf
r = cf.get(URL, impersonate="chrome131")

When to use: any time requests returns 403 or empty HTML on Cloudflare/DataDome-protected sites. ~80% of "anti-bot" cases are solved by switching to curl_cffi.

When to switch away: when the site challenges with an actual JavaScript proof-of-work (Cloudflare's "Verifying you are human"). At that point you need a headless browser.

Verdict: the single biggest scraper-stack upgrade of 2026. If you're still using requests for anti-bot-protected sites, switch tomorrow.

Playwright — the modern headless browser

Cross-browser automation library (Chromium, Firefox, WebKit). Async-friendly, multi-language. Replaced Selenium for new work.

from playwright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.launch()
    page = await browser.new_page()
    await page.goto("https://js-rendered-site.com")
    html = await page.content()
    await browser.close()

When to use:

  • JavaScript-rendered sites (data not in the source HTML)
  • Sites that require interaction (clicking, scrolling, filling forms)
  • Login flows
  • Cloudflare's JS challenge that curl_cffi can't pass

When to switch away: if curl_cffi works, use that — Playwright is 10-50× slower because it spins up a real browser engine.

Verdict: for any modern site, Playwright is the universal "if all else fails" tool. Default headless browser of 2026.

Scrapy — the framework

Full-featured crawling framework. Defines spiders, pipelines, item-loaders, middlewares.

import scrapy

class BookSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css("p.price_color::text").get(),
            }
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

When to use: large-scale crawls (10,000+ pages) with complex link-following, deduplication, and pipelines. Production work where you have a team that knows Scrapy.

When NOT to use: smaller jobs (1-1,000 pages). The framework overhead outweighs the benefit. Use requests + BeautifulSoup instead.

Verdict: powerful, mature, learning-curve-heavy. Most freelance scraping jobs don't need it. If you're not sure you need Scrapy, you don't need Scrapy.

The managed services

ScrapingBee

Scraping-as-a-service API. You make one HTTP call, they handle headless browsers, proxies, anti-bot.

Pricing (2026): ~$0.001-$0.005 per request depending on plan + features.

When it's worth it:

  • You're scraping <1,000 pages/day
  • You have no Python team
  • You don't want to manage infrastructure
  • You're prototyping and don't know yet whether the volume justifies custom code

When it's NOT worth it:

  • Volume above ~1,000/day (the per-request cost compounds fast)
  • You have anything resembling a developer on your team
  • Long-running monitoring (a $5 VPS plus your own code wins on 12-month TCO by an order of magnitude)

Verdict: legitimate convenience product, dramatically overpriced at scale.

Bright Data

Largest residential-proxy provider. Has scraping IDE, web unblocker, dataset marketplace.

Pricing: $500-5,000/month typical for serious volume.

When it's worth it:

  • You need geo-diverse residential IPs across 100+ countries
  • You're scraping at >50,000 pages/day with anti-bot protections
  • You need their dataset products (pre-scraped LinkedIn, Amazon, etc., where ToS makes DIY sketchy)

When it's NOT worth it:

  • Anything below 10,000 pages/day (Webshare residential at $3-15/month does the same job)
  • General-purpose scraping (their pricing model assumes enterprise)

Verdict: industry-grade, enterprise-priced. If you're solo, you don't need this.

Apify

Marketplace of pre-built "actors" (scrapers) plus an automation platform.

Pricing: $49-499/month for compute + per-actor fees.

When it's worth it:

  • You need a pre-built scraper right now and one exists in their marketplace
  • You want a shared platform across your team
  • You don't want to build your own deployment

When it's NOT worth it:

  • The marketplace actor breaks (they often do; they're community-maintained)
  • You need custom logic; you're better off writing it from scratch

Verdict: the "App Store of scraping." Useful if you find an actor that does exactly what you need. Otherwise lock-in trap.

ScrapFly

Newer scraping-API platform with strong technical reputation.

Pricing: similar to ScrapingBee.

When it's worth it: you specifically need their anti-bot bypass for hard targets (DataDome, PerimeterX) and don't want to learn nodriver.

Verdict: best technical depth among managed services. Same caveats as ScrapingBee on cost-at-volume.

Self-healing extraction (the 2026 pattern)

Not a tool — an approach. Use an LLM (gpt-4o-mini, Claude Haiku, or local Ollama) to extract structured data from page text per a JSON schema, instead of CSS selectors.

When to use:

  • Sites that redesign frequently
  • Long-running monitors (>3 months) where maintenance burden is the dominant cost
  • When you want to extract content (not markup positions) — descriptions, summaries, entity-shaped data

When NOT to use:

  • High-volume, low-margin scraping (per-page LLM cost adds up at 1M+ pages/day)
  • Sites with stable, well-structured HTML (Wikipedia, GitHub API responses, etc.)

Cost: ~$0.0003/page with gpt-4o-mini. For 10,000 pages/day, $1/day in LLM cost.

Verdict: the single biggest pattern shift in 2026 scraping. Once you've shipped one self-healing extractor, you'll struggle to justify CSS selectors for new long-lived monitors.

Full guide: Self-Healing AI Web Extractors.

The honest cost comparison

For a typical "monitor 5 e-commerce stores daily, alert on price changes" pipeline:

ApproachSetup timeMonthly costAnnual cost
ScrapingBee (per-request)30 min$99-300$1,200-3,600
Bright Data (Web Unblocker)1 hour$500+$6,000+
Apify (community actor)30 min$49-99$588-1,188
Custom Python on $5 VPS4-6 hours$5-15$60-180

For one client, the difference is $1,000-5,000/year of margin you can put in your own pocket if you can build the custom pipeline.

This is the math behind why a $5 VPS beats a $1,200 ScrapingBee plan.

If you're evaluating tools for a specific brief and want a second opinion, send the brief to info@luba.media. I quote fixed-price and ship in 7-10 days, no managed-service markup.

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief