Eyal Rosenthal · Web scraping at scale

Scrapy vs Playwright vs Selenium: 2026 Decision Tree (with the Honest Verdict)

Scrapy vs Playwright vs Selenium: 2026 Decision Tree

The most common mistake I see freelancers and clients make: treating Scrapy, Playwright, and Selenium as alternatives to each other. They aren't. They solve different problems.

This guide picks the right one for any scraping brief, with the verdict-first decision tree, then the honest tradeoffs.

The decision tree

START
  │
  ├── Is the data in the HTML source? (View Source, NOT Inspect Element)
  │     │
  │     ├── YES (server-rendered) → requests + BeautifulSoup
  │     │     │
  │     │     └── Is it a >10,000-page crawl with link-following? → Scrapy
  │     │
  │     └── NO (data loaded by JS) → Playwright
  │
  └── Need browser interaction (click, scroll, fill forms)?
        │
        ├── YES → Playwright
        │
        └── Legacy codebase already on Selenium → keep Selenium

In one sentence: requests + BeautifulSoup for static HTML, Playwright for JavaScript or interaction, Scrapy for industrial-scale crawls. Selenium is what you keep if you already have it.

The three tools, in plain English

requests + BeautifulSoup

The baseline. HTTP client + HTML parser. Synchronous, simple, no browser involved.

import requests
from bs4 import BeautifulSoup

r = requests.get("https://example.com/page", headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, "html.parser")
prices = [el.get_text() for el in soup.select("span.price")]

Wins when: the data is in the HTML the server sends. Most static sites, server-rendered React (Next.js with default config), Wikipedia, blogs, Wikipedia, government data portals.

Loses when: the page is empty until JavaScript runs. SPA-heavy sites where requests.get() returns a hollow shell.

Playwright

Modern headless browser automation. Drives a real Chromium / Firefox / WebKit instance. Multi-language (Python, JS, .NET, Java).

from playwright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.launch()
    page = await browser.new_page()
    await page.goto("https://js-rendered-site.com")
    await page.wait_for_selector(".dynamic-content")
    html = await page.content()
    await browser.close()

Wins when: data only exists after JavaScript runs. Sites with infinite scroll, login flows, multi-step interactions, click-to-reveal content.

Loses when: your target is server-rendered (you're paying 10-50× the time/memory for nothing). Or when scraping at very high volume (each browser instance is heavy).

Scrapy

Full-featured crawling framework with spiders, item-loaders, middleware, pipelines, deduplication, distributed-crawl support.

import scrapy

class BookSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css("p.price_color::text").get(),
            }
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Wins when: large crawls (10,000+ pages) with complex link-following, dedup, pipelines. When you need persistent crawl state, distributed workers, integration with Scrapy Cloud or Spider Cloud.

Loses when: small jobs (1-1,000 pages). The framework overhead outweighs the benefit. Or jobs with heavy JavaScript rendering (Scrapy + scrapy-playwright works, but it's awkward — just use Playwright directly).

Selenium

Older browser automation framework. Pre-dates Playwright by ~10 years.

Wins when: you have an existing Selenium codebase or your team's expertise is there.

Loses when: you're starting fresh. Playwright is faster, cleaner async support, better cross-browser, better DX. Selenium isn't broken, just legacy.

The honest performance comparison

For "fetch 1,000 pages from a static site":

StackWall time (cold cache)Memory peak
requests (sync)~12 minutes~50 MB
httpx (async, 50 parallel)~30 seconds~80 MB
Scrapy (concurrent_requests=16)~45 seconds~150 MB
Playwright (sync, 1 browser)~25 minutes~700 MB
Playwright (async, 5 browsers)~6 minutes~3.5 GB

For static-HTML targets, Playwright is 10-50× slower than the right tool. This matters at scale.

For JS-rendered targets, Playwright is the only option — requests returns nothing useful.

When you actually need Scrapy

Scrapy earns its learning curve when:

  1. Crawl size is >10,000 pages with complex link-following
  2. You need pipelines — built-in deduplication, item-loaders, post-processing
  3. You need persistent crawl state — pause-resume, distributed workers, restart-safety
  4. You're integrating with Scrapy Cloud or Zyte's commercial offerings

For everything else, requests + BeautifulSoup (with httpx for async) wins on simplicity, speed-to-ship, and maintainability.

I shipped 30+ production scrapers before I touched Scrapy. The framework is real but it's also overkill 90% of the time.

When Playwright wins decisively

Playwright is the right choice when any of these are true:

  1. The page's data isn't in the HTML source
  2. You need to click, scroll, or fill forms
  3. You're scraping behind a login flow
  4. You need to bypass Cloudflare's JS challenge (Playwright executes the challenge JS naturally)
  5. You're testing your own web app

For #1 + #4, Playwright is often the only working choice. Don't waste time trying to reverse-engineer the API or hack around the JS challenge — just spin up a browser.

What about httpx, aiohttp, requests-html?

  • httpx — async HTTP client, drop-in successor to requests for async needs. Excellent. Use this when you need parallel requests but not browser rendering.
  • aiohttp — older async HTTP library. Still solid. httpx is generally cleaner.
  • requests-html — HTTP + headless rendering in one package. Was popular ~2018-2021; now mostly superseded by Playwright + bs4.

What about curl_cffi?

Drop-in requests replacement that impersonates Chrome/Firefox at the TLS-fingerprint level. Use this in place of requests for any anti-bot-protected site. It's not an alternative to Playwright; it's an upgrade to your HTTP client.

from curl_cffi import requests as cf

r = cf.get(URL, impersonate="chrome131")  # one-line upgrade

Defeats Cloudflare, DataDome's TLS layer, basic IP-rep gates. ~80% of "anti-bot" cases solved by switching from requests to curl_cffi.

The full 2026 stack ranked

For a typical production scraping job, the stack I reach for in order:

  1. curl_cffi + BeautifulSoup — works for ~70% of jobs
  2. curl_cffi + BeautifulSoup + Webshare residential proxy — adds ~15% (sites with IP-rep gates)
  3. Playwright + Webshare residential — adds ~10% (JS-rendered sites)
  4. Playwright + Webshare + 2Captcha + behavioral simulation — adds ~4% (DataDome / PerimeterX)
  5. Scrapy + the above — adds ~1% (large-scale industrial crawls)

If a target falls outside that 100%, it's usually because of legal/ToS issues, not technical ones, and the right answer is a different source (official API, paid dataset, partner data feed).

What I'd avoid in 2026

  • Selenium for new projects. Pick Playwright instead.
  • requests for any site with anti-bot. Use curl_cffi instead.
  • Scrapy for crawls under ~5,000 pages. The framework overhead isn't worth the learning curve.
  • Headless Chrome via raw CDP. Playwright wraps this much more cleanly.
  • Custom HTTP/2 fingerprint forging. curl_cffi does this; don't roll your own.

If you have a specific brief and want a tool recommendation, send the use case to info@luba.media. I'll match the tool to the job, free.

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief