Scrapy vs Playwright vs Selenium: 2026 Decision Tree (with the Honest Verdict)
Scrapy vs Playwright vs Selenium: 2026 Decision Tree
The most common mistake I see freelancers and clients make: treating Scrapy, Playwright, and Selenium as alternatives to each other. They aren't. They solve different problems.
This guide picks the right one for any scraping brief, with the verdict-first decision tree, then the honest tradeoffs.
The decision tree
START
│
├── Is the data in the HTML source? (View Source, NOT Inspect Element)
│ │
│ ├── YES (server-rendered) → requests + BeautifulSoup
│ │ │
│ │ └── Is it a >10,000-page crawl with link-following? → Scrapy
│ │
│ └── NO (data loaded by JS) → Playwright
│
└── Need browser interaction (click, scroll, fill forms)?
│
├── YES → Playwright
│
└── Legacy codebase already on Selenium → keep Selenium
In one sentence: requests + BeautifulSoup for static HTML, Playwright for JavaScript or interaction, Scrapy for industrial-scale crawls. Selenium is what you keep if you already have it.
The three tools, in plain English
requests + BeautifulSoup
The baseline. HTTP client + HTML parser. Synchronous, simple, no browser involved.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://example.com/page", headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, "html.parser")
prices = [el.get_text() for el in soup.select("span.price")]
Wins when: the data is in the HTML the server sends. Most static sites, server-rendered React (Next.js with default config), Wikipedia, blogs, Wikipedia, government data portals.
Loses when: the page is empty until JavaScript runs. SPA-heavy sites where requests.get() returns a hollow shell.
Playwright
Modern headless browser automation. Drives a real Chromium / Firefox / WebKit instance. Multi-language (Python, JS, .NET, Java).
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto("https://js-rendered-site.com")
await page.wait_for_selector(".dynamic-content")
html = await page.content()
await browser.close()
Wins when: data only exists after JavaScript runs. Sites with infinite scroll, login flows, multi-step interactions, click-to-reveal content.
Loses when: your target is server-rendered (you're paying 10-50× the time/memory for nothing). Or when scraping at very high volume (each browser instance is heavy).
Scrapy
Full-featured crawling framework with spiders, item-loaders, middleware, pipelines, deduplication, distributed-crawl support.
import scrapy
class BookSpider(scrapy.Spider):
name = "books"
start_urls = ["https://books.toscrape.com"]
def parse(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css("p.price_color::text").get(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Wins when: large crawls (10,000+ pages) with complex link-following, dedup, pipelines. When you need persistent crawl state, distributed workers, integration with Scrapy Cloud or Spider Cloud.
Loses when: small jobs (1-1,000 pages). The framework overhead outweighs the benefit. Or jobs with heavy JavaScript rendering (Scrapy + scrapy-playwright works, but it's awkward — just use Playwright directly).
Selenium
Older browser automation framework. Pre-dates Playwright by ~10 years.
Wins when: you have an existing Selenium codebase or your team's expertise is there.
Loses when: you're starting fresh. Playwright is faster, cleaner async support, better cross-browser, better DX. Selenium isn't broken, just legacy.
The honest performance comparison
For "fetch 1,000 pages from a static site":
| Stack | Wall time (cold cache) | Memory peak |
|---|---|---|
requests (sync) | ~12 minutes | ~50 MB |
httpx (async, 50 parallel) | ~30 seconds | ~80 MB |
| Scrapy (concurrent_requests=16) | ~45 seconds | ~150 MB |
| Playwright (sync, 1 browser) | ~25 minutes | ~700 MB |
| Playwright (async, 5 browsers) | ~6 minutes | ~3.5 GB |
For static-HTML targets, Playwright is 10-50× slower than the right tool. This matters at scale.
For JS-rendered targets, Playwright is the only option — requests returns nothing useful.
When you actually need Scrapy
Scrapy earns its learning curve when:
- Crawl size is >10,000 pages with complex link-following
- You need pipelines — built-in deduplication, item-loaders, post-processing
- You need persistent crawl state — pause-resume, distributed workers, restart-safety
- You're integrating with Scrapy Cloud or Zyte's commercial offerings
For everything else, requests + BeautifulSoup (with httpx for async) wins on simplicity, speed-to-ship, and maintainability.
I shipped 30+ production scrapers before I touched Scrapy. The framework is real but it's also overkill 90% of the time.
When Playwright wins decisively
Playwright is the right choice when any of these are true:
- The page's data isn't in the HTML source
- You need to click, scroll, or fill forms
- You're scraping behind a login flow
- You need to bypass Cloudflare's JS challenge (Playwright executes the challenge JS naturally)
- You're testing your own web app
For #1 + #4, Playwright is often the only working choice. Don't waste time trying to reverse-engineer the API or hack around the JS challenge — just spin up a browser.
What about httpx, aiohttp, requests-html?
httpx— async HTTP client, drop-in successor torequestsfor async needs. Excellent. Use this when you need parallel requests but not browser rendering.aiohttp— older async HTTP library. Still solid.httpxis generally cleaner.requests-html— HTTP + headless rendering in one package. Was popular ~2018-2021; now mostly superseded by Playwright + bs4.
What about curl_cffi?
Drop-in requests replacement that impersonates Chrome/Firefox at the TLS-fingerprint level. Use this in place of requests for any anti-bot-protected site. It's not an alternative to Playwright; it's an upgrade to your HTTP client.
from curl_cffi import requests as cf
r = cf.get(URL, impersonate="chrome131") # one-line upgrade
Defeats Cloudflare, DataDome's TLS layer, basic IP-rep gates. ~80% of "anti-bot" cases solved by switching from requests to curl_cffi.
The full 2026 stack ranked
For a typical production scraping job, the stack I reach for in order:
curl_cffi+ BeautifulSoup — works for ~70% of jobscurl_cffi+ BeautifulSoup + Webshare residential proxy — adds ~15% (sites with IP-rep gates)- Playwright + Webshare residential — adds ~10% (JS-rendered sites)
- Playwright + Webshare + 2Captcha + behavioral simulation — adds ~4% (DataDome / PerimeterX)
- Scrapy + the above — adds ~1% (large-scale industrial crawls)
If a target falls outside that 100%, it's usually because of legal/ToS issues, not technical ones, and the right answer is a different source (official API, paid dataset, partner data feed).
What I'd avoid in 2026
- Selenium for new projects. Pick Playwright instead.
requestsfor any site with anti-bot. Usecurl_cffiinstead.- Scrapy for crawls under ~5,000 pages. The framework overhead isn't worth the learning curve.
- Headless Chrome via raw CDP. Playwright wraps this much more cleanly.
- Custom HTTP/2 fingerprint forging.
curl_cffidoes this; don't roll your own.
What to read next
- Getting Started with Web Scraping — for first-timers
- Bypassing Cloudflare, DataDome, and PerimeterX — when the basic stack hits a wall
- Self-Healing AI Web Extractors — the parsing-layer upgrade
- Best Residential Proxy Services 2026 — the IP-rotation layer
If you have a specific brief and want a tool recommendation, send the use case to info@luba.media. I'll match the tool to the job, free.
Hire me to build this for your site
I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.
Send a brief