Eyal Rosenthal · Web scraping at scale

Bypassing Cloudflare, DataDome, and PerimeterX in 2026: A Working Playbook

Bypassing Cloudflare, DataDome, and PerimeterX in 2026

Every scraping job that pays well lives behind an anti-bot wall. The cheap jobs ($30 for a CSV) are on bare WordPress sites. The retainer jobs ($300-1,000/month) are on Cloudflare-protected e-commerce, DataDome-fronted enterprise B2B, PerimeterX-shielded ad platforms.

The difference between a $50 freelancer and a $150 freelancer on Upwork is often just whether they know which anti-bot vendor a target uses, and which tool to reach for next.

Here's the playbook I run in production.

The decision tree

Before writing any code, identify the wall. The right tool depends on what you're up against, and 95% of sites use one of four configurations:

  1. Plain HTTP with rate limitsrequests + a real User-Agent + 0.5s delay handles it.
  2. Cloudflare with JS challengecurl_cffi with browser impersonation handles 80%, headless nodriver handles the remainder.
  3. DataDome / PerimeterX — only headless undetected browsers work, and you usually need residential IPs.
  4. CAPTCHAs (Cloudflare Turnstile, hCaptcha, reCAPTCHA) — solve via a service ($1.50/1000), don't roll your own.

Most freelancers blow up because they reach for Selenium / undetected-chromedriver as the default. That's the sledgehammer. The right tool for tier 1 and tier 2 (the majority of paying jobs) is curl_cffi, which is two orders of magnitude cheaper and faster.

Tier 1: plain HTTP with rate limits

The trap on tier-1 sites is over-engineering. A real browser fingerprint and a respectful delay are enough.

import requests
import time
from random import uniform

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/130.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
})

for url in urls:
    r = session.get(url)
    process(r.text)
    time.sleep(uniform(0.5, 1.5))  # jitter — kills the steady-rate fingerprint

The jitter is non-negotiable. Rate limiters look for steady-rate timing patterns. Random sleep breaks the pattern.

Tier 2: Cloudflare's JS challenge

Cloudflare's free tier challenges any client whose TLS fingerprint doesn't match a real browser. requests and httpx use Python's TLS stack, which has a distinctive signature that Cloudflare flags instantly.

The fix is curl_cffi, which compiles curl against the same BoringSSL library Chrome uses, and exposes a requests-shaped API. Same code, real browser fingerprint, ~0.1s overhead.

from curl_cffi import requests as cf

r = cf.get("https://protected-site.com/page",
           impersonate="chrome131",
           timeout=20)

That's it. No headless browser, no JS engine, no proxy. ~80% of Cloudflare-protected sites I've encountered open up with this single change.

For the 20% where Cloudflare escalates to a JS challenge: route to nodriver. It's a maintained fork of undetected-chromedriver with cleaner async APIs.

import nodriver

async def fetch_with_jschallenge(url):
    browser = await nodriver.start(headless=False)  # headless=True works on most sites
    page = await browser.get(url)
    await page.wait_for_ready_state("complete")
    html = await page.get_content()
    await browser.stop()
    return html

The reason this works: Cloudflare's challenge is a real JS execution test. A real browser engine passes it. Patched-Chromium drivers (which is what nodriver is under the hood) look real enough to pass.

Tier 3: DataDome and PerimeterX

DataDome is meaner. It fingerprints:

  • TLS handshake (defeats requests)
  • Canvas rendering (defeats most headless setups)
  • Mouse movement patterns (defeats naive automation)
  • IP reputation (defeats datacenter IPs)

You need three things stacked: a stealth-patched browser, residential IPs, and human-shaped mouse movement.

import nodriver

async def fetch_datadome(url, residential_proxy: str):
    browser = await nodriver.start(
        headless=True,
        browser_args=[f"--proxy-server={residential_proxy}"],
    )
    page = await browser.get(url)
    # Move the mouse before reading content — DataDome scores you on input behavior
    await page.evaluate("""
        window.dispatchEvent(new MouseEvent('mousemove', {clientX: 200, clientY: 300}));
        window.dispatchEvent(new MouseEvent('mousemove', {clientX: 400, clientY: 500}));
    """)
    await page.wait_for_ready_state("complete")
    html = await page.get_content()
    await browser.stop()
    return html

For residential IPs I use Webshare ($3-15/month) or IPRoyal for higher volume. Avoid datacenter IPs (DigitalOcean, AWS, Linode) on DataDome — instant flag.

PerimeterX is similar to DataDome but more sensitive to canvas fingerprints. The same stack works, but increase your patience window — sometimes the first 2-3 requests get challenged before you're trusted.

Tier 4: CAPTCHAs

If you hit a CAPTCHA, stop trying to solve it programmatically. Use a service.

  • 2Captcha: $1.50 per 1,000 reCAPTCHA solves. Decent for hCaptcha and Turnstile too.
  • CapSolver: $0.50 per 1,000 for image-based, $1.00 for hCaptcha. Faster than 2Captcha most days.

The integration is one HTTP call:

import requests, time

def solve_recaptcha(site_key: str, page_url: str, api_key: str):
    r = requests.post("https://2captcha.com/in.php", data={
        "key": api_key, "method": "userrecaptcha",
        "googlekey": site_key, "pageurl": page_url, "json": 1,
    })
    captcha_id = r.json()["request"]
    while True:
        time.sleep(5)
        r = requests.get(f"https://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}&json=1")
        if r.json()["status"] == 1:
            return r.json()["request"]  # the token to submit

You inject the returned token into the page's hidden field, submit the form, you're through. ~12-30 seconds per solve. Build it into the request flow once and forget about it.

The math: when to pay for ScrapingBee vs build it yourself

ScrapingBee charges ~$0.001 per request on their cheapest plan, ~$0.005 on the API-credit plan. Sounds cheap. Here's what those numbers really mean:

VolumeScrapingBee costSelf-built costBreak-even
1k req/day$30/mo$5 VPS + $3 proxy = $8/mo270 req/day
10k req/day$300/mo$5 VPS + $15 proxy = $20/mo--
100k req/day$3,000/mo$40 VPS + $50 proxy = $90/mo--

The break-even is around 270 requests/day. Above that, self-built wins by an order of magnitude. Below that, ScrapingBee's convenience may justify the price — but you're also paying to be locked into their feature set.

The freelancer math: at 100k req/day, you can quote $200/mo for "managed scraping pipeline" and pocket the $110/mo margin while delivering the same outcome ScrapingBee would charge $3,000 for. Clients love this when they figure out the spread.

When the wall wins

There are sites I won't touch:

  • Sites with active legal enforcement (LinkedIn after the hiQ ruling unwound)
  • Sites with paid-tier APIs and explicit ToS bans on scraping (Twitter/X, Reddit since 2023)
  • Sites with AKAMAI Bot Manager + active human review (most major airlines)

These aren't technical problems, they're risk problems. I quote fixed-price scraping work only on targets where ToS doesn't ban third-party data collection. If a brief asks me to scrape LinkedIn, I tell the client to look at LinkedIn's official APIs or Apollo / Lusha instead.

  • Self-Healing AI Web Extractors — once you're past the wall, the next problem is markup that changes. This is the schema-driven fix.
  • Why $5/mo VPS Beats $1,200/mo ScrapingBee — the cost math above, with the actual VPS / proxy / scheduler stack I run in production.
  • The repo: portfolio_demos/self_healing_scraper/ and portfolio_demos/competitor_watch/ — both ship with curl_cffi as the primary HTTP client.

If you have a Cloudflare / DataDome / PerimeterX target that's blocking your team, send the URL to info@luba.media. Quote within 24h, fixed-price.

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief