Eyal Rosenthal · Web scraping at scale

Web Scraping Glossary: Every Term Defined Plainly

Web Scraping Glossary

Every term you'll see in scraping tutorials, defined plainly. Bookmark this page; you'll come back to it.

A

Anti-bot — Software designed to detect and block automated traffic. Examples: Cloudflare, DataDome, PerimeterX, Akamai Bot Manager. Sells based on stopping fraud, but catches scrapers as a side effect.

API (Application Programming Interface) — A direct, structured way to ask a website for data, returning JSON or XML instead of HTML. Always prefer an API over scraping when one exists; scraping is what you do because there's no API.

Async (Asynchronous) — A programming style where you don't wait for one request to finish before starting the next. Critical when you need to fetch hundreds of pages in parallel. Python's asyncio and httpx/aiohttp libraries enable this.

B

BeautifulSoup (bs4) — A Python library for parsing HTML and XML. The "fluent ergonomic" parser — easier than lxml, slightly slower. Most beginner tutorials use this.

Browser fingerprint — The unique combination of attributes a browser exposes (TLS version, supported ciphers, screen resolution, fonts, WebGL renderer, plugin list, timezone). Anti-bot uses this to identify scrapers because Python's TLS stack has a different fingerprint from Chrome's.

C

CAPTCHA — Test designed to tell humans from bots. ReCAPTCHA, hCaptcha, Cloudflare Turnstile are the major variants. Solved programmatically via services like 2Captcha or CapSolver (~$1.50 per 1,000 solves).

Cloudflare — The most-deployed anti-bot/CDN on the public web. Their "I'm Under Attack" mode and Bot Fight Mode are the most common things you'll hit. Defeated for ~80% of cases by curl_cffi (TLS impersonation) + ~20% by nodriver (JS challenge).

Crawler / Crawling — Programmatically discovering URLs by following links. Distinct from scraping, which extracts data from URLs you already know. Most jobs are crawling-then-scraping.

curl_cffi — Python library that wraps curl-impersonate to give your scripts the exact TLS fingerprint of Chrome/Firefox/Safari. Drop-in replacement for requests. The single biggest anti-bot bypass tool of 2026.

D

DataDome — Aggressive anti-bot vendor that fingerprints TLS, canvas rendering, and mouse-movement patterns. Harder to bypass than Cloudflare; usually requires headless browsers + residential proxies.

DOM (Document Object Model) — The in-memory tree representation of an HTML page after the browser parses it. "DOM scrambling" is when site redesigns rename CSS classes — what kills traditional scrapers.

docker — Container platform. Useful for deploying scrapers to identical environments across machines. Not required for beginners.

E

ETag — HTTP header servers send to identify a specific version of a resource. You can use it for idempotent caching: "if you've sent me ETag X already, don't bother re-downloading."

Extractor — The component that takes raw HTML/JSON and produces structured records. Self-healing extractors use schemas + LLMs. Traditional extractors use CSS selectors or XPath.

F

FAQPage schema — Google's structured-data markup for FAQ pages. Earns you the rich-snippet treatment in search results (the expandable Q&A boxes). High SEO value.

Fingerprinting — See Browser fingerprint.

G

GraphQL — An API protocol where the client specifies exactly which fields to return. Increasingly common; some sites expose GraphQL endpoints alongside their HTML pages, which is the gold standard for scrapers.

H

Headless browser — A real browser engine running without a visible window. Playwright, Puppeteer, and Selenium drive headless browsers. Used when JavaScript rendering is required.

HEAD request — An HTTP request that asks for just the headers, not the body. Useful for checking a URL exists or its Content-Type without downloading the whole page.

HLS (HTTP Live Streaming) — Apple's streaming protocol; videos are split into .m3u8 playlists and .ts chunks. Many video platforms use HLS, which is what yt-dlp parses for video downloads.

HTTPx — Modern async-friendly Python HTTP client; alternative to requests. Lacks curl_cffi's TLS impersonation, so it's not a complete anti-bot solution.

I

Idempotent — A function or pipeline run that produces the same result regardless of how many times you run it. Critical for production scrapers — re-running on a partial failure should not duplicate data.

ISBN — International Standard Book Number, 10 or 13 digits. ISBN-pattern SKUs are how you confirm a Shopify catalog actually sells books vs. just-a-few-books-among-merch.

J

JSON-LD — JSON-formatted Linked Data, embedded in HTML via