Web Scraping Glossary: Every Term Defined Plainly
Web Scraping Glossary
Every term you'll see in scraping tutorials, defined plainly. Bookmark this page; you'll come back to it.
A
Anti-bot — Software designed to detect and block automated traffic. Examples: Cloudflare, DataDome, PerimeterX, Akamai Bot Manager. Sells based on stopping fraud, but catches scrapers as a side effect.
API (Application Programming Interface) — A direct, structured way to ask a website for data, returning JSON or XML instead of HTML. Always prefer an API over scraping when one exists; scraping is what you do because there's no API.
Async (Asynchronous) — A programming style where you don't wait for one request to finish before starting the next. Critical when you need to fetch hundreds of pages in parallel. Python's asyncio and httpx/aiohttp libraries enable this.
B
BeautifulSoup (bs4) — A Python library for parsing HTML and XML. The "fluent ergonomic" parser — easier than lxml, slightly slower. Most beginner tutorials use this.
Browser fingerprint — The unique combination of attributes a browser exposes (TLS version, supported ciphers, screen resolution, fonts, WebGL renderer, plugin list, timezone). Anti-bot uses this to identify scrapers because Python's TLS stack has a different fingerprint from Chrome's.
C
CAPTCHA — Test designed to tell humans from bots. ReCAPTCHA, hCaptcha, Cloudflare Turnstile are the major variants. Solved programmatically via services like 2Captcha or CapSolver (~$1.50 per 1,000 solves).
Cloudflare — The most-deployed anti-bot/CDN on the public web. Their "I'm Under Attack" mode and Bot Fight Mode are the most common things you'll hit. Defeated for ~80% of cases by curl_cffi (TLS impersonation) + ~20% by nodriver (JS challenge).
Crawler / Crawling — Programmatically discovering URLs by following links. Distinct from scraping, which extracts data from URLs you already know. Most jobs are crawling-then-scraping.
curl_cffi — Python library that wraps curl-impersonate to give your scripts the exact TLS fingerprint of Chrome/Firefox/Safari. Drop-in replacement for requests. The single biggest anti-bot bypass tool of 2026.
D
DataDome — Aggressive anti-bot vendor that fingerprints TLS, canvas rendering, and mouse-movement patterns. Harder to bypass than Cloudflare; usually requires headless browsers + residential proxies.
DOM (Document Object Model) — The in-memory tree representation of an HTML page after the browser parses it. "DOM scrambling" is when site redesigns rename CSS classes — what kills traditional scrapers.
docker — Container platform. Useful for deploying scrapers to identical environments across machines. Not required for beginners.
E
ETag — HTTP header servers send to identify a specific version of a resource. You can use it for idempotent caching: "if you've sent me ETag X already, don't bother re-downloading."
Extractor — The component that takes raw HTML/JSON and produces structured records. Self-healing extractors use schemas + LLMs. Traditional extractors use CSS selectors or XPath.
F
FAQPage schema — Google's structured-data markup for FAQ pages. Earns you the rich-snippet treatment in search results (the expandable Q&A boxes). High SEO value.
Fingerprinting — See Browser fingerprint.
G
GraphQL — An API protocol where the client specifies exactly which fields to return. Increasingly common; some sites expose GraphQL endpoints alongside their HTML pages, which is the gold standard for scrapers.
H
Headless browser — A real browser engine running without a visible window. Playwright, Puppeteer, and Selenium drive headless browsers. Used when JavaScript rendering is required.
HEAD request — An HTTP request that asks for just the headers, not the body. Useful for checking a URL exists or its Content-Type without downloading the whole page.
HLS (HTTP Live Streaming) — Apple's streaming protocol; videos are split into .m3u8 playlists and .ts chunks. Many video platforms use HLS, which is what yt-dlp parses for video downloads.
HTTPx — Modern async-friendly Python HTTP client; alternative to requests. Lacks curl_cffi's TLS impersonation, so it's not a complete anti-bot solution.
I
Idempotent — A function or pipeline run that produces the same result regardless of how many times you run it. Critical for production scrapers — re-running on a partial failure should not duplicate data.
ISBN — International Standard Book Number, 10 or 13 digits. ISBN-pattern SKUs are how you confirm a Shopify catalog actually sells books vs. just-a-few-books-among-merch.
J
JSON-LD — JSON-formatted Linked Data, embedded in HTML via . Many sites bake structured data into JSON-LD blocks — read those first before scraping the visible HTML.
JS rendering — When a page's data is loaded by JavaScript after the HTML arrives. requests.get() returns the empty shell; you need a headless browser to get the data.
L
LLM extraction — Using a large language model (Claude, GPT-4, Llama) to read page text and extract structured fields per a JSON schema. The basis of self-healing extractors. Cost: ~$0.0003/page with gpt-4o-mini.
M
m3u8 — File extension for HLS playlists. Open DevTools → Network tab → filter .m3u8 to find video manifest URLs.
N
nodriver — Python library for headless Chromium with stealth patches that defeat most JS challenges. Spiritual successor to undetected-chromedriver. Used when curl_cffi isn't enough.
P
Pagination — When a list spans multiple pages (?page=1, ?page=2, etc.). Production scrapers must handle pagination, and the gotchas are: detecting the last page, deduping when items shift between pages, and resume-safety after crashes.
Playwright — Modern headless-browser library, the 2026 default. Async, multi-language (Python, JS, .NET, Java), supports Chromium / Firefox / WebKit. Replaces Selenium for new projects.
Proxy — Intermediate server that forwards your requests, hiding your real IP. Two types: datacenter (cheap, easily detected) and residential (real consumer IPs, ~$3-15/month for small pools). Critical at high volume.
Proxy rotation — Cycling through a pool of proxy IPs to avoid per-IP rate limits. Most providers handle rotation automatically (sticky vs. rotating sessions).
R
Rate limiting — When a server caps how many requests you can send per time window. Polite practice: 0.5-2 second delay between requests, time.sleep(uniform(0.5, 1.5)) to avoid steady-rate detection.
Residential proxy — Proxy whose exit IP is a real consumer ISP IP, not a datacenter. The gold standard for anti-bot bypass at scale. Costs more, blocked less.
REST API — Convention for HTTP APIs (GET, POST, PUT, DELETE methods + JSON bodies). Almost every modern site has one underneath the HTML — sometimes you can call it directly and skip scraping entirely.
robots.txt — File at /robots.txt on every site listing which paths bots are allowed/disallowed. Not legally binding (in most jurisdictions) but always worth respecting.
S
Scrapy — Mature Python framework for large-scale crawling. Steeper learning curve than requests + BeautifulSoup. Best for crawls of thousands+ pages with complex link-following. Maintained by Zyte.
Scraping — Extracting structured data from a website's HTML or API. The verb. The result is usually a CSV, JSON, or database.
Self-healing extractor — Scraper that survives site redesigns by using an LLM + JSON schema as the contract instead of CSS selectors. New 2026-default pattern.
Selenium — Older browser-automation library. Largely superseded by Playwright for new work. Still common in legacy codebases and CI.
Sitemap (sitemap.xml) — XML file at /sitemap.xml listing every URL the site wants indexed. The free shortcut for "give me every product/article URL on this site."
SKU (Stock Keeping Unit) — Unique product identifier. ISBN-pattern SKUs are a strong signal of "this is a bookstore" in Shopify catalog scraping.
T
TLS fingerprint — The unique signature your TLS handshake produces. Python's requests has a different one from Chrome's, which is why anti-bot can tell them apart. Defeated by curl_cffi's impersonation.
tel: and mailto: links — Standard ways pages encode phone numbers and emails. and . Beat regex-scraping for phone/email extraction because they're machine-readable.
U
Undetected-chromedriver — Older library that patched Chromedriver to evade anti-bot. Largely deprecated; replaced by nodriver in 2025+.
User-Agent — HTTP header that identifies what client is making the request. python-requests/2.31.0 is the giveaway that you're a bot. Set it to a real browser UA always.
V
Vision LLM — A multi-modal language model that can read images. Used as a fallback when text-extraction fails (e.g., prices baked into images). Cost: ~5× a text-only LLM call.
.vtt (WebVTT) — Subtitle file format used by HTML5 video. YouTube auto-captions are typically served as VTT or JSON3.
W
Webhook — A URL that another service POSTs to when an event happens. Used in scraping pipelines to trigger downstream actions (Slack alert, email, BigQuery insert) when new data is detected.
Webshare — Affordable residential-proxy provider, ~$3-15/month for small pools. The default residential-proxy choice for solo developers.
X
XBRL (eXtensible Business Reporting Language) — XML-based standard for SEC financial filings. Multi-candidate field resolution is the hard part — see SEC EDGAR + XBRL: From Filings to Clean CSV.
XPath — Older alternative to CSS selectors for navigating XML/HTML trees. More expressive but uglier syntax. Most beginners can avoid it; CSS selectors cover 95% of scraping needs.
Y
yt-dlp — The de-facto command-line tool for downloading videos and subtitles from YouTube, Vimeo, Wistia, Teachable, and 1,000+ other platforms. Python library + CLI.
What to read next
- Getting Started with Web Scraping in 2026 — your first working scraper in 30 minutes
- Self-Healing AI Web Extractors — the schema-driven pattern
- Web Scraping Tools Comparison 2026 — Scrapy vs Playwright vs Beautiful Soup vs ScrapingBee vs DIY
- Web Scraping Legal & Ethics: 2026 State of Play — what's legal, what's not, where the gray zones are
Hire me to build this for your site
I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.
Send a brief