Eyal Rosenthal · Web scraping at scale

Getting Started with Web Scraping in 2026: From Zero to First Working Scraper in 30 Minutes

Getting Started with Web Scraping in 2026

If you've never written a web scraper before, this is the place to start. By the end of this page you'll have a working Python script that pulls real data off a real website, and you'll understand why the "easy tutorials" you'll find elsewhere stop working two weeks later.

I've shipped 46 production scrapers across 40+ different brief types. The patterns repeat. Once you have one working scraper, you have all of them.

What web scraping actually is

Web scraping is programming a computer to do what you'd do with a browser, copy-paste, and a spreadsheet — but at the speed and scale of a script.

You go to a website. You see a list of products with prices. You want all of them in a CSV. You don't want to copy-paste 500 rows. So you write a program that:

  1. Asks the website for the page (an HTTP request)
  2. Parses the response (HTML, JSON, or sometimes JavaScript-rendered content)
  3. Extracts the fields you care about (prices, names, URLs)
  4. Saves them somewhere structured (CSV, database, Google Sheet)

That's it. Everything else — anti-bot bypass, JavaScript rendering, headless browsers, proxy rotation — is solving the complications that arise when websites notice you're a script.

What you need installed

For your first working scraper, three things:

pip install requests beautifulsoup4 curl_cffi
  • requests — sends HTTP requests, the cleanest API in Python.
  • beautifulsoup4 (a.k.a. bs4) — parses HTML so you can ask "give me every

    tag."

  • curl_cffi — drop-in replacement for requests that mimics a real browser at the TLS level. You won't need it for your first scraper, but you will need it for your second.

That's the entire toolkit. Anything else (Scrapy, Selenium, Playwright) is for specific cases we'll cover later.

Your first working scraper

Pick a site that wants to be scraped: books.toscrape.com. It's a fake bookstore built specifically for practice. No anti-bot, no surprises.

import requests
from bs4 import BeautifulSoup
import csv

URL = "https://books.toscrape.com/catalogue/page-1.html"

# 1. Ask for the page
r = requests.get(URL, headers={"User-Agent": "Mozilla/5.0"})
r.raise_for_status()  # blow up if the request failed

# 2. Parse the HTML
soup = BeautifulSoup(r.text, "html.parser")

# 3. Find every book on the page
books = []
for article in soup.select("article.product_pod"):
    title = article.select_one("h3 a")["title"]
    price = article.select_one("p.price_color").get_text(strip=True)
    in_stock = "In stock" in article.select_one("p.instock").get_text()
    books.append({"title": title, "price": price, "in_stock": in_stock})

# 4. Save to CSV
with open("books.csv", "w", newline="") as f:
    w = csv.DictWriter(f, fieldnames=["title", "price", "in_stock"])
    w.writeheader()
    for b in books:
        w.writerow(b)

print(f"Saved {len(books)} books to books.csv")

Save that as scrape_books.py and run python3 scrape_books.py. You should see Saved 20 books to books.csv and find a books.csv file with 20 rows in your folder.

Congratulations. That's a working web scraper.

The three things that will trip you up next

The scraper above works because books.toscrape.com is a friendly target. The real world is less friendly. Here's what breaks first.

Trip #1: the site gives you HTML but the data is loaded by JavaScript

Modern sites often render data after the page loads. Your requests.get() returns the empty shell; the data shows up only when JavaScript runs in a browser.

How to spot it: Right-click the page → "View page source" (NOT inspect element — that shows you the rendered DOM). If the data you want isn't in the source HTML, JavaScript is loading it.

The fix: switch from requests to a headless browser. Playwright is the modern answer:

pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://js-rendered-site.com/")
    html = page.content()  # this is the rendered HTML — JS has run
    browser.close()

Same BeautifulSoup parsing afterward, just a different way to get the HTML.

Trip #2: the site detects your script and blocks you

You'll see HTTP 403 errors, or pages that come back with "Are you a robot?" challenges. The site's anti-bot is doing its job.

How to spot it: your script worked five times, then started returning short HTML or 403 errors.

The fix (level 1): switch from requests to curl_cffi. Same API, but it impersonates Chrome's exact TLS fingerprint, so anti-bot can't tell you apart from a real browser.

from curl_cffi import requests as cf

r = cf.get(URL, impersonate="chrome131")  # one-line change, often enough

The fix (level 2): add residential proxies. Webshare ($3-15/month) gives you a pool of real-residential-IP exit nodes. The site can't tell you apart from a regular human.

For the full anti-bot playbook see Bypassing Cloudflare, DataDome, and PerimeterX in 2026.

Trip #3: the site changes its HTML and your scraper breaks silently

You wrote soup.select_one("div.product-price"). A month later the site refactors and that class becomes div.PriceTag__amount-2eHfC. Your scraper now returns None — but doesn't crash. Your CSV has a column of empty strings. You don't notice for a week.

How to spot it: your output suddenly has missing fields, or it's been a while since the source site was redesigned.

The fix: stop using CSS selectors as the contract. Use a JSON schema and an LLM. The schema is what "product price" means; the LLM finds it on the page regardless of markup. This is the self-healing extractor pattern, and it's the single biggest 2026 shift in scraping.

# instead of brittle selectors
price = soup.select_one("div.product-price").text  # breaks on redesign

# use schema-driven extraction
schema = {"title": "string", "price": "number", "in_stock": "boolean"}
result = llm.extract(page_text, schema)  # survives redesigns

Cost: ~$0.0003/page with gpt-4o-mini. Negligible. Full implementation in Self-Healing AI Web Extractors: A Complete Implementation Guide.

What to learn next, in order

  1. HTTP basics — what a request, response, header, and status code are. 5 minutes of reading saves 50 hours of debugging.
  2. CSS selectorsdiv.foo > a.bar notation. The "Inspect Element" right-click menu is where you'll spend most of your beginner time.
  3. Pagination — most lists have multiple pages. Loop. Save progress. Resume on crash.
  4. Rate limitingtime.sleep() between requests. Don't hammer servers. Be polite.
  5. Storage — start with CSV, graduate to SQLite (sqlite3 is in the Python standard library), then maybe Postgres.
  6. JavaScript-rendered sites — Playwright is the answer.
  7. Anti-botcurl_cffi first, residential proxies second, headless browser third.
  8. Self-healing extraction — when sites redesign and you can't keep chasing CSS selectors.

The order matters. Don't skip ahead to anti-bot and self-healing until you've shipped 3-5 simple scrapers. Most "real" scraping problems are pagination + storage + rate-limiting, not exotic anti-bot.

When you're ready for production

A script runs once and produces a CSV. A production pipeline runs forever, alerts on changes, and survives crashes. The transition from one to the other is mostly operational discipline, not code. See Why $5/mo VPS Beats $1,200/mo ScrapingBee for the full discipline.

Frequently-asked questions

Is web scraping legal? Yes, in the US, for publicly-accessible data. The hiQ Labs v. LinkedIn ruling (2022) established that scraping public web data does not violate the Computer Fraud and Abuse Act. State and EU rules vary; respect robots.txt and Terms of Service for safety. See our legal-and-ethics guide for the nuanced version.

Do I need to learn Scrapy? No, not as a beginner. Scrapy is a powerful framework but it has a steep learning curve. Start with requests + BeautifulSoup. Move to Scrapy only if you're crawling thousands of pages with complex link-following.

What's the difference between scraping and crawling? Scraping extracts data from a page you already have the URL for. Crawling discovers URLs by following links. Most "scraping" jobs are actually crawling-then-scraping.

How do I scrape sites behind a login? Use the requests Session() object to persist cookies. Or use a headless browser (Playwright) and authenticate via the login form once.

What's the cheapest way to run a scraper 24/7? A $5/month VPS at Hetzner or DigitalOcean, plus cron for scheduling. We have a whole tutorial on this.

Should I use Selenium or Playwright? Playwright. Selenium is older, slower, and has worse async support. Playwright is the modern default.

How do I avoid getting IP-banned? Rate-limit your requests, set a real User-Agent, use curl_cffi instead of plain requests, rotate residential IPs at high volume.

Where do I put my scrapers in production? Cron job on a $5 VPS for most pipelines. GitHub Actions is great for low-frequency runs (daily/weekly). Cloud Run or AWS Lambda for higher-frequency event-driven work.

If you have a target site you're trying to scrape and you're stuck, send the URL to info@luba.media. I'll tell you which approach to use, free.

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief