Getting Started with Web Scraping in 2026: From Zero to First Working Scraper in 30 Minutes
Getting Started with Web Scraping in 2026
If you've never written a web scraper before, this is the place to start. By the end of this page you'll have a working Python script that pulls real data off a real website, and you'll understand why the "easy tutorials" you'll find elsewhere stop working two weeks later.
I've shipped 46 production scrapers across 40+ different brief types. The patterns repeat. Once you have one working scraper, you have all of them.
What web scraping actually is
Web scraping is programming a computer to do what you'd do with a browser, copy-paste, and a spreadsheet — but at the speed and scale of a script.
You go to a website. You see a list of products with prices. You want all of them in a CSV. You don't want to copy-paste 500 rows. So you write a program that:
- Asks the website for the page (an HTTP request)
- Parses the response (HTML, JSON, or sometimes JavaScript-rendered content)
- Extracts the fields you care about (prices, names, URLs)
- Saves them somewhere structured (CSV, database, Google Sheet)
That's it. Everything else — anti-bot bypass, JavaScript rendering, headless browsers, proxy rotation — is solving the complications that arise when websites notice you're a script.
What you need installed
For your first working scraper, three things:
pip install requests beautifulsoup4 curl_cffi
requests— sends HTTP requests, the cleanest API in Python.beautifulsoup4(a.k.a.bs4) — parses HTML so you can ask "give me everytag."curl_cffi— drop-in replacement forrequeststhat mimics a real browser at the TLS level. You won't need it for your first scraper, but you will need it for your second.
That's the entire toolkit. Anything else (Scrapy, Selenium, Playwright) is for specific cases we'll cover later.
Your first working scraper
Pick a site that wants to be scraped: books.toscrape.com. It's a fake bookstore built specifically for practice. No anti-bot, no surprises.
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://books.toscrape.com/catalogue/page-1.html"
# 1. Ask for the page
r = requests.get(URL, headers={"User-Agent": "Mozilla/5.0"})
r.raise_for_status() # blow up if the request failed
# 2. Parse the HTML
soup = BeautifulSoup(r.text, "html.parser")
# 3. Find every book on the page
books = []
for article in soup.select("article.product_pod"):
title = article.select_one("h3 a")["title"]
price = article.select_one("p.price_color").get_text(strip=True)
in_stock = "In stock" in article.select_one("p.instock").get_text()
books.append({"title": title, "price": price, "in_stock": in_stock})
# 4. Save to CSV
with open("books.csv", "w", newline="") as f:
w = csv.DictWriter(f, fieldnames=["title", "price", "in_stock"])
w.writeheader()
for b in books:
w.writerow(b)
print(f"Saved {len(books)} books to books.csv")
Save that as scrape_books.py and run python3 scrape_books.py. You should see Saved 20 books to books.csv and find a books.csv file with 20 rows in your folder.
Congratulations. That's a working web scraper.
The three things that will trip you up next
The scraper above works because books.toscrape.com is a friendly target. The real world is less friendly. Here's what breaks first.
Trip #1: the site gives you HTML but the data is loaded by JavaScript
Modern sites often render data after the page loads. Your requests.get() returns the empty shell; the data shows up only when JavaScript runs in a browser.
How to spot it: Right-click the page → "View page source" (NOT inspect element — that shows you the rendered DOM). If the data you want isn't in the source HTML, JavaScript is loading it.
The fix: switch from requests to a headless browser. Playwright is the modern answer:
pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://js-rendered-site.com/")
html = page.content() # this is the rendered HTML — JS has run
browser.close()
Same BeautifulSoup parsing afterward, just a different way to get the HTML.
Trip #2: the site detects your script and blocks you
You'll see HTTP 403 errors, or pages that come back with "Are you a robot?" challenges. The site's anti-bot is doing its job.
How to spot it: your script worked five times, then started returning short HTML or 403 errors.
The fix (level 1): switch from requests to curl_cffi. Same API, but it impersonates Chrome's exact TLS fingerprint, so anti-bot can't tell you apart from a real browser.
from curl_cffi import requests as cf
r = cf.get(URL, impersonate="chrome131") # one-line change, often enough
The fix (level 2): add residential proxies. Webshare ($3-15/month) gives you a pool of real-residential-IP exit nodes. The site can't tell you apart from a regular human.
For the full anti-bot playbook see Bypassing Cloudflare, DataDome, and PerimeterX in 2026.
Trip #3: the site changes its HTML and your scraper breaks silently
You wrote soup.select_one("div.product-price"). A month later the site refactors and that class becomes div.PriceTag__amount-2eHfC. Your scraper now returns None — but doesn't crash. Your CSV has a column of empty strings. You don't notice for a week.
How to spot it: your output suddenly has missing fields, or it's been a while since the source site was redesigned.
The fix: stop using CSS selectors as the contract. Use a JSON schema and an LLM. The schema is what "product price" means; the LLM finds it on the page regardless of markup. This is the self-healing extractor pattern, and it's the single biggest 2026 shift in scraping.
# instead of brittle selectors
price = soup.select_one("div.product-price").text # breaks on redesign
# use schema-driven extraction
schema = {"title": "string", "price": "number", "in_stock": "boolean"}
result = llm.extract(page_text, schema) # survives redesigns
Cost: ~$0.0003/page with gpt-4o-mini. Negligible. Full implementation in Self-Healing AI Web Extractors: A Complete Implementation Guide.
What to learn next, in order
- HTTP basics — what a request, response, header, and status code are. 5 minutes of reading saves 50 hours of debugging.
- CSS selectors —
div.foo > a.barnotation. The "Inspect Element" right-click menu is where you'll spend most of your beginner time. - Pagination — most lists have multiple pages. Loop. Save progress. Resume on crash.
- Rate limiting —
time.sleep()between requests. Don't hammer servers. Be polite. - Storage — start with CSV, graduate to SQLite (
sqlite3is in the Python standard library), then maybe Postgres. - JavaScript-rendered sites — Playwright is the answer.
- Anti-bot —
curl_cffifirst, residential proxies second, headless browser third. - Self-healing extraction — when sites redesign and you can't keep chasing CSS selectors.
The order matters. Don't skip ahead to anti-bot and self-healing until you've shipped 3-5 simple scrapers. Most "real" scraping problems are pagination + storage + rate-limiting, not exotic anti-bot.
When you're ready for production
A script runs once and produces a CSV. A production pipeline runs forever, alerts on changes, and survives crashes. The transition from one to the other is mostly operational discipline, not code. See Why $5/mo VPS Beats $1,200/mo ScrapingBee for the full discipline.
Frequently-asked questions
Is web scraping legal? Yes, in the US, for publicly-accessible data. The hiQ Labs v. LinkedIn ruling (2022) established that scraping public web data does not violate the Computer Fraud and Abuse Act. State and EU rules vary; respect robots.txt and Terms of Service for safety. See our legal-and-ethics guide for the nuanced version.
Do I need to learn Scrapy? No, not as a beginner. Scrapy is a powerful framework but it has a steep learning curve. Start with requests + BeautifulSoup. Move to Scrapy only if you're crawling thousands of pages with complex link-following.
What's the difference between scraping and crawling? Scraping extracts data from a page you already have the URL for. Crawling discovers URLs by following links. Most "scraping" jobs are actually crawling-then-scraping.
How do I scrape sites behind a login? Use the requests Session() object to persist cookies. Or use a headless browser (Playwright) and authenticate via the login form once.
What's the cheapest way to run a scraper 24/7? A $5/month VPS at Hetzner or DigitalOcean, plus cron for scheduling. We have a whole tutorial on this.
Should I use Selenium or Playwright? Playwright. Selenium is older, slower, and has worse async support. Playwright is the modern default.
How do I avoid getting IP-banned? Rate-limit your requests, set a real User-Agent, use curl_cffi instead of plain requests, rotate residential IPs at high volume.
Where do I put my scrapers in production? Cron job on a $5 VPS for most pipelines. GitHub Actions is great for low-frequency runs (daily/weekly). Cloud Run or AWS Lambda for higher-frequency event-driven work.
What to read next
- Self-Healing AI Web Extractors — the schema-driven scraping pattern that survives site redesigns
- Bypassing Cloudflare, DataDome, and PerimeterX — the modern anti-bot playbook
- 100 Production Web Scrapers, One Repo — the six patterns that cover almost every brief
- Web Scraping Glossary — every term you'll see in tutorials, defined plainly
If you have a target site you're trying to scrape and you're stuck, send the URL to info@luba.media. I'll tell you which approach to use, free.
Hire me to build this for your site
I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.
Send a brief