Beginner 4 min read · Updated 2026-05-05

How to Scrape Wikipedia (The Easy Target Everyone Overcomplicates)

How to Scrape Wikipedia in 2026

Wikipedia is the easiest legitimate scraping target on the public internet. The data is licensed for reuse (CC-BY-SA), the official APIs are excellent and free, and there's no anti-bot.

The mistake most tutorials make is showing you BeautifulSoup on rendered Wikipedia pages. Don't do that. Wikipedia exposes four better paths.

The four paths, ranked

Wikipedia REST API — fastest for "give me the content of [article]"
Wikidata SPARQL endpoint — for structured queries ("every female nobel laureate after 1990")
MediaWiki Action API — for advanced needs (categories, revision history, edit metadata)
HTML scraping — only when you need the rendered visual layout

For 90% of use cases, path 1 (REST API) is all you need. Don't reach for BeautifulSoup until you've checked the API.

Path 1: REST API (the easy answer)

curl "https://en.wikipedia.org/api/rest_v1/page/summary/Web_scraping"

Returns JSON:

{
  "title": "Web scraping",
  "extract": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites...",
  "thumbnail": {"source": "https://...", "width": 320, "height": 200},
  "wikibase_item": "Q665452",
  "lang": "en"
}

Three useful endpoints:

import requests

def get_summary(title: str) -> dict:
    r = requests.get(f"https://en.wikipedia.org/api/rest_v1/page/summary/{title}",
                     headers={"User-Agent": "research-script/1.0 (you@example.com)"})
    return r.json()

def get_html(title: str) -> str:
    r = requests.get(f"https://en.wikipedia.org/api/rest_v1/page/html/{title}",
                     headers={"User-Agent": "research-script/1.0"})
    return r.text  # clean rendered HTML

def get_pdf(title: str) -> bytes:
    r = requests.get(f"https://en.wikipedia.org/api/rest_v1/page/pdf/{title}")
    return r.content

Rate limit: ~200 requests/second, no auth needed. Set a real User-Agent per Wikipedia's etiquette.

Path 2: Wikidata SPARQL (structured queries)

Wikidata is Wikipedia's structured-data sibling. Every Wikipedia article has a Wikidata entity (Q12345 style) with typed properties.

import requests

QUERY = """
SELECT ?person ?personLabel ?birthDate WHERE {
  ?person wdt:P166 wd:Q7191.       # has Nobel Prize
  ?person wdt:P21 wd:Q6581072.     # gender = female
  ?person wdt:P569 ?birthDate.
  FILTER(YEAR(?birthDate) >= 1950)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
"""

r = requests.get("https://query.wikidata.org/sparql",
                 params={"query": QUERY, "format": "json"},
                 headers={"User-Agent": "research-script/1.0"})
for binding in r.json()["results"]["bindings"]:
    print(binding["personLabel"]["value"], binding["birthDate"]["value"][:10])

This pulls every female Nobel laureate born after 1950 in one HTTP call. Try doing that with HTML scraping — you'd be at it for a week.

Wikidata SPARQL is the underrated weapon for any "give me a list of all [X] that match [criteria]" job. See the Wikidata SPARQL demo in the repo for production patterns.

Path 3: MediaWiki Action API

For when you need things the REST API doesn't expose: edit history, category trees, page revisions, contributor lists.

def get_revisions(title: str, n: int = 10) -> list[dict]:
    r = requests.get("https://en.wikipedia.org/w/api.php", params={
        "action": "query", "format": "json", "prop": "revisions",
        "titles": title, "rvlimit": n, "rvprop": "timestamp|user|comment|size",
    })
    pages = r.json()["query"]["pages"]
    page = next(iter(pages.values()))
    return page.get("revisions", [])

for rev in get_revisions("Web scraping", n=5):
    print(rev["timestamp"], rev["user"], "—", rev["comment"][:80])

Path 4: HTML scraping (only when you need it)

Use case: you want the rendered visual structure (infobox, table, image gallery) without re-implementing MediaWiki's renderer.

import requests
from bs4 import BeautifulSoup

r = requests.get("https://en.wikipedia.org/wiki/Web_scraping",
                 headers={"User-Agent": "research-script/1.0"})
soup = BeautifulSoup(r.text, "html.parser")

# Infobox extraction
infobox = soup.select_one("table.infobox")
data = {}
if infobox:
    for tr in infobox.select("tr"):
        th = tr.select_one("th")
        td = tr.select_one("td")
        if th and td:
            data[th.get_text(strip=True)] = td.get_text(" ", strip=True)
print(data)

When you reach for HTML scraping on Wikipedia, ask first whether the REST API's /page/html/ endpoint gives you the same data more cheaply. Usually yes.

Bulk processing patterns

For "extract data from 1,000 articles":

Get the list of titles via category membership (MediaWiki Action API: list=categorymembers) or a Wikidata SPARQL query
For each title, fetch via REST API (/page/summary/ or /page/html/) in parallel — httpx async or concurrent.futures.ThreadPoolExecutor
Parse what you need with BeautifulSoup (HTML) or mwparserfromhell (raw wikitext)
Write to CSV / DB

The bottleneck is HTTP, not parsing. With 50 parallel requests at 200/sec rate limit, you can fetch 1,000 articles in ~10 seconds.

Production demo

See portfolio_demos/wikipedia_infobox_extractor/ in the repo for a complete bulk-infobox extractor:

Input: list of article titles (or a category)
Process: fetch via REST API, parse infobox, normalize fields across articles
Output: CSV with one row per article, columns merged across heterogeneous infoboxes

Legal & ToS

Wikipedia content is licensed CC-BY-SA. You can:

✓ Reuse it commercially
✓ Modify it
✓ Redistribute it
Required: attribution + share-alike licensing of derivative works

Wikipedia's ToS asks (doesn't legally require) that you:

Set a real User-Agent identifying yourself
Respect rate limits (200 req/sec is the soft cap; below that you're fine)
Don't impersonate Wikipedia or its editors

For massive-scale extraction (>1M articles), Wikimedia provides full database dumps at dumps.wikimedia.org. Use those instead of API hammering.

What to build with this

A few high-leverage Wikipedia-data ideas:

Topic-specific bibliography — every "Books in [genre]" category → infobox extraction → enriched CSV
Biographical dataset — every "Living people" sub-category → birth dates, places, occupations → analytics
Geographic coordinates dataset — every "Geographical objects" category → lat/lon → map
Cross-language alignment — same article in 10 languages → translation pair extraction
Citation graph — outbound links from articles → who-cites-who network

These are all friendly to the API path. None require HTML scraping.

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief