How to Scrape Wikipedia (The Easy Target Everyone Overcomplicates)
How to Scrape Wikipedia in 2026
Wikipedia is the easiest legitimate scraping target on the public internet. The data is licensed for reuse (CC-BY-SA), the official APIs are excellent and free, and there's no anti-bot.
The mistake most tutorials make is showing you BeautifulSoup on rendered Wikipedia pages. Don't do that. Wikipedia exposes four better paths.
The four paths, ranked
- Wikipedia REST API — fastest for "give me the content of [article]"
- Wikidata SPARQL endpoint — for structured queries ("every female nobel laureate after 1990")
- MediaWiki Action API — for advanced needs (categories, revision history, edit metadata)
- HTML scraping — only when you need the rendered visual layout
For 90% of use cases, path 1 (REST API) is all you need. Don't reach for BeautifulSoup until you've checked the API.
Path 1: REST API (the easy answer)
curl "https://en.wikipedia.org/api/rest_v1/page/summary/Web_scraping"
Returns JSON:
{
"title": "Web scraping",
"extract": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites...",
"thumbnail": {"source": "https://...", "width": 320, "height": 200},
"wikibase_item": "Q665452",
"lang": "en"
}
Three useful endpoints:
import requests
def get_summary(title: str) -> dict:
r = requests.get(f"https://en.wikipedia.org/api/rest_v1/page/summary/{title}",
headers={"User-Agent": "research-script/1.0 (you@example.com)"})
return r.json()
def get_html(title: str) -> str:
r = requests.get(f"https://en.wikipedia.org/api/rest_v1/page/html/{title}",
headers={"User-Agent": "research-script/1.0"})
return r.text # clean rendered HTML
def get_pdf(title: str) -> bytes:
r = requests.get(f"https://en.wikipedia.org/api/rest_v1/page/pdf/{title}")
return r.content
Rate limit: ~200 requests/second, no auth needed. Set a real User-Agent per Wikipedia's etiquette.
Path 2: Wikidata SPARQL (structured queries)
Wikidata is Wikipedia's structured-data sibling. Every Wikipedia article has a Wikidata entity (Q12345 style) with typed properties.
import requests
QUERY = """
SELECT ?person ?personLabel ?birthDate WHERE {
?person wdt:P166 wd:Q7191. # has Nobel Prize
?person wdt:P21 wd:Q6581072. # gender = female
?person wdt:P569 ?birthDate.
FILTER(YEAR(?birthDate) >= 1950)
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
"""
r = requests.get("https://query.wikidata.org/sparql",
params={"query": QUERY, "format": "json"},
headers={"User-Agent": "research-script/1.0"})
for binding in r.json()["results"]["bindings"]:
print(binding["personLabel"]["value"], binding["birthDate"]["value"][:10])
This pulls every female Nobel laureate born after 1950 in one HTTP call. Try doing that with HTML scraping — you'd be at it for a week.
Wikidata SPARQL is the underrated weapon for any "give me a list of all [X] that match [criteria]" job. See the Wikidata SPARQL demo in the repo for production patterns.
Path 3: MediaWiki Action API
For when you need things the REST API doesn't expose: edit history, category trees, page revisions, contributor lists.
def get_revisions(title: str, n: int = 10) -> list[dict]:
r = requests.get("https://en.wikipedia.org/w/api.php", params={
"action": "query", "format": "json", "prop": "revisions",
"titles": title, "rvlimit": n, "rvprop": "timestamp|user|comment|size",
})
pages = r.json()["query"]["pages"]
page = next(iter(pages.values()))
return page.get("revisions", [])
for rev in get_revisions("Web scraping", n=5):
print(rev["timestamp"], rev["user"], "—", rev["comment"][:80])
Path 4: HTML scraping (only when you need it)
Use case: you want the rendered visual structure (infobox, table, image gallery) without re-implementing MediaWiki's renderer.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://en.wikipedia.org/wiki/Web_scraping",
headers={"User-Agent": "research-script/1.0"})
soup = BeautifulSoup(r.text, "html.parser")
# Infobox extraction
infobox = soup.select_one("table.infobox")
data = {}
if infobox:
for tr in infobox.select("tr"):
th = tr.select_one("th")
td = tr.select_one("td")
if th and td:
data[th.get_text(strip=True)] = td.get_text(" ", strip=True)
print(data)
When you reach for HTML scraping on Wikipedia, ask first whether the REST API's /page/html/ endpoint gives you the same data more cheaply. Usually yes.
Bulk processing patterns
For "extract data from 1,000 articles":
- Get the list of titles via category membership (MediaWiki Action API:
list=categorymembers) or a Wikidata SPARQL query - For each title, fetch via REST API (
/page/summary/or/page/html/) in parallel —httpxasync orconcurrent.futures.ThreadPoolExecutor - Parse what you need with
BeautifulSoup(HTML) ormwparserfromhell(raw wikitext) - Write to CSV / DB
The bottleneck is HTTP, not parsing. With 50 parallel requests at 200/sec rate limit, you can fetch 1,000 articles in ~10 seconds.
Production demo
See portfolio_demos/wikipedia_infobox_extractor/ in the repo for a complete bulk-infobox extractor:
- Input: list of article titles (or a category)
- Process: fetch via REST API, parse infobox, normalize fields across articles
- Output: CSV with one row per article, columns merged across heterogeneous infoboxes
Legal & ToS
Wikipedia content is licensed CC-BY-SA. You can:
- ✓ Reuse it commercially
- ✓ Modify it
- ✓ Redistribute it
- Required: attribution + share-alike licensing of derivative works
Wikipedia's ToS asks (doesn't legally require) that you:
- Set a real
User-Agentidentifying yourself - Respect rate limits (200 req/sec is the soft cap; below that you're fine)
- Don't impersonate Wikipedia or its editors
For massive-scale extraction (>1M articles), Wikimedia provides full database dumps at dumps.wikimedia.org. Use those instead of API hammering.
What to build with this
A few high-leverage Wikipedia-data ideas:
- Topic-specific bibliography — every "Books in [genre]" category → infobox extraction → enriched CSV
- Biographical dataset — every "Living people" sub-category → birth dates, places, occupations → analytics
- Geographic coordinates dataset — every "Geographical objects" category → lat/lon → map
- Cross-language alignment — same article in 10 languages → translation pair extraction
- Citation graph — outbound links from articles → who-cites-who network
These are all friendly to the API path. None require HTML scraping.
What to read next
- Self-Healing AI Web Extractors — the schema-driven pattern for messier infoboxes
- Web Scraping Tools Comparison — when paid services make sense (almost never for Wikipedia)
- The repo:
portfolio_demos/wikipedia_infobox_extractor/
If you have a Wikipedia/Wikidata-shaped data project, send to info@luba.media. Most are quick fixed-price gigs ($100-400) that ship in 24-48h.
Hire me to build this for your site
I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.
Send a brief