Wikipedia Infobox Bulk Extractor

Take a list of Wikipedia article titles → fetch each via the public MediaWiki parse API → extract the

→ flatten label/value rows to a per-article CSV. Maps to "extract this dataset of [companies / people / films / drugs / countries] from Wikipedia" briefs.

Built 2026-05-03 as Demo #15. Different from the wikitable demo (#19) — wikitables list things, infoboxes describe them.

Run

. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/wikipedia_infobox_extractor
python extract.py                # uses titles.txt
python extract.py --limit 10     # cap to first 10

Result (8 well-known SaaS / AI companies)

8 / 8 articles produced an infobox ✅
Total fields extracted: ~108 across 37 distinct columns ✅
Auto-handles redirects (e.g. "Stripe (company)" → canonical "Stripe, Inc.") ✅
Strips footnote markers [1], [2], edit links, and inline references ✅
0.6s politeness sleep between API calls ✅
Tenacity retries with exponential backoff ✅

Structured output sample

company	founded	industry	hq	employees	revenue
Anthropic	2021	AI	San Francisco	2,500 (2026)	—
OpenAI	2015	AI	San Francisco	4,500 (2026)	$13.1B (2025)
Hugging Face	2016	AI / ML	New York City	250 (2025)	$15M (2022)
Cloudflare	2009	Web infrastructure	San Francisco	4,800 (2025)	$1.67B (2024)

Adapting to a new entity type

Wikipedia uses different infobox templates for different things — but the rendered HTML structure is consistent (th class="infobox-label" + td class="infobox-data"). The extractor is template-agnostic, so the same code runs against:

People (Infobox person)
Films (Infobox film)
Software (Infobox software)
Diseases (Infobox medical condition)
Countries (Infobox country)

For a new entity type, just supply a different titles.txt. No code changes.

Wikipedia Infobox Bulk Extractor

Wikipedia Infobox Bulk Extractor

Run

Result (8 well-known SaaS / AI companies)

Structured output sample

Adapting to a new entity type

Hire me to build this for your stack