Eyal Rosenthal · Web scraping at scale

Wikipedia Infobox Bulk Extractor

Wikipedia Infobox Bulk Extractor — Per-Title CSV via MediaWiki Parse API

Wikipedia Infobox Bulk Extractor

Take a list of Wikipedia article titles → fetch each via the public MediaWiki parse API → extract the

→ flatten label/value rows to a per-article CSV. Maps to "extract this dataset of [companies / people / films / drugs / countries] from Wikipedia" briefs.

Built 2026-05-03 as Demo #15. Different from the wikitable demo (#19) — wikitables list things, infoboxes describe them.

Run

. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/wikipedia_infobox_extractor
python extract.py                # uses titles.txt
python extract.py --limit 10     # cap to first 10

Result (8 well-known SaaS / AI companies)

  • 8 / 8 articles produced an infobox ✅
  • Total fields extracted: ~108 across 37 distinct columns ✅
  • Auto-handles redirects (e.g. "Stripe (company)" → canonical "Stripe, Inc.") ✅
  • Strips footnote markers [1], [2], edit links, and inline references ✅
  • 0.6s politeness sleep between API calls ✅
  • Tenacity retries with exponential backoff ✅

Structured output sample

companyfoundedindustryhqemployeesrevenue
Anthropic2021AISan Francisco2,500 (2026)
OpenAI2015AISan Francisco4,500 (2026)$13.1B (2025)
Hugging Face2016AI / MLNew York City250 (2025)$15M (2022)
Cloudflare2009Web infrastructureSan Francisco4,800 (2025)$1.67B (2024)

Adapting to a new entity type

Wikipedia uses different infobox templates for different things — but the rendered HTML structure is consistent (th class="infobox-label" + td class="infobox-data"). The extractor is template-agnostic, so the same code runs against:

For a new entity type, just supply a different titles.txt. No code changes.

Hire me to build this for your stack

Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.

info@luba.media