Wikipedia Infobox Bulk Extractor
Wikipedia Infobox Bulk Extractor
Take a list of Wikipedia article titles → fetch each via the public MediaWiki Built 2026-05-03 as Demo #15. Different from the wikitable demo (#19) — wikitables list things, infoboxes describe them. Wikipedia uses different infobox templates for different things — but the rendered HTML structure is consistent ( For a new entity type, just supply a different Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.parse API → extract the → flatten label/value rows to a per-article CSV. Maps to "extract this dataset of [companies / people / films / drugs / countries] from Wikipedia" briefs.
Run
. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/wikipedia_infobox_extractor
python extract.py # uses titles.txt
python extract.py --limit 10 # cap to first 10Result (8 well-known SaaS / AI companies)
[1], [2], edit links, and inline references ✅Structured output sample
company founded industry hq employees revenue Anthropic 2021 AI San Francisco 2,500 (2026) — OpenAI 2015 AI San Francisco 4,500 (2026) $13.1B (2025) Hugging Face 2016 AI / ML New York City 250 (2025) $15M (2022) Cloudflare 2009 Web infrastructure San Francisco 4,800 (2025) $1.67B (2024) Adapting to a new entity type
th class="infobox-label" + td class="infobox-data"). The extractor is template-agnostic, so the same code runs against:Infobox person)Infobox film)Infobox software)Infobox medical condition)Infobox country)titles.txt. No code changes.Hire me to build this for your stack