Sitemap → JSON-LD Bulk Extractor
Sitemap-Driven JSON-LD Bulk Extractor
Maps to "scrape every X on this site" Upwork briefs. Two-stage pipeline: pull sitemap.xml (handles sitemap-index nesting), filter URLs by pattern, then extract every JSON-LD block from each page and project to a flat CSV.
Built 2026-05-03 as Demo #13. Pattern works on ~60-80% of major commercial sites (e-commerce, recipes, news, real estate, events) without writing a single CSS selector.
Run
. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/sitemap_jsonld_extractor
python extract.py --reset --limit 8 # extract 8 BBC Good Food recipes
python extract.py --limit 4 # add 4 more (resume — auto-dedupes)
Result
- 12 recipes extracted from BBC Good Food's quarterly sitemap ✅
- Idempotent resume (already-seen URLs skipped on subsequent runs) ✅
- Structured fields: name, prep time, cook time, ingredients count, calories, rating ✅
- Tenacity retries with exponential backoff ✅
- 1-second politeness sleep between fetches ✅
- 0 failures across 12 pages ✅
Why JSON-LD
schema.org JSON-LD is the structured data the site itself publishes for Google. It survives template redesigns, AB tests, and JS-heavy SPAs better than any CSS selector chain. The same extractor works on Recipe, Product, NewsArticle, Event, RealEstateListing, Organization — just swap the projection function for the schema.org type you care about.
Adapting to a new site / type
- Find the site's sitemap (try
/sitemap.xml,/sitemap_index.xml, or/robots.txtfor the actual location). - Pass it via
--sitemap. Adjust--filterto the URL substring that identifies the page type you want. - Add a
project_*function for the schema.org type (Product, NewsArticle, etc.) and switch the call inmain().
The extract_jsonld() function handles the three JSON-LD shapes that sites use in the wild: bare object, array of objects, and @graph-wrapped object — so the parsing layer rarely needs changing.
Hire me to build this for your stack
Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.
info@luba.media