Wayback Machine History Extractor
Wayback Machine History Extractor
Extract historical snapshots of any URL from the Internet Archive's Wayback Machine via the public CDX API. Maps to "what did this page look like in [year]?", "track competitor's homepage messaging over time", "audit my brand's historical claims" briefs.
Built 2026-05-03 as Demo #24.
Run
. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/wayback_history_extractor
python extract.py --target https://www.anthropic.com --from 20230101 --to 20260501 --limit 8
Result
- 8 unique-digest snapshots from anthropic.com (Jan-Feb 2023) ✅
- CDX
collapse=digestserver-side de-dupes identical content snapshots ✅ - Per-snapshot fields: date, snapshot URL, title, h1, meta-description ✅
- 1.5s politeness sleep — Wayback is a free public service ✅
- Tenacity retries with exponential backoff (4-20s) ✅
Why this brief class matters
Three real Upwork-style use cases:
- Competitor messaging archaeology — "What did Stripe's homepage say in 2018 vs 2024?"
- Compliance / fact-check — "Did this site claim X before regulators flagged it?"
- SEO / content history audit — "What did our title tag look like in 2022 when we ranked #1?"
The CDX API is free, has no quota, and returns up to millions of snapshots per URL.
Hire me to build this for your stack
Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.
info@luba.media