Self-Healing AI Web Extractor

Self-Healing AI Extractor (Spec #1 demo)

Extracts structured book data from any HTML layout — survives full DOM scrambles that would break every CSS selector.

Run

cd ~/freelance/portfolio_demos/self_healing_scraper
. ~/freelance/.venv/bin/activate
python3 test_layout_change.py    # full acceptance test (passes)
python3 extractor.py http://books.toscrape.com/   # one-shot extraction

Result (last run)

Original layout: 20/20 books extracted, 20/20 prices ✅
Mutated layout (every CSS selector broken): 20/20 books, 100% title overlap ✅
Cost: ~1860 tokens / page (~$0.0003 with gpt-4o-mini)

What the test mutates

All class and id attributes stripped
→
,
→
,
→
All data-* and itemprop attributes removed
Each
wrapped in extra
layer

A traditional CSS-selector scraper returns 0 books after these changes. The LLM-based extractor returns 20/20 — because it works at the schema layer, not the markup layer.

Files

extractor.py — the extraction logic (LLM + JSON schema)
test_layout_change.py — acceptance test (mutates DOM + verifies parity)
fixtures/original.html — saved snapshot for replay
fixtures/mutated.html — DOM-scrambled version for the proof

Hire me to build this for your stack

Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.
info@luba.media

Self-Healing AI Web Extractor

Self-Healing AI Extractor (Spec #1 demo)

Run

Result (last run)

What the test mutates

→ , →

, →

Files

Hire me to build this for your stack

→
,
→

,
→