Eyal Rosenthal · Web scraping at scale

Self-Healing AI Web Extractor

Self-Healing AI Web Extractor — Survives DOM Changes That Break Every CSS Selector

Self-Healing AI Extractor (Spec #1 demo)

Extracts structured book data from any HTML layout — survives full DOM scrambles that would break every CSS selector.

Run

cd ~/freelance/portfolio_demos/self_healing_scraper
. ~/freelance/.venv/bin/activate
python3 test_layout_change.py    # full acceptance test (passes)
python3 extractor.py http://books.toscrape.com/   # one-shot extraction

Result (last run)

  • Original layout: 20/20 books extracted, 20/20 prices ✅
  • Mutated layout (every CSS selector broken): 20/20 books, 100% title overlap ✅
  • Cost: ~1860 tokens / page (~$0.0003 with gpt-4o-mini)

What the test mutates

  • All class and id attributes stripped
  • ,

    ,

  • All data-* and itemprop attributes removed
  • Each
    wrapped in extra
    layer

A traditional CSS-selector scraper returns 0 books after these changes. The LLM-based extractor returns 20/20 — because it works at the schema layer, not the markup layer.

Files

  • extractor.py — the extraction logic (LLM + JSON schema)
  • test_layout_change.py — acceptance test (mutates DOM + verifies parity)
  • fixtures/original.html — saved snapshot for replay
  • fixtures/mutated.html — DOM-scrambled version for the proof

Hire me to build this for your stack

Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.

info@luba.media