Eyal Rosenthal · Web scraping at scale

Self-Healing AI Web Extractors: A Complete Implementation Guide

Self-Healing AI Web Extractors: A Complete Implementation Guide

Every CSS-selector scraper dies on the next redesign. The div.product-price that worked yesterday is span.PriceTag__amount-2eHfC today. Then it's data-test="price-display". Then it's gone entirely because they moved to a JS-rendered Stripe widget.

You can spend 6 months chasing selectors. Or you can stop using them as the contract.

This is how I build extractors that survive site redesigns: the LLM reads the page, the JSON Schema defines what "product" means, and the markup becomes irrelevant. I've stress-tested this approach against full DOM scrambles — every class stripped, every tag swapped, every attribute removed — and the extractor still returns 100% of records with full title parity. A traditional CSS scraper drops to 0% on the same input.

The cost is roughly $0.0003 per page with gpt-4o-mini. For 10,000 pages a day that's $1/day in LLM cost. Cheaper than the Octoparse plan you'd otherwise need to keep retraining.

The core idea in one paragraph

Stop telling the scraper where the data is. Tell it what the data is. The schema is the contract. The page is just text. The LLM reads the text and fills the schema.

# WRONG: selector-driven, brittle to markup changes
price = soup.select_one("div.product-price > span").text

# RIGHT: schema-driven, survives markup changes
schema = {"name": "string", "price": "number", "in_stock": "boolean"}
result = llm.extract(page_text, schema)

That's the whole shift. Everything else is implementation detail.

The minimal working extractor

Here's the smallest version that actually ships in production. Drop it in your project, swap the URL, and you have a working scraper.

import json
import requests
from bs4 import BeautifulSoup
from openai import OpenAI

SCHEMA = {
    "type": "object",
    "properties": {
        "title": {"type": "string", "description": "Product name"},
        "price": {"type": "number", "description": "Price in USD, no currency symbol"},
        "in_stock": {"type": "boolean", "description": "Is the item currently purchasable"},
        "description": {"type": "string"},
    },
    "required": ["title", "price"],
}

def extract(url: str, client: OpenAI) -> dict:
    html = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).text
    text = BeautifulSoup(html, "html.parser").get_text(" ", strip=True)[:8000]
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system",
             "content": "Extract structured data per the schema. Use null for fields you cannot find."},
            {"role": "user",
             "content": f"Schema:\n{json.dumps(SCHEMA)}\n\nPage:\n{text}"},
        ],
        response_format={"type": "json_schema", "json_schema": {"name": "Product", "schema": SCHEMA}},
    )
    return json.loads(resp.choices[0].message.content)

Forty lines. Ship it.

Why this works (and CSS selectors don't)

CSS selectors are a brittle abstraction over a page's internal markup. The internal markup is an implementation detail of the site, owned by the site's frontend team. Their job is to refactor it constantly. Your CSS selectors are coupled to their refactoring schedule.

The schema is an abstraction over the content of the page. The content is owned by the business team — it changes when products change, not when CSS classes change. By moving the contract from "where" to "what," you decouple from the frontend churn.

This is the same pattern as moving from XPath to JSON APIs ten years ago. The lesson is the same: contracts should live at the layer that changes for business reasons, not at the layer that changes for refactor reasons.

Stress-testing your extractor

The discipline that makes this approach reliable is testing against scrambled markup. Without the test, you don't actually know if your extractor is selector-coupled.

The acceptance test I run on every self-healing extractor I ship:

  1. Fetch the live site, run the extractor, capture the result. This is the ground-truth set.
  2. Apply a destructive transformation to the HTML:

- Strip every class attribute - Strip every id attribute - Replace every data-* attribute with data-x="x" - Swap every semantic tag for

(every
,
,

,

becomes

) - Wrap every text node in three random s

  1. Re-run the extractor on the mangled HTML.
  2. Diff the results. Title parity should be 100%. Numeric fields should be within 1% (LLMs occasionally re-format prices).

If the test fails, you didn't write a self-healing extractor. You wrote a CSS scraper with extra steps. Common mistakes:

  • Using BeautifulSoup to pre-extract specific elements before passing to the LLM (you re-coupled to selectors)
  • Truncating page text by tag-position instead of by character count (you re-coupled to structure)
  • Asking the LLM to "read the navigation" or "skip the footer" (those are markup concepts, not content concepts)

The fix is always the same: make the schema speak the business domain. "Price" is a content concept. "The price element" is a markup concept. Use the first.

The vision-LLM fallback

Some pages are genuinely hostile to text-based extraction:

  • Heavy graphics with prices baked into image paths
  • React apps that render text inside Canvas elements
  • Sites that show prices as composed background-image numerals
  • Anything CAPTCHA-shaped

For these, route to a vision LLM as a confidence-driven fallback. The pattern:

def extract_with_fallback(url, client):
    text_result = extract(url, client)
    if confidence(text_result) < 0.7:
        screenshot = render_to_png(url)  # Playwright + page.screenshot()
        return extract_from_image(screenshot, client)
    return text_result

Pre-2025 the vision LLM was 30x the cost of text. Today it's roughly 5x. For ~5% of pages where text extraction is unreliable, vision fallback raises end-to-end accuracy from ~92% to ~99%. Math says do it.

Confidence scoring without a ML pipeline

You don't need a separate ML model to score extraction confidence. The LLM already knows when it's guessing. Two cheap signals:

1. Self-rated confidence. Add a _confidence field to your schema and ask the model to fill it on a 0.0-1.0 scale. The score is rough but well-correlated with real accuracy in practice.

SCHEMA["properties"]["_confidence"] = {
    "type": "number",
    "description": "Your subjective confidence 0-1 that this extraction is correct."
}

2. Re-run consistency. Call the LLM twice with temperature=0 and temperature=0.7. If the results disagree on any required field, route to vision fallback. Cost: 2x. Worth it on long-lived monitors.

Cost math at scale

The numbers I run in production with gpt-4o-mini (as of mid-2026):

  • ~$0.15 per million input tokens
  • ~$0.60 per million output tokens
  • Typical product page: ~6,000 tokens of text in, ~200 tokens of structured JSON out
  • Per-page cost: ~$0.0003

For a daily monitor of 1,000 SKUs across 10 stores: $3/day, $90/month. The Bright Data Web Scraper IDE plan that does this would charge you $500/month minimum.

The catch: at very high volume (>100k pages/day) you should benchmark gpt-4o-mini vs Claude Haiku vs a local Ollama model. The cheapest one this quarter shifts every model release. Put the LLM call behind an interface so you can swap.

When NOT to use this approach

I run CSS-selector scrapers in production too. The decision rule:

  • Use CSS scrapers when: site is yours or has a stable contract, you scrape <100 pages a day, the schema is one or two fields, the content is heavily structured (HTML tables, JSON-LD).
  • Use self-healing AI extractors when: you don't control the site, you need a long-lived monitor, the content is descriptive (product blurbs, news articles, profiles), or the site has a history of redesigns.

The default for any new monitor I build is self-healing. CSS is a optimization I add when volume justifies it.

If you want this built for your site, I quote fixed-price and ship in 7-10 days. Send the target site to info@luba.media.

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief