Eyal Rosenthal · Web scraping at scale

46 production scraping & pipeline demos.

All real, all runnable. Each one ships with a working extractor, a sample output, and the operational discipline that makes it production-grade.

Self-Healing AI Web Extractor — Survives DOM Changes That Break Every CSS Selector Demo · #01

Self-Healing AI Web Extractor

A web extractor that does not break when sites redesign. Pages are converted to text and passed to an LLM with a strict JSON schema; the schema (not the markup)

Real-Time Competitor Price Watch — Slack Alert in Under 60s on Any Price or Stock Change Demo · #02

Real-Time Competitor Price Watch

Catalog-monitoring pipeline that snapshots a competitor's product list on a schedule, diffs against the last run, and posts a structured Slack alert the moment

GitHub Trending Monitor — Daily Tech-Stack Intelligence with Viral-Repo Alerts Demo · #03

GitHub Trending Monitor

Daily monitor across GitHub's trending pages (Python / TypeScript / General). Alerts on new repos entering the trending list, star-count deltas, and language dr

Government Facility Monitor — Schema-Aware Wikitable Scraper with Diff Alerts Demo · #04

Government Facility Monitor

Drop-in monitor for any government / municipal / open-data wikitable listing. Extracts structured facility records (name, location, attributes), diffs against l

BigCommerce Store Monitor — Twice-Daily Inventory Crawl with Email Change Reports Demo · #05

BigCommerce Store Monitor

Production Python monitor that crawls a BigCommerce storefront's category pages on a schedule, detects inventory changes (new products, removed products, price

Hacker News Monitor — Multi-Feed Score & Comment Tracking with Viral-Story Alerts Demo · #06

Hacker News Monitor

Recurring monitor across Hacker News front page + newest + best feeds. Tracks every story, diffs score and comment_count between runs, fires structured alerts o

Hugging Face Trending Monitor — Daily Model Release Alerts via Hub API Demo · #07

Hugging Face Trending Monitor

Daily monitor across Hugging Face's trending models, datasets, and spaces via the public Hub API. Alerts on new entries, like surges, download spikes, and trend

arXiv Papers Monitor — Multi-Category Research Alerts via Atom API Demo · #08

arXiv Papers Monitor

Daily monitor across arXiv submission categories (cs.AI / cs.LG / cs.CL — easily extended) via the public arXiv Atom API. Alerts on new submissions, paper revis

RemoteOK Jobs Monitor — Tag-Filtered Hiring-Signal Alerts via Public JSON API Demo · #09

RemoteOK Jobs Monitor

Hourly job-board monitor across RemoteOK's public JSON feed, filtered by tag (python, javascript, ai — easily extended). Alerts on new postings, salary updates,

Shopify Storefront Monitor — Variant-Level Inventory + Price Alerts via /products.json Demo · #10

Shopify Storefront Monitor

Drop-in monitor for any public Shopify store via the universal /products.json endpoint every Shopify storefront exposes by default. Tracks per-variant price, co

Substack & Newsletter Publication Monitor — Multi-Feed RSS with Edit/Drift Alerts Demo · #11

Substack & Newsletter Publication Monitor

Generic RSS 2.0 monitor for Substack publications, Medium pubs, Ghost blogs, WordPress feeds — anything with public RSS. Tracks per-post link, title, author, ca

PDF Invoice Extractor — Batch Extract to CSV with 100% Acceptance-Test Coverage Demo · #12

PDF Invoice Extractor

Production batch extractor that ingests a directory of invoice PDFs and produces two structured CSVs (per-line-item + per-invoice summary). Two-pass strategy: p

Sitemap → JSON-LD Bulk Extractor — Universal Pattern for 'Scrape Every Recipe / Product / Article' Demo · #13

Sitemap → JSON-LD Bulk Extractor

Two-stage pipeline mapping to 'scrape every X on this site' brief class. Stage 1: pull sitemap.xml (handles sitemap-index nesting), filter URLs by pattern. Stag

Lead-Gen Contact Extractor — Batch Email/Phone/Social Harvest with QA-Tight Regex Demo · #14

Lead-Gen Contact Extractor

Take a list of company URLs → fetch homepage + auto-discovered contact/about/team/press/imprint pages → extract emails, phones, social handles (twitter, linkedi

Wikipedia Infobox Bulk Extractor — Per-Title CSV via MediaWiki Parse API Demo · #15

Wikipedia Infobox Bulk Extractor

Take a list of Wikipedia article titles → fetch via MediaWiki parse API → locate infobox table → flatten label/value rows to per-article CSV. Maps to 'extract t

OpenStreetMap POI Bulk Extractor — Bounding-Box Queries via Overpass API Demo · #16

OpenStreetMap POI Bulk Extractor

Pull every point-of-interest of a given OSM tag (cafes / pharmacies / EV chargers / schools / clinics — any amenity, shop, or leisure tag) within a bounding box

Paginated Catalog Scraper — Multi-Page Walk with Idempotent Resume + Progress State Demo · #17

Paginated Catalog Scraper

Walks every page of a paginated listing (search results, e-com catalogs, real-estate listings, classifieds). Different from single-page monitors — iterates N pa

PyPI Releases Monitor — Global + Per-Project Release Tracking via RSS Demo · #18

PyPI Releases Monitor

Recurring monitor across PyPI's public RSS feeds — /rss/updates.xml (last 40 globally) + per-project /rss/project/<name>/releases.xml. Maps to dependency intel

GitHub Releases Tracker — Multi-Repo Release Watch via REST API Demo · #19

GitHub Releases Tracker

Multi-repo GitHub release monitor via public REST API. Token-friendly: anonymous 60/h auto-upgrades to authenticated 5K/h with $GITHUB_TOKEN. Maps to multi-repo

YouTube Channel Monitor — Multi-Channel Video + View-Count Alerts via Public RSS Demo · #20

YouTube Channel Monitor

Track new uploads, view-count surges, rating changes across N YouTube channels via public RSS (/feeds/videos.xml?channel_id=UC...) — no API key, no quota cost.

Producthunt Launches Monitor — Daily Startup Intel + Tagline Edit Detection Demo · #21

Producthunt Launches Monitor

Track Producthunt's daily launches via public Atom feed (/feed + /feed?category=tech + /feed?category=ai). Maps to launch tracking / startup intel / SaaS compet

CVE / NVD Security Monitor — Track Newly Published Vulnerabilities + CVSS Re-Score Signals Demo · #22

CVE / NVD Security Monitor

Track newly published CVEs via the NVD v2 API. Alerts on CVSS re-scoring (analysts revising severity), status transitions (Awaiting Analysis → Analyzed → Modifi

Stack Exchange Q&A Monitor — Tag-Filtered New-Question + Score-Surge Alerts Demo · #23

Stack Exchange Q&A Monitor

Track new Stack Overflow / Stack Exchange questions by tag (170+ SE sites supported). Diff alerts on new questions, score deltas, view-count surges, is_answered

Wayback Machine History Extractor — Per-Date Title / H1 / Meta Across Years Demo · #24

Wayback Machine History Extractor

Extract historical snapshots of any URL via Wayback CDX API. Maps to 'what did this page look like in year X?', 'track competitor messaging over time', 'audit h

SEC EDGAR Bulk Extractor — Tickers → CIK → 10-K Filings + XBRL Financials in One CSV Demo · #25

SEC EDGAR Bulk Extractor

Built specifically against active Upwork brief ~022050416 (SEC EDGAR Extraction, US, fixed-price, 5-10 proposals, $700+ verified client, posted 2026-05-02) — no

CoinGecko Market Monitor — Crypto Price + Market-Cap Rank Diff Alerts Demo · #26

CoinGecko Market Monitor

Track top-N coins by market cap via CoinGecko's free public API. Alerts on price changes, rank shifts (coin enters/exits top-N), 24h % moves, ATH-distance. Acce

NPM Registry Monitor — Per-Package Version + Deprecation Diff Alerts Demo · #27

NPM Registry Monitor

Per-package npm release monitor via public registry. Parallel to PyPI demo for JS ecosystem. Tracks latest 20 versions per package + dist-tags. Alerts on new ve

Y Combinator Companies Bulk Extractor — API-Driven · Batch + Status Filters · Resume Demo · #28

Y Combinator Companies Bulk Extractor

Bulk-extract every YC company via public api.ycombinator.com endpoint. Real recurring Upwork brief class — VCs, sales-intel, recruitment, outbound platforms pos

DEV.to Articles Monitor — Per-Tag Dev-Community Watch with Edit Detection Demo · #29

DEV.to Articles Monitor

Track new articles + edits across DEV.to per-tag feeds. Maps to dev community / content discovery / dev-tool brand monitoring briefs. Complements Substack #26 (

GitHub Issues Multi-Repo Extractor — Issue + PR State, Comment, Label Diff Alerts Demo · #30

GitHub Issues Multi-Repo Extractor

Track issues + PRs across N GitHub repos via public REST API. Diff alerts on state transitions (open → closed), comment surges, label updates, last-update drift

Steam Catalog Bulk Extractor — Game Metadata + Pricing + Genres → CSV Demo · #31

Steam Catalog Bulk Extractor

Bulk extract Steam game metadata via public Store API. Real recurring Upwork brief class — gaming media, indie analytics, market research, recommender-system da

Open Library Books Extractor — Bulk Book Metadata via Internet Archive's Open DB Demo · #32

Open Library Books Extractor

Bulk-extract book metadata from Open Library (IA's open catalog) — search.json + optional per-work enrichment. Maps to book-recommendation, library catalog, use

HN Algolia Search Monitor — Full-History HN Brand & Topic Watch Demo · #33

HN Algolia Search Monitor

Watch entire HN history via Algolia search API for brand mentions, topic surges, old-thread re-discovery. Different from #21 (live feeds) — this is full-history

App Store Metadata Bulk Extractor — iOS Apps via iTunes Search + Lookup API Demo · #34

App Store Metadata Bulk Extractor

Bulk-extract iOS app metadata via public iTunes Search + Lookup API. Real recurring Upwork brief class — mobile-app analytics, ASO consultancies, recommender da

WHOIS / RDAP Bulk Domain Lookup — Registrar, Expiry, Nameservers, DNSSEC Demo · #35

WHOIS / RDAP Bulk Domain Lookup

Bulk WHOIS-style lookup via RDAP (modern HTTPS+JSON WHOIS replacement). Maps to domain investor expiry tracking, infosec DNSSEC + NS audits, brand-protection ty

Crates.io Rust Package Monitor — Per-Crate Yank + Download Diff Alerts Demo · #36

Crates.io Rust Package Monitor

Per-crate Rust release monitor via public crates.io API. Parallel to PyPI #33 + npm #27 — completes package-registry trio across Python/JS/Rust. Tracks latest 2

CKAN Open-Data Extractor — One Scraper for 200+ Government Portals Demo · #37

CKAN Open-Data Extractor

Bulk-extract dataset metadata from any CKAN-based open-data portal (data.gov.uk, data.gov.au, NYC OpenData, EU Open Data, 200+ municipal portals globally). All

Open-Meteo Weather Extractor — Multi-City Forecast Data via Free Public API Demo · #38

Open-Meteo Weather Extractor

Pull current + 7-day weather forecast for any list of cities via Open-Meteo (free, no key, no quota — DWD ICON / NOAA GFS / ECMWF IFS aggregated). Maps to weath

PubMed Research Paper Extractor — NCBI E-utilities → CSV Demo · #39

PubMed Research Paper Extractor

Bulk-extract medical research papers via NCBI E-utilities (PubMed API). Maps to PubMed papers on topic/drug/disease briefs — medical research, pharma competitiv

GitHub Org / User Bulk Repo Extractor — Stars / Lang / License / Topics CSV Demo · #40

GitHub Org / User Bulk Repo Extractor

Bulk-extract every public repo from N orgs/users via GitHub REST API. Real recurring Upwork brief — dev-tool sales targeting, recruitment tech-stack profiling,

Mastodon Hashtag Monitor — Multi-Instance Fediverse Brand Mention Watch Demo · #41

Mastodon Hashtag Monitor

Track Mastodon public hashtag timelines across multiple instances. Maps to social listening / brand mention briefs — practical X/Twitter alternative now that X

Crossref DOI Bulk Extractor — Academic Publication Metadata + Citation Counts Demo · #42

Crossref DOI Bulk Extractor

Extract academic publication metadata via Crossref (official DOI registry covering ~150M+ scholarly works). Maps to academic metadata / citation graph / journal

WordPress Plugin Directory Bulk Extractor — Downloads / Ratings / Installs Demo · #43

WordPress Plugin Directory Bulk Extractor

Bulk-extract WP plugin metadata via WP.org plugin info API. Real recurring Upwork brief — WP agencies (competitive intel), SEO-tooling startups, plugin devs (TA

Nominatim Bulk Geocoder — Address → Lat/Lon + Structured Address Demo · #44

Nominatim Bulk Geocoder

Bulk-geocode addresses (string → lat/lon + structured) via OSM Nominatim. Maps to geocode N addresses briefs — real estate, logistics, delivery routing, retail

Dental IT MSP Lead Finder — Bulk-Extract + Tier-Score B2B Leads from Public Sources Demo · #45

Dental IT MSP Lead Finder

Bulk lead-list builder for vertical-niche B2B prospecting. Sweeps DuckDuckGo + Bing organic results across 20 metros × 4 query variants, fetches each candidate

Shopify Bookstore Lead Finder — Platform-Verified Leads via /products.json Demo · #46

Shopify Bookstore Lead Finder

Verifier-driven lead-finder for B2B vertical-niche prospecting. Two-stage pipeline: candidate sourcing from curated seed lists + Bing search → platform verifica