Eyal Rosenthal · Web scraping at scale

Paginated Catalog Scraper

Paginated Catalog Scraper — Multi-Page Walk with Idempotent Resume + Progress State

Paginated Catalog Scraper

Scrape every page of a paginated listing — search results, e-commerce catalogs, real-estate listings, court records, any multi-page index. Maps to "scrape every result from this paginated search" briefs.

Built 2026-05-03 as Demo #17. Different from monitor scrapers (#18-#26 walk page 1 only); this one walks N pages with resume support so an interrupted run picks up exactly where it left off.

Run

. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/paginated_catalog_scraper

python scrape.py --reset --max-pages 3   # crawl first 3 pages
python scrape.py --max-pages 5           # resume — crawls pages 4-8
python scrape.py                         # resume + run to completion

Result (books.toscrape.com — 50 pages × 20 books)

  • 50 pages crawled ✅
  • 1,000 unique books captured ✅
  • Resume verified: stop at page 3, restart, picks up at page 4 with no duplicates ✅
  • Idempotent: re-running a completed catalog adds 0 new rows ✅
  • 0.4s politeness sleep between pages ✅
  • Tenacity retries with exponential backoff (transient HTTP errors don't lose progress) ✅

Why resume support matters

For a 50-page catalog this scraper completes in ~30 seconds. For a real-estate site with 600 pages × 30 listings, it might take 10 minutes — and if the network blips at page 423, the naive scraper starts over. With .progress.json snapshotted after every page, a re-run picks up at page 424 instead of paying for 423 wasted requests.

Adapting to a new site

Two functions to edit:

def parse_page(html: str, page_url: str) -> tuple[list[dict], str | None]:
    # 1. Extract items from the current page
    # 2. Detect the next-page URL (return None when there's no next page)

Common next-page detection patterns:

  • Selector (most sites): li.next a, a[rel="next"], .pagination .next
  • Constructed URL: ?page=N+1 until empty results
  • Cursor: ?after= from the prev response

Hire me to build this for your stack

Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.

info@luba.media