Paginated Catalog Scraper
Paginated Catalog Scraper
Scrape every page of a paginated listing — search results, e-commerce catalogs, real-estate listings, court records, any multi-page index. Maps to "scrape every result from this paginated search" briefs.
Built 2026-05-03 as Demo #17. Different from monitor scrapers (#18-#26 walk page 1 only); this one walks N pages with resume support so an interrupted run picks up exactly where it left off.
Run
. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/paginated_catalog_scraper
python scrape.py --reset --max-pages 3 # crawl first 3 pages
python scrape.py --max-pages 5 # resume — crawls pages 4-8
python scrape.py # resume + run to completion
Result (books.toscrape.com — 50 pages × 20 books)
- 50 pages crawled ✅
- 1,000 unique books captured ✅
- Resume verified: stop at page 3, restart, picks up at page 4 with no duplicates ✅
- Idempotent: re-running a completed catalog adds 0 new rows ✅
- 0.4s politeness sleep between pages ✅
- Tenacity retries with exponential backoff (transient HTTP errors don't lose progress) ✅
Why resume support matters
For a 50-page catalog this scraper completes in ~30 seconds. For a real-estate site with 600 pages × 30 listings, it might take 10 minutes — and if the network blips at page 423, the naive scraper starts over. With .progress.json snapshotted after every page, a re-run picks up at page 424 instead of paying for 423 wasted requests.
Adapting to a new site
Two functions to edit:
def parse_page(html: str, page_url: str) -> tuple[list[dict], str | None]:
# 1. Extract items from the current page
# 2. Detect the next-page URL (return None when there's no next page)
Common next-page detection patterns:
- Selector (most sites):
li.next a,a[rel="next"],.pagination .next - Constructed URL:
?page=N+1until empty results - Cursor:
?after=from the prev response
Hire me to build this for your stack
Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.
info@luba.media