Eyal Rosenthal · Web scraping at scale

100 Production Web Scrapers, One Repo: The Patterns That Repeat

100 Production Web Scrapers, One Repo: The Patterns That Repeat

I shipped 100 working web-scraping demos over the past six months. They're all in one public repo. Every one of them runs end-to-end, with audit logs, schema validation, and the operational discipline I outlined in the pipeline-as-product tutorial.

After 100 of them, the patterns are obvious. There aren't that many. Most freelancing briefs are variations on six core patterns. Once you have one of each, you can quote any new brief in 30 minutes.

This tutorial is the taxonomy. Six patterns, real examples for each, when to use which.

Pattern 1: Bulk one-shot extraction

A client has a list of URLs (or a search query that produces a list). They want a CSV with the structured data extracted from each. One-time job, no recurring runs.

This is 60% of Upwork's web-scraping job board. Low budget, high volume, race-to-bottom pricing — unless you can deliver in <24h with a working sample attached. Then you skip the pricing competition entirely.

Examples in the repo:

  • wikipedia_infobox_extractor — bulk Wikipedia infoboxes by article-title list
  • pdf_invoice_extractor — bulk PDF → CSV with arithmetic validation
  • osm_poi_extractor — bulk POIs by bounding-box from OpenStreetMap
  • dental_msp_lead_finder — bulk dental-IT lead list across 20 metros

The bulk-extract template is ~200 lines of Python. It handles: input list iteration, retry logic, schema validation (Pydantic), output CSV with a source_url column on every row, idempotent resume via a done.txt archive.

Pattern 2: Recurring monitor + diff

A client wants daily/hourly tracking of a target. The output is changes: new items, removed items, price changes, stock changes. Slack alert when something changes; silent otherwise.

This is the retainer pattern. $300-1,000/month per pipeline, infinite renewal once it's running.

Examples in the repo:

  • competitor_watch — Slack alert in <60s on any catalog price/stock change
  • bigcommerce_monitor — twice-daily inventory crawl with email change reports
  • github_releases_tracker — multi-repo release-watch via REST API
  • cve_nvd_monitor — newly-published vulnerability + CVSS re-score signals

The diff template is ~300 lines. The expensive part isn't the scraper — it's the diff logic. Subtle: if the source paginates, "removed item" can mean "moved to page 3" rather than "actually removed." Solving that distinction is what makes monitors usable vs. noisy.

Pattern 3: Multi-source aggregation

A client wants one dataset assembled from N sources. Examples: financial data from EDGAR + market-cap from CoinGecko, jobs from RemoteOK + GitHub Jobs + LinkedIn, papers from arXiv + Semantic Scholar + OpenAlex.

This is the highest-value pattern when it's done right. The aggregation layer is where domain expertise lives, and clients pay 3-5x for that vs. point scrapers.

Examples in the repo:

  • ai_pulse_watchtower (the capstone) — parallel fan-out across HN + arXiv + HuggingFace + GitHub + Papers-With-Code, normalized schema
  • sec_edgar_extractor + coingecko_market_monitor (combinable) — financial vertical
  • crossref_doi_extractor + pubmed_research_extractor + orcid_researcher_extractor — academic stack

The pattern: each source gets its own thin extractor. A separate orchestration layer fans out in parallel, applies a normalized schema, and isolates per-source failures (one source going down doesn't kill the others). Clean abstraction, low coupling.

Pattern 4: Lead-list builder

A client wants a vetted list of N companies/people matching specific criteria. Output: CSV with name, website, contact, location, scoring tier.

This is its own pattern (different from bulk-extract) because the input is a query, not a URL list. You drive a search engine + filter the results.

Examples in the repo:

  • leadgen_contact_extractor — batch email/phone/social harvest with QA-tight regex
  • dental_msp_lead_finder — see the demo, built against a real Upwork brief

The template handles: query templating across geos, search-engine traversal (DuckDuckGo + Bing fallback for resilience), candidate-domain dedup, dual-keyword filtering (high precision), tiered scoring by contact-data completeness.

Pattern 5: Filing / document extraction (vertical depth)

A client wants structured data from semi-structured documents: SEC filings, court records, government PDFs, academic papers, regulatory disclosures. Vertical-specific. Premium pricing.

Examples in the repo:

  • sec_edgar_extractor — tickers → CIK → 10-K + XBRL multi-candidate field resolution
  • pdf_invoice_extractor — batch PDF → CSV with arithmetic validation
  • gov_facility_monitor — schema-aware Wikitable scraper with diff alerts

The pattern: domain knowledge matters more than scraping skill. Extracting "Revenue" from a 10-K is non-trivial because there are 12 different XBRL tags that mean "revenue" depending on filer (us-gaap:Revenues, us-gaap:SalesRevenueNet, us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax, etc.). The solution is a multi-candidate field-resolution engine, which is how my SEC EDGAR demo works.

This is the highest-paying scraping niche on Upwork. $80-150/hr, low competition, premium clients (hedge funds, accounting firms, regulatory consultants).

Pattern 6: Pipeline-as-product

A client wants the entire scraping infrastructure as a turnkey product, not a one-off CSV. They want: dashboard, alerts, BI integration, recurring updates, observability.

This is where pricing crosses from project-based ($500-2,000) to retainer-based ($300-1,500/month). Same scraper code as Pattern 2, plus deployment + dashboard + integration.

Examples in the repo:

  • The recurring-monitor Gig template — turnkey deploy of any monitor onto a $5/mo VPS with cron + Slack + BI export

The decision point: does the client want a result (CSV) or a capability (the data exists in their stack forever)? If the second, you're doing Pattern 6 and pricing accordingly.

Other patterns I didn't include

A few smaller patterns that show up occasionally:

  • Auth-required scraping — scraping behind a login. Spec-wise it's Pattern 1 with cookies. Examples: wayback-history-extractor.
  • Forum / discussion scraping — Reddit-shaped sources with threaded replies. Examples: hn-algolia-search, stackexchange-questions-monitor.
  • Geographic / mapping — OSM, Wikidata, civic data. Examples: osm-poi-extractor, nominatim-geocoder.
  • Real-time ticker / market data — different infra (websocket-shaped). Less common on Upwork; more common in direct-hire fintech work.

Each of these is a Pattern 1 / Pattern 2 / Pattern 3 with vertical specifics. Once you have the six core patterns, the verticals are interpolation.

Why six patterns matter for pricing

Most freelancers quote scraping by hour or by line of code. That's a trap. The pattern itself anchors the price:

PatternTypical fixed-priceTypical retainer
1: Bulk one-shot$50-300$0
2: Recurring monitor$300-1,500 setup$200-1,000/mo
3: Multi-source aggregation$1,500-5,000 setup$500-2,000/mo
4: Lead-list builder$100-500$0 (or "list refresh" $50/mo)
5: Filing / document extraction (vertical)$500-3,000 setup$300-1,500/mo
6: Pipeline-as-product$1,000-2,500 setup$300-1,500/mo

Same code in many cases. The price difference is the framing. A "bulk one-shot" priced as a "pipeline-as-product" is a 5-10x revenue swing on the same delivery — if you wrap it in the operational discipline that justifies the framing.

Reading the brief

When a new Upwork brief lands, the first thing I do is classify it into one of six patterns. Decision tree:

  1. Does the client want this to run more than once? If no → Pattern 1.
  2. Is the input a query (not a URL list)? If yes → Pattern 4.
  3. Are there multiple sources to combine? If yes → Pattern 3.
  4. Does the input include filings / PDFs / regulatory documents? If yes → Pattern 5.
  5. Does the client mention dashboards / Slack / "always-on" / "live"? If yes → Pattern 6.
  6. Otherwise → Pattern 2 (recurring monitor).

90% of briefs classify cleanly. The other 10% are usually two patterns combined (e.g., a Pattern 4 lead-builder with a Pattern 2 weekly refresh = the start of a retainer).

If you have a brief that doesn't classify cleanly, send it to info@luba.media. I'll quote in 24h.

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief