Eyal Rosenthal · Web scraping at scale

Why a $5/mo VPS Beats a $1,200/mo ScrapingBee Plan

Why a $5/mo VPS Beats a $1,200/mo ScrapingBee Plan

Most freelancers deliver scrapers as scripts. The client gets a .py file, runs it once, gets a CSV. A month later something breaks, the scraper goes stale, and the client calls back asking for help (or worse, a refund).

The $300-1,000/month retainer business is built on a different artifact: a pipeline. Persistent state, idempotent runs, audit logs, retry logic, alerts on drift. Same code as the script, plus the operational discipline that makes it run unattended.

This is the stack I run in production for my own data business and for client retainers. Total cost: ~$15/month per pipeline. The closest equivalent commercial setup (ScrapingBee + their scheduler) starts at $99/month and scales to $1,200/month at the volume we're talking about.

The stack at a glance

LayerWhat I useCost
ComputeHetzner CX22 (2 vCPU, 4GB RAM) or DigitalOcean Basic$5/mo
Anti-botcurl_cffi (free) + Webshare residential rotation$3-15/mo
Schedulingcron (free) or GitHub Actions ($0 on free tier)$0
StateSQLite file with WAL mode, or Postgres on a Supabase free tier$0
AlertsSlack webhook$0
ObservabilityJSONL audit log + a tail -f SSH habit$0
Total$8-20/mo

A 100k-page-per-day pipeline runs comfortably on this. I've operated nine of them in parallel on a single $5 box.

The script-vs-pipeline distinction

A script is what most freelancers ship. It runs once, it produces output, it stops. The artifact is the data.

A pipeline is what retainer clients pay for. It runs forever. The artifact is the capability. Every day at 06:00 UTC, fresh data lands in the client's BigQuery table. They forgot the scraper exists. That's the value.

The transformation from script to pipeline is mostly operational discipline, not code. Five rules:

  1. Persistent state — the pipeline knows what it's already fetched.
  2. Idempotent runs — running it twice in a row produces the same result, no duplicates.
  3. Resume safety — if the box crashes mid-run, the next run picks up from where it left off.
  4. Audit log — every fetch, every parse, every diff is appended to a JSONL file you can grep later.
  5. Alerts on drift — if the source schema changes or extraction confidence drops, the pipeline pings Slack.

Each rule is roughly 10-30 lines of code. None of them is glamorous. All of them are what separates the $50 freelancer from the $300/month retainer.

A complete pipeline in 60 lines

Here's a real pipeline shape. Daily price-monitoring of one e-commerce store, idempotent, resume-safe, alerts to Slack on price changes.

import json
import sqlite3
import time
from pathlib import Path
from curl_cffi import requests
from bs4 import BeautifulSoup

DB = Path("state.db")
LOG = Path("audit.jsonl")
SLACK = "https://hooks.slack.com/services/..."  # webhook URL

def setup_db():
    conn = sqlite3.connect(DB)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS items (
            sku TEXT PRIMARY KEY,
            name TEXT, price REAL,
            last_seen TEXT, last_changed TEXT
        );
    """)
    conn.commit()
    return conn

def audit(event: dict):
    event["ts"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    with LOG.open("a") as f:
        f.write(json.dumps(event) + "\n")

def fetch_catalog(url: str) -> list[dict]:
    r = requests.get(url, impersonate="chrome131", timeout=30)
    audit({"event": "fetch", "url": url, "status": r.status_code, "bytes": len(r.text)})
    soup = BeautifulSoup(r.text, "html.parser")
    items = []
    for el in soup.select("[data-product]"):
        items.append({
            "sku": el["data-product"],
            "name": el.select_one(".name").get_text(strip=True),
            "price": float(el.select_one(".price").get_text(strip=True).replace("$", "")),
        })
    return items

def diff_and_alert(conn, items):
    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    for it in items:
        row = conn.execute("SELECT price FROM items WHERE sku = ?", (it["sku"],)).fetchone()
        if row is None:
            conn.execute("INSERT INTO items VALUES (?, ?, ?, ?, ?)",
                         (it["sku"], it["name"], it["price"], now, now))
            audit({"event": "new_item", "sku": it["sku"]})
            slack_post(f"NEW: {it['name']} @ ${it['price']}")
        elif row[0] != it["price"]:
            audit({"event": "price_change", "sku": it["sku"], "from": row[0], "to": it["price"]})
            slack_post(f"PRICE CHANGED: {it['name']}  ${row[0]} → ${it['price']}")
            conn.execute("UPDATE items SET price=?, last_changed=?, last_seen=? WHERE sku=?",
                         (it["price"], now, now, it["sku"]))
        else:
            conn.execute("UPDATE items SET last_seen=? WHERE sku=?", (now, it["sku"]))
    conn.commit()

def slack_post(text: str):
    requests.post(SLACK, json={"text": text}, timeout=10)

def main():
    conn = setup_db()
    items = fetch_catalog("https://store.example.com/products")
    diff_and_alert(conn, items)
    audit({"event": "run_complete", "items_seen": len(items)})

if __name__ == "__main__":
    main()

Sixty lines. Every one of the five pipeline rules is in there. Drop it on a $5 VPS, add a cron entry:

0 6 * * * cd /home/scraper/pipeline && python3 main.py >> run.log 2>&1

Done. Runs daily at 06:00, alerts on changes, recovers from crashes (SQLite + idempotent insert means no duplicate alerts). The client never thinks about it again.

Why ScrapingBee is overpriced for this

ScrapingBee's pitch is "don't worry about anti-bot, we handle it." That's worth something — until you realize anti-bot at typical e-commerce volume is a one-line curl_cffi change. You're paying $99-1,200/month for an HTTP call.

The hidden cost is lock-in. Their scheduler is theirs. Their proxy pool is theirs. If they raise prices (they will), you migrate the entire pipeline. Self-built, you migrate one library.

The other hidden cost is observability. Their dashboard is one line per request. Mine is a JSONL audit log I can grep, jq, pipe into BigQuery, alert on. The cost difference is also a flexibility difference.

The retainer pricing model

Here's the math that makes this a real business, not a cost-cutting exercise:

  • Setup cost (one-time): 4-8 hours of work to build the pipeline + cron config + Slack integration → quote $800-1,500 fixed-price for setup.
  • Ongoing cost (yours): $5 VPS + $5 proxy = $10/month per client.
  • Ongoing price (theirs): $250-500/month per client (justified by "real-time alerts, never breaks, observability included").
  • Margin: $240-490/month per client, recurring.

Run 10 of these in parallel on the same $5 box (yes, a single CX22 handles 10 small pipelines easily). $2,500-5,000/month recurring revenue. $50/month total infrastructure cost.

This is the pricing model that turns scraping from "I deliver CSV files for $50" into "I run your competitor-monitoring infrastructure for $400/month." Same code, fundamentally different conversation with the client.

What clients actually pay for

When I close a retainer-grade conversation, the client's question is never "how do you handle Cloudflare." It's:

  • "What happens if the source site changes?" → I show them the audit log + the schema drift alert.
  • "How fast will I know about a price change?" → I show them the run cadence + Slack post timestamp.
  • "What if your VPS goes down?" → I show them the GitHub Actions backup that runs on workflow_dispatch.
  • "Can I see the raw data?" → I give them read-only Postgres access.

None of these are scraping questions. They're operations questions. The pipeline-as-product mindset answers them by default. Script-as-deliverable doesn't.

When to use a managed service anyway

I'm not anti-managed. ScrapingBee + Scrapy Cloud + Apify all earn their price in two scenarios:

  1. You don't have the operational chops yet. If your client is a non-technical agency that doesn't have anyone to SSH into a VPS, charging them $400/month and using ScrapingBee under the hood ($100/month) for the first 6 months is a reasonable bridge. Migrate to self-hosted once you have the cash flow.
  1. You need geo-diverse residential IPs at low volume. Webshare residential at high volume is cheaper. At <1k requests/day across 5+ countries, ScrapingBee's pre-baked residential pool is hard to beat on per-request economics.

For everything else, the math is on self-hosted.

Send a brief to info@luba.media if you want this built for your business. I quote fixed-price + monthly retainer.

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief