Why a $5/mo VPS Beats a $1,200/mo ScrapingBee Plan
Why a $5/mo VPS Beats a $1,200/mo ScrapingBee Plan
Most freelancers deliver scrapers as scripts. The client gets a .py file, runs it once, gets a CSV. A month later something breaks, the scraper goes stale, and the client calls back asking for help (or worse, a refund).
The $300-1,000/month retainer business is built on a different artifact: a pipeline. Persistent state, idempotent runs, audit logs, retry logic, alerts on drift. Same code as the script, plus the operational discipline that makes it run unattended.
This is the stack I run in production for my own data business and for client retainers. Total cost: ~$15/month per pipeline. The closest equivalent commercial setup (ScrapingBee + their scheduler) starts at $99/month and scales to $1,200/month at the volume we're talking about.
The stack at a glance
| Layer | What I use | Cost |
|---|---|---|
| Compute | Hetzner CX22 (2 vCPU, 4GB RAM) or DigitalOcean Basic | $5/mo |
| Anti-bot | curl_cffi (free) + Webshare residential rotation | $3-15/mo |
| Scheduling | cron (free) or GitHub Actions ($0 on free tier) | $0 |
| State | SQLite file with WAL mode, or Postgres on a Supabase free tier | $0 |
| Alerts | Slack webhook | $0 |
| Observability | JSONL audit log + a tail -f SSH habit | $0 |
| Total | $8-20/mo |
A 100k-page-per-day pipeline runs comfortably on this. I've operated nine of them in parallel on a single $5 box.
The script-vs-pipeline distinction
A script is what most freelancers ship. It runs once, it produces output, it stops. The artifact is the data.
A pipeline is what retainer clients pay for. It runs forever. The artifact is the capability. Every day at 06:00 UTC, fresh data lands in the client's BigQuery table. They forgot the scraper exists. That's the value.
The transformation from script to pipeline is mostly operational discipline, not code. Five rules:
- Persistent state — the pipeline knows what it's already fetched.
- Idempotent runs — running it twice in a row produces the same result, no duplicates.
- Resume safety — if the box crashes mid-run, the next run picks up from where it left off.
- Audit log — every fetch, every parse, every diff is appended to a JSONL file you can
greplater. - Alerts on drift — if the source schema changes or extraction confidence drops, the pipeline pings Slack.
Each rule is roughly 10-30 lines of code. None of them is glamorous. All of them are what separates the $50 freelancer from the $300/month retainer.
A complete pipeline in 60 lines
Here's a real pipeline shape. Daily price-monitoring of one e-commerce store, idempotent, resume-safe, alerts to Slack on price changes.
import json
import sqlite3
import time
from pathlib import Path
from curl_cffi import requests
from bs4 import BeautifulSoup
DB = Path("state.db")
LOG = Path("audit.jsonl")
SLACK = "https://hooks.slack.com/services/..." # webhook URL
def setup_db():
conn = sqlite3.connect(DB)
conn.executescript("""
CREATE TABLE IF NOT EXISTS items (
sku TEXT PRIMARY KEY,
name TEXT, price REAL,
last_seen TEXT, last_changed TEXT
);
""")
conn.commit()
return conn
def audit(event: dict):
event["ts"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
with LOG.open("a") as f:
f.write(json.dumps(event) + "\n")
def fetch_catalog(url: str) -> list[dict]:
r = requests.get(url, impersonate="chrome131", timeout=30)
audit({"event": "fetch", "url": url, "status": r.status_code, "bytes": len(r.text)})
soup = BeautifulSoup(r.text, "html.parser")
items = []
for el in soup.select("[data-product]"):
items.append({
"sku": el["data-product"],
"name": el.select_one(".name").get_text(strip=True),
"price": float(el.select_one(".price").get_text(strip=True).replace("$", "")),
})
return items
def diff_and_alert(conn, items):
now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
for it in items:
row = conn.execute("SELECT price FROM items WHERE sku = ?", (it["sku"],)).fetchone()
if row is None:
conn.execute("INSERT INTO items VALUES (?, ?, ?, ?, ?)",
(it["sku"], it["name"], it["price"], now, now))
audit({"event": "new_item", "sku": it["sku"]})
slack_post(f"NEW: {it['name']} @ ${it['price']}")
elif row[0] != it["price"]:
audit({"event": "price_change", "sku": it["sku"], "from": row[0], "to": it["price"]})
slack_post(f"PRICE CHANGED: {it['name']} ${row[0]} → ${it['price']}")
conn.execute("UPDATE items SET price=?, last_changed=?, last_seen=? WHERE sku=?",
(it["price"], now, now, it["sku"]))
else:
conn.execute("UPDATE items SET last_seen=? WHERE sku=?", (now, it["sku"]))
conn.commit()
def slack_post(text: str):
requests.post(SLACK, json={"text": text}, timeout=10)
def main():
conn = setup_db()
items = fetch_catalog("https://store.example.com/products")
diff_and_alert(conn, items)
audit({"event": "run_complete", "items_seen": len(items)})
if __name__ == "__main__":
main()
Sixty lines. Every one of the five pipeline rules is in there. Drop it on a $5 VPS, add a cron entry:
0 6 * * * cd /home/scraper/pipeline && python3 main.py >> run.log 2>&1
Done. Runs daily at 06:00, alerts on changes, recovers from crashes (SQLite + idempotent insert means no duplicate alerts). The client never thinks about it again.
Why ScrapingBee is overpriced for this
ScrapingBee's pitch is "don't worry about anti-bot, we handle it." That's worth something — until you realize anti-bot at typical e-commerce volume is a one-line curl_cffi change. You're paying $99-1,200/month for an HTTP call.
The hidden cost is lock-in. Their scheduler is theirs. Their proxy pool is theirs. If they raise prices (they will), you migrate the entire pipeline. Self-built, you migrate one library.
The other hidden cost is observability. Their dashboard is one line per request. Mine is a JSONL audit log I can grep, jq, pipe into BigQuery, alert on. The cost difference is also a flexibility difference.
The retainer pricing model
Here's the math that makes this a real business, not a cost-cutting exercise:
- Setup cost (one-time): 4-8 hours of work to build the pipeline + cron config + Slack integration → quote $800-1,500 fixed-price for setup.
- Ongoing cost (yours): $5 VPS + $5 proxy = $10/month per client.
- Ongoing price (theirs): $250-500/month per client (justified by "real-time alerts, never breaks, observability included").
- Margin: $240-490/month per client, recurring.
Run 10 of these in parallel on the same $5 box (yes, a single CX22 handles 10 small pipelines easily). $2,500-5,000/month recurring revenue. $50/month total infrastructure cost.
This is the pricing model that turns scraping from "I deliver CSV files for $50" into "I run your competitor-monitoring infrastructure for $400/month." Same code, fundamentally different conversation with the client.
What clients actually pay for
When I close a retainer-grade conversation, the client's question is never "how do you handle Cloudflare." It's:
- "What happens if the source site changes?" → I show them the audit log + the schema drift alert.
- "How fast will I know about a price change?" → I show them the run cadence + Slack post timestamp.
- "What if your VPS goes down?" → I show them the GitHub Actions backup that runs on workflow_dispatch.
- "Can I see the raw data?" → I give them read-only Postgres access.
None of these are scraping questions. They're operations questions. The pipeline-as-product mindset answers them by default. Script-as-deliverable doesn't.
When to use a managed service anyway
I'm not anti-managed. ScrapingBee + Scrapy Cloud + Apify all earn their price in two scenarios:
- You don't have the operational chops yet. If your client is a non-technical agency that doesn't have anyone to SSH into a VPS, charging them $400/month and using ScrapingBee under the hood ($100/month) for the first 6 months is a reasonable bridge. Migrate to self-hosted once you have the cash flow.
- You need geo-diverse residential IPs at low volume. Webshare residential at high volume is cheaper. At <1k requests/day across 5+ countries, ScrapingBee's pre-baked residential pool is hard to beat on per-request economics.
For everything else, the math is on self-hosted.
What to read next
- Bypassing Cloudflare, DataDome, and PerimeterX in 2026 — the underlying tools that make this stack work.
- Self-Healing AI Web Extractors — the parsing layer that survives site redesigns, paired with this pipeline pattern.
- The repos:
portfolio_demos/competitor_watch/,portfolio_demos/bigcommerce_monitor/,portfolio_demos/shopify_storefront_monitor/— three production pipelines running this exact pattern.
Send a brief to info@luba.media if you want this built for your business. I quote fixed-price + monthly retainer.
Hire me to build this for your site
I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.
Send a brief