Eyal Rosenthal · Web scraping at scale

Y Combinator Companies Bulk Extractor

Y Combinator Companies Bulk Extractor — API-Driven · Batch + Status Filters · Resume

YC Companies Directory Bulk Extractor

Pull every Y Combinator-funded company via the public api.ycombinator.com/v0.1/companies API. Maps to a recurring real Upwork brief class: VCs, sales-intel teams, recruitment agencies, and sourcing tools post jobs every month asking for "scrape every YC company in batch X" or "every Active YC company with team_size > 10."

Built 2026-05-03 as Demo #28 — second hybrid-mode demo (real recurring brief class, not a pure pattern).

Run

. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/yc_companies_extractor

python extract.py --reset --batch P26 --max-pages 3   # quick smoke test
python extract.py --batch P26                         # latest batch only
python extract.py --batch P26 --status Active         # active P26 companies
python extract.py                                     # full directory (~5000 companies)

Result (P26 batch slice)

  • 75 P26 companies extracted in 3 pages ✅
  • Per-row fields: id, name, website, yc_url, batch, status, team_size, one-liner, industries, tags, regions, locations, badges ✅
  • Resume + dedupe via in-CSV id set ✅
  • 0.4s politeness sleep between pages ✅
  • Tenacity retries with exponential backoff ✅

Why this beats hand-rolling the same scraper

The naive bid solution is HTML-scraping ycombinator.com/companies — fragile, JS-heavy. The API has been stable for years and ships clean structured data including filterable fields the HTML doesn't surface (long descriptions, badges, region tags).

The brief-class fit covers:

  • Sales intel: filter to Active + specific industry → outbound list
  • Recruitment: filter to team_size > N and a region → talent pipeline
  • Investment scouting: latest batch + specific tag (e.g., AI) → deal-flow source
  • Competitive intel: track when a competitor's portfolio company changes status

Same pattern works on Crunchbase Pro (paid), Dealroom, AngelList Talent, Indie Hackers — drop in the new endpoint + projection.

Hire me to build this for your stack

Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.

info@luba.media