Eyal Rosenthal · Web scraping at scale

Lead-Gen Contact Extractor

Lead-Gen Contact Extractor — Batch Email/Phone/Social Harvest with QA-Tight Regex

Lead-Gen Contact Extractor

Take a list of company URLs → fetch homepage + contact / about / team pages → extract emails, phone numbers, and social handles → write a deduplicated CSV. Maps to the highest-volume Upwork lead-gen brief class: "scrape contact info from these N companies."

Built 2026-05-03 as Demo #14.

Run

. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/leadgen_contact_extractor
python extract.py                        # uses companies.txt as input
python extract.py --input my_list.txt    # custom input
python extract.py --limit 20             # cap to first 20 rows

Result (8 sample SaaS companies)

  • 8 / 8 companies processed without errors ✅
  • Email coverage: 4 / 8 ✅
  • Phone coverage: 3 / 8 ✅
  • Social coverage: 8 / 8 (twitter, linkedin, github, instagram, facebook) ✅
  • Cloudflare alone yielded 21 international phone numbers ✅
  • 0 false-positive phones (SVG path coordinates rejected) ✅

Quality discipline

The naive version of this scraper produces garbage. Phone regex over a JS-rendered page hits SVG path coordinates and decimal numbers (0.142857142857, 10.5582 14.7391) as "phone numbers." This extractor uses three guarantees:

  1. Three canonical phone formats only+CC XXX XXX XXXX, (XXX) XXX-XXXX, XXX.XXX.XXXX. Pure-decimal strings are rejected unless they match the US classic dot format exactly.
  2. tel: href harvesting — when a site uses , we trust that explicitly.
  3. Email blocklist — strips sentry.io DSNs, image filenames, and JSON-escaped fragments that the typical email regex falsely matches.

Plus per-company politeness: 0.6s between sub-pages, 1s between companies, capped at 6 contact-page candidates per site.