Lead-Gen Contact Extractor
Lead-Gen Contact Extractor
Take a list of company URLs → fetch homepage + contact / about / team pages → extract emails, phone numbers, and social handles → write a deduplicated CSV. Maps to the highest-volume Upwork lead-gen brief class: "scrape contact info from these N companies."
Built 2026-05-03 as Demo #14.
Run
. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/leadgen_contact_extractor
python extract.py # uses companies.txt as input
python extract.py --input my_list.txt # custom input
python extract.py --limit 20 # cap to first 20 rows
Result (8 sample SaaS companies)
- 8 / 8 companies processed without errors ✅
- Email coverage: 4 / 8 ✅
- Phone coverage: 3 / 8 ✅
- Social coverage: 8 / 8 (twitter, linkedin, github, instagram, facebook) ✅
- Cloudflare alone yielded 21 international phone numbers ✅
- 0 false-positive phones (SVG path coordinates rejected) ✅
Quality discipline
The naive version of this scraper produces garbage. Phone regex over a JS-rendered page hits SVG path coordinates and decimal numbers (0.142857142857, 10.5582 14.7391) as "phone numbers." This extractor uses three guarantees:
- Three canonical phone formats only —
+CC XXX XXX XXXX,(XXX) XXX-XXXX,XXX.XXX.XXXX. Pure-decimal strings are rejected unless they match the US classic dot format exactly. tel:href harvesting — when a site uses, we trust that explicitly.- Email blocklist — strips
sentry.ioDSNs, image filenames, and JSON-escaped fragments that the typical email regex falsely matches.
Plus per-company politeness: 0.6s between sub-pages, 1s between companies, capped at 6 contact-page candidates per site.
Hire me to build this for your stack
Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.
info@luba.media