Lead-Gen Contact Extractor

Take a list of company URLs → fetch homepage + contact / about / team pages → extract emails, phone numbers, and social handles → write a deduplicated CSV. Maps to the highest-volume Upwork lead-gen brief class: "scrape contact info from these N companies."

Built 2026-05-03 as Demo #14.

Run

. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/leadgen_contact_extractor
python extract.py                        # uses companies.txt as input
python extract.py --input my_list.txt    # custom input
python extract.py --limit 20             # cap to first 20 rows

Result (8 sample SaaS companies)

8 / 8 companies processed without errors ✅
Email coverage: 4 / 8 ✅
Phone coverage: 3 / 8 ✅
Social coverage: 8 / 8 (twitter, linkedin, github, instagram, facebook) ✅
Cloudflare alone yielded 21 international phone numbers ✅
0 false-positive phones (SVG path coordinates rejected) ✅

Quality discipline

The naive version of this scraper produces garbage. Phone regex over a JS-rendered page hits SVG path coordinates and decimal numbers (0.142857142857, 10.5582 14.7391) as "phone numbers." This extractor uses three guarantees:

Three canonical phone formats only — +CC XXX XXX XXXX, (XXX) XXX-XXXX, XXX.XXX.XXXX. Pure-decimal strings are rejected unless they match the US classic dot format exactly.
tel: href harvesting — when a site uses , we trust that explicitly.
Email blocklist — strips sentry.io DSNs, image filenames, and JSON-escaped fragments that the typical email regex falsely matches.

Plus per-company politeness: 0.6s between sub-pages, 1s between companies, capped at 6 contact-page candidates per site.

Hire me to build this for your stack

Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.

info@luba.media