Eyal Rosenthal · Web scraping at scale

PDF Invoice Extractor

PDF Invoice Extractor — Batch Extract to CSV with 100% Acceptance-Test Coverage

PDF Invoice Extractor

Extract structured data from invoice PDFs into a CSV — with arithmetic validation and field-level acceptance testing against ground truth. Maps to "extract data from these invoices", "PDF table extraction", "AP automation" briefs.

Built 2026-05-03 as Demo #12. Different scaffold from the monitor demos (#16-#26) — this is "one-shot batch extraction with validation", not "recurring scrape with diff".

Run

. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/pdf_invoice_extractor
python generate_fixtures.py     # creates 5 invoice PDFs + ground_truth.json
python extract.py --verify      # extracts, writes CSVs, exits 1 if accuracy < 100%

Result

  • 5 invoice PDFs processed ✅
  • 99 / 99 fields extracted correctly (100% accuracy) ✅
  • Per-line-item CSV (18 line items across 5 invoices) ✅
  • Per-invoice summary CSV (5 rows with totals) ✅
  • Arithmetic validation: every line_total cross-checked against qty × unit_price ✅
  • Grand-total validation: subtotal + tax_amount = total checked per invoice ✅

How it generalizes

The extractor uses a two-pass strategy that maps to most invoice formats:

  1. pdfplumber.extract_tables() for the line-item grid (structured, fixed-column).
  2. Regex over raw text for header fields (invoice no, dates, vendor, client) and totals (subtotal, tax, grand total) — robust to layout variations where pdfplumber's table detection misses rows because of inconsistent grid lines.

The validation layer (arithmetic checks + ground-truth comparison) is the differentiator vs naive extraction — it catches drift the moment a vendor changes their invoice template, instead of silently producing bad data.

Adapting to a new invoice format

  1. Add 3-5 sample PDFs from the new format to fixtures/ and update ground_truth.json.
  2. Tune the regex patterns in extract.py (typically INVOICE_NO_RE, DATE_LINE_RE, vendor/client line markers).
  3. Run python extract.py --verify — fail-loud if accuracy drops below 100%.

Hire me to build this for your stack

Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.

info@luba.media