PDF Invoice Extractor
PDF Invoice Extractor
Extract structured data from invoice PDFs into a CSV — with arithmetic validation and field-level acceptance testing against ground truth. Maps to "extract data from these invoices", "PDF table extraction", "AP automation" briefs.
Built 2026-05-03 as Demo #12. Different scaffold from the monitor demos (#16-#26) — this is "one-shot batch extraction with validation", not "recurring scrape with diff".
Run
. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/pdf_invoice_extractor
python generate_fixtures.py # creates 5 invoice PDFs + ground_truth.json
python extract.py --verify # extracts, writes CSVs, exits 1 if accuracy < 100%
Result
- 5 invoice PDFs processed ✅
- 99 / 99 fields extracted correctly (100% accuracy) ✅
- Per-line-item CSV (18 line items across 5 invoices) ✅
- Per-invoice summary CSV (5 rows with totals) ✅
- Arithmetic validation: every line_total cross-checked against qty × unit_price ✅
- Grand-total validation: subtotal + tax_amount = total checked per invoice ✅
How it generalizes
The extractor uses a two-pass strategy that maps to most invoice formats:
- pdfplumber.extract_tables() for the line-item grid (structured, fixed-column).
- Regex over raw text for header fields (invoice no, dates, vendor, client) and totals (subtotal, tax, grand total) — robust to layout variations where pdfplumber's table detection misses rows because of inconsistent grid lines.
The validation layer (arithmetic checks + ground-truth comparison) is the differentiator vs naive extraction — it catches drift the moment a vendor changes their invoice template, instead of silently producing bad data.
Adapting to a new invoice format
- Add 3-5 sample PDFs from the new format to
fixtures/and updateground_truth.json. - Tune the regex patterns in
extract.py(typicallyINVOICE_NO_RE,DATE_LINE_RE, vendor/client line markers). - Run
python extract.py --verify— fail-loud if accuracy drops below 100%.
Hire me to build this for your stack
Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.
info@luba.media