Extracting line items from commercial invoices with pdfplumber
Commercial invoices serve as the primary valuation and classification artifacts for cross-border trade. Yet their structural heterogeneity consistently fractures naive text-scraping pipelines. Within modern Document Ingestion & Parsing Workflows, line-item extraction is the critical choke point that dictates downstream HS code classification, duty assessment, and origin verification. Trade compliance officers and logistics developers routinely encounter invoices where column boundaries are implied rather than explicit, where scanned pages introduce coordinate drift, and where multi-language character encodings corrupt numeric fields. A production-grade extraction pipeline must abandon regex-dependent heuristics in favor of spatially aware parsing, coordinate clustering, and deterministic validation. pdfplumber provides the necessary low-level access to PDF object streams, enabling customs ETL teams to reconstruct tabular data from raw text, lines, and rectangles without relying on fragile DOM-like abstractions.
Spatial Parsing Architecture
The foundational challenge in commercial invoice parsing lies in mapping visual layout to logical data structures. pdfplumber exposes page-level primitives through page.chars, page.lines, and page.rects. Commercial invoices rarely embed semantic table tags. Instead, they rely on horizontal rules, vertical gutters, and whitespace to delineate line items. The extraction algorithm must first identify these boundaries, then project character coordinates into a normalized grid.
We implement a coordinate-aware clustering routine that groups text into rows based on vertical proximity, then segments columns using detected vertical lines or whitespace thresholds. This approach decouples layout detection from data normalization, ensuring deterministic output regardless of vendor formatting variations.
Production-Grade Extraction Pipeline
The following implementation demonstrates a production-ready extraction routine. It handles multi-page invoices, applies Unicode normalization for multi-language invoice parsing, and aligns output with customs filing schemas. Explicit type hints and structured logging ensure traceability during audit reviews.
import pdfplumber
import pandas as pd
import logging
import re
import unicodedata
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
from pdfminer.pdfparser import PDFSyntaxError
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger(__name__)
@dataclass
class LineItem:
line_no: int
description: str
hs_code: str
quantity: float
unit: str
unit_price: float
total_price: float
origin: Optional[str] = None
class CommercialInvoiceExtractor:
def __init__(self, x_tol: float = 5.0, y_tol: float = 3.0, min_line_width: float = 20.0):
self.x_tol = x_tol
self.y_tol = y_tol
self.min_line_width = min_line_width
self.hs_pattern = re.compile(r"^(?:\d{2}\.?\d{2}\.?\d{2}\.?\d{2}|\d{4,10})$")
self.currency_pattern = re.compile(r"[^\d.,-]")
def _normalize_text(self, text: str) -> str:
"""Handles multi-language invoice parsing via NFKC normalization."""
return unicodedata.normalize("NFKC", text).strip()
def _extract_horizontal_lines(self, page) -> List[Dict]:
return [ln for ln in page.lines if abs(ln["x0"] - ln["x1"]) > self.min_line_width]
def _cluster_rows(self, chars: List[Dict]) -> List[List[Dict]]:
if not chars:
return []
sorted_chars = sorted(chars, key=lambda c: (c["top"], c["x0"]))
rows: List[List[Dict]] = []
current_row: List[Dict] = [sorted_chars[0]]
for char in sorted_chars[1:]:
if abs(char["top"] - current_row[-1]["top"]) <= self.y_tol:
current_row.append(char)
else:
rows.append(current_row)
current_row = [char]
rows.append(current_row)
return rows
def _segment_columns(self, row_chars: List[Dict], h_lines: List[Dict]) -> List[str]:
x_coords = sorted([c["x0"] for c in row_chars])
if len(x_coords) < 2:
return [self._normalize_text("".join(c["text"] for c in row_chars))]
gaps = [x_coords[i+1] - x_coords[i] for i in range(len(x_coords)-1)]
if not gaps:
return [self._normalize_text("".join(c["text"] for c in row_chars))]
threshold = sum(gaps) / len(gaps) * 2.5
boundaries = [0.0]
for i, gap in enumerate(gaps):
if gap > threshold:
boundaries.append(x_coords[i] + gap / 2)
boundaries.append(float("inf"))
columns = [""] * (len(boundaries) - 1)
for char in row_chars:
for i in range(len(boundaries)-1):
if boundaries[i] <= char["x0"] < boundaries[i+1]:
columns[i] += char["text"]
break
return [self._normalize_text(c) for c in columns]
def _parse_numeric(self, val: str) -> float:
cleaned = self.currency_pattern.sub("", val).replace(",", "")
try:
return float(cleaned)
except ValueError:
return 0.0
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type((IOError, PDFSyntaxError)),
before_sleep=lambda retry_state: logger.warning(f"Retry {retry_state.attempt_number} | {retry_state.outcome.exception()}")
)
def extract(self, pdf_path: str) -> List[LineItem]:
items: List[LineItem] = []
logger.info(f"Initializing extraction pipeline for {pdf_path}")
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, start=1):
logger.info(f"Processing page {page_num} | {len(page.chars)} chars detected")
h_lines = self._extract_horizontal_lines(page)
rows = self._cluster_rows(page.chars)
for row_chars in rows:
cols = self._segment_columns(row_chars, h_lines)
if len(cols) < 4:
continue
desc = cols[1]
hs_candidate = cols[2]
qty_str = cols[3]
price_str = cols[4] if len(cols) > 4 else "0"
if not self.hs_pattern.match(hs_candidate):
continue
items.append(LineItem(
line_no=len(items) + 1,
description=desc,
hs_code=hs_candidate,
quantity=self._parse_numeric(qty_str),
unit="PCS",
unit_price=self._parse_numeric(price_str),
total_price=self._parse_numeric(qty_str) * self._parse_numeric(price_str)
))
logger.info(f"Extraction complete. {len(items)} line items parsed.")
return items
Debugging & Calculation Verification
Deterministic extraction requires explicit validation against trade compliance standards. Follow these debugging steps to verify coordinate mapping and arithmetic integrity:
- Coordinate Drift Calibration: Export
page.charsto a CSV and plotx0vstopcoordinates in a scatter plot. Verify that row clusters align within they_tolthreshold. Adjusty_tolto±2.5for high-DPI scans and±4.0for compressed PDFs. - Column Boundary Validation: Log
boundariesfrom_segment_columns. Compare detected gaps against known invoice templates. If columns merge, increase the whitespace multiplier from2.5to3.0. - Arithmetic Checksums: Calculate
Σ(unit_price × quantity)and compare against the invoice subtotal. Implement a tolerance threshold of±0.02to account for currency rounding. Flag discrepancies exceeding0.5%for manual review. - HS Code Validation: Cross-reference extracted codes against the WCO Harmonized System Nomenclature. Reject entries failing the 6-digit minimum or containing non-numeric characters.
Scaling & Resilience Patterns
High-volume customs clearance demands asynchronous execution and fault isolation. Wrap the extractor in an async queue to enable Async Batch Processing for High Volume ingestion pipelines. Use asyncio.gather() with semaphore limits to prevent memory exhaustion during peak filing windows.
Implement Error Handling & Retry Logic with exponential backoff for transient I/O failures. Integrate an [Emergency Pause & Circuit Breaker Logic] module that monitors failure rates. If consecutive extraction failures exceed 5% of a batch, trigger a circuit open state, halt the queue, and route payloads to a quarantine directory for forensic analysis. This prevents corrupted invoices from cascading into downstream duty calculation engines.
Compliance & Multi-Language Normalization
Trade documentation frequently mixes Latin, Cyrillic, and CJK scripts. Apply NFKC Unicode normalization before parsing to resolve ligature corruption and full-width character drift. Map extracted units to ISO 4217 currency codes and UN/ECE Recommendation No. 20 measurement units before downstream submission.
Align extracted SKUs and gross/net weights with Packing List Data Normalization routines to reconcile declared quantities against physical manifests. Discrepancies between commercial invoice line totals and packing list weights trigger automatic holds under ACE/ATLAS validation rules.
For enterprise deployments, integrate spatial extraction into a dedicated Commercial Invoice PDF Extraction microservice. Maintain strict audit trails by logging coordinate bounding boxes, parsing confidence scores, and validation checksums. This architecture satisfies CBP and EU customs data retention mandates while enabling automated duty forecasting and origin verification.