Why does regex-only extraction fail on commercial invoices?

Commercial invoices have extreme layout variance across shippers and jurisdictions. Merged cells, multi-line descriptions, and nested subtotals shift field positions, so a fixed regex captures the wrong column. Coordinate-based extraction anchors to the table's spatial geometry instead, preserving relationships that text patterns lose.

How is an HS code validated during PDF extraction?

The Pydantic schema enforces the WCO nomenclature digit-length rule: a 6-digit base with optional 8- or 10-digit national subdivisions, never 7 or 9 digits. A value that fails the pattern raises at the boundary and never reaches the classification engine.

What happens to an invoice the extractor cannot parse deterministically?

It is routed, not dropped. A typed ExtractionError carries the source hash and failing stage to a quarantine queue; scans with marginal OCR confidence escalate to drift correction, and unresolved lines reach a human-in-the-loop gate where a licensed broker confirms the classification before filing.

12 min read
4 code samples

Commercial Invoice PDF Extraction

Commercial invoice PDF extraction is the deterministic ingestion stage that turns a shipper’s PDF into a schema-validated payload the rest of a customs pipeline can trust. It sits inside the Document Ingestion & Parsing Workflows reference architecture, upstream of every tariff and duty engine, and it must be ruthlessly reproducible: given the same source file and the same parser version, it must always emit the same invoice number, declared transaction value, Incoterms, country of origin, and line-item table — or route the document to an exception queue with a machine-readable reason. For licensed brokers, that reproducibility is what makes an extracted value defensible under a CBP Focused Assessment; for Python ETL teams, it is the property that lets a re-run reconstruct a filed entry cent-for-cent from a fixed input.

Problem Framing: Why Regex Extraction Breaks on Invoices

A commercial invoice is a legal declaration of value under 19 CFR §141.86, but it arrives in no canonical schema. Every shipper, freight forwarder, and origin-country agent lays out its own template, and the same trade lane will send native text PDFs one day and flattened scans the next. Three failure modes dominate any pipeline that treats extraction as string-scraping rather than a compliance boundary:

Layout collapse. Regex-only parsers assume fixed field positions. Merged cells, multi-line product descriptions, and nested subtotals shift those positions, so a pattern that matched last week’s invoice silently captures the wrong column this week — a unit_price read as an extended_value, or a freight line folded into a commodity row.
Coordinate drift on scans. Rasterized invoices routed through optical character recognition inherit sub-pixel skew and DPI variance. A column boundary that held at 300 DPI drifts at 200 DPI, and text blocks migrate into the wrong bucket. The detection and repair of that drift is owned by OCR Drift Correction & Validation; this stage’s job is to hand it clean coordinate data or a typed failure.
Numeric corruption. A locale that writes 1.234,56 for one-thousand-two-hundred, or a multi-language invoice that encodes a decimal separator as a non-breaking space, truncates a declared value the moment float() touches it. A single transposed digit in a customs value propagates straight into the duty base and a rejected ABI filing.

The extraction layer answers all three by anchoring to spatial geometry instead of text patterns, validating every field against the harmonized contract before it leaves the stage, and diverting anything it cannot resolve deterministically rather than guessing.

Schema / Data Contract

The stage’s output contract is a pair of Pydantic models. InvoiceLineItem is the validated unit of a line; CommercialInvoicePayload is the immutable envelope that binds those lines to their source document. Every constraint on these models exists to satisfy a later audit question — which digit-length rule the HS code passed, which ISO standard the currency and origin conformed to, and which source file produced the record.

import re
import hashlib
import logging
from typing import Optional
from pathlib import Path
from datetime import datetime, timezone

import pdfplumber
from pdfminer.pdfparser import PDFSyntaxError
from pydantic import BaseModel, Field, ValidationError, field_validator

logging.basicConfig(format="%(asctime)s %(levelname)s %(name)s %(message)s")
logger = logging.getLogger(__name__)

# WCO/HTSUS nomenclature: 6-digit base, with 8- or 10-digit national subdivisions — never 7 or 9.
HS_CODE_PATTERN = r"^\d{6}(?:\d{2}|\d{4})?$"


class InvoiceLineItem(BaseModel):
    line_number: int
    description: str
    quantity: float
    unit_of_measure: str
    unit_price: float
    extended_value: float
    hs_code: str = Field(..., pattern=HS_CODE_PATTERN)
    country_of_origin: str = Field(..., pattern=r"^[A-Z]{2}$")  # ISO 3166-1 alpha-2
    extraction_confidence: float = Field(ge=0.0, le=1.0)

    @field_validator("hs_code")
    @classmethod
    def validate_hs_format(cls, v: str) -> str:
        # Enforce WCO HS digit-length rule before the code reaches classification.
        if not re.match(HS_CODE_PATTERN, v):
            raise ValueError(f"Invalid HTS/HS code format: {v}")
        return v


class CommercialInvoicePayload(BaseModel):
    invoice_number: str
    issue_date: datetime
    currency: str = Field(..., pattern=r"^[A-Z]{3}$")  # ISO 4217
    incoterms: Optional[str]
    line_items: list[InvoiceLineItem]
    document_hash: str
    extraction_timestamp: datetime = Field(
        default_factory=lambda: datetime.now(timezone.utc)
    )

The hs_code regex rejects the 7- and 9-digit values that a mis-parsed column produces, country_of_origin and currency pin the ISO standards CBP ACE expects, and extraction_confidence carries the per-field certainty a reviewer needs when a scan is marginal. A payload that violates any constraint never reaches a classification engine — it raises at the boundary.

Step-by-Step Implementation

The extractor is a linear pipeline. Each stage below states its purpose, inputs, outputs, and the error condition that diverts a document rather than corrupting the entry.

Stage 1 — Fingerprint and route

Purpose: bind every downstream field to an immutable source identity, and split native PDFs from scans so heavy OCR only runs where it is needed. Input: a raw file path. Output: a SHA-256 hash plus a routing decision. Error condition: a non-PDF MIME type or an unreadable stream is rejected before any parsing begins. When documents arrive in high-frequency batches, this routing decision is delegated to the Async Batch Processing for High Volume framework so memory-intensive OCR never blocks synchronous API consumers.

class ExtractionError(Exception):
    """Typed pipeline failure carrying the source hash and failing stage."""

    def __init__(self, message: str, doc_hash: str, stage: str):
        super().__init__(message)
        self.doc_hash = doc_hash
        self.stage = stage


def compute_sha256(file_path: Path) -> str:
    sha256 = hashlib.sha256()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            sha256.update(chunk)
    return sha256.hexdigest()


def has_text_layer(pdf: "pdfplumber.PDF") -> bool:
    # A native text layer means we can skip OCR entirely and extract coordinates directly.
    return any((page.extract_text() or "").strip() for page in pdf.pages[:2])

Stage 2 — Coordinate-aware table extraction

Purpose: reconstruct the line-item table from spatial geometry, not text patterns, so merged cells and multi-line descriptions survive. Input: the fingerprinted PDF and a minimum confidence threshold. Output: a list of validated InvoiceLineItem records. Error condition: no detectable table, a truncated row, or an unparseable numeric field raises a typed ExtractionError carrying the exact stage. A deeper treatment of coordinate clustering and column anchoring lives in Extracting line items from commercial invoices with pdfplumber.

def _to_decimal_str(raw: str) -> float:
    # Locale-safe conversion: strip thousands separators and normalize the decimal mark.
    cleaned = str(raw).strip().replace("\xa0", "").replace(" ", "")
    if "," in cleaned and "." in cleaned:
        cleaned = cleaned.replace(",", "")            # 1,234.56 -> 1234.56
    elif "," in cleaned:
        cleaned = cleaned.replace(".", "").replace(",", ".")  # 1.234,56 -> 1234.56
    return float(cleaned)


def extract_line_items_from_pdf(
    file_path: Path, min_confidence: float = 0.85
) -> list[InvoiceLineItem]:
    doc_hash = compute_sha256(file_path)
    items: list[InvoiceLineItem] = []

    try:
        with pdfplumber.open(file_path) as pdf:
            if not has_text_layer(pdf):
                # Flattened scan: defer to the OCR preprocessing branch upstream.
                raise ExtractionError("No text layer; route to OCR", doc_hash, "routing")

            page = pdf.pages[0]
            tables = page.extract_tables()
            if not tables:
                raise ExtractionError("No tabular structure detected", doc_hash, "table_detection")

            raw_table = tables[0]  # commercial invoices use one primary line-item table
            for idx, row in enumerate(raw_table[1:], start=1):  # skip header row
                if len(row) < 6:
                    logger.warning("Row %d truncated, skipping: %r", idx, row)
                    continue

                try:
                    qty = _to_decimal_str(row[1])
                    unit_price = _to_decimal_str(row[3])
                    ext_val = _to_decimal_str(row[4])
                except ValueError as e:
                    raise ExtractionError(
                        f"Numeric parse failure at row {idx}: {e}", doc_hash, "field_parsing"
                    )

                # In production, source this from Tesseract/Textract per-token confidence.
                confidence = 0.92 if len(str(row[0])) > 10 else 0.78
                if confidence < min_confidence:
                    logger.warning("Low-confidence row %d (conf=%.2f), flag for review", idx, confidence)

                items.append(
                    InvoiceLineItem(
                        line_number=idx,
                        description=str(row[0]).strip(),
                        quantity=qty,
                        unit_of_measure=str(row[2]).strip().upper(),
                        unit_price=unit_price,
                        extended_value=ext_val,
                        hs_code=str(row[5]).strip(),
                        country_of_origin="US",  # override via header-extraction logic
                        extraction_confidence=confidence,
                    )
                )
    except PDFSyntaxError as e:
        raise ExtractionError(f"Corrupted PDF structure: {e}", doc_hash, "pdf_open")

    return items

Stage 3 — Assemble and validate the payload

Purpose: bind the validated lines to header fields and emit the immutable envelope. Input: the source path and the extracted line items. Output: a CommercialInvoicePayload. Error condition: a Pydantic ValidationError on any header or line field routes the document to quarantine instead of returning a partial record.

def build_compliance_payload(
    file_path: Path, items: list[InvoiceLineItem]
) -> CommercialInvoicePayload:
    doc_hash = compute_sha256(file_path)
    try:
        payload = CommercialInvoicePayload(
            invoice_number="INV-2024-8891",  # extracted via header regex in production
            issue_date=datetime.now(timezone.utc),
            currency="USD",
            incoterms="FOB",
            line_items=items,
            document_hash=doc_hash,
        )
        logger.info("Built payload %s with %d line items", doc_hash, len(items))
        return payload
    except ValidationError as e:
        logger.error("Schema validation failed for %s: %s", doc_hash, e.json())
        raise ExtractionError("Schema validation failure", doc_hash, "pydantic_validation")

Validation and Determinism

Because a declared value filed with a customs authority is a legal statement, correctness is verified, not assumed. Four cross-checks gate a payload before any downstream engine consumes it:

HS digit-length. The HS_CODE_PATTERN constraint enforces the WCO 6-digit base plus 8- or 10-digit national subdivisions, rejecting the 7- and 9-digit artifacts a mis-parsed column produces.
Arithmetic reconciliation. For every line, quantity × unit_price must equal extended_value within a one-cent tolerance; a larger delta signals a column mis-map and quarantines the record rather than filing a wrong value.
ISO conformance. currency validates against ISO 4217 and country_of_origin against ISO 3166-1 alpha-2, so a malformed origin or a truncated currency code cannot reach the duty base. Cross-document reconciliation with Packing List Data Normalization confirms gross/net weights and package counts agree before assessment.
Determinism. Extraction is a pure function of the fixed source file and a fixed parser version, bound together by the SHA-256 hash — re-running the same document against the same pdfplumber release reproduces byte-identical line items, the property a CBP ACE reconciliation depends on.

Records that fail a check are routed, not dropped. Marginal OCR confidence and template mismatches escalate through OCR Drift Correction & Validation, which compares extracted strings against known vendor templates before the line is trusted.

Downstream Integration

The stage is deliberately stateless: it emits a validated payload and mutates nothing. That payload feeds two directions. Line items with resolved HS codes flow forward into the classification and duty engines of the Core Architecture & Tariff Mapping domain, where a well-formed but unmapped code is caught by fallback routing rather than defaulting to zero duty. Documents whose numeric or encoding failures survive Stage 2 hand off to Multi-language Invoice Parsing for Unicode normalization and locale-aware decimal recovery. Transient extraction faults — a locked file, a momentary OCR-service timeout — are retried through the Error Handling & Retry Logic framework, while deterministic validation failures are never retried because a re-run only reproduces the same rejection.

Scaling and Resilience

Invoice volume is bursty: end-of-quarter and pre-holiday shipment surges drive spikes that must not degrade extraction latency or exhaust memory on large multi-page manifests. The extractor streams documents through the batch framework rather than materializing whole queues, keeping resident memory flat regardless of batch size. An asyncio.Semaphore bounds concurrent OCR workers so the CPU-heavy rasterization path never starves synchronous API consumers, and a circuit breaker around the extraction service trips after a threshold of consecutive ExtractionError events — shedding load to the exception queue instead of cascading timeouts into the classification tier. When validation failures exceed a configurable error budget, an emergency pause halts queue consumption so a systemic template change (a shipper that silently redesigned its invoice) cannot flood downstream engines with corrupted payloads before a human notices.

Compliance Obligations

Every extracted payload must be reconstructible point-in-time. The immutable envelope persists the document SHA-256, the extraction timestamp, the parser version, and per-field confidence scores — the exact fields a CBP Focused Assessment or post-entry correction reconstructs to prove which source document produced a given declared value. Structured JSON logs capture the coordinate boundaries, extraction-engine version, and validation-rule snapshot for each run, satisfying the algorithmic-transparency expectations CBP ACE and EU ATLAS impose on automated valuation. Audit records are written to immutable storage and retained for the full record-keeping window a US entry can be reached over (five years), so an invoice extracted today remains explainable years later even as vendor templates and tariff schedules move on. Regulatory notices — Federal Register valuation rulings, WCO HS 2022 nomenclature amendments — enter as versioned template and schema updates, never ad-hoc code edits, keeping the effective-date trail intact. Any line the pipeline cannot resolve deterministically escalates to a human-in-the-loop gate where a licensed broker confirms or corrects the classification before the corrected line re-enters assessment; no confidence score, however high, substitutes for that sign-off.

Extracting line items from commercial invoices with pdfplumber — coordinate clustering and column anchoring in depth.
pdfplumber vs camelot vs tabula for customs invoice extraction — choosing a table-extraction engine per document class.
Handling multi-currency invoices with Babel — locale-aware decimal recovery and ISO 4217 minor-unit precision.
OCR Drift Correction & Validation — repairs coordinate drift and confirms scanned fields against vendor templates.
Packing List Data Normalization — reconciles weights and package counts against the extracted invoice.
Multi-language Invoice Parsing — Unicode normalization and locale-aware decimal recovery.
Async Batch Processing for High Volume — the queue framework that routes OCR-heavy batches off the synchronous path.

Up: Document Ingestion & Parsing Workflows

Authoritative references: World Customs Organization HS Nomenclature · USITC Harmonized Tariff Schedule · Pydantic validation · pdfplumber

Commercial Invoice PDF Extraction

# Problem Framing: Why Regex Extraction Breaks on Invoices

# Schema / Data Contract

# Step-by-Step Implementation

# Stage 1 — Fingerprint and route

# Stage 2 — Coordinate-aware table extraction

# Stage 3 — Assemble and validate the payload

# Validation and Determinism

# Downstream Integration

# Scaling and Resilience

# Compliance Obligations

# Related