Why not just use pdfplumber's built-in extract_tables() for commercial invoices?

extract_tables() relies on ruling lines or a uniform lattice, which most commercial invoices lack — they delimit columns with whitespace and merged multi-line descriptions. A coordinate-aware pass over page.chars clusters rows by vertical proximity and segments columns by gap width, so it survives borderless layouts and vendor-to-vendor variance where the built-in table finder returns empty or misaligned cells.

How does the routine avoid mis-reading an HS code that shares a column with a SKU?

Every candidate is matched against a strict pattern that only accepts 6, 8, or 10 digits (with optional dot separators), because WCO and HTSUS codes are never 4, 5, 7, or 9 digits. A SKU like 12345 or a 7-digit part number fails the pattern and the row is skipped rather than mis-classified, which keeps an invalid code out of the downstream duty calculation.

Does coordinate clustering survive scanned invoices?

Only if the scan carries a text layer. pdfplumber reads embedded characters, not pixels, so a rasterized scan yields no page.chars and must first pass through OCR drift correction and validation. High-DPI OCR output also drifts vertically, so y_tol should be tightened to about 2.5 points and calibrated against a scatter plot before the row clusterer is trusted.

12 min read
1 code sample

Extracting line items from commercial invoices with pdfplumber

This page answers one narrow implementation question: how do you pull structured line items — description, HS code, quantity, unit price, and line total — out of a borderless commercial-invoice PDF when the columns are implied by whitespace rather than drawn as a table? This is the exact stage referenced from Commercial Invoice PDF Extraction, and it is the critical choke point in the pipeline: line items feed HS classification, duty assessment, and origin verification downstream, so a mis-segmented column or a swallowed decimal propagates straight into an incorrect entry.

The concrete failure mode targeted here is column collapse on borderless invoices. pdfplumber’s built-in extract_tables() needs ruling lines or a uniform lattice; most commercial invoices have neither. When you feed one to the lattice finder you get empty cells or two adjacent fields fused into one, and a regex-only fallback then splits a multi-line product description across the wrong rows. The fix is to abandon table abstractions and work directly on page.chars: cluster characters into rows by vertical proximity, then segment each row into columns by gap width. This decouples layout detection from data normalization and produces deterministic output regardless of vendor formatting.

Prerequisites

Pin the following before applying this routine. The coordinate clustering depends on pdfplumber’s character-level geometry, and the numeric parsing assumes locale-normalized input, so the versions and upstream state below are load-bearing.

Python 3.10+ — the code uses X | Y union hints and builtin generics already established across these ingestion workflows.
pdfplumber >= 0.11 (which pins pdfminer.six >= 20231228). The char dict keys top, x0, x1, and text are stable in this range; older pdfminer builds shift top origins between releases, which silently breaks the row clusterer.
pandas >= 2.0 and tenacity >= 8.2 for downstream tabulation and the retry wrapper around transient I/O.
A native or OCR-recovered text layer. pdfplumber reads embedded characters, not pixels — a rasterized scan yields an empty page.chars. Route image-only invoices through OCR drift correction and validation first, and normalize mixed scripts through multi-language invoice parsing so full-width digits and ligatures do not corrupt numeric fields before they reach this stage.
A known column order. This implementation assumes the line, description, HS code, quantity, and price columns appear left to right; remap the index offsets in extract() if a vendor template reorders them.

Implementation

The extractor works in three passes per page. First _cluster_rows sorts every character by (top, x0) and groups characters whose vertical position falls within y_tol into a single logical row. Then _segment_columns computes the horizontal gaps between characters and opens a column boundary wherever a gap exceeds a multiple of the mean gap — this is what recovers implied columns without ruling lines. Finally each column is NFKC-normalized, the HS candidate is validated against a strict digit-length pattern, and only surviving rows are materialized as a LineItem. Currency symbols and thousands separators are stripped before any arithmetic so an ISO 4217 amount like 1.234,56 or $1,234.56 parses to the same float.

import pdfplumber
import pandas as pd
import logging
import re
import unicodedata
from typing import Optional
from dataclasses import dataclass
from pdfminer.pdfparser import PDFSyntaxError
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(name)s | %(message)s")
logger = logging.getLogger("customs_etl.invoice")

# WCO/HTSUS codes are exactly 6, 8, or 10 digits — never 4, 5, 7, or 9.
# The optional dot-separated form (e.g. "8471.30.01.00") is also accepted.
HS_PATTERN = re.compile(r"^(?:\d{2}\.?\d{4}\.?\d{2}(?:\.?\d{2})?|\d{6}(?:\d{2}(?:\d{2})?)?)$")


@dataclass
class LineItem:
    line_no: int
    description: str
    hs_code: str
    quantity: float
    unit: str
    unit_price: float
    total_price: float
    origin: Optional[str] = None


class CommercialInvoiceExtractor:
    def __init__(self, x_tol: float = 5.0, y_tol: float = 3.0, min_line_width: float = 20.0):
        self.x_tol = x_tol
        self.y_tol = y_tol                 # vertical band that defines one logical row
        self.min_line_width = min_line_width
        self.currency_pattern = re.compile(r"[^\d.,-]")

    def _normalize_text(self, text: str) -> str:
        """NFKC folds full-width digits and ligatures — see multi-language parsing."""
        return unicodedata.normalize("NFKC", text).strip()

    def _extract_horizontal_lines(self, page) -> list[dict]:
        # Rules wide enough to separate the item table from headers/footers.
        return [ln for ln in page.lines if abs(ln["x0"] - ln["x1"]) > self.min_line_width]

    def _cluster_rows(self, chars: list[dict]) -> list[list[dict]]:
        if not chars:
            return []
        sorted_chars = sorted(chars, key=lambda c: (c["top"], c["x0"]))
        rows: list[list[dict]] = []
        current_row: list[dict] = [sorted_chars[0]]
        for char in sorted_chars[1:]:
            # Same row while the baseline stays within y_tol of the row's start.
            if abs(char["top"] - current_row[-1]["top"]) <= self.y_tol:
                current_row.append(char)
            else:
                rows.append(current_row)
                current_row = [char]
        rows.append(current_row)
        return rows

    def _segment_columns(self, row_chars: list[dict], h_lines: list[dict]) -> list[str]:
        x_coords = sorted(c["x0"] for c in row_chars)
        if len(x_coords) < 2:
            return [self._normalize_text("".join(c["text"] for c in row_chars))]

        gaps = [x_coords[i + 1] - x_coords[i] for i in range(len(x_coords) - 1)]
        if not gaps:
            return [self._normalize_text("".join(c["text"] for c in row_chars))]

        # A column break is any gap markedly wider than the mean inter-char gap.
        threshold = sum(gaps) / len(gaps) * 2.5
        boundaries = [0.0]
        for i, gap in enumerate(gaps):
            if gap > threshold:
                boundaries.append(x_coords[i] + gap / 2)
        boundaries.append(float("inf"))

        columns = [""] * (len(boundaries) - 1)
        for char in row_chars:
            for i in range(len(boundaries) - 1):
                if boundaries[i] <= char["x0"] < boundaries[i + 1]:
                    columns[i] += char["text"]
                    break
        return [self._normalize_text(c) for c in columns]

    def _parse_numeric(self, val: str) -> float:
        # Strip currency glyphs/separators before float() — ISO 4217 amounts vary.
        cleaned = self.currency_pattern.sub("", val).replace(",", "")
        try:
            return float(cleaned)
        except ValueError:
            return 0.0

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=10),
        retry=retry_if_exception_type((IOError, PDFSyntaxError)),
        before_sleep=lambda rs: logger.warning("Retry %s | %s", rs.attempt_number, rs.outcome.exception()),
    )
    def extract(self, pdf_path: str) -> list[LineItem]:
        items: list[LineItem] = []
        logger.info("Initializing extraction pipeline for %s", pdf_path)

        with pdfplumber.open(pdf_path) as pdf:
            for page_num, page in enumerate(pdf.pages, start=1):
                logger.info("Processing page %d | %d chars detected", page_num, len(page.chars))
                h_lines = self._extract_horizontal_lines(page)
                rows = self._cluster_rows(page.chars)

                for row_chars in rows:
                    cols = self._segment_columns(row_chars, h_lines)
                    if len(cols) < 4:
                        continue  # header, footer, or subtotal band — not a line item

                    desc = cols[1]
                    hs_candidate = cols[2]
                    qty_str = cols[3]
                    price_str = cols[4] if len(cols) > 4 else "0"

                    # Digit-length gate: keeps SKUs and part numbers out of hs_code.
                    if not HS_PATTERN.match(hs_candidate):
                        continue

                    qty = self._parse_numeric(qty_str)
                    unit_price = self._parse_numeric(price_str)
                    items.append(LineItem(
                        line_no=len(items) + 1,
                        description=desc,
                        hs_code=hs_candidate,
                        quantity=qty,
                        unit="PCS",
                        unit_price=unit_price,
                        total_price=round(qty * unit_price, 2),  # recompute, never trust printed total
                    ))
        logger.info("Extraction complete. %d line items parsed.", len(items))
        return items

The validated hs_code on each LineItem is the join key the rest of the system depends on: it is resolved against the schedule defined in HTS Schedule Database Design, and only a code that passes the digit-length gate above will match a row there. Recomputing total_price from quantity * unit_price rather than trusting the printed figure is deliberate — it turns the invoice’s own subtotal into an independent checksum, which is the first verification step below.

Verification steps

Run these checks against a labeled sample before the extractor carries production filings. Each one is deterministic and reproducible in staging.

Calibrate row clustering. Export page.chars to CSV and scatter-plot x0 against top. Confirm every logical row collapses into a single cluster within y_tol. Tighten y_tol to 2.5 for high-DPI scans and relax it to 4.0 for compressed PDFs; a row that splits in two means the band is too tight.
Validate column boundaries. Log the boundaries list from _segment_columns for a known template and compare against the expected header count. If two columns merge, raise the gap multiplier from 2.5 to 3.0 until the counts match; if a description wraps into a phantom column, the multiplier is too high.
Checksum the arithmetic. Compute Σ(unit_price × quantity) across all rows and compare against the printed invoice subtotal. Apply a ±0.02 tolerance for currency rounding and flag any divergence above 0.5% for broker review — this catches a swallowed decimal or a mis-segmented price column that nothing else will.
Enforce HS digit-length rules. Assert that every hs_code matches HS_PATTERN — exactly 6, 8, or 10 digits. A 0% rejection rate on a real batch usually means the SKU column is leaking into the HS column, not that every code is valid.
Reconcile against the packing list. Cross-check declared quantities and gross/net weights against Packing List Data Normalization output; a mismatch between invoice line totals and manifest weights triggers an automatic hold under ACE/ATLAS validation before duty assessment.

Edge cases & gotchas

The failure modes below are specific to coordinate-based extraction over real commercial invoices, and most only surface once you leave a single clean vendor template.

top drift between pdfminer releases. pdfplumber inherits its coordinate origin from pdfminer.six, and a minor bump can shift top values by a few points across a page. A y_tol calibrated on one version then over- or under-merges rows on another. Pin pdfminer.six exactly, and re-run the scatter-plot calibration whenever you upgrade rather than assuming the tolerance carries over.
Right-aligned numeric columns fool gap segmentation. Quantity and price are often right-aligned, so their leading x0 drifts by magnitude — a 9 and a 1,234 no longer share a left edge. Cluster the price column on x1 (the right edge) rather than x0, or the segmenter will occasionally slice a wide number into two columns.
Multi-line descriptions swallow the next row. A product description that wraps to a second visual line sits at a different top, so the clusterer emits it as its own row with no HS code — and the < 4 columns guard silently drops it, losing the continuation text. Merge a row into its predecessor when it lacks a numeric quantity and its top is within one line-height, before the HS gate runs.
Full-width and Arabic-Indic digits survive as non-ASCII. NFKC folds full-width １２３ to 123, but Arabic-Indic ٤٥٦ is a distinct code point that _parse_numeric will strip to an empty string and silently score as 0.0. Map locale digit sets explicitly during multi-language invoice parsing rather than relying on NFKC alone.
Dotted vs. bare HS forms compare unequal. 8471.30.01.00 and 8471300100 both pass HS_PATTERN but are different strings, so a naive join against the schedule misses. Canonicalize to the bare 10-digit form before the record enters the HTS Schedule Database Design lookup, and keep the dotted form only for display.
A 0.0 from _parse_numeric is indistinguishable from a genuine zero. The except ValueError: return 0.0 fallback masks an unparseable quantity as a legitimate zero, which zeroes the line total and passes the subtotal checksum only by luck. Return None and route the row to a quarantine list for error handling and retry logic rather than coercing failures to zero.

Commercial Invoice PDF Extraction — the parent workflow this line-item stage plugs into, covering ingestion, routing, and schema validation.
Packing List Data Normalization — reconciles the quantities and weights this extractor emits against the physical manifest.
OCR Drift Correction & Validation — recovers a text layer for scanned invoices before page.chars is usable.
Multi-Language Invoice Parsing — locale digit maps and Unicode normalization that protect numeric fields.
HTS Schedule Database Design — the schema the validated hs_code is resolved against for duty assessment.

Up: Commercial Invoice PDF Extraction

Authoritative references: WCO HS 2022 Nomenclature, HTSUS (USITC), CBP ACE / ABI submission formats, EU ATLAS validation rules, ISO 4217 currency codes, UN/ECE Recommendation No. 20 (units of measure).

Extracting line items from commercial invoices with pdfplumber

# Prerequisites

# Implementation

# Verification steps

# Edge cases & gotchas

# Related

Prerequisites

Implementation

Verification steps

Edge cases & gotchas

Related