Why normalize invoice text to NFC before hashing or classification?

The same character can arrive as a composed code point or a base-plus-combining-mark sequence, and zero-width joiners can slip into a description invisibly. Without a single canonical form, two identical descriptions hash differently and exact-match glossary lookups fail. NFC composition collapses every equivalent sequence to one representation, making lineage hashing and deduplication deterministic.

How does the parser avoid corrupting declared values across locales?

Monetary tokens are parsed under their source BCP 47 locale with Babel rather than the process default, so 1.234,56 and 1,234.56 are both recovered correctly, and non-breaking-space thousands separators are stripped first. Values are carried as exact Decimal, never float, to satisfy CBP ACE decimal-precision rules. Any ambiguous token routes to quarantine instead of being guessed.

How is semantic drift from machine translation prevented?

Trade-of-art terms that carry HS meaning — galvanized, annealed, rolled — are swapped for opaque private-use tokens before translation and restored afterward, so the translator can never paraphrase them into words that shift the line into the wrong HS chapter. Candidates are then constrained to the WCO digit-length rule before they can reach duty assessment.

13 min read
6 code samples

Multi-language Invoice Parsing

Multi-language invoice parsing is the deterministic normalization stage that turns a mixed-script commercial invoice into a locale-neutral payload the rest of a customs pipeline can classify without ambiguity. It sits inside the Document Ingestion & Parsing Workflows reference architecture, downstream of raw capture and upstream of every tariff and duty engine, and its output must be reproducible cent-for-cent: given the same source file, the same Unicode normalization form, and the same constrained glossary, it must always emit the same normalized description, the same recovered decimal transaction value, and the same HS candidate — or route the line to a quarantine queue with a machine-readable reason. For licensed brokers, that reproducibility is what makes a declared value defensible under a CBP Focused Assessment; for Python ETL teams, it is the property that lets a re-run reconstruct a filed entry byte-for-byte from a fixed input.

Problem Framing: Where Language Variance Breaks Classification

A single cross-border shipment routinely carries product descriptions in the exporter’s language, terms of sale in another, and regulatory declarations in a third — sometimes interleaved inside one line-item cell. Treating that document as ASCII text scraped in filing order is the source of three failure modes that a compliance pipeline cannot tolerate:

Encoding corruption. Non-Latin scripts, mixed code pages, zero-width joiners, and ligature glyphs survive extraction as inconsistent byte sequences. Two visually identical descriptions hash differently, defeating deduplication and lineage, and a decomposed character silently fails an exact-match glossary lookup that its composed twin would have passed.
Locale decimal drift. An invoice that writes 1.234,56 for one-thousand-two-hundred, or encodes a thousands separator as a non-breaking space (U+00A0), truncates the moment float() touches it. A single misread separator moves a declared value by three orders of magnitude and propagates straight into the duty base and a rejected ABI filing.
Semantic drift on translation. Unconstrained machine translation rewrites trade-of-art terms — “annealed”, “galvanized”, “of iron or non-alloy steel” — into paraphrases that no longer map to the WCO nomenclature the classifier expects, silently shifting a line into the wrong HS chapter.

The extraction that precedes this stage — coordinate-anchored table capture in Commercial Invoice PDF Extraction — hands off clean text blocks, and marginal-confidence scans are repaired by OCR Drift Correction & Validation before they reach here. This stage answers all three failures by normalizing to a single canonical Unicode form, recovering decimals against the source locale rather than a hard-coded separator, and translating only through a glossary that pins trade terminology in place.

Schema / Data Contract

The stage’s output contract is a pair of dataclasses. InvoiceLineItem is the normalized unit of a line; every field on it exists to answer a later audit question — which normalization form the text passed through, which locale recovered the decimal, and which source coordinate produced the record. NormalizationResult binds the transformation chain to an immutable lineage hash so the whole path is reconstructible point-in-time.

import hashlib
import unicodedata
from dataclasses import dataclass, field
from decimal import Decimal
from enum import Enum
from typing import Optional


class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


@dataclass(frozen=True)
class NormalizationResult:
    raw: str
    normalized: str
    source_locale: str        # BCP 47 tag, e.g. "es-MX", "de-DE"
    source_coords: str        # page + bounding box from the extractor
    lineage_hash: str         # SHA-256 over (coords | raw | normalized)


@dataclass
class InvoiceLineItem:
    line_id: str
    raw_description: str
    source_lang: str          # ISO 639-1, e.g. "es"
    target_lang: str = "en"
    source_locale: str = "en-US"
    normalized_description: str = ""
    translated_description: str = ""
    declared_value: Optional[Decimal] = None   # exact, never float
    currency: str = "USD"                        # ISO 4217
    hs_code_candidate: Optional[str] = None
    confidence_score: float = 0.0
    lineage_hash: str = ""
    quarantine_reason: Optional[str] = None

Monetary values are carried as Decimal, never float: CBP ACE decimal-precision rules and duty-base arithmetic require exact representation, and binary floating point cannot represent 0.10 without rounding error that accumulates across a multi-line entry.

Step-by-Step Implementation

The pipeline runs four ordered stages per line. Each stage has a single responsibility, a typed input and output, and a defined error condition that routes the line to quarantine rather than emitting a partial record.

Stage 1 — Deterministic Unicode normalization

Purpose: collapse every visually-equivalent byte sequence to one canonical form so hashing, deduplication, and glossary lookup are stable. Input: raw extracted text. Output: an NFC-composed string with control and zero-width characters stripped. Error condition: input that is empty after stripping routes to quarantine as EMPTY_AFTER_NORMALIZE.

def normalize_text(text: str) -> str:
    """Apply deterministic NFC composition and strip non-printable controls."""
    # NFC composition: canonical decomposition followed by canonical composition,
    # so "é" (U+0065 U+0301) and "é" (U+00E9) collapse to one representation.
    normalized = unicodedata.normalize("NFC", text)
    # Drop control chars (category C*) except the whitespace parsers rely on,
    # and explicitly remove the zero-width joiner/non-joiner and BOM.
    cleaned = "".join(
        ch for ch in normalized
        if (unicodedata.category(ch)[0] != "C" or ch in {"\n", "\t", " "})
        and ch not in {"", "‌", "‍", ""}
    )
    return cleaned.strip()

Stage 2 — Locale-aware decimal recovery

Purpose: recover the true numeric value from a locale’s grouping and decimal conventions before any arithmetic runs. Input: a numeric token plus its BCP 47 source locale. Output: an exact Decimal. Error condition: an ambiguous or unparseable token routes to quarantine as AMBIGUOUS_DECIMAL.

from babel.numbers import parse_decimal, NumberFormatError


def recover_decimal(token: str, locale: str) -> Decimal:
    """Parse a monetary token under its source locale, never the process default."""
    # Normalize the non-breaking space some locales use as a thousands separator,
    # then let Babel apply the correct grouping/decimal marks for the locale.
    token = token.replace(" ", "").replace(" ", "")
    try:
        value = parse_decimal(token, locale=locale, strict=True)
    except NumberFormatError as exc:
        raise ValueError(f"AMBIGUOUS_DECIMAL: {token!r} under {locale}") from exc
    return Decimal(str(value))

Stage 3 — Glossary-constrained translation

Purpose: convert descriptions to the target customs-jurisdiction language while pinning trade-of-art terminology in place so tariff meaning is preserved. Input: normalized text plus source/target languages. Output: a translated string with glossary terms untouched. Error condition: a translation-service failure records against the circuit breaker and re-raises for retry/dead-letter routing.

Direct machine translation without domain constraints is the single largest source of misclassification here. The glossary is applied as pre-substitution placeholders so the translator never rewrites a term that carries HS meaning.

GLOSSARY = {
    "es": {"galvanizado": "galvanized", "recocido": "annealed", "laminado": "rolled"},
    "de": {"verzinkt": "galvanized", "geglüht": "annealed", "gewalzt": "rolled"},
}


def protect_terms(text: str, source_lang: str) -> tuple[str, dict[str, str]]:
    """Swap trade terms for opaque tokens the translator will not alter."""
    guards: dict[str, str] = {}
    for i, (term, canonical) in enumerate(GLOSSARY.get(source_lang, {}).items()):
        token = f"{i}"  # private-use sentinels survive translation
        text = text.replace(term, token)
        guards[token] = canonical
    return text, guards


def restore_terms(text: str, guards: dict[str, str]) -> str:
    for token, canonical in guards.items():
        text = text.replace(token, canonical)
    return text

Stage 4 — Lineage hashing and HS candidate handoff

Purpose: bind the transformation chain to an immutable fingerprint and pass the normalized line to classification. Input: raw text, normalized text, and source coordinates. Output: a SHA-256 lineage hash written onto the line. Error condition: none — hashing is total; the line then enters HS candidate scoring, and anything below the confidence floor routes to human review.

def compute_lineage_hash(raw: str, normalized: str, source_coords: str) -> str:
    """SHA-256 over the transformation chain for audit-trail reconstruction."""
    payload = f"{source_coords}|{raw}|{normalized}".encode("utf-8")
    return hashlib.sha256(payload).hexdigest()

Validation & Determinism

Normalization is only defensible if it is verifiably deterministic. Three cross-checks gate every line before it leaves the stage:

Idempotence check. normalize_text(normalize_text(x)) == normalize_text(x) must hold; a normalization form that is not a fixed point indicates a decomposition bug and fails the run rather than emitting drifting hashes.
Round-trip decimal tolerance. The recovered Decimal re-formatted back into the source locale must equal the original token modulo whitespace. A mismatch means the separators were misread and routes the line to AMBIGUOUS_DECIMAL quarantine instead of guessing.
HS nomenclature constraint. Any candidate the classifier proposes must satisfy the WCO digit-length rule — a 6-digit base with optional 8- or 10-digit national subdivisions, never 7 or 9 — the same constraint the Commercial Invoice PDF Extraction schema enforces upstream, so a malformed code can never reach duty assessment.

Lines that fail any check carry a quarantine_reason and are diverted with full context; nothing is silently dropped, and no confidence score substitutes for a broker’s sign-off on a quarantined line.

Downstream Integration

Normalized, hashed lines are the input contract for the rest of the platform. The translated descriptions and HS candidates feed the classification engines that consume this stage’s output, and the recovered Decimal values become the declared transaction values that duty computation rounds under CBP rules. Two sibling stages coordinate directly with this one:

Weight, package-count, and unit-of-measure reconciliation happens in Packing List Data Normalization, which cross-references this stage’s line items against the shipment manifest so declared values and quantities agree across all documents before submission.
High-volume corridors decouple ingestion from this normalization work through Async Batch Processing for High Volume, which buffers invoice payloads on a queue and carries each batch’s correlation ID through normalization, translation, and HS alignment.

Because the lineage hash travels with every line, a downstream engine can always resolve a classified entry back to the exact source coordinate and transformation rule that produced it.

Scaling & Resilience

At peak submission windows a single trade lane can push tens of thousands of lines per minute, and the translation service is the stage’s most failure-prone dependency. Resilience is enforced with a circuit breaker that isolates translation faults from the rest of the pipeline, backed by Error Handling & Retry Logic for the transient cases.

import time
from tenacity import (
    retry, stop_after_attempt, wait_exponential, retry_if_exception_type,
)


class TranslationCircuitBreaker:
    def __init__(self, failure_threshold: int = 5, recovery_timeout: int = 60):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failures = 0
        self.state = CircuitState.CLOSED
        self.last_failure_time = 0.0

    def record_success(self) -> None:
        self.failures = 0
        self.state = CircuitState.CLOSED

    def record_failure(self) -> None:
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.failure_threshold:
            self.state = CircuitState.OPEN

    def allow_request(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN  # probe with one request
                return True
            return False
        return True  # HALF_OPEN admits a single trial request


class MultiLanguageInvoiceParser:
    def __init__(self, circuit_breaker: TranslationCircuitBreaker):
        self.circuit_breaker = circuit_breaker

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=10),  # backoff with jitter upstream
        retry=retry_if_exception_type((ConnectionError, TimeoutError)),
        reraise=True,
    )
    def _translate(self, text: str, source: str, target: str) -> str:
        if not self.circuit_breaker.allow_request():
            raise RuntimeError("Translation circuit breaker OPEN")
        guarded, guards = protect_terms(text, source)
        # In production: call the NMT/DeepL service on `guarded`, then restore.
        translated = f"[{target}] {guarded}"
        return restore_terms(translated, guards)

    def parse_line_item(self, item: InvoiceLineItem, source_coords: str) -> InvoiceLineItem:
        """Run the deterministic normalization pipeline for a single line."""
        try:
            item.normalized_description = normalize_text(item.raw_description)
            if not item.normalized_description:
                item.quarantine_reason = "EMPTY_AFTER_NORMALIZE"
                return item

            item.lineage_hash = compute_lineage_hash(
                item.raw_description, item.normalized_description, source_coords
            )

            if item.source_lang != item.target_lang:
                item.translated_description = self._translate(
                    item.normalized_description, item.source_lang, item.target_lang
                )
                self.circuit_breaker.record_success()
            else:
                item.translated_description = item.normalized_description

            item.confidence_score = 0.95 if len(item.translated_description) > 5 else 0.60
            return item
        except Exception as exc:
            self.circuit_breaker.record_failure()
            raise RuntimeError(f"Line parsing failed: {item.line_id}") from exc

A per-worker semaphore caps concurrent translation calls so a burst cannot exhaust the service’s rate budget, and once the breaker opens, new batches pause at ingestion while in-flight lines drain to completion — bounding memory to the in-flight window rather than the whole submission. Persistent failures (malformed EDI, unsupported encodings) route to a dead-letter queue with full lineage context for manual review.

Compliance Obligations

Every normalized line must be reconstructible point-in-time. The lineage hash, the applied Unicode normalization form, the source locale used for decimal recovery, and the glossary version are persisted alongside each record — the exact fields a CBP Focused Assessment or post-entry correction reconstructs to prove which byte sequence produced a given declared value. Structured JSON logs capture the original byte sequence, the transformation rule applied at each stage, and the resulting canonical string, satisfying the algorithmic-transparency expectations CBP ACE and EU ATLAS impose on automated valuation. Audit records are written to immutable storage and retained for the full US record-keeping window a filed entry can be reached over (five years), so a line normalized today stays explainable years later even as vendor templates, locales, and the WCO HS 2022 nomenclature move on. Regulatory notices — Federal Register valuation rulings, HS nomenclature amendments — enter as versioned glossary and schema updates carrying an effective date, never ad-hoc code edits, keeping the effective-date trail intact. Any line the pipeline cannot resolve deterministically escalates to a human-in-the-loop gate where a licensed broker confirms or corrects the classification before it re-enters assessment.

Normalizing CJK invoice fields with Unicode NFC — canonical folding of fullwidth and compatibility forms before hashing.
Commercial Invoice PDF Extraction — coordinate-anchored capture that hands clean text blocks to this stage.
OCR Drift Correction & Validation — repairs scanned-character drift before normalization runs.
Packing List Data Normalization — reconciles weights and quantities against the normalized invoice lines.
Async Batch Processing for High Volume — the queue framework that buffers multi-language batches off the synchronous path.
Error Handling & Retry Logic — backoff, dead-letter routing, and the circuit-breaker patterns this stage depends on.

Up: Document Ingestion & Parsing Workflows

Authoritative references: WCO HS Nomenclature 2022 · USITC Harmonized Tariff Schedule · Python unicodedata · Babel number parsing · ISO 4217 currency codes

Multi-language Invoice Parsing

# Problem Framing: Where Language Variance Breaks Classification

# Schema / Data Contract

# Step-by-Step Implementation

# Stage 1 — Deterministic Unicode normalization

# Stage 2 — Locale-aware decimal recovery

# Stage 3 — Glossary-constrained translation

# Stage 4 — Lineage hashing and HS candidate handoff

# Validation & Determinism

# Downstream Integration

# Scaling & Resilience

# Compliance Obligations

# Related