Why normalize packing list units to UN/ECE Rec 20 codes rather than keeping the shipper's labels?

Downstream reconciliation and ACE transmission require a single deterministic vocabulary. Mapping kg/lbs/ctn/pallet to KGM/LBR/BOX/PAL up front means the reconciliation join compares like against like, and audit reconstruction never has to re-interpret a shipper-specific abbreviation years later.

How does normalization handle OCR errors in scanned packing lists?

Unit tokens are resolved with bounded fuzzy matching against the canonical dictionary. Above the confidence floor the token is corrected deterministically; below it the line is quarantined for human-in-the-loop review rather than mapped to the nearest guess, so a misread never silently enters the duty base.

What tolerance triggers a compliance hold during invoice reconciliation?

Package-count mismatches, net or gross weight deltas exceeding 2%, and volumetric discrepancies against the declared commercial invoice values route the record to a hold queue before any HS code is assigned.

15 min read
4 code samples

Packing List Data Normalization

Packing list data normalization is the deterministic reconciliation stage that turns a shipper’s package manifest into a schema-validated record the rest of a customs pipeline can trust. It sits inside the Document Ingestion & Parsing Workflows reference architecture, alongside invoice extraction, and its job is to make physical quantities defensible: package counts, net and gross weights, volumetric dimensions, and the parent-child structure of master cartons and inner packs must all emerge in one canonical vocabulary or be diverted with a machine-readable reason. For licensed brokers, that normalized record is what verifies declared quantities against the commercial invoice under a CBP Focused Assessment; for Python ETL teams, it is the property that lets a re-run reconstruct a filed entry’s physical basis from a fixed source file.

Problem Framing: Why Packing Lists Resist Naive Parsing

A packing list is the physical counterpart to the value declaration, but it arrives in no canonical schema. Every carrier, forwarder, and origin agent lays out its own template, mixes native-text PDFs with flattened scans, and labels the same unit a dozen different ways. Three failure modes dominate any pipeline that treats normalization as string-scraping rather than a compliance boundary:

Unit-of-measure divergence. The same weight is written kg, KGS, Kilos, or a localized Stück for a package count. A pipeline that keeps these raw labels cannot reconcile a 1000 KGS gross against an invoice’s 1000 kg without ad-hoc string comparisons that break on the next template.
Hierarchy collapse. Packing lists encode nested packaging — pallets containing master cartons containing inner packs. Flatten that tree and a container-utilization check double-counts weight, or a per-carton quantity is read as a shipment total, corrupting the count that reconciliation depends on.
OCR drift on scans. Rasterized packing lists inherit character substitution and positional skew. A 0 read as O in a unit token, or a decimal point lost to a scan artifact, silently truncates a declared weight the moment float() touches it — and a single wrong digit in gross weight propagates straight into the duty base and a rejected ABI filing.

The normalization layer answers all three by mapping every unit to a single canonical vocabulary, preserving the packaging tree as explicit parent references, and diverting anything it cannot resolve deterministically to a review queue rather than guessing. Drift that originates upstream in the scan itself is detected and repaired by OCR Drift Correction & Validation before it reaches this stage; normalization then applies a final bounded fuzzy pass over unit tokens as a defensive backstop.

Schema / Data Contract

The stage’s output contract is a pair of Pydantic V2 models. PackageLine is the validated unit of a single packing line; PackingListPayload is the immutable envelope binding those lines to their source document. Every constraint exists to answer a later audit question — which unit standard the measure conformed to, whether gross weight was physically consistent with net, and which source file produced the record. Units resolve to UN/ECE Rec 20 codes so that the vocabulary matches what Syncing packing lists to shipment records via API transmits downstream.

from typing import Optional
from pydantic import BaseModel, Field, field_validator, ValidationInfo

# Canonical UOM Mapping (UN/ECE Rec 20 / ISO 80000)
UOM_MAP: dict[str, str] = {
    "kg": "KGM", "kgs": "KGM", "kilo": "KGM", "kilos": "KGM",
    "lbs": "LBR", "lb": "LBR", "pound": "LBR",
    "m3": "MTQ", "cbm": "MTQ", "cubic_meter": "MTQ",
    "ctn": "BOX", "box": "BOX", "carton": "BOX", "pkg": "PKG",
    "stuck": "PKG", "colis": "BOX",
    "pallet": "PAL", "plt": "PAL",
}


class PackageLine(BaseModel):
    """A single normalized packing-list line, keyed to its source document."""
    line_number: int = Field(ge=1)
    package_type: str
    quantity: int = Field(gt=0)
    net_weight: float = Field(ge=0)
    gross_weight: float = Field(ge=0)
    volume: float = Field(ge=0)
    uom_net: str
    uom_gross: str
    uom_volume: str
    parent_package_id: Optional[str] = None
    source_hash: str

    @field_validator("package_type")
    @classmethod
    def normalize_package_type(cls, v: str) -> str:
        token = v.strip().lower()
        return UOM_MAP.get(token, v.strip().upper())

    @field_validator("uom_net", "uom_gross", "uom_volume")
    @classmethod
    def standardize_uom(cls, v: str) -> str:
        return UOM_MAP.get(v.strip().lower(), v.strip().upper())

    @field_validator("gross_weight")
    @classmethod
    def gross_not_below_net(cls, v: float, info: ValidationInfo) -> float:
        net = info.data.get("net_weight")
        if net is not None and v < net:
            raise ValueError("gross weight cannot be less than net weight")
        return v


class PackingListPayload(BaseModel):
    """Immutable envelope binding validated lines to one source document."""
    shipment_id: str
    consignee: str
    packages: list[PackageLine]
    document_hash: str

    @field_validator("document_hash")
    @classmethod
    def verify_hash(cls, v: str) -> str:
        if len(v) != 64:
            raise ValueError("document_hash must be a 64-char SHA-256 digest")
        return v

The contract is intentionally strict: quantity must be positive, weights and volume non-negative, and the gross_not_below_net validator rejects any line where declared gross falls under declared net — a physical impossibility that almost always signals a transposed OCR digit. The source_hash on every line and the document_hash on the envelope carry the SHA-256 fingerprint of the raw bytes, so each record remains traceable to the exact file that produced it.

Step-by-Step Implementation

Normalization runs as an ordered set of stages. Each stage has one purpose, typed inputs and outputs, and an explicit failure mode; nothing advances until the prior stage emits a valid, non-degraded result.

Stage 1 — Fingerprint and admit

Purpose: bind every derived record to an immutable source identity before any interpretation happens. Input: raw document bytes plus extracted tokens. Output: a SHA-256 digest. Error condition: an empty token set is a hard admission failure, not an empty payload.

import hashlib


def generate_audit_hash(raw_bytes: bytes) -> str:
    """SHA-256 digest of the source document, used for audit lineage."""
    return hashlib.sha256(raw_bytes).hexdigest()

Stage 2 — Resolve units against the canonical dictionary

Purpose: collapse every unit variant to one UN/ECE Rec 20 code, absorbing OCR drift without guessing. Input: a raw unit token. Output: a canonical dictionary key. Error condition: a token whose best match scores below the confidence floor is unresolvable and quarantines the line rather than mapping to the nearest option.

from rapidfuzz import fuzz, process

UOM_CONFIDENCE_FLOOR = 75  # below this, route to human-in-the-loop review


def resolve_uom(raw_uom: str) -> str:
    """Map a raw unit token to a canonical UOM_MAP key, or raise."""
    candidate = raw_uom.strip().lower()
    if candidate in UOM_MAP:
        return candidate
    match = process.extractOne(candidate, UOM_MAP.keys(), scorer=fuzz.ratio)
    if match and match[1] >= UOM_CONFIDENCE_FLOOR:
        return match[0]
    raise UOMConversionError(f"unresolvable UOM token: {raw_uom!r}")

Stage 3 — Assemble and validate the payload

Purpose: build typed lines and the envelope, letting the schema enforce the contract. Input: the digest, per-line tokens, and a document header. Output: a PackingListPayload. Error conditions: schema violations raise SchemaValidationError; any other failure raises the base NormalizationError. Header fields (shipment reference, consignee) come from the document header, never from the last loop token — a subtle bug that would otherwise stamp every payload with the final line’s stray values.

import logging
from typing import Any
from pydantic import ValidationError

logger = logging.getLogger(__name__)
logging.basicConfig(format="%(asctime)s %(levelname)s %(name)s %(message)s")


class NormalizationError(Exception):
    """Base class for deterministic normalization failures."""


class UOMConversionError(NormalizationError):
    """A unit token could not be resolved above the confidence floor."""


class SchemaValidationError(NormalizationError):
    """A line or payload failed the canonical schema contract."""


def normalize_packing_list(
    raw_doc: bytes,
    extracted_tokens: list[dict[str, Any]],
    header: dict[str, Any],
) -> PackingListPayload:
    if not extracted_tokens:
        raise NormalizationError("no package tokens extracted from document")

    doc_hash = generate_audit_hash(raw_doc)
    lines: list[PackageLine] = []

    try:
        for idx, token in enumerate(extracted_tokens, start=1):
            weight_uom = resolve_uom(token.get("uom", ""))
            volume_uom = resolve_uom(token.get("vol_uom", "m3"))
            lines.append(
                PackageLine(
                    line_number=idx,
                    package_type=token.get("type", "PKG"),
                    quantity=int(token.get("qty", 0)),
                    net_weight=float(token.get("net_wt", 0)),
                    gross_weight=float(token.get("gross_wt", 0)),
                    volume=float(token.get("vol", 0)),
                    uom_net=weight_uom,
                    uom_gross=weight_uom,
                    uom_volume=volume_uom,
                    parent_package_id=token.get("parent_id"),
                    source_hash=doc_hash,
                )
            )

        # Header-level fields come from the document header — never from a
        # per-line token, which would leak the last iteration's values.
        return PackingListPayload(
            shipment_id=header.get("shipment_ref", "UNKNOWN"),
            consignee=header.get("consignee", "UNKNOWN"),
            packages=lines,
            document_hash=doc_hash,
        )
    except ValidationError as exc:
        logger.error("schema validation failed: %s", exc.json())
        raise SchemaValidationError("payload failed canonical schema") from exc
    except UOMConversionError:
        raise
    except Exception as exc:  # noqa: BLE001 — converted to typed failure
        logger.critical("normalization pipeline failure: %s", exc)
        raise NormalizationError("deterministic normalization failed") from exc

Validation & Determinism

Determinism is the property that makes a normalized packing list defensible: the same source bytes and the same dictionary version must always emit the same payload or the same typed failure. Three checks enforce it.

Physical consistency. The gross_not_below_net validator is a hard invariant — gross weight below net is impossible in the physical world, so any line that violates it is a data-corruption signal, not a warning. Package hierarchy is validated by confirming every parent_package_id resolves to a known package on the same list; a dangling parent reference means the packaging tree was flattened or partially captured.

Bounded fuzzy resolution. Unit resolution never maps below the 75% confidence floor. This is the difference between correcting a KGS→kg typo and silently accepting a garbage token: above the floor is a deterministic correction, below it is a quarantine. Because the dictionary version is pinned, the same token always resolves the same way, so a re-run over a fixed snapshot reproduces byte-identical dispositions.

Reconciliation tolerances. Before any HS code is assigned, normalized quantities are cross-checked against the value declaration extracted by Commercial Invoice PDF Extraction. The reconciliation engine performs a deterministic join on SKU or line-item identifiers and applies fixed tolerances: package-count mismatches, net/gross weight deltas beyond 2%, and volumetric discrepancies all trip a compliance hold. Anything the join cannot match, or any delta beyond tolerance, routes to quarantine with the offending fields recorded rather than being forced through to classification.

Downstream Integration

A validated payload is not the end of the pipeline — it is the trusted input several downstream consumers depend on. Once reconciliation passes, the serialized payload flows to Syncing packing lists to shipment records via API, which posts it to the master shipment record so brokers and compliance officers see real-time updates in their brokerage platform. That integration layer enforces idempotency keys, so a retried post can never create a duplicate filing across ACE, ABI, and internal ERP systems.

The normalized record also becomes an authoritative quantity source for classification: the reconciled package counts and weights anchor the physical basis of the entry that duty and tariff engines consume. Locale-specific unit terminology — a German Stück or a French colis that must resolve deterministically to PKG or BOX — is aligned with the dictionaries maintained by Multi-language Invoice Parsing, so regional nomenclature is normalized consistently across both the invoice and the packing list rather than diverging between the two documents.

Scaling & Resilience

High-volume batch windows demand that normalization stay non-blocking without letting failures cascade. Payloads are drained through the concurrency and back-pressure patterns defined by Async Batch Processing for High Volume: a bounded semaphore caps concurrent normalization workers so a burst of scanned documents cannot spawn unbounded tasks or exhaust memory, and surplus documents stay durably buffered in the broker rather than in process heap. Ordering guarantees are preserved for reconciliation sequences even as extraction runs in parallel.

Failures integrate through an explicit taxonomy rather than being swallowed. A transient dependency stall — a UOM translation service or an OCR microservice exhibiting elevated latency — is a retryable condition handed to the backoff and dead-letter path defined by Error Handling & Retry Logic. Retry uses exponential backoff with full jitter, strictly capped at three attempts; a payload that fails three times dead-letters with its full stack trace, source-document hash, and violation codes rather than looping forever. A circuit breaker bounds the blast radius: once the failure rate against an external dependency crosses its threshold, the breaker opens and diverts affected payloads to a deferred-processing queue while alerting compliance staff.

Compliance Obligations

Customs compliance demands immutable auditability. Every normalized record carries the SHA-256 hash of its source document, and every quarantine or hold event logs the original token, the applied resolution, the confidence delta, and the final disposition to a tamper-evident store. The retention window aligns with CBP recordkeeping under 19 CFR § 163 — typically 5–7 years — and the store must preserve enough to reconstruct any filed quantity from its source packing list during a CBP Focused Assessment.

Quarantine is a recorded escalation, not a discard. A line whose unit token falls below the confidence floor, whose parent reference dangles, or whose reconciliation delta exceeds tolerance is held with its full lineage and surfaced to a licensed broker through an audited job ledger, so no document exits the pipeline unrecorded. Emergency-pause mechanisms let compliance officers halt normalization during a regulatory transition — a WCO HS 2022 revision or an HTSUS schedule update from the USITC published in the Federal Register — preventing misclassification during the changeover. Circuit-breaker transitions, tolerance thresholds, and per-document-type confidence floors are configured through versioned infrastructure-as-code parameters, so the operational envelope is reviewable rather than buried in code, and periodic reconciliation jobs mine quarantine partitions for recurring signatures — a specific carrier template, a specific scanner, a specific glyph pair — to feed root-cause analysis back to the ingestion team.

Commercial Invoice PDF Extraction — the value-declaration extractor whose line items this stage reconciles against.
OCR Drift Correction & Validation — the upstream repair layer that hands normalization clean coordinate and token data.
Multi-language Invoice Parsing — the locale-aware dictionaries that keep regional packaging terms mapping consistently.
Async Batch Processing for High Volume — the concurrency and back-pressure patterns that drain normalization at scale.
Syncing packing lists to shipment records via API — the downstream integration that posts the validated payload to the master shipment record.
Reconciling net vs gross weight discrepancies — tolerance thresholds and tare inference across packing list and invoice.

Up: Document Ingestion & Parsing Workflows

Authoritative references: WCO HS Nomenclature 2022 Edition, HTSUS (USITC), CBP ACE / ABI submission formats, UN/ECE Recommendation 20 (units of measure), WCO Data Model 3.x, 19 CFR § 163 recordkeeping.

Packing List Data Normalization

# Problem Framing: Why Packing Lists Resist Naive Parsing

# Schema / Data Contract

# Step-by-Step Implementation

# Stage 1 — Fingerprint and admit

# Stage 2 — Resolve units against the canonical dictionary

# Stage 3 — Assemble and validate the payload

# Validation & Determinism

# Downstream Integration

# Scaling & Resilience

# Compliance Obligations

# Related