Why correct OCR drift by field type instead of running a single spell-checker over the whole document?

Each customs field has a different validity domain: an HS code must match a strict 6/8/10-digit rule, a currency must be a member of the ISO 4217 set, and a free-text description has no closed vocabulary. A generic spell-checker has no notion of these constraints, so it will happily 'correct' a valid code into an invalid one. Routing each token to a field-specific corrector lets the routine reject 0/O and 8/B substitutions deterministically for structured fields and reserve fuzzy matching for closed sets like currency mnemonics.

When should the fuzzy currency match accept a candidate?

Only when RapidFuzz's best candidate clears a tuned score cutoff. process.extractOne returns None when nothing scores above zero, so the unpack must be guarded before destructuring. A cutoff around 78 recovers U5D to USD while rejecting an ambiguous mnemonic that could map to two currencies; the exact value is calibrated against a labeled replay set so precision stays above 0.95.

What happens to a token the routine cannot repair?

It is never guessed into the duty base. An unresolved token increments the failure counter and is quarantined with its raw text, confidence score, and bounding box, so a licensed broker can reconstruct exactly what the scanner read. If the batch-wide failure rate crosses the circuit-breaker threshold, the pipeline halts entirely rather than let systematic scanner drift contaminate a filed entry.

13 min read
1 code sample

Correcting OCR drift in scanned customs forms

This page answers one narrow implementation question: once a scanned commercial invoice or packing list has been recognized, how do you deterministically repair lexical OCR drift — the 0/O, 8/B, 5/S glyph substitutions that corrupt HS codes and currency mnemonics — before the token reaches a duty engine? It is the repair stage of OCR Drift Correction & Validation, sitting after raster-to-text recognition and before any record reaches Duty Formula Calculation Frameworks. Because the OCR output is the only surviving evidence of what the source document declared, this layer behaves as a compliance boundary: it either resolves a token deterministically, or it quarantines it with a machine-readable reason.

The concrete failure mode targeted here is lexical drift in structured fields. Compression artifacts, low-contrast toner, and scanner wear substitute visually similar glyphs one character at a time — a U5D where the source read USD, an 8471.3O.01.00 where the classification held 8471.30.01.00. A generic spell-checker has no notion of a customs field’s validity domain and will “correct” a valid code into an invalid one. The fix is to route each recognized token to a field-specific corrector: a strict digit-length rule for HS codes, membership in the ISO 4217 set for currency, and confidence-gated fuzzy matching only where a closed vocabulary exists. Coordinate (spatial) drift is handled upstream by Commercial Invoice PDF Extraction; this stage owns lexical repair on the tokens that survive it.

Prerequisites

Pin the following before applying this routine. The fuzzy matching and the async batching are load-bearing, and older releases change the return contract the code depends on.

Python 3.10+ — the code uses X | Y union hints and builtin generics already established across these ingestion workflows.
rapidfuzz >= 3.0. The process.extractOne contract matters: in 3.x it returns a (choice, score, key) tuple, or None when no candidate scores above zero. Code written against fuzzywuzzy or pre-3.0 rapidfuzz assumed a non-None return and will raise on an empty candidate set.
A recognized text layer with confidence scores. This stage consumes the typed OCRToken contract emitted by the parent OCR Drift Correction & Validation stage — raw string, confidence, bounding box, and field type. A rasterized image with no recognition pass has nothing to correct.
A canonical ISO 4217 reference set. The example inlines the common trade currencies; in production, load the full ISO 4217 alpha-3 list so a legitimate but uncommon currency is not fuzzy-matched onto a neighbour.
asyncio for batch fan-out. High-volume repair reuses the same non-blocking pattern as Async Batch Processing for High Volume; this page focuses on the per-token correctors that run inside it.

Implementation

The routine routes each ExtractionResult to a corrector chosen by field_type. correct_hs_code strips every non-digit and admits the token only if it clears the WCO/HTSUS length rule — 6, 8, or 10 digits, never 4, 5, 7, or 9. correct_currency normalizes case and symbols, checks direct ISO 4217 membership, and only then falls back to RapidFuzz — guarding the extractOne unpack because it returns None on an empty candidate set. A PipelineState counter feeds a CircuitBreaker so that a batch whose failure rate crosses the threshold halts before corrupt tokens reach duty assessment.

import asyncio
import logging
import re
from typing import List, Optional, Tuple
from dataclasses import dataclass, field
from datetime import datetime, timezone
from rapidfuzz import process, fuzz

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    handlers=[logging.StreamHandler()],
)
logger = logging.getLogger("customs_ocr_pipeline")

# Regulatory reference sets. Load the full ISO 4217 alpha-3 list in production.
VALID_CURRENCIES = {"USD", "EUR", "GBP", "CAD", "JPY", "CNY", "MXN", "AUD", "CHF"}
# WCO/HTSUS codes are exactly 6, 8, or 10 digits — never 4, 5, 7, or 9.
HS_CODE_PATTERN = re.compile(r"^\d{6}(?:\d{2}(?:\d{2})?)?$")
DRIFT_THRESHOLD = 78              # RapidFuzz score cutoff, calibrated on a replay set
CIRCUIT_BREAKER_THRESHOLD = 0.35  # 35% batch failure rate halts the pipeline
MAX_RETRIES = 3


@dataclass
class ExtractionResult:
    raw_text: str
    field_type: str                                  # "HS_CODE" | "CURRENCY" | other
    confidence: float
    bbox: Optional[Tuple[float, float, float, float]] = None
    corrected_value: Optional[str] = None
    is_valid: bool = False


@dataclass
class PipelineState:
    total_processed: int = 0
    total_failed: int = 0
    circuit_open: bool = False
    last_reset: datetime = field(default_factory=lambda: datetime.now(timezone.utc))

    @property
    def failure_rate(self) -> float:
        return self.total_failed / self.total_processed if self.total_processed else 0.0


class CircuitBreaker:
    def __init__(self, state: PipelineState):
        self.state = state

    def check(self) -> None:
        if self.state.failure_rate > CIRCUIT_BREAKER_THRESHOLD:
            self.state.circuit_open = True
            logger.critical(
                "Circuit breaker tripped: failure rate %.2f%% exceeds threshold.",
                self.state.failure_rate * 100,
            )
            raise RuntimeError("Emergency pause: high OCR drift rate detected.")


def correct_hs_code(raw: str) -> Optional[str]:
    """Strip non-digits and admit only WCO/HTSUS-valid lengths (6, 8, 10)."""
    cleaned = re.sub(r"[^0-9]", "", raw)
    return cleaned if HS_CODE_PATTERN.match(cleaned) else None


def correct_currency(raw: str) -> Optional[str]:
    """ISO 4217 membership first; confidence-gated fuzzy fallback second."""
    cleaned = raw.strip().upper().replace(" ", "").replace("$", "").replace("€", "")
    if not cleaned:
        return None
    if cleaned in VALID_CURRENCIES:
        return cleaned
    # Fuzzy fallback for OCR corruption (e.g. "U5D" -> "USD").
    # extractOne returns None when nothing clears the internal cutoff, so guard
    # the result before destructuring the (match, score, key) tuple.
    result = process.extractOne(cleaned, VALID_CURRENCIES, scorer=fuzz.ratio)
    if result is None:
        return None
    match, score, _ = result
    return match if score >= DRIFT_THRESHOLD else None


async def process_field_with_retry(
    extraction: ExtractionResult, state: PipelineState
) -> ExtractionResult:
    CircuitBreaker(state).check()
    for attempt in range(1, MAX_RETRIES + 1):
        try:
            if extraction.field_type == "HS_CODE":
                extraction.corrected_value = correct_hs_code(extraction.raw_text)
            elif extraction.field_type == "CURRENCY":
                extraction.corrected_value = correct_currency(extraction.raw_text)
            else:
                extraction.corrected_value = extraction.raw_text.strip()

            extraction.is_valid = extraction.corrected_value is not None
            state.total_processed += 1
            if extraction.is_valid:
                logger.info("Corrected %s -> %s", extraction.raw_text, extraction.corrected_value)
            else:
                state.total_failed += 1
                logger.warning("Drift unresolved (attempt %d): %r", attempt, extraction.raw_text)
            return extraction
        except Exception as exc:  # transient scorer/IO failure — back off and retry
            logger.error("Attempt %d failed: %s", attempt, exc)
            if attempt == MAX_RETRIES:
                state.total_processed += 1
                state.total_failed += 1
                extraction.is_valid = False
                return extraction
            await asyncio.sleep(2 ** attempt)  # exponential backoff
    return extraction


async def run_batch_pipeline(extractions: List[ExtractionResult]) -> List[ExtractionResult]:
    state = PipelineState()
    tasks = [process_field_with_retry(ex, state) for ex in extractions]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    valid = [r for r in results if isinstance(r, ExtractionResult)]
    logger.info(
        "Batch complete. processed=%d failed=%d circuit=%s",
        state.total_processed, state.total_failed, "OPEN" if state.circuit_open else "CLOSED",
    )
    return valid

Every token that survives correction carries a resolved corrected_value; every token that does not carries is_valid = False and its original bbox, so a broker reviewing the quarantine can locate the value on the scanned page. Records the routine cannot repair are held for review rather than routed into Fallback Routing for Unmapped Codes, which handles valid codes with no schedule match — a different failure class from a corrupted token.

Verification steps

Run these checks against a labeled sample before the corrector carries production filings. Each one is deterministic and reproducible in staging.

Enforce HS digit-length rules. Assert every corrected_value for an HS_CODE field matches HS_CODE_PATTERN — exactly 6, 8, or 10 digits. A 0% rejection rate on a real scanned batch usually means a SKU column is leaking into the HS field, not that every code recognized cleanly.
Bound the fuzzy currency threshold. Replay a labeled set of corrupted mnemonics (U5D, EUR0, G8P) through correct_currency. Plot the RapidFuzz score against confirmed corrections and set DRIFT_THRESHOLD so precision stays above 0.95 while recall holds above 0.88. Record any threshold change in version control.
Compute the duty impact delta. Run parallel duty calculations on raw versus corrected values: ΔDuty = (corrected − raw) × tariff_rate. Escalate any line item where |ΔDuty| > $50.00 for licensed-broker review before ABI submission — this catches a transposed digit in a declared value that the length gate alone would pass.
Reconcile quarantine counts. Confirm that every token where is_valid is False lands in the quarantine list, and that the count equals state.total_failed. A drift between the two means a failure path is swallowing a token without recording it.
Exercise the circuit breaker. Inject a batch whose failure rate exceeds CIRCUIT_BREAKER_THRESHOLD and confirm CircuitBreaker.check trips, halts processing, and prevents any corrected token from that batch reaching the duty engine. Reset requires a compliance-officer sign-off, then a warm-up batch of verified documents.

Edge cases & gotchas

The failure modes below are specific to lexical correction over real scanned customs forms, and most only surface once you leave a single clean vendor template.

extractOne returns None, not a tuple. In rapidfuzz >= 3.0, process.extractOne yields None when no candidate scores above zero — an empty or heavily corrupted mnemonic hits this. Destructuring match, score, _ = result without the None guard raises TypeError and crashes the whole batch. Keep the guard, and never assume a non-empty candidate set.
A valid-length HS code can still be the wrong code. correct_hs_code only proves the shape is right; 8471.3O.01.00 cleaned to 8471300100 passes the length gate even if the O→0 substitution changed the tariff line. Digit-length validation is necessary but not sufficient — cross-check the corrected prefix against the schedule during HTS Schedule Database Design lookup before duty assessment, and treat a checksum mismatch as quarantine, not a pass.
Fuzzy matching onto a truncated currency set. The inline VALID_CURRENCIES covers common trade currencies; a legitimate SGD or NOK invoice will fuzzy-match onto the nearest included member and silently corrupt the value. Load the full ISO 4217 list in production so membership, not proximity, decides.
Non-ASCII digit corruption before recognition. Full-width １２３ and Arabic-Indic ٤٥٦ survive as distinct code points; re.sub(r"[^0-9]", "", raw) strips them to empty and the token fails the length gate as if unreadable. Normalize scripts through Multi-Language Invoice Parsing before this corrector runs, rather than letting NFKC gaps read as drift.
The circuit breaker can starve a healthy tail. check() runs at the start of each task, so once the rate crosses the threshold every remaining task in the gather raises, including tokens that would have corrected cleanly. That is intentional — a systematic scanner fault contaminates the whole batch — but pair it with the retry-and-backoff design in Error Handling & Retry Logic so transient scorer failures do not trip the breaker prematurely.
gather(return_exceptions=True) hides raised tasks. A task that raises past its retries returns the exception object, not an ExtractionResult; the isinstance filter then silently drops it from the output. Count the difference between input length and valid length and reconcile it against state.total_failed, or a swallowed exception will under-report the true quarantine volume.

OCR Drift Correction & Validation — the parent workflow this repair stage plugs into, covering the spatial/lexical/statistical drift taxonomy and the token contract.
Commercial Invoice PDF Extraction — hands this stage clean coordinate geometry so only lexical drift remains to repair.
Multi-Language Invoice Parsing — locale digit maps and Unicode normalization that keep non-ASCII digits from reading as drift.
Error Handling & Retry Logic — the backoff and dead-letter patterns behind the retry wrapper and circuit breaker.
HTS Schedule Database Design — the schema a corrected HS code is resolved against before duty assessment.

Up: OCR Drift Correction & Validation

Authoritative references: WCO HS 2022 Nomenclature, HTSUS (USITC), CBP ACE / ABI submission formats, EU ATLAS validation rules, ISO 4217 currency codes, UN/ECE Recommendation No. 20 (units of measure).

Correcting OCR drift in scanned customs forms

# Prerequisites

# Implementation

# Verification steps

# Edge cases & gotchas

# Related

Prerequisites

Implementation

Verification steps

Edge cases & gotchas

Related