10 min read
4 code samples

Async Batch Processing for High Volume

High-volume customs brokerage operations need an ingestion architecture that decouples document receipt from the CPU- and I/O-bound work of parsing, validation, and Harmonized System classification. When a broker ingests thousands of commercial invoices, packing lists, and certificates of origin inside a narrow clearance window, a synchronous request-response path collapses under thread exhaustion and unbounded latency: a single slow OCR pass or a rate-limited tariff lookup stalls the entire queue. Within the broader Document Ingestion & Parsing Workflows domain, this page specifies the async batch pattern that resolves that gap — a broker-mediated pipeline that buffers payloads, fans work out to stateless Celery workers, and guarantees at-least-once, idempotent processing so that every line item reaches the classification engine exactly once, with a complete audit trail from receipt to duty determination.

Problem Framing: Where Synchronous Ingestion Breaks

The failure mode is specific. A REST endpoint that parses inline holds a worker thread for the full duration of pdfplumber extraction, an OCR call, and a database write — hundreds of milliseconds to several seconds per document. At peak filing volume (tens of thousands of documents per day, arriving in bursts around vessel ETAs and cutoff times), the thread pool saturates, health checks time out, and the load balancer sheds traffic that carries legally-required entry data. Worse, a partial failure midway through a synchronous batch leaves ambiguous state: some line items committed, some not, and no clean way to replay only the failures.

Async batch processing removes the coupling. Documents are acknowledged at the ingestion boundary in single-digit milliseconds, serialized into an immutable envelope, and pushed to a broker. Workers claim micro-batches on their own schedule, and every stage transition is recorded so a compliance officer can reconstruct exactly what happened to a shipment reference. The design goal is not raw speed — it is deterministic, replayable, audit-defensible throughput under load.

Schema / Data Contract

The contract between producer and consumer is an immutable envelope. It carries a correlation ID (the audit anchor), a SHA-256 file hash (deduplication and integrity), source metadata, and the extracted compliance payload. The payload models the WCO Data Model fields a customs entry depends on: line items with HS codes, currency, and origin. Formalizing this with Pydantic gives the worker a single deterministic validation gate — anything that fails the model never reaches the classification engine.

from datetime import datetime, timezone
from enum import Enum
from typing import Any

from pydantic import BaseModel, Field


class ProcessingState(str, Enum):
    QUEUED = "queued"
    PROCESSING = "processing"
    VALIDATED = "validated"
    FAILED = "failed"
    DLQ = "dead_letter_queue"


class LineItem(BaseModel):
    description: str
    quantity: float = Field(gt=0)
    unit_price: float = Field(ge=0)
    currency: str = Field(pattern=r"^[A-Z]{3}$")          # ISO 4217
    hs_code: str = Field(pattern=r"^(?:\d{6}|\d{8}|\d{10})$")  # HS-6 or national line
    origin_country: str = Field(pattern=r"^[A-Z]{2}$")    # ISO 3166-1 alpha-2


class CompliancePayload(BaseModel):
    invoice_number: str
    line_items: list[LineItem]
    total_value: float
    incoterms: str = Field(pattern=r"^[A-Z]{3}$")         # Incoterms 2020 three-letter


class DocumentEnvelope(BaseModel):
    correlation_id: str = Field(..., description="Immutable audit anchor")
    file_hash: str = Field(..., min_length=64, max_length=64)   # SHA-256 hex
    source_system: str
    compliance_data: CompliancePayload
    received_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))

The envelope is written once and never mutated in place; state transitions are recorded as separate audit rows keyed on correlation_id. This is what makes redelivery safe — a redelivered envelope is byte-identical, so an idempotent consumer can detect the duplicate by hash and skip it.

Step-by-Step Implementation

The pipeline runs in four ordered stages. Each stage has a defined purpose, inputs, outputs, and error condition, and each is isolated so a fault in parsing cannot corrupt classification state.

Stage 1 — Envelope validation (input: raw broker message; output: DocumentEnvelope; error: schema violation → Dead Letter Queue). The worker first deserializes and validates the envelope. A malformed envelope is a permanent failure — retrying will never fix it — so it routes straight to the Dead Letter Queue with full context rather than consuming retry budget.

Stage 2 — HTS/HS validation (input: CompliancePayload; output: validated payload; error: ValueError → DLQ). Digit-length and nomenclature rules are enforced deterministically before any downstream work.

Stage 3 — Async I/O (input: validated payload; output: persisted record; error: transient exception → retry). OCR verification, tariff lookups, and the database write run concurrently under asyncio.gather. These are the operations that fail transiently, so they are the ones wrapped in retry logic. Field-level OCR quality is handled by OCR Drift Correction & Validation, and locale-aware value parsing by Multi-language Invoice Parsing.

Stage 4 — Commit and record success. The record is UPSERTed idempotently on correlation_id, and the circuit-breaker success counter is incremented.

Because Celery workers are synchronous, the async pipeline is driven by asyncio.run on a per-task event loop — the task wrapper stays thin, and all awaitable work lives in the coroutine.

import asyncio
import logging
from typing import Any

from celery import Celery
from pydantic import ValidationError

logging.basicConfig(format="%(asctime)s %(levelname)s %(name)s %(message)s")
log = logging.getLogger("customs.async_batch")

CELERY_APP = Celery(
    "customs_etl",
    broker="redis://localhost:6379/0",
    backend="redis://localhost:6379/1",
)
MAX_RETRIES = 3


async def _process_document_batch_async(
    envelopes: list[dict[str, Any]],
    retries: int,
    retry_handler,
) -> dict[str, int]:
    """Async pipeline. Kept separate from the Celery wrapper because Celery
    workers do not natively await coroutines."""
    check_circuit_breaker()  # halts consumption when downstream is degraded
    results = {"processed": 0, "dlq_routed": 0}

    for raw_env in envelopes:
        correlation_id = raw_env.get("correlation_id", "<unknown>")
        try:
            env = DocumentEnvelope(**raw_env)                    # Stage 1
            validate_hts_compliance(env.compliance_data)         # Stage 2
            await asyncio.gather(                                # Stage 3
                _verify_ocr(env.file_hash),
                _persist_idempotent(env.correlation_id, env.compliance_data),
            )
            results["processed"] += 1                            # Stage 4
            record_success()

        except (ValidationError, ValueError) as exc:
            # Permanent: schema or nomenclature violation — do not retry.
            log.warning("Permanent failure %s: %s", correlation_id, exc)
            await _route_to_dlq(raw_env, ProcessingState.DLQ, str(exc))
            results["dlq_routed"] += 1
            record_failure()

        except Exception as exc:  # noqa: BLE001 — transient I/O
            log.error("Transient failure %s: %s", correlation_id, exc)
            record_failure()
            countdown = (2 ** retries) * 1.5  # exponential backoff (jitter added below)
            retry_handler(exc=exc, countdown=countdown)

    return results


@CELERY_APP.task(bind=True, max_retries=MAX_RETRIES, acks_late=True)
def process_document_batch(self, envelopes: list[dict[str, Any]]) -> dict[str, int]:
    """Celery wrapper. `acks_late=True` ensures the broker redelivers a batch
    if the worker dies mid-flight — idempotent Stage 4 makes that safe."""
    def _retry(exc: Exception, countdown: float) -> None:
        raise self.retry(exc=exc, countdown=countdown, max_retries=MAX_RETRIES)

    return asyncio.run(
        _process_document_batch_async(envelopes, self.request.retries, _retry)
    )


async def _verify_ocr(file_hash: str) -> None:
    await asyncio.sleep(0.1)  # placeholder: call OCR engine, check drift confidence


async def _persist_idempotent(correlation_id: str, payload: "CompliancePayload") -> None:
    await asyncio.sleep(0.1)  # placeholder: asyncpg UPSERT keyed on correlation_id


async def _route_to_dlq(envelope: dict[str, Any], state: ProcessingState, reason: str) -> None:
    log.info("Routed %s to DLQ (%s): %s", envelope.get("correlation_id"), state.value, reason)

Batch size is not fixed. It is tuned against queue depth, per-worker memory, and downstream API rate limits — micro-batches of roughly 50–200 records keep memory bounded while throughput scales linearly with added workers. The exact broker-backpressure tuning is worked through in Implementing async queues for bulk customs docs.

Validation & Determinism

Validation is the point of the pipeline. Every line item must satisfy Harmonized System digit-length rules before it can reach duty calculation: HS-6 is the globally standardized base under WCO HS 2022, while 8- and 10-digit national lines (HTSUS under the USITC schedule, or EU ATLAS/TARIC subdivisions) must resolve against the correct jurisdictional tariff. Items with an invalid HS length, a missing or ambiguous description, or a non-ISO currency code are quarantined, not propagated — a corrupt classification input is far more expensive than a delayed one.

def validate_hts_compliance(payload: "CompliancePayload") -> None:
    """Enforce HS digit-length and WCO nomenclature rules before classification."""
    for idx, item in enumerate(payload.line_items):
        code = item.hs_code
        if len(code) == 6:
            log.debug("HS-6 base validated (line %d): %s", idx, code)          # WCO global
        elif len(code) in (8, 10):
            log.debug("National tariff line validated (line %d): %s", idx, code)  # HTSUS / TARIC
        else:
            raise ValueError(f"Invalid HS code length at line {idx}: {code!r}")

        if item.currency == "XXX":  # ISO 4217 'no currency' sentinel — never valid on an entry
            raise ValueError(f"Unresolvable currency at line {idx}")

Determinism extends to redelivery: because Stage 4 UPSERTs on correlation_id, replaying a batch produces identical state. The SHA-256 file hash in the envelope lets the consumer reject byte-identical duplicates outright, so at-least-once broker delivery never becomes at-least-twice processing.

Downstream Integration

This stage sits between raw extraction and classification, and its validated output is the contract every downstream consumer depends on. Once a worker claims a document it invokes Commercial Invoice PDF Extraction to parse structured fields, line items, and Incoterms, then reconciles physical quantities through Packing List Data Normalization so gross/net weights, package counts, and container seals agree before duty is assessed. The clean, schema-valid payload that leaves this pipeline is what the classification and duty engines consume — parsing faults are isolated upstream so classification logic only ever sees structurally sound inputs.

Scaling & Resilience

Two controls keep the pipeline stable under peak load. Retry strategy: transient failures — OCR timeouts, dropped database connections, rate-limited tariff APIs — retry with exponential backoff and jitter while preserving the original correlation_id; permanent failures (schema violations, unsupported formats, hash mismatches, unresolvable HS codes) skip retries entirely and route to the DLQ for manual review. The backoff schedule and its interaction with acks_late redelivery is derived in Designing exponential backoff for failed parsing jobs.

Circuit breaker: when the failure rate exceeds a threshold (for example, 15% over a rolling 5-minute window), the breaker opens, halts new task consumption, and drains in-flight batches — protecting the downstream classification database from a flood of corrupt writes during a broker outage or an upstream API degradation. After a cooldown it moves to half-open and lets a single probe batch verify recovery before full throughput resumes. In production the breaker state lives in Redis (not process memory) so every worker shares one view.

from datetime import datetime, timezone

CIRCUIT_FAILURE_THRESHOLD = 0.15
CIRCUIT_WINDOW_SECONDS = 300

# Demonstration state; back this with Redis so all workers share it in production.
_circuit: dict[str, Any] = {"failures": 0, "processed": 0, "last_reset": datetime.now(timezone.utc)}


def check_circuit_breaker() -> None:
    now = datetime.now(timezone.utc)
    if (now - _circuit["last_reset"]).total_seconds() > CIRCUIT_WINDOW_SECONDS:
        _circuit.update(failures=0, processed=0, last_reset=now)  # roll the window
    total = _circuit["processed"] + _circuit["failures"]
    if total and _circuit["failures"] / total >= CIRCUIT_FAILURE_THRESHOLD:
        raise RuntimeError(
            "Circuit breaker OPEN: halting consumption to protect downstream compliance systems."
        )


def record_success() -> None:
    _circuit["processed"] += 1


def record_failure() -> None:
    _circuit["failures"] += 1

Semaphore limits on concurrent OCR and tariff-API calls bound the memory footprint of each worker, keeping micro-batch processing within its allocation even when a burst of large multi-page invoices arrives at once.

Compliance Obligations

Async batch processing is, above all, a compliance control. Decoupling ingestion from parsing lets the pipeline enforce deterministic validation gates, preserve cryptographic integrity, and isolate transient infrastructure faults from tariff classification — all while producing the records an audit demands. Every state transition (QUEUED → PROCESSING → VALIDATED/DLQ) is logged with its correlation_id, file hash, engine version, and timestamp, giving a point-in-time reconstruction of any shipment’s path from receipt to HS assignment. That trail satisfies CBP recordkeeping obligations under 19 CFR §163 and equivalent EU customs requirements, and it drives the human-in-the-loop escalation gate: anything in the DLQ carries enough context for a broker to correct and replay it without guesswork. Retention rules apply to the DLQ payloads and the audit rows equally, since both are part of the entry record. When a tariff bulletin or Federal Register notice changes a national line, the immutable envelope history makes it possible to identify exactly which prior entries used the superseded code.

Commercial Invoice PDF Extraction — parses the structured fields this pipeline validates.
Packing List Data Normalization — reconciles weights and package counts before duty assessment.
OCR Drift Correction & Validation — confidence-threshold checks invoked during Stage 3 I/O.
Error Handling & Retry Logic — DLQ routing and backoff strategy for the transient path.
Implementing async queues for bulk customs docs — broker, envelope, and backpressure tuning in depth.
asyncpg bulk COPY vs executemany benchmarks — the fastest way to land parsed rows in PostgreSQL.

Up: Document Ingestion & Parsing Workflows

Authoritative references: WCO HS Nomenclature 2022 · USITC Harmonized Tariff Schedule · CBP recordkeeping — 19 CFR Part 163

Async Batch Processing for High Volume

# Problem Framing: Where Synchronous Ingestion Breaks

# Schema / Data Contract

# Step-by-Step Implementation

# Validation & Determinism

# Downstream Integration

# Scaling & Resilience

# Compliance Obligations

# Related