Why hash the raw bytes instead of the document URI for deduplication?

The same commercial invoice is frequently re-uploaded under a new object key by different forwarders, so a URI-based key would let duplicates through. A SHA-256 hash of the raw bytes is stable across renames and object-store re-writes, so the Redis SET NX gate catches the true duplicate and prevents a second, conflicting customs entry.

Should the ingestion queue enforce ordering?

Only where it is required. Use a FIFO queue partitioned by shipment reference or importer of record so that an invoice amendment never overtakes the original within a shipment, but keep unrelated shipments on independent partitions so ordering does not serialize the whole pipeline and destroy throughput.

How does the circuit breaker avoid losing documents when it opens?

Opening the breaker stops consumption, not production. In-flight messages are acked late, and unconsumed messages remain durably buffered in the broker. When the cooldown elapses and a probe succeeds, consumption resumes from the buffered backlog with no message loss and no partial customs declarations committed.

15 min read
3 code samples

Implementing async queues for bulk customs docs

This page answers one narrow implementation question: how do you build an ingestion queue that survives a peak filing window without double-filing an entry, starving its consumers, or committing a half-parsed declaration? High-volume customs pipelines routinely exceed synchronous processing thresholds — trade compliance teams and logistics developers must ingest thousands of commercial invoices, packing lists, and certificates of origin inside narrow clearance windows. The concrete failure mode targeted here is duplicate-or-lost documents under backpressure: a synchronous handler blocks on OCR and object-store I/O, the thread pool exhausts, retries re-submit the same invoice, and a second customs entry is created for a shipment that already cleared.

The fix is an asynchronous queue that decouples document receipt from resource-intensive extraction, normalization, and classification. This is the exact configuration referenced from Async Batch Processing for High Volume: a lightweight producer that hashes and enqueues, an idempotent consumer that de-duplicates before it does any work, dead-letter routing for poison payloads, and a circuit breaker that pauses cleanly rather than corrupting a declaration mid-flight. The design goals are payload immutability, idempotent consumers, and strict alignment with the WCO Data Model and CBP ACE submission formats.

Prerequisites

Pin the following environment before applying this pattern. The idempotency and circuit-breaker logic below depend on Redis SET with the NX flag and per-key TTLs, and the retry semantics assume Celery late-acknowledgement.

Python 3.10+ — the code uses X | Y union hints and dict[str, Any] builtins already established across these ingestion workflows.
celery >= 5.3 with RabbitMQ 3.12+ as the broker (or AWS SQS FIFO where ordering by shipment reference is mandatory). Enable task_acks_late = True and task_reject_on_worker_lost = True so a crashed worker redelivers rather than silently drops a document.
redis >= 5.0 (server and redis-py) for the SHA-256 deduplication store and the rolling-window circuit-breaker counters. Both must share the same logical DB so a single FLUSHDB never desynchronizes them.
boto3 >= 1.34 for object-storage reads, with raw document bytes already landed in a bucket keyed by ingestion date. This page assumes the raw-capture layer exists and the extraction stages downstream — Commercial Invoice PDF Extraction and Packing List Data Normalization — are already implemented as separate tasks this queue routes into.
A propagated trace_id minted at receipt and carried through Redis, object metadata, and every DLQ payload, so the audit trail reconstructs end to end.

Implementation

The producer serializes each document into an immutable JSON envelope — document URI, MIME type, declared origin, importer EORI/VAT, shipment reference, and a cryptographic hash for deduplication — so network I/O never blocks CPU-bound extraction. The consumer fetches the raw bytes, hashes them, and de-duplicates before doing any work: the Redis SET ... NX EX acts as an atomic claim, so a redelivered or re-uploaded invoice is dropped rather than re-filed. Only after the claim succeeds does it route to a document-type-specific task.

import hashlib
import logging
from datetime import datetime, timezone
from typing import Any

import boto3
import redis
from celery import Celery

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(name)s | %(message)s")
logger = logging.getLogger("customs_etl.ingest")

app = Celery("customs_etl", broker="amqp://guest:guest@rabbitmq:5672//")
app.conf.update(task_acks_late=True, task_reject_on_worker_lost=True)

redis_client = redis.Redis(host="redis", port=6379, db=0, decode_responses=True)
s3 = boto3.client("s3")

# Retain the dedup claim for 24h — long enough to cover a broker-outage
# redelivery storm without permanently blocking a legitimate re-filing.
DEDUP_TTL_SECONDS = 86_400


def build_envelope(doc_uri: str, doc_type: str, shipment_ref: str, importer_eori: str, trace_id: str) -> dict[str, Any]:
    """Immutable receipt envelope. No raw bytes travel through the broker."""
    return {
        "doc_uri": doc_uri,
        "doc_type": doc_type,
        "shipment_ref": shipment_ref,
        "importer_eori": importer_eori,          # EORI/VAT — importer of record
        "trace_id": trace_id,                    # propagated to Redis, S3 meta, DLQ
        "timestamp_utc": datetime.now(timezone.utc).isoformat(),
    }


@app.task(bind=True, max_retries=3, default_retry_delay=60, acks_late=True)
def ingest_customs_document(self, envelope: dict[str, Any]) -> dict[str, Any]:
    trace_id = envelope["trace_id"]
    try:
        obj = s3.get_object(Bucket="customs-ingestion", Key=envelope["doc_uri"])
        raw_bytes = obj["Body"].read()
        # Hash the BYTES, not the URI: the same invoice is re-uploaded under
        # new keys by different forwarders — only a content hash catches it.
        payload_hash = hashlib.sha256(raw_bytes).hexdigest()

        # Atomic idempotency claim. If the key already exists, this is a
        # duplicate customs document; skipping prevents a second ACE entry.
        if not redis_client.set(f"dedup:{payload_hash}", trace_id, nx=True, ex=DEDUP_TTL_SECONDS):
            logger.warning("Duplicate document skipped: hash=%s trace_id=%s", payload_hash, trace_id)
            return {"status": "skipped", "hash": payload_hash}

        # Route to the doc-type-specific extraction task. Each downstream task
        # maps fields to the WCO Data Model before classification.
        route = {
            "commercial_invoice": "customs_etl.parse.process_invoice",
            "packing_list": "customs_etl.parse.process_packing_list",
        }.get(envelope["doc_type"])
        if route is None:
            raise ValueError(f"Unroutable doc_type: {envelope['doc_type']!r}")

        app.send_task(route, args=[envelope["doc_uri"], payload_hash, envelope["shipment_ref"], trace_id])
        logger.info("Queued %s for shipment=%s trace_id=%s", envelope["doc_type"], envelope["shipment_ref"], trace_id)
        return {"status": "queued", "hash": payload_hash}

    except Exception as exc:
        logger.error("Ingestion failed uri=%s trace_id=%s: %s", envelope["doc_uri"], trace_id, exc)
        # Exponential backoff with jitter is handled by the shared retry helper;
        # after max_retries the message lands in the DLQ with its envelope intact.
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)

Transient broker outages, malformed payloads, and downstream rate limits need deterministic recovery. Retries use exponential backoff with jitter to avoid a thundering herd on broker recovery; payloads that exhaust the retry budget land in a dead-letter queue with the original envelope preserved for forensic review — the same discipline detailed in Error Handling & Retry Logic and its companion page on designing exponential backoff for failed parsing jobs.

import random
import logging

logger = logging.getLogger("customs_etl.retry")


def calculate_backoff(retry_count: int, base_delay: int = 60, max_delay: int = 900) -> int:
    """Truncated exponential backoff with jitter to de-synchronize retries."""
    delay = min(base_delay * (2 ** retry_count), max_delay)
    jitter = random.uniform(0, delay * 0.25)
    return int(delay + jitter)


def handle_consumer_error(payload: dict, exc: Exception, retry_count: int) -> int | None:
    """Return the next delay in seconds, or None to route the payload to the DLQ."""
    logger.error("Consumer error shipment=%s: %s", payload.get("shipment_ref"), exc)
    if retry_count >= 5:
        logger.critical("Routing to DLQ after %d retries: %s", retry_count, payload["doc_uri"])
        return None
    delay = calculate_backoff(retry_count)
    logger.info("Scheduling retry in %ds (attempt %d)", delay, retry_count + 1)
    return delay

Regulatory updates, broker maintenance, or a burst of malformed filings sometimes require an immediate, clean pause rather than blind retrying. A circuit breaker watches the failure ratio over a rolling window; when failures cross 15% inside a 5-minute window, it opens and consumers stop claiming new work — messages simply stay durably buffered in the broker. This mirrors the emergency pause and circuit breaker logic used elsewhere in the pipeline and keeps corrupted extractions out of the classification stage.

import logging
import redis

logger = logging.getLogger("customs_etl.breaker")


class CircuitBreaker:
    """
    Rolling-window failure-ratio breaker. Both successes and failures
    increment `total_processed`; only failures increment `failures`. The
    ratio is evaluated on every recorded result. Both counters carry the
    same TTL, so once the keys expire the next event starts a fresh window.
    """

    FAILURE_KEY = "circuit_breaker:customs_queue:failures"
    TOTAL_KEY = "circuit_breaker:customs_queue:total_processed"
    OPEN_KEY = "circuit_breaker:customs_queue:open"

    def __init__(self, redis_client: redis.Redis, failure_threshold: float = 0.15, window_seconds: int = 300):
        self.redis = redis_client
        self.threshold = failure_threshold
        self.window = window_seconds

    def is_open(self) -> bool:
        return bool(self.redis.get(self.OPEN_KEY))

    def _bump(self, key: str) -> None:
        self.redis.incr(key)
        self.redis.expire(key, self.window)

    def record_success(self) -> None:
        self._bump(self.TOTAL_KEY)
        self._evaluate_state()

    def record_failure(self) -> None:
        self._bump(self.FAILURE_KEY)
        self._bump(self.TOTAL_KEY)
        self._evaluate_state()

    def _evaluate_state(self) -> None:
        failures = int(self.redis.get(self.FAILURE_KEY) or 0)
        total = int(self.redis.get(self.TOTAL_KEY) or 0)
        if total > 0 and (failures / total) >= self.threshold:
            # Buffer messages in the broker for a 30-min cooldown, then probe.
            self.redis.set(self.OPEN_KEY, "1", ex=1800)
            logger.warning("Circuit breaker OPEN: pausing customs queue consumption.")

Verification steps

Validate the queue against these checks before it carries production filings. Each one is deterministic and reproducible in staging.

Confirm effective throughput. Compute TPS = Total_Processed_Messages / Elapsed_Seconds and compare against the broker publish rate. A divergence above 20% indicates consumer starvation or blocking I/O — profile the object-store read and the dedup round-trip first.
Measure consumer lag. Run rabbitmqctl list_queues name messages consumers or aws sqs get-queue-attributes --queue-url <URL> --attribute-names ApproximateNumberOfMessages. Lag above 2x your peak TPS calls for horizontal worker additions or payload chunking.
Prove idempotency. Re-enqueue the same document hash and confirm the second run returns status="skipped" and does not dispatch a downstream extraction task. Then delete the dedup: key and confirm a legitimate re-filing is allowed.
Validate extraction accuracy. Score a statistical sample of 1,000 processed documents against ground-truth manifests: Precision = TP / (TP + FP). Target ≥ 0.98 for HS codes and invoice totals before feeding the classification engine.
Trace retry storms. Query Celery Flower or broker logs for max_retries exhaustion and correlate spikes with external rate limits (CBP ABI, EU TARIC). Add a token-bucket limiter if retries pile up against an upstream 429.
Audit the trace trail. Confirm every message carries a trace_id propagated through Redis, object-storage metadata, and DLQ payloads, and cross-reference it against the customs filing log to satisfy CBP recordkeeping and ISO 27001 controls.

Edge cases & gotchas

The failure modes below are specific to running an idempotent async queue over customs documents at volume, and most surface only under load or during broker recovery.

Ack-early loses documents on crash. If task_acks_late is off, Celery acknowledges the message before extraction runs; a worker OOM then drops the document with no redelivery. Keep late-ack on and pair it with task_reject_on_worker_lost so a lost worker requeues rather than silently discards.
Dedup TTL vs. legitimate re-filing. A 24-hour dedup window blocks a genuine corrected re-submission that arrives within the window. Key the dedup entry on the content hash only, and delete it explicitly when a broker issues an authorized amendment, rather than lowering the TTL and reopening the double-entry hole.
Redis eviction silently disarms idempotency. If Redis is configured with allkeys-lru and runs hot, dedup keys are evicted early and duplicates slip through. Use a noeviction or volatile-ttl policy on the DB that holds the dedup and breaker keys, and alarm on evicted-keys metrics.
FIFO ordering must be partitioned. Enforcing global ordering to protect invoice-then-amendment sequencing serializes the entire queue and collapses throughput. Partition the FIFO group by shipment reference or importer of record so ordering holds within a shipment while unrelated shipments still run in parallel.
Breaker counter drift under two Redis nodes. If the failures and total_processed counters land on different shards during a failover, the ratio is computed against mismatched windows and the breaker either flaps or never opens. Pin both keys (and the dedup keys) to one logical DB, and treat a breaker that never trips under an injected fault as a failed verification.
Multi-language payloads corrupt the hash comparison, not the hash. The SHA-256 gate is byte-exact and safe, but re-encoding a UTF-8 invoice to Latin-1 upstream changes the bytes and defeats dedup. Normalize encoding at capture, before hashing — the same discipline enforced in multi-language invoice parsing and OCR drift correction and validation.

Async Batch Processing for High Volume — the parent workflow defining broker-mediated batch ingestion and envelope design.
Error Handling & Retry Logic — DLQ routing, retry budgets, and circuit-breaker standards this queue inherits.
Designing Exponential Backoff for Failed Parsing Jobs — the backoff-with-jitter curve used by the retry helper above.
Commercial Invoice PDF Extraction — the extraction task this queue routes invoice documents into.
Packing List Data Normalization — the normalization task this queue routes packing lists into.

Up: Async Batch Processing for High Volume

Authoritative references: WCO Data Model 3.x, CBP ACE / ABI submission formats, EU ICS2 filing requirements, ISO 4217 currency codes, ISO 27001 (audit-trail controls).

Implementing async queues for bulk customs docs

# Prerequisites

# Implementation

# Verification steps

# Edge cases & gotchas

# Related

Prerequisites

Implementation

Verification steps

Edge cases & gotchas

Related