Why truncate the backoff curve with a max delay?

Uncapped exponential growth pushes late attempts into minutes-long sleeps that blow past the clearance SLA and leave documents sitting well beyond the point where a broker should have been paged. Truncating at a cap — 300 seconds is typical for customs ingestion — keeps the worst-case wait bounded and predictable, so cumulative delay across the whole attempt budget stays inside the filing window.

Should I use full jitter or equal jitter for parsing retries?

Full jitter — drawing the delay uniformly between zero and the exponential ceiling — spreads a co-failed batch most evenly across the recovering OCR or classification service and is the safest default for high-volume customs windows. Equal jitter keeps half the ceiling as a fixed floor, which preserves more ordering but reconcentrates load; for parsing jobs where any attempt can succeed independently, full jitter's flatter load profile is worth the lost ordering.

Why must retry state be persisted rather than held in memory?

Container orchestration can evict or restart a worker mid-backoff during a rolling deploy or node drain. If the attempt counter lives only in process memory, the restarted worker either loses the document or restarts the curve from attempt zero — which can re-dispatch a job that was already close to succeeding and risk a duplicate customs entry. Persisting attempt_number and the next-earliest timestamp keyed on document_id lets a fresh worker resume exactly where the curve left off.

12 min read
1 code sample

Designing exponential backoff for failed parsing jobs

A customs document-ingestion pipeline that retries failed parsing jobs on a fixed interval will, during peak filing windows, turn a single throttled OCR endpoint into a synchronized retry storm — every job that failed together retries together, re-saturating the recovering service and cascading 429s across every worker. This page answers one narrow implementation question inside the Error Handling & Retry Logic workflow: how do you shape the retry delay for a failed parsing job so recovery is fast when the service is healthy, gentle when it is degraded, bounded so no document breaches its clearance SLA, and restart-safe so an evicted worker never double-files an entry? The answer is truncated exponential backoff with full jitter, persisted attempt state, and a hard attempt cap that routes exhausted documents to a dead-letter queue for broker review.

Backoff here is not just an engineering nicety. Every retry, every sleep interval, and every eventual escalation has to be reconstructable for a CBP audit, so the delay curve is designed alongside the audit trail, not bolted on after.

Prerequisites

This page assumes the failure taxonomy and immutable payload contract established upstream in the Error Handling & Retry Logic cluster: only exceptions already classified as transient reach the backoff loop, and each job carries a stable document_id. Before applying the code below, confirm:

Python 3.10+ and tenacity 8.2+ — earlier tenacity releases lack the wait_combine and RetryCallState fields the state hooks depend on.
A persistence layer for attempt state. The examples target PostgreSQL 14+ via asyncpg 0.29+, matching the Async Batch Processing for High Volume worker pool. A small retry_state table keyed on document_id is enough.
A distinct transient exception raised by the parsing microservice on throttling, timeouts, and TLS or DNS failures — so backoff never fires on a permanent defect such as an encrypted file or a document missing mandatory customs fields, which belong in Commercial Invoice PDF Extraction validation and are escalated, not retried.
Calibrated bounds. For trade-document ingestion, base_delay = 2s, max_delay = 300s, and max_attempts = 5 keep cumulative wait comfortably inside a 90-second-per-attempt clearance window while respecting third-party API rate limits.

Implementation

The core is a single delay function and a tenacity retry policy that only fires on the transient exception, persists its attempt state before each sleep, and routes to the dead-letter queue when the cap is reached. Full jitter is expressed directly rather than through tenacity’s plain wait_exponential, because the customs workload needs the delay drawn uniformly from [0, ceiling] — not a fixed exponential value — to flatten a co-failed batch across the recovering service.

from __future__ import annotations

import logging
import random
from typing import Any

import asyncpg
from tenacity import (
    AsyncRetrying,
    RetryCallState,
    retry_if_exception_type,
    stop_after_attempt,
    wait_base,
)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("customs_etl.backoff")

BASE_DELAY = 2.0      # seconds; first-attempt ceiling
MAX_DELAY = 300.0     # seconds; truncation cap keeps worst case within SLA
MAX_ATTEMPTS = 5      # then route to the dead-letter queue


class TransientParsingError(Exception):
    """Recoverable OCR/classification failure: 429, timeout, TLS/DNS error."""


class wait_full_jitter(wait_base):
    """Truncated exponential backoff with full jitter (AWS 'full jitter').

    ceiling = min(BASE_DELAY * 2**(attempt-1), MAX_DELAY)
    delay   = uniform(0, ceiling)

    Drawing from [0, ceiling] — rather than returning the ceiling itself —
    spreads a batch that failed together evenly across the recovering
    service, which is what prevents a second 429 storm during a peak
    filing window.
    """

    def __call__(self, retry_state: RetryCallState) -> float:
        attempt = retry_state.attempt_number  # 1-indexed
        ceiling = min(BASE_DELAY * (2 ** (attempt - 1)), MAX_DELAY)
        return random.uniform(0.0, ceiling)


async def _persist_attempt(pool: asyncpg.Pool, doc_id: str,
                           retry_state: RetryCallState) -> None:
    """Write the attempt counter and the next-earliest retry time BEFORE the
    sleep, so an evicted worker resumes the curve instead of restarting it
    (and never re-dispatches a job that would double-file a customs entry)."""
    delay = retry_state.next_action.sleep if retry_state.next_action else 0.0
    await pool.execute(
        """
        INSERT INTO retry_state (document_id, attempt_number, next_earliest_at)
        VALUES ($1, $2, now() + ($3 || ' seconds')::interval)
        ON CONFLICT (document_id) DO UPDATE
          SET attempt_number   = EXCLUDED.attempt_number,
              next_earliest_at  = EXCLUDED.next_earliest_at
        """,
        doc_id, retry_state.attempt_number, str(delay),
    )
    logger.warning(
        "backoff doc_id=%s attempt=%d sleeping=%.2fs",
        doc_id, retry_state.attempt_number, delay,
    )


async def parse_with_backoff(pool: asyncpg.Pool, doc_id: str,
                             payload: bytes) -> dict[str, Any]:
    """Run one parsing job under truncated full-jitter backoff.

    Only TransientParsingError is retried; a permanent defect propagates
    on the first raise and is escalated to broker review by the caller.
    Exhausting MAX_ATTEMPTS re-raises for dead-letter routing.
    """
    async for attempt in AsyncRetrying(
        stop=stop_after_attempt(MAX_ATTEMPTS),
        wait=wait_full_jitter(),
        retry=retry_if_exception_type(TransientParsingError),
        before_sleep=lambda rs: _persist_attempt(pool, doc_id, rs),
        reraise=True,
    ):
        with attempt:
            result = await _extract(doc_id, payload)   # OCR/classification call
            await pool.execute(
                "DELETE FROM retry_state WHERE document_id = $1", doc_id,
            )
            logger.info("parsed doc_id=%s attempts=%d",
                        doc_id, attempt.retry_state.attempt_number)
            return result
    raise RuntimeError("unreachable")  # reraise=True surfaces the last error

The before_sleep hook is the restart-safety hinge: because attempt state is committed to the Packing List Data Normalization sequence’s shared retry_state table before the process ever sleeps, a rolling deploy that evicts the pod mid-backoff lets a fresh worker read attempt_number, honour next_earliest_at, and continue the curve — instead of resetting to attempt one and re-dispatching a job that may already be one retry from success.

Verification steps

Validate the curve in staging before it touches a live filing window. Force a retry storm by making _extract raise TransientParsingError on a fixed fraction of calls, then confirm each item:

Only transient errors enter the loop. Inject one permanent defect (encrypted PDF) and one throttle (HTTP 429). Assert the permanent defect propagates on attempt 1 with zero sleeps logged, while the throttle retries.
Every delay stays inside its jitter band. Log all sleep intervals and assert each attempt’s delay falls within [0, min(2 * 2**(n-1), 300)] — so attempt 3’s delay is between 0s and 8s, attempt 5’s between 0s and 32s. Any value above the ceiling means jitter is misconfigured.
Cumulative wait respects the SLA. Sum the worst-case ceilings (2 + 4 + 8 + 16 + 32 = 62s) and confirm it clears the 90-second clearance window with margin. This is the budget shared with Multi-language Invoice Parsing during peak trade seasons.
Restart resumes, not restarts. Kill the worker after attempt 2 commits, restart it, and assert attempt_number reads 2 from retry_state and the next sleep honours next_earliest_at — no reset to attempt 1, no second customs entry.
Exhaustion routes cleanly. Drive one document past MAX_ATTEMPTS and confirm it re-raises for dead-letter routing carrying its full attempt history, and that its retry_state row is retained for the OCR Drift Correction & Validation audit rather than silently dropped.

Enable DEBUG on the retry logger to capture attempt_number, idle_for, and outcome per call, and correlate doc_id with the persisted state for CBP recordkeeping. Emit the sleep intervals as Prometheus histograms so drift in retry frequency surfaces as an early warning of upstream OCR degradation.

Edge cases & gotchas

stop_after_attempt counts attempts, not retries. stop_after_attempt(5) allows the initial call plus four retries — five executions total, not six. Off-by-one here silently over- or under-runs the SLA budget; assert it in a test.
Full jitter can return a near-zero first delay. Because attempt 1’s ceiling is BASE_DELAY and jitter draws from [0, ceiling], a retry can fire almost immediately. That is intended load-spreading, but if the upstream 429 carries a Retry-After header, honour it as a floor — clamp max(retry_after, jitter_delay) — or you will hammer a service that explicitly asked you to wait.
Persisting state after the sleep loses the restart guarantee. The write must land in before_sleep, before the await. Moving it into the retry body means an eviction during the sleep leaves stale or missing state and reopens the double-filing window.
asyncpg interval coercion. Passing the delay as a float into an interval cast can raise type errors across asyncpg versions; format it as '<n> seconds'::interval (as above) or pass a datetime.timedelta, and never interpolate the value into the SQL string.
A tripped circuit breaker must short-circuit backoff, not stack with it. When systemic degradation opens the breaker described in the Error Handling & Retry Logic cluster, in-flight jobs should stop scheduling new sleeps and drain — otherwise documents keep accumulating multi-minute backoff waits behind a breaker that has already halted ingestion, quietly breaching the SLA while appearing “still retrying.”
Clock source for next_earliest_at. Use the database now() (as shown) rather than each worker’s wall clock; skewed pod clocks otherwise let a resumed worker retry early or late, corrupting the audit reconstruction of when each attempt actually ran.

Error Handling & Retry Logic — the failure taxonomy, idempotency contract, and circuit-breaker design this page builds on.
Async Batch Processing for High Volume — the worker pool and queue depth targets the backoff bounds are calibrated against.
Commercial Invoice PDF Extraction — where permanent parsing defects are detected and kept out of the retry loop.
Packing List Data Normalization — the downstream stage whose sequence integrity depends on restart-safe retries.
OCR Drift Correction & Validation — the validation stage that consumes exhausted, dead-lettered documents.

Up: Error Handling & Retry Logic

Authoritative references: tenacity documentation, CBP Automated Commercial Environment (ACE).

Designing exponential backoff for failed parsing jobs

# Prerequisites

# Implementation

# Verification steps

# Edge cases & gotchas

# Related

Prerequisites

Implementation

Verification steps

Edge cases & gotchas

Related