7 min read
1 code sample

Building fallback logic for ambiguous tariff classifications

Automated customs classification pipelines inevitably hit product descriptions that resist deterministic mapping to Harmonized Tariff Schedule (HTSUS) codes: a min_confidence gate returns nothing, or two GRI-plausible chapter headings tie. The exact failure this page solves is the ambiguous case — not a missing code, but a description that maps to more than one defensible subheading — and how to route it deterministically without either halting the batch or silently picking a wrong code. Both outcomes violate reasonable-care recordkeeping under 19 CFR 141.89 and expose the importer to liquidation penalties. This is a specific stage inside Fallback Routing for Unmapped Codes: the tie-break and quarantine tier that runs after a straight lookup fails but before a broker is paged.

Prerequisites

Before applying the router below, confirm the surrounding pipeline state:

Python 3.10+ — the code uses dataclass slots, structural type hints, and the match-friendly control flow. Install pandas>=2.1 and rapidfuzz>=3.6 (the stdlib substitute shown inline is for illustration only; rapidfuzz.fuzz.token_sort_ratio is what you run in production).
A validated active tariff snapshot. Ambiguity resolution assumes the declared code already failed exact validation. If your snapshot is stale or malformed, fix ingestion first — see Handling Missing HTS Codes in ETL Pipelines for the upstream lookup and bitemporal snapshot contract this page depends on.
A historical classification log keyed by cleaned product description → accepted 10-digit HTSUS code, sourced from prior liquidated entries (never from unreviewed fallbacks).
A weighted rule table encoding material composition, manufacturing process, and declared end-use, with weights calibrated against WCO General Rules of Interpretation (GRI) 1–3.
A quarantine table with write access restricted to compliance staff via the Security Boundary & Data Isolation layer, so exception records cannot be mutated by ingestion workers.

The router resolves only on an unambiguous winner: a single Tier 1 code above 0.75, or a Tier 2 matrix score at or above 0.75. A two-code tie inside the 0.05 margin, a below-gate fall-through with no Tier 2 resolution, or a bare exception all route to the Tier 3 quarantine queue, which never emits a rate.

Implementation

The router evaluates tiers sequentially and, critically, treats a tie between two candidate codes as an ambiguity signal that forces quarantine rather than an arbitrary pick. Confidence gating, structured logging for the audit trail, and chunked ingestion keep it production-safe under high-volume batch windows.

import logging
import pandas as pd
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass, field
from pathlib import Path

# Configure structured logging for audit compliance
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    handlers=[logging.StreamHandler()],
)
logger = logging.getLogger("hts_fallback_router")


@dataclass
class ClassificationResult:
    hts_code: str
    confidence: float
    routing_path: str
    requires_review: bool
    metadata: Dict[str, str] = field(default_factory=dict)


class FallbackClassificationEngine:
    """Deterministic tie-break + quarantine router for ambiguous HTSUS mappings."""

    def __init__(
        self,
        fuzzy_threshold: float = 0.15,
        min_confidence: float = 0.75,
        tie_margin: float = 0.05,
        chunk_size: int = 50_000,
    ) -> None:
        self.fuzzy_threshold = fuzzy_threshold
        self.min_confidence = min_confidence
        # If the two best candidates are within tie_margin, treat as ambiguous.
        self.tie_margin = tie_margin
        self.chunk_size = chunk_size
        logger.info(
            "Engine ready: min_confidence=%.2f, tie_margin=%.2f, chunk_size=%d",
            self.min_confidence, self.tie_margin, self.chunk_size,
        )

    def _rank_fuzzy_matches(
        self, description: str, historical_map: Dict[str, str]
    ) -> List[Tuple[str, float]]:
        """Tier 1: rank historical codes by similarity (use rapidfuzz in prod)."""
        scored: Dict[str, float] = {}
        for hist_desc, hts_code in historical_map.items():
            overlap = len(set(description.lower()) & set(hist_desc.lower()))
            distance = 1 - (overlap / max(len(description), len(hist_desc), 1))
            if distance <= self.fuzzy_threshold:
                confidence = 1.0 - distance
                # Keep the best score seen per distinct HTS code.
                scored[hts_code] = max(scored.get(hts_code, 0.0), confidence)
        return sorted(scored.items(), key=lambda kv: kv[1], reverse=True)

    def _evaluate_weighted_matrix(
        self, sku_meta: Dict[str, str], rule_weights: Dict[str, float]
    ) -> Optional[Tuple[str, float]]:
        """Tier 2: GRI-anchored material/end-use weighted scoring."""
        score = sum(
            weight for attr, weight in rule_weights.items()
            if sku_meta.get(attr)
        )
        if score >= self.min_confidence:
            code = sku_meta.get("default_hts_candidate", "UNKNOWN")
            return code, min(score, 1.0)
        return None

    def _classify(
        self,
        description: str,
        sku_meta: Dict[str, str],
        historical_map: Dict[str, str],
        rule_weights: Dict[str, float],
        sku_id: str,
    ) -> ClassificationResult:
        ranked = self._rank_fuzzy_matches(description, historical_map)
        if ranked and ranked[0][1] >= self.min_confidence:
            # Ambiguity guard: two near-equal top codes must not be auto-picked.
            if len(ranked) > 1 and (ranked[0][1] - ranked[1][1]) < self.tie_margin:
                logger.warning(
                    "Ambiguous Tier 1 match for %s: %s vs %s within tie_margin",
                    sku_id, ranked[0][0], ranked[1][0],
                )
                return ClassificationResult(
                    "QUARANTINE", 0.0, "TIER3_TIE_BREAK", True,
                    {"sku_id": sku_id, "candidates": f"{ranked[0][0]}|{ranked[1][0]}"},
                )
            return ClassificationResult(
                ranked[0][0], ranked[0][1], "TIER1_FUZZY", False
            )

        tier2 = self._evaluate_weighted_matrix(sku_meta, rule_weights)
        if tier2:
            return ClassificationResult(tier2[0], tier2[1], "TIER2_MATRIX", False)

        return ClassificationResult(
            "QUARANTINE", 0.0, "TIER3_EXCEPTION", True,
            {"sku_id": sku_id, "description": description},
        )

    def process_chunk(
        self,
        chunk: pd.DataFrame,
        historical_map: Dict[str, str],
        rule_weights: Dict[str, float],
    ) -> pd.DataFrame:
        """Execute fallback routing on a memory-bounded DataFrame chunk."""
        results: List[ClassificationResult] = []
        for _, row in chunk.iterrows():
            desc = str(row.get("product_description", ""))
            meta = {
                k: str(v) for k, v in row.items()
                if k not in ("product_description", "sku_id")
            }
            results.append(
                self._classify(
                    desc, meta, historical_map, rule_weights,
                    str(row.get("sku_id", "")),
                )
            )
        return pd.DataFrame([r.__dict__ for r in results])

    def run_pipeline(
        self,
        input_path: Path,
        output_path: Path,
        historical_map: Dict[str, str],
        rule_weights: Dict[str, float],
    ) -> None:
        """Chunked execution with strict memory bounds.

        Note: pandas' chunked CSV iterator (`chunksize`) is not compatible with
        `engine='pyarrow'`; use the default C engine for streamed ingestion.
        """
        logger.info("Starting chunked fallback processing: %s", input_path)
        first_chunk = True
        for chunk in pd.read_csv(input_path, chunksize=self.chunk_size):
            processed = self.process_chunk(chunk, historical_map, rule_weights)
            processed.to_csv(
                output_path,
                mode="w" if first_chunk else "a",
                header=first_chunk,
                index=False,
            )
            logger.info("Wrote %d records to %s", len(processed), output_path)
            first_chunk = False
        logger.info("Pipeline completed.")

The tie-break guard is the load-bearing change: a description that scores 0.82 against subheading 6204.62 and 0.80 against 6204.63 is ambiguous, not resolved. Auto-selecting the marginal winner would violate GRI 3(a)'s specific-over-general rule and corrupt the downstream duty rate. Quarantined records carry no rate and flow to broker review before any Duty Formula Calculation Frameworks run against them.

Verification steps

Confidence threshold calibration. Run a 10,000-record sample against historical CBP rulings. Raise min_confidence until the false-positive rate drops below 0.5%. Log every record scoring 0.60–0.75 for manual GRI cross-referencing.
Tie-margin audit. Count TIER3_TIE_BREAK versus TIER3_EXCEPTION rows. If tie-breaks exceed ~2% of volume, your historical map has duplicate descriptions mapped to different codes — deduplicate before trusting Tier 1.
GRI alignment. Map Tier 2 matrix outputs against WCO HS 2022 Explanatory Notes; confirm material-composition weights prioritize GRI 1 over GRI 3© when headings conflict.
Duty-impact isolation. Route quarantined outputs through a staging environment and confirm every QUARANTINE flag bypasses automated rate application and pauses Rule of Origin Logic Engines until a broker assigns a definitive code.
Memory and throughput profiling. Monitor peak RSS per worker under chunk_size=50_000. Keep it below 2 GB; if it spikes past 15% of budget, lower chunk_size and coerce string columns with explicit dtype.
Audit-trail completeness. Confirm every routed record persists routing_path, confidence, and requires_review to immutable storage — CBP audits require unbroken lineage from SKU ingestion to final declaration.

Edge cases & gotchas

Set-overlap similarity is not real fuzzy matching. The inline _rank_fuzzy_matches uses character-set overlap so the example runs without dependencies; it ignores word order and will mis-rank multi-word descriptors. Swap in rapidfuzz.fuzz.token_sort_ratio before production and re-calibrate fuzzy_threshold.
Multi-language encoding corruption. Supplier descriptions arriving as Latin-1 or GBK bytes decoded as UTF-8 produce mojibake that silently deflates similarity scores and over-quarantines. Normalize to NFC UTF-8 at ingestion, not inside the router.
pandas chunk iterator vs pyarrow. As the docstring warns, pd.read_csv(..., chunksize=...) cannot use engine="pyarrow"; forcing it raises ValueError mid-batch and silently drops the tail after the first chunk if you swallow the exception.
Historical-map poisoning. If accepted fallbacks are written back into historical_map without broker sign-off, one wrong guess propagates as a high-confidence Tier 1 match on the next run. Only liquidated, reviewed entries may seed the history.
Tariff-update reconciliation. When USITC publishes revised subheadings, re-run quarantined SKUs against the updated tree so newly deterministic records auto-promote out of quarantine while genuinely ambiguous ones remain held. Records built against the HTS Schedule Database Design schema must be re-validated on every snapshot swap, not cached across versions.

Handling Missing HTS Codes in ETL Pipelines — the upstream lookup and bitemporal validation this page assumes has already run.
Duty Formula Calculation Frameworks — where resolved (non-quarantined) codes have rates applied.
Rule of Origin Logic Engines — origin determination that must pause on quarantine.
Security Boundary & Data Isolation — access controls around the exception queue.

Up: Fallback Routing for Unmapped Codes

Building fallback logic for ambiguous tariff classifications

# Prerequisites

# Implementation

# Verification steps

# Edge cases & gotchas

# Related

Prerequisites

Implementation

Verification steps

Edge cases & gotchas

Related