13 min read
8 code samples

HTS Schedule Database Design

The Harmonized Tariff Schedule database is the foundational reference layer for automated customs brokerage and HS code classification workflows, and its design determines whether every downstream duty calculation is defensible under audit. Part of the Core Architecture & Tariff Mapping reference architecture, this schema reconciles the rigid hierarchical nomenclature maintained by the World Customs Organization with jurisdiction-specific regulatory overlays, treating the tariff schedule as a versioned, temporally aware, audit-ready data structure rather than a static lookup table.

Problem Framing: Classification Drift Across Tariff Revisions

The failure mode this design exists to prevent is classification drift — the silent divergence between the tariff code applied at entry filing and the code that was legally in force on the date of import. HTSUS revisions ship at least twice a year through USITC, punctuated by ad-hoc Presidential proclamations, Section 301/232 actions, and Federal Register notices that alter rates or add exclusions mid-cycle. A schema that overwrites active rows on each revision destroys the historical state that a CBP post-entry audit requires, so retroactive duty reconciliation becomes impossible and every prior entry inherits the newest description.

For trade compliance officers, customs brokers, logistics developers, and Python ETL teams, the database must therefore answer a harder question than “what is code 8471.30.0100?” It must answer “what was code 8471.30.0100 on 2024-06-15, in the US jurisdiction, and what regulatory flags applied then?” That point-in-time requirement drives every schema decision below: temporal validity intervals, immutable audit trails, and a hierarchy model that survives schedule rotations without orphaning records.

The schedule itself is a fixed-width numeric hierarchy. The first six digits are internationally harmonised by the WCO; statistical suffixes (digits 7–10) are national subdivisions — HTSUS in the US, CN-codes in the EU ATLAS system, and so on.

Schema & Data Contract

A production-grade schema normalizes the chapter → heading → subheading → statistical-suffix hierarchy while preserving the full 10-digit code as a deterministic query key. The core hts_code table stores the canonical identifier, jurisdiction, effective date range, and regulatory metadata. Temporal validity is enforced through valid_from and valid_to timestamps: a NULL valid_to marks the currently active row, and the exclusion constraint prevents two overlapping intervals for the same (jurisdiction, full_code) pair from ever coexisting.

CREATE EXTENSION IF NOT EXISTS btree_gist;

CREATE TABLE hts_code (
    code_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    jurisdiction     VARCHAR(3) NOT NULL,
    full_code        VARCHAR(10) NOT NULL,
    description      TEXT NOT NULL,
    valid_from       TIMESTAMPTZ NOT NULL,
    valid_to         TIMESTAMPTZ,                       -- NULL = currently active
    regulatory_flag  BOOLEAN DEFAULT FALSE,
    created_at       TIMESTAMPTZ DEFAULT NOW(),
    updated_at       TIMESTAMPTZ DEFAULT NOW(),
    CONSTRAINT chk_valid_range CHECK (valid_to IS NULL OR valid_to > valid_from),
    -- No two rows for the same code+jurisdiction may cover overlapping time.
    CONSTRAINT ex_no_overlap EXCLUDE USING gist (
        jurisdiction WITH =,
        full_code    WITH =,
        tstzrange(valid_from, COALESCE(valid_to, 'infinity')) WITH &&
    )
);

CREATE TABLE hts_hierarchy (
    code_id          UUID PRIMARY KEY REFERENCES hts_code(code_id) ON DELETE CASCADE,
    parent_code_id   UUID REFERENCES hts_code(code_id),
    level            SMALLINT NOT NULL CHECK (level BETWEEN 1 AND 5)  -- 1=chapter … 5=10-digit
);

The input contract for any ingestion job is formalized as a Pydantic model so that malformed rows are rejected before they touch the transaction. The model encodes the two hard regulatory invariants — a 10-digit numeric full_code and an ISO jurisdiction code — that the WCO nomenclature and HTSUS structure require.

from datetime import datetime
from pydantic import BaseModel, field_validator


class HTSRecord(BaseModel):
    jurisdiction: str          # ISO 3166 alpha-2/3, e.g. "US"
    full_code: str             # exactly 10 numeric digits
    description: str
    valid_from: datetime
    valid_to: datetime | None = None
    regulatory_flag: bool = False

    @field_validator("full_code")
    @classmethod
    def _ten_digits(cls, v: str) -> str:
        digits = "".join(ch for ch in v if ch.isdigit())
        if len(digits) != 10:
            raise ValueError(f"HTS code must be 10 digits, got {len(digits)}: {v!r}")
        return digits

    @field_validator("valid_to")
    @classmethod
    def _range_ordered(cls, v: datetime | None, info) -> datetime | None:
        vf = info.data.get("valid_from")
        if v is not None and vf is not None and v <= vf:
            raise ValueError("valid_to must be strictly after valid_from")
        return v

Step-by-Step Implementation

The ingestion path runs as four ordered stages. Each stage has a defined purpose, input, output, and error condition; a failure at any stage aborts the surrounding transaction rather than committing a partial schedule.

1. Extract and structurally validate

Purpose: turn a raw USITC/WCO extract into typed records. Input: a Polars DataFrame from the source file. Output: a clean frame of validated rows. Error condition: zero surviving rows raises ComplianceIngestionError, halting the run before any write.

import polars as pl
from datetime import datetime


class ComplianceIngestionError(Exception):
    pass


def structural_clean(raw_df: pl.DataFrame) -> pl.DataFrame:
    required = {"jurisdiction", "full_code", "description", "valid_from", "valid_to"}
    if not required.issubset(raw_df.columns):
        raise ComplianceIngestionError(f"Missing HTS columns: {required - set(raw_df.columns)}")

    clean = (
        raw_df.with_columns(
            pl.col("full_code").str.replace_all(r"\D", "").str.slice(0, 10),
            pl.col("valid_from").str.to_datetime("%Y-%m-%d"),
            pl.col("valid_to").str.to_datetime("%Y-%m-%d", strict=False),
        )
        .filter(pl.col("full_code").str.len_chars() == 10)
        .unique(subset=["jurisdiction", "full_code", "valid_from"])
    )
    if clean.is_empty():
        raise ComplianceIngestionError("No valid records after structural validation.")
    return clean

2. Bulk-load into an isolated staging table

Purpose: stage the batch for set-based promotion without contending with live reads. Input: the cleaned frame. Output: a truncated-and-repopulated stg_hts_codes. Error condition: any COPY type-coercion failure propagates and rolls the transaction back.

import asyncpg


async def bulk_stage(conn: asyncpg.Connection, df: pl.DataFrame, staging: str) -> None:
    records = [
        (r["jurisdiction"], r["full_code"], r["description"], r["valid_from"], r["valid_to"])
        for r in df.to_dicts()
    ]
    await conn.execute(f"TRUNCATE TABLE {staging}")
    await conn.copy_records_to_table(
        staging,
        records=records,
        columns=("jurisdiction", "full_code", "description", "valid_from", "valid_to"),
    )

3. Detect overlaps and quarantine conflicts

Purpose: isolate rows whose validity window collides with an existing active interval, since those cannot be auto-applied without erasing history. Input: staged rows. Output: conflicting rows written to quarantine_conflicts for broker review. Error condition: none — quarantine is the safe default, and unresolved conflicts are surfaced downstream through Fallback Routing for Unmapped Codes.

4. Idempotent promotion to production

Purpose: merge non-conflicting rows deterministically so that replaying the same batch is a no-op. Input: staged rows minus quarantined ones. Output: upserted hts_code rows. Error condition: the exclusion constraint rejects any overlap that slipped past detection, guaranteeing the invariant holds at the storage layer.

async def promote(conn: asyncpg.Connection) -> None:
    """Overlap detection + quarantine + idempotent upsert, in one transaction."""
    await conn.execute("""
        WITH conflicts AS (
            SELECT s.full_code, s.jurisdiction
            FROM stg_hts_codes s
            JOIN hts_code h
              ON s.full_code = h.full_code AND s.jurisdiction = h.jurisdiction
            WHERE tstzrange(s.valid_from, COALESCE(s.valid_to, 'infinity'))
               && tstzrange(h.valid_from, COALESCE(h.valid_to, 'infinity'))
        ),
        quarantined AS (
            INSERT INTO quarantine_conflicts (code, jurisdiction, conflict_reason, ingested_at)
            SELECT full_code, jurisdiction, 'OVERLAPPING_VALIDITY', NOW() FROM conflicts
            RETURNING code, jurisdiction
        )
        INSERT INTO hts_code (jurisdiction, full_code, description, valid_from, valid_to)
        SELECT s.jurisdiction, s.full_code, s.description, s.valid_from, s.valid_to
        FROM stg_hts_codes s
        WHERE NOT EXISTS (
            SELECT 1 FROM conflicts c
            WHERE c.full_code = s.full_code AND c.jurisdiction = s.jurisdiction
        )
        ON CONFLICT ON CONSTRAINT ex_no_overlap DO NOTHING;
    """)

Wrapping stages 2–4 in a single async with conn.transaction() block makes the whole promotion atomic: either the batch lands intact or the schedule is left untouched. The same set-based promotion pattern is reused by Tariff Update Ingestion Pipelines, which schedule these jobs against the biannual HTSUS release cadence.

Validation & Determinism

Structural validation alone is not enough for audit defensibility; the schema enforces regulatory cross-checks that a raw extract can violate. Three deterministic gates run before promotion:

Digit-length rule. Every full_code must be exactly 10 numeric digits (HTSUS) — the Pydantic validator and the Polars len_chars() == 10 filter enforce this on both the row and the batch level, rejecting truncated 8-digit statistical codes that would otherwise silently mis-key lookups.
Nomenclature continuity. A 10-digit code’s leading six digits must resolve to a WCO subheading already present in the schedule; an orphaned statistical suffix is quarantined rather than inserted.
Interval integrity. The btree_gist exclusion constraint is the last line of defence: even if application-layer overlap detection has a bug, the database physically refuses to store two active intervals for the same code, so classification drift cannot be introduced.

Point-in-time resolution is the read-side counterpart. A recursive CTE walks the hierarchy from a 10-digit leaf up to its chapter, filtering every level by the declaration date so the returned lineage reflects exactly the regulatory state in force at that instant.

WITH RECURSIVE code_lineage AS (
    SELECT hc.code_id, hc.full_code, hc.description, hc.regulatory_flag,
           hc.valid_from, hc.valid_to, hh.parent_code_id
    FROM hts_code hc
    JOIN hts_hierarchy hh ON hc.code_id = hh.code_id
    WHERE hc.jurisdiction = 'US'
      AND hc.full_code   = '8471300100'
      AND hc.valid_from  <= '2024-06-15T00:00:00Z'
      AND (hc.valid_to IS NULL OR hc.valid_to > '2024-06-15T00:00:00Z')
    UNION ALL
    SELECT hc.code_id, hc.full_code, hc.description, hc.regulatory_flag,
           hc.valid_from, hc.valid_to, hh.parent_code_id
    FROM hts_code hc
    JOIN hts_hierarchy hh ON hc.code_id = hh.code_id
    JOIN code_lineage cl   ON hc.code_id = cl.parent_code_id
    WHERE hc.valid_from <= '2024-06-15T00:00:00Z'
      AND (hc.valid_to IS NULL OR hc.valid_to > '2024-06-15T00:00:00Z')
)
SELECT * FROM code_lineage ORDER BY valid_from DESC;

A useful post-load determinism check is a count reconciliation: rows staged minus rows quarantined must equal rows newly present in hts_code for the batch’s valid_from. A mismatch signals a silently swallowed conflict and blocks the release.

Downstream Integration

The HTS database is the authoritative source that the rest of the tariff-mapping architecture reads from. It feeds Rule of Origin Logic Engines, which consume hierarchical lineage to evaluate regional value content and tariff-shift criteria against a bill-of-materials. It also backs Duty Formula Calculation Frameworks, which join the hts_code table to resolve ad valorem, specific, or compound duty rates for the exact code state in force at entry.

When a commercial description cannot be resolved to a code above a deterministic confidence threshold, the workflow escalates to Fallback Routing for Unmapped Codes, which parks the record in a human-in-the-loop queue while keeping a shadow row for audit continuity. Every one of these cross-system reads is mediated by Security Boundary & Data Isolation controls, which keep the reference schema read-only for downstream consumers and ensure commercial invoice PII never leaks into tariff reference data.

Scaling & Resilience

High-throughput clearance windows demand both fast reads and bounded memory. On the read path, the schema carries a composite B-tree index on (jurisdiction, full_code, valid_from) plus a partial index over active rows (WHERE valid_to IS NULL) so the common “current code” lookup never scans historical intervals. For multi-million-row national schedules, declarative range partitioning by valid_from (monthly or quarterly) keeps index depth flat and accelerates the range scans that retroactive reconciliation depends on.

On the write path, memory footprint is controlled by streaming source frames through Polars iter_batches() and moving each batch with asyncpg.copy_records_to_table(), avoiding a full HTSUS release ever becoming resident. Concurrency is bounded with an asyncio.Semaphore so a burst of release jobs cannot exhaust the connection pool, and each job runs under a statement timeout that acts as a circuit breaker — a stalled promotion is aborted and retried rather than holding locks against live classification traffic.

import asyncio

SEM = asyncio.Semaphore(4)          # cap concurrent promotions
BATCH_ROWS = 50_000                 # iter_batches window


async def ingest_release(pool: asyncpg.Pool, frame: pl.DataFrame) -> None:
    async with SEM:
        clean = structural_clean(frame)
        async with pool.acquire() as conn:
            await conn.execute("SET statement_timeout = '30s'")   # circuit breaker
            async with conn.transaction():
                for batch in clean.iter_slices(n_rows=BATCH_ROWS):
                    await bulk_stage(conn, batch, "stg_hts_codes")
                    await promote(conn)

Application-layer caching (Redis or Memcached) memoizes resolved point-in-time classifications for high-frequency SKUs, holding lookup latency under 50 ms even at peak. Cache keys embed the declaration date so a mid-cycle revision invalidates only the affected code, never the whole namespace.

Compliance Obligations

Post-entry audits under CBP standards require exact reconstruction of the tariff environment at the moment of entry filing, so the schema treats history as immutable evidence rather than mutable state. Every mutation writes an append-only audit row via a PostgreSQL trigger, capturing the actor, timestamp, prior value, and new value; audit rows are never updated or deleted.

CREATE TABLE hts_audit (
    audit_id     BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    code_id      UUID NOT NULL,
    action       TEXT NOT NULL,          -- INSERT | UPDATE | DELETE
    actor        TEXT NOT NULL DEFAULT current_user,
    changed_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    old_row      JSONB,
    new_row      JSONB
);

Retention follows the CBP recordkeeping horizon: audit rows and superseded intervals are retained for at least five years past the entry date and stored on an immutable tier so a Focused Assessment can be answered from primary records. Regulatory notices — Federal Register updates, Section 301 actions, tariff bulletins — are mapped onto the regulatory_flag column so automated screening can quarantine affected shipments before they reach the ACE portal, and any code that cannot be auto-resolved is escalated to a broker through the human-in-the-loop gate rather than defaulted. Treating the schedule as a version-controlled reference system, not a flat dictionary, is what makes classification deterministic, duty calculations auditable, and pipeline replays safe across global trade corridors.

Point-in-time HTS resolution with recursive CTEs — climb the chapter→heading→subheading chain under an as-of window
asyncpg vs psycopg3 for HTS schedule bulk upserts — driver benchmark for high-volume schedule merges
Tariff Update Ingestion Pipelines — scheduled delta processing that drives writes into this schema
Rule of Origin Logic Engines — consumes hierarchical lineage for RVC and tariff-shift tests
Duty Formula Calculation Frameworks — joins hts_code to resolve ad valorem, specific, and compound rates
Fallback Routing for Unmapped Codes — broker-review queue for quarantined and low-confidence records
Security Boundary & Data Isolation — read-only isolation and PII controls around the reference schema

Up: Core Architecture & Tariff Mapping

HTS Schedule Database Design

# Problem Framing: Classification Drift Across Tariff Revisions

# Schema & Data Contract

# Step-by-Step Implementation

# 1. Extract and structurally validate

# 2. Bulk-load into an isolated staging table

# 3. Detect overlaps and quarantine conflicts

# 4. Idempotent promotion to production

# Validation & Determinism

# Downstream Integration

# Scaling & Resilience

# Compliance Obligations

# Related