10 min read
1 code sample

How to parse WCO HS 2024 updates automatically

The exact problem this page solves: the World Customs Organization ships its Harmonized System 2024 revision as a bundle of nested XML deliverables containing roughly 350 structural changes — heading splits, chapter reorganizations, merges, and revised explanatory notes — and a naive parser either loads the whole multi-megabyte tree into memory and dies on a national extension file, or flattens the recursive nomenclature into a lookup table and silently loses the parent-child lineage that every downstream duty calculation depends on. Both outcomes are unacceptable in a production clearance environment: the first misses the ingestion window, the second manufactures classification drift that surfaces later as duty miscalculation under a CBP Focused Assessment. The correct behavior is deterministic — stream the nomenclature node-by-node, resolve every code to a valid ancestor, flag split/merge transitions so the transition mapping is explicit, and emit immutable typed records before anything reaches the Tariff Update Ingestion Pipelines staging layer. This page gives you a single runnable streaming parser that enforces that contract, plus the verification checklist and the encoding, ordering, and lxml-lifecycle gotchas that break naive implementations.

Prerequisites

This solution assumes a specific toolchain and upstream pipeline state. Confirm each before applying it:

Python 3.10+ — the parser uses dataclass(frozen=True) for immutable records, str | int union annotations, and structural typing throughout. The immutability is load-bearing: a frozen HSNode cannot be mutated after a downstream stage reads it, which is what makes the audit hash reproducible.
lxml 4.9+ — iterparse with a tag filter and start/end events is required for bounded-memory streaming. The stdlib xml.etree.ElementTree works for small files but lacks getprevious()/getparent() sibling pruning, so it cannot release parsed subtrees on a 500 MB national extension.
The raw WCO deliverable staged on local or object storage, not streamed over HTTP. The parser is a pure function of (xml_path,); do not parse directly off a network handle, because a mid-stream disconnect leaves the internal stack half-populated and silently drops the tail of a chapter.
A digit-only code convention already agreed with the schema. The parser normalizes to bare \d{2,8} codes; the bitemporal storage contract is defined in HTS Schedule Database Design and expects the same key shape.
Structured logging configured (stdlib logging with the %(asctime)s | %(levelname)s | ... format, or structlog) so every rejected node is greppable for audit. High-frequency Invalid HS code format lines are the primary signal of an XML namespace or encoding mismatch.

Implementation

The parser below streams the WCO XML with event-driven iterparse, maintains an explicit ancestor stack to resolve parent_code, validates each code against the WCO digit-length rule before emitting, and prunes each finished subtree so memory stays bounded regardless of file size. Every yielded HSNode is immutable and carries the is_split/is_merged flags that the transition-mapping routines in the Duty Formula Calculation Frameworks rely on to reconcile legacy 8-digit codes against the new 2024 configuration.

import logging
import re
from dataclasses import dataclass
from typing import Generator, Optional, List, Dict
from lxml import etree

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)-8s | %(funcName)s:%(lineno)d | %(message)s",
)
logger = logging.getLogger(__name__)


@dataclass(frozen=True)
class HSNode:
    code: str                              # digit-only HS code, 2–8 chars
    level: int                             # 0 = chapter, deeper = subheading
    description: str
    parent_code: Optional[str] = None
    notes: str = ""
    is_split: bool = False                 # WCO type="split" — 2024 heading divergence
    is_merged: bool = False                # WCO type="merge" — legacy codes collapsed
    effective_date: Optional[str] = None   # ISO 8601; drives the bitemporal window


class HS2024Parser:
    VALID_TAGS = {"Chapter", "Heading", "SubHeading", "SubSubHeading"}
    # WCO nomenclature is 2/4/6 international digits; national extensions push to 8.
    CODE_PATTERN = re.compile(r"^\d{2,8}$")

    def __init__(self, xml_path: str) -> None:
        self.xml_path = xml_path
        self._stack: List[Dict[str, str | int]] = []

    def _resolve_parent(self, current_level: int) -> Optional[str]:
        if current_level == 0:
            return None
        return self._stack[-1]["code"] if self._stack else None

    def _validate_node(self, code: str, level: int) -> bool:
        if not self.CODE_PATTERN.match(code):
            logger.warning("Invalid HS code format: %s", code)
            return False
        if level < 0 or level > 8:
            logger.error("Hierarchy depth out of bounds: level=%d", level)
            return False
        return True

    def parse(self) -> Generator[HSNode, None, None]:
        context = etree.iterparse(
            self.xml_path, events=("start", "end"), tag=self.VALID_TAGS
        )

        for event, elem in context:
            attrib = elem.attrib
            code = attrib.get("code", "").strip()

            if event == "start":
                # Push before descending so children see the correct parent.
                if self.CODE_PATTERN.match(code):
                    self._stack.append({"code": code, "level": len(self._stack)})
                continue

            # event == "end": pop only the node we actually pushed.
            if not self._stack or self._stack[-1]["code"] != code:
                logger.debug("Skipping orphaned end tag: %s", code)
                continue

            node_data = self._stack.pop()
            level = node_data["level"]
            parent = self._resolve_parent(level)

            if not self._validate_node(code, level):
                continue

            desc = (elem.text or "").strip()
            # Use explicit `is not None` checks: an lxml element with no children
            # is falsy, so `find(...) or find(...)` would skip a valid but empty
            # <Notes> node and lose the explanatory-note lineage.
            notes = elem.find("Notes")
            if notes is None:
                notes = elem.find("ExplanatoryNotes")
            notes_text = notes.text.strip() if notes is not None and notes.text else ""

            # WCO tags structural transitions on the metadata attributes.
            is_split = attrib.get("type", "").lower() == "split"
            is_merged = attrib.get("type", "").lower() == "merge"
            eff_date = attrib.get("effectiveDate", None)

            yield HSNode(
                code=code,
                level=level,
                description=desc,
                parent_code=parent,
                notes=notes_text,
                is_split=is_split,
                is_merged=is_merged,
                effective_date=eff_date,
            )

            # Release the finished subtree: clear the node, then delete already
            # emitted preceding siblings so peak memory stays flat.
            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]

The generator yields one immutable HSNode per valid nomenclature node in document order. Because elem.clear() runs on every end event and the while ... getprevious() loop deletes emitted siblings, peak memory tracks the depth of the tree (a handful of ancestors) rather than its total size — a 500 MB national extension parses in the same footprint as a single chapter.

Verification steps

Run these checks against a representative WCO deliverable before promoting parsed data to staging:

Node-count parity. Compare the count of yielded HSNode records against the official WCO revision index, per chapter. A shortfall almost always means an XML namespace prefix is defeating the tag= filter — iterparse matches on the local name only if the file is namespace-clean, so strip or map the namespace first.
Parent-child integrity. Load the records and assert every non-chapter node’s parent_code resolves to an earlier-yielded code. A recursive CTE over the landed rows should reach a chapter root for every leaf; any node whose ancestor walk exceeds the WCO maximum depth of 8 signals a mis-pushed stack entry.
Split/merge coverage. Count is_split and is_merged records and reconcile against the published 2024 correlation table. Every legacy code retired by a merge must have a surviving successor, or the transition mapping consumed by Rule of Origin Logic Engines will strand preference claims on dead headings.
Checksum validation. Compute a SHA-256 over the raw XML bytes and a second hash over the canonicalized record stream. Store both in the audit table; a matching input hash with a diverging output hash proves a parser-version regression rather than a source change.
Effective-date presence. Assert every node carrying type="split" or type="merge" also has a non-null effective_date. A structural change with no validity window cannot be placed in the bitemporal schema and will silently overwrite the prior cycle.
Log-level sweep. Grep the structured logs for WARNING/ERROR. A burst of Invalid HS code format entries at the same byte offset points to encoding corruption; Skipping orphaned end tag at high frequency points to a malformed or truncated file.

Edge cases & gotchas

XML namespaces defeat the tag= filter. WCO deliverables sometimes ship with a default namespace, and iterparse(tag="Chapter") then matches nothing, yielding zero nodes with no error. Either pass the fully-qualified {uri}Chapter tags or run the file through a namespace-stripping pass first, and let verification step 1 catch the silent empty result.
Character-encoding corruption in descriptions and notes. A Latin-1 or Windows-1252 source re-declared as UTF-8 turns accented commodity names into mojibake, which poisons the description field and mis-hashes the audit payload. Open the file with errors="strict" (never "replace") and quarantine on a decode failure rather than lossily patching bytes.
Document-order dependence of the stack. The parser assumes children appear inside their parent element in document order. If a deliverable emits a flat list of headings with parentCode attributes instead of nesting, the ancestor stack resolves the wrong parent — detect the flat shape up front and switch to attribute-based resolution rather than trusting positional nesting.
elem.clear() before reading .text. The sibling-prune block must run after every .text/.find() access. Moving elem.clear() earlier — a common “optimization” — wipes the note children before they are read and returns empty descriptions with no exception.
Codes shorter than the national granularity. A 6-digit international subheading is valid nomenclature but is not filable as a 10-digit national line. The parser accepts 2–8 digits for lineage, but a bare 6-digit match must still route through Fallback Routing for Unmapped Codes before it reaches an entry, or the duty engine will invoke a rate against an incomplete code.
Malformed codes must never bypass validation. Any record failing CODE_PATTERN or parent resolution is dropped from the emit stream — but “dropped” must mean “routed to quarantine with error_reason and raw_payload”, not silently discarded. Wire the _validate_node failure branch to the quarantine schema so verification step 6’s counts reconcile against landed quarantine rows.

Authoritative references: WCO HS Nomenclature 2022 Edition · lxml.etree iterparse documentation.

Up: Tariff Update Ingestion Pipelines

How to parse WCO HS 2024 updates automatically

# Prerequisites

# Implementation

# Verification steps

# Edge cases & gotchas

# Related

Prerequisites

Implementation

Verification steps

Edge cases & gotchas

Related