HS Code Classifier (English)

A deep learning model for automatic classification of goods by Harmonized System (HS) codes based on English-language product descriptions. The model predicts HS codes at three levels of granularity: 2-digit (chapter), 4-digit (heading), and 6-digit (subheading).

Overview

The Harmonized System is an internationally standardized nomenclature for the classification of traded products. Manual assignment of HS codes is time-consuming and error-prone. This model automates that process from plain English product text, providing multi-level predictions with confidence scores.

Task: Multi-class text classification Input: English product description (free-form text) Output: HS code predictions at 2-, 4-, and 6-digit levels with confidence scores Base model: bert-base-uncased

Performance

The model was trained for 25 epochs and evaluated on a held-out validation set. The results below reflect the best checkpoint selected during training.

Level	Granularity	Accuracy
2-digit	Chapter	97.74%
4-digit	Heading	97.50%
6-digit	Subheading	90.12%

Training and validation loss progression confirmed stable convergence without overfitting, supported by learning rate scheduling and weight averaging over the final epochs.

Training Details

Parameter	Value
Training started	2026-03-09
Total epochs	25
Final training loss	0.40
Hardware	GPU
Framework	PyTorch + Transformers

Usage

This model uses a custom PyTorch architecture. Loading requires the class definition from the original inference script. Below is a high-level usage example.

Requirements

pip install torch transformers sentencepiece safetensors

Loading and Running Inference

import torch
import json
from transformers import AutoTokenizer, AutoModel

# Load configuration
config = json.load(open("model/model_config.json"))

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("model/tokenizer")

# Load label mappings
label2id_6 = json.load(open("model/label2id_6.json"))
id2label_6 = {v: k for k, v in label2id_6.items()}

# Run inference using the provided inference script:
# python inference.py

For full inference, use the inference.py script included in the repository. It loads all model components, accepts a product description as input, and returns the top-5 HS code candidates with confidence scores at each level.

Output Format

Input: "wireless bluetooth headphones with noise cancellation"

Rank | Code       | Score     | Confidence
-----|------------|-----------|------------------------
  1  | 851830     | 4.12e-01  | 85.21 -> 91.43 -> 87.62
  2  | 851890     | 2.87e-01  | 85.21 -> 91.43 -> 72.18
  3  | 852520     | 1.03e-01  | ...

Each result shows the predicted 6-digit subheading with a chain of probabilities: chapter (2-digit) -> heading (4-digit) -> subheading (6-digit).

Model Files

File	Description
`cascaded_best.pt`	Trained model weights
`model_config.json`	Model architecture configuration
`label2id_2.json`	Chapter-level (2-digit) label map
`label2id_4.json`	Heading-level (4-digit) label map
`label2id_6.json`	Subheading-level (6-digit) label map
`tokenizer/`	Tokenizer files
`base_model/`	Fine-tuned base transformer weights

Limitations

The model was trained on English-language product descriptions. Other languages are not supported.
Coverage is limited to HS codes present in the training data. Very rare or newly introduced subheadings may not be recognized.
Confidence scores should be treated as relative rankings rather than calibrated probabilities.
The model predicts based on text alone. Physical measurements, materials composition, or country-specific tariff rulings are not taken into account.

License

This model is released under the MIT License.

Contact

Developed by ENTUM-AI. For questions or collaboration, contact us via the Hugging Face profile page.

Downloads last month: -; Downloads are not tracked for this model. How to track