HS Code Classifier (English)
A deep learning model for automatic classification of goods by Harmonized System (HS) codes based on English-language product descriptions. The model predicts HS codes at three levels of granularity: 2-digit (chapter), 4-digit (heading), and 6-digit (subheading).
Overview
The Harmonized System is an internationally standardized nomenclature for the classification of traded products. Manual assignment of HS codes is time-consuming and error-prone. This model automates that process from plain English product text, providing multi-level predictions with confidence scores.
Task: Multi-class text classification
Input: English product description (free-form text)
Output: HS code predictions at 2-, 4-, and 6-digit levels with confidence scores
Base model: bert-base-uncased
Performance
The model was trained for 25 epochs and evaluated on a held-out validation set. The results below reflect the best checkpoint selected during training.
| Level | Granularity | Accuracy |
|---|---|---|
| 2-digit | Chapter | 97.74% |
| 4-digit | Heading | 97.50% |
| 6-digit | Subheading | 90.12% |
Training and validation loss progression confirmed stable convergence without overfitting, supported by learning rate scheduling and weight averaging over the final epochs.
Training Details
| Parameter | Value |
|---|---|
| Training started | 2026-03-09 |
| Total epochs | 25 |
| Final training loss | 0.40 |
| Hardware | GPU |
| Framework | PyTorch + Transformers |
Usage
This model uses a custom PyTorch architecture. Loading requires the class definition from the original inference script. Below is a high-level usage example.
Requirements
pip install torch transformers sentencepiece safetensors
Loading and Running Inference
import torch
import json
from transformers import AutoTokenizer, AutoModel
# Load configuration
config = json.load(open("model/model_config.json"))
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("model/tokenizer")
# Load label mappings
label2id_6 = json.load(open("model/label2id_6.json"))
id2label_6 = {v: k for k, v in label2id_6.items()}
# Run inference using the provided inference script:
# python inference.py
For full inference, use the inference.py script included in the repository. It loads all model components, accepts a product description as input, and returns the top-5 HS code candidates with confidence scores at each level.
Output Format
Input: "wireless bluetooth headphones with noise cancellation"
Rank | Code | Score | Confidence
-----|------------|-----------|------------------------
1 | 851830 | 4.12e-01 | 85.21 -> 91.43 -> 87.62
2 | 851890 | 2.87e-01 | 85.21 -> 91.43 -> 72.18
3 | 852520 | 1.03e-01 | ...
Each result shows the predicted 6-digit subheading with a chain of probabilities: chapter (2-digit) -> heading (4-digit) -> subheading (6-digit).
Model Files
| File | Description |
|---|---|
cascaded_best.pt |
Trained model weights |
model_config.json |
Model architecture configuration |
label2id_2.json |
Chapter-level (2-digit) label map |
label2id_4.json |
Heading-level (4-digit) label map |
label2id_6.json |
Subheading-level (6-digit) label map |
tokenizer/ |
Tokenizer files |
base_model/ |
Fine-tuned base transformer weights |
Limitations
- The model was trained on English-language product descriptions. Other languages are not supported.
- Coverage is limited to HS codes present in the training data. Very rare or newly introduced subheadings may not be recognized.
- Confidence scores should be treated as relative rankings rather than calibrated probabilities.
- The model predicts based on text alone. Physical measurements, materials composition, or country-specific tariff rulings are not taken into account.
License
This model is released under the MIT License.
Contact
Developed by ENTUM-AI. For questions or collaboration, contact us via the Hugging Face profile page.
