---
language:
  - en
tags:
  - classification
  - customs
  - trade
  - hscode
  - product-classification
  - pytorch
license: mit
pipeline_tag: text-classification
---

![HS Code Classifier](hscode_class_entum.webp)

# HS Code Classifier (English)

A deep learning model for automatic classification of goods by Harmonized System (HS) codes based on English-language product descriptions. The model predicts HS codes at three levels of granularity: 2-digit (chapter), 4-digit (heading), and 6-digit (subheading).

---

## Overview

The Harmonized System is an internationally standardized nomenclature for the classification of traded products. Manual assignment of HS codes is time-consuming and error-prone. This model automates that process from plain English product text, providing multi-level predictions with confidence scores.

**Task:** Multi-class text classification
**Input:** English product description (free-form text)
**Output:** HS code predictions at 2-, 4-, and 6-digit levels with confidence scores
**Base model:** `bert-base-uncased`

---

## Performance

The model was trained for 25 epochs and evaluated on a held-out validation set. The results below reflect the best checkpoint selected during training.

| Level       | Granularity   | Accuracy   |
|-------------|---------------|------------|
| 2-digit     | Chapter       | **97.74%** |
| 4-digit     | Heading       | **97.50%** |
| 6-digit     | Subheading    | **90.12%** |

Training and validation loss progression confirmed stable convergence without overfitting, supported by learning rate scheduling and weight averaging over the final epochs.

---

## Training Details

| Parameter           | Value                  |
|---------------------|------------------------|
| Training started    | 2026-03-09             |
| Total epochs        | 25                     |
| Final training loss | 0.40                   |
| Hardware            | GPU                    |
| Framework           | PyTorch + Transformers |

---

## Usage

This model uses a custom PyTorch architecture. Loading requires the class definition from the original inference script. Below is a high-level usage example.

### Requirements

```bash
pip install torch transformers sentencepiece safetensors
```

### Loading and Running Inference

```python
import torch
import json
from transformers import AutoTokenizer, AutoModel

# Load configuration
config = json.load(open("model/model_config.json"))

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("model/tokenizer")

# Load label mappings
label2id_6 = json.load(open("model/label2id_6.json"))
id2label_6 = {v: k for k, v in label2id_6.items()}

# Run inference using the provided inference script:
# python inference.py
```

For full inference, use the `inference.py` script included in the repository. It loads all model components, accepts a product description as input, and returns the top-5 HS code candidates with confidence scores at each level.

### Output Format

```
Input: "wireless bluetooth headphones with noise cancellation"

Rank | Code       | Score     | Confidence
-----|------------|-----------|------------------------
  1  | 851830     | 4.12e-01  | 85.21 -> 91.43 -> 87.62
  2  | 851890     | 2.87e-01  | 85.21 -> 91.43 -> 72.18
  3  | 852520     | 1.03e-01  | ...
```

Each result shows the predicted 6-digit subheading with a chain of probabilities: chapter (2-digit) -> heading (4-digit) -> subheading (6-digit).

---

## Model Files

| File                  | Description                        |
|-----------------------|------------------------------------|
| `cascaded_best.pt`    | Trained model weights              |
| `model_config.json`   | Model architecture configuration   |
| `label2id_2.json`     | Chapter-level (2-digit) label map  |
| `label2id_4.json`     | Heading-level (4-digit) label map  |
| `label2id_6.json`     | Subheading-level (6-digit) label map |
| `tokenizer/`          | Tokenizer files                    |
| `base_model/`         | Fine-tuned base transformer weights |

---

## Limitations

- The model was trained on English-language product descriptions. Other languages are not supported.
- Coverage is limited to HS codes present in the training data. Very rare or newly introduced subheadings may not be recognized.
- Confidence scores should be treated as relative rankings rather than calibrated probabilities.
- The model predicts based on text alone. Physical measurements, materials composition, or country-specific tariff rulings are not taken into account.

---

## License

This model is released under the MIT License.

---

## Contact

Developed by **ENTUM-AI**.
For questions or collaboration, contact us via the Hugging Face profile page.