Initial upload: HS Code Classifier (English, 6-digit)

6062b2d verified 25 days ago

4.74 kB

	---
	language:
	- en
	tags:
	- classification
	- customs
	- trade
	- hscode
	- product-classification
	- pytorch
	license: mit
	pipeline_tag: text-classification
	---

	![HS Code Classifier](hscode_class_entum.webp)

	# HS Code Classifier (English)

	A deep learning model for automatic classification of goods by Harmonized System (HS) codes based on English-language product descriptions. The model predicts HS codes at three levels of granularity: 2-digit (chapter), 4-digit (heading), and 6-digit (subheading).

	---

	## Overview

	The Harmonized System is an internationally standardized nomenclature for the classification of traded products. Manual assignment of HS codes is time-consuming and error-prone. This model automates that process from plain English product text, providing multi-level predictions with confidence scores.

	Task: Multi-class text classification
	Input: English product description (free-form text)
	Output: HS code predictions at 2-, 4-, and 6-digit levels with confidence scores
	Base model: `bert-base-uncased`

	---

	## Performance

	The model was trained for 25 epochs and evaluated on a held-out validation set. The results below reflect the best checkpoint selected during training.

	\| Level \| Granularity \| Accuracy \|
	\|-------------\|---------------\|------------\|
	\| 2-digit \| Chapter \| 97.74% \|
	\| 4-digit \| Heading \| 97.50% \|
	\| 6-digit \| Subheading \| 90.12% \|

	Training and validation loss progression confirmed stable convergence without overfitting, supported by learning rate scheduling and weight averaging over the final epochs.

	---

	## Training Details

	\| Parameter \| Value \|
	\|---------------------\|------------------------\|
	\| Training started \| 2026-03-09 \|
	\| Total epochs \| 25 \|
	\| Final training loss \| 0.40 \|
	\| Hardware \| GPU \|
	\| Framework \| PyTorch + Transformers \|

	---

	## Usage

	This model uses a custom PyTorch architecture. Loading requires the class definition from the original inference script. Below is a high-level usage example.

	### Requirements

	```bash
	pip install torch transformers sentencepiece safetensors
	```

	### Loading and Running Inference

	```python
	import torch
	import json
	from transformers import AutoTokenizer, AutoModel

	# Load configuration
	config = json.load(open("model/model_config.json"))

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("model/tokenizer")

	# Load label mappings
	label2id_6 = json.load(open("model/label2id_6.json"))
	id2label_6 = {v: k for k, v in label2id_6.items()}

	# Run inference using the provided inference script:
	# python inference.py
	```

	For full inference, use the `inference.py` script included in the repository. It loads all model components, accepts a product description as input, and returns the top-5 HS code candidates with confidence scores at each level.

	### Output Format

	```
	Input: "wireless bluetooth headphones with noise cancellation"

	Rank \| Code \| Score \| Confidence
	-----\|------------\|-----------\|------------------------
	1 \| 851830 \| 4.12e-01 \| 85.21 -> 91.43 -> 87.62
	2 \| 851890 \| 2.87e-01 \| 85.21 -> 91.43 -> 72.18
	3 \| 852520 \| 1.03e-01 \| ...
	```

	Each result shows the predicted 6-digit subheading with a chain of probabilities: chapter (2-digit) -> heading (4-digit) -> subheading (6-digit).

	---

	## Model Files

	\| File \| Description \|
	\|-----------------------\|------------------------------------\|
	\| `cascaded_best.pt` \| Trained model weights \|
	\| `model_config.json` \| Model architecture configuration \|
	\| `label2id_2.json` \| Chapter-level (2-digit) label map \|
	\| `label2id_4.json` \| Heading-level (4-digit) label map \|
	\| `label2id_6.json` \| Subheading-level (6-digit) label map \|
	\| `tokenizer/` \| Tokenizer files \|
	\| `base_model/` \| Fine-tuned base transformer weights \|

	---

	## Limitations

	- The model was trained on English-language product descriptions. Other languages are not supported.
	- Coverage is limited to HS codes present in the training data. Very rare or newly introduced subheadings may not be recognized.
	- Confidence scores should be treated as relative rankings rather than calibrated probabilities.
	- The model predicts based on text alone. Physical measurements, materials composition, or country-specific tariff rulings are not taken into account.

	---

	## License

	This model is released under the MIT License.

	---

	## Contact

	Developed by ENTUM-AI.
	For questions or collaboration, contact us via the Hugging Face profile page.