Spaces:
Sleeping
title: HS Code Classifier Micro
emoji: ⚡
colorFrom: pink
colorTo: blue
sdk: docker
app_port: 7860
HSClassify_micro 🔍
Machine learning model for multilingual HS/HTS classification for trade finance and customs workflows, built with FastAPI + OCR.
Classifies product descriptions into Harmonized System (HS) codes using sentence embeddings and k-NN search, with an interactive latent space visualization.
Live Demo
- Hugging Face Space: https://huggingface.co/spaces/Troglobyte/MicroHS/
Features
- 🌍 Multilingual — example supports English, Thai, Vietnamese, and Chinese product descriptions
- ⚡ Real-time classification — top-3 HS code predictions with confidence scores
- 📊 Latent space visualization — interactive UMAP plot showing embedding clusters
- 🎯 KNN-based — simple, interpretable nearest-neighbor approach using fine-tuned
multilingual-e5-small - 🧾 Official HS coverage — training generation incorporates the datasets/harmonized-system 6-digit nomenclature
Dataset Attribution
This project includes HS nomenclature content sourced from:
- datasets/harmonized-system
- Upstream references listed by that dataset:
- WCO HS nomenclature documentation
- UN Comtrade data extraction API
Related datasets (evaluated during development):
- Customs-Declaration-Datasets — 54,000 synthetic customs declaration records derived from 24.7M real Korean customs entries. Provides structured trade metadata (HS codes, country of origin, price, weight, fraud labels) but does not include free-text product descriptions. Cited as a reference for customs data research. See: S. Kim et al., "DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection," KDD 2020.
Licensing:
- Upstream HS source data: ODC Public Domain Dedication and License (PDDL) v1.0
- Project-added synthetic multilingual examples and labels: MIT (this repo)
Quick Start
# Clone
git clone https://github.com/JamesEBall/HSClassify_micro.git
cd HSClassify_micro
# Install dependencies
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Generate training data & train model
python scripts/generate_training_data.py
python scripts/train_model.py
# Run the web app
uvicorn app:app --reload --port 8000
Open http://localhost:8000 to classify products.
Deployment
- The Space runs in Docker (
sdk: docker,app_port: 7860). - OCR endpoints require OS packages;
Dockerfileinstalls:tesseract-ocrpoppler-utils(for PDF conversion viapdf2image)
- Model and data loading is resilient in hosted environments:
- Large artifacts (model weights, embeddings, classifier, training data) are hosted on HF Hub and downloaded automatically at startup if not present locally
- Set
SENTENCE_MODEL_NAMEto override the HF model repo (default:Mead0w1ark/multilingual-e5-small-hs-codes)
Auto Sync (GitHub -> Hugging Face Space)
This repo includes a GitHub Action at .github/workflows/sync_to_hf_space.yml that syncs main to:
spaces/Troglobyte/MicroHS
Required GitHub secret:
HF_TOKEN: Hugging Face token with write access to the Space
Publish Dataset to Hugging Face Datasets
Use the included publish helper:
bash scripts/publish_dataset_to_hf.sh <namespace>/<dataset-repo>
# Example:
bash scripts/publish_dataset_to_hf.sh Troglobyte/hsclassify-micro-dataset
The script creates/updates a Dataset repo and uploads:
training_data_indexed.csvharmonized-system.csv(attributed source snapshot)hs_codes_reference.json- Dataset card + attribution notes
Model
The classifier uses multilingual-e5-small fine-tuned with contrastive learning (MultipleNegativesRankingLoss) on 9,829 curated HS-coded training pairs. Fine-tuned weights are hosted on HF Hub at Mead0w1ark/multilingual-e5-small-hs-codes.
| Metric | Before Fine-Tuning | After Fine-Tuning |
|---|---|---|
| Training accuracy (80/20 split) | 77.2% | 87.0% |
| Benchmark Top-1 (in-label-space) | 88.6% | 92.9% |
| Benchmark Top-3 (in-label-space) | — | 97.1% |
To fine-tune from scratch:
python scripts/train_model.py --finetune
How It Works
- Embedding: Product descriptions are encoded using fine-tuned
multilingual-e5-small(384-dim sentence embeddings) - Classification: K-nearest neighbors (k=5) over pre-computed embeddings of HS-coded training examples
- Visualization: UMAP reduction to 2D for interactive cluster exploration via Plotly
Project Structure
├── app.py # FastAPI web application
├── dataset/
│ ├── README.md # HF dataset card (attribution + schema)
│ └── ATTRIBUTION.md # Source and license attribution details
├── requirements.txt # Python dependencies
├── scripts/
│ ├── generate_training_data.py # Synthetic training data generator
│ ├── train_model.py # Model training (embeddings + KNN)
│ └── publish_dataset_to_hf.sh # Publish dataset artifacts to HF Datasets
├── data/
│ ├── hs_codes_reference.json # HS code definitions
│ ├── harmonized-system/harmonized-system.csv # Upstream HS source snapshot
│ ├── training_data.csv # Generated training examples
│ └── training_data_indexed.csv # App/latent-ready training examples
├── models/ # Trained artifacts (generated)
│ ├── sentence_model/ # Cached sentence transformer
│ ├── embeddings.npy # Pre-computed embeddings
│ ├── knn_classifier.pkl # Trained KNN model
│ └── label_encoder.pkl # Label encoder
└── templates/
└── index.html # Web UI
Context
Built as a rapid POC exploring whether multilingual sentence embeddings can simplify HS code classification for customs authorities.
License
MIT — see LICENSE