MicroHS / README.md
github-actions[bot]
Sync from GitHub 38cd8d69dc858672e22cd1448f7768fef87468b1
79f9b3a
---
title: HS Code Classifier Micro
emoji:
colorFrom: pink
colorTo: blue
sdk: docker
app_port: 7860
---
# HSClassify_micro 🔍
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
**Machine learning model for multilingual HS/HTS classification** for trade finance and customs workflows, built with FastAPI + OCR.
Classifies product descriptions into [Harmonized System (HS) codes](https://en.wikipedia.org/wiki/Harmonized_System) using sentence embeddings and k-NN search, with an interactive latent space visualization.
## Live Demo
- Hugging Face Space: [https://huggingface.co/spaces/Troglobyte/MicroHS/](https://huggingface.co/spaces/Mead0w1ark/MicroHS)
## Features
- 🌍 **Multilingual** — example supports English, Thai, Vietnamese, and Chinese product descriptions
- ⚡ **Real-time classification** — top-3 HS code predictions with confidence scores
- 📊 **Latent space visualization** — interactive UMAP plot showing embedding clusters
- 🎯 **KNN-based** — simple, interpretable nearest-neighbor approach using fine-tuned `multilingual-e5-small`
- 🧾 **Official HS coverage** — training generation incorporates the [datasets/harmonized-system](https://github.com/datasets/harmonized-system) 6-digit nomenclature
## Dataset Attribution
This project includes HS nomenclature content sourced from:
- [datasets/harmonized-system](https://github.com/datasets/harmonized-system)
- Upstream references listed by that dataset:
- WCO HS nomenclature documentation
- UN Comtrade data extraction API
Related datasets (evaluated during development):
- [Customs-Declaration-Datasets](https://github.com/Seondong/Customs-Declaration-Datasets) — 54,000 synthetic customs declaration records derived from 24.7M real Korean customs entries. Provides structured trade metadata (HS codes, country of origin, price, weight, fraud labels) but does not include free-text product descriptions. Cited as a reference for customs data research. See: *S. Kim et al., "DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection," KDD 2020.*
Licensing:
- Upstream HS source data: **ODC Public Domain Dedication and License (PDDL) v1.0**
- Project-added synthetic multilingual examples and labels: **MIT** (this repo)
## Quick Start
```bash
# Clone
git clone https://github.com/JamesEBall/HSClassify_micro.git
cd HSClassify_micro
# Install dependencies
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Generate training data & train model
python scripts/generate_training_data.py
python scripts/train_model.py
# Run the web app
uvicorn app:app --reload --port 8000
```
Open [http://localhost:8000](http://localhost:8000) to classify products.
## Deployment
- The Space runs in Docker (`sdk: docker`, `app_port: 7860`).
- OCR endpoints require OS packages; `Dockerfile` installs:
- `tesseract-ocr`
- `poppler-utils` (for PDF conversion via `pdf2image`)
- Model and data loading is resilient in hosted environments:
- Large artifacts (model weights, embeddings, classifier, training data) are hosted on [HF Hub](https://huggingface.co/Mead0w1ark/multilingual-e5-small-hs-codes) and downloaded automatically at startup if not present locally
- Set `SENTENCE_MODEL_NAME` to override the HF model repo (default: `Mead0w1ark/multilingual-e5-small-hs-codes`)
### Auto Sync (GitHub -> Hugging Face Space)
This repo includes a GitHub Action at `.github/workflows/sync_to_hf_space.yml` that syncs `main` to:
- `spaces/Troglobyte/MicroHS`
Required GitHub secret:
- `HF_TOKEN`: Hugging Face token with write access to the Space
## Publish Dataset to Hugging Face Datasets
Use the included publish helper:
```bash
bash scripts/publish_dataset_to_hf.sh <namespace>/<dataset-repo>
# Example:
bash scripts/publish_dataset_to_hf.sh Troglobyte/hsclassify-micro-dataset
```
The script creates/updates a Dataset repo and uploads:
- `training_data_indexed.csv`
- `harmonized-system.csv` (attributed source snapshot)
- `hs_codes_reference.json`
- Dataset card + attribution notes
## Model
The classifier uses [`multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small) fine-tuned with contrastive learning (MultipleNegativesRankingLoss) on 9,829 curated HS-coded training pairs. Fine-tuned weights are hosted on HF Hub at [`Mead0w1ark/multilingual-e5-small-hs-codes`](https://huggingface.co/Mead0w1ark/multilingual-e5-small-hs-codes).
| Metric | Before Fine-Tuning | After Fine-Tuning |
|---|---|---|
| Training accuracy (80/20 split) | 77.2% | **87.0%** |
| Benchmark Top-1 (in-label-space) | 88.6% | **92.9%** |
| Benchmark Top-3 (in-label-space) | — | **97.1%** |
To fine-tune from scratch:
```bash
python scripts/train_model.py --finetune
```
## How It Works
1. **Embedding**: Product descriptions are encoded using fine-tuned `multilingual-e5-small` (384-dim sentence embeddings)
2. **Classification**: K-nearest neighbors (k=5) over pre-computed embeddings of HS-coded training examples
3. **Visualization**: UMAP reduction to 2D for interactive cluster exploration via Plotly
## Project Structure
```
├── app.py # FastAPI web application
├── dataset/
│ ├── README.md # HF dataset card (attribution + schema)
│ └── ATTRIBUTION.md # Source and license attribution details
├── requirements.txt # Python dependencies
├── scripts/
│ ├── generate_training_data.py # Synthetic training data generator
│ ├── train_model.py # Model training (embeddings + KNN)
│ └── publish_dataset_to_hf.sh # Publish dataset artifacts to HF Datasets
├── data/
│ ├── hs_codes_reference.json # HS code definitions
│ ├── harmonized-system/harmonized-system.csv # Upstream HS source snapshot
│ ├── training_data.csv # Generated training examples
│ └── training_data_indexed.csv # App/latent-ready training examples
├── models/ # Trained artifacts (generated)
│ ├── sentence_model/ # Cached sentence transformer
│ ├── embeddings.npy # Pre-computed embeddings
│ ├── knn_classifier.pkl # Trained KNN model
│ └── label_encoder.pkl # Label encoder
└── templates/
└── index.html # Web UI
```
## Context
Built as a rapid POC exploring whether multilingual sentence embeddings can simplify HS code classification for customs authorities.
## License
MIT — see [LICENSE](LICENSE)