Spaces:
Sleeping
Sleeping
| title: HS Code Classifier Micro | |
| emoji: ⚡ | |
| colorFrom: pink | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 7860 | |
| # HSClassify_micro 🔍 | |
| [](https://opensource.org/licenses/MIT) | |
| [](https://www.python.org/downloads/) | |
| **Machine learning model for multilingual HS/HTS classification** for trade finance and customs workflows, built with FastAPI + OCR. | |
| Classifies product descriptions into [Harmonized System (HS) codes](https://en.wikipedia.org/wiki/Harmonized_System) using sentence embeddings and k-NN search, with an interactive latent space visualization. | |
| ## Live Demo | |
| - Hugging Face Space: [https://huggingface.co/spaces/Troglobyte/MicroHS/](https://huggingface.co/spaces/Mead0w1ark/MicroHS) | |
| ## Features | |
| - 🌍 **Multilingual** — example supports English, Thai, Vietnamese, and Chinese product descriptions | |
| - ⚡ **Real-time classification** — top-3 HS code predictions with confidence scores | |
| - 📊 **Latent space visualization** — interactive UMAP plot showing embedding clusters | |
| - 🎯 **KNN-based** — simple, interpretable nearest-neighbor approach using fine-tuned `multilingual-e5-small` | |
| - 🧾 **Official HS coverage** — training generation incorporates the [datasets/harmonized-system](https://github.com/datasets/harmonized-system) 6-digit nomenclature | |
| ## Dataset Attribution | |
| This project includes HS nomenclature content sourced from: | |
| - [datasets/harmonized-system](https://github.com/datasets/harmonized-system) | |
| - Upstream references listed by that dataset: | |
| - WCO HS nomenclature documentation | |
| - UN Comtrade data extraction API | |
| Related datasets (evaluated during development): | |
| - [Customs-Declaration-Datasets](https://github.com/Seondong/Customs-Declaration-Datasets) — 54,000 synthetic customs declaration records derived from 24.7M real Korean customs entries. Provides structured trade metadata (HS codes, country of origin, price, weight, fraud labels) but does not include free-text product descriptions. Cited as a reference for customs data research. See: *S. Kim et al., "DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection," KDD 2020.* | |
| Licensing: | |
| - Upstream HS source data: **ODC Public Domain Dedication and License (PDDL) v1.0** | |
| - Project-added synthetic multilingual examples and labels: **MIT** (this repo) | |
| ## Quick Start | |
| ```bash | |
| # Clone | |
| git clone https://github.com/JamesEBall/HSClassify_micro.git | |
| cd HSClassify_micro | |
| # Install dependencies | |
| python -m venv venv | |
| source venv/bin/activate | |
| pip install -r requirements.txt | |
| # Generate training data & train model | |
| python scripts/generate_training_data.py | |
| python scripts/train_model.py | |
| # Run the web app | |
| uvicorn app:app --reload --port 8000 | |
| ``` | |
| Open [http://localhost:8000](http://localhost:8000) to classify products. | |
| ## Deployment | |
| - The Space runs in Docker (`sdk: docker`, `app_port: 7860`). | |
| - OCR endpoints require OS packages; `Dockerfile` installs: | |
| - `tesseract-ocr` | |
| - `poppler-utils` (for PDF conversion via `pdf2image`) | |
| - Model and data loading is resilient in hosted environments: | |
| - Large artifacts (model weights, embeddings, classifier, training data) are hosted on [HF Hub](https://huggingface.co/Mead0w1ark/multilingual-e5-small-hs-codes) and downloaded automatically at startup if not present locally | |
| - Set `SENTENCE_MODEL_NAME` to override the HF model repo (default: `Mead0w1ark/multilingual-e5-small-hs-codes`) | |
| ### Auto Sync (GitHub -> Hugging Face Space) | |
| This repo includes a GitHub Action at `.github/workflows/sync_to_hf_space.yml` that syncs `main` to: | |
| - `spaces/Troglobyte/MicroHS` | |
| Required GitHub secret: | |
| - `HF_TOKEN`: Hugging Face token with write access to the Space | |
| ## Publish Dataset to Hugging Face Datasets | |
| Use the included publish helper: | |
| ```bash | |
| bash scripts/publish_dataset_to_hf.sh <namespace>/<dataset-repo> | |
| # Example: | |
| bash scripts/publish_dataset_to_hf.sh Troglobyte/hsclassify-micro-dataset | |
| ``` | |
| The script creates/updates a Dataset repo and uploads: | |
| - `training_data_indexed.csv` | |
| - `harmonized-system.csv` (attributed source snapshot) | |
| - `hs_codes_reference.json` | |
| - Dataset card + attribution notes | |
| ## Model | |
| The classifier uses [`multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small) fine-tuned with contrastive learning (MultipleNegativesRankingLoss) on 9,829 curated HS-coded training pairs. Fine-tuned weights are hosted on HF Hub at [`Mead0w1ark/multilingual-e5-small-hs-codes`](https://huggingface.co/Mead0w1ark/multilingual-e5-small-hs-codes). | |
| | Metric | Before Fine-Tuning | After Fine-Tuning | | |
| |---|---|---| | |
| | Training accuracy (80/20 split) | 77.2% | **87.0%** | | |
| | Benchmark Top-1 (in-label-space) | 88.6% | **92.9%** | | |
| | Benchmark Top-3 (in-label-space) | — | **97.1%** | | |
| To fine-tune from scratch: | |
| ```bash | |
| python scripts/train_model.py --finetune | |
| ``` | |
| ## How It Works | |
| 1. **Embedding**: Product descriptions are encoded using fine-tuned `multilingual-e5-small` (384-dim sentence embeddings) | |
| 2. **Classification**: K-nearest neighbors (k=5) over pre-computed embeddings of HS-coded training examples | |
| 3. **Visualization**: UMAP reduction to 2D for interactive cluster exploration via Plotly | |
| ## Project Structure | |
| ``` | |
| ├── app.py # FastAPI web application | |
| ├── dataset/ | |
| │ ├── README.md # HF dataset card (attribution + schema) | |
| │ └── ATTRIBUTION.md # Source and license attribution details | |
| ├── requirements.txt # Python dependencies | |
| ├── scripts/ | |
| │ ├── generate_training_data.py # Synthetic training data generator | |
| │ ├── train_model.py # Model training (embeddings + KNN) | |
| │ └── publish_dataset_to_hf.sh # Publish dataset artifacts to HF Datasets | |
| ├── data/ | |
| │ ├── hs_codes_reference.json # HS code definitions | |
| │ ├── harmonized-system/harmonized-system.csv # Upstream HS source snapshot | |
| │ ├── training_data.csv # Generated training examples | |
| │ └── training_data_indexed.csv # App/latent-ready training examples | |
| ├── models/ # Trained artifacts (generated) | |
| │ ├── sentence_model/ # Cached sentence transformer | |
| │ ├── embeddings.npy # Pre-computed embeddings | |
| │ ├── knn_classifier.pkl # Trained KNN model | |
| │ └── label_encoder.pkl # Label encoder | |
| └── templates/ | |
| └── index.html # Web UI | |
| ``` | |
| ## Context | |
| Built as a rapid POC exploring whether multilingual sentence embeddings can simplify HS code classification for customs authorities. | |
| ## License | |
| MIT — see [LICENSE](LICENSE) | |