Spaces:
Sleeping
Sleeping
File size: 6,702 Bytes
79f9b3a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | ---
title: HS Code Classifier Micro
emoji: ⚡
colorFrom: pink
colorTo: blue
sdk: docker
app_port: 7860
---
# HSClassify_micro 🔍
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
**Machine learning model for multilingual HS/HTS classification** for trade finance and customs workflows, built with FastAPI + OCR.
Classifies product descriptions into [Harmonized System (HS) codes](https://en.wikipedia.org/wiki/Harmonized_System) using sentence embeddings and k-NN search, with an interactive latent space visualization.
## Live Demo
- Hugging Face Space: [https://huggingface.co/spaces/Troglobyte/MicroHS/](https://huggingface.co/spaces/Mead0w1ark/MicroHS)
## Features
- 🌍 **Multilingual** — example supports English, Thai, Vietnamese, and Chinese product descriptions
- ⚡ **Real-time classification** — top-3 HS code predictions with confidence scores
- 📊 **Latent space visualization** — interactive UMAP plot showing embedding clusters
- 🎯 **KNN-based** — simple, interpretable nearest-neighbor approach using fine-tuned `multilingual-e5-small`
- 🧾 **Official HS coverage** — training generation incorporates the [datasets/harmonized-system](https://github.com/datasets/harmonized-system) 6-digit nomenclature
## Dataset Attribution
This project includes HS nomenclature content sourced from:
- [datasets/harmonized-system](https://github.com/datasets/harmonized-system)
- Upstream references listed by that dataset:
- WCO HS nomenclature documentation
- UN Comtrade data extraction API
Related datasets (evaluated during development):
- [Customs-Declaration-Datasets](https://github.com/Seondong/Customs-Declaration-Datasets) — 54,000 synthetic customs declaration records derived from 24.7M real Korean customs entries. Provides structured trade metadata (HS codes, country of origin, price, weight, fraud labels) but does not include free-text product descriptions. Cited as a reference for customs data research. See: *S. Kim et al., "DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection," KDD 2020.*
Licensing:
- Upstream HS source data: **ODC Public Domain Dedication and License (PDDL) v1.0**
- Project-added synthetic multilingual examples and labels: **MIT** (this repo)
## Quick Start
```bash
# Clone
git clone https://github.com/JamesEBall/HSClassify_micro.git
cd HSClassify_micro
# Install dependencies
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Generate training data & train model
python scripts/generate_training_data.py
python scripts/train_model.py
# Run the web app
uvicorn app:app --reload --port 8000
```
Open [http://localhost:8000](http://localhost:8000) to classify products.
## Deployment
- The Space runs in Docker (`sdk: docker`, `app_port: 7860`).
- OCR endpoints require OS packages; `Dockerfile` installs:
- `tesseract-ocr`
- `poppler-utils` (for PDF conversion via `pdf2image`)
- Model and data loading is resilient in hosted environments:
- Large artifacts (model weights, embeddings, classifier, training data) are hosted on [HF Hub](https://huggingface.co/Mead0w1ark/multilingual-e5-small-hs-codes) and downloaded automatically at startup if not present locally
- Set `SENTENCE_MODEL_NAME` to override the HF model repo (default: `Mead0w1ark/multilingual-e5-small-hs-codes`)
### Auto Sync (GitHub -> Hugging Face Space)
This repo includes a GitHub Action at `.github/workflows/sync_to_hf_space.yml` that syncs `main` to:
- `spaces/Troglobyte/MicroHS`
Required GitHub secret:
- `HF_TOKEN`: Hugging Face token with write access to the Space
## Publish Dataset to Hugging Face Datasets
Use the included publish helper:
```bash
bash scripts/publish_dataset_to_hf.sh <namespace>/<dataset-repo>
# Example:
bash scripts/publish_dataset_to_hf.sh Troglobyte/hsclassify-micro-dataset
```
The script creates/updates a Dataset repo and uploads:
- `training_data_indexed.csv`
- `harmonized-system.csv` (attributed source snapshot)
- `hs_codes_reference.json`
- Dataset card + attribution notes
## Model
The classifier uses [`multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small) fine-tuned with contrastive learning (MultipleNegativesRankingLoss) on 9,829 curated HS-coded training pairs. Fine-tuned weights are hosted on HF Hub at [`Mead0w1ark/multilingual-e5-small-hs-codes`](https://huggingface.co/Mead0w1ark/multilingual-e5-small-hs-codes).
| Metric | Before Fine-Tuning | After Fine-Tuning |
|---|---|---|
| Training accuracy (80/20 split) | 77.2% | **87.0%** |
| Benchmark Top-1 (in-label-space) | 88.6% | **92.9%** |
| Benchmark Top-3 (in-label-space) | — | **97.1%** |
To fine-tune from scratch:
```bash
python scripts/train_model.py --finetune
```
## How It Works
1. **Embedding**: Product descriptions are encoded using fine-tuned `multilingual-e5-small` (384-dim sentence embeddings)
2. **Classification**: K-nearest neighbors (k=5) over pre-computed embeddings of HS-coded training examples
3. **Visualization**: UMAP reduction to 2D for interactive cluster exploration via Plotly
## Project Structure
```
├── app.py # FastAPI web application
├── dataset/
│ ├── README.md # HF dataset card (attribution + schema)
│ └── ATTRIBUTION.md # Source and license attribution details
├── requirements.txt # Python dependencies
├── scripts/
│ ├── generate_training_data.py # Synthetic training data generator
│ ├── train_model.py # Model training (embeddings + KNN)
│ └── publish_dataset_to_hf.sh # Publish dataset artifacts to HF Datasets
├── data/
│ ├── hs_codes_reference.json # HS code definitions
│ ├── harmonized-system/harmonized-system.csv # Upstream HS source snapshot
│ ├── training_data.csv # Generated training examples
│ └── training_data_indexed.csv # App/latent-ready training examples
├── models/ # Trained artifacts (generated)
│ ├── sentence_model/ # Cached sentence transformer
│ ├── embeddings.npy # Pre-computed embeddings
│ ├── knn_classifier.pkl # Trained KNN model
│ └── label_encoder.pkl # Label encoder
└── templates/
└── index.html # Web UI
```
## Context
Built as a rapid POC exploring whether multilingual sentence embeddings can simplify HS code classification for customs authorities.
## License
MIT — see [LICENSE](LICENSE)
|