--- title: HS Code Classifier Micro emoji: โšก colorFrom: pink colorTo: blue sdk: docker app_port: 7860 --- # HSClassify_micro ๐Ÿ” [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) **Machine learning model for multilingual HS/HTS classification** for trade finance and customs workflows, built with FastAPI + OCR. Classifies product descriptions into [Harmonized System (HS) codes](https://en.wikipedia.org/wiki/Harmonized_System) using sentence embeddings and k-NN search, with an interactive latent space visualization. ## Live Demo - Hugging Face Space: [https://huggingface.co/spaces/Troglobyte/MicroHS/](https://huggingface.co/spaces/Mead0w1ark/MicroHS) ## Features - ๐ŸŒ **Multilingual** โ€” example supports English, Thai, Vietnamese, and Chinese product descriptions - โšก **Real-time classification** โ€” top-3 HS code predictions with confidence scores - ๐Ÿ“Š **Latent space visualization** โ€” interactive UMAP plot showing embedding clusters - ๐ŸŽฏ **KNN-based** โ€” simple, interpretable nearest-neighbor approach using fine-tuned `multilingual-e5-small` - ๐Ÿงพ **Official HS coverage** โ€” training generation incorporates the [datasets/harmonized-system](https://github.com/datasets/harmonized-system) 6-digit nomenclature ## Dataset Attribution This project includes HS nomenclature content sourced from: - [datasets/harmonized-system](https://github.com/datasets/harmonized-system) - Upstream references listed by that dataset: - WCO HS nomenclature documentation - UN Comtrade data extraction API Related datasets (evaluated during development): - [Customs-Declaration-Datasets](https://github.com/Seondong/Customs-Declaration-Datasets) โ€” 54,000 synthetic customs declaration records derived from 24.7M real Korean customs entries. Provides structured trade metadata (HS codes, country of origin, price, weight, fraud labels) but does not include free-text product descriptions. Cited as a reference for customs data research. See: *S. Kim et al., "DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection," KDD 2020.* Licensing: - Upstream HS source data: **ODC Public Domain Dedication and License (PDDL) v1.0** - Project-added synthetic multilingual examples and labels: **MIT** (this repo) ## Quick Start ```bash # Clone git clone https://github.com/JamesEBall/HSClassify_micro.git cd HSClassify_micro # Install dependencies python -m venv venv source venv/bin/activate pip install -r requirements.txt # Generate training data & train model python scripts/generate_training_data.py python scripts/train_model.py # Run the web app uvicorn app:app --reload --port 8000 ``` Open [http://localhost:8000](http://localhost:8000) to classify products. ## Deployment - The Space runs in Docker (`sdk: docker`, `app_port: 7860`). - OCR endpoints require OS packages; `Dockerfile` installs: - `tesseract-ocr` - `poppler-utils` (for PDF conversion via `pdf2image`) - Model and data loading is resilient in hosted environments: - Large artifacts (model weights, embeddings, classifier, training data) are hosted on [HF Hub](https://huggingface.co/Mead0w1ark/multilingual-e5-small-hs-codes) and downloaded automatically at startup if not present locally - Set `SENTENCE_MODEL_NAME` to override the HF model repo (default: `Mead0w1ark/multilingual-e5-small-hs-codes`) ### Auto Sync (GitHub -> Hugging Face Space) This repo includes a GitHub Action at `.github/workflows/sync_to_hf_space.yml` that syncs `main` to: - `spaces/Troglobyte/MicroHS` Required GitHub secret: - `HF_TOKEN`: Hugging Face token with write access to the Space ## Publish Dataset to Hugging Face Datasets Use the included publish helper: ```bash bash scripts/publish_dataset_to_hf.sh / # Example: bash scripts/publish_dataset_to_hf.sh Troglobyte/hsclassify-micro-dataset ``` The script creates/updates a Dataset repo and uploads: - `training_data_indexed.csv` - `harmonized-system.csv` (attributed source snapshot) - `hs_codes_reference.json` - Dataset card + attribution notes ## Model The classifier uses [`multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small) fine-tuned with contrastive learning (MultipleNegativesRankingLoss) on 9,829 curated HS-coded training pairs. Fine-tuned weights are hosted on HF Hub at [`Mead0w1ark/multilingual-e5-small-hs-codes`](https://huggingface.co/Mead0w1ark/multilingual-e5-small-hs-codes). | Metric | Before Fine-Tuning | After Fine-Tuning | |---|---|---| | Training accuracy (80/20 split) | 77.2% | **87.0%** | | Benchmark Top-1 (in-label-space) | 88.6% | **92.9%** | | Benchmark Top-3 (in-label-space) | โ€” | **97.1%** | To fine-tune from scratch: ```bash python scripts/train_model.py --finetune ``` ## How It Works 1. **Embedding**: Product descriptions are encoded using fine-tuned `multilingual-e5-small` (384-dim sentence embeddings) 2. **Classification**: K-nearest neighbors (k=5) over pre-computed embeddings of HS-coded training examples 3. **Visualization**: UMAP reduction to 2D for interactive cluster exploration via Plotly ## Project Structure ``` โ”œโ”€โ”€ app.py # FastAPI web application โ”œโ”€โ”€ dataset/ โ”‚ โ”œโ”€โ”€ README.md # HF dataset card (attribution + schema) โ”‚ โ””โ”€โ”€ ATTRIBUTION.md # Source and license attribution details โ”œโ”€โ”€ requirements.txt # Python dependencies โ”œโ”€โ”€ scripts/ โ”‚ โ”œโ”€โ”€ generate_training_data.py # Synthetic training data generator โ”‚ โ”œโ”€โ”€ train_model.py # Model training (embeddings + KNN) โ”‚ โ””โ”€โ”€ publish_dataset_to_hf.sh # Publish dataset artifacts to HF Datasets โ”œโ”€โ”€ data/ โ”‚ โ”œโ”€โ”€ hs_codes_reference.json # HS code definitions โ”‚ โ”œโ”€โ”€ harmonized-system/harmonized-system.csv # Upstream HS source snapshot โ”‚ โ”œโ”€โ”€ training_data.csv # Generated training examples โ”‚ โ””โ”€โ”€ training_data_indexed.csv # App/latent-ready training examples โ”œโ”€โ”€ models/ # Trained artifacts (generated) โ”‚ โ”œโ”€โ”€ sentence_model/ # Cached sentence transformer โ”‚ โ”œโ”€โ”€ embeddings.npy # Pre-computed embeddings โ”‚ โ”œโ”€โ”€ knn_classifier.pkl # Trained KNN model โ”‚ โ””โ”€โ”€ label_encoder.pkl # Label encoder โ””โ”€โ”€ templates/ โ””โ”€โ”€ index.html # Web UI ``` ## Context Built as a rapid POC exploring whether multilingual sentence embeddings can simplify HS code classification for customs authorities. ## License MIT โ€” see [LICENSE](LICENSE)