Spaces:

Mead0w1ark
/

MicroHS

Sleeping

App Files Files Community

MicroHS / README.md

github-actions[bot]

Sync from GitHub 38cd8d69dc858672e22cd1448f7768fef87468b1

79f9b3a 4 days ago

preview code

raw

history blame contribute delete

6.7 kB

metadata

title: HS Code Classifier Micro
emoji: ⚡
colorFrom: pink
colorTo: blue
sdk: docker
app_port: 7860

HSClassify_micro 🔍

Machine learning model for multilingual HS/HTS classification for trade finance and customs workflows, built with FastAPI + OCR.

Classifies product descriptions into Harmonized System (HS) codes using sentence embeddings and k-NN search, with an interactive latent space visualization.

Live Demo

Hugging Face Space: https://huggingface.co/spaces/Troglobyte/MicroHS/

Features

🌍 Multilingual — example supports English, Thai, Vietnamese, and Chinese product descriptions
⚡ Real-time classification — top-3 HS code predictions with confidence scores
📊 Latent space visualization — interactive UMAP plot showing embedding clusters
🎯 KNN-based — simple, interpretable nearest-neighbor approach using fine-tuned multilingual-e5-small
🧾 Official HS coverage — training generation incorporates the datasets/harmonized-system 6-digit nomenclature

Dataset Attribution

This project includes HS nomenclature content sourced from:

datasets/harmonized-system
Upstream references listed by that dataset:
- WCO HS nomenclature documentation
- UN Comtrade data extraction API

Related datasets (evaluated during development):

Customs-Declaration-Datasets — 54,000 synthetic customs declaration records derived from 24.7M real Korean customs entries. Provides structured trade metadata (HS codes, country of origin, price, weight, fraud labels) but does not include free-text product descriptions. Cited as a reference for customs data research. See: S. Kim et al., "DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection," KDD 2020.

Licensing:

Upstream HS source data: ODC Public Domain Dedication and License (PDDL) v1.0
Project-added synthetic multilingual examples and labels: MIT (this repo)

Quick Start

# Clone
git clone https://github.com/JamesEBall/HSClassify_micro.git
cd HSClassify_micro

# Install dependencies
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Generate training data & train model
python scripts/generate_training_data.py
python scripts/train_model.py

# Run the web app
uvicorn app:app --reload --port 8000

Open http://localhost:8000 to classify products.

Deployment

The Space runs in Docker (sdk: docker, app_port: 7860).
OCR endpoints require OS packages; Dockerfile installs:
- tesseract-ocr
- poppler-utils (for PDF conversion via pdf2image)
Model and data loading is resilient in hosted environments:
- Large artifacts (model weights, embeddings, classifier, training data) are hosted on HF Hub and downloaded automatically at startup if not present locally
- Set SENTENCE_MODEL_NAME to override the HF model repo (default: Mead0w1ark/multilingual-e5-small-hs-codes)

Auto Sync (GitHub -> Hugging Face Space)

This repo includes a GitHub Action at .github/workflows/sync_to_hf_space.yml that syncs main to:

spaces/Troglobyte/MicroHS

Required GitHub secret:

HF_TOKEN: Hugging Face token with write access to the Space

Publish Dataset to Hugging Face Datasets

Use the included publish helper:

bash scripts/publish_dataset_to_hf.sh <namespace>/<dataset-repo>
# Example:
bash scripts/publish_dataset_to_hf.sh Troglobyte/hsclassify-micro-dataset

The script creates/updates a Dataset repo and uploads:

training_data_indexed.csv
harmonized-system.csv (attributed source snapshot)
hs_codes_reference.json
Dataset card + attribution notes

Model

The classifier uses multilingual-e5-small fine-tuned with contrastive learning (MultipleNegativesRankingLoss) on 9,829 curated HS-coded training pairs. Fine-tuned weights are hosted on HF Hub at Mead0w1ark/multilingual-e5-small-hs-codes.

Metric	Before Fine-Tuning	After Fine-Tuning
Training accuracy (80/20 split)	77.2%	87.0%
Benchmark Top-1 (in-label-space)	88.6%	92.9%
Benchmark Top-3 (in-label-space)	—	97.1%

To fine-tune from scratch:

python scripts/train_model.py --finetune

How It Works

Embedding: Product descriptions are encoded using fine-tuned multilingual-e5-small (384-dim sentence embeddings)
Classification: K-nearest neighbors (k=5) over pre-computed embeddings of HS-coded training examples
Visualization: UMAP reduction to 2D for interactive cluster exploration via Plotly

Project Structure

├── app.py                  # FastAPI web application
├── dataset/
│   ├── README.md           # HF dataset card (attribution + schema)
│   └── ATTRIBUTION.md      # Source and license attribution details
├── requirements.txt        # Python dependencies
├── scripts/
│   ├── generate_training_data.py   # Synthetic training data generator
│   ├── train_model.py              # Model training (embeddings + KNN)
│   └── publish_dataset_to_hf.sh    # Publish dataset artifacts to HF Datasets
├── data/
│   ├── hs_codes_reference.json     # HS code definitions
│   ├── harmonized-system/harmonized-system.csv  # Upstream HS source snapshot
│   ├── training_data.csv           # Generated training examples
│   └── training_data_indexed.csv   # App/latent-ready training examples
├── models/                 # Trained artifacts (generated)
│   ├── sentence_model/     # Cached sentence transformer
│   ├── embeddings.npy      # Pre-computed embeddings
│   ├── knn_classifier.pkl  # Trained KNN model
│   └── label_encoder.pkl   # Label encoder
└── templates/
    └── index.html          # Web UI

Context

Built as a rapid POC exploring whether multilingual sentence embeddings can simplify HS code classification for customs authorities.

License

MIT — see LICENSE