Spaces:

Mead0w1ark
/

MicroHS

Sleeping

App Files Files Community

MicroHS / README.md

github-actions[bot]

Sync from GitHub 38cd8d69dc858672e22cd1448f7768fef87468b1

79f9b3a 4 days ago

preview code

raw

history blame contribute delete

6.7 kB

	---
	title: HS Code Classifier Micro
	emoji: ⚡
	colorFrom: pink
	colorTo: blue
	sdk: docker
	app_port: 7860
	---
	# HSClassify_micro 🔍

	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

	Machine learning model for multilingual HS/HTS classification for trade finance and customs workflows, built with FastAPI + OCR.

	Classifies product descriptions into [Harmonized System (HS) codes](https://en.wikipedia.org/wiki/Harmonized_System) using sentence embeddings and k-NN search, with an interactive latent space visualization.

	## Live Demo

	- Hugging Face Space: [https://huggingface.co/spaces/Troglobyte/MicroHS/](https://huggingface.co/spaces/Mead0w1ark/MicroHS)
	## Features

	- 🌍 Multilingual — example supports English, Thai, Vietnamese, and Chinese product descriptions
	- ⚡ Real-time classification — top-3 HS code predictions with confidence scores
	- 📊 Latent space visualization — interactive UMAP plot showing embedding clusters
	- 🎯 KNN-based — simple, interpretable nearest-neighbor approach using fine-tuned `multilingual-e5-small`
	- 🧾 Official HS coverage — training generation incorporates the [datasets/harmonized-system](https://github.com/datasets/harmonized-system) 6-digit nomenclature

	## Dataset Attribution

	This project includes HS nomenclature content sourced from:

	- [datasets/harmonized-system](https://github.com/datasets/harmonized-system)
	- Upstream references listed by that dataset:
	- WCO HS nomenclature documentation
	- UN Comtrade data extraction API

	Related datasets (evaluated during development):

	- [Customs-Declaration-Datasets](https://github.com/Seondong/Customs-Declaration-Datasets) — 54,000 synthetic customs declaration records derived from 24.7M real Korean customs entries. Provides structured trade metadata (HS codes, country of origin, price, weight, fraud labels) but does not include free-text product descriptions. Cited as a reference for customs data research. See: S. Kim et al., "DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection," KDD 2020.

	Licensing:

	- Upstream HS source data: ODC Public Domain Dedication and License (PDDL) v1.0
	- Project-added synthetic multilingual examples and labels: MIT (this repo)

	## Quick Start

	```bash
	# Clone
	git clone https://github.com/JamesEBall/HSClassify_micro.git
	cd HSClassify_micro

	# Install dependencies
	python -m venv venv
	source venv/bin/activate
	pip install -r requirements.txt

	# Generate training data & train model
	python scripts/generate_training_data.py
	python scripts/train_model.py

	# Run the web app
	uvicorn app:app --reload --port 8000
	```

	Open [http://localhost:8000](http://localhost:8000) to classify products.

	## Deployment

	- The Space runs in Docker (`sdk: docker`, `app_port: 7860`).
	- OCR endpoints require OS packages; `Dockerfile` installs:
	- `tesseract-ocr`
	- `poppler-utils` (for PDF conversion via `pdf2image`)
	- Model and data loading is resilient in hosted environments:
	- Large artifacts (model weights, embeddings, classifier, training data) are hosted on [HF Hub](https://huggingface.co/Mead0w1ark/multilingual-e5-small-hs-codes) and downloaded automatically at startup if not present locally
	- Set `SENTENCE_MODEL_NAME` to override the HF model repo (default: `Mead0w1ark/multilingual-e5-small-hs-codes`)

	### Auto Sync (GitHub -> Hugging Face Space)

	This repo includes a GitHub Action at `.github/workflows/sync_to_hf_space.yml` that syncs `main` to:

	- `spaces/Troglobyte/MicroHS`

	Required GitHub secret:

	- `HF_TOKEN`: Hugging Face token with write access to the Space

	## Publish Dataset to Hugging Face Datasets

	Use the included publish helper:

	```bash
	bash scripts/publish_dataset_to_hf.sh <namespace>/<dataset-repo>
	# Example:
	bash scripts/publish_dataset_to_hf.sh Troglobyte/hsclassify-micro-dataset
	```

	The script creates/updates a Dataset repo and uploads:

	- `training_data_indexed.csv`
	- `harmonized-system.csv` (attributed source snapshot)
	- `hs_codes_reference.json`
	- Dataset card + attribution notes

	## Model

	The classifier uses [`multilingual-e5-small`](https://huggingface.co/intfloat/multilingual-e5-small) fine-tuned with contrastive learning (MultipleNegativesRankingLoss) on 9,829 curated HS-coded training pairs. Fine-tuned weights are hosted on HF Hub at [`Mead0w1ark/multilingual-e5-small-hs-codes`](https://huggingface.co/Mead0w1ark/multilingual-e5-small-hs-codes).

	\| Metric \| Before Fine-Tuning \| After Fine-Tuning \|
	\|---\|---\|---\|
	\| Training accuracy (80/20 split) \| 77.2% \| 87.0% \|
	\| Benchmark Top-1 (in-label-space) \| 88.6% \| 92.9% \|
	\| Benchmark Top-3 (in-label-space) \| — \| 97.1% \|

	To fine-tune from scratch:
	```bash
	python scripts/train_model.py --finetune
	```

	## How It Works

	1. Embedding: Product descriptions are encoded using fine-tuned `multilingual-e5-small` (384-dim sentence embeddings)
	2. Classification: K-nearest neighbors (k=5) over pre-computed embeddings of HS-coded training examples
	3. Visualization: UMAP reduction to 2D for interactive cluster exploration via Plotly

	## Project Structure

	```
	├── app.py # FastAPI web application
	├── dataset/
	│ ├── README.md # HF dataset card (attribution + schema)
	│ └── ATTRIBUTION.md # Source and license attribution details
	├── requirements.txt # Python dependencies
	├── scripts/
	│ ├── generate_training_data.py # Synthetic training data generator
	│ ├── train_model.py # Model training (embeddings + KNN)
	│ └── publish_dataset_to_hf.sh # Publish dataset artifacts to HF Datasets
	├── data/
	│ ├── hs_codes_reference.json # HS code definitions
	│ ├── harmonized-system/harmonized-system.csv # Upstream HS source snapshot
	│ ├── training_data.csv # Generated training examples
	│ └── training_data_indexed.csv # App/latent-ready training examples
	├── models/ # Trained artifacts (generated)
	│ ├── sentence_model/ # Cached sentence transformer
	│ ├── embeddings.npy # Pre-computed embeddings
	│ ├── knn_classifier.pkl # Trained KNN model
	│ └── label_encoder.pkl # Label encoder
	└── templates/
	└── index.html # Web UI
	```

	## Context

	Built as a rapid POC exploring whether multilingual sentence embeddings can simplify HS code classification for customs authorities.
	## License

	MIT — see [LICENSE](LICENSE)