File size: 9,902 Bytes

9988d8d
88edd90
077b045
88edd90
3fc6199
 
d9f2d17
3fc6199
 
 
 
 
 
 
 
 
9988d8d
d9f2d17
88edd90
d9f2d17
88edd90
d9f2d17
88edd90
d9f2d17
88edd90
76f76f8
88edd90
76f76f8
88edd90
76f76f8
88edd90
 
 
 
 
 
 
 
76f76f8
88edd90
 
76f76f8
88edd90
76f76f8
88edd90
 
 
 
 
 
a670db9
 
 
 
 
 
 
 
 
 
 
 
 
d11cca5
 
 
 
 
 
 
 
 
 
a670db9
 
88edd90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76f76f8
88edd90
76f76f8
a670db9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88edd90
76f76f8
88edd90
 
 
 
 
 
 
 
 
76f76f8
88edd90
76f76f8
88edd90
76f76f8
88edd90
76f76f8
88edd90
 
76f76f8
88edd90
 
 
76f76f8
88edd90
76f76f8
88edd90
 
 
 
76f76f8
88edd90
76f76f8
88edd90
76f76f8
88edd90
 
 
 
 
 
 
76f76f8
88edd90
 
 
 
 
76f76f8
88edd90
76f76f8
88edd90
76f76f8
88edd90
76f76f8
88edd90
 
 
76f76f8
88edd90
76f76f8
88edd90
76f76f8
88edd90
76f76f8
88edd90
 
a670db9
 
 
76f76f8
88edd90
76f76f8
88edd90
76f76f8
a670db9
 
 
 
 
 
 
 
 
88edd90
76f76f8
88edd90
 
76f76f8
88edd90
76f76f8
88edd90
76f76f8
88edd90
 
76f76f8
88edd90
d9f2d17
a670db9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88edd90
a670db9

---
license: gemma
pipeline_tag: text-generation
language:
- en
- km
tags:
- customs
- hs-code
- classification
- cambodia
- gemma
- unsloth
- qlora
base_model:
- unsloth/gemma-4-E4B-it
---

# Gemma‑4 HS Code Classifier (Cambodia Customs)

A **Gemma‑4‑E4B‑it** model fine‑tuned with QLoRA to classify product descriptions into **8‑digit HS codes** and return corresponding Cambodian trade rates (Customs Duty, Special Tax, VAT, Excise Tax).

Built with **[Unsloth](https://github.com/unslothai/unsloth)** for fast, memory‑efficient fine‑tuning on a single T4 GPU.

---

## 🎯 What it does

Given a plain‑English product description, the model generates:

```text
HS Code: 61091000
Unit: PIECE
Customs Duty: 25%
Special Tax: 0%
VAT: 10%
Excise Tax: 0%
```

**⚠️ Important**: The rates in the text are generated by the model and **may be wrong**.  
For production, always use the included **lookup table** (`hs_code_lookup.json`) – see [Production use](#-production-use) below.

---

## 🚀 Quick start (in Colab or locally)

This repository contains **only the LoRA adapter**, not the full model.  
Loading it will automatically download the base model (`unsloth/gemma-4-E4B-it`) and apply the adapter in 4-bit.

```python

# %% [Install]
%%capture
import os, re
# Install everything needed for the T4 Colab environment
!pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
!pip install --no-deps unsloth_zoo bitsandbytes accelerate xformers peft trl triton unsloth
!pip install --no-deps --upgrade "torchao>=0.16.0"
!pip install --no-deps transformers==5.5.0 "tokenizers>=0.22.0,<=0.23.0"
!pip install torchcodec
import torch
torch._dynamo.config.recompile_limit = 64


import warnings

# Suppress the specific PyTorch size check warning from bitsandbytes
warnings.filterwarnings(
    "ignore", 
    category=FutureWarning, 
    message=".*_check_is_size will be removed in a future PyTorch release.*"
)

#------------

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    "Sothay/gemma4-hscode-classifier",   # LoRA adapter on Hugging Face
    load_in_4bit = True,                 # required – the adapter was trained in 4-bit
    max_seq_length = 1024,
)

# ---------- Inference with the authoritative lookup table (recommended) ----------
import json, re

with open("hs_code_lookup.json") as f:
    rate_lookup = json.load(f)

def predict_hs_code(description: str) -> dict:
    system_prompt = (
        "You are a customs compliance AI. Classify the product description to its "
        "correct 8-digit HS code and output the corresponding trade rates (Customs Duty, "
        "Special Tax, VAT, Excise Tax) and unit."
    )
    messages = [
        {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
        {"role": "user",   "content": [{"type": "text", "text": f"Description: {description}"}]},
    ]
    inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")
    out = model.generate(inputs, max_new_tokens=80, do_sample=False)
    text = tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)

    m = re.search(r"HS Code:\s*([0-9]{4,10})", text)
    code = m.group(1) if m else None
    if code and code in rate_lookup:
        return {"hs_code": code, "source": "lookup_table", **rate_lookup[code]}
    return {"hs_code": code, "source": "model_only_UNVERIFIED", "raw_output": text}

print(predict_hs_code("Men's cotton knitted T-shirt"))
```

---

## 🔍 Raw model output (debugging)

If you want to see exactly what the model generated (including the rates it predicted) without the lookup table, use the raw‑output function below.  
**Do not** use these rates in production – they are only for debugging or confidence evaluation.

```python
def predict_hs_code_raw(description: str, max_new_tokens=100) -> dict:
    system_prompt = (
        "You are a customs compliance AI. Classify the product description to its "
        "correct 8-digit HS code and output the corresponding trade rates (Customs Duty, "
        "Special Tax, VAT, Excise Tax) and unit."
    )
    messages = [
        {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
        {"role": "user",   "content": [{"type": "text", "text": f"Description: {description}"}]},
    ]
    inputs = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, tokenize=True,
        return_dict=True, return_tensors="pt",
    ).to("cuda")

    out = model.generate(**inputs, max_new_tokens=max_new_tokens, use_cache=True, do_sample=False)
    raw_text = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

    def extract(pattern, text):
        m = re.search(pattern, text)
        return m.group(1).strip() if m else None

    return {
        "hs_code":   extract(r"HS Code:\s*([0-9.]+)", raw_text),
        "unit":      extract(r"Unit:\s*(.*)", raw_text),
        "cd_rate":   extract(r"Customs Duty:\s*([\d.]+)%?", raw_text),
        "st_rate":   extract(r"Special Tax:\s*([\d.]+)%?", raw_text),
        "vat_rate":  extract(r"VAT:\s*([\d.]+)%?", raw_text),
        "et_rate":   extract(r"Excise Tax:\s*([\d.]+)%?", raw_text),
        "raw_output": raw_text
    }

# Example
raw = predict_hs_code_raw("Men's cotton knitted T-shirt")
print(raw["raw_output"])
print(raw["hs_code"])   # model’s guess
```

---

## 🧠 Training details

- **Base model**: `unsloth/gemma-4-E4B-it` (4‑bit QLoRA)
- **Adapter rank**: r=16, alpha=16, targeting all language & attention layers
- **Gradient checkpointing**: Unsloth’s own implementation (avoids Gemma‑4 KV‑shared layer bug)
- **Dataset**: Custom Cambodian HS‑code dataset (`hs_code.csv`) with descriptions, codes, and official rates
  - Cleaned, deduplicated, split into 90/10 train/validation
  - Chat roles fixed to system/user/assistant (Gemma‑4 standard)
- **Training config**: 3 epochs, effective batch size 8, learning rate 2e‑4, linear schedule, eval & save every epoch, best model loaded
- **Hardware**: Google Colab T4 (16 GB) – peak memory ~10 GB thanks to QLoRA
- **Accuracy**: Evaluated on held‑out examples (exact HS‑code match) – see model card for current numbers

---

## ⚖️ Production use

> **Always use the lookup table – never trust the model’s generated rates.**

The model is a **classifier**: description → HS code.  
Rates are fetched deterministically from `hs_code_lookup.json`, a file extracted from the same official tariff data used during training.

Why?  
- A causal LM recalling a rate from memory will occasionally hallucinate – a customs tool with confident, wrong numbers is worse than one that says “I don’t know”.
- The lookup table guarantees 100% accuracy on rates once the HS code is correct.

The `hs_code_lookup.json` file is included in this repository and can be downloaded via:

```python
from huggingface_hub import hf_hub_download
hf_hub_download("Sothay/gemma4-hscode-classifier", "hs_code_lookup.json")
```

---

## 📦 Files in this repository

| File | Description |
|------|-------------|
| `adapter_model.safetensors` | LoRA adapter weights (few MB) |
| `adapter_config.json` | Adapter configuration (references base model) |
| `tokenizer.json`, `tokenizer_config.json` | Tokenizer files |
| `hs_code_lookup.json` | Authoritative rate table for production inference |
| `README.md` | This file |

> **Note**: Only the adapter is stored here – the full Gemma‑4 base model is automatically fetched from Unsloth when you call `FastModel.from_pretrained`.  
> If you need a **merged, full‑precision model** (for vLLM, TGI, etc.), generate it locally with Unsloth:
> ```python
> model.save_pretrained_merged("merged_fp16", tokenizer, save_method="merged_16bit")
> ```

---

## 🦙 Ollama / llama.cpp (GGUF)

Export a quantized GGUF directly from the loaded adapter:

```python
model.save_pretrained_gguf("gguf_model", tokenizer, quantization_method="q4_k_m")
```

Then use with Ollama (see [`Modelfile` example](https://ollama.com) – set temperature 0, deterministic sampling).

---

## 📊 Example predictions

| Description | Predicted HS Code | Unit | CD | ST | VAT | ET |
|-------------|-------------------|------|----|----|-----|----|
| Toyota Hilux pickup, diesel 2.8L | 87042110 | UNIT | 35% | 50% | 10% | 0% |
| iPhone 15 Pro Max 256GB | 85171200 | UNIT | 0% | 0% | 10% | 0% |
| Heineken beer 330ml can | 22030010 | LTR | 35% | 30% | 10% | 0% |

*(Rates from lookup table – not generated by the model.)*

---

## ⚠️ Limitations

- The model may output incorrect HS codes for ambiguous, misspelled, or region‑specific descriptions.
- It was trained on a fixed set of Cambodian HS codes; revisions after the training data cutoff are not covered.
- Duty rates can become outdated – always cross‑check with the latest official tariff schedule.
- The model is a classifier, **not** a legal authority. For binding decisions, consult a customs professional.

---

## 📝 License

This model is a derivative of **Gemma‑4‑E4B‑it** and is subject to the [Gemma license](https://ai.google.dev/gemma/terms).  
The HS‑code dataset and lookup table are the property of their respective owners.

---

## 🙏 Acknowledgments

- [Unsloth](https://github.com/unslothai/unsloth) – made QLoRA + Gemma‑4 on a T4 effortless
- [Google DeepMind](https://deepmind.google) – for the Gemma family of models

---

## 📚 Citation

If you use this model, please cite:

```bibtex
@misc{gemma4-hscode-classifier,
  author = {Sothay},
  title = {Gemma‑4 HS Code Classifier (Cambodia Customs)},
  year = 2025,
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Sothay/gemma4-hscode-classifier}}
}
```

---

**Author**: [Sothay](https://huggingface.co/Sothay)  
**Model card version**: 1.2