qox's picture
Upload README.md with huggingface_hub
9b1f192 verified
---
language:
- vi
tags:
- address-normalization
- vietnamese
- seq2seq
- named-entity-recognition
license: mit
pipeline_tag: other
---
# Vietnamese Address Normalizer
Normalizes Vietnamese address strings to canonical post-2025 administrative form.
Uses a Transformer seq2seq model with province-constrained beam search over
187K clean canonical addresses.
## Quick Start
```bash
git clone https://huggingface.co/<your-username>/vn-address-normalizer
cd vn-address-normalizer
pip install -r requirements.txt
python inference.py "p tan dinh q1 tphcm"
```
Expected output:
```
Input: p tan dinh q1 tphcm
Canonical: Phường Tân Định, Thành phố Hồ Chí Minh
Valid: True
Province: Thành phố Hồ Chí Minh
Ward hint: tan dinh
Space: 170 candidates
Latency: ~150 ms (warm), ~1.5 s (cold — model load + trie build)
```
## Python API
```python
from inference import normalize
# With or without Vietnamese diacritics
result = normalize("p tan dinh q1 tphcm")
result = normalize("Phường Tân Định, Quận 1, TP.HCM")
result = normalize("Xa Cu Chi TP HCM")
print(result["canonical"]) # "Phường Tân Định, Thành phố Hồ Chí Minh"
print(result["valid"]) # True
print(result["latency_ms"]) # ~150 ms warm
```
### Return fields
| Field | Type | Description |
|---|---|---|
| `canonical` | str | Normalized address; empty if not found |
| `valid` | bool | True if canonical exists in address database |
| `confidence` | float | Log-prob score (higher = more confident) |
| `province` | str | Resolved province name, or None |
| `ward_hint` | str | Detected ward slug, or None |
| `search_space` | int | Number of trie candidates searched |
| `latency_ms` | float | Wall-clock inference time in milliseconds |
## Input / Output Examples
| Input | Output |
|---|---|
| `p tan dinh q1 tphcm` | `Phường Tân Định, Thành phố Hồ Chí Minh` |
| `Phuong Ba Dinh Ha Noi` | `Phường Ba Đình, Thành phố Hà Nội` |
| `Xa Cu Chi TP HCM` | `Xã Củ Chi, Thành phố Hồ Chí Minh` |
| `P. Bến Nghé Q.1 HCM` | `Phường Sài Gòn, Thành phố Hồ Chí Minh` *(pre-2025 rename)* |
| `phuong 14 quan 10 tphcm` | *(empty — numbered wards removed post-2025)* |
| `duong le loi phuong ben nghe q1 tphcm` | `Đường Lê Lợi, Phường Sài Gòn, Thành phố Hồ Chí Minh` |
## Architecture
**Seq2Seq Transformer** (26M parameters):
- Encoder: 4-layer, 256-dim, 4-head, GELU, Pre-LN
- Decoder: 3-layer, same config
- Input: character-level tokenization (287-token vocab)
- Output: character-level decoding (269-token vocab)
**Inference pipeline:**
1. Detect province from raw text (regex + alias table)
2. Detect ward hint (regex + ward slug index + legacy ward map for pre-2025 names)
3. Build constrained trie: ward candidates (~10–500) → province candidates (~3K–52K) → full (~187K)
4. Province-constrained beam search (beam=5, max 96 steps)
5. Result is guaranteed in the canonical address database (trie acceptance)
**Training data:**
- 187K clean canonical addresses (34 provinces, 3,321 wards, post-2025 boundaries)
- Coverage: 79.3% of Vietnamese address variants
## Files
| File | Description |
|---|---|
| `inference.py` | Standalone inference — the only file you need to run |
| `model_v3_final/model.safetensors` | Model weights (26M params) |
| `model_v3_final/config.json` | Model hyperparameters |
| `model_v3_final/src_vocab.json` | Source character vocabulary |
| `model_v3_final/tgt_vocab.json` | Target character vocabulary |
| `model_v3_final/clean_canonicals.json` | 187K pre-filtered canonical addresses |
| `model_v3_final/legacy_ward_idx.json` | Pre-2025 → 2025 ward name mapping (13K entries) |
## Limitations
- **ML-only mode.** `inference.py` uses the neural model without the rule-based FST
engine (which requires the `vietnam_provinces` package and a heavier index build).
For production use with higher accuracy, integrate `normalizer.py` + `fst.py` from
the full server-side stack.
- **Post-2025 boundaries only.** Canonical addresses reflect the 2025 administrative
reorganization. Pre-2025 ward names (e.g. "Bến Nghé" → "Sài Gòn") are handled via
the legacy map, but coverage is not exhaustive.
- **Numbered wards rejected.** Wards identified only by number (e.g. "Phường 14",
"Quận 10") are rejected — these no longer exist post-2025.
- **2-variable input.** Model handles up to 3 address components (street, ward,
province). Highly complex or apartment-style addresses may not simplify cleanly.
- **Cold start latency.** First call takes ~2–4 s (model load + trie construction).
Subsequent calls: ~10–150 ms depending on search space.
- **CPU inference only.** GPU not required or used.
## Requirements
```
torch>=1.13.0
unidecode>=1.3.0
safetensors>=0.3.0
```
Python 3.10+ required (uses structural pattern matching in pipeline internals).