---
language:
- vi
tags:
- address-normalization
- vietnamese
- seq2seq
- named-entity-recognition
license: mit
pipeline_tag: other
---

# Vietnamese Address Normalizer

Normalizes Vietnamese address strings to canonical post-2025 administrative form.
Uses a Transformer seq2seq model with province-constrained beam search over
187K clean canonical addresses.

## Quick Start

```bash
git clone https://huggingface.co/<your-username>/vn-address-normalizer
cd vn-address-normalizer
pip install -r requirements.txt
python inference.py "p tan dinh q1 tphcm"
```

Expected output:

```
Input:       p tan dinh q1 tphcm
Canonical:   Phường Tân Định, Thành phố Hồ Chí Minh
Valid:       True
Province:    Thành phố Hồ Chí Minh
Ward hint:   tan dinh
Space:       170 candidates
Latency:     ~150 ms (warm), ~1.5 s (cold — model load + trie build)
```

## Python API

```python
from inference import normalize

# With or without Vietnamese diacritics
result = normalize("p tan dinh q1 tphcm")
result = normalize("Phường Tân Định, Quận 1, TP.HCM")
result = normalize("Xa Cu Chi TP HCM")

print(result["canonical"])  # "Phường Tân Định, Thành phố Hồ Chí Minh"
print(result["valid"])      # True
print(result["latency_ms"]) # ~150 ms warm
```

### Return fields

| Field | Type | Description |
|---|---|---|
| `canonical` | str | Normalized address; empty if not found |
| `valid` | bool | True if canonical exists in address database |
| `confidence` | float | Log-prob score (higher = more confident) |
| `province` | str | Resolved province name, or None |
| `ward_hint` | str | Detected ward slug, or None |
| `search_space` | int | Number of trie candidates searched |
| `latency_ms` | float | Wall-clock inference time in milliseconds |

## Input / Output Examples

| Input | Output |
|---|---|
| `p tan dinh q1 tphcm` | `Phường Tân Định, Thành phố Hồ Chí Minh` |
| `Phuong Ba Dinh Ha Noi` | `Phường Ba Đình, Thành phố Hà Nội` |
| `Xa Cu Chi TP HCM` | `Xã Củ Chi, Thành phố Hồ Chí Minh` |
| `P. Bến Nghé Q.1 HCM` | `Phường Sài Gòn, Thành phố Hồ Chí Minh` *(pre-2025 rename)* |
| `phuong 14 quan 10 tphcm` | *(empty — numbered wards removed post-2025)* |
| `duong le loi phuong ben nghe q1 tphcm` | `Đường Lê Lợi, Phường Sài Gòn, Thành phố Hồ Chí Minh` |

## Architecture

**Seq2Seq Transformer** (26M parameters):
- Encoder: 4-layer, 256-dim, 4-head, GELU, Pre-LN
- Decoder: 3-layer, same config
- Input: character-level tokenization (287-token vocab)
- Output: character-level decoding (269-token vocab)

**Inference pipeline:**
1. Detect province from raw text (regex + alias table)
2. Detect ward hint (regex + ward slug index + legacy ward map for pre-2025 names)
3. Build constrained trie: ward candidates (~10–500) → province candidates (~3K–52K) → full (~187K)
4. Province-constrained beam search (beam=5, max 96 steps)
5. Result is guaranteed in the canonical address database (trie acceptance)

**Training data:**
- 187K clean canonical addresses (34 provinces, 3,321 wards, post-2025 boundaries)
- Coverage: 79.3% of Vietnamese address variants

## Files

| File | Description |
|---|---|
| `inference.py` | Standalone inference — the only file you need to run |
| `model_v3_final/model.safetensors` | Model weights (26M params) |
| `model_v3_final/config.json` | Model hyperparameters |
| `model_v3_final/src_vocab.json` | Source character vocabulary |
| `model_v3_final/tgt_vocab.json` | Target character vocabulary |
| `model_v3_final/clean_canonicals.json` | 187K pre-filtered canonical addresses |
| `model_v3_final/legacy_ward_idx.json` | Pre-2025 → 2025 ward name mapping (13K entries) |

## Limitations

- **ML-only mode.** `inference.py` uses the neural model without the rule-based FST
  engine (which requires the `vietnam_provinces` package and a heavier index build).
  For production use with higher accuracy, integrate `normalizer.py` + `fst.py` from
  the full server-side stack.

- **Post-2025 boundaries only.** Canonical addresses reflect the 2025 administrative
  reorganization. Pre-2025 ward names (e.g. "Bến Nghé" → "Sài Gòn") are handled via
  the legacy map, but coverage is not exhaustive.

- **Numbered wards rejected.** Wards identified only by number (e.g. "Phường 14",
  "Quận 10") are rejected — these no longer exist post-2025.

- **2-variable input.** Model handles up to 3 address components (street, ward,
  province). Highly complex or apartment-style addresses may not simplify cleanly.

- **Cold start latency.** First call takes ~2–4 s (model load + trie construction).
  Subsequent calls: ~10–150 ms depending on search space.

- **CPU inference only.** GPU not required or used.

## Requirements

```
torch>=1.13.0
unidecode>=1.3.0
safetensors>=0.3.0
```

Python 3.10+ required (uses structural pattern matching in pipeline internals).