--- language: - vi tags: - address-normalization - vietnamese - seq2seq - named-entity-recognition license: mit pipeline_tag: other --- # Vietnamese Address Normalizer Normalizes Vietnamese address strings to canonical post-2025 administrative form. Uses a Transformer seq2seq model with province-constrained beam search over 187K clean canonical addresses. ## Quick Start ```bash git clone https://huggingface.co//vn-address-normalizer cd vn-address-normalizer pip install -r requirements.txt python inference.py "p tan dinh q1 tphcm" ``` Expected output: ``` Input: p tan dinh q1 tphcm Canonical: Phường Tân Định, Thành phố Hồ Chí Minh Valid: True Province: Thành phố Hồ Chí Minh Ward hint: tan dinh Space: 170 candidates Latency: ~150 ms (warm), ~1.5 s (cold — model load + trie build) ``` ## Python API ```python from inference import normalize # With or without Vietnamese diacritics result = normalize("p tan dinh q1 tphcm") result = normalize("Phường Tân Định, Quận 1, TP.HCM") result = normalize("Xa Cu Chi TP HCM") print(result["canonical"]) # "Phường Tân Định, Thành phố Hồ Chí Minh" print(result["valid"]) # True print(result["latency_ms"]) # ~150 ms warm ``` ### Return fields | Field | Type | Description | |---|---|---| | `canonical` | str | Normalized address; empty if not found | | `valid` | bool | True if canonical exists in address database | | `confidence` | float | Log-prob score (higher = more confident) | | `province` | str | Resolved province name, or None | | `ward_hint` | str | Detected ward slug, or None | | `search_space` | int | Number of trie candidates searched | | `latency_ms` | float | Wall-clock inference time in milliseconds | ## Input / Output Examples | Input | Output | |---|---| | `p tan dinh q1 tphcm` | `Phường Tân Định, Thành phố Hồ Chí Minh` | | `Phuong Ba Dinh Ha Noi` | `Phường Ba Đình, Thành phố Hà Nội` | | `Xa Cu Chi TP HCM` | `Xã Củ Chi, Thành phố Hồ Chí Minh` | | `P. Bến Nghé Q.1 HCM` | `Phường Sài Gòn, Thành phố Hồ Chí Minh` *(pre-2025 rename)* | | `phuong 14 quan 10 tphcm` | *(empty — numbered wards removed post-2025)* | | `duong le loi phuong ben nghe q1 tphcm` | `Đường Lê Lợi, Phường Sài Gòn, Thành phố Hồ Chí Minh` | ## Architecture **Seq2Seq Transformer** (26M parameters): - Encoder: 4-layer, 256-dim, 4-head, GELU, Pre-LN - Decoder: 3-layer, same config - Input: character-level tokenization (287-token vocab) - Output: character-level decoding (269-token vocab) **Inference pipeline:** 1. Detect province from raw text (regex + alias table) 2. Detect ward hint (regex + ward slug index + legacy ward map for pre-2025 names) 3. Build constrained trie: ward candidates (~10–500) → province candidates (~3K–52K) → full (~187K) 4. Province-constrained beam search (beam=5, max 96 steps) 5. Result is guaranteed in the canonical address database (trie acceptance) **Training data:** - 187K clean canonical addresses (34 provinces, 3,321 wards, post-2025 boundaries) - Coverage: 79.3% of Vietnamese address variants ## Files | File | Description | |---|---| | `inference.py` | Standalone inference — the only file you need to run | | `model_v3_final/model.safetensors` | Model weights (26M params) | | `model_v3_final/config.json` | Model hyperparameters | | `model_v3_final/src_vocab.json` | Source character vocabulary | | `model_v3_final/tgt_vocab.json` | Target character vocabulary | | `model_v3_final/clean_canonicals.json` | 187K pre-filtered canonical addresses | | `model_v3_final/legacy_ward_idx.json` | Pre-2025 → 2025 ward name mapping (13K entries) | ## Limitations - **ML-only mode.** `inference.py` uses the neural model without the rule-based FST engine (which requires the `vietnam_provinces` package and a heavier index build). For production use with higher accuracy, integrate `normalizer.py` + `fst.py` from the full server-side stack. - **Post-2025 boundaries only.** Canonical addresses reflect the 2025 administrative reorganization. Pre-2025 ward names (e.g. "Bến Nghé" → "Sài Gòn") are handled via the legacy map, but coverage is not exhaustive. - **Numbered wards rejected.** Wards identified only by number (e.g. "Phường 14", "Quận 10") are rejected — these no longer exist post-2025. - **2-variable input.** Model handles up to 3 address components (street, ward, province). Highly complex or apartment-style addresses may not simplify cleanly. - **Cold start latency.** First call takes ~2–4 s (model load + trie construction). Subsequent calls: ~10–150 ms depending on search space. - **CPU inference only.** GPU not required or used. ## Requirements ``` torch>=1.13.0 unidecode>=1.3.0 safetensors>=0.3.0 ``` Python 3.10+ required (uses structural pattern matching in pipeline internals).