symphonym-v7 / README.md
docuracy's picture
Upload Symphonym v7 model, vocabularies, and evaluation results
4558539 verified
---
language:
- multilingual
- ar
- zh
- ru
- ja
- ko
- he
- fa
- hi
- el
- ka
- am
- hy
- ur
- bn
- ta
- te
- th
tags:
- toponym-matching
- cross-script
- phonetic-embeddings
- geospatial
- named-entity
- information-retrieval
- teacher-student
- knowledge-distillation
license: cc-by-4.0
datasets:
- geonames
- wikidata
- getty-tgn
metrics:
- recall@k
- mrr
pipeline_tag: feature-extraction
---
# Symphonym v7 — Universal Phonetic Embeddings for Cross-Script Toponym Matching
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18682017.svg)](https://doi.org/10.5281/zenodo.18682017)
Symphonym maps toponyms (place names) from **20 writing systems** into a unified
**128-dimensional phonetic embedding space**, enabling direct cross-script similarity
comparison without runtime phonetic conversion or language identification.
> *"London" / "Лондон" / "伦敦" / "لندن" → [0.12, -0.34, …] (all nearby)*
## Intended Use
- **Cross-script toponym matching** in geographic databases and gazetteers
- **Phonetic search** — retrieve results for a place name entered in any script
- **Historical record linkage** — match pre-standardisation spelling variants
- **Multilingual named entity linking** in NLP pipelines
- **Digital humanities** — reconciling place references across archival sources
The model operates on **phonetic similarity**, not semantic or orthographic similarity.
It is designed as a **candidate retrieval** component within a larger reconciliation
pipeline, where candidates are subsequently filtered by geographic proximity and
other constraints.
## Quick Start
```python
from inference import SymphonymModel
model = SymphonymModel() # loads weights from this directory
# Single similarity score
sim = model.similarity("London", "en", "Лондон", "ru")
print(f"London / Лондон: {sim:.3f}") # → 0.991
# Batch embeddings (N × 128 numpy array)
embeddings = model.batch_embed([
("London", "en"),
("Лондон", "ru"),
("伦敦", "zh"),
("لندن", "ar"),
("ლონდონი", "ka"),
])
```
### With HuggingFace `huggingface_hub`
```python
from huggingface_hub import snapshot_download
model_dir = snapshot_download("docuracy/symphonym-v7")
from inference import SymphonymModel
model = SymphonymModel(model_dir=model_dir)
```
## Representative Cross-Script Similarities
| Pair | Scripts | Similarity |
|------|---------|-----------|
| London / Лондон | Latin–Cyrillic | 0.991 |
| Athens / Αθήνα | Latin–Greek | 0.980 |
| Beijing / 北京 | Latin–CJK | 0.955 |
| Baghdad / بغداد | Latin–Arabic | 0.969 |
| Jerusalem / ירושלים | Latin–Hebrew | 0.892 |
| Tokyo / とうきょう | Latin–Hiragana | ~0.94 |
| London / Londres | Latin–Latin | 0.474 *(correct: phonetically distinct)* |
## Model Architecture
Symphonym uses a **Teacher–Student knowledge distillation** framework.
### Teacher (PhoneticEncoder) — training only
- Input: IPA transcriptions via Epitran (+ 102 extensions), Phonikud, CharsiuG2P
- Representation: PanPhon192 — 24-dim articulatory feature vectors,
8-bin positional pooling → 192-dim fixed-length input
- Architecture: BiLSTM → Self-Attention → Attention Pooling → 128-dim projection
### Student (UniversalEncoder) — deployed model
- Input: raw Unicode characters + script ID + language ID + length bucket
- Vocabulary: **113,280 tokens** across 20 scripts, 1,944 language codes
- Architecture: Character/Script/Language/Length embeddings →
Input projection → BiLSTM → Self-Attention (residual) →
Attention Pooling → 128-dim projection → L2 normalisation
- Parameters: ~8.3M
The **length bucket embedding** (16 buckets, 8-dim) conditions every character
representation on sequence length, mitigating spurious matches between short
toponyms and long compound strings.
### Three-Phase Training Curriculum
| Phase | Objective | Epochs | Notes |
|-------|-----------|--------|-------|
| 1 | Teacher: triplet margin loss on PanPhon192 features | 50 | val_loss 0.0056 |
| 2 | Student–Teacher distillation: α·MSE + (1−α)·cosine | 50 | α=0.5, Student-Teacher cosine 0.942 |
| 3 | Hard negative fine-tuning (triplet, margin=0.3) | 30 | val_loss 0.02122 |
## Evaluation
### MEHDIE Hebrew–Arabic Historical Benchmark (Sagi et al., 2025)
Independent evaluation on medieval Hebrew and Arabic geographical sources — **not in training data**.
| Method | R@1 | R@5 | R@10 | MRR |
|--------|-----|-----|------|-----|
| PanPhon192 (ablation) | 41.1% | 48.2% | 52.3% | 45.0% |
| Levenshtein + AnyAscii | 81.5% | 97.5% | 99.4% | 88.5% |
| Jaro-Winkler + AnyAscii | 78.5% | 96.2% | 97.8% | 86.3% |
| **Symphonym v7** | **85.2%** | **97.0%** | **97.6%** | **90.8%** |
The PanPhon192 ablation (raw articulatory features, no neural training) achieves only
45.0% MRR — less than half Symphonym's score and below the string baselines —
confirming that performance derives from the training curriculum, not the phonetic
features alone.
### Cross-Script Pair Validation (11,723 pairs, 170+ script combinations)
Systematically sampled from training data (up to 10 pairs per script-pair bin); these
test embedding retrieval quality over the full 67M-toponym index, not generalisation
to unseen sources.
| Metric | v6 | v7 |
|--------|----|----|
| Pass rate (≥0.75 cosine) | — | **90.7%** |
| Embedding coverage | ~98% | **100%** |
| Hiragana↔Katakana mean similarity | 0.000 | **0.981** |
**Best-performing script pairs:** Hiragana–Katakana (0.981), Devanagari–Kannada (0.976),
Devanagari–Telugu (0.976), Cyrillic–Latin (0.923, n=1,334), Arabic–Latin (0.898, n=800).
## v7 Changes
v6 exhibited 0% IPA coverage for Hiragana (151,980 toponyms) and Katakana (340,555 toponyms)
despite both being natively supported by Epitran (`jpn-Hira`, `jpn-Kana`). The pipeline
was dispatching by language first (`lang=ja`), routing all Japanese toponyms to CharsiuG2P
which only processes CJK/Kanji. v7 fixes this by dispatching on detected script *before*
language code, restoring IPA coverage for 492,535 toponyms. The model was retrained from scratch.
## Training Data
Trained on 66.9 million unique toponyms from:
| Source | License |
|--------|---------|
| GeoNames | CC BY 4.0 |
| Wikidata | CC0 |
| Getty TGN | ODC-By 1.0 |
54.0% of training-namespace toponyms received IPA transcription; the remainder
contribute to the Student's character-level learning via distillation.
## Repository Contents
```
model.safetensors Student (UniversalEncoder) weights
config.json Architecture hyperparameters
inference.py Self-contained inference module
requirements.txt Dependencies
vocab/
char_vocab.json 113,280-character vocabulary
lang_vocab.json 1,944 ISO language codes
script_vocab.json 20 script categories
evaluation/
mehdie_results_v7_ranking.json
symphonym_v7_pairs_test_report.json
training_stats/
coverage_stats.json IPA coverage by script and language
phase{1,2,3}_metrics.json
epitran_extensions/ 102 custom CSV G2P files
```
## Limitations
- **Phonetic similarity only**: The model does not use geographic coordinates,
semantic information, or entity types. Phonetically similar but geographically
unrelated names (Austria/Australia: 0.883) will score highly.
- **Training bias**: Sources over-represent populated places with official names
in high-resource languages. Performance on under-represented scripts and
mundane places may be weaker.
- **Tonal languages**: PanPhon encodes segmental articulatory features but not
tone. Tonal minimal pairs in place names are rare in practice.
- **CJK–Hiragana pairs**: Mean similarity 0.437, reflecting that CharsiuG2P
produces Mandarin phonetics for Kanji while Epitran produces Japanese readings
for Hiragana — a genuine phonological mismatch, not a model deficiency.
## Citation
If you use Symphonym in your research, please cite the preprint and the Zenodo dataset:
```bibtex
@misc{symphonym2025,
author = {Gadd, Stephen},
title = {Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching},
year = {2026},
eprint = {2601.06932},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2601.06932},
doi = {10.48550/arXiv.2601.06932}
}
@dataset{symphonym_v7_zenodo,
title = {Symphonym v7 — Universal Phonetic Embeddings for Cross-Script Toponym Matching},
year = {2026},
doi = {10.5281/zenodo.18682017},
url = {https://doi.org/10.5281/zenodo.18682017}
}
```