File size: 8,697 Bytes
---
language:
- multilingual
- ar
- zh
- ru
- ja
- ko
- he
- fa
- hi
- el
- ka
- am
- hy
- ur
- bn
- ta
- te
- th
tags:
- toponym-matching
- cross-script
- phonetic-embeddings
- geospatial
- named-entity
- information-retrieval
- teacher-student
- knowledge-distillation
license: cc-by-4.0
datasets:
- geonames
- wikidata
- getty-tgn
metrics:
- recall@k
- mrr
pipeline_tag: feature-extraction
---

# Symphonym v7 — Universal Phonetic Embeddings for Cross-Script Toponym Matching

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18682017.svg)](https://doi.org/10.5281/zenodo.18682017)

Symphonym maps toponyms (place names) from **20 writing systems** into a unified
**128-dimensional phonetic embedding space**, enabling direct cross-script similarity
comparison without runtime phonetic conversion or language identification.

> *"London" / "Лондон" / "伦敦" / "لندن" → [0.12, -0.34, …] (all nearby)*

## Intended Use

- **Cross-script toponym matching** in geographic databases and gazetteers
- **Phonetic search** — retrieve results for a place name entered in any script
- **Historical record linkage** — match pre-standardisation spelling variants
- **Multilingual named entity linking** in NLP pipelines
- **Digital humanities** — reconciling place references across archival sources

The model operates on **phonetic similarity**, not semantic or orthographic similarity.
It is designed as a **candidate retrieval** component within a larger reconciliation
pipeline, where candidates are subsequently filtered by geographic proximity and
other constraints.

## Quick Start

```python
from inference import SymphonymModel

model = SymphonymModel()   # loads weights from this directory

# Single similarity score
sim = model.similarity("London", "en", "Лондон", "ru")
print(f"London / Лондон: {sim:.3f}")   # → 0.991

# Batch embeddings  (N × 128 numpy array)
embeddings = model.batch_embed([
    ("London",   "en"),
    ("Лондон",   "ru"),
    ("伦敦",     "zh"),
    ("لندن",     "ar"),
    ("ლონდონი",  "ka"),
])
```

### With HuggingFace `huggingface_hub`

```python
from huggingface_hub import snapshot_download

model_dir = snapshot_download("docuracy/symphonym-v7")

from inference import SymphonymModel
model = SymphonymModel(model_dir=model_dir)
```

## Representative Cross-Script Similarities

| Pair | Scripts | Similarity |
|------|---------|-----------|
| London / Лондон | Latin–Cyrillic | 0.991 |
| Athens / Αθήνα | Latin–Greek | 0.980 |
| Beijing / 北京 | Latin–CJK | 0.955 |
| Baghdad / بغداد | Latin–Arabic | 0.969 |
| Jerusalem / ירושלים | Latin–Hebrew | 0.892 |
| Tokyo / とうきょう | Latin–Hiragana | ~0.94 |
| London / Londres | Latin–Latin | 0.474 *(correct: phonetically distinct)* |

## Model Architecture

Symphonym uses a **Teacher–Student knowledge distillation** framework.

### Teacher (PhoneticEncoder) — training only
- Input: IPA transcriptions via Epitran (+ 102 extensions), Phonikud, CharsiuG2P
- Representation: PanPhon192 — 24-dim articulatory feature vectors,
  8-bin positional pooling → 192-dim fixed-length input
- Architecture: BiLSTM → Self-Attention → Attention Pooling → 128-dim projection

### Student (UniversalEncoder) — deployed model
- Input: raw Unicode characters + script ID + language ID + length bucket
- Vocabulary: **113,280 tokens** across 20 scripts, 1,944 language codes
- Architecture: Character/Script/Language/Length embeddings →
  Input projection → BiLSTM → Self-Attention (residual) →
  Attention Pooling → 128-dim projection → L2 normalisation
- Parameters: ~8.3M

The **length bucket embedding** (16 buckets, 8-dim) conditions every character
representation on sequence length, mitigating spurious matches between short
toponyms and long compound strings.

### Three-Phase Training Curriculum

| Phase | Objective | Epochs | Notes |
|-------|-----------|--------|-------|
| 1 | Teacher: triplet margin loss on PanPhon192 features | 50 | val_loss 0.0056 |
| 2 | Student–Teacher distillation: α·MSE + (1−α)·cosine | 50 | α=0.5, Student-Teacher cosine 0.942 |
| 3 | Hard negative fine-tuning (triplet, margin=0.3) | 30 | val_loss 0.02122 |

## Evaluation

### MEHDIE Hebrew–Arabic Historical Benchmark (Sagi et al., 2025)

Independent evaluation on medieval Hebrew and Arabic geographical sources — **not in training data**.

| Method | R@1 | R@5 | R@10 | MRR |
|--------|-----|-----|------|-----|
| PanPhon192 (ablation) | 41.1% | 48.2% | 52.3% | 45.0% |
| Levenshtein + AnyAscii | 81.5% | 97.5% | 99.4% | 88.5% |
| Jaro-Winkler + AnyAscii | 78.5% | 96.2% | 97.8% | 86.3% |
| **Symphonym v7** | **85.2%** | **97.0%** | **97.6%** | **90.8%** |

The PanPhon192 ablation (raw articulatory features, no neural training) achieves only
45.0% MRR — less than half Symphonym's score and below the string baselines —
confirming that performance derives from the training curriculum, not the phonetic
features alone.

### Cross-Script Pair Validation (11,723 pairs, 170+ script combinations)

Systematically sampled from training data (up to 10 pairs per script-pair bin); these
test embedding retrieval quality over the full 67M-toponym index, not generalisation
to unseen sources.

| Metric | v6 | v7 |
|--------|----|----|
| Pass rate (≥0.75 cosine) | — | **90.7%** |
| Embedding coverage | ~98% | **100%** |
| Hiragana↔Katakana mean similarity | 0.000 | **0.981** |

**Best-performing script pairs:** Hiragana–Katakana (0.981), Devanagari–Kannada (0.976),
Devanagari–Telugu (0.976), Cyrillic–Latin (0.923, n=1,334), Arabic–Latin (0.898, n=800).

## v7 Changes

v6 exhibited 0% IPA coverage for Hiragana (151,980 toponyms) and Katakana (340,555 toponyms)
despite both being natively supported by Epitran (`jpn-Hira`, `jpn-Kana`). The pipeline
was dispatching by language first (`lang=ja`), routing all Japanese toponyms to CharsiuG2P
which only processes CJK/Kanji. v7 fixes this by dispatching on detected script *before*
language code, restoring IPA coverage for 492,535 toponyms. The model was retrained from scratch.

## Training Data

Trained on 66.9 million unique toponyms from:

| Source | License |
|--------|---------|
| GeoNames | CC BY 4.0 |
| Wikidata | CC0 |
| Getty TGN | ODC-By 1.0 |

54.0% of training-namespace toponyms received IPA transcription; the remainder
contribute to the Student's character-level learning via distillation.

## Repository Contents

```
model.safetensors          Student (UniversalEncoder) weights
config.json                Architecture hyperparameters
inference.py               Self-contained inference module
requirements.txt           Dependencies
vocab/
  char_vocab.json          113,280-character vocabulary
  lang_vocab.json          1,944 ISO language codes
  script_vocab.json        20 script categories
evaluation/
  mehdie_results_v7_ranking.json
  symphonym_v7_pairs_test_report.json
training_stats/
  coverage_stats.json      IPA coverage by script and language
  phase{1,2,3}_metrics.json
epitran_extensions/        102 custom CSV G2P files
```

## Limitations

- **Phonetic similarity only**: The model does not use geographic coordinates,
  semantic information, or entity types. Phonetically similar but geographically
  unrelated names (Austria/Australia: 0.883) will score highly.
- **Training bias**: Sources over-represent populated places with official names
  in high-resource languages. Performance on under-represented scripts and
  mundane places may be weaker.
- **Tonal languages**: PanPhon encodes segmental articulatory features but not
  tone. Tonal minimal pairs in place names are rare in practice.
- **CJK–Hiragana pairs**: Mean similarity 0.437, reflecting that CharsiuG2P
  produces Mandarin phonetics for Kanji while Epitran produces Japanese readings
  for Hiragana — a genuine phonological mismatch, not a model deficiency.

## Citation

If you use Symphonym in your research, please cite the preprint and the Zenodo dataset:

```bibtex
@misc{symphonym2025,
    author       = {Gadd, Stephen},
    title        = {Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching},
    year         = {2026},
    eprint       = {2601.06932},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL},
    url          = {https://arxiv.org/abs/2601.06932},
    doi          = {10.48550/arXiv.2601.06932}
}

@dataset{symphonym_v7_zenodo,
    title   = {Symphonym v7 — Universal Phonetic Embeddings for Cross-Script Toponym Matching},
    year    = {2026},
    doi     = {10.5281/zenodo.18682017},
    url     = {https://doi.org/10.5281/zenodo.18682017}
}
```