| | --- |
| | language: |
| | - multilingual |
| | - ar |
| | - zh |
| | - ru |
| | - ja |
| | - ko |
| | - he |
| | - fa |
| | - hi |
| | - el |
| | - ka |
| | - am |
| | - hy |
| | - ur |
| | - bn |
| | - ta |
| | - te |
| | - th |
| | tags: |
| | - toponym-matching |
| | - cross-script |
| | - phonetic-embeddings |
| | - geospatial |
| | - named-entity |
| | - information-retrieval |
| | - teacher-student |
| | - knowledge-distillation |
| | license: cc-by-4.0 |
| | datasets: |
| | - geonames |
| | - wikidata |
| | - getty-tgn |
| | metrics: |
| | - recall@k |
| | - mrr |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # Symphonym v7 — Universal Phonetic Embeddings for Cross-Script Toponym Matching |
| |
|
| | [](https://doi.org/10.5281/zenodo.18682017) |
| |
|
| | Symphonym maps toponyms (place names) from **20 writing systems** into a unified |
| | **128-dimensional phonetic embedding space**, enabling direct cross-script similarity |
| | comparison without runtime phonetic conversion or language identification. |
| |
|
| | > *"London" / "Лондон" / "伦敦" / "لندن" → [0.12, -0.34, …] (all nearby)* |
| |
|
| | ## Intended Use |
| |
|
| | - **Cross-script toponym matching** in geographic databases and gazetteers |
| | - **Phonetic search** — retrieve results for a place name entered in any script |
| | - **Historical record linkage** — match pre-standardisation spelling variants |
| | - **Multilingual named entity linking** in NLP pipelines |
| | - **Digital humanities** — reconciling place references across archival sources |
| |
|
| | The model operates on **phonetic similarity**, not semantic or orthographic similarity. |
| | It is designed as a **candidate retrieval** component within a larger reconciliation |
| | pipeline, where candidates are subsequently filtered by geographic proximity and |
| | other constraints. |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | from inference import SymphonymModel |
| | |
| | model = SymphonymModel() # loads weights from this directory |
| | |
| | # Single similarity score |
| | sim = model.similarity("London", "en", "Лондон", "ru") |
| | print(f"London / Лондон: {sim:.3f}") # → 0.991 |
| | |
| | # Batch embeddings (N × 128 numpy array) |
| | embeddings = model.batch_embed([ |
| | ("London", "en"), |
| | ("Лондон", "ru"), |
| | ("伦敦", "zh"), |
| | ("لندن", "ar"), |
| | ("ლონდონი", "ka"), |
| | ]) |
| | ``` |
| |
|
| | ### With HuggingFace `huggingface_hub` |
| | |
| | ```python |
| | from huggingface_hub import snapshot_download |
| | |
| | model_dir = snapshot_download("docuracy/symphonym-v7") |
| | |
| | from inference import SymphonymModel |
| | model = SymphonymModel(model_dir=model_dir) |
| | ``` |
| | |
| | ## Representative Cross-Script Similarities |
| | |
| | | Pair | Scripts | Similarity | |
| | |------|---------|-----------| |
| | | London / Лондон | Latin–Cyrillic | 0.991 | |
| | | Athens / Αθήνα | Latin–Greek | 0.980 | |
| | | Beijing / 北京 | Latin–CJK | 0.955 | |
| | | Baghdad / بغداد | Latin–Arabic | 0.969 | |
| | | Jerusalem / ירושלים | Latin–Hebrew | 0.892 | |
| | | Tokyo / とうきょう | Latin–Hiragana | ~0.94 | |
| | | London / Londres | Latin–Latin | 0.474 *(correct: phonetically distinct)* | |
| | |
| | ## Model Architecture |
| | |
| | Symphonym uses a **Teacher–Student knowledge distillation** framework. |
| | |
| | ### Teacher (PhoneticEncoder) — training only |
| | - Input: IPA transcriptions via Epitran (+ 102 extensions), Phonikud, CharsiuG2P |
| | - Representation: PanPhon192 — 24-dim articulatory feature vectors, |
| | 8-bin positional pooling → 192-dim fixed-length input |
| | - Architecture: BiLSTM → Self-Attention → Attention Pooling → 128-dim projection |
| | |
| | ### Student (UniversalEncoder) — deployed model |
| | - Input: raw Unicode characters + script ID + language ID + length bucket |
| | - Vocabulary: **113,280 tokens** across 20 scripts, 1,944 language codes |
| | - Architecture: Character/Script/Language/Length embeddings → |
| | Input projection → BiLSTM → Self-Attention (residual) → |
| | Attention Pooling → 128-dim projection → L2 normalisation |
| | - Parameters: ~8.3M |
| | |
| | The **length bucket embedding** (16 buckets, 8-dim) conditions every character |
| | representation on sequence length, mitigating spurious matches between short |
| | toponyms and long compound strings. |
| | |
| | ### Three-Phase Training Curriculum |
| | |
| | | Phase | Objective | Epochs | Notes | |
| | |-------|-----------|--------|-------| |
| | | 1 | Teacher: triplet margin loss on PanPhon192 features | 50 | val_loss 0.0056 | |
| | | 2 | Student–Teacher distillation: α·MSE + (1−α)·cosine | 50 | α=0.5, Student-Teacher cosine 0.942 | |
| | | 3 | Hard negative fine-tuning (triplet, margin=0.3) | 30 | val_loss 0.02122 | |
| | |
| | ## Evaluation |
| | |
| | ### MEHDIE Hebrew–Arabic Historical Benchmark (Sagi et al., 2025) |
| | |
| | Independent evaluation on medieval Hebrew and Arabic geographical sources — **not in training data**. |
| | |
| | | Method | R@1 | R@5 | R@10 | MRR | |
| | |--------|-----|-----|------|-----| |
| | | PanPhon192 (ablation) | 41.1% | 48.2% | 52.3% | 45.0% | |
| | | Levenshtein + AnyAscii | 81.5% | 97.5% | 99.4% | 88.5% | |
| | | Jaro-Winkler + AnyAscii | 78.5% | 96.2% | 97.8% | 86.3% | |
| | | **Symphonym v7** | **85.2%** | **97.0%** | **97.6%** | **90.8%** | |
| | |
| | The PanPhon192 ablation (raw articulatory features, no neural training) achieves only |
| | 45.0% MRR — less than half Symphonym's score and below the string baselines — |
| | confirming that performance derives from the training curriculum, not the phonetic |
| | features alone. |
| | |
| | ### Cross-Script Pair Validation (11,723 pairs, 170+ script combinations) |
| | |
| | Systematically sampled from training data (up to 10 pairs per script-pair bin); these |
| | test embedding retrieval quality over the full 67M-toponym index, not generalisation |
| | to unseen sources. |
| | |
| | | Metric | v6 | v7 | |
| | |--------|----|----| |
| | | Pass rate (≥0.75 cosine) | — | **90.7%** | |
| | | Embedding coverage | ~98% | **100%** | |
| | | Hiragana↔Katakana mean similarity | 0.000 | **0.981** | |
| | |
| | **Best-performing script pairs:** Hiragana–Katakana (0.981), Devanagari–Kannada (0.976), |
| | Devanagari–Telugu (0.976), Cyrillic–Latin (0.923, n=1,334), Arabic–Latin (0.898, n=800). |
| | |
| | ## v7 Changes |
| | |
| | v6 exhibited 0% IPA coverage for Hiragana (151,980 toponyms) and Katakana (340,555 toponyms) |
| | despite both being natively supported by Epitran (`jpn-Hira`, `jpn-Kana`). The pipeline |
| | was dispatching by language first (`lang=ja`), routing all Japanese toponyms to CharsiuG2P |
| | which only processes CJK/Kanji. v7 fixes this by dispatching on detected script *before* |
| | language code, restoring IPA coverage for 492,535 toponyms. The model was retrained from scratch. |
| | |
| | ## Training Data |
| | |
| | Trained on 66.9 million unique toponyms from: |
| | |
| | | Source | License | |
| | |--------|---------| |
| | | GeoNames | CC BY 4.0 | |
| | | Wikidata | CC0 | |
| | | Getty TGN | ODC-By 1.0 | |
| | |
| | 54.0% of training-namespace toponyms received IPA transcription; the remainder |
| | contribute to the Student's character-level learning via distillation. |
| | |
| | ## Repository Contents |
| | |
| | ``` |
| | model.safetensors Student (UniversalEncoder) weights |
| | config.json Architecture hyperparameters |
| | inference.py Self-contained inference module |
| | requirements.txt Dependencies |
| | vocab/ |
| | char_vocab.json 113,280-character vocabulary |
| | lang_vocab.json 1,944 ISO language codes |
| | script_vocab.json 20 script categories |
| | evaluation/ |
| | mehdie_results_v7_ranking.json |
| | symphonym_v7_pairs_test_report.json |
| | training_stats/ |
| | coverage_stats.json IPA coverage by script and language |
| | phase{1,2,3}_metrics.json |
| | epitran_extensions/ 102 custom CSV G2P files |
| | ``` |
| | |
| | ## Limitations |
| | |
| | - **Phonetic similarity only**: The model does not use geographic coordinates, |
| | semantic information, or entity types. Phonetically similar but geographically |
| | unrelated names (Austria/Australia: 0.883) will score highly. |
| | - **Training bias**: Sources over-represent populated places with official names |
| | in high-resource languages. Performance on under-represented scripts and |
| | mundane places may be weaker. |
| | - **Tonal languages**: PanPhon encodes segmental articulatory features but not |
| | tone. Tonal minimal pairs in place names are rare in practice. |
| | - **CJK–Hiragana pairs**: Mean similarity 0.437, reflecting that CharsiuG2P |
| | produces Mandarin phonetics for Kanji while Epitran produces Japanese readings |
| | for Hiragana — a genuine phonological mismatch, not a model deficiency. |
| | |
| | ## Citation |
| | |
| | If you use Symphonym in your research, please cite the preprint and the Zenodo dataset: |
| | |
| | ```bibtex |
| | @misc{symphonym2025, |
| | author = {Gadd, Stephen}, |
| | title = {Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching}, |
| | year = {2026}, |
| | eprint = {2601.06932}, |
| | archivePrefix = {arXiv}, |
| | primaryClass = {cs.CL}, |
| | url = {https://arxiv.org/abs/2601.06932}, |
| | doi = {10.48550/arXiv.2601.06932} |
| | } |
| | |
| | @dataset{symphonym_v7_zenodo, |
| | title = {Symphonym v7 — Universal Phonetic Embeddings for Cross-Script Toponym Matching}, |
| | year = {2026}, |
| | doi = {10.5281/zenodo.18682017}, |
| | url = {https://doi.org/10.5281/zenodo.18682017} |
| | } |
| | ``` |
| | |
| | |
| | |
| | |