Upload Symphonym v7 model, vocabularies, and evaluation results

4558539 verified 5 days ago

8.7 kB

	---
	language:
	- multilingual
	- ar
	- zh
	- ru
	- ja
	- ko
	- he
	- fa
	- hi
	- el
	- ka
	- am
	- hy
	- ur
	- bn
	- ta
	- te
	- th
	tags:
	- toponym-matching
	- cross-script
	- phonetic-embeddings
	- geospatial
	- named-entity
	- information-retrieval
	- teacher-student
	- knowledge-distillation
	license: cc-by-4.0
	datasets:
	- geonames
	- wikidata
	- getty-tgn
	metrics:
	- recall@k
	- mrr
	pipeline_tag: feature-extraction
	---

	# Symphonym v7 — Universal Phonetic Embeddings for Cross-Script Toponym Matching

	[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18682017.svg)](https://doi.org/10.5281/zenodo.18682017)

	Symphonym maps toponyms (place names) from 20 writing systems into a unified
	128-dimensional phonetic embedding space, enabling direct cross-script similarity
	comparison without runtime phonetic conversion or language identification.

	> "London" / "Лондон" / "伦敦" / "لندن" → [0.12, -0.34, …] (all nearby)

	## Intended Use

	- Cross-script toponym matching in geographic databases and gazetteers
	- Phonetic search — retrieve results for a place name entered in any script
	- Historical record linkage — match pre-standardisation spelling variants
	- Multilingual named entity linking in NLP pipelines
	- Digital humanities — reconciling place references across archival sources

	The model operates on phonetic similarity, not semantic or orthographic similarity.
	It is designed as a candidate retrieval component within a larger reconciliation
	pipeline, where candidates are subsequently filtered by geographic proximity and
	other constraints.

	## Quick Start

	```python
	from inference import SymphonymModel

	model = SymphonymModel() # loads weights from this directory

	# Single similarity score
	sim = model.similarity("London", "en", "Лондон", "ru")
	print(f"London / Лондон: {sim:.3f}") # → 0.991

	# Batch embeddings (N × 128 numpy array)
	embeddings = model.batch_embed([
	("London", "en"),
	("Лондон", "ru"),
	("伦敦", "zh"),
	("لندن", "ar"),
	("ლონდონი", "ka"),
	])
	```

	### With HuggingFace `huggingface_hub`

	```python
	from huggingface_hub import snapshot_download

	model_dir = snapshot_download("docuracy/symphonym-v7")

	from inference import SymphonymModel
	model = SymphonymModel(model_dir=model_dir)
	```

	## Representative Cross-Script Similarities

	\| Pair \| Scripts \| Similarity \|
	\|------\|---------\|-----------\|
	\| London / Лондон \| Latin–Cyrillic \| 0.991 \|
	\| Athens / Αθήνα \| Latin–Greek \| 0.980 \|
	\| Beijing / 北京 \| Latin–CJK \| 0.955 \|
	\| Baghdad / بغداد \| Latin–Arabic \| 0.969 \|
	\| Jerusalem / ירושלים \| Latin–Hebrew \| 0.892 \|
	\| Tokyo / とうきょう \| Latin–Hiragana \| ~0.94 \|
	\| London / Londres \| Latin–Latin \| 0.474 (correct: phonetically distinct) \|

	## Model Architecture

	Symphonym uses a Teacher–Student knowledge distillation framework.

	### Teacher (PhoneticEncoder) — training only
	- Input: IPA transcriptions via Epitran (+ 102 extensions), Phonikud, CharsiuG2P
	- Representation: PanPhon192 — 24-dim articulatory feature vectors,
	8-bin positional pooling → 192-dim fixed-length input
	- Architecture: BiLSTM → Self-Attention → Attention Pooling → 128-dim projection

	### Student (UniversalEncoder) — deployed model
	- Input: raw Unicode characters + script ID + language ID + length bucket
	- Vocabulary: 113,280 tokens across 20 scripts, 1,944 language codes
	- Architecture: Character/Script/Language/Length embeddings →
	Input projection → BiLSTM → Self-Attention (residual) →
	Attention Pooling → 128-dim projection → L2 normalisation
	- Parameters: ~8.3M

	The length bucket embedding (16 buckets, 8-dim) conditions every character
	representation on sequence length, mitigating spurious matches between short
	toponyms and long compound strings.

	### Three-Phase Training Curriculum

	\| Phase \| Objective \| Epochs \| Notes \|
	\|-------\|-----------\|--------\|-------\|
	\| 1 \| Teacher: triplet margin loss on PanPhon192 features \| 50 \| val_loss 0.0056 \|
	\| 2 \| Student–Teacher distillation: α·MSE + (1−α)·cosine \| 50 \| α=0.5, Student-Teacher cosine 0.942 \|
	\| 3 \| Hard negative fine-tuning (triplet, margin=0.3) \| 30 \| val_loss 0.02122 \|

	## Evaluation

	### MEHDIE Hebrew–Arabic Historical Benchmark (Sagi et al., 2025)

	Independent evaluation on medieval Hebrew and Arabic geographical sources — not in training data.

	\| Method \| R@1 \| R@5 \| R@10 \| MRR \|
	\|--------\|-----\|-----\|------\|-----\|
	\| PanPhon192 (ablation) \| 41.1% \| 48.2% \| 52.3% \| 45.0% \|
	\| Levenshtein + AnyAscii \| 81.5% \| 97.5% \| 99.4% \| 88.5% \|
	\| Jaro-Winkler + AnyAscii \| 78.5% \| 96.2% \| 97.8% \| 86.3% \|
	\| Symphonym v7 \| 85.2% \| 97.0% \| 97.6% \| 90.8% \|

	The PanPhon192 ablation (raw articulatory features, no neural training) achieves only
	45.0% MRR — less than half Symphonym's score and below the string baselines —
	confirming that performance derives from the training curriculum, not the phonetic
	features alone.

	### Cross-Script Pair Validation (11,723 pairs, 170+ script combinations)

	Systematically sampled from training data (up to 10 pairs per script-pair bin); these
	test embedding retrieval quality over the full 67M-toponym index, not generalisation
	to unseen sources.

	\| Metric \| v6 \| v7 \|
	\|--------\|----\|----\|
	\| Pass rate (≥0.75 cosine) \| — \| 90.7% \|
	\| Embedding coverage \| ~98% \| 100% \|
	\| Hiragana↔Katakana mean similarity \| 0.000 \| 0.981 \|

	Best-performing script pairs: Hiragana–Katakana (0.981), Devanagari–Kannada (0.976),
	Devanagari–Telugu (0.976), Cyrillic–Latin (0.923, n=1,334), Arabic–Latin (0.898, n=800).

	## v7 Changes

	v6 exhibited 0% IPA coverage for Hiragana (151,980 toponyms) and Katakana (340,555 toponyms)
	despite both being natively supported by Epitran (`jpn-Hira`, `jpn-Kana`). The pipeline
	was dispatching by language first (`lang=ja`), routing all Japanese toponyms to CharsiuG2P
	which only processes CJK/Kanji. v7 fixes this by dispatching on detected script before
	language code, restoring IPA coverage for 492,535 toponyms. The model was retrained from scratch.

	## Training Data

	Trained on 66.9 million unique toponyms from:

	\| Source \| License \|
	\|--------\|---------\|
	\| GeoNames \| CC BY 4.0 \|
	\| Wikidata \| CC0 \|
	\| Getty TGN \| ODC-By 1.0 \|

	54.0% of training-namespace toponyms received IPA transcription; the remainder
	contribute to the Student's character-level learning via distillation.

	## Repository Contents

	```
	model.safetensors Student (UniversalEncoder) weights
	config.json Architecture hyperparameters
	inference.py Self-contained inference module
	requirements.txt Dependencies
	vocab/
	char_vocab.json 113,280-character vocabulary
	lang_vocab.json 1,944 ISO language codes
	script_vocab.json 20 script categories
	evaluation/
	mehdie_results_v7_ranking.json
	symphonym_v7_pairs_test_report.json
	training_stats/
	coverage_stats.json IPA coverage by script and language
	phase{1,2,3}_metrics.json
	epitran_extensions/ 102 custom CSV G2P files
	```

	## Limitations

	- Phonetic similarity only: The model does not use geographic coordinates,
	semantic information, or entity types. Phonetically similar but geographically
	unrelated names (Austria/Australia: 0.883) will score highly.
	- Training bias: Sources over-represent populated places with official names
	in high-resource languages. Performance on under-represented scripts and
	mundane places may be weaker.
	- Tonal languages: PanPhon encodes segmental articulatory features but not
	tone. Tonal minimal pairs in place names are rare in practice.
	- CJK–Hiragana pairs: Mean similarity 0.437, reflecting that CharsiuG2P
	produces Mandarin phonetics for Kanji while Epitran produces Japanese readings
	for Hiragana — a genuine phonological mismatch, not a model deficiency.

	## Citation

	If you use Symphonym in your research, please cite the preprint and the Zenodo dataset:

	```bibtex
	@misc{symphonym2025,
	author = {Gadd, Stephen},
	title = {Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching},
	year = {2026},
	eprint = {2601.06932},
	archivePrefix = {arXiv},
	primaryClass = {cs.CL},
	url = {https://arxiv.org/abs/2601.06932},
	doi = {10.48550/arXiv.2601.06932}
	}

	@dataset{symphonym_v7_zenodo,
	title = {Symphonym v7 — Universal Phonetic Embeddings for Cross-Script Toponym Matching},
	year = {2026},
	doi = {10.5281/zenodo.18682017},
	url = {https://doi.org/10.5281/zenodo.18682017}
	}
	```