ποΈ Naga Languages Similarity Analysis
The first computational study measuring similarity between 37 Naga tribal languages using parallel Bible translations.
36 Naga languages + English Β· 6 similarity methods Β· NPMI word extraction Β· Interactive cognate explorer
π Live Explorer β | π Results & Visualizations
What This Is
A computational study of how similar Naga languages are to each other, measured using parallel Bible translations as a controlled corpus. 36 Naga and Naga-adjacent languages plus English β compared using character-level patterns to find shared vocabulary, cognates, and structural connections.
The goal is not to rank languages or tribes. It is to find the common threads β the shared words, the similar sounds, the patterns that connect communities across mountains and state borders.
Key Findings
1. Six similarity methods evaluated; three are primary
| Method | Cohen's d (Full) | ARI (Full) | Cohen's d (NT) | ARI (NT) | Status |
|---|---|---|---|---|---|
| Jaccard (top-1000 trigrams) | 1.64 | 0.65 | 1.57 | 0.55 | β Best clustering |
| TF-IDF (2-4 char n-grams) | 1.59 | 0.46 | 1.94 | 0.66 | β Best pair ranking |
| Subword Vocabulary JSD | 1.27 | 0.44 | 1.37 | 0.32 | β Complement |
| Character Trigram Frequency | 1.11 | 0.36 | 1.12 | 0.35 | β οΈ Marginal |
| Character Frequency | 0.63 | 0.28 | 0.63 | 0.29 | β Fails |
| Glot500-m Embeddings | 0.27 | 0.15 | 0.27 | 0.15 | β Fails |
2. Known language families cluster correctly
The dendrograms, t-SNE, and PCA all confirm:
- Ao (Central Naga) and South Patkaian cluster together
- Angami-Pochuri forms a distinct cluster
- Monsang-Moyon pair is the most similar in the dataset
- English is correctly the most distant language
3. Word extraction reveals shared vocabulary
Using NPMI (Normalized PMI) on verse-aligned parallel text + independent RAT (TF-IDF + BM25 retrieval):
- ~2,000 English concepts mapped to 36 Naga languages
- 9,087 high-confidence translations (both methods agree)
- Genuine cognates confirmed: stone (lung/hlung/long), fire (mei/mela/meze), years (kum), wood (thing)
4. Interactive explorer
β Open the Naga Cognate Explorer
Search for any English word and see how all 36 Naga languages express it, with similar-sounding words grouped by color.
Languages Studied (37)
| Sub-Branch | Languages | Count |
|---|---|---|
| Ao (Central Naga) | Ao, Sangtam, Yimchunger | 3 |
| Angami-Pochuri | Angami, Chokri, Pochuri, Poumei, Rengma N, Rengma S, Sumi | 7 |
| South Patkaian | Chang, Phom, Konyak, Wancho, Khiamniungan | 5 |
| North Patkaian | ASII Lakdap (Tangsa), Nocte, Tutsa, Hawa, Muklom, Chuyo | 6 |
| Tangkhul-Maring | Maring, Tangkhul | 2 |
| Zemeic | Liangmai, Rongmei, Zeme, Maram | 4 |
| Pakan | Anal, Chothe, Kharam, Lamkang, Monsang, Moyon, Tarao, Kom | 8 |
| Creole | Nagamese | 1 |
| Reference | English (NRSVUE) | 1 |
Running
git clone https://huggingface.co/Rantunvsn/naga-languages-similarity
cd naga-languages-similarity
pip install -r requirements.txt
# Similarity analysis
python scripts/02d_clean_bible_texts.py
python scripts/03_compute_cosine_similarity.py
python scripts/03b_additional_similarity_methods.py
python scripts/04_tsne_visualization.py
# Word extraction pipeline (NPMI + RAT + merge + atlas)
python scripts/05_extract_similar_words.py # NPMI v8
python scripts/05b_cognate_supported_translations.py # Score-filtered cognate rows
python scripts/10_rat_word_retrieval.py # RAT v10.3 (independent)
python scripts/11_compare_pmi_rat.py # Quality-filtered merge
python scripts/15_naga_cognate_atlas.py # Cognate atlas
python scripts/16_naga_cognate_viz.py # Interactive HTML explorer
Repository Contents
- 35 Python scripts β full pipeline from data collection to visualization
- 17 documentation files β methodology, findings, verification for every phase
- Similarity matrices β 6 methods Γ full/NT, with Cohen's d and ARI metrics
- Visualizations β dendrograms, heatmaps, t-SNE, PCA, phylogenetic trees
- Word tables β NPMI translations, RAT extractions, merged high-confidence table
- Interactive explorer β
results/final/naga_cognate_explorer.html
Data Source
Bible texts obtained from bible.com (YouVersion). All translations Β© Bible Society of India (BSI). Original texts are NOT redistributed β only derived analytical outputs.
References
- Burling, R. (2003). The Tibeto-Burman languages of NE India. In The Sino-Tibetan Languages. Routledge.
- Hammarstrom, H., et al. (2024). Glottolog 5.0. https://glottolog.org
- Wang, Y., et al. (2023). mPLM-Sim. EMNLP 2023. arXiv:2305.13684.
- Cavnar, W.B. & Trenkle, J.M. (1994). N-gram based text categorization.
- Lu, Y., et al. (2023). Chain-of-Dictionary Prompting. EMNLP 2023. arXiv:2305.06575.
Citation
If you use this work, please cite:
@misc{naga-languages-similarity,
title={Naga Languages Computational Similarity Analysis},
author={Rantunvsn},
year={2026},
url={https://huggingface.co/Rantunvsn/naga-languages-similarity}
}
37 languages Β· 6 methods Β· 173 files Β· ~34 MB analytical outputs
