πŸ”οΈ Naga Languages Similarity Analysis

The first computational study measuring similarity between 37 Naga tribal languages using parallel Bible translations.

36 Naga languages + English Β· 6 similarity methods Β· NPMI word extraction Β· Interactive cognate explorer

πŸ”— Live Explorer β†’ | πŸ“Š Results & Visualizations

What This Is

A computational study of how similar Naga languages are to each other, measured using parallel Bible translations as a controlled corpus. 36 Naga and Naga-adjacent languages plus English β€” compared using character-level patterns to find shared vocabulary, cognates, and structural connections.

The goal is not to rank languages or tribes. It is to find the common threads β€” the shared words, the similar sounds, the patterns that connect communities across mountains and state borders.

Key Findings

1. Six similarity methods evaluated; three are primary

Method Cohen's d (Full) ARI (Full) Cohen's d (NT) ARI (NT) Status
Jaccard (top-1000 trigrams) 1.64 0.65 1.57 0.55 βœ… Best clustering
TF-IDF (2-4 char n-grams) 1.59 0.46 1.94 0.66 βœ… Best pair ranking
Subword Vocabulary JSD 1.27 0.44 1.37 0.32 βœ… Complement
Character Trigram Frequency 1.11 0.36 1.12 0.35 ⚠️ Marginal
Character Frequency 0.63 0.28 0.63 0.29 ❌ Fails
Glot500-m Embeddings 0.27 0.15 0.27 0.15 ❌ Fails

2. Known language families cluster correctly

Dendrogram β€” Jaccard

The dendrograms, t-SNE, and PCA all confirm:

  • Ao (Central Naga) and South Patkaian cluster together
  • Angami-Pochuri forms a distinct cluster
  • Monsang-Moyon pair is the most similar in the dataset
  • English is correctly the most distant language

3. Word extraction reveals shared vocabulary

Using NPMI (Normalized PMI) on verse-aligned parallel text + independent RAT (TF-IDF + BM25 retrieval):

  • ~2,000 English concepts mapped to 36 Naga languages
  • 9,087 high-confidence translations (both methods agree)
  • Genuine cognates confirmed: stone (lung/hlung/long), fire (mei/mela/meze), years (kum), wood (thing)

4. Interactive explorer

β†’ Open the Naga Cognate Explorer

Search for any English word and see how all 36 Naga languages express it, with similar-sounding words grouped by color.

Languages Studied (37)

Sub-Branch Languages Count
Ao (Central Naga) Ao, Sangtam, Yimchunger 3
Angami-Pochuri Angami, Chokri, Pochuri, Poumei, Rengma N, Rengma S, Sumi 7
South Patkaian Chang, Phom, Konyak, Wancho, Khiamniungan 5
North Patkaian ASII Lakdap (Tangsa), Nocte, Tutsa, Hawa, Muklom, Chuyo 6
Tangkhul-Maring Maring, Tangkhul 2
Zemeic Liangmai, Rongmei, Zeme, Maram 4
Pakan Anal, Chothe, Kharam, Lamkang, Monsang, Moyon, Tarao, Kom 8
Creole Nagamese 1
Reference English (NRSVUE) 1

Running

git clone https://huggingface.co/Rantunvsn/naga-languages-similarity
cd naga-languages-similarity
pip install -r requirements.txt

# Similarity analysis
python scripts/02d_clean_bible_texts.py
python scripts/03_compute_cosine_similarity.py
python scripts/03b_additional_similarity_methods.py
python scripts/04_tsne_visualization.py

# Word extraction pipeline (NPMI + RAT + merge + atlas)
python scripts/05_extract_similar_words.py           # NPMI v8
python scripts/05b_cognate_supported_translations.py  # Score-filtered cognate rows
python scripts/10_rat_word_retrieval.py               # RAT v10.3 (independent)
python scripts/11_compare_pmi_rat.py                  # Quality-filtered merge
python scripts/15_naga_cognate_atlas.py               # Cognate atlas
python scripts/16_naga_cognate_viz.py                 # Interactive HTML explorer

Repository Contents

  • 35 Python scripts β€” full pipeline from data collection to visualization
  • 17 documentation files β€” methodology, findings, verification for every phase
  • Similarity matrices β€” 6 methods Γ— full/NT, with Cohen's d and ARI metrics
  • Visualizations β€” dendrograms, heatmaps, t-SNE, PCA, phylogenetic trees
  • Word tables β€” NPMI translations, RAT extractions, merged high-confidence table
  • Interactive explorer β€” results/final/naga_cognate_explorer.html

Data Source

Bible texts obtained from bible.com (YouVersion). All translations Β© Bible Society of India (BSI). Original texts are NOT redistributed β€” only derived analytical outputs.

References

  • Burling, R. (2003). The Tibeto-Burman languages of NE India. In The Sino-Tibetan Languages. Routledge.
  • Hammarstrom, H., et al. (2024). Glottolog 5.0. https://glottolog.org
  • Wang, Y., et al. (2023). mPLM-Sim. EMNLP 2023. arXiv:2305.13684.
  • Cavnar, W.B. & Trenkle, J.M. (1994). N-gram based text categorization.
  • Lu, Y., et al. (2023). Chain-of-Dictionary Prompting. EMNLP 2023. arXiv:2305.06575.

Citation

If you use this work, please cite:

@misc{naga-languages-similarity,
  title={Naga Languages Computational Similarity Analysis},
  author={Rantunvsn},
  year={2026},
  url={https://huggingface.co/Rantunvsn/naga-languages-similarity}
}

37 languages Β· 6 methods Β· 173 files Β· ~34 MB analytical outputs

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using Rantunvsn/naga-languages-similarity 1

Papers for Rantunvsn/naga-languages-similarity