BharatMorph Embedding — Phoneme-Aware Multilingual Indic Embedding Model
A 76.8M parameter multilingual embedding model built from scratch for Indic languages. Trained on 330K Wikipedia samples across 6 languages using MLM objective with morpheme diversity and cross-lingual alignment losses.
Honest note: This model is best suited for cross-lingual retrieval tasks. Within-language semantic search requires contrastive fine-tuning (planned for v2).
What Makes This Different
Most multilingual embedding models treat all languages the same — they tokenize text and learn embeddings purely from context. BharatMorph takes a different approach:
Phoneme-aware character encoding — Tamil க, Hindi क, Malayalam ക all map to the same phoneme ID. This means the model understands that these characters represent the same sound across scripts — giving it a structural advantage for Indic cross-lingual tasks.
Morpheme-type soft mixture — Each token is analyzed as a soft mixture of 8 morpheme types (root, prefix, suffix, infix, compound, sandhi, clitic, stem). This is differentiable — no hard decisions, gradients flow through.
Cross-lingual concept alignment — A language-neutral concept space pulls same-meaning representations together across languages without requiring parallel data.
Architecture
Input tokens
│
├──► Token Embedding (V × 1024)
│
└──► CharCNN (phoneme IDs)
k=3 : local morpheme patterns
k=7 : sandhi boundary context
│
▼
MorphemeAnalyzer
8-type soft mixture
│
▼
MorphemeAttn (bidirectional)
Q,K from morpheme space
│
▼
Gate(base, morph_vec) — content words use more morpheme signal
│
▼
CrossLingualAligner
language-neutral concept space
│
▼
Pooled sentence embedding (L2 normalized, dim=1024)
| Component | Details |
|---|---|
| Total parameters | 76.8M |
| Embedding dimension | 1024 |
| Morpheme types | 8 (soft mixture) |
| Languages | Tamil, Hindi, Telugu, Kannada, Malayalam, English |
| Max sequence length | 256 tokens |
| Tokenizer | Sarvam AI (sarvamai/sarvam-2b-v0.5) |
| Output | L2-normalized sentence embeddings |
Training
Dataset — 330K Wikipedia Samples
| Language | Samples | Script range |
|---|---|---|
| Tamil | 80,000 | U+0B80–0BFF |
| Hindi | 80,000 | U+0900–097F |
| Telugu | 60,000 | U+0C00–0C7F |
| Kannada | 40,000 | U+0C80–0CFF |
| Malayalam | 40,000 | U+0D00–0D7F |
| English | 30,000 | Latin |
| Validation | 2,000 | mixed |
Training Config
GPU : NVIDIA Tesla T4 (Kaggle single GPU)
Epochs : 3
Batch size : 32
Grad accum : 4 (effective batch = 128)
Max seq len : 256
Learning rate : 2e-4
LR schedule : Cosine with warmup (1000 steps)
Optimizer : AdamW (β=0.9, 0.95, ε=1e-8)
Weight decay : 0.01
Grad clip : 1.0
Mixed precision: FP16 (AMP)
NaN batches : 0
Loss:
Total = MLM loss + 0.01 × Morpheme diversity loss + 0.005 × Alignment loss
Training Results
| Epoch | Val Loss | Val PPL |
|---|---|---|
| 1 | 0.9713 | 2.64 |
| 2 | 0.7500 | 2.12 |
| 3 | 0.7230 | 2.06 |
Evaluation Results
Cross-lingual Similarity (Same Meaning)
| Pair | Cosine Similarity |
|---|---|
| Tamil ↔ Malayalam | 0.9546 |
| Tamil ↔ Telugu | 0.9342 |
| Tamil ↔ Hindi | 0.8932 |
| Tamil ↔ English | 0.8754 |
All pairs exceed the 0.6 threshold — cross-lingual alignment is working well.
Honest Limitations
- Within-language semantic search: Different Tamil sentences score ~0.96 cosine similarity regardless of meaning. The model does not yet separate "cat sleeping" from "car going fast" within the same language. This is because MLM-only training does not push apart unrelated sentences — contrastive loss is needed.
- Contrastive fine-tuning: Planned for v2 using SimCSE-style training.
- Factual accuracy: Not applicable — this is an embedding model, not a generative model.
Quick Start
Installation
pip install transformers torch safetensors huggingface_hub
# Optional — improves phoneme mapping accuracy for Indic scripts
pip install indic-transliteration
You also need bharatmorph_embedding.py — download it from the model repo or copy from below.
Load and Encode
import torch
import torch.nn.functional as F
import safetensors.torch
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download
from bharatmorph_embedding import (
BharatMorphEmbeddingConfig,
BharatMorphEmbeddingModel,
build_char_table,
fast_char_ids,
)
# ── Download model ────────────────────────────────────────────
local_dir = snapshot_download("Girinath11/bharatmorph-embedding")
# ── Load tokenizer (must use Sarvam tokenizer) ────────────────
tok = AutoTokenizer.from_pretrained("sarvamai/sarvam-2b-v0.5", trust_remote_code=True)
tok.add_special_tokens({"additional_special_tokens": ["[TA]","[HI]","[TE]","[KN]","[ML]","[EN]"]})
if tok.pad_token is None:
tok.pad_token = tok.eos_token
# ── Load model ────────────────────────────────────────────────
ecfg = BharatMorphEmbeddingConfig.from_pretrained(local_dir)
model = BharatMorphEmbeddingModel(ecfg)
model.resize_token_embeddings(len(tok))
state_dict = safetensors.torch.load_file(f"{local_dir}/model.safetensors")
state_dict.pop("mlm_head.3.weight", None) # tied weight — safe to skip
model.load_state_dict(state_dict, strict=False)
model = model.eval().cuda()
# ── Build char table (do once, reuse) ────────────────────────
CHAR_CPU = build_char_table(tok, len(tok), 20, 512)
# ── Language IDs ─────────────────────────────────────────────
# ta=0 hi=1 te=2 kn=3 ml=4 en=5
def encode(texts, lang_ids):
"""
texts : list of strings
lang_ids : list of ints matching language of each text
returns : (N, 1024) L2-normalized tensor
"""
enc = tok(texts, max_length=256, truncation=True,
padding="max_length", return_tensors="pt")
ids = enc["input_ids"].cuda()
mask = enc["attention_mask"].cuda()
cids = fast_char_ids(ids.cpu(), CHAR_CPU).cuda()
lids = torch.tensor(lang_ids, dtype=torch.long).cuda()
with torch.no_grad():
out = model(input_ids=ids, attention_mask=mask,
char_ids=cids, lang_ids=lids, run_mlm=False)
return out.pooled # (N, 1024) L2-normalized
# ── Cosine similarity ─────────────────────────────────────────
def similarity(a, b):
return F.cosine_similarity(a, b, dim=-1).item()
Cross-lingual Retrieval Example
# Same meaning, different languages — should score high
ta = encode(["அம்மா சாப்பிட்டாள்"], [0]) # Tamil
hi = encode(["माँ ने खाना खाया"], [1]) # Hindi
ml = encode(["അമ്മ ഭക്ഷണം കഴിച്ചു"], [4]) # Malayalam
en = encode(["Mother ate food"], [5]) # English
print(f"Tamil ↔ Hindi : {similarity(ta, hi):.4f}") # 0.8932
print(f"Tamil ↔ Malayalam : {similarity(ta, ml):.4f}") # 0.9546
print(f"Tamil ↔ English : {similarity(ta, en):.4f}") # 0.8754
Cross-lingual Document Search
# Query in English, find matching documents in Tamil/Hindi
query = encode(["agriculture and farming"], [5])
tamil_docs = [
"விவசாயம் தமிழ்நாட்டின் முக்கிய தொழில்", # Agriculture is TN's main industry
"கணினி அறிவியல் படிப்பு பயனுள்ளது", # CS education is useful
"நெல் சாகுபடி அதிகமாக உள்ளது", # Rice cultivation is high
]
hindi_docs = [
"किसान खेती में मेहनत करते हैं", # Farmers work hard in farming
"मोबाइल फोन आज जरूरी है", # Mobile phones are necessary today
]
all_docs = tamil_docs + hindi_docs
all_lids = [0]*3 + [1]*2
doc_embs = encode(all_docs, all_lids)
scores = F.cosine_similarity(query, doc_embs, dim=-1)
ranked = sorted(zip(scores.tolist(), all_docs), reverse=True)
print("Query: 'agriculture and farming'\n")
for score, doc in ranked:
print(f" {score:.4f} {doc}")
Batch Encoding
# Encode multiple sentences at once (efficient)
sentences = [
"தமிழ் மொழி மிகவும் பழமையானது", # Tamil is very ancient
"हिंदी भारत की राजभाषा है", # Hindi is India's official language
"Telugu is a Dravidian language",
"ಕನ್ನಡ ಕರ್ನಾಟಕದ ಅಧಿಕೃತ ಭಾಷೆ", # Kannada is Karnataka's official language
"Malayalam has palindrome script",
]
lang_ids = [0, 1, 5, 3, 4]
embeddings = encode(sentences, lang_ids)
print(f"Shape: {embeddings.shape}") # (5, 1024)
# Pairwise similarity matrix
sim_matrix = torch.matmul(embeddings, embeddings.T)
print(sim_matrix)
Use Cases
| Task | Suitable? | Notes |
|---|---|---|
| Cross-lingual document retrieval | Yes | Main strength |
| Multilingual clustering | Yes | Language-neutral space |
| Cross-lingual semantic textual similarity | Yes | High cosine scores |
| Within-language semantic search | Partial | v2 with contrastive training planned |
| Within-language sentence ranking | Partial | Scores are compressed |
| Named entity recognition | No | Not designed for this |
| Text generation | No | Embedding model only |
What I Learned / What's Next
What worked:
Phoneme-aware char encoding — cross-lingual similarity excellent
Morpheme soft mixture — differentiable, stable
Cross-lingual alignment loss — no parallel data needed
NaN-safe training — 0 NaN batches across 3 epochs
Loss curve healthy — 0.97 → 0.72 val loss
What needs improvement (v2):
SimCSE contrastive loss — within-language semantic separation
More training data — 330K is small
Harder negative mining — unrelated sentences too similar
Evaluation on standard benchmarks (MIRACL, XQuAD)
Citation
@misc{girinath2026bharatmorph,
author = {Girinath V},
title = {BharatMorph Embedding: Phoneme-Aware Multilingual
Embedding Model for Indic Languages},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Girinath11/bharatmorph-embedding}},
note = {76.8M parameter embedding model trained from scratch
with phoneme-aware CharCNN and cross-lingual alignment}
}
Acknowledgments
- Sarvam AI — tokenizer (sarvamai/sarvam-2b-v0.5)
- Wikimedia Foundation — Wikipedia training data
- HuggingFace — Transformers library
- Kaggle — Free GPU access (T4)
Model status : Research / Cross-lingual retrieval use
Author : Girinath V
Last updated : April 2026
License : MIT
license: mit
- Downloads last month
- 45