BharatMorph Embedding — Phoneme-Aware Multilingual Indic Embedding Model

A 76.8M parameter multilingual embedding model built from scratch for Indic languages. Trained on 330K Wikipedia samples across 6 languages using MLM objective with morpheme diversity and cross-lingual alignment losses.

Honest note: This model is best suited for cross-lingual retrieval tasks. Within-language semantic search requires contrastive fine-tuning (planned for v2).


What Makes This Different

Most multilingual embedding models treat all languages the same — they tokenize text and learn embeddings purely from context. BharatMorph takes a different approach:

Phoneme-aware character encoding — Tamil க, Hindi क, Malayalam ക all map to the same phoneme ID. This means the model understands that these characters represent the same sound across scripts — giving it a structural advantage for Indic cross-lingual tasks.

Morpheme-type soft mixture — Each token is analyzed as a soft mixture of 8 morpheme types (root, prefix, suffix, infix, compound, sandhi, clitic, stem). This is differentiable — no hard decisions, gradients flow through.

Cross-lingual concept alignment — A language-neutral concept space pulls same-meaning representations together across languages without requiring parallel data.


Architecture

Input tokens
    │
    ├──► Token Embedding (V × 1024)
    │
    └──► CharCNN (phoneme IDs)
              k=3 : local morpheme patterns
              k=7 : sandhi boundary context
                │
                ▼
         MorphemeAnalyzer
         8-type soft mixture
                │
                ▼
         MorphemeAttn (bidirectional)
         Q,K from morpheme space
                │
                ▼
    Gate(base, morph_vec) — content words use more morpheme signal
                │
                ▼
         CrossLingualAligner
         language-neutral concept space
                │
                ▼
    Pooled sentence embedding (L2 normalized, dim=1024)
Component Details
Total parameters 76.8M
Embedding dimension 1024
Morpheme types 8 (soft mixture)
Languages Tamil, Hindi, Telugu, Kannada, Malayalam, English
Max sequence length 256 tokens
Tokenizer Sarvam AI (sarvamai/sarvam-2b-v0.5)
Output L2-normalized sentence embeddings

Training

Dataset — 330K Wikipedia Samples

Language Samples Script range
Tamil 80,000 U+0B80–0BFF
Hindi 80,000 U+0900–097F
Telugu 60,000 U+0C00–0C7F
Kannada 40,000 U+0C80–0CFF
Malayalam 40,000 U+0D00–0D7F
English 30,000 Latin
Validation 2,000 mixed

Training Config

GPU            : NVIDIA Tesla T4 (Kaggle single GPU)
Epochs         : 3
Batch size     : 32
Grad accum     : 4  (effective batch = 128)
Max seq len    : 256
Learning rate  : 2e-4
LR schedule    : Cosine with warmup (1000 steps)
Optimizer      : AdamW (β=0.9, 0.95, ε=1e-8)
Weight decay   : 0.01
Grad clip      : 1.0
Mixed precision: FP16 (AMP)
NaN batches    : 0

Loss:
  Total = MLM loss + 0.01 × Morpheme diversity loss + 0.005 × Alignment loss

Training Results

Epoch Val Loss Val PPL
1 0.9713 2.64
2 0.7500 2.12
3 0.7230 2.06

Evaluation Results

Cross-lingual Similarity (Same Meaning)

Pair Cosine Similarity
Tamil ↔ Malayalam 0.9546
Tamil ↔ Telugu 0.9342
Tamil ↔ Hindi 0.8932
Tamil ↔ English 0.8754

All pairs exceed the 0.6 threshold — cross-lingual alignment is working well.

Honest Limitations

  • Within-language semantic search: Different Tamil sentences score ~0.96 cosine similarity regardless of meaning. The model does not yet separate "cat sleeping" from "car going fast" within the same language. This is because MLM-only training does not push apart unrelated sentences — contrastive loss is needed.
  • Contrastive fine-tuning: Planned for v2 using SimCSE-style training.
  • Factual accuracy: Not applicable — this is an embedding model, not a generative model.

Quick Start

Installation

pip install transformers torch safetensors huggingface_hub
# Optional — improves phoneme mapping accuracy for Indic scripts
pip install indic-transliteration

You also need bharatmorph_embedding.py — download it from the model repo or copy from below.

Load and Encode

import torch
import torch.nn.functional as F
import safetensors.torch
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download
from bharatmorph_embedding import (
    BharatMorphEmbeddingConfig,
    BharatMorphEmbeddingModel,
    build_char_table,
    fast_char_ids,
)

# ── Download model ────────────────────────────────────────────
local_dir = snapshot_download("Girinath11/bharatmorph-embedding")

# ── Load tokenizer (must use Sarvam tokenizer) ────────────────
tok = AutoTokenizer.from_pretrained("sarvamai/sarvam-2b-v0.5", trust_remote_code=True)
tok.add_special_tokens({"additional_special_tokens": ["[TA]","[HI]","[TE]","[KN]","[ML]","[EN]"]})
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

# ── Load model ────────────────────────────────────────────────
ecfg  = BharatMorphEmbeddingConfig.from_pretrained(local_dir)
model = BharatMorphEmbeddingModel(ecfg)
model.resize_token_embeddings(len(tok))

state_dict = safetensors.torch.load_file(f"{local_dir}/model.safetensors")
state_dict.pop("mlm_head.3.weight", None)   # tied weight — safe to skip
model.load_state_dict(state_dict, strict=False)
model = model.eval().cuda()

# ── Build char table (do once, reuse) ────────────────────────
CHAR_CPU = build_char_table(tok, len(tok), 20, 512)

# ── Language IDs ─────────────────────────────────────────────
# ta=0  hi=1  te=2  kn=3  ml=4  en=5

def encode(texts, lang_ids):
    """
    texts    : list of strings
    lang_ids : list of ints matching language of each text
    returns  : (N, 1024) L2-normalized tensor
    """
    enc  = tok(texts, max_length=256, truncation=True,
               padding="max_length", return_tensors="pt")
    ids  = enc["input_ids"].cuda()
    mask = enc["attention_mask"].cuda()
    cids = fast_char_ids(ids.cpu(), CHAR_CPU).cuda()
    lids = torch.tensor(lang_ids, dtype=torch.long).cuda()
    with torch.no_grad():
        out = model(input_ids=ids, attention_mask=mask,
                    char_ids=cids, lang_ids=lids, run_mlm=False)
    return out.pooled   # (N, 1024) L2-normalized

# ── Cosine similarity ─────────────────────────────────────────
def similarity(a, b):
    return F.cosine_similarity(a, b, dim=-1).item()

Cross-lingual Retrieval Example

# Same meaning, different languages — should score high
ta = encode(["அம்மா சாப்பிட்டாள்"],  [0])   # Tamil
hi = encode(["माँ ने खाना खाया"],     [1])   # Hindi
ml = encode(["അമ്മ ഭക്ഷണം കഴിച്ചു"], [4])   # Malayalam
en = encode(["Mother ate food"],       [5])   # English

print(f"Tamil  ↔ Hindi     : {similarity(ta, hi):.4f}")   # 0.8932
print(f"Tamil  ↔ Malayalam : {similarity(ta, ml):.4f}")   # 0.9546
print(f"Tamil  ↔ English   : {similarity(ta, en):.4f}")   # 0.8754

Cross-lingual Document Search

# Query in English, find matching documents in Tamil/Hindi
query = encode(["agriculture and farming"], [5])

tamil_docs = [
    "விவசாயம் தமிழ்நாட்டின் முக்கிய தொழில்",   # Agriculture is TN's main industry
    "கணினி அறிவியல் படிப்பு பயனுள்ளது",          # CS education is useful
    "நெல் சாகுபடி அதிகமாக உள்ளது",               # Rice cultivation is high
]
hindi_docs = [
    "किसान खेती में मेहनत करते हैं",              # Farmers work hard in farming
    "मोबाइल फोन आज जरूरी है",                    # Mobile phones are necessary today
]

all_docs  = tamil_docs + hindi_docs
all_lids  = [0]*3 + [1]*2
doc_embs  = encode(all_docs, all_lids)

scores = F.cosine_similarity(query, doc_embs, dim=-1)
ranked = sorted(zip(scores.tolist(), all_docs), reverse=True)

print("Query: 'agriculture and farming'\n")
for score, doc in ranked:
    print(f"  {score:.4f}  {doc}")

Batch Encoding

# Encode multiple sentences at once (efficient)
sentences = [
    "தமிழ் மொழி மிகவும் பழமையானது",    # Tamil is very ancient
    "हिंदी भारत की राजभाषा है",          # Hindi is India's official language
    "Telugu is a Dravidian language",
    "ಕನ್ನಡ ಕರ್ನಾಟಕದ ಅಧಿಕೃತ ಭಾಷೆ",       # Kannada is Karnataka's official language
    "Malayalam has palindrome script",
]
lang_ids = [0, 1, 5, 3, 4]

embeddings = encode(sentences, lang_ids)
print(f"Shape: {embeddings.shape}")   # (5, 1024)

# Pairwise similarity matrix
sim_matrix = torch.matmul(embeddings, embeddings.T)
print(sim_matrix)

Use Cases

Task Suitable? Notes
Cross-lingual document retrieval Yes Main strength
Multilingual clustering Yes Language-neutral space
Cross-lingual semantic textual similarity Yes High cosine scores
Within-language semantic search Partial v2 with contrastive training planned
Within-language sentence ranking Partial Scores are compressed
Named entity recognition No Not designed for this
Text generation No Embedding model only

What I Learned / What's Next

What worked:
  Phoneme-aware char encoding — cross-lingual similarity excellent
  Morpheme soft mixture — differentiable, stable
  Cross-lingual alignment loss — no parallel data needed
  NaN-safe training — 0 NaN batches across 3 epochs
  Loss curve healthy — 0.97 → 0.72 val loss

What needs improvement (v2):
  SimCSE contrastive loss — within-language semantic separation
  More training data — 330K is small
  Harder negative mining — unrelated sentences too similar
  Evaluation on standard benchmarks (MIRACL, XQuAD)

Citation

@misc{girinath2026bharatmorph,
  author       = {Girinath V},
  title        = {BharatMorph Embedding: Phoneme-Aware Multilingual
                  Embedding Model for Indic Languages},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Girinath11/bharatmorph-embedding}},
  note         = {76.8M parameter embedding model trained from scratch
                  with phoneme-aware CharCNN and cross-lingual alignment}
}

Acknowledgments

  • Sarvam AI — tokenizer (sarvamai/sarvam-2b-v0.5)
  • Wikimedia Foundation — Wikipedia training data
  • HuggingFace — Transformers library
  • Kaggle — Free GPU access (T4)

Model status : Research / Cross-lingual retrieval use
Author : Girinath V Last updated : April 2026
License : MIT license: mit

Downloads last month
45
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Girinath11/bharatmorph_indic_crosslingual