muril-lang-id-v7

Fine-tuned google/muril-base-cased for language identification on Indian banking chatbot messages. Covers 17 Indian languages plus English in both native and Romanized script, with an 18th undetermined class for out-of-distribution inputs.

This is v7 of an iterative series (v1 → v6). v7 adds brand-laden English banking Q&A, expanded Dravidian Romanized templates, and banking-style European OOD to the v6 training mix.

Labels (0–17)

as, bn, en, gu, hi, kn, ks, ml, mr, ne, or, pa, sa, sd, ta, te, ur, undetermined

Evaluation

On a held-out 1882-row banking chat test set:

version	overall	en	hi	kn	ta	te	undetermined
v5	91.82%	98.2%	99.7%	91.6%	81.3%	80.5%	77.8%
v6	93.25%	96.0%	99.8%	97.5%	92.4%	78.8%	82.5%
v7	96.07%	100%	99.8%	99.2%	97.9%	93.8%	84.0%

Held-out stratified test (from the training-mix distribution): accuracy 0.9731, f1_macro 0.9675.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "dnivra26/muril-lang-id-v7"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo)
model.eval()

LABELS = ["as","bn","en","gu","hi","kn","ks","ml","mr","ne","or","pa","sa","sd","ta","te","ur","undetermined"]
ENERGY_THRESHOLD = -7.0  # energy > threshold ⇒ flag as undetermined

text = "mera balance kitna hai"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.inference_mode():
    logits = model(**inputs).logits.squeeze(0)
energy = -torch.logsumexp(logits, dim=0).item()
top = int(logits.argmax())
label = "undetermined" if energy > ENERGY_THRESHOLD else LABELS[top]
print(label)  # → hi

Training

Base: google/muril-base-cased
Epochs: 3
Batch size: 128, lr: 4e-5, precision: bf16 + TF32
Max seq length: 128
Datasets: AI4Bharat Bhasha-Abhijnaanam, AI4Bharat Aksharantar, SST-2, suhani-sarvam/google-dakshina, findnitai/english-to-hinglish, AmazonScience/MASSIVE, community-datasets/offenseval_dravidian (non-offensive only), bitext retail-banking, FLORES-200 (OOD), synthetic brand-laden English banking Q&A, banking-style European OOD (DE/FR/PT/ES/IT/TR/SV/NL), synthetic gibberish.

Notes

Romanized Urdu and Hindi are merged to hi at inference time (Hindustani is effectively one spoken language).
Pre-v6 checkpoints in this series only emit labels 0–16 and need a tighter energy threshold (-11.22).
Works best when wrapped in a pipeline that runs Unicode-script short-circuiting first, so deterministic native-script inputs skip the model entirely.

Downloads last month: 19

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for dnivra26/muril-lang-id-v7

Base model

google/muril-base-cased

Finetuned

(61)

this model