muril-lang-id-v7

Fine-tuned google/muril-base-cased for language identification on Indian banking chatbot messages. Covers 17 Indian languages plus English in both native and Romanized script, with an 18th undetermined class for out-of-distribution inputs.

This is v7 of an iterative series (v1 โ†’ v6). v7 adds brand-laden English banking Q&A, expanded Dravidian Romanized templates, and banking-style European OOD to the v6 training mix.

Labels (0โ€“17)

as, bn, en, gu, hi, kn, ks, ml, mr, ne, or, pa, sa, sd, ta, te, ur, undetermined

Evaluation

On a held-out 1882-row banking chat test set:

version overall en hi kn ta te undetermined
v5 91.82% 98.2% 99.7% 91.6% 81.3% 80.5% 77.8%
v6 93.25% 96.0% 99.8% 97.5% 92.4% 78.8% 82.5%
v7 96.07% 100% 99.8% 99.2% 97.9% 93.8% 84.0%

Held-out stratified test (from the training-mix distribution): accuracy 0.9731, f1_macro 0.9675.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "dnivra26/muril-lang-id-v7"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo)
model.eval()

LABELS = ["as","bn","en","gu","hi","kn","ks","ml","mr","ne","or","pa","sa","sd","ta","te","ur","undetermined"]
ENERGY_THRESHOLD = -7.0  # energy > threshold โ‡’ flag as undetermined

text = "mera balance kitna hai"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.inference_mode():
    logits = model(**inputs).logits.squeeze(0)
energy = -torch.logsumexp(logits, dim=0).item()
top = int(logits.argmax())
label = "undetermined" if energy > ENERGY_THRESHOLD else LABELS[top]
print(label)  # โ†’ hi

Training

  • Base: google/muril-base-cased
  • Epochs: 3
  • Batch size: 128, lr: 4e-5, precision: bf16 + TF32
  • Max seq length: 128
  • Datasets: AI4Bharat Bhasha-Abhijnaanam, AI4Bharat Aksharantar, SST-2, suhani-sarvam/google-dakshina, findnitai/english-to-hinglish, AmazonScience/MASSIVE, community-datasets/offenseval_dravidian (non-offensive only), bitext retail-banking, FLORES-200 (OOD), synthetic brand-laden English banking Q&A, banking-style European OOD (DE/FR/PT/ES/IT/TR/SV/NL), synthetic gibberish.

Notes

  • Romanized Urdu and Hindi are merged to hi at inference time (Hindustani is effectively one spoken language).
  • Pre-v6 checkpoints in this series only emit labels 0โ€“16 and need a tighter energy threshold (-11.22).
  • Works best when wrapped in a pipeline that runs Unicode-script short-circuiting first, so deterministic native-script inputs skip the model entirely.
Downloads last month
3
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for dnivra26/muril-lang-id-v7

Finetuned
(56)
this model