Text Classification
Transformers
Safetensors
bert
language-identification
indian-languages
banking
romanized
text-embeddings-inference
Instructions to use dnivra26/muril-lang-id-v7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dnivra26/muril-lang-id-v7 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="dnivra26/muril-lang-id-v7")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("dnivra26/muril-lang-id-v7") model = AutoModelForSequenceClassification.from_pretrained("dnivra26/muril-lang-id-v7") - Notebooks
- Google Colab
- Kaggle
muril-lang-id-v7
Fine-tuned google/muril-base-cased for language identification on Indian banking chatbot messages. Covers 17 Indian languages plus English in both native and Romanized script, with an 18th undetermined class for out-of-distribution inputs.
This is v7 of an iterative series (v1 โ v6). v7 adds brand-laden English banking Q&A, expanded Dravidian Romanized templates, and banking-style European OOD to the v6 training mix.
Labels (0โ17)
as, bn, en, gu, hi, kn, ks, ml, mr, ne, or, pa, sa, sd, ta, te, ur, undetermined
Evaluation
On a held-out 1882-row banking chat test set:
| version | overall | en | hi | kn | ta | te | undetermined |
|---|---|---|---|---|---|---|---|
| v5 | 91.82% | 98.2% | 99.7% | 91.6% | 81.3% | 80.5% | 77.8% |
| v6 | 93.25% | 96.0% | 99.8% | 97.5% | 92.4% | 78.8% | 82.5% |
| v7 | 96.07% | 100% | 99.8% | 99.2% | 97.9% | 93.8% | 84.0% |
Held-out stratified test (from the training-mix distribution): accuracy 0.9731, f1_macro 0.9675.
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "dnivra26/muril-lang-id-v7"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo)
model.eval()
LABELS = ["as","bn","en","gu","hi","kn","ks","ml","mr","ne","or","pa","sa","sd","ta","te","ur","undetermined"]
ENERGY_THRESHOLD = -7.0 # energy > threshold โ flag as undetermined
text = "mera balance kitna hai"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.inference_mode():
logits = model(**inputs).logits.squeeze(0)
energy = -torch.logsumexp(logits, dim=0).item()
top = int(logits.argmax())
label = "undetermined" if energy > ENERGY_THRESHOLD else LABELS[top]
print(label) # โ hi
Training
- Base: google/muril-base-cased
- Epochs: 3
- Batch size: 128, lr: 4e-5, precision: bf16 + TF32
- Max seq length: 128
- Datasets: AI4Bharat Bhasha-Abhijnaanam, AI4Bharat Aksharantar, SST-2, suhani-sarvam/google-dakshina, findnitai/english-to-hinglish, AmazonScience/MASSIVE, community-datasets/offenseval_dravidian (non-offensive only), bitext retail-banking, FLORES-200 (OOD), synthetic brand-laden English banking Q&A, banking-style European OOD (DE/FR/PT/ES/IT/TR/SV/NL), synthetic gibberish.
Notes
- Romanized Urdu and Hindi are merged to
hiat inference time (Hindustani is effectively one spoken language). - Pre-v6 checkpoints in this series only emit labels 0โ16 and need a tighter energy threshold (
-11.22). - Works best when wrapped in a pipeline that runs Unicode-script short-circuiting first, so deterministic native-script inputs skip the model entirely.
- Downloads last month
- 3
Model tree for dnivra26/muril-lang-id-v7
Base model
google/muril-base-cased