library_name: transformers
license: mit
pipeline_tag: text-classification
tags:
- language-identification
- indian-languages
- multilingual
- muril
ILID: Native Script Language Identification for Indian Languages
The model yashingle-ai/ILID is a MuRIL-based model fine-tuned for the language identification task, capable of identifying English and all 22 official Indian languages. It was presented in the paper ILID: Native Script Language Identification for Indian Languages.
Project Page: https://yashingle-ai.github.io/ILID/ Code Repository: https://github.com/yashingle-ai/TextLangDetect
Model Details
Model Description
This model is a fine-tuned version of google/muril-base-cased on the language identification task, capable of distinguishing between English and all 22 official Indian languages. It addresses the challenges of distinguishing languages in noisy, short, and code-mixed environments, particularly relevant for the diverse linguistic landscape of India.
- Developed by: Pruthwik Mishra, Yash Ingle
- Model type: BERT-based for Sequence Classification
- Language(s) (NLP): Multilingual (22 official Indian languages + English)
- Finetuned from model:
google/muril-base-cased
Uses
The model can be directly used for English and Indian language identification, serving as a preprocessing step for applications like multilingual machine translation, information retrieval, and question answering.
Out-of-Scope Use
The model is not designed for languages other than English and the official Indian languages. Performance may vary on very low-resource Indian languages.
Bias, Risks, and Limitations
The model may not perform optimally on very resource-poor languages such as Manipuri (in Meitei script), Sindhi, or Maithili, as highlighted in the original paper.
Recommendations
Users should be aware of the model's limitations and potential biases when applying it to specific use cases or languages outside its primary training scope.
How to Get Started with the Model
You can use the model directly with the transformers library for text classification:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "yashingle-ai/ILID" # Replace with actual model ID if different
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example usage (Hindi text)
text_to_classify_hi = "\u0928\u092e\u0938\u094d\u0924\u0947, \u092f\u0939 \u090f\u0915 \u092a\u0930\u0940\u0915\u094d\u0937\u0923 \u0935\u093e\u0915\u094d\u092f \u0939\u0948\u0964"
inputs_hi = tokenizer(text_to_classify_hi, return_tensors="pt")
with torch.no_grad():
logits_hi = model(**inputs_hi).logits
predicted_class_id_hi = logits_hi.argmax().item()
predicted_label_hi = model.config.id2label[predicted_class_id_hi] # Maps to LABEL_X
print(f"Text: '{text_to_classify_hi}'")
print(f"Predicted language label: {predicted_label_hi}")
# Example usage (English text)
text_to_classify_en = "Hello, this is a test sentence."
inputs_en = tokenizer(text_to_classify_en, return_tensors="pt")
with torch.no_grad():
logits_en = model(**inputs_en).logits
predicted_class_id_en = logits_en.argmax().item()
predicted_label_en = model.config.id2label[predicted_class_id_en]
print(f"Text: '{text_to_classify_en}'")
print(f"Predicted language label: {predicted_label_en}")
Training Details
Training Data
The model was trained on the newly created ILID (Indian Language Identification Dataset).
Training Hyperparameters
The model was fine-tuned from google/muril-base-cased.
Evaluation
The model was evaluated on the created ILID corpus and the Bhasha-Abhijnaanam benchmark.
Metrics
The primary evaluation metric used was F1-score.
Results
The model achieved an average F1-score of 0.96.
Citation
If you find our work helpful or inspiring, please feel free to cite it.
@misc{ingle2025ilidnativescriptlanguage,
title={ILID: Native Script Language Identification for Indian Languages},
author={Yash Ingle and Pruthwik Mishra},
year={2025},
eprint={2507.11832},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.11832},
}