ilid-muril-model / README.md

nielsr HF Staff

Improve model card: Add pipeline tag, usage example, and project link

f7305a2 verified 6 months ago

preview code

raw

history blame

4.5 kB

metadata

library_name: transformers
license: mit
pipeline_tag: text-classification
tags:
  - language-identification
  - indian-languages
  - multilingual
  - muril

ILID: Native Script Language Identification for Indian Languages

The model yashingle-ai/ILID is a MuRIL-based model fine-tuned for the language identification task, capable of identifying English and all 22 official Indian languages. It was presented in the paper ILID: Native Script Language Identification for Indian Languages.

Project Page: https://yashingle-ai.github.io/ILID/ Code Repository: https://github.com/yashingle-ai/TextLangDetect

Model Details

Model Description

This model is a fine-tuned version of google/muril-base-cased on the language identification task, capable of distinguishing between English and all 22 official Indian languages. It addresses the challenges of distinguishing languages in noisy, short, and code-mixed environments, particularly relevant for the diverse linguistic landscape of India.

Developed by: Pruthwik Mishra, Yash Ingle
Model type: BERT-based for Sequence Classification
Language(s) (NLP): Multilingual (22 official Indian languages + English)
Finetuned from model: google/muril-base-cased

Uses

The model can be directly used for English and Indian language identification, serving as a preprocessing step for applications like multilingual machine translation, information retrieval, and question answering.

Out-of-Scope Use

The model is not designed for languages other than English and the official Indian languages. Performance may vary on very low-resource Indian languages.

Bias, Risks, and Limitations

The model may not perform optimally on very resource-poor languages such as Manipuri (in Meitei script), Sindhi, or Maithili, as highlighted in the original paper.

Recommendations

Users should be aware of the model's limitations and potential biases when applying it to specific use cases or languages outside its primary training scope.

How to Get Started with the Model

You can use the model directly with the transformers library for text classification:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "yashingle-ai/ILID" # Replace with actual model ID if different
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage (Hindi text)
text_to_classify_hi = "\u0928\u092e\u0938\u094d\u0924\u0947, \u092f\u0939 \u090f\u0915 \u092a\u0930\u0940\u0915\u094d\u0937\u0923 \u0935\u093e\u0915\u094d\u092f \u0939\u0948\u0964"
inputs_hi = tokenizer(text_to_classify_hi, return_tensors="pt")

with torch.no_grad():
    logits_hi = model(**inputs_hi).logits

predicted_class_id_hi = logits_hi.argmax().item()
predicted_label_hi = model.config.id2label[predicted_class_id_hi] # Maps to LABEL_X

print(f"Text: '{text_to_classify_hi}'")
print(f"Predicted language label: {predicted_label_hi}")

# Example usage (English text)
text_to_classify_en = "Hello, this is a test sentence."
inputs_en = tokenizer(text_to_classify_en, return_tensors="pt")

with torch.no_grad():
    logits_en = model(**inputs_en).logits

predicted_class_id_en = logits_en.argmax().item()
predicted_label_en = model.config.id2label[predicted_class_id_en]

print(f"Text: '{text_to_classify_en}'")
print(f"Predicted language label: {predicted_label_en}")

Training Details

Training Data

The model was trained on the newly created ILID (Indian Language Identification Dataset).

Training Hyperparameters

The model was fine-tuned from google/muril-base-cased.

Evaluation

The model was evaluated on the created ILID corpus and the Bhasha-Abhijnaanam benchmark.

Metrics

The primary evaluation metric used was F1-score.

Results

The model achieved an average F1-score of 0.96.

Citation

If you find our work helpful or inspiring, please feel free to cite it.

@misc{ingle2025ilidnativescriptlanguage,
      title={ILID: Native Script Language Identification for Indian Languages},
      author={Yash Ingle and Pruthwik Mishra},
      year={2025},
      eprint={2507.11832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.11832},
}