ilid-muril-model / README.md

Improve model card: Add pipeline tag, usage example, and project link

f7305a2 verified 6 months ago

4.5 kB

	---
	library_name: transformers
	license: mit
	pipeline_tag: text-classification
	tags:
	- language-identification
	- indian-languages
	- multilingual
	- muril
	---

	# ILID: Native Script Language Identification for Indian Languages

	The model `yashingle-ai/ILID` is a MuRIL-based model fine-tuned for the language identification task, capable of identifying English and all 22 official Indian languages. It was presented in the paper [ILID: Native Script Language Identification for Indian Languages](https://huggingface.co/papers/2507.11832).

	Project Page: [https://yashingle-ai.github.io/ILID/](https://yashingle-ai.github.io/ILID/)
	Code Repository: [https://github.com/yashingle-ai/TextLangDetect](https://github.com/yashingle-ai/TextLangDetect)

	## Model Details

	### Model Description
	This model is a fine-tuned version of `google/muril-base-cased` on the language identification task, capable of distinguishing between English and all 22 official Indian languages. It addresses the challenges of distinguishing languages in noisy, short, and code-mixed environments, particularly relevant for the diverse linguistic landscape of India.

	- Developed by: Pruthwik Mishra, Yash Ingle
	- Model type: BERT-based for Sequence Classification
	- Language(s) (NLP): Multilingual (22 official Indian languages + English)
	- Finetuned from model: `google/muril-base-cased`

	## Uses

	The model can be directly used for English and Indian language identification, serving as a preprocessing step for applications like multilingual machine translation, information retrieval, and question answering.

	### Out-of-Scope Use
	The model is not designed for languages other than English and the official Indian languages. Performance may vary on very low-resource Indian languages.

	## Bias, Risks, and Limitations
	The model may not perform optimally on very resource-poor languages such as Manipuri (in Meitei script), Sindhi, or Maithili, as highlighted in the original paper.

	### Recommendations
	Users should be aware of the model's limitations and potential biases when applying it to specific use cases or languages outside its primary training scope.

	## How to Get Started with the Model

	You can use the model directly with the `transformers` library for text classification:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	model_name = "yashingle-ai/ILID" # Replace with actual model ID if different
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Example usage (Hindi text)
	text_to_classify_hi = "\u0928\u092e\u0938\u094d\u0924\u0947, \u092f\u0939 \u090f\u0915 \u092a\u0930\u0940\u0915\u094d\u0937\u0923 \u0935\u093e\u0915\u094d\u092f \u0939\u0948\u0964"
	inputs_hi = tokenizer(text_to_classify_hi, return_tensors="pt")

	with torch.no_grad():
	logits_hi = model(**inputs_hi).logits

	predicted_class_id_hi = logits_hi.argmax().item()
	predicted_label_hi = model.config.id2label[predicted_class_id_hi] # Maps to LABEL_X

	print(f"Text: '{text_to_classify_hi}'")
	print(f"Predicted language label: {predicted_label_hi}")

	# Example usage (English text)
	text_to_classify_en = "Hello, this is a test sentence."
	inputs_en = tokenizer(text_to_classify_en, return_tensors="pt")

	with torch.no_grad():
	logits_en = model(**inputs_en).logits

	predicted_class_id_en = logits_en.argmax().item()
	predicted_label_en = model.config.id2label[predicted_class_id_en]

	print(f"Text: '{text_to_classify_en}'")
	print(f"Predicted language label: {predicted_label_en}")
	```

	## Training Details

	### Training Data
	The model was trained on the newly created ILID (Indian Language Identification Dataset).

	### Training Hyperparameters
	The model was fine-tuned from `google/muril-base-cased`.

	## Evaluation

	The model was evaluated on the created ILID corpus and the Bhasha-Abhijnaanam benchmark.

	### Metrics
	The primary evaluation metric used was F1-score.

	### Results
	The model achieved an average F1-score of 0.96.

	## Citation
	If you find our work helpful or inspiring, please feel free to cite it.

	```bibtex
	@misc{ingle2025ilidnativescriptlanguage,
	title={ILID: Native Script Language Identification for Indian Languages},
	author={Yash Ingle and Pruthwik Mishra},
	year={2025},
	eprint={2507.11832},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2507.11832},
	}
	```