Text Classification
Transformers
Safetensors
bert
File size: 4,502 Bytes
82c1671
 
f7305a2
 
 
 
 
 
 
82c1671
 
f7305a2
82c1671
f7305a2
82c1671
f7305a2
 
82c1671
 
 
 
f7305a2
82c1671
f7305a2
 
 
 
82c1671
 
 
f7305a2
82c1671
 
f7305a2
82c1671
 
f7305a2
82c1671
 
f7305a2
82c1671
 
 
f7305a2
82c1671
f7305a2
 
 
82c1671
f7305a2
 
 
 
82c1671
f7305a2
 
 
82c1671
f7305a2
 
82c1671
f7305a2
 
82c1671
f7305a2
 
82c1671
f7305a2
 
 
82c1671
f7305a2
 
82c1671
f7305a2
 
82c1671
f7305a2
 
 
82c1671
f7305a2
82c1671
f7305a2
 
82c1671
f7305a2
 
82c1671
 
 
f7305a2
82c1671
f7305a2
 
82c1671
 
f7305a2
82c1671
f7305a2
 
82c1671
f7305a2
4d25dfe
f7305a2
4d25dfe
 
 
 
 
f7305a2
4d25dfe
f7305a2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
library_name: transformers
license: mit
pipeline_tag: text-classification
tags:
  - language-identification
  - indian-languages
  - multilingual
  - muril
---

# ILID: Native Script Language Identification for Indian Languages

The model `yashingle-ai/ILID` is a MuRIL-based model fine-tuned for the language identification task, capable of identifying English and all 22 official Indian languages. It was presented in the paper [ILID: Native Script Language Identification for Indian Languages](https://huggingface.co/papers/2507.11832).

**Project Page:** [https://yashingle-ai.github.io/ILID/](https://yashingle-ai.github.io/ILID/)
**Code Repository:** [https://github.com/yashingle-ai/TextLangDetect](https://github.com/yashingle-ai/TextLangDetect)

## Model Details

### Model Description
This model is a fine-tuned version of `google/muril-base-cased` on the language identification task, capable of distinguishing between English and all 22 official Indian languages. It addresses the challenges of distinguishing languages in noisy, short, and code-mixed environments, particularly relevant for the diverse linguistic landscape of India.

-   **Developed by:** Pruthwik Mishra, Yash Ingle
-   **Model type:** BERT-based for Sequence Classification
-   **Language(s) (NLP):** Multilingual (22 official Indian languages + English)
-   **Finetuned from model:** `google/muril-base-cased`

## Uses

The model can be directly used for English and Indian language identification, serving as a preprocessing step for applications like multilingual machine translation, information retrieval, and question answering.

### Out-of-Scope Use
The model is not designed for languages other than English and the official Indian languages. Performance may vary on very low-resource Indian languages.

## Bias, Risks, and Limitations
The model may not perform optimally on very resource-poor languages such as Manipuri (in Meitei script), Sindhi, or Maithili, as highlighted in the original paper.

### Recommendations
Users should be aware of the model's limitations and potential biases when applying it to specific use cases or languages outside its primary training scope.

## How to Get Started with the Model

You can use the model directly with the `transformers` library for text classification:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "yashingle-ai/ILID" # Replace with actual model ID if different
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage (Hindi text)
text_to_classify_hi = "\u0928\u092e\u0938\u094d\u0924\u0947, \u092f\u0939 \u090f\u0915 \u092a\u0930\u0940\u0915\u094d\u0937\u0923 \u0935\u093e\u0915\u094d\u092f \u0939\u0948\u0964"
inputs_hi = tokenizer(text_to_classify_hi, return_tensors="pt")

with torch.no_grad():
    logits_hi = model(**inputs_hi).logits

predicted_class_id_hi = logits_hi.argmax().item()
predicted_label_hi = model.config.id2label[predicted_class_id_hi] # Maps to LABEL_X

print(f"Text: '{text_to_classify_hi}'")
print(f"Predicted language label: {predicted_label_hi}")

# Example usage (English text)
text_to_classify_en = "Hello, this is a test sentence."
inputs_en = tokenizer(text_to_classify_en, return_tensors="pt")

with torch.no_grad():
    logits_en = model(**inputs_en).logits

predicted_class_id_en = logits_en.argmax().item()
predicted_label_en = model.config.id2label[predicted_class_id_en]

print(f"Text: '{text_to_classify_en}'")
print(f"Predicted language label: {predicted_label_en}")
```

## Training Details

### Training Data
The model was trained on the newly created ILID (Indian Language Identification Dataset).

### Training Hyperparameters
The model was fine-tuned from `google/muril-base-cased`.

## Evaluation

The model was evaluated on the created ILID corpus and the Bhasha-Abhijnaanam benchmark.

### Metrics
The primary evaluation metric used was F1-score.

### Results
The model achieved an average F1-score of 0.96.

## Citation
If you find our work helpful or inspiring, please feel free to cite it.

```bibtex
@misc{ingle2025ilidnativescriptlanguage,
      title={ILID: Native Script Language Identification for Indian Languages},
      author={Yash Ingle and Pruthwik Mishra},
      year={2025},
      eprint={2507.11832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.11832},
}
```