Text Classification
Transformers
Safetensors
bert
File size: 6,163 Bytes
82c1671
 
c1d6ee7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82c1671
 
 
 
 
c1d6ee7
82c1671
4d25dfe
c1d6ee7
4d25dfe
c1d6ee7
82c1671
c1d6ee7
82c1671
4d25dfe
c1d6ee7
82c1671
c1d6ee7
82c1671
4d25dfe
82c1671
c1d6ee7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71cdedb
c1d6ee7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82c1671
4d25dfe
82c1671
 
 
4d25dfe
82c1671
c1d6ee7
82c1671
4d25dfe
82c1671
 
 
 
 
 
 
 
 
 
c1d6ee7
82c1671
c1d6ee7
 
82c1671
 
c1d6ee7
82c1671
 
 
 
c1d6ee7
 
 
 
 
 
82c1671
c1d6ee7
82c1671
4d25dfe
82c1671
 
 
4d25dfe
82c1671
 
 
 
c1d6ee7
82c1671
 
 
 
4d25dfe
82c1671
 
 
4d25dfe
82c1671
c1d6ee7
82c1671
c1d6ee7
82c1671
 
 
c1d6ee7
82c1671
c1d6ee7
82c1671
c1d6ee7
4d25dfe
 
 
 
 
 
 
 
 
c1d6ee7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
library_name: transformers
license: mit
datasets:
- yash-ingle/ILID_Indian_Language_Identification_Dataset
language:
- as
- en
- gu
- or
- mr
- ml
- te
- ta
- hi
- pa
- ur
- kn
- bn
metrics:
- f1
base_model:
- google/muril-base-cased
pipeline_tag: text-classification
---
## Model Details

### Model Description

The model is a MuRIL based finetuned model on the language identification task using the Huggingface transformers library.

- **Developed by:** Pruthwik Mishra, Yash Ingle
- **Funded by:** SVNIT, Surat
- **License:** MIT
- **Finetuned from model:** google/muril-base-cased

### Model Sources

- **Repository:** [https://github.com/yashingle-ai/TextLangDetect]
- **Paper:** [https://arxiv.org/abs/2507.11832]

### Uses

The model can be directly used for English and Indian language identification.

### How to Get Started with the Model
```
"""Language Identification using fine-tuned model."""
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TextClassificationPipeline
from datasets import Dataset
import torch

# this is an cased muril base model
tokenizer_model = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model)
device = torch.device('cuda:0')


model = AutoModelForSequenceClassification.from_pretrained("pruthwik/ilid-muril-model")


def preprocess_function(examples):
    """Preprocess function for processing the data."""
    tokenized_inputs = tokenizer(examples['text'], truncation=True, max_length=256, padding='max_length')
    return tokenized_inputs


index_to_label_dict = {0: 'asm', 1: 'ben', 2: 'brx', 3: 'doi', 4: 'eng', 5: 'gom', 6: 'guj', 7: 'hin', 8: 'kan', 9: 'kas', 10: 'mai', 11: 'mal', 12: 'mar', 13: 'mni_Beng', 14: 'mni_Mtei', 15: 'npi', 16: 'ory', 17: 'pan', 18: 'san', 19: 'sat', 20: 'snd_Arab', 21: 'snd_Deva', 22: 'tam', 23: 'tel', 24: 'urd'}
test_texts = ["Hello, how are you?", "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।", "आनी हो एक गंभीर मूर्खपणा.", "ਮਨੁੱਖੀ ਦਿਮਾਗ਼ ਦੀ ਕਾਢ ਨੇ ਭਾਵੇਂ ਸਭ ਕੁਝ ਸੌਖਾ ਕਰ ਦਿੱਤਾ ਹੈ ਪਰ ਫਿਰ ਵੀ ਸਭ ਕੁਝ ਸਮਝਣਾ ਜਾਂ ਕਰਨਾ ਨਿਯਮਾਂ ਵਿੱਚ ਬੱਝਾ ਪਿਆ ਹੈ।", "માં વરસાદનું પાણી મોટા જથ્થામાં જમીનની નીચે જ ઉતરી જાય છે।", "କିନ୍ତୁ ପୁଅ, ତୁମେ ଛୋଟ।"]
test_dataset_raw = Dataset.from_dict({'text': test_texts})
# load the index to label dictionary from pickle file
num_labels = len(index_to_label_dict)
print(f'Number of labels: {num_labels}')
# create the tokenized dataset
test_tokenized_dataset = test_dataset_raw.map(preprocess_function, batched=True)
test_tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
# Load the model from the specified directory
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=0)
# Save the predictions to the output file
predictions_test = pipe(test_texts, truncation=True, max_length=256)
actual_labels_test = []
for prediction in predictions_test:
    pred_label = prediction['label']
    pred_index = pred_label.split('_')[1]
    actual_labels_test.append(index_to_label_dict[int(pred_index)])
print(actual_labels_test)
"""Output:
['eng', 'hin', 'gom', 'pan', 'guj', 'ory']
"""
```
### Label Indices For Languages
- 0: asm (Assamese)
- 1: ben (Bengali)
- 2: brx (Bodo)
- 3: doi (Dogri)
- 4: eng (English)
- 5: gom (Konkani)
- 6: guj (Gujarati)
- 7: hin (Hindi)
- 8: kan (Kannada)
- 9: kas (Kashmiri)
- 10: mai (Maithili)
- 11: mal (Malayalam)
- 12: mar (Marathi)
- 13: mni_Beng (Manipuri in Bengali Script)
- 14: mni_Mtei (Manipuri in Meitei Script)
- 15: npi (Nepali)
- 16: ory (Oriya/Odia)
- 17: pan (Punjabi)
- 18: san (Sanskrit)
- 19: sat (Santhali)
- 20: snd_Arab (Sindhi in Perso-Arabic Script)
- 21: snd_Deva (Sindhi in Devanagari Script)
- 22: tam (Tamil)
- 23: tel (Telugu)
- 24: urd (Urdu)
### Downstream Use

 Can be integrated into any pipeline that requires language identification for the concerned languages.

### Out-of-Scope Use

The model may not work for languages other than English and Indian languages.

## Limitations

The model may not perform well on very resource poor languages such as Manipuri (in Meitei script), Sindhi, Maithili.

## How to Get Started with the Model

Use the code below to get started with the model.


## Training Details

### Training Data

[Train Data](https://huggingface.co/datasets/yash-ingle/ILID_Indian_Language_Identification_Dataset)

### Dev Data
[Dev Data](https://huggingface.co/datasets/yash-ingle/ILID_Indian_Language_Identification_Dataset)

### Training Procedure
The training includes the finetuning of the MuRIL base cased model for 10 epochs.


#### Training Hyperparameters

- **Training regime:** fp32
- **Training Batch Size:** 32
- **Evaluation Batch Size:** 32
- **Learning Rate:** 0.00002
- **Weight Decay:** 0.01
- **Epoch:** 10

#### Size

250K Dataset named ILID (Indian Language Identification Dataset)

## Evaluation

The models are evaluated on the created corpora and the Bhasha-Abhijnaanam benchmark.
### Testing Data, Factors & Metrics

#### Testing Data

[Test Data](https://huggingface.co/datasets/yash-ingle/ILID_Indian_Language_Identification_Dataset)


#### Metrics

F1-score

### Results

0.96 F1 on an average

#### Summary

The model is a MuRIL based finetuned model on the language identification task. The model can identify English and all 22 official Indian languaages.

### Compute Infrastructure

Model trained with one H100 NVIDIA GPU with 94GB RAM

## Citation
**BibTeX:**
```
@misc{ingle2025ilidnativescriptlanguage,
      title={ILID: Native Script Language Identification for Indian Languages}, 
      author={Yash Ingle and Pruthwik Mishra},
      year={2025},
      eprint={2507.11832},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.11832}, 
}
```