File size: 6,163 Bytes
82c1671 c1d6ee7 82c1671 c1d6ee7 82c1671 4d25dfe c1d6ee7 4d25dfe c1d6ee7 82c1671 c1d6ee7 82c1671 4d25dfe c1d6ee7 82c1671 c1d6ee7 82c1671 4d25dfe 82c1671 c1d6ee7 71cdedb c1d6ee7 82c1671 4d25dfe 82c1671 4d25dfe 82c1671 c1d6ee7 82c1671 4d25dfe 82c1671 c1d6ee7 82c1671 c1d6ee7 82c1671 c1d6ee7 82c1671 c1d6ee7 82c1671 c1d6ee7 82c1671 4d25dfe 82c1671 4d25dfe 82c1671 c1d6ee7 82c1671 4d25dfe 82c1671 4d25dfe 82c1671 c1d6ee7 82c1671 c1d6ee7 82c1671 c1d6ee7 82c1671 c1d6ee7 82c1671 c1d6ee7 4d25dfe c1d6ee7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
---
library_name: transformers
license: mit
datasets:
- yash-ingle/ILID_Indian_Language_Identification_Dataset
language:
- as
- en
- gu
- or
- mr
- ml
- te
- ta
- hi
- pa
- ur
- kn
- bn
metrics:
- f1
base_model:
- google/muril-base-cased
pipeline_tag: text-classification
---
## Model Details
### Model Description
The model is a MuRIL based finetuned model on the language identification task using the Huggingface transformers library.
- **Developed by:** Pruthwik Mishra, Yash Ingle
- **Funded by:** SVNIT, Surat
- **License:** MIT
- **Finetuned from model:** google/muril-base-cased
### Model Sources
- **Repository:** [https://github.com/yashingle-ai/TextLangDetect]
- **Paper:** [https://arxiv.org/abs/2507.11832]
### Uses
The model can be directly used for English and Indian language identification.
### How to Get Started with the Model
```
"""Language Identification using fine-tuned model."""
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TextClassificationPipeline
from datasets import Dataset
import torch
# this is an cased muril base model
tokenizer_model = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model)
device = torch.device('cuda:0')
model = AutoModelForSequenceClassification.from_pretrained("pruthwik/ilid-muril-model")
def preprocess_function(examples):
"""Preprocess function for processing the data."""
tokenized_inputs = tokenizer(examples['text'], truncation=True, max_length=256, padding='max_length')
return tokenized_inputs
index_to_label_dict = {0: 'asm', 1: 'ben', 2: 'brx', 3: 'doi', 4: 'eng', 5: 'gom', 6: 'guj', 7: 'hin', 8: 'kan', 9: 'kas', 10: 'mai', 11: 'mal', 12: 'mar', 13: 'mni_Beng', 14: 'mni_Mtei', 15: 'npi', 16: 'ory', 17: 'pan', 18: 'san', 19: 'sat', 20: 'snd_Arab', 21: 'snd_Deva', 22: 'tam', 23: 'tel', 24: 'urd'}
test_texts = ["Hello, how are you?", "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।", "आनी हो एक गंभीर मूर्खपणा.", "ਮਨੁੱਖੀ ਦਿਮਾਗ਼ ਦੀ ਕਾਢ ਨੇ ਭਾਵੇਂ ਸਭ ਕੁਝ ਸੌਖਾ ਕਰ ਦਿੱਤਾ ਹੈ ਪਰ ਫਿਰ ਵੀ ਸਭ ਕੁਝ ਸਮਝਣਾ ਜਾਂ ਕਰਨਾ ਨਿਯਮਾਂ ਵਿੱਚ ਬੱਝਾ ਪਿਆ ਹੈ।", "માં વરસાદનું પાણી મોટા જથ્થામાં જમીનની નીચે જ ઉતરી જાય છે।", "କିନ୍ତୁ ପୁଅ, ତୁମେ ଛୋଟ।"]
test_dataset_raw = Dataset.from_dict({'text': test_texts})
# load the index to label dictionary from pickle file
num_labels = len(index_to_label_dict)
print(f'Number of labels: {num_labels}')
# create the tokenized dataset
test_tokenized_dataset = test_dataset_raw.map(preprocess_function, batched=True)
test_tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
# Load the model from the specified directory
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=0)
# Save the predictions to the output file
predictions_test = pipe(test_texts, truncation=True, max_length=256)
actual_labels_test = []
for prediction in predictions_test:
pred_label = prediction['label']
pred_index = pred_label.split('_')[1]
actual_labels_test.append(index_to_label_dict[int(pred_index)])
print(actual_labels_test)
"""Output:
['eng', 'hin', 'gom', 'pan', 'guj', 'ory']
"""
```
### Label Indices For Languages
- 0: asm (Assamese)
- 1: ben (Bengali)
- 2: brx (Bodo)
- 3: doi (Dogri)
- 4: eng (English)
- 5: gom (Konkani)
- 6: guj (Gujarati)
- 7: hin (Hindi)
- 8: kan (Kannada)
- 9: kas (Kashmiri)
- 10: mai (Maithili)
- 11: mal (Malayalam)
- 12: mar (Marathi)
- 13: mni_Beng (Manipuri in Bengali Script)
- 14: mni_Mtei (Manipuri in Meitei Script)
- 15: npi (Nepali)
- 16: ory (Oriya/Odia)
- 17: pan (Punjabi)
- 18: san (Sanskrit)
- 19: sat (Santhali)
- 20: snd_Arab (Sindhi in Perso-Arabic Script)
- 21: snd_Deva (Sindhi in Devanagari Script)
- 22: tam (Tamil)
- 23: tel (Telugu)
- 24: urd (Urdu)
### Downstream Use
Can be integrated into any pipeline that requires language identification for the concerned languages.
### Out-of-Scope Use
The model may not work for languages other than English and Indian languages.
## Limitations
The model may not perform well on very resource poor languages such as Manipuri (in Meitei script), Sindhi, Maithili.
## How to Get Started with the Model
Use the code below to get started with the model.
## Training Details
### Training Data
[Train Data](https://huggingface.co/datasets/yash-ingle/ILID_Indian_Language_Identification_Dataset)
### Dev Data
[Dev Data](https://huggingface.co/datasets/yash-ingle/ILID_Indian_Language_Identification_Dataset)
### Training Procedure
The training includes the finetuning of the MuRIL base cased model for 10 epochs.
#### Training Hyperparameters
- **Training regime:** fp32
- **Training Batch Size:** 32
- **Evaluation Batch Size:** 32
- **Learning Rate:** 0.00002
- **Weight Decay:** 0.01
- **Epoch:** 10
#### Size
250K Dataset named ILID (Indian Language Identification Dataset)
## Evaluation
The models are evaluated on the created corpora and the Bhasha-Abhijnaanam benchmark.
### Testing Data, Factors & Metrics
#### Testing Data
[Test Data](https://huggingface.co/datasets/yash-ingle/ILID_Indian_Language_Identification_Dataset)
#### Metrics
F1-score
### Results
0.96 F1 on an average
#### Summary
The model is a MuRIL based finetuned model on the language identification task. The model can identify English and all 22 official Indian languaages.
### Compute Infrastructure
Model trained with one H100 NVIDIA GPU with 94GB RAM
## Citation
**BibTeX:**
```
@misc{ingle2025ilidnativescriptlanguage,
title={ILID: Native Script Language Identification for Indian Languages},
author={Yash Ingle and Pruthwik Mishra},
year={2025},
eprint={2507.11832},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.11832},
}
``` |