|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
datasets: |
|
|
- yash-ingle/ILID_Indian_Language_Identification_Dataset |
|
|
language: |
|
|
- as |
|
|
- en |
|
|
- gu |
|
|
- or |
|
|
- mr |
|
|
- ml |
|
|
- te |
|
|
- ta |
|
|
- hi |
|
|
- pa |
|
|
- ur |
|
|
- kn |
|
|
- bn |
|
|
metrics: |
|
|
- f1 |
|
|
base_model: |
|
|
- google/muril-base-cased |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
The model is a MuRIL based finetuned model on the language identification task using the Huggingface transformers library. |
|
|
|
|
|
- **Developed by:** Pruthwik Mishra, Yash Ingle |
|
|
- **Funded by:** SVNIT, Surat |
|
|
- **License:** MIT |
|
|
- **Finetuned from model:** google/muril-base-cased |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [https://github.com/yashingle-ai/TextLangDetect] |
|
|
- **Paper:** [https://arxiv.org/abs/2507.11832] |
|
|
|
|
|
### Uses |
|
|
|
|
|
The model can be directly used for English and Indian language identification. |
|
|
|
|
|
### How to Get Started with the Model |
|
|
``` |
|
|
"""Language Identification using fine-tuned model.""" |
|
|
from transformers import AutoTokenizer |
|
|
from transformers import AutoModelForSequenceClassification |
|
|
from transformers import TextClassificationPipeline |
|
|
from datasets import Dataset |
|
|
import torch |
|
|
|
|
|
# this is an cased muril base model |
|
|
tokenizer_model = "google/muril-base-cased" |
|
|
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model) |
|
|
device = torch.device('cuda:0') |
|
|
|
|
|
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("pruthwik/ilid-muril-model") |
|
|
|
|
|
|
|
|
def preprocess_function(examples): |
|
|
"""Preprocess function for processing the data.""" |
|
|
tokenized_inputs = tokenizer(examples['text'], truncation=True, max_length=256, padding='max_length') |
|
|
return tokenized_inputs |
|
|
|
|
|
|
|
|
index_to_label_dict = {0: 'asm', 1: 'ben', 2: 'brx', 3: 'doi', 4: 'eng', 5: 'gom', 6: 'guj', 7: 'hin', 8: 'kan', 9: 'kas', 10: 'mai', 11: 'mal', 12: 'mar', 13: 'mni_Beng', 14: 'mni_Mtei', 15: 'npi', 16: 'ory', 17: 'pan', 18: 'san', 19: 'sat', 20: 'snd_Arab', 21: 'snd_Deva', 22: 'tam', 23: 'tel', 24: 'urd'} |
|
|
test_texts = ["Hello, how are you?", "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।", "आनी हो एक गंभीर मूर्खपणा.", "ਮਨੁੱਖੀ ਦਿਮਾਗ਼ ਦੀ ਕਾਢ ਨੇ ਭਾਵੇਂ ਸਭ ਕੁਝ ਸੌਖਾ ਕਰ ਦਿੱਤਾ ਹੈ ਪਰ ਫਿਰ ਵੀ ਸਭ ਕੁਝ ਸਮਝਣਾ ਜਾਂ ਕਰਨਾ ਨਿਯਮਾਂ ਵਿੱਚ ਬੱਝਾ ਪਿਆ ਹੈ।", "માં વરસાદનું પાણી મોટા જથ્થામાં જમીનની નીચે જ ઉતરી જાય છે।", "କିନ୍ତୁ ପୁଅ, ତୁମେ ଛୋଟ।"] |
|
|
test_dataset_raw = Dataset.from_dict({'text': test_texts}) |
|
|
# load the index to label dictionary from pickle file |
|
|
num_labels = len(index_to_label_dict) |
|
|
print(f'Number of labels: {num_labels}') |
|
|
# create the tokenized dataset |
|
|
test_tokenized_dataset = test_dataset_raw.map(preprocess_function, batched=True) |
|
|
test_tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask']) |
|
|
# Load the model from the specified directory |
|
|
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=0) |
|
|
# Save the predictions to the output file |
|
|
predictions_test = pipe(test_texts, truncation=True, max_length=256) |
|
|
actual_labels_test = [] |
|
|
for prediction in predictions_test: |
|
|
pred_label = prediction['label'] |
|
|
pred_index = pred_label.split('_')[1] |
|
|
actual_labels_test.append(index_to_label_dict[int(pred_index)]) |
|
|
print(actual_labels_test) |
|
|
"""Output: |
|
|
['eng', 'hin', 'gom', 'pan', 'guj', 'ory'] |
|
|
""" |
|
|
``` |
|
|
### Label Indices For Languages |
|
|
- 0: asm (Assamese) |
|
|
- 1: ben (Bengali) |
|
|
- 2: brx (Bodo) |
|
|
- 3: doi (Dogri) |
|
|
- 4: eng (English) |
|
|
- 5: gom (Konkani) |
|
|
- 6: guj (Gujarati) |
|
|
- 7: hin (Hindi) |
|
|
- 8: kan (Kannada) |
|
|
- 9: kas (Kashmiri) |
|
|
- 10: mai (Maithili) |
|
|
- 11: mal (Malayalam) |
|
|
- 12: mar (Marathi) |
|
|
- 13: mni_Beng (Manipuri in Bengali Script) |
|
|
- 14: mni_Mtei (Manipuri in Meitei Script) |
|
|
- 15: npi (Nepali) |
|
|
- 16: ory (Oriya/Odia) |
|
|
- 17: pan (Punjabi) |
|
|
- 18: san (Sanskrit) |
|
|
- 19: sat (Santhali) |
|
|
- 20: snd_Arab (Sindhi in Perso-Arabic Script) |
|
|
- 21: snd_Deva (Sindhi in Devanagari Script) |
|
|
- 22: tam (Tamil) |
|
|
- 23: tel (Telugu) |
|
|
- 24: urd (Urdu) |
|
|
### Downstream Use |
|
|
|
|
|
Can be integrated into any pipeline that requires language identification for the concerned languages. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
The model may not work for languages other than English and Indian languages. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
The model may not perform well on very resource poor languages such as Manipuri (in Meitei script), Sindhi, Maithili. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
[Train Data](https://huggingface.co/datasets/yash-ingle/ILID_Indian_Language_Identification_Dataset) |
|
|
|
|
|
### Dev Data |
|
|
[Dev Data](https://huggingface.co/datasets/yash-ingle/ILID_Indian_Language_Identification_Dataset) |
|
|
|
|
|
### Training Procedure |
|
|
The training includes the finetuning of the MuRIL base cased model for 10 epochs. |
|
|
|
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Training regime:** fp32 |
|
|
- **Training Batch Size:** 32 |
|
|
- **Evaluation Batch Size:** 32 |
|
|
- **Learning Rate:** 0.00002 |
|
|
- **Weight Decay:** 0.01 |
|
|
- **Epoch:** 10 |
|
|
|
|
|
#### Size |
|
|
|
|
|
250K Dataset named ILID (Indian Language Identification Dataset) |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The models are evaluated on the created corpora and the Bhasha-Abhijnaanam benchmark. |
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
[Test Data](https://huggingface.co/datasets/yash-ingle/ILID_Indian_Language_Identification_Dataset) |
|
|
|
|
|
|
|
|
#### Metrics |
|
|
|
|
|
F1-score |
|
|
|
|
|
### Results |
|
|
|
|
|
0.96 F1 on an average |
|
|
|
|
|
#### Summary |
|
|
|
|
|
The model is a MuRIL based finetuned model on the language identification task. The model can identify English and all 22 official Indian languaages. |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
Model trained with one H100 NVIDIA GPU with 94GB RAM |
|
|
|
|
|
## Citation |
|
|
**BibTeX:** |
|
|
``` |
|
|
@misc{ingle2025ilidnativescriptlanguage, |
|
|
title={ILID: Native Script Language Identification for Indian Languages}, |
|
|
author={Yash Ingle and Pruthwik Mishra}, |
|
|
year={2025}, |
|
|
eprint={2507.11832}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2507.11832}, |
|
|
} |
|
|
``` |