Model Details
Model Description
mt5-small-indic-gec is a multilingual grammatical error correction (GEC) model fine-tuned from mT5-small, trained on Hindi data and evaluated in a zero-shot setting on four additional Indic languages. It follows a text-to-text generation paradigm to produce grammatically corrected outputs and performs best on monolingual Indic text, though it can accept mixed-script inputs. The model was developed as part of the BHASHA workshop (INDICGEC-26).
- Developed by: Rucha Ambaliya, Mahika Dugar, Pruthwik Mishra
- License: MIT
- Finetuned from model: google/mt5-small
Model Sources
- Base Model: https://huggingface.co/google/mt5-small
- Repository: https://github.com/Rucha-Ambaliya/bhasha-workshop
- Paper / Report: https://openreview.net/attachment?id=vEHj9e66Zd&name=pdf
Uses
Direct Use
The model can be used for:
- Grammatical Error Correction (GEC)
- Text normalization for Indic languages
- Preprocessing noisy user-generated content
- Improving downstream NLP tasks such as translation, summarization, and classification
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_name = "Rucha-Ambaliya/mt5-small-indic-gec"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = [
"আমি স্কুলে যাই",
"मैं स्कूल जाता है",
"నేను పాఠశాలకు వెళ్ళి ఉంటాను"
]
encoded = tokenizer(
inputs,
return_tensors="pt",
padding=True,
truncation=True,
max_length=128
).to(device)
outputs = model.generate(
**encoded,
max_length=128,
num_beams=4
)
corrected_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for inp, out in zip(inputs, corrected_texts):
print(f"Input: {inp}")
print(f"Corrected: {out}\n")
Downstream Use
- Grammar-aware chatbots
- Educational tools for language learners
- Preprocessing pipelines for Indic NLP
- Assistive writing tools
Out-of-Scope Use
- Languages not seen during training
- Highly domain-specific or technical grammar
- Heavily code-mixed text beyond the trained distribution
Limitations
Performance may degrade on:
- Low-resource Indic languages
- Heavy code-mixing
- Informal slang or dialectal variations
Training Details
Training Data
- Parallel noisy–clean sentence pairs for Indic languages
- Dataset size and composition depend on the chosen corpus
Dataset link: https://github.com/Rucha-Ambaliya/bhasha-workshop/tree/main/data
Training Procedure
The model was fine-tuned using a standard sequence-to-sequence objective where the input is an incorrect sentence and the target is its corrected version.
Training Hyperparameters
- Learning Rate: 2e-4
- Effective batch size: 8 (batch size 2 × gradient accumulation 4)
- Epochs: 21
- Max Sequence Length: 128
Evaluation
Metrics
- google_bleu
Results
- Hindi 78.98
- Bangla 81.83
- Malayalam 89.77
- Tamil 84.48
- Telugu 85.03
Compute Infrastructure
- GPU: T4 GPU * 2
- Framework: HuggingFace Transformers
- Training Environment: Kaggle
Responsible Use
The model is intended for educational and assistive purposes. It should not be used as the sole decision-making system in high-stakes language evaluation scenarios.
Citation
If you use this model in your research, please cite:
@misc{ambaliya2026mt5indicgec,
title={Niyamika at BHASHA Task 1: Word-Level Transliteration for
English-Hindi Mixed Text in Grammar Correction Using MT5},
author={Rucha Ambaliya and Mahika Dugar and Pruthwik Mishra},
year={2026},
howpublished={https://openreview.net/attachment?id=vEHj9e66Zd&name=pdf}
}
- Downloads last month
- 10
Model tree for Rucha-Ambaliya/mt5-small-indic-gec
Base model
google/mt5-small