Text Normalization Model for Indic Languages
Overview
This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates, currencies, and scientific units—into their fully normalized forms.
Model Details
- Model Architecture:
T5-small - Dataset: Augmented version of
SPRINGLab/IndicVoices-R_Hindi, enriched with synthetic examples for dates, currencies, and units. - Hyperparameters:
- Learning rate:
2e-5 - Epochs:
3 - Per-device batch size:
2(with gradient accumulation) - FP16 enabled mixed precision training
- Learning rate:
- Training Environment: Trained on Google Colab with a GPU.
Usage
You can use the model with the transformers library:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")
# Example input
input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
inputs = tokenizer(input_text, return_tensors="pt", padding=True)
# Generate normalized text
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support