Text Normalization Model for Indic Languages

Overview

This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates, currencies, and scientific units—into their fully normalized forms.

Model Details

Model Architecture: T5-small
Dataset: Augmented version of SPRINGLab/IndicVoices-R_Hindi, enriched with synthetic examples for dates, currencies, and units.
Hyperparameters:
- Learning rate: 2e-5
- Epochs: 3
- Per-device batch size: 2 (with gradient accumulation)
- FP16 enabled mixed precision training
Training Environment: Trained on Google Colab with a GPU.

Usage

You can use the model with the transformers library:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")

# Example input
input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
inputs = tokenizer(input_text, return_tensors="pt", padding=True)

# Generate normalized text
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support