Text Normalization Model for Indic Languages

Overview

This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates, currencies, and scientific units—into their fully normalized forms.

Model Details

  • Model Architecture: T5-small
  • Dataset: Augmented version of SPRINGLab/IndicVoices-R_Hindi, enriched with synthetic examples for dates, currencies, and units.
  • Hyperparameters:
    • Learning rate: 2e-5
    • Epochs: 3
    • Per-device batch size: 2 (with gradient accumulation)
    • FP16 enabled mixed precision training
  • Training Environment: Trained on Google Colab with a GPU.

Usage

You can use the model with the transformers library:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")

# Example input
input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
inputs = tokenizer(input_text, return_tensors="pt", padding=True)

# Generate normalized text
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support