--- license: apache-2.0 --- # Text Normalization Model for Indic Languages ## Overview This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates, currencies, and scientific units—into their fully normalized forms. ## Model Details - **Model Architecture:** `T5-small` - **Dataset:** Augmented version of [`SPRINGLab/IndicVoices-R_Hindi`](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi), enriched with synthetic examples for dates, currencies, and units. - **Hyperparameters:** - Learning rate: `2e-5` - Epochs: `3` - Per-device batch size: `2` (with gradient accumulation) - FP16 enabled mixed precision training - **Training Environment:** Trained on Google Colab with a GPU. ## Usage You can use the model with the `transformers` library: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi") model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi") # Example input input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।" inputs = tokenizer(input_text, return_tensors="pt", padding=True) # Generate normalized text outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True))