| --- |
| license: apache-2.0 |
| --- |
| |
| # Text Normalization Model for Indic Languages |
|
|
| ## Overview |
|
|
| This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates, currencies, and scientific units—into their fully normalized forms. |
|
|
| ## Model Details |
|
|
| - **Model Architecture:** `T5-small` |
| - **Dataset:** Augmented version of [`SPRINGLab/IndicVoices-R_Hindi`](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi), enriched with synthetic examples for dates, currencies, and units. |
| - **Hyperparameters:** |
| - Learning rate: `2e-5` |
| - Epochs: `3` |
| - Per-device batch size: `2` (with gradient accumulation) |
| - FP16 enabled mixed precision training |
| - **Training Environment:** Trained on Google Colab with a GPU. |
|
|
| ## Usage |
|
|
| You can use the model with the `transformers` library: |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
| |
| # Load model and tokenizer |
| tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi") |
| model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi") |
| |
| # Example input |
| input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।" |
| inputs = tokenizer(input_text, return_tensors="pt", padding=True) |
| |
| # Generate normalized text |
| outputs = model.generate(**inputs) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| |