shubham-Bgs's picture
Update README.md
bf3648f verified
---
license: apache-2.0
---
# Text Normalization Model for Indic Languages
## Overview
This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates, currencies, and scientific units—into their fully normalized forms.
## Model Details
- **Model Architecture:** `T5-small`
- **Dataset:** Augmented version of [`SPRINGLab/IndicVoices-R_Hindi`](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi), enriched with synthetic examples for dates, currencies, and units.
- **Hyperparameters:**
- Learning rate: `2e-5`
- Epochs: `3`
- Per-device batch size: `2` (with gradient accumulation)
- FP16 enabled mixed precision training
- **Training Environment:** Trained on Google Colab with a GPU.
## Usage
You can use the model with the `transformers` library:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")
# Example input
input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
inputs = tokenizer(input_text, return_tensors="pt", padding=True)
# Generate normalized text
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))