File size: 1,438 Bytes
400f446
 
 
 
998e35d
400f446
998e35d
c245822
998e35d
c245822
998e35d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56a6ded
 
998e35d
56a6ded
 
 
998e35d
56a6ded
 
998e35d
 
56a6ded
998e35d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
---
license: apache-2.0
---

# Text Normalization Model for Indic Languages

## Overview

This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates, currencies, and scientific units—into their fully normalized forms. 

## Model Details

- **Model Architecture:** `T5-small`
- **Dataset:** Augmented version of [`SPRINGLab/IndicVoices-R_Hindi`](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi), enriched with synthetic examples for dates, currencies, and units.
- **Hyperparameters:**
  - Learning rate: `2e-5`
  - Epochs: `3`
  - Per-device batch size: `2` (with gradient accumulation)
  - FP16 enabled mixed precision training
- **Training Environment:** Trained on Google Colab with a GPU.

## Usage

You can use the model with the `transformers` library:

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")

# Example input
input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
inputs = tokenizer(input_text, return_tensors="pt", padding=True)

# Generate normalized text
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))