shubham-Bgs
/

Text-normalization-hindi

Model card Files Files and versions

shubham-Bgs commited on Feb 9, 2025

Commit

998e35d

·

verified ·

1 Parent(s): 56a6ded

Update README.md

Files changed (1) hide show

README.md +30 -13

README.md CHANGED Viewed

@@ -1,30 +1,47 @@
 ---
 license: apache-2.0
 ---
-Text Normalization Model for Indic Languages
-Overview: This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates (e.g., "15/03/1990"), currencies (e.g., "$120"), and scientific units (e.g., "500 mg")—into their fully normalized forms.
-Training Details
-Model Architecture: T5-small
-Dataset: An augmented version of SPRINGLab/IndicVoices-R_Hindi, further enriched with synthetic examples for dates, currencies, and units.
-Hyperparameters:
-Learning rate: 2e-5;
-Epochs: 3;
-Per-device batch size: 2 (with gradient accumulation);
-FP16 enabled for mixed precision training;
-Environment: Trained on Google Colab with a GPU.
 from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
-# Load model and tokenizer from Hugging Face Hub
 tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
 model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")
 input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
 inputs = tokenizer(input_text, return_tensors="pt", padding=True)
 outputs = model.generate(**inputs)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))

 ---
 license: apache-2.0
 ---
+# Text Normalization Model for Indic Languages
+## Overview
+This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates, currencies, and scientific units—into their fully normalized forms.
+### Examples:
+| Input | Normalized Output |
+|--------|------------------|
+| `"15/03/1990"` | `"15 मार्च 1990"` |
+| `"$120"` | `"120 डॉलर"` |
+| `"500 mg"` | `"500 मिलीग्राम"` |
+## Model Details
+- **Model Architecture:** `T5-small`
+- **Dataset:** Augmented version of [`SPRINGLab/IndicVoices-R_Hindi`](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi), enriched with synthetic examples for dates, currencies, and units.
+- **Hyperparameters:**
+  - Learning rate: `2e-5`
+  - Epochs: `3`
+  - Per-device batch size: `2` (with gradient accumulation)
+  - FP16 enabled mixed precision training
+- **Training Environment:** Trained on Google Colab with a GPU.
+## Usage
+You can use the model with the `transformers` library:
+```python
 from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+# Load model and tokenizer
 tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
 model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")
+# Example input
 input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
 inputs = tokenizer(input_text, return_tensors="pt", padding=True)
+# Generate normalized text
 outputs = model.generate(**inputs)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))