shubham-Bgs
/

Text-normalization-hindi

Model card Files Files and versions

shubham-Bgs commited on Feb 9, 2025

Commit

400f446

·

verified ·

1 Parent(s): 5342a97

Update README.md

Files changed (1) hide show

README.md +28 -3

README.md CHANGED Viewed

@@ -1,3 +1,28 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+Text Normalization Model for Indic Languages
+Overview: This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates (e.g., "15/03/1990"), currencies (e.g., "$120"), and scientific units (e.g., "500 mg")—into their fully normalized forms.
+Training Details
+Model Architecture: T5-small
+Dataset: An augmented version of SPRINGLab/IndicVoices-R_Hindi, further enriched with synthetic examples for dates, currencies, and units.
+Hyperparameters:
+Learning rate: 2e-5
+Epochs: 3
+Per-device batch size: 2 (with gradient accumulation)
+FP16 enabled for mixed precision training
+Environment: Trained on Google Colab with a GPU.
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+# Load model and tokenizer from Hugging Face Hub
+tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
+model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")
+input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
+inputs = tokenizer(input_text, return_tensors="pt", padding=True)
+outputs = model.generate(**inputs)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))