shubham-Bgs
/

Text-normalization-hindi

Model card Files Files and versions

Text-normalization-hindi / README.md

shubham-Bgs's picture

Update README.md

bf3648f verified over 1 year ago

|

history blame contribute delete

1.44 kB

	---
	license: apache-2.0
	---

	# Text Normalization Model for Indic Languages

	## Overview

	This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates, currencies, and scientific units—into their fully normalized forms.

	## Model Details

	- Model Architecture: `T5-small`
	- Dataset: Augmented version of [`SPRINGLab/IndicVoices-R_Hindi`](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi), enriched with synthetic examples for dates, currencies, and units.
	- Hyperparameters:
	- Learning rate: `2e-5`
	- Epochs: `3`
	- Per-device batch size: `2` (with gradient accumulation)
	- FP16 enabled mixed precision training
	- Training Environment: Trained on Google Colab with a GPU.

	## Usage

	You can use the model with the `transformers` library:

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
	model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")

	# Example input
	input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
	inputs = tokenizer(input_text, return_tensors="pt", padding=True)

	# Generate normalized text
	outputs = model.generate(**inputs)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))