shubham-Bgs commited on
Commit
400f446
·
verified ·
1 Parent(s): 5342a97

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -3
README.md CHANGED
@@ -1,3 +1,28 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ Text Normalization Model for Indic Languages
5
+
6
+ Overview: This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates (e.g., "15/03/1990"), currencies (e.g., "$120"), and scientific units (e.g., "500 mg")—into their fully normalized forms.
7
+
8
+ Training Details
9
+ Model Architecture: T5-small
10
+ Dataset: An augmented version of SPRINGLab/IndicVoices-R_Hindi, further enriched with synthetic examples for dates, currencies, and units.
11
+ Hyperparameters:
12
+ Learning rate: 2e-5
13
+ Epochs: 3
14
+ Per-device batch size: 2 (with gradient accumulation)
15
+ FP16 enabled for mixed precision training
16
+ Environment: Trained on Google Colab with a GPU.
17
+
18
+
19
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
20
+
21
+ # Load model and tokenizer from Hugging Face Hub
22
+ tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
23
+ model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")
24
+
25
+ input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
26
+ inputs = tokenizer(input_text, return_tensors="pt", padding=True)
27
+ outputs = model.generate(**inputs)
28
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))