shubham-Bgs commited on
Commit
998e35d
·
verified ·
1 Parent(s): 56a6ded

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -13
README.md CHANGED
@@ -1,30 +1,47 @@
1
  ---
2
  license: apache-2.0
3
  ---
4
- Text Normalization Model for Indic Languages
5
 
6
- Overview: This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates (e.g., "15/03/1990"), currencies (e.g., "$120"), and scientific units (e.g., "500 mg")—into their fully normalized forms.
7
 
8
- Training Details
9
- Model Architecture: T5-small
10
 
11
- Dataset: An augmented version of SPRINGLab/IndicVoices-R_Hindi, further enriched with synthetic examples for dates, currencies, and units.
12
 
13
- Hyperparameters:
14
- Learning rate: 2e-5;
15
- Epochs: 3;
16
- Per-device batch size: 2 (with gradient accumulation);
17
- FP16 enabled for mixed precision training;
18
- Environment: Trained on Google Colab with a GPU.
19
 
 
 
 
 
 
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
22
 
23
- # Load model and tokenizer from Hugging Face Hub
24
  tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
25
  model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")
26
 
 
27
  input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
28
  inputs = tokenizer(input_text, return_tensors="pt", padding=True)
 
 
29
  outputs = model.generate(**inputs)
30
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
1
  ---
2
  license: apache-2.0
3
  ---
 
4
 
5
+ # Text Normalization Model for Indic Languages
6
 
7
+ ## Overview
 
8
 
9
+ This model is fine-tuned for text normalization in Hindi. It converts non-standard entities—such as dates, currencies, and scientific units—into their fully normalized forms.
10
 
11
+ ### Examples:
 
 
 
 
 
12
 
13
+ | Input | Normalized Output |
14
+ |--------|------------------|
15
+ | `"15/03/1990"` | `"15 मार्च 1990"` |
16
+ | `"$120"` | `"120 डॉलर"` |
17
+ | `"500 mg"` | `"500 मिलीग्राम"` |
18
 
19
+ ## Model Details
20
+
21
+ - **Model Architecture:** `T5-small`
22
+ - **Dataset:** Augmented version of [`SPRINGLab/IndicVoices-R_Hindi`](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi), enriched with synthetic examples for dates, currencies, and units.
23
+ - **Hyperparameters:**
24
+ - Learning rate: `2e-5`
25
+ - Epochs: `3`
26
+ - Per-device batch size: `2` (with gradient accumulation)
27
+ - FP16 enabled mixed precision training
28
+ - **Training Environment:** Trained on Google Colab with a GPU.
29
+
30
+ ## Usage
31
+
32
+ You can use the model with the `transformers` library:
33
+
34
+ ```python
35
  from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
36
 
37
+ # Load model and tokenizer
38
  tokenizer = AutoTokenizer.from_pretrained("shubham-Bgs/Text-normalization-hindi")
39
  model = AutoModelForSeq2SeqLM.from_pretrained("shubham-Bgs/Text-normalization-hindi")
40
 
41
+ # Example input
42
  input_text = "15 / 03 / 1990 को, वैज्ञानिक ने $120 में 500 mg का नमूना खरीदा।"
43
  inputs = tokenizer(input_text, return_tensors="pt", padding=True)
44
+
45
+ # Generate normalized text
46
  outputs = model.generate(**inputs)
47
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))