Akaash1 commited on
Commit
db3cc54
Β·
1 Parent(s): 80bee52

Upload Khmer ITN model

Browse files
Files changed (5) hide show
  1. README.md +142 -0
  2. config.json +32 -0
  3. generation_config.json +6 -0
  4. model.safetensors +3 -0
  5. training_args.bin +3 -0
README.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - km
4
+ license: apache-2.0
5
+ tags:
6
+ - text2text-generation
7
+ - mt5
8
+ - khmer
9
+ - inverse-text-normalization
10
+ - number-normalization
11
+ datasets:
12
+ - custom
13
+ metrics:
14
+ - exact_match
15
+ library_name: transformers
16
+ pipeline_tag: text2text-generation
17
+ ---
18
+
19
+ # Khmer Inverse Text Normalization (ITN) Model
20
+
21
+ This model converts Khmer number words to digits using a fine-tuned mT5-small model.
22
+
23
+ ## Model Description
24
+
25
+ - **Model**: mT5-small (fine-tuned)
26
+ - **Language**: Khmer (αž—αžΆαžŸαžΆαžαŸ’αž˜αŸ‚αžš)
27
+ - **Task**: Inverse Text Normalization (ITN)
28
+ - **Training Data**: 121,097 Khmer text samples with number normalization
29
+
30
+ ## Usage
31
+
32
+ ### Quick Start
33
+
34
+ ```python
35
+ from transformers import MT5ForConditionalGeneration, MT5Tokenizer
36
+
37
+ # Load model and tokenizer
38
+ model_name = "Akaash1/NLP_mt5"
39
+ tokenizer = MT5Tokenizer.from_pretrained(model_name)
40
+ model = MT5ForConditionalGeneration.from_pretrained(model_name)
41
+
42
+ # Normalize Khmer number words
43
+ text = "αžœαŸαž™ αžαŸ’αžšαžΉαž˜ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ αž†αŸ’αž“αžΆαŸ†"
44
+ inputs = tokenizer(text, return_tensors="pt")
45
+ outputs = model.generate(**inputs, num_beams=4, max_length=256)
46
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
47
+
48
+ print(result) # Output: αžœαŸαž™ αžαŸ’αžšαžΉαž˜ 18 αž†αŸ’αž“αžΆαŸ†
49
+ ```
50
+
51
+ ### Advanced Usage with Custom Class
52
+
53
+ ```python
54
+ import torch
55
+ from transformers import MT5ForConditionalGeneration, MT5Tokenizer
56
+
57
+ class KhmerITN:
58
+ def __init__(self, model_name="Akaash1/NLP_mt5"):
59
+ self.tokenizer = MT5Tokenizer.from_pretrained(model_name)
60
+ self.model = MT5ForConditionalGeneration.from_pretrained(model_name)
61
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
62
+ self.model.to(self.device)
63
+ self.model.eval()
64
+
65
+ def normalize(self, text, num_beams=4):
66
+ inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
67
+ inputs = {k: v.to(self.device) for k, v in inputs.items()}
68
+
69
+ with torch.no_grad():
70
+ outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256)
71
+
72
+ return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
73
+
74
+ # Use it
75
+ itn = KhmerITN()
76
+ result = itn.normalize("αž†αŸ’αž“αžΆαŸ† αž–αžΈαžš αž–αžΆαž“αŸ‹ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ")
77
+ print(result) # Output: αž†αŸ’αž“αžΆαŸ† 2013
78
+ ```
79
+
80
+ ## Examples
81
+
82
+ | Input (Khmer words) | Output (with digits) |
83
+ |---------------------|----------------------|
84
+ | αžœαŸαž™ αžαŸ’αžšαžΉαž˜ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ αž†αŸ’αž“αžΆαŸ† | αžœαŸαž™ αžαŸ’αžšαžΉαž˜ 18 αž†αŸ’αž“αžΆαŸ† |
85
+ | αž†αŸ’αž“αžΆαŸ† αž–αžΈαžš αž–αžΆαž“αŸ‹ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ | αž†αŸ’αž“αžΆαŸ† 2013 |
86
+ | តអរអ αžœαŸαž™ αžŸαžΆαž˜αžŸαž·αž” αž”αž½αž“ αž†αŸ’αž“αžΆαŸ† | តអរអ αžœαŸαž™ 34 αž†αŸ’αž“αžΆαŸ† |
87
+ | αž˜αžΆαž“ αžŸαžšαž»αž” αž˜αŸ’αž—αŸƒ αž˜αž½αž™ αž“αžΆαž€αŸ‹ | αž˜αžΆαž“ αžŸαžšαž»αž” 21 αž“αžΆαž€αŸ‹ |
88
+ | αž€αŸ’αž“αž»αž„ αžšαž™αŸˆαž–αŸαž› αžŠαž”αŸ‹ αž†αŸ’αž“αžΆαŸ† | αž€αŸ’αž“αž»αž„ αžšαž™αŸˆαž–αŸαž› 10 αž†αŸ’αž“αžΆαŸ† |
89
+
90
+ ## Training Details
91
+
92
+ ### Training Data
93
+
94
+ - **Size**: 121,097 text pairs
95
+ - **Source**: Khmer text corpus with number words
96
+ - **Split**: 95% train, 5% validation
97
+
98
+ ### Training Procedure
99
+
100
+ - **Base Model**: google/mt5-small
101
+ - **Epochs**: 5
102
+ - **Batch Size**: 8 (per device) Γ— 4 (gradient accumulation) = 32 effective
103
+ - **Learning Rate**: 5e-4
104
+ - **Optimizer**: AdamW
105
+ - **Max Sequence Length**: 256
106
+
107
+ ### Supported Number Types
108
+
109
+ The model can convert various Khmer number expressions:
110
+
111
+ - **Units**: αžŸαžΌαž“αŸ’αž™ (0), αž˜αž½αž™ (1), αž–αžΈαžš (2), αž”αžΈ (3), αž”αž½αž“ (4), αž”αŸ’αžšαžΆαŸ† (5), etc.
112
+ - **Tens**: αžŠαž”αŸ‹ (10), αž˜αŸ’αž—αŸƒ (20), αžŸαžΆαž˜αžŸαž·αž” (30), etc.
113
+ - **Hundreds**: αžšαž™ (100)
114
+ - **Thousands**: αž–αžΆαž“αŸ‹ (1,000), αž˜αŸ‰αžΊαž“ (10,000), αžŸαŸ‚αž“ (100,000)
115
+ - **Large numbers**: αž›αžΆαž“ (1,000,000), αž€αŸ„αžŠαž· (10,000,000)
116
+
117
+ ## Limitations
118
+
119
+ - Input text should be space-separated Khmer tokens
120
+ - Model trained on specific number word patterns
121
+ - Some idiomatic expressions preserved (e.g., "αž˜αž½αž™ αžšαž™αŸˆ" meaning "a while")
122
+
123
+ ## Citation
124
+
125
+ If you use this model, please cite:
126
+
127
+ ```bibtex
128
+ @misc{khmer-itn-mt5,
129
+ title={Khmer Inverse Text Normalization using mT5},
130
+ author={Your Name},
131
+ year={2024},
132
+ url={https://huggingface.co/Akaash1/NLP_mt5}
133
+ }
134
+ ```
135
+
136
+ ## Model Card Authors
137
+
138
+ [Your Name]
139
+
140
+ ## Contact
141
+
142
+ For questions or feedback, please open an issue on the model repository.
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "google/mt5-small",
3
+ "architectures": [
4
+ "MT5ForConditionalGeneration"
5
+ ],
6
+ "classifier_dropout": 0.0,
7
+ "d_ff": 1024,
8
+ "d_kv": 64,
9
+ "d_model": 512,
10
+ "decoder_start_token_id": 0,
11
+ "dense_act_fn": "gelu_new",
12
+ "dropout_rate": 0.1,
13
+ "eos_token_id": 1,
14
+ "feed_forward_proj": "gated-gelu",
15
+ "initializer_factor": 1.0,
16
+ "is_encoder_decoder": true,
17
+ "is_gated_act": true,
18
+ "layer_norm_epsilon": 1e-06,
19
+ "model_type": "mt5",
20
+ "num_decoder_layers": 8,
21
+ "num_heads": 6,
22
+ "num_layers": 8,
23
+ "pad_token_id": 0,
24
+ "relative_attention_max_distance": 128,
25
+ "relative_attention_num_buckets": 32,
26
+ "tie_word_embeddings": false,
27
+ "tokenizer_class": "T5Tokenizer",
28
+ "torch_dtype": "float32",
29
+ "transformers_version": "4.35.2",
30
+ "use_cache": true,
31
+ "vocab_size": 250112
32
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "decoder_start_token_id": 0,
3
+ "eos_token_id": 1,
4
+ "pad_token_id": 0,
5
+ "transformers_version": "4.35.2"
6
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5ef3e883ec8a1ccd09539b68110b6db9dab93b853156d1112f19185fa366123
3
+ size 1200729512
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:827afca5242cf3c7c6c352131ea97d970bfcb185b39dd2b8240a3956691746ce
3
+ size 4728