Upload Khmer ITN model

Browse files

Files changed (5) hide show

README.md +142 -0
config.json +32 -0
generation_config.json +6 -0
model.safetensors +3 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,142 @@

+---
+language:
+- km
+license: apache-2.0
+tags:
+- text2text-generation
+- mt5
+- khmer
+- inverse-text-normalization
+- number-normalization
+datasets:
+- custom
+metrics:
+- exact_match
+library_name: transformers
+pipeline_tag: text2text-generation
+---
+# Khmer Inverse Text Normalization (ITN) Model
+This model converts Khmer number words to digits using a fine-tuned mT5-small model.
+## Model Description
+- **Model**: mT5-small (fine-tuned)
+- **Language**: Khmer (ភាសាខ្មែរ)
+- **Task**: Inverse Text Normalization (ITN)
+- **Training Data**: 121,097 Khmer text samples with number normalization
+## Usage
+### Quick Start
+```python
+from transformers import MT5ForConditionalGeneration, MT5Tokenizer
+# Load model and tokenizer
+model_name = "Akaash1/NLP_mt5"
+tokenizer = MT5Tokenizer.from_pretrained(model_name)
+model = MT5ForConditionalGeneration.from_pretrained(model_name)
+# Normalize Khmer number words
+text = "វ័យ ត្រឹម ដប់ ប្រាំបី ឆ្នាំ"
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, num_beams=4, max_length=256)
+result = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(result)  # Output: វ័យ ត្រឹម 18 ឆ្នាំ
+```
+### Advanced Usage with Custom Class
+```python
+import torch
+from transformers import MT5ForConditionalGeneration, MT5Tokenizer
+class KhmerITN:
+    def __init__(self, model_name="Akaash1/NLP_mt5"):
+        self.tokenizer = MT5Tokenizer.from_pretrained(model_name)
+        self.model = MT5ForConditionalGeneration.from_pretrained(model_name)
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        self.model.to(self.device)
+        self.model.eval()
+    def normalize(self, text, num_beams=4):
+        inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
+        inputs = {k: v.to(self.device) for k, v in inputs.items()}
+        with torch.no_grad():
+            outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256)
+        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
+# Use it
+itn = KhmerITN()
+result = itn.normalize("ឆ្នាំ ពីរ ពាន់ ដប់ ប្រាំបី")
+print(result)  # Output: ឆ្នាំ 2013
+```
+## Examples
+| Input (Khmer words) | Output (with digits) |
+|---------------------|----------------------|
+| វ័យ ត្រឹម ដប់ ប្រាំបី ឆ្នាំ | វ័យ ត្រឹម 18 ឆ្នាំ |
+| ឆ្នាំ ពីរ ពាន់ ដប់ ប្រាំបី | ឆ្នាំ 2013 |
+| តារា វ័យ សាមសិប បួន ឆ្នាំ | តារា វ័យ 34 ឆ្នាំ |
+| មាន សរុប ម្ភៃ មួយ នាក់ | មាន សរុប 21 នាក់ |
+| ក្នុង រយៈពេល ដប់ ឆ្នាំ | ក្នុង រយៈពេល 10 ឆ្នាំ |
+## Training Details
+### Training Data
+- **Size**: 121,097 text pairs
+- **Source**: Khmer text corpus with number words
+- **Split**: 95% train, 5% validation
+### Training Procedure
+- **Base Model**: google/mt5-small
+- **Epochs**: 5
+- **Batch Size**: 8 (per device) × 4 (gradient accumulation) = 32 effective
+- **Learning Rate**: 5e-4
+- **Optimizer**: AdamW
+- **Max Sequence Length**: 256
+### Supported Number Types
+The model can convert various Khmer number expressions:
+- **Units**: សូន្យ (0), មួយ (1), ពីរ (2), បី (3), បួន (4), ប្រាំ (5), etc.
+- **Tens**: ដប់ (10), ម្ភៃ (20), សាមសិប (30), etc.
+- **Hundreds**: រយ (100)
+- **Thousands**: ពាន់ (1,000), ម៉ឺន (10,000), សែន (100,000)
+- **Large numbers**: លាន (1,000,000), កោដិ (10,000,000)
+## Limitations
+- Input text should be space-separated Khmer tokens
+- Model trained on specific number word patterns
+- Some idiomatic expressions preserved (e.g., "មួយ រយៈ" meaning "a while")
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{khmer-itn-mt5,
+  title={Khmer Inverse Text Normalization using mT5},
+  author={Your Name},
+  year={2024},
+  url={https://huggingface.co/Akaash1/NLP_mt5}
+}
+```
+## Model Card Authors
+[Your Name]
+## Contact
+For questions or feedback, please open an issue on the model repository.

config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "_name_or_path": "google/mt5-small",
+  "architectures": [
+    "MT5ForConditionalGeneration"
+  ],
+  "classifier_dropout": 0.0,
+  "d_ff": 1024,
+  "d_kv": 64,
+  "d_model": 512,
+  "decoder_start_token_id": 0,
+  "dense_act_fn": "gelu_new",
+  "dropout_rate": 0.1,
+  "eos_token_id": 1,
+  "feed_forward_proj": "gated-gelu",
+  "initializer_factor": 1.0,
+  "is_encoder_decoder": true,
+  "is_gated_act": true,
+  "layer_norm_epsilon": 1e-06,
+  "model_type": "mt5",
+  "num_decoder_layers": 8,
+  "num_heads": 6,
+  "num_layers": 8,
+  "pad_token_id": 0,
+  "relative_attention_max_distance": 128,
+  "relative_attention_num_buckets": 32,
+  "tie_word_embeddings": false,
+  "tokenizer_class": "T5Tokenizer",
+  "torch_dtype": "float32",
+  "transformers_version": "4.35.2",
+  "use_cache": true,
+  "vocab_size": 250112
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "decoder_start_token_id": 0,
+  "eos_token_id": 1,
+  "pad_token_id": 0,
+  "transformers_version": "4.35.2"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a5ef3e883ec8a1ccd09539b68110b6db9dab93b853156d1112f19185fa366123
+size 1200729512

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:827afca5242cf3c7c6c352131ea97d970bfcb185b39dd2b8240a3956691746ce
+size 4728