--- language: - km license: apache-2.0 tags: - text2text-generation - mt5 - khmer - inverse-text-normalization - number-normalization datasets: - custom metrics: - exact_match library_name: transformers pipeline_tag: text2text-generation --- # Khmer Inverse Text Normalization (ITN) Model This model converts Khmer number words to digits using a fine-tuned mT5-small model. ## Model Description - **Model**: mT5-small (fine-tuned) - **Language**: Khmer (ភាសាខ្មែរ) - **Task**: Inverse Text Normalization (ITN) - **Training Data**: 121,097 Khmer text samples with number normalization ## Usage ### Quick Start ```python from transformers import MT5ForConditionalGeneration, MT5Tokenizer # Load model and tokenizer model_name = "Akaash1/NLP_mt5" tokenizer = MT5Tokenizer.from_pretrained(model_name) model = MT5ForConditionalGeneration.from_pretrained(model_name) # Normalize Khmer number words text = "វ័យ ត្រឹម ដប់ ប្រាំបី ឆ្នាំ" inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, num_beams=4, max_length=256) result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result) # Output: វ័យ ត្រឹម 18 ឆ្នាំ ``` ### Advanced Usage with Custom Class ```python import torch from transformers import MT5ForConditionalGeneration, MT5Tokenizer class KhmerITN: def __init__(self, model_name="Akaash1/NLP_mt5"): self.tokenizer = MT5Tokenizer.from_pretrained(model_name) self.model = MT5ForConditionalGeneration.from_pretrained(model_name) self.device = "cuda" if torch.cuda.is_available() else "cpu" self.model.to(self.device) self.model.eval() def normalize(self, text, num_beams=4): inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True) inputs = {k: v.to(self.device) for k, v in inputs.items()} with torch.no_grad(): outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256) return self.tokenizer.decode(outputs[0], skip_special_tokens=True) # Use it itn = KhmerITN() result = itn.normalize("ឆ្នាំ ពីរ ពាន់ ដប់ ប្រាំបី") print(result) # Output: ឆ្នាំ 2013 ``` ## Examples | Input (Khmer words) | Output (with digits) | |---------------------|----------------------| | វ័យ ត្រឹម ដប់ ប្រាំបី ឆ្នាំ | វ័យ ត្រឹម 18 ឆ្នាំ | | ឆ្នាំ ពីរ ពាន់ ដប់ ប្រាំបី | ឆ្នាំ 2013 | | តារា វ័យ សាមសិប បួន ឆ្នាំ | តារា វ័យ 34 ឆ្នាំ | | មាន សរុប ម្ភៃ មួយ នាក់ | មាន សរុប 21 នាក់ | | ក្នុង រយៈពេល ដប់ ឆ្នាំ | ក្នុង រយៈពេល 10 ឆ្នាំ | ## Training Details ### Training Data - **Size**: 121,097 text pairs - **Source**: Khmer text corpus with number words - **Split**: 95% train, 5% validation ### Training Procedure - **Base Model**: google/mt5-small - **Epochs**: 5 - **Batch Size**: 8 (per device) × 4 (gradient accumulation) = 32 effective - **Learning Rate**: 5e-4 - **Optimizer**: AdamW - **Max Sequence Length**: 256 ### Supported Number Types The model can convert various Khmer number expressions: - **Units**: សូន្យ (0), មួយ (1), ពីរ (2), បី (3), បួន (4), ប្រាំ (5), etc. - **Tens**: ដប់ (10), ម្ភៃ (20), សាមសិប (30), etc. - **Hundreds**: រយ (100) - **Thousands**: ពាន់ (1,000), ម៉ឺន (10,000), សែន (100,000) - **Large numbers**: លាន (1,000,000), កោដិ (10,000,000) ## Limitations - Input text should be space-separated Khmer tokens - Model trained on specific number word patterns - Some idiomatic expressions preserved (e.g., "មួយ រយៈ" meaning "a while") ## Citation If you use this model, please cite: ```bibtex @misc{khmer-itn-mt5, title={Khmer Inverse Text Normalization using mT5}, author={Your Name}, year={2024}, url={https://huggingface.co/Akaash1/NLP_mt5} } ``` ## Model Card Authors [Your Name] ## Contact For questions or feedback, please open an issue on the model repository.