File size: 4,397 Bytes
db3cc54 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
language:
- km
license: apache-2.0
tags:
- text2text-generation
- mt5
- khmer
- inverse-text-normalization
- number-normalization
datasets:
- custom
metrics:
- exact_match
library_name: transformers
pipeline_tag: text2text-generation
---
# Khmer Inverse Text Normalization (ITN) Model
This model converts Khmer number words to digits using a fine-tuned mT5-small model.
## Model Description
- **Model**: mT5-small (fine-tuned)
- **Language**: Khmer (ααΆααΆααααα)
- **Task**: Inverse Text Normalization (ITN)
- **Training Data**: 121,097 Khmer text samples with number normalization
## Usage
### Quick Start
```python
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
# Load model and tokenizer
model_name = "Akaash1/NLP_mt5"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
# Normalize Khmer number words
text = "ααα ααααΉα ααα ααααΆαααΈ ααααΆα"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, max_length=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result) # Output: ααα ααααΉα 18 ααααΆα
```
### Advanced Usage with Custom Class
```python
import torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
class KhmerITN:
def __init__(self, model_name="Akaash1/NLP_mt5"):
self.tokenizer = MT5Tokenizer.from_pretrained(model_name)
self.model = MT5ForConditionalGeneration.from_pretrained(model_name)
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)
self.model.eval()
def normalize(self, text, num_beams=4):
inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Use it
itn = KhmerITN()
result = itn.normalize("ααααΆα ααΈα ααΆαα ααα ααααΆαααΈ")
print(result) # Output: ααααΆα 2013
```
## Examples
| Input (Khmer words) | Output (with digits) |
|---------------------|----------------------|
| ααα ααααΉα ααα ααααΆαααΈ ααααΆα | ααα ααααΉα 18 ααααΆα |
| ααααΆα ααΈα ααΆαα ααα ααααΆαααΈ | ααααΆα 2013 |
| ααΆααΆ ααα ααΆααα·α αα½α ααααΆα | ααΆααΆ ααα 34 ααααΆα |
| ααΆα ααα»α αααα αα½α ααΆαα | ααΆα ααα»α 21 ααΆαα |
| αααα»α αααααα ααα ααααΆα | αααα»α αααααα 10 ααααΆα |
## Training Details
### Training Data
- **Size**: 121,097 text pairs
- **Source**: Khmer text corpus with number words
- **Split**: 95% train, 5% validation
### Training Procedure
- **Base Model**: google/mt5-small
- **Epochs**: 5
- **Batch Size**: 8 (per device) Γ 4 (gradient accumulation) = 32 effective
- **Learning Rate**: 5e-4
- **Optimizer**: AdamW
- **Max Sequence Length**: 256
### Supported Number Types
The model can convert various Khmer number expressions:
- **Units**: ααΌααα (0), αα½α (1), ααΈα (2), ααΈ (3), αα½α (4), ααααΆα (5), etc.
- **Tens**: ααα (10), αααα (20), ααΆααα·α (30), etc.
- **Hundreds**: αα (100)
- **Thousands**: ααΆαα (1,000), αααΊα (10,000), ααα (100,000)
- **Large numbers**: ααΆα (1,000,000), αααα· (10,000,000)
## Limitations
- Input text should be space-separated Khmer tokens
- Model trained on specific number word patterns
- Some idiomatic expressions preserved (e.g., "αα½α ααα" meaning "a while")
## Citation
If you use this model, please cite:
```bibtex
@misc{khmer-itn-mt5,
title={Khmer Inverse Text Normalization using mT5},
author={Your Name},
year={2024},
url={https://huggingface.co/Akaash1/NLP_mt5}
}
```
## Model Card Authors
[Your Name]
## Contact
For questions or feedback, please open an issue on the model repository.
|