|
|
--- |
|
|
language: |
|
|
- km |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text2text-generation |
|
|
- mt5 |
|
|
- khmer |
|
|
- inverse-text-normalization |
|
|
- number-normalization |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- exact_match |
|
|
library_name: transformers |
|
|
pipeline_tag: text2text-generation |
|
|
--- |
|
|
|
|
|
# Khmer Inverse Text Normalization (ITN) Model |
|
|
|
|
|
This model converts Khmer number words to digits using a fine-tuned mT5-small model. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model**: mT5-small (fine-tuned) |
|
|
- **Language**: Khmer (ααΆααΆααααα) |
|
|
- **Task**: Inverse Text Normalization (ITN) |
|
|
- **Training Data**: 121,097 Khmer text samples with number normalization |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import MT5ForConditionalGeneration, MT5Tokenizer |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "Akaash1/NLP_mt5" |
|
|
tokenizer = MT5Tokenizer.from_pretrained(model_name) |
|
|
model = MT5ForConditionalGeneration.from_pretrained(model_name) |
|
|
|
|
|
# Normalize Khmer number words |
|
|
text = "ααα ααααΉα ααα ααααΆαααΈ ααααΆα" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, num_beams=4, max_length=256) |
|
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
print(result) # Output: ααα ααααΉα 18 ααααΆα |
|
|
``` |
|
|
|
|
|
### Advanced Usage with Custom Class |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import MT5ForConditionalGeneration, MT5Tokenizer |
|
|
|
|
|
class KhmerITN: |
|
|
def __init__(self, model_name="Akaash1/NLP_mt5"): |
|
|
self.tokenizer = MT5Tokenizer.from_pretrained(model_name) |
|
|
self.model = MT5ForConditionalGeneration.from_pretrained(model_name) |
|
|
self.device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
self.model.to(self.device) |
|
|
self.model.eval() |
|
|
|
|
|
def normalize(self, text, num_beams=4): |
|
|
inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True) |
|
|
inputs = {k: v.to(self.device) for k, v in inputs.items()} |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256) |
|
|
|
|
|
return self.tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
# Use it |
|
|
itn = KhmerITN() |
|
|
result = itn.normalize("ααααΆα ααΈα ααΆαα ααα ααααΆαααΈ") |
|
|
print(result) # Output: ααααΆα 2013 |
|
|
``` |
|
|
|
|
|
## Examples |
|
|
|
|
|
| Input (Khmer words) | Output (with digits) | |
|
|
|---------------------|----------------------| |
|
|
| ααα ααααΉα ααα ααααΆαααΈ ααααΆα | ααα ααααΉα 18 ααααΆα | |
|
|
| ααααΆα ααΈα ααΆαα ααα ααααΆαααΈ | ααααΆα 2013 | |
|
|
| ααΆααΆ ααα ααΆααα·α αα½α ααααΆα | ααΆααΆ ααα 34 ααααΆα | |
|
|
| ααΆα ααα»α αααα αα½α ααΆαα | ααΆα ααα»α 21 ααΆαα | |
|
|
| αααα»α αααααα ααα ααααΆα | αααα»α αααααα 10 ααααΆα | |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Size**: 121,097 text pairs |
|
|
- **Source**: Khmer text corpus with number words |
|
|
- **Split**: 95% train, 5% validation |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
- **Base Model**: google/mt5-small |
|
|
- **Epochs**: 5 |
|
|
- **Batch Size**: 8 (per device) Γ 4 (gradient accumulation) = 32 effective |
|
|
- **Learning Rate**: 5e-4 |
|
|
- **Optimizer**: AdamW |
|
|
- **Max Sequence Length**: 256 |
|
|
|
|
|
### Supported Number Types |
|
|
|
|
|
The model can convert various Khmer number expressions: |
|
|
|
|
|
- **Units**: ααΌααα (0), αα½α (1), ααΈα (2), ααΈ (3), αα½α (4), ααααΆα (5), etc. |
|
|
- **Tens**: ααα (10), αααα (20), ααΆααα·α (30), etc. |
|
|
- **Hundreds**: αα (100) |
|
|
- **Thousands**: ααΆαα (1,000), αααΊα (10,000), ααα (100,000) |
|
|
- **Large numbers**: ααΆα (1,000,000), αααα· (10,000,000) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Input text should be space-separated Khmer tokens |
|
|
- Model trained on specific number word patterns |
|
|
- Some idiomatic expressions preserved (e.g., "αα½α ααα" meaning "a while") |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{khmer-itn-mt5, |
|
|
title={Khmer Inverse Text Normalization using mT5}, |
|
|
author={Your Name}, |
|
|
year={2024}, |
|
|
url={https://huggingface.co/Akaash1/NLP_mt5} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
[Your Name] |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or feedback, please open an issue on the model repository. |
|
|
|