NLP_mt5 / README.md
Akaash1's picture
Upload Khmer ITN model
db3cc54
---
language:
- km
license: apache-2.0
tags:
- text2text-generation
- mt5
- khmer
- inverse-text-normalization
- number-normalization
datasets:
- custom
metrics:
- exact_match
library_name: transformers
pipeline_tag: text2text-generation
---
# Khmer Inverse Text Normalization (ITN) Model
This model converts Khmer number words to digits using a fine-tuned mT5-small model.
## Model Description
- **Model**: mT5-small (fine-tuned)
- **Language**: Khmer (αž—αžΆαžŸαžΆαžαŸ’αž˜αŸ‚αžš)
- **Task**: Inverse Text Normalization (ITN)
- **Training Data**: 121,097 Khmer text samples with number normalization
## Usage
### Quick Start
```python
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
# Load model and tokenizer
model_name = "Akaash1/NLP_mt5"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
# Normalize Khmer number words
text = "αžœαŸαž™ αžαŸ’αžšαžΉαž˜ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ αž†αŸ’αž“αžΆαŸ†"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, max_length=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result) # Output: αžœαŸαž™ αžαŸ’αžšαžΉαž˜ 18 αž†αŸ’αž“αžΆαŸ†
```
### Advanced Usage with Custom Class
```python
import torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
class KhmerITN:
def __init__(self, model_name="Akaash1/NLP_mt5"):
self.tokenizer = MT5Tokenizer.from_pretrained(model_name)
self.model = MT5ForConditionalGeneration.from_pretrained(model_name)
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)
self.model.eval()
def normalize(self, text, num_beams=4):
inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Use it
itn = KhmerITN()
result = itn.normalize("αž†αŸ’αž“αžΆαŸ† αž–αžΈαžš αž–αžΆαž“αŸ‹ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ")
print(result) # Output: αž†αŸ’αž“αžΆαŸ† 2013
```
## Examples
| Input (Khmer words) | Output (with digits) |
|---------------------|----------------------|
| αžœαŸαž™ αžαŸ’αžšαžΉαž˜ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ αž†αŸ’αž“αžΆαŸ† | αžœαŸαž™ αžαŸ’αžšαžΉαž˜ 18 αž†αŸ’αž“αžΆαŸ† |
| αž†αŸ’αž“αžΆαŸ† αž–αžΈαžš αž–αžΆαž“αŸ‹ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ | αž†αŸ’αž“αžΆαŸ† 2013 |
| តអរអ αžœαŸαž™ αžŸαžΆαž˜αžŸαž·αž” αž”αž½αž“ αž†αŸ’αž“αžΆαŸ† | តអរអ αžœαŸαž™ 34 αž†αŸ’αž“αžΆαŸ† |
| αž˜αžΆαž“ αžŸαžšαž»αž” αž˜αŸ’αž—αŸƒ αž˜αž½αž™ αž“αžΆαž€αŸ‹ | αž˜αžΆαž“ αžŸαžšαž»αž” 21 αž“αžΆαž€αŸ‹ |
| αž€αŸ’αž“αž»αž„ αžšαž™αŸˆαž–αŸαž› αžŠαž”αŸ‹ αž†αŸ’αž“αžΆαŸ† | αž€αŸ’αž“αž»αž„ αžšαž™αŸˆαž–αŸαž› 10 αž†αŸ’αž“αžΆαŸ† |
## Training Details
### Training Data
- **Size**: 121,097 text pairs
- **Source**: Khmer text corpus with number words
- **Split**: 95% train, 5% validation
### Training Procedure
- **Base Model**: google/mt5-small
- **Epochs**: 5
- **Batch Size**: 8 (per device) Γ— 4 (gradient accumulation) = 32 effective
- **Learning Rate**: 5e-4
- **Optimizer**: AdamW
- **Max Sequence Length**: 256
### Supported Number Types
The model can convert various Khmer number expressions:
- **Units**: αžŸαžΌαž“αŸ’αž™ (0), αž˜αž½αž™ (1), αž–αžΈαžš (2), αž”αžΈ (3), αž”αž½αž“ (4), αž”αŸ’αžšαžΆαŸ† (5), etc.
- **Tens**: αžŠαž”αŸ‹ (10), αž˜αŸ’αž—αŸƒ (20), αžŸαžΆαž˜αžŸαž·αž” (30), etc.
- **Hundreds**: αžšαž™ (100)
- **Thousands**: αž–αžΆαž“αŸ‹ (1,000), αž˜αŸ‰αžΊαž“ (10,000), αžŸαŸ‚αž“ (100,000)
- **Large numbers**: αž›αžΆαž“ (1,000,000), αž€αŸ„αžŠαž· (10,000,000)
## Limitations
- Input text should be space-separated Khmer tokens
- Model trained on specific number word patterns
- Some idiomatic expressions preserved (e.g., "αž˜αž½αž™ αžšαž™αŸˆ" meaning "a while")
## Citation
If you use this model, please cite:
```bibtex
@misc{khmer-itn-mt5,
title={Khmer Inverse Text Normalization using mT5},
author={Your Name},
year={2024},
url={https://huggingface.co/Akaash1/NLP_mt5}
}
```
## Model Card Authors
[Your Name]
## Contact
For questions or feedback, please open an issue on the model repository.