YAML Metadata
Warning:
The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Khmer Inverse Text Normalization (ITN) Model
This model converts Khmer number words to digits using a fine-tuned mT5-small model.
Model Description
- Model: mT5-small (fine-tuned)
- Language: Khmer (ααΆααΆααααα)
- Task: Inverse Text Normalization (ITN)
- Training Data: 121,097 Khmer text samples with number normalization
Usage
Quick Start
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
# Load model and tokenizer
model_name = "Akaash1/NLP_mt5"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
# Normalize Khmer number words
text = "ααα ααααΉα ααα ααααΆαααΈ ααααΆα"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, max_length=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result) # Output: ααα ααααΉα 18 ααααΆα
Advanced Usage with Custom Class
import torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
class KhmerITN:
def __init__(self, model_name="Akaash1/NLP_mt5"):
self.tokenizer = MT5Tokenizer.from_pretrained(model_name)
self.model = MT5ForConditionalGeneration.from_pretrained(model_name)
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)
self.model.eval()
def normalize(self, text, num_beams=4):
inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Use it
itn = KhmerITN()
result = itn.normalize("ααααΆα ααΈα ααΆαα ααα ααααΆαααΈ")
print(result) # Output: ααααΆα 2013
Examples
| Input (Khmer words) | Output (with digits) |
|---|---|
| ααα ααααΉα ααα ααααΆαααΈ ααααΆα | ααα ααααΉα 18 ααααΆα |
| ααααΆα ααΈα ααΆαα ααα ααααΆαααΈ | ααααΆα 2013 |
| ααΆααΆ ααα ααΆααα·α αα½α ααααΆα | ααΆααΆ ααα 34 ααααΆα |
| ααΆα ααα»α αααα αα½α ααΆαα | ααΆα ααα»α 21 ααΆαα |
| αααα»α αααααα ααα ααααΆα | αααα»α αααααα 10 ααααΆα |
Training Details
Training Data
- Size: 121,097 text pairs
- Source: Khmer text corpus with number words
- Split: 95% train, 5% validation
Training Procedure
- Base Model: google/mt5-small
- Epochs: 5
- Batch Size: 8 (per device) Γ 4 (gradient accumulation) = 32 effective
- Learning Rate: 5e-4
- Optimizer: AdamW
- Max Sequence Length: 256
Supported Number Types
The model can convert various Khmer number expressions:
- Units: ααΌααα (0), αα½α (1), ααΈα (2), ααΈ (3), αα½α (4), ααααΆα (5), etc.
- Tens: ααα (10), αααα (20), ααΆααα·α (30), etc.
- Hundreds: αα (100)
- Thousands: ααΆαα (1,000), αααΊα (10,000), ααα (100,000)
- Large numbers: ααΆα (1,000,000), αααα· (10,000,000)
Limitations
- Input text should be space-separated Khmer tokens
- Model trained on specific number word patterns
- Some idiomatic expressions preserved (e.g., "αα½α ααα" meaning "a while")
Citation
If you use this model, please cite:
@misc{khmer-itn-mt5,
title={Khmer Inverse Text Normalization using mT5},
author={Your Name},
year={2024},
url={https://huggingface.co/Akaash1/NLP_mt5}
}
Model Card Authors
[Your Name]
Contact
For questions or feedback, please open an issue on the model repository.
- Downloads last month
- 150