YAML Metadata Warning: The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Khmer Inverse Text Normalization (ITN) Model

This model converts Khmer number words to digits using a fine-tuned mT5-small model.

Model Description

  • Model: mT5-small (fine-tuned)
  • Language: Khmer (αž—αžΆαžŸαžΆαžαŸ’αž˜αŸ‚αžš)
  • Task: Inverse Text Normalization (ITN)
  • Training Data: 121,097 Khmer text samples with number normalization

Usage

Quick Start

from transformers import MT5ForConditionalGeneration, MT5Tokenizer

# Load model and tokenizer
model_name = "Akaash1/NLP_mt5"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

# Normalize Khmer number words
text = "αžœαŸαž™ αžαŸ’αžšαžΉαž˜ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ αž†αŸ’αž“αžΆαŸ†"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, max_length=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)  # Output: αžœαŸαž™ αžαŸ’αžšαžΉαž˜ 18 αž†αŸ’αž“αžΆαŸ†

Advanced Usage with Custom Class

import torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer

class KhmerITN:
    def __init__(self, model_name="Akaash1/NLP_mt5"):
        self.tokenizer = MT5Tokenizer.from_pretrained(model_name)
        self.model = MT5ForConditionalGeneration.from_pretrained(model_name)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)
        self.model.eval()
    
    def normalize(self, text, num_beams=4):
        inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256)
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Use it
itn = KhmerITN()
result = itn.normalize("αž†αŸ’αž“αžΆαŸ† αž–αžΈαžš αž–αžΆαž“αŸ‹ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ")
print(result)  # Output: αž†αŸ’αž“αžΆαŸ† 2013

Examples

Input (Khmer words) Output (with digits)
αžœαŸαž™ αžαŸ’αžšαžΉαž˜ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ αž†αŸ’αž“αžΆαŸ† αžœαŸαž™ αžαŸ’αžšαžΉαž˜ 18 αž†αŸ’αž“αžΆαŸ†
αž†αŸ’αž“αžΆαŸ† αž–αžΈαžš αž–αžΆαž“αŸ‹ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ αž†αŸ’αž“αžΆαŸ† 2013
តអរអ αžœαŸαž™ αžŸαžΆαž˜αžŸαž·αž” αž”αž½αž“ αž†αŸ’αž“αžΆαŸ† តអរអ αžœαŸαž™ 34 αž†αŸ’αž“αžΆαŸ†
αž˜αžΆαž“ αžŸαžšαž»αž” αž˜αŸ’αž—αŸƒ αž˜αž½αž™ αž“αžΆαž€αŸ‹ αž˜αžΆαž“ αžŸαžšαž»αž” 21 αž“αžΆαž€αŸ‹
αž€αŸ’αž“αž»αž„ αžšαž™αŸˆαž–αŸαž› αžŠαž”αŸ‹ αž†αŸ’αž“αžΆαŸ† αž€αŸ’αž“αž»αž„ αžšαž™αŸˆαž–αŸαž› 10 αž†αŸ’αž“αžΆαŸ†

Training Details

Training Data

  • Size: 121,097 text pairs
  • Source: Khmer text corpus with number words
  • Split: 95% train, 5% validation

Training Procedure

  • Base Model: google/mt5-small
  • Epochs: 5
  • Batch Size: 8 (per device) Γ— 4 (gradient accumulation) = 32 effective
  • Learning Rate: 5e-4
  • Optimizer: AdamW
  • Max Sequence Length: 256

Supported Number Types

The model can convert various Khmer number expressions:

  • Units: αžŸαžΌαž“αŸ’αž™ (0), αž˜αž½αž™ (1), αž–αžΈαžš (2), αž”αžΈ (3), αž”αž½αž“ (4), αž”αŸ’αžšαžΆαŸ† (5), etc.
  • Tens: αžŠαž”αŸ‹ (10), αž˜αŸ’αž—αŸƒ (20), αžŸαžΆαž˜αžŸαž·αž” (30), etc.
  • Hundreds: αžšαž™ (100)
  • Thousands: αž–αžΆαž“αŸ‹ (1,000), αž˜αŸ‰αžΊαž“ (10,000), αžŸαŸ‚αž“ (100,000)
  • Large numbers: αž›αžΆαž“ (1,000,000), αž€αŸ„αžŠαž· (10,000,000)

Limitations

  • Input text should be space-separated Khmer tokens
  • Model trained on specific number word patterns
  • Some idiomatic expressions preserved (e.g., "αž˜αž½αž™ αžšαž™αŸˆ" meaning "a while")

Citation

If you use this model, please cite:

@misc{khmer-itn-mt5,
  title={Khmer Inverse Text Normalization using mT5},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/Akaash1/NLP_mt5}
}

Model Card Authors

[Your Name]

Contact

For questions or feedback, please open an issue on the model repository.

Downloads last month
150
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support