YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Khmer Inverse Text Normalization (ITN) Model

This model converts Khmer number words to digits using a fine-tuned mT5-small model.

Model Description

Model: mT5-small (fine-tuned)
Language: Khmer (ភាសាខ្មែរ)
Task: Inverse Text Normalization (ITN)
Training Data: 121,097 Khmer text samples with number normalization

Usage

Quick Start

from transformers import MT5ForConditionalGeneration, MT5Tokenizer

# Load model and tokenizer
model_name = "Akaash1/NLP_mt5"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

# Normalize Khmer number words
text = "វ័យ ត្រឹម ដប់ ប្រាំបី ឆ្នាំ"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, max_length=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)  # Output: វ័យ ត្រឹម 18 ឆ្នាំ

Advanced Usage with Custom Class

import torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer

class KhmerITN:
    def __init__(self, model_name="Akaash1/NLP_mt5"):
        self.tokenizer = MT5Tokenizer.from_pretrained(model_name)
        self.model = MT5ForConditionalGeneration.from_pretrained(model_name)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)
        self.model.eval()
    
    def normalize(self, text, num_beams=4):
        inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256)
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Use it
itn = KhmerITN()
result = itn.normalize("ឆ្នាំ ពីរ ពាន់ ដប់ ប្រាំបី")
print(result)  # Output: ឆ្នាំ 2013

Examples

Input (Khmer words)	Output (with digits)
វ័យ ត្រឹម ដប់ ប្រាំបី ឆ្នាំ	វ័យ ត្រឹម 18 ឆ្នាំ
ឆ្នាំ ពីរ ពាន់ ដប់ ប្រាំបី	ឆ្នាំ 2013
តារា វ័យ សាមសិប បួន ឆ្នាំ	តារា វ័យ 34 ឆ្នាំ
មាន សរុប ម្ភៃ មួយ នាក់	មាន សរុប 21 នាក់
ក្នុង រយៈពេល ដប់ ឆ្នាំ	ក្នុង រយៈពេល 10 ឆ្នាំ

Training Details

Training Data

Size: 121,097 text pairs
Source: Khmer text corpus with number words
Split: 95% train, 5% validation

Training Procedure

Base Model: google/mt5-small
Epochs: 5
Batch Size: 8 (per device) × 4 (gradient accumulation) = 32 effective
Learning Rate: 5e-4
Optimizer: AdamW
Max Sequence Length: 256

Supported Number Types

The model can convert various Khmer number expressions:

Units: សូន្យ (0), មួយ (1), ពីរ (2), បី (3), បួន (4), ប្រាំ (5), etc.
Tens: ដប់ (10), ម្ភៃ (20), សាមសិប (30), etc.
Hundreds: រយ (100)
Thousands: ពាន់ (1,000), ម៉ឺន (10,000), សែន (100,000)
Large numbers: លាន (1,000,000), កោដិ (10,000,000)

Limitations

Input text should be space-separated Khmer tokens
Model trained on specific number word patterns
Some idiomatic expressions preserved (e.g., "មួយ រយៈ" meaning "a while")

Citation

If you use this model, please cite:

@misc{khmer-itn-mt5,
  title={Khmer Inverse Text Normalization using mT5},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/Akaash1/NLP_mt5}
}

Model Card Authors

[Your Name]

Contact

For questions or feedback, please open an issue on the model repository.

Downloads last month: 1