NLP_mt5 / README.md

Upload Khmer ITN model

db3cc54 21 days ago

4.4 kB

	---
	language:
	- km
	license: apache-2.0
	tags:
	- text2text-generation
	- mt5
	- khmer
	- inverse-text-normalization
	- number-normalization
	datasets:
	- custom
	metrics:
	- exact_match
	library_name: transformers
	pipeline_tag: text2text-generation
	---

	# Khmer Inverse Text Normalization (ITN) Model

	This model converts Khmer number words to digits using a fine-tuned mT5-small model.

	## Model Description

	- Model: mT5-small (fine-tuned)
	- Language: Khmer (ភាសាខ្មែរ)
	- Task: Inverse Text Normalization (ITN)
	- Training Data: 121,097 Khmer text samples with number normalization

	## Usage

	### Quick Start

	```python
	from transformers import MT5ForConditionalGeneration, MT5Tokenizer

	# Load model and tokenizer
	model_name = "Akaash1/NLP_mt5"
	tokenizer = MT5Tokenizer.from_pretrained(model_name)
	model = MT5ForConditionalGeneration.from_pretrained(model_name)

	# Normalize Khmer number words
	text = "វ័យ ត្រឹម ដប់ ប្រាំបី ឆ្នាំ"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, num_beams=4, max_length=256)
	result = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print(result) # Output: វ័យ ត្រឹម 18 ឆ្នាំ
	```

	### Advanced Usage with Custom Class

	```python
	import torch
	from transformers import MT5ForConditionalGeneration, MT5Tokenizer

	class KhmerITN:
	def __init__(self, model_name="Akaash1/NLP_mt5"):
	self.tokenizer = MT5Tokenizer.from_pretrained(model_name)
	self.model = MT5ForConditionalGeneration.from_pretrained(model_name)
	self.device = "cuda" if torch.cuda.is_available() else "cpu"
	self.model.to(self.device)
	self.model.eval()

	def normalize(self, text, num_beams=4):
	inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
	inputs = {k: v.to(self.device) for k, v in inputs.items()}

	with torch.no_grad():
	outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256)

	return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

	# Use it
	itn = KhmerITN()
	result = itn.normalize("ឆ្នាំ ពីរ ពាន់ ដប់ ប្រាំបី")
	print(result) # Output: ឆ្នាំ 2013
	```

	## Examples

	\| Input (Khmer words) \| Output (with digits) \|
	\|---------------------\|----------------------\|
	\| វ័យ ត្រឹម ដប់ ប្រាំបី ឆ្នាំ \| វ័យ ត្រឹម 18 ឆ្នាំ \|
	\| ឆ្នាំ ពីរ ពាន់ ដប់ ប្រាំបី \| ឆ្នាំ 2013 \|
	\| តារា វ័យ សាមសិប បួន ឆ្នាំ \| តារា វ័យ 34 ឆ្នាំ \|
	\| មាន សរុប ម្ភៃ មួយ នាក់ \| មាន សរុប 21 នាក់ \|
	\| ក្នុង រយៈពេល ដប់ ឆ្នាំ \| ក្នុង រយៈពេល 10 ឆ្នាំ \|

	## Training Details

	### Training Data

	- Size: 121,097 text pairs
	- Source: Khmer text corpus with number words
	- Split: 95% train, 5% validation

	### Training Procedure

	- Base Model: google/mt5-small
	- Epochs: 5
	- Batch Size: 8 (per device) × 4 (gradient accumulation) = 32 effective
	- Learning Rate: 5e-4
	- Optimizer: AdamW
	- Max Sequence Length: 256

	### Supported Number Types

	The model can convert various Khmer number expressions:

	- Units: សូន្យ (0), មួយ (1), ពីរ (2), បី (3), បួន (4), ប្រាំ (5), etc.
	- Tens: ដប់ (10), ម្ភៃ (20), សាមសិប (30), etc.
	- Hundreds: រយ (100)
	- Thousands: ពាន់ (1,000), ម៉ឺន (10,000), សែន (100,000)
	- Large numbers: លាន (1,000,000), កោដិ (10,000,000)

	## Limitations

	- Input text should be space-separated Khmer tokens
	- Model trained on specific number word patterns
	- Some idiomatic expressions preserved (e.g., "មួយ រយៈ" meaning "a while")

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{khmer-itn-mt5,
	title={Khmer Inverse Text Normalization using mT5},
	author={Your Name},
	year={2024},
	url={https://huggingface.co/Akaash1/NLP_mt5}
	}
	```

	## Model Card Authors

	[Your Name]

	## Contact

	For questions or feedback, please open an issue on the model repository.