File size: 4,397 Bytes
db3cc54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
language:
- km
license: apache-2.0
tags:
- text2text-generation
- mt5
- khmer
- inverse-text-normalization
- number-normalization
datasets:
- custom
metrics:
- exact_match
library_name: transformers
pipeline_tag: text2text-generation
---

# Khmer Inverse Text Normalization (ITN) Model

This model converts Khmer number words to digits using a fine-tuned mT5-small model.

## Model Description

- **Model**: mT5-small (fine-tuned)
- **Language**: Khmer (αž—αžΆαžŸαžΆαžαŸ’αž˜αŸ‚αžš)
- **Task**: Inverse Text Normalization (ITN)
- **Training Data**: 121,097 Khmer text samples with number normalization

## Usage

### Quick Start

```python
from transformers import MT5ForConditionalGeneration, MT5Tokenizer

# Load model and tokenizer
model_name = "Akaash1/NLP_mt5"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

# Normalize Khmer number words
text = "αžœαŸαž™ αžαŸ’αžšαžΉαž˜ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ αž†αŸ’αž“αžΆαŸ†"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, max_length=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)  # Output: αžœαŸαž™ αžαŸ’αžšαžΉαž˜ 18 αž†αŸ’αž“αžΆαŸ†
```

### Advanced Usage with Custom Class

```python
import torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer

class KhmerITN:
    def __init__(self, model_name="Akaash1/NLP_mt5"):
        self.tokenizer = MT5Tokenizer.from_pretrained(model_name)
        self.model = MT5ForConditionalGeneration.from_pretrained(model_name)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)
        self.model.eval()
    
    def normalize(self, text, num_beams=4):
        inputs = self.tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model.generate(**inputs, num_beams=num_beams, max_length=256)
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Use it
itn = KhmerITN()
result = itn.normalize("αž†αŸ’αž“αžΆαŸ† αž–αžΈαžš αž–αžΆαž“αŸ‹ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ")
print(result)  # Output: αž†αŸ’αž“αžΆαŸ† 2013
```

## Examples

| Input (Khmer words) | Output (with digits) |
|---------------------|----------------------|
| αžœαŸαž™ αžαŸ’αžšαžΉαž˜ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ αž†αŸ’αž“αžΆαŸ† | αžœαŸαž™ αžαŸ’αžšαžΉαž˜ 18 αž†αŸ’αž“αžΆαŸ† |
| αž†αŸ’αž“αžΆαŸ† αž–αžΈαžš αž–αžΆαž“αŸ‹ αžŠαž”αŸ‹ αž”αŸ’αžšαžΆαŸ†αž”αžΈ | αž†αŸ’αž“αžΆαŸ† 2013 |
| តអរអ αžœαŸαž™ αžŸαžΆαž˜αžŸαž·αž” αž”αž½αž“ αž†αŸ’αž“αžΆαŸ† | តអរអ αžœαŸαž™ 34 αž†αŸ’αž“αžΆαŸ† |
| αž˜αžΆαž“ αžŸαžšαž»αž” αž˜αŸ’αž—αŸƒ αž˜αž½αž™ αž“αžΆαž€αŸ‹ | αž˜αžΆαž“ αžŸαžšαž»αž” 21 αž“αžΆαž€αŸ‹ |
| αž€αŸ’αž“αž»αž„ αžšαž™αŸˆαž–αŸαž› αžŠαž”αŸ‹ αž†αŸ’αž“αžΆαŸ† | αž€αŸ’αž“αž»αž„ αžšαž™αŸˆαž–αŸαž› 10 αž†αŸ’αž“αžΆαŸ† |

## Training Details

### Training Data

- **Size**: 121,097 text pairs
- **Source**: Khmer text corpus with number words
- **Split**: 95% train, 5% validation

### Training Procedure

- **Base Model**: google/mt5-small
- **Epochs**: 5
- **Batch Size**: 8 (per device) Γ— 4 (gradient accumulation) = 32 effective
- **Learning Rate**: 5e-4
- **Optimizer**: AdamW
- **Max Sequence Length**: 256

### Supported Number Types

The model can convert various Khmer number expressions:

- **Units**: αžŸαžΌαž“αŸ’αž™ (0), αž˜αž½αž™ (1), αž–αžΈαžš (2), αž”αžΈ (3), αž”αž½αž“ (4), αž”αŸ’αžšαžΆαŸ† (5), etc.
- **Tens**: αžŠαž”αŸ‹ (10), αž˜αŸ’αž—αŸƒ (20), αžŸαžΆαž˜αžŸαž·αž” (30), etc.
- **Hundreds**: αžšαž™ (100)
- **Thousands**: αž–αžΆαž“αŸ‹ (1,000), αž˜αŸ‰αžΊαž“ (10,000), αžŸαŸ‚αž“ (100,000)
- **Large numbers**: αž›αžΆαž“ (1,000,000), αž€αŸ„αžŠαž· (10,000,000)

## Limitations

- Input text should be space-separated Khmer tokens
- Model trained on specific number word patterns
- Some idiomatic expressions preserved (e.g., "αž˜αž½αž™ αžšαž™αŸˆ" meaning "a while")

## Citation

If you use this model, please cite:

```bibtex
@misc{khmer-itn-mt5,
  title={Khmer Inverse Text Normalization using mT5},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/Akaash1/NLP_mt5}
}
```

## Model Card Authors

[Your Name]

## Contact

For questions or feedback, please open an issue on the model repository.