phonsobon/khmer-artical-summaries
Viewer β’ Updated β’ 13k β’ 92
An abstractive summarization model for the Khmer language, fine-tuned from
google/mt5-base on two Khmer news datasets.
Note: Khmer has no spaces between words.
The mT5 SentencePiece tokenizer handles all subword segmentation automatically β
do not apply any word-splitting pre-processing.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import unicodedata, re
tokenizer = AutoTokenizer.from_pretrained("phonsobon/khmer-text-summarization", use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained("phonsobon/khmer-text-summarization")
model.eval()
def clean_khmer(text):
text = unicodedata.normalize("NFC", text)
text = re.sub(r"<[^>]+>|https?://\S+", " ", text)
text = re.sub(r"[ \t]+", " ", text)
return text.strip()
article = "αααα
αΌαα’ααααααααααααααα’ααααα
ααΈααα ..." # your Khmer article
inputs = tokenizer(
"summarize: " + clean_khmer(article),
return_tensors="pt",
max_length=512,
truncation=True,
)
output_ids = model.generate(**inputs) # generation_config baked in
summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(summary)
| Metric | Score |
|---|---|
| β | β |
| Metric | Score |
|---|---|
| β | β |
| Setting | Value |
|---|---|
| Base model | google/mt5-base |
| Fine-tuning method | LoRA (merged) |
| Task prefix | summarize: |
| Max input length | 512 tokens |
| Max target length | 128 tokens |
| Epochs | 10 |
| Learning rate | 0.0002 |
| Beam search | 4 beams |
| No-repeat n-gram | 3 |
| Training date | 2026-06-20T09:20:50.974517 |
| Dataset | Columns used |
|---|---|
phonsobon/khmer-artical-summaries |
content β summaries |
phonsobon/khmer-text-summarization-v2 |
article β summaries |
Base model
google/mt5-base