Khmer Text Summarization — mT5

An abstractive summarization model for the Khmer language, fine-tuned from google/mt5-base on two Khmer news datasets.

Note: Khmer has no spaces between words.
The mT5 SentencePiece tokenizer handles all subword segmentation automatically —
do not apply any word-splitting pre-processing.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import unicodedata, re

tokenizer = AutoTokenizer.from_pretrained("phonsobon/khmer-text-summarization", use_fast=False)
model     = AutoModelForSeq2SeqLM.from_pretrained("phonsobon/khmer-text-summarization")
model.eval()

def clean_khmer(text):
    text = unicodedata.normalize("NFC", text)
    text = re.sub(r"<[^>]+>|https?://\S+", " ", text)
    text = re.sub(r"[ \t]+", " ", text)
    return text.strip()

article = "បញ្ចូលអត្ថបទខ្មែររបស់អ្នកនៅទីនេះ ..."   # your Khmer article

inputs = tokenizer(
    "summarize: " + clean_khmer(article),
    return_tensors="pt",
    max_length=512,
    truncation=True,
)

output_ids = model.generate(**inputs)   # generation_config baked in
summary    = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(summary)

Evaluation — Validation set

Metric	Score
—	—

Evaluation — Test set

Metric	Score
—	—

Training details

Setting	Value
Base model	`google/mt5-base`
Fine-tuning method	LoRA (merged)
Task prefix	`summarize:`
Max input length	512 tokens
Max target length	128 tokens
Epochs	10
Learning rate	0.0002
Beam search	4 beams
No-repeat n-gram	3
Training date	2026-06-20T09:20:50.974517

Datasets

Dataset	Columns used
`phonsobon/khmer-artical-summaries`	`content` → `summaries`
`phonsobon/khmer-text-summarization-v2`	`article` → `summaries`

Limitations

Optimised for Khmer-language news articles.
ROUGE scores are computed character-level (no Khmer word segmenter) — treat as relative, not absolute quality.
Model may struggle on very short or colloquial Khmer text outside the training distribution.

Downloads last month: 34

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for phonsobon/khmer-text-summarization

Base model

google/mt5-base

Finetuned

(314)

this model

phonsobon
/

khmer-text-summarization