T5 Indonesian Summarization (Augmented 3x)
Model T5-base yang di-fine-tune dengan augmentasi data 3x (utterance-level paraphrase) untuk meringkas percakapan Bahasa Indonesia. Training menggunakan 5-Fold Cross Validation.
Model Details
- Base Model: cahya/t5-base-indonesian-summarization-cased
- Architecture: T5-base (encoder-decoder, 12 layers, 768 hidden, 12 heads)
- Parameters: ~220M
- Language: Indonesian (Bahasa Indonesia)
- Task: Abstractive Summarization of Indonesian Conversations
- Training: 5-Fold Cross Validation
- Available Folds: 5 folds tersedia sebagai branches (
fold_0s/dfold_4). Branchmainberisi fold 3 (performa terbaik).
Usage
from transformers import T5Tokenizer, T5ForConditionalGeneration
# Load model & tokenizer
model_name = "aloisiusedwin/t5-id-summarization-augmented3x"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
# Contoh percakapan
conversation = "summarize: S1: Halo, gimana kabarmu? S2: Baik, aku lagi sibuk ngerjain tugas nih."
# Generate ringkasan
inputs = tokenizer(conversation, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
inputs["input_ids"],
max_length=150,
num_beams=1,
no_repeat_ngram_size=2,
early_stopping=True
)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)
Loading Specific Fold
# Load fold tertentu (misal fold_0)
model = T5ForConditionalGeneration.from_pretrained(model_name, revision="fold_0")
tokenizer = T5Tokenizer.from_pretrained(model_name, revision="fold_0")
Training Details
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 5e-5 |
| Batch Size | 8 |
| Epochs | 10 |
| Early Stopping Patience | 2 |
| Weight Decay | 0.01 |
| Label Smoothing | 0.1 |
| LR Scheduler | Cosine |
| Warmup Ratio | 0.10 |
| Max Grad Norm | 1.0 |
| Max Input Length | 512 |
| Max Target Length | 128 |
| FP16 | True |
Data Augmentation
Dataset diperbesar 3x dengan teknik utterance-level paraphrase menggunakan model Wikidepia/IndoT5-base-paraphrase. Setiap percakapan di-paraphrase secara per-kalimat.
Evaluation Results
Per-Fold Results
| Fold | ROUGE-1 | ROUGE-2 | ROUGE-L | BERTScore F1 | Eval Loss |
|---|---|---|---|---|---|
| 0 | 28.63 | 9.66 | 24.54 | 0.7310 | 4.6597 |
| 1 | 28.39 | 9.54 | 24.08 | 0.7314 | 4.6101 |
| 2 | 24.83 | 8.12 | 22.21 | 0.7201 | 4.6530 |
| 3 (best) | 28.80 | 10.38 | 26.07 | 0.7339 | 4.5295 |
| 4 | 27.49 | 8.85 | 24.57 | 0.7295 | 4.5674 |
(best) = Fold terbaik (digunakan sebagai branch main)
Aggregated (5-Fold Cross Validation)
| Metric | Mean | Std |
|---|---|---|
| ROUGE-1 | 27.63 | 1.47 |
| ROUGE-2 | 9.31 | 0.77 |
| ROUGE-L | 24.29 | 1.24 |
| BERTScore F1 | 0.7292 | 0.0048 |
Perbandingan dengan Baseline
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BERTScore F1 |
|---|---|---|---|---|
| Baseline (pretrained) | 15.92 | 4.40 | 13.12 | 0.6626 |
| T5 Indonesian Summarization (Augmented 3x) | 27.63 | 9.31 | 24.29 | 0.7292 |
Intended Use
Model ini dirancang untuk meringkas percakapan dalam Bahasa Indonesia.
Limitations
- Input harus diawali dengan prefix
summarize:untuk hasil optimal. - Panjang input maksimum 512 token.
Citation
@thesis{edwin2026summarization,
title={Pengaruh Augmentasi Data terhadap Kualitas Ringkasan Percakapan Bahasa Indonesia menggunakan T5},
author={Aloisius Edwin},
year={2026},
school={Institut Teknologi Sumatera}
}
- Downloads last month
- 174
Model tree for aloisiusedwin/t5-id-summarization-augmented3x
Base model
cahya/t5-base-indonesian-summarization-casedSpace using aloisiusedwin/t5-id-summarization-augmented3x 1
Evaluation results
- ROUGE-L (mean)self-reported24.290
- ROUGE-1 (mean)self-reported27.630
- ROUGE-2 (mean)self-reported9.310