File size: 4,852 Bytes
ebbd10b df5033b ebbd10b 6e8253c ebbd10b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
---
license: mit
datasets:
- 8Opt/vietnamese-summarization-dataset-0001
language:
- vi
metrics:
- bertscore
- rouge
base_model:
- VietAI/vit5-base
pipeline_tag: summarization
library_name: transformers
---
# ViT5 Vietnamese Summarization
Fine-tuned ViT5 model for Vietnamese text summarization.
## Model Description
This model is a fine-tuned version of ViT5 on Vietnamese summarization dataset. Unified extractive/abstractive summaries from Vietnamese documents.
**Base Model:** VietAI/vit5-base
**Task:** Abstractive Text Summarization
**Language:** Vietnamese
## Training Configuration
- **max_input_length:** 1280 tokens
- **max_output_length:** 256 tokens
- **Training dataset:** [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001)
## Usage
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "thnhan3/sft_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")
document = """
Ngày 16 tháng 11 năm 2025, Chính phủ Việt Nam công bố kế hoạch phát triển kinh tế số
trong giai đoạn 2025-2030. Kế hoạch tập trung vào 3 trọng tâm chính: phát triển hạ tầng
số, đào tạo nguồn nhân lực công nghệ cao, và thúc đẩy chuyển đổi số doanh nghiệp.
Mục tiêu đặt ra là đến năm 2030, kinh tế số chiếm 30% GDP và tạo ra 2 triệu việc làm mới.
"""
inputs = tokenizer(
document,
max_length=1280,
truncation=True,
return_tensors="pt"
).to("cuda")
outputs = model.generate(
inputs.input_ids,
max_new_tokens=256,
num_beams=4,
length_penalty=1.0,
early_stopping=True,
no_repeat_ngram_size=3
)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)
```
### Batch Processing
```python
import torch
documents = [
"Văn bản 1...",
"Văn bản 2...",
"Văn bản 3...",
]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = tokenizer(
documents,
max_length=1280,
truncation=True,
padding=True,
return_tensors="pt"
).to(device)
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=256,
num_beams=4,
length_penalty=1.0,
early_stopping=True,
no_repeat_ngram_size=3
)
summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for i, summary in enumerate(summaries):
print(f"Summary {i+1}: {summary}")
```
### Optimized Inference with FP16
```python
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).half()
with torch.inference_mode():
inputs = tokenizer(
document,
max_length=1280,
truncation=True,
return_tensors="pt"
).to(device)
with torch.amp.autocast('cuda'):
outputs = model.generate(
inputs.input_ids,
max_new_tokens=256,
num_beams=4,
length_penalty=1.0,
early_stopping=True,
no_repeat_ngram_size=3
)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
## Generation Parameters
Recommended parameters for best quality:
- `max_new_tokens`: 256 (matches training configuration)
- `num_beams`: 4 (beam search for better quality)
- `length_penalty`: 1.0 (neutral length preference)
- `early_stopping`: True (stop when EOS token generated)
- `no_repeat_ngram_size`: 3 (avoid repetitive phrases)
You can adjust these parameters based on your needs:
- Increase `num_beams` (5-8) for potentially better quality but slower generation
- Decrease `num_beams` (2-3) for faster generation with slight quality trade-off
- Adjust `length_penalty` (0.8-1.2) to control summary length
## Model Performance
Evaluated on test set of [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001):
~comming soon~
## Limitations
- Maximum input length: 1280 tokens. Longer documents will be truncated.
- Trained on Vietnamese news/formal text.
## Citation
If you use this model, please cite:
```bibtex
@misc{vit5-vietnamese-summarization,
author = {Tran Huu Nhan},
title = {ViT5 Vietnamese Summarization},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/thnhan3/sft_model}}
}
```
## License
MIT
## Acknowledgments
- Base model: [VietAI/vit5-base](https://huggingface.co/VietAI/vit5-base)
- Training dataset: [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001) |