|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- 8Opt/vietnamese-summarization-dataset-0001 |
|
|
language: |
|
|
- vi |
|
|
metrics: |
|
|
- bertscore |
|
|
- rouge |
|
|
base_model: |
|
|
- VietAI/vit5-base |
|
|
pipeline_tag: summarization |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# ViT5 Vietnamese Summarization |
|
|
|
|
|
Fine-tuned ViT5 model for Vietnamese text summarization. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a fine-tuned version of ViT5 on Vietnamese summarization dataset. Unified extractive/abstractive summaries from Vietnamese documents. |
|
|
|
|
|
**Base Model:** VietAI/vit5-base |
|
|
**Task:** Abstractive Text Summarization |
|
|
**Language:** Vietnamese |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
- **max_input_length:** 1280 tokens |
|
|
- **max_output_length:** 256 tokens |
|
|
- **Training dataset:** [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
|
|
model_name = "thnhan3/sft_model" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda") |
|
|
|
|
|
document = """ |
|
|
Ngày 16 tháng 11 năm 2025, Chính phủ Việt Nam công bố kế hoạch phát triển kinh tế số |
|
|
trong giai đoạn 2025-2030. Kế hoạch tập trung vào 3 trọng tâm chính: phát triển hạ tầng |
|
|
số, đào tạo nguồn nhân lực công nghệ cao, và thúc đẩy chuyển đổi số doanh nghiệp. |
|
|
Mục tiêu đặt ra là đến năm 2030, kinh tế số chiếm 30% GDP và tạo ra 2 triệu việc làm mới. |
|
|
""" |
|
|
|
|
|
inputs = tokenizer( |
|
|
document, |
|
|
max_length=1280, |
|
|
truncation=True, |
|
|
return_tensors="pt" |
|
|
).to("cuda") |
|
|
|
|
|
outputs = model.generate( |
|
|
inputs.input_ids, |
|
|
max_new_tokens=256, |
|
|
num_beams=4, |
|
|
length_penalty=1.0, |
|
|
early_stopping=True, |
|
|
no_repeat_ngram_size=3 |
|
|
) |
|
|
|
|
|
summary = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(summary) |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
documents = [ |
|
|
"Văn bản 1...", |
|
|
"Văn bản 2...", |
|
|
"Văn bản 3...", |
|
|
] |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model = model.to(device) |
|
|
|
|
|
inputs = tokenizer( |
|
|
documents, |
|
|
max_length=1280, |
|
|
truncation=True, |
|
|
padding=True, |
|
|
return_tensors="pt" |
|
|
).to(device) |
|
|
|
|
|
outputs = model.generate( |
|
|
inputs.input_ids, |
|
|
attention_mask=inputs.attention_mask, |
|
|
max_new_tokens=256, |
|
|
num_beams=4, |
|
|
length_penalty=1.0, |
|
|
early_stopping=True, |
|
|
no_repeat_ngram_size=3 |
|
|
) |
|
|
|
|
|
summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
|
for i, summary in enumerate(summaries): |
|
|
print(f"Summary {i+1}: {summary}") |
|
|
``` |
|
|
|
|
|
### Optimized Inference with FP16 |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model = model.to(device).half() |
|
|
|
|
|
with torch.inference_mode(): |
|
|
inputs = tokenizer( |
|
|
document, |
|
|
max_length=1280, |
|
|
truncation=True, |
|
|
return_tensors="pt" |
|
|
).to(device) |
|
|
|
|
|
with torch.amp.autocast('cuda'): |
|
|
outputs = model.generate( |
|
|
inputs.input_ids, |
|
|
max_new_tokens=256, |
|
|
num_beams=4, |
|
|
length_penalty=1.0, |
|
|
early_stopping=True, |
|
|
no_repeat_ngram_size=3 |
|
|
) |
|
|
|
|
|
summary = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
## Generation Parameters |
|
|
|
|
|
Recommended parameters for best quality: |
|
|
|
|
|
- `max_new_tokens`: 256 (matches training configuration) |
|
|
- `num_beams`: 4 (beam search for better quality) |
|
|
- `length_penalty`: 1.0 (neutral length preference) |
|
|
- `early_stopping`: True (stop when EOS token generated) |
|
|
- `no_repeat_ngram_size`: 3 (avoid repetitive phrases) |
|
|
|
|
|
You can adjust these parameters based on your needs: |
|
|
|
|
|
- Increase `num_beams` (5-8) for potentially better quality but slower generation |
|
|
- Decrease `num_beams` (2-3) for faster generation with slight quality trade-off |
|
|
- Adjust `length_penalty` (0.8-1.2) to control summary length |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
Evaluated on test set of [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001): |
|
|
|
|
|
~comming soon~ |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Maximum input length: 1280 tokens. Longer documents will be truncated. |
|
|
- Trained on Vietnamese news/formal text. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{vit5-vietnamese-summarization, |
|
|
author = {Tran Huu Nhan}, |
|
|
title = {ViT5 Vietnamese Summarization}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/thnhan3/sft_model}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Base model: [VietAI/vit5-base](https://huggingface.co/VietAI/vit5-base) |
|
|
- Training dataset: [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001) |