sft_model / README.md

Update README.md

6e8253c verified about 1 month ago

4.85 kB

	---
	license: mit
	datasets:
	- 8Opt/vietnamese-summarization-dataset-0001
	language:
	- vi
	metrics:
	- bertscore
	- rouge
	base_model:
	- VietAI/vit5-base
	pipeline_tag: summarization
	library_name: transformers
	---

	# ViT5 Vietnamese Summarization

	Fine-tuned ViT5 model for Vietnamese text summarization.

	## Model Description

	This model is a fine-tuned version of ViT5 on Vietnamese summarization dataset. Unified extractive/abstractive summaries from Vietnamese documents.

	Base Model: VietAI/vit5-base
	Task: Abstractive Text Summarization
	Language: Vietnamese

	## Training Configuration

	- max_input_length: 1280 tokens
	- max_output_length: 256 tokens
	- Training dataset: [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001)

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Basic Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	model_name = "thnhan3/sft_model"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")

	document = """
	Ngày 16 tháng 11 năm 2025, Chính phủ Việt Nam công bố kế hoạch phát triển kinh tế số
	trong giai đoạn 2025-2030. Kế hoạch tập trung vào 3 trọng tâm chính: phát triển hạ tầng
	số, đào tạo nguồn nhân lực công nghệ cao, và thúc đẩy chuyển đổi số doanh nghiệp.
	Mục tiêu đặt ra là đến năm 2030, kinh tế số chiếm 30% GDP và tạo ra 2 triệu việc làm mới.
	"""

	inputs = tokenizer(
	document,
	max_length=1280,
	truncation=True,
	return_tensors="pt"
	).to("cuda")

	outputs = model.generate(
	inputs.input_ids,
	max_new_tokens=256,
	num_beams=4,
	length_penalty=1.0,
	early_stopping=True,
	no_repeat_ngram_size=3
	)

	summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(summary)
	```

	### Batch Processing

	```python
	import torch

	documents = [
	"Văn bản 1...",
	"Văn bản 2...",
	"Văn bản 3...",
	]

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)

	inputs = tokenizer(
	documents,
	max_length=1280,
	truncation=True,
	padding=True,
	return_tensors="pt"
	).to(device)

	outputs = model.generate(
	inputs.input_ids,
	attention_mask=inputs.attention_mask,
	max_new_tokens=256,
	num_beams=4,
	length_penalty=1.0,
	early_stopping=True,
	no_repeat_ngram_size=3
	)

	summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	for i, summary in enumerate(summaries):
	print(f"Summary {i+1}: {summary}")
	```

	### Optimized Inference with FP16

	```python
	import torch

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device).half()

	with torch.inference_mode():
	inputs = tokenizer(
	document,
	max_length=1280,
	truncation=True,
	return_tensors="pt"
	).to(device)

	with torch.amp.autocast('cuda'):
	outputs = model.generate(
	inputs.input_ids,
	max_new_tokens=256,
	num_beams=4,
	length_penalty=1.0,
	early_stopping=True,
	no_repeat_ngram_size=3
	)

	summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
	```

	## Generation Parameters

	Recommended parameters for best quality:

	- `max_new_tokens`: 256 (matches training configuration)
	- `num_beams`: 4 (beam search for better quality)
	- `length_penalty`: 1.0 (neutral length preference)
	- `early_stopping`: True (stop when EOS token generated)
	- `no_repeat_ngram_size`: 3 (avoid repetitive phrases)

	You can adjust these parameters based on your needs:

	- Increase `num_beams` (5-8) for potentially better quality but slower generation
	- Decrease `num_beams` (2-3) for faster generation with slight quality trade-off
	- Adjust `length_penalty` (0.8-1.2) to control summary length

	## Model Performance

	Evaluated on test set of [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001):

	~comming soon~

	## Limitations

	- Maximum input length: 1280 tokens. Longer documents will be truncated.
	- Trained on Vietnamese news/formal text.

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{vit5-vietnamese-summarization,
	author = {Tran Huu Nhan},
	title = {ViT5 Vietnamese Summarization},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/thnhan3/sft_model}}
	}
	```

	## License

	MIT

	## Acknowledgments

	- Base model: [VietAI/vit5-base](https://huggingface.co/VietAI/vit5-base)
	- Training dataset: [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001)