prebart-large-cnn / README.md

Update README.md

b67243a verified 3 months ago

7.19 kB

	---
	library_name: transformers
	license: mit
	tags:
	- summarization
	- controllable-text-generation
	- custom-architecture
	- length-control
	- bart
	- cnn-dailymail
	- nlp
	---

	# PreBART — Progress Ratio Embeddings (PRE) for Length-Controlled Summarization

	<!-- Author: [Ivanhoé Botcazou](https://huggingface.co/Ivanhoe9) -->
	Model ID: `Ivanhoe9/prebart-large-cnn`
	Based on: `facebook/bart-large-cnn`

	---

	## 🧩 Model Summary

	`PreBART` is a custom architecture extending BART for length-controlled text generation.
	It introduces Progress Ratio Embeddings (PRE) — a novel mechanism that encodes decoding progress as a continuous trigonometric “impatience signal”.
	This allows the model to maintain high summarization quality while precisely following a target length provided at generation time.

	---

	## 🧠 Model Description

	Modern neural language models excel at abstractive summarization, but they lack robust control over the length of generated texts.
	Discrete countdown methods such as Reverse Positional Embeddings (RPE) often degrade beyond the training distribution.
	To address this, Progress Ratio Embeddings (PRE) represent the normalized decoding ratio \( r = t/l \in [0,1] \) through a continuous sinusoidal embedding, providing a smooth temporal conditioning signal.

	### Key features
	- Compatible with any Transformer encoder–decoder architecture.
	- Length control via a continuous progress signal rather than discrete countdown.
	- Stable performance on both in-distribution and unseen target lengths.
	- Seamless integration with standard Hugging Face APIs.

	### Base model
	`facebook/bart-large-cnn`

	### Language(s)
	English 🇬🇧

	### Tasks
	- Abstractive summarization
	- Length-controlled text generation
	- Reformulation and controlled paraphrasing

	### License
	MIT

	---

	## 🚀 How to Use

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
	import torch

	# Load the model and tokenizer
	model = AutoModelForSeq2SeqLM.from_pretrained("Ivanhoe9/prebart-large-cnn", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("Ivanhoe9/prebart-large-cnn", trust_remote_code=True)

	ARTICLE_TO_SUMMARIZE = """
	Modern neural language models achieve high
	accuracy in text generation, yet precise control
	over generation length remains underdeveloped.
	In this paper, we first investigate a
	recent length control method based on Reverse
	Positional Embeddings (RPE) and show its limits
	when control is requested beyond the training
	distribution. In particular, using a discrete
	countdown signal tied to the absolute remaining
	token count leads to instability. To provide
	robust length control, we introduce Progress
	Ratio Embeddings (PRE), as continuous
	embeddings tied to a trigonometric impatience
	signal. PRE integrates seamlessly into
	standard Transformer architectures, providing
	stable length fidelity without degrading text
	accuracy under standard evaluation metrics. We
	further show that PRE generalizes well to
	unseen target lengths. Experiments on two widely
	used news-summarization benchmarks validate
	these findings.
	""".replace("\n", " ")

	inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, truncation=True, return_tensors="pt")
	```

	### Example 1 — Long summary (≈ 75 tokens)
	```python
	target_len = torch.tensor([75], dtype=torch.long)
	summary_ids = model.generate(
	inputs["input_ids"],
	target_len=target_len,
	**model.config.task_specific_params["summarization"]
	)
	summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)[0]
	print(summary)
	```

	Output
	```
	Modern neural language models achieve high accuracy in text generation, yet precise control over generation length remains underdeveloped.
	Using a discrete countdown signal tied to the absolute remaining token count leads to instability.
	To provide robust length control, we introduce Progress Ratio Embeddings (PRE), as continuous embeddings tied to a trigonometric impatience signal.
	```

	### Example 2 — Short summary (≈ 25 tokens)
	```python
	target_len = torch.tensor([25], dtype=torch.long)
	summary_ids = model.generate(
	inputs["input_ids"],
	target_len=target_len,
	**model.config.task_specific_params["summarization"]
	)
	summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)[0]
	print(summary)
	```

	Output
	```
	Modern neural language models achieve high accuracy in text generation.
	But precise control over generation length remains underdeveloped.
	```

	---

	## 🧪 Training Details

	### Dataset
	- CNN/DailyMail summarization dataset
	- Standard train/validation/test splits from Hugging Face Datasets

	### Procedure
	- Fine-tuned from `facebook/bart-large-cnn`
	- Objective: minimize sequence loss while conditioning on continuous progress ratios
	- Gaussian noise added to progress ratio embeddings during training for smoother interpolation
	- Standard early stopping on validation loss
	- Training on Jean-Zay HPC cluster (GENCI-IDRIS)

	### Hardware
	- 8× NVIDIA V100 32 GB (multi-GPU DDP)
	- PyTorch + Transformers 4.45.1

	---

	## 📊 Evaluation
	Evaluation on all the test dataset part of CNN/DailyMail, mean score reported in the tabular below :
	\| Metric \| Score \| BART baseline \|
	\|:-------\|:----------------\|:---------\|
	\| ↑ ROUGE-1 \| 45.3 \| 44.2 \|
	\| ↑ ROUGE-2 \| 21.9 \| 21.1 \|
	\| ↑ ROUGE-L \| 42.2 \| 40.9\|
	\| ↓ MAE (token length error) \| 0.5 \| 19.2\|

	---

	## ⚠️ Bias, Risks, and Limitations

	- The model inherits dataset biases from CNN/DailyMail news corpus.
	- Length control is designed for summarization tasks — not for unrestricted story generation.
	- Generated summaries may reflect stylistic bias from the news domain.
	- No guarantees on factual accuracy; always verify outputs for factual tasks.

	---

	## 🌍 Environmental Impact (Approx.)

	\| Resource \| Details \|
	\|:----------\|:---------\|
	\| Hardware \| 8× V100 32 GB (Jean-Zay HPC) \|
	\| Training Duration \| ≈ 36 GPU hours \|
	\| Framework \| PyTorch + Transformers \|
	\| Estimated CO₂ emissions \| ~ 39 kg CO₂eq \|

	---

	## 🧩 Technical Specifications

	- Architecture: Encoder-decoder Transformer (BART variant)
	- New modules: Progress Ratio Embeddings (PRE) — sinusoidal encoding of decoding progress
	- Model size: 406 M parameters
	- Base model: `facebook/bart-large-cnn`
	- Custom files:
	- `configuration_prebart.py`
	- `modeling_prebart.py`

	---

	<!-- ## 📚 Citation


	BibTeX
	```bibtex
	@misc{botcazou2025prebart,
	title = {PreBART: Progress Ratio Embeddings for Robust Length-Controlled Summarization},
	author = {Ivanhoé Botcazou, Tassadit Amghar, Sylvain Lamprier, Frederic Saubion},
	year = {2025},
	howpublished = {\url{https://huggingface.co/Ivanhoe9/prebart-large-cnn}}
	}
	```

	---

	## 📬 Contact

	- Author: Ivanhoé Botcazou
	- Affiliation: LERIA - University of Angers - France
	- Hugging Face: [Ivanhoe9](https://huggingface.co/Ivanhoe9)

	---

	### ✨ Acknowledgements
	Thanks to the Jean-Zay (IDRIS–GENCI) compute resources (2025–AD011016042) and open-source contributors from the Hugging Face community.

	--- -->