|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
tags: |
|
|
- summarization |
|
|
- controllable-text-generation |
|
|
- custom-architecture |
|
|
- length-control |
|
|
- bart |
|
|
- cnn-dailymail |
|
|
- nlp |
|
|
--- |
|
|
|
|
|
# PreBART — Progress Ratio Embeddings (PRE) for Length-Controlled Summarization |
|
|
|
|
|
<!-- **Author:** [Ivanhoé Botcazou](https://huggingface.co/Ivanhoe9) --> |
|
|
**Model ID:** `Ivanhoe9/prebart-large-cnn` |
|
|
**Based on:** `facebook/bart-large-cnn` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧩 Model Summary |
|
|
|
|
|
`PreBART` is a custom architecture extending **BART** for **length-controlled text generation**. |
|
|
It introduces **Progress Ratio Embeddings (PRE)** — a novel mechanism that encodes decoding progress as a continuous trigonometric “impatience signal”. |
|
|
This allows the model to maintain **high summarization quality** while precisely following a **target length** provided at generation time. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Model Description |
|
|
|
|
|
Modern neural language models excel at abstractive summarization, but they lack robust control over the **length** of generated texts. |
|
|
Discrete countdown methods such as **Reverse Positional Embeddings (RPE)** often degrade beyond the training distribution. |
|
|
To address this, **Progress Ratio Embeddings (PRE)** represent the normalized decoding ratio \( r = t/l \in [0,1] \) through a continuous sinusoidal embedding, providing a smooth temporal conditioning signal. |
|
|
|
|
|
### Key features |
|
|
- Compatible with any Transformer encoder–decoder architecture. |
|
|
- Length control via a continuous progress signal rather than discrete countdown. |
|
|
- Stable performance on both in-distribution and unseen target lengths. |
|
|
- Seamless integration with standard Hugging Face APIs. |
|
|
|
|
|
### Base model |
|
|
`facebook/bart-large-cnn` |
|
|
|
|
|
### Language(s) |
|
|
English 🇬🇧 |
|
|
|
|
|
### Tasks |
|
|
- Abstractive summarization |
|
|
- Length-controlled text generation |
|
|
- Reformulation and controlled paraphrasing |
|
|
|
|
|
### License |
|
|
MIT |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained("Ivanhoe9/prebart-large-cnn", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("Ivanhoe9/prebart-large-cnn", trust_remote_code=True) |
|
|
|
|
|
ARTICLE_TO_SUMMARIZE = """ |
|
|
Modern neural language models achieve high |
|
|
accuracy in text generation, yet precise control |
|
|
over generation length remains underdeveloped. |
|
|
In this paper, we first investigate a |
|
|
recent length control method based on Reverse |
|
|
Positional Embeddings (RPE) and show its limits |
|
|
when control is requested beyond the training |
|
|
distribution. In particular, using a discrete |
|
|
countdown signal tied to the absolute remaining |
|
|
token count leads to instability. To provide |
|
|
robust length control, we introduce Progress |
|
|
Ratio Embeddings (PRE), as continuous |
|
|
embeddings tied to a trigonometric impatience |
|
|
signal. PRE integrates seamlessly into |
|
|
standard Transformer architectures, providing |
|
|
stable length fidelity without degrading text |
|
|
accuracy under standard evaluation metrics. We |
|
|
further show that PRE generalizes well to |
|
|
unseen target lengths. Experiments on two widely |
|
|
used news-summarization benchmarks validate |
|
|
these findings. |
|
|
""".replace("\n", " ") |
|
|
|
|
|
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, truncation=True, return_tensors="pt") |
|
|
``` |
|
|
|
|
|
### Example 1 — Long summary (≈ 75 tokens) |
|
|
```python |
|
|
target_len = torch.tensor([75], dtype=torch.long) |
|
|
summary_ids = model.generate( |
|
|
inputs["input_ids"], |
|
|
target_len=target_len, |
|
|
**model.config.task_specific_params["summarization"] |
|
|
) |
|
|
summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)[0] |
|
|
print(summary) |
|
|
``` |
|
|
|
|
|
**Output** |
|
|
``` |
|
|
Modern neural language models achieve high accuracy in text generation, yet precise control over generation length remains underdeveloped. |
|
|
Using a discrete countdown signal tied to the absolute remaining token count leads to instability. |
|
|
To provide robust length control, we introduce Progress Ratio Embeddings (PRE), as continuous embeddings tied to a trigonometric impatience signal. |
|
|
``` |
|
|
|
|
|
### Example 2 — Short summary (≈ 25 tokens) |
|
|
```python |
|
|
target_len = torch.tensor([25], dtype=torch.long) |
|
|
summary_ids = model.generate( |
|
|
inputs["input_ids"], |
|
|
target_len=target_len, |
|
|
**model.config.task_specific_params["summarization"] |
|
|
) |
|
|
summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)[0] |
|
|
print(summary) |
|
|
``` |
|
|
|
|
|
**Output** |
|
|
``` |
|
|
Modern neural language models achieve high accuracy in text generation. |
|
|
But precise control over generation length remains underdeveloped. |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧪 Training Details |
|
|
|
|
|
### Dataset |
|
|
- **CNN/DailyMail** summarization dataset |
|
|
- Standard train/validation/test splits from Hugging Face Datasets |
|
|
|
|
|
### Procedure |
|
|
- Fine-tuned from `facebook/bart-large-cnn` |
|
|
- Objective: minimize sequence loss while conditioning on continuous progress ratios |
|
|
- Gaussian noise added to progress ratio embeddings during training for smoother interpolation |
|
|
- Standard early stopping on validation loss |
|
|
- Training on **Jean-Zay** HPC cluster (GENCI-IDRIS) |
|
|
|
|
|
### Hardware |
|
|
- 8× NVIDIA V100 32 GB (multi-GPU DDP) |
|
|
- PyTorch + Transformers 4.45.1 |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Evaluation |
|
|
Evaluation on all the test dataset part of CNN/DailyMail, mean score reported in the tabular below : |
|
|
| Metric | Score | BART baseline | |
|
|
|:-------|:----------------|:---------| |
|
|
| ↑ ROUGE-1 | 45.3 | 44.2 | |
|
|
| ↑ ROUGE-2 | 21.9 | 21.1 | |
|
|
| ↑ ROUGE-L | 42.2 | 40.9| |
|
|
| ↓ MAE (token length error) | 0.5 | 19.2| |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚠️ Bias, Risks, and Limitations |
|
|
|
|
|
- The model inherits dataset biases from **CNN/DailyMail** news corpus. |
|
|
- Length control is designed for summarization tasks — not for unrestricted story generation. |
|
|
- Generated summaries may reflect stylistic bias from the news domain. |
|
|
- No guarantees on factual accuracy; always verify outputs for factual tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🌍 Environmental Impact (Approx.) |
|
|
|
|
|
| Resource | Details | |
|
|
|:----------|:---------| |
|
|
| Hardware | 8× V100 32 GB (Jean-Zay HPC) | |
|
|
| Training Duration | ≈ 36 GPU hours | |
|
|
| Framework | PyTorch + Transformers | |
|
|
| Estimated CO₂ emissions | ~ 39 kg CO₂eq | |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧩 Technical Specifications |
|
|
|
|
|
- **Architecture:** Encoder-decoder Transformer (BART variant) |
|
|
- **New modules:** Progress Ratio Embeddings (PRE) — sinusoidal encoding of decoding progress |
|
|
- **Model size:** 406 M parameters |
|
|
- **Base model:** `facebook/bart-large-cnn` |
|
|
- **Custom files:** |
|
|
- `configuration_prebart.py` |
|
|
- `modeling_prebart.py` |
|
|
|
|
|
--- |
|
|
|
|
|
<!-- ## 📚 Citation |
|
|
|
|
|
|
|
|
**BibTeX** |
|
|
```bibtex |
|
|
@misc{botcazou2025prebart, |
|
|
title = {PreBART: Progress Ratio Embeddings for Robust Length-Controlled Summarization}, |
|
|
author = {Ivanhoé Botcazou, Tassadit Amghar, Sylvain Lamprier, Frederic Saubion}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/Ivanhoe9/prebart-large-cnn}} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 📬 Contact |
|
|
|
|
|
- **Author:** Ivanhoé Botcazou |
|
|
- **Affiliation:** LERIA - University of Angers - France |
|
|
- **Hugging Face:** [Ivanhoe9](https://huggingface.co/Ivanhoe9) |
|
|
|
|
|
--- |
|
|
|
|
|
### ✨ Acknowledgements |
|
|
Thanks to the **Jean-Zay (IDRIS–GENCI)** compute resources (**2025–AD011016042**) and open-source contributors from the Hugging Face community. |
|
|
|
|
|
--- --> |
|
|
|