TL;DR-Sci: Extreme Summarization of Scientific Papers
A LoRA adapter for T5-base that compresses scientific paper abstracts (150-300 words) into single-sentence TLDRs (15-25 words). Trained on the SciTLDR dataset using Parameter-Efficient Fine-Tuning.
Model Details
- Base model: t5-base (223M parameters)
- PEFT method: LoRA (r=16, α=32, dropout=0.05)
- Target modules: q, k, v, o (all attention projections)
- Trainable parameters: 3.5M / 223M (1.4%)
- Adapter size: ~14MB
- Language: English
- License: MIT
- Training hardware: Google Colab free tier (NVIDIA T4, 15GB VRAM)
- Training time: ~30-35 minutes (5 epochs)
- Training precision: float32
Why T5-base and not FLAN-T5-base?
We discovered that the google/flan-t5-base checkpoint on HuggingFace has corrupted weight tying — lm_head.weight (norm 3,958) and shared.weight (norm 54,486) are untied with a 14x norm mismatch, causing initial loss of ~9.7 and degenerate outputs. Plain t5-base loads correctly with tied weights and healthy initial loss of ~1.35.
Quick Start
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Load model and tokenizer
repo_id = "ArenaRune/scitldr-t5-base-lora"
config = PeftConfig.from_pretrained(repo_id)
base_model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(base_model, repo_id)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model.eval()
# Generate TLDR
abstract = (
"We propose a new simple network architecture, the Transformer, "
"based solely on attention mechanisms, dispensing with recurrence "
"and convolutions entirely. Experiments on two machine translation "
"tasks show these models to be superior in quality while being more "
"parallelizable and requiring significantly less time to train."
)
inputs = tokenizer(
"summarize: " + abstract,
return_tensors="pt",
max_length=512,
truncation=True,
)
outputs = model.generate(**inputs, max_new_tokens=64, num_beams=4)
tldr = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(tldr)
Training Data
- Dataset: allenai/scitldr (SciTLDR)
- Domain: Computer science research papers
- Task: Abstract → single-sentence TLDR
- Splits: 1,992 train / 619 validation / 618 test
- Input format:
"summarize: "+ abstract text - Target: First expert-written TLDR per paper
Preprocessing
- Source sentences joined into single abstract string
- Prepended with T5's native
"summarize: "prefix - Tokenized with
text_target=for proper decoder-side formatting - Max input length: 512 tokens | Max target length: 64 tokens
Training Procedure
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 16 |
| Alpha (α) | 32 |
| Dropout | 0.05 |
| Target modules | q, k, v, o |
| Bias | none |
Hyperparameters
Best configuration selected from grid search over 3 learning rates (1e-4, 3e-4, 5e-4) using ROUGE-L on validation set.
| Parameter | Value |
|---|---|
| Epochs | 5 |
| Batch size | 8 |
| Optimizer | AdamW (weight_decay=0.01) |
| LR schedule | Cosine with 100 warmup steps |
| Gradient clipping | max_norm=1.0 |
| Precision | float32 |
Evaluation
Metrics
- ROUGE-1: Unigram overlap
- ROUGE-2: Bigram overlap
- ROUGE-L: Longest common subsequence (primary metric)
All scores computed with stemming on 100 test samples.
Comparative Results
| Method | Type | ROUGE-1 | ROUGE-2 | ROUGE-L | Avg Len |
|---|---|---|---|---|---|
| Lead sentence | Extractive | 0.2594 | 0.0926 | 0.1975 | 23.7 |
| Last sentence | Extractive | 0.1526 | 0.0144 | 0.1123 | 29.9 |
| Longest sentence | Extractive | 0.1977 | 0.0456 | 0.1309 | 78.9 |
| T5-base (zero-shot) | Generative | 0.2955 | 0.1105 | 0.2123 | 41.1 |
| T5-base + LoRA (ours) | Generative | 0.3953 | 0.1931 | 0.3344 | 22.1 |
Improvement over Zero-shot Baseline
| Metric | Baseline | Fine-Tuned | Δ | % Change |
|---|---|---|---|---|
| ROUGE-1 | 0.2955 | 0.3953 | +0.0999 | +33.8% |
| ROUGE-2 | 0.1105 | 0.1931 | +0.0826 | +74.8% |
| ROUGE-L | 0.2123 | 0.3344 | +0.1221 | +57.5% |
| Avg Length | 41.1 words | 22.1 words | -19.0 | — |
The fine-tuned model achieves substantial improvements across all ROUGE metrics while generating outputs closer to the target length range (15-25 words) compared to the verbose zero-shot baseline (41 words).
Uses
Intended Use
- Screening tool for researchers scanning large volumes of papers
- Rapid literature review and paper triage
- Generating paper summaries for reading lists or feeds
Limitations
- CS domain only: Trained exclusively on computer science papers. Quality on biomedical, legal, physics, or social science abstracts is untested and likely lower.
- Not a replacement for reading: TLDRs may omit critical caveats, overstate findings, or miss nuance. Always read the full abstract before citing.
- English only: Cannot process or generate TLDRs in other languages.
- No factual verification: May generate plausible-sounding but inaccurate summaries.
Out-of-Scope Use
- Generating authoritative summaries for citation without reading the original paper
- Medical, legal, or safety-critical applications where omitted details could cause harm
- Non-English abstracts
Bias, Risks, and Limitations
- Misrepresentation risk: TLDRs may drop qualifiers (e.g., "under controlled conditions") making results appear more general than they are
- Domain bias: Reflects CS research conventions; may mishandle terminology from other fields
- Temporal bias: Trained on papers from a specific time period; novel terminology may not be handled well
- Automation bias: Users may over-rely on TLDRs and stop reading abstracts
Recommendations
- Always label outputs as machine-generated
- Use as a screening tool only, not as a substitute for reading
- Verify key claims against the original abstract
- Exercise extra caution when applying to non-CS domains
Environmental Impact
- Hardware: NVIDIA Tesla T4 (15GB VRAM)
- Training time: ~30-35 minutes
- Cloud provider: Google Colab (free tier)
- Estimated emissions: Minimal (~0.005 kg CO2eq based on T4 power consumption of ~70W)
Technical Specifications
Model Architecture
- Architecture: Encoder-decoder (T5)
- Base model parameters: 223M (frozen)
- Adapter parameters: 3.5M (trainable)
- Total adapter size: ~14MB
Software
- Python 3.10+
- transformers >= 4.36.0
- peft >= 0.7.0
- torch >= 2.0.0
Citation
If you use this model, please cite the underlying dataset and techniques:
@inproceedings{cachola2020tldr,
title={TLDR: Extreme Summarization of Scientific Documents},
author={Cachola, Isabel and Lo, Kyle and Cohan, Arman and Weld, Daniel},
booktitle={Findings of EMNLP},
year={2020}
}
@inproceedings{hu2022lora,
title={LoRA: Low-Rank Adaptation of Large Language Models},
author={Hu, Edward and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
booktitle={ICLR},
year={2022}
}
@article{raffel2020t5,
title={Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter},
journal={JMLR},
year={2020}
}
Model Card Author
ArenaRune
- Downloads last month
- 29
Model tree for ArenaRune/scitldr-t5-base-lora
Base model
google-t5/t5-base