Scientific Paper Summarizer (SciTLDR)
A fine-tuned Qwen2.5-0.5B model for generating one-sentence TL;DR summaries of scientific paper abstracts.
Trained entirely on consumer hardware (RTX 2070, 8GB VRAM) using QLoRA.
Benchmark Results
Evaluated on the official SciTLDR-A test set (618 papers) using MAX ROUGE across multiple references:
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | Params | Training |
|---|---|---|---|---|---|
| PACSUM | 19.30 | 4.00 | 15.10 | - | Unsupervised |
| BERTSUMEXT | 38.50 | 16.60 | 30.50 | 110M | Full FT |
| This Model | 39.91 | 16.98 | 32.94 | 0.5B | QLoRA |
| MatchSum | 42.70 | 20.00 | 34.00 | 125M | Full FT |
| BART-large | 43.30 | 20.80 | 35.00 | 400M | Full FT |
| CATTS (SOTA) | 44.30 | 21.30 | 35.90 | 400M | Full FT |
Key Achievement: Outperforms BERTSUMEXT while using QLoRA on consumer hardware.
Model Description
Training Pipeline
This model was trained using a two-stage fine-tuning approach:
Qwen2.5-0.5B-Instruct → CPT (Domain Adaptation) → SFT (Task Training) → This Model
Stage 1: Continued Pretraining (CPT)
- Purpose: Adapt the model to scientific language and terminology
- Data: 50,000 arXiv abstracts from the
scientific_papersdataset - Objective: Next-token prediction on domain text
- Result: -44% perplexity on scientific text
Stage 2: Supervised Fine-Tuning (SFT)
- Purpose: Teach the specific summarization task format
- Data: 1,992 examples from SciTLDR (abstract → TL;DR pairs)
- Objective: Generate one-sentence summaries
- Result: +39.5% ROUGE-L improvement over base model
Why Two Stages?
We discovered that CPT alone degrades task performance (-18.5% ROUGE-L). The model learns domain vocabulary but loses instruction-following ability. SFT is essential to recover and improve task performance.
| Stage | ROUGE-L | Format Compliance |
|---|---|---|
| Base | 23.62 | 45% |
| After CPT | 19.24 | 10% ⚠️ |
| After SFT | 32.94 | 100% ✅ |
Training Details
Hardware
- GPU: NVIDIA GeForce RTX 2070 (8GB VRAM)
- Training Time: ~24 hours total (17h CPT + 7h SFT)
- Framework: PyTorch + Transformers + PEFT + TRL
Quantization (QLoRA)
BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 quantization
bnb_4bit_compute_dtype=bfloat16, # Compute precision
)
Memory Usage: ~3-4GB VRAM during training (vs 6+ GB for full fine-tuning)
LoRA Configuration
LoraConfig(
r=16, # Rank of adapter matrices
lora_alpha=32, # Scaling factor (α/r = 2)
lora_dropout=0.05, # Regularization
target_modules=[ # Applied to attention layers
"q_proj", "k_proj",
"v_proj", "o_proj"
],
task_type="CAUSAL_LM",
)
Trainable Parameters: 2.16M (0.44% of 496M total)
CPT Hyperparameters
| Parameter | Value | Rationale |
|---|---|---|
| Learning Rate | 1e-4 | Conservative for domain adaptation |
| Epochs | 1 | Single pass sufficient for vocabulary |
| Batch Size | 1 | Memory constraint |
| Gradient Accumulation | 8 | Effective batch = 8 |
| Max Sequence Length | 512 | Covers 94% of abstracts |
| LR Scheduler | Cosine | Smooth decay |
| Warmup | 3% | Stabilize early training |
| Precision | fp16 | Mixed precision |
SFT Hyperparameters
| Parameter | Value | Rationale |
|---|---|---|
| Learning Rate | 2e-4 | Higher for stronger task signal |
| Epochs | 3 | Small dataset needs more passes |
| Batch Size | 1 | Memory constraint |
| Gradient Accumulation | 8 | Effective batch = 8 |
| Max Sequence Length | 512 | Matches CPT |
| LR Scheduler | Cosine | Smooth decay |
| Warmup | 3% | Stabilize early training |
| Precision | bf16 | Match adapter dtype |
Data
CPT Data (Domain Adaptation):
- Source:
scientific_papersdataset (arXiv subset) - Size: 50,000 abstracts
- Processing: Minimal cleaning, kept @xmath/@xcite tokens
- Average length: 1,597 characters
SFT Data (Task Training):
- Source:
allenai/scitldr(Abstract configuration) - Size: 1,992 training examples
- Format: Chat messages (system/user/assistant)
- Average abstract: 1,121 chars → Average TL;DR: 137 chars
Usage
Installation
pip install transformers peft bitsandbytes accelerate torch
Loading the Model
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
# Quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
# Load adapter
model = PeftModel.from_pretrained(
base_model,
"tcy93/scitldr-qwen2.5-0.5b-summarizer"
)
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
Generating Summaries
def summarize(abstract: str) -> str:
messages = [
{"role": "system", "content": "You are a scientific paper summarizer. Provide a one-sentence summary (TL;DR) of the given abstract."},
{"role": "user", "content": f"Summarize this abstract:\n\n{abstract}"},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False,
)
response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
return response.strip()
# Example
abstract = """
We propose a new optimization algorithm called AdamW that decouples
weight decay from the gradient update. This simple modification leads
to better generalization performance compared to L2 regularization in Adam.
"""
print(summarize(abstract))
# Output: "We propose AdamW, an optimizer that decouples weight decay from gradient updates for improved generalization."
Example Outputs
| Paper | Generated TL;DR |
|---|---|
| Attention Is All You Need | We propose a simple but effective model for sequence transduction that is parallelizable and requires significantly less training time than existing models. |
| BERT | A new language representation model that pre-trains bidirectional representations from unlabeled text by joint conditioning on both left and right context in all layers. |
| GPT-3 | We show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. |
Limitations
- Domain: Optimized for scientific/academic abstracts (computer science, physics, etc.)
- Language: English only
- Output: Single-sentence summaries only (by design)
- Model Size: 0.5B parameters limits complex reasoning compared to larger models
- Generalization: May not perform well on non-scientific text (news, social media, etc.)
Ethical Considerations
- This model generates summaries and may occasionally produce inaccurate or misleading content
- Always verify important claims against the original source
- Not intended for generating fake scientific content
Citation
@misc{scitldr-qwen-summarizer-2025,
author = {Turgut Cem Yilmazturk},
title = {Scientific Paper Summarizer: Fine-tuned Qwen2.5-0.5B on SciTLDR},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/tcy93/scitldr-qwen2.5-0.5b-summarizer}}
}
Acknowledgments
- Base Model: Qwen/Qwen2.5-0.5B-Instruct
- SFT Dataset: SciTLDR by AllenAI
- CPT Dataset: scientific_papers (arXiv)
- Benchmark: TLDR: Extreme Summarization of Scientific Documents (EMNLP 2020)
License
Apache 2.0 (following the base model license)
- Downloads last month
- 1