Scientific Paper Summarizer (SciTLDR)

A fine-tuned Qwen2.5-0.5B model for generating one-sentence TL;DR summaries of scientific paper abstracts.

Trained entirely on consumer hardware (RTX 2070, 8GB VRAM) using QLoRA.

Benchmark Results

Evaluated on the official SciTLDR-A test set (618 papers) using MAX ROUGE across multiple references:

Model	ROUGE-1	ROUGE-2	ROUGE-L	Params	Training
PACSUM	19.30	4.00	15.10	-	Unsupervised
BERTSUMEXT	38.50	16.60	30.50	110M	Full FT
This Model	39.91	16.98	32.94	0.5B	QLoRA
MatchSum	42.70	20.00	34.00	125M	Full FT
BART-large	43.30	20.80	35.00	400M	Full FT
CATTS (SOTA)	44.30	21.30	35.90	400M	Full FT

Key Achievement: Outperforms BERTSUMEXT while using QLoRA on consumer hardware.

Model Description

Training Pipeline

This model was trained using a two-stage fine-tuning approach:

Qwen2.5-0.5B-Instruct → CPT (Domain Adaptation) → SFT (Task Training) → This Model

Stage 1: Continued Pretraining (CPT)

Purpose: Adapt the model to scientific language and terminology
Data: 50,000 arXiv abstracts from the scientific_papers dataset
Objective: Next-token prediction on domain text
Result: -44% perplexity on scientific text

Stage 2: Supervised Fine-Tuning (SFT)

Purpose: Teach the specific summarization task format
Data: 1,992 examples from SciTLDR (abstract → TL;DR pairs)
Objective: Generate one-sentence summaries
Result: +39.5% ROUGE-L improvement over base model

Why Two Stages?

We discovered that CPT alone degrades task performance (-18.5% ROUGE-L). The model learns domain vocabulary but loses instruction-following ability. SFT is essential to recover and improve task performance.

Stage	ROUGE-L	Format Compliance
Base	23.62	45%
After CPT	19.24	10% ⚠️
After SFT	32.94	100% ✅

Training Details

Hardware

GPU: NVIDIA GeForce RTX 2070 (8GB VRAM)
Training Time: ~24 hours total (17h CPT + 7h SFT)
Framework: PyTorch + Transformers + PEFT + TRL

Quantization (QLoRA)

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 quantization
    bnb_4bit_compute_dtype=bfloat16,  # Compute precision
)

Memory Usage: ~3-4GB VRAM during training (vs 6+ GB for full fine-tuning)

LoRA Configuration

LoraConfig(
    r=16,                              # Rank of adapter matrices
    lora_alpha=32,                     # Scaling factor (α/r = 2)
    lora_dropout=0.05,                 # Regularization
    target_modules=[                   # Applied to attention layers
        "q_proj", "k_proj", 
        "v_proj", "o_proj"
    ],
    task_type="CAUSAL_LM",
)

Trainable Parameters: 2.16M (0.44% of 496M total)

CPT Hyperparameters

Parameter	Value	Rationale
Learning Rate	1e-4	Conservative for domain adaptation
Epochs	1	Single pass sufficient for vocabulary
Batch Size	1	Memory constraint
Gradient Accumulation	8	Effective batch = 8
Max Sequence Length	512	Covers 94% of abstracts
LR Scheduler	Cosine	Smooth decay
Warmup	3%	Stabilize early training
Precision	fp16	Mixed precision

SFT Hyperparameters

Parameter	Value	Rationale
Learning Rate	2e-4	Higher for stronger task signal
Epochs	3	Small dataset needs more passes
Batch Size	1	Memory constraint
Gradient Accumulation	8	Effective batch = 8
Max Sequence Length	512	Matches CPT
LR Scheduler	Cosine	Smooth decay
Warmup	3%	Stabilize early training
Precision	bf16	Match adapter dtype

Data

CPT Data (Domain Adaptation):

Source: scientific_papers dataset (arXiv subset)
Size: 50,000 abstracts
Processing: Minimal cleaning, kept @xmath/@xcite tokens
Average length: 1,597 characters

SFT Data (Task Training):

Source: allenai/scitldr (Abstract configuration)
Size: 1,992 training examples
Format: Chat messages (system/user/assistant)
Average abstract: 1,121 chars → Average TL;DR: 137 chars

Usage

Installation

pip install transformers peft bitsandbytes accelerate torch

Loading the Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

# Load adapter
model = PeftModel.from_pretrained(
    base_model, 
    "tcy93/scitldr-qwen2.5-0.5b-summarizer"
)
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

Generating Summaries

def summarize(abstract: str) -> str:
    messages = [
        {"role": "system", "content": "You are a scientific paper summarizer. Provide a one-sentence summary (TL;DR) of the given abstract."},
        {"role": "user", "content": f"Summarize this abstract:\n\n{abstract}"},
    ]
    
    text = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=False,
        )
    
    response = tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1]:], 
        skip_special_tokens=True
    )
    return response.strip()

# Example
abstract = """
We propose a new optimization algorithm called AdamW that decouples 
weight decay from the gradient update. This simple modification leads 
to better generalization performance compared to L2 regularization in Adam.
"""

print(summarize(abstract))
# Output: "We propose AdamW, an optimizer that decouples weight decay from gradient updates for improved generalization."

Example Outputs

Paper	Generated TL;DR
Attention Is All You Need	We propose a simple but effective model for sequence transduction that is parallelizable and requires significantly less training time than existing models.
BERT	A new language representation model that pre-trains bidirectional representations from unlabeled text by joint conditioning on both left and right context in all layers.
GPT-3	We show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.

Limitations

Domain: Optimized for scientific/academic abstracts (computer science, physics, etc.)
Language: English only
Output: Single-sentence summaries only (by design)
Model Size: 0.5B parameters limits complex reasoning compared to larger models
Generalization: May not perform well on non-scientific text (news, social media, etc.)

Ethical Considerations

This model generates summaries and may occasionally produce inaccurate or misleading content
Always verify important claims against the original source
Not intended for generating fake scientific content

Citation

@misc{scitldr-qwen-summarizer-2025,
  author = {Turgut Cem Yilmazturk},
  title = {Scientific Paper Summarizer: Fine-tuned Qwen2.5-0.5B on SciTLDR},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/tcy93/scitldr-qwen2.5-0.5b-summarizer}}
}

Acknowledgments

Base Model: Qwen/Qwen2.5-0.5B-Instruct
SFT Dataset: SciTLDR by AllenAI
CPT Dataset: scientific_papers (arXiv)
Benchmark: TLDR: Extreme Summarization of Scientific Documents (EMNLP 2020)

License

Apache 2.0 (following the base model license)

Downloads last month: 1

Model tree for tcy93/scitldr-qwen2.5-0.5b-summarizer

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(562)

this model

tcy93
/

scitldr-qwen2.5-0.5b-summarizer