Scientific Paper Summarizer (SciTLDR)

A fine-tuned Qwen2.5-0.5B model for generating one-sentence TL;DR summaries of scientific paper abstracts.

Trained entirely on consumer hardware (RTX 2070, 8GB VRAM) using QLoRA.

Benchmark Results

Evaluated on the official SciTLDR-A test set (618 papers) using MAX ROUGE across multiple references:

Model ROUGE-1 ROUGE-2 ROUGE-L Params Training
PACSUM 19.30 4.00 15.10 - Unsupervised
BERTSUMEXT 38.50 16.60 30.50 110M Full FT
This Model 39.91 16.98 32.94 0.5B QLoRA
MatchSum 42.70 20.00 34.00 125M Full FT
BART-large 43.30 20.80 35.00 400M Full FT
CATTS (SOTA) 44.30 21.30 35.90 400M Full FT

Key Achievement: Outperforms BERTSUMEXT while using QLoRA on consumer hardware.

Model Description

Training Pipeline

This model was trained using a two-stage fine-tuning approach:

Qwen2.5-0.5B-Instruct → CPT (Domain Adaptation) → SFT (Task Training) → This Model

Stage 1: Continued Pretraining (CPT)

  • Purpose: Adapt the model to scientific language and terminology
  • Data: 50,000 arXiv abstracts from the scientific_papers dataset
  • Objective: Next-token prediction on domain text
  • Result: -44% perplexity on scientific text

Stage 2: Supervised Fine-Tuning (SFT)

  • Purpose: Teach the specific summarization task format
  • Data: 1,992 examples from SciTLDR (abstract → TL;DR pairs)
  • Objective: Generate one-sentence summaries
  • Result: +39.5% ROUGE-L improvement over base model

Why Two Stages?

We discovered that CPT alone degrades task performance (-18.5% ROUGE-L). The model learns domain vocabulary but loses instruction-following ability. SFT is essential to recover and improve task performance.

Stage ROUGE-L Format Compliance
Base 23.62 45%
After CPT 19.24 10% ⚠️
After SFT 32.94 100% ✅

Training Details

Hardware

  • GPU: NVIDIA GeForce RTX 2070 (8GB VRAM)
  • Training Time: ~24 hours total (17h CPT + 7h SFT)
  • Framework: PyTorch + Transformers + PEFT + TRL

Quantization (QLoRA)

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 quantization
    bnb_4bit_compute_dtype=bfloat16,  # Compute precision
)

Memory Usage: ~3-4GB VRAM during training (vs 6+ GB for full fine-tuning)

LoRA Configuration

LoraConfig(
    r=16,                              # Rank of adapter matrices
    lora_alpha=32,                     # Scaling factor (α/r = 2)
    lora_dropout=0.05,                 # Regularization
    target_modules=[                   # Applied to attention layers
        "q_proj", "k_proj", 
        "v_proj", "o_proj"
    ],
    task_type="CAUSAL_LM",
)

Trainable Parameters: 2.16M (0.44% of 496M total)

CPT Hyperparameters

Parameter Value Rationale
Learning Rate 1e-4 Conservative for domain adaptation
Epochs 1 Single pass sufficient for vocabulary
Batch Size 1 Memory constraint
Gradient Accumulation 8 Effective batch = 8
Max Sequence Length 512 Covers 94% of abstracts
LR Scheduler Cosine Smooth decay
Warmup 3% Stabilize early training
Precision fp16 Mixed precision

SFT Hyperparameters

Parameter Value Rationale
Learning Rate 2e-4 Higher for stronger task signal
Epochs 3 Small dataset needs more passes
Batch Size 1 Memory constraint
Gradient Accumulation 8 Effective batch = 8
Max Sequence Length 512 Matches CPT
LR Scheduler Cosine Smooth decay
Warmup 3% Stabilize early training
Precision bf16 Match adapter dtype

Data

CPT Data (Domain Adaptation):

  • Source: scientific_papers dataset (arXiv subset)
  • Size: 50,000 abstracts
  • Processing: Minimal cleaning, kept @xmath/@xcite tokens
  • Average length: 1,597 characters

SFT Data (Task Training):

  • Source: allenai/scitldr (Abstract configuration)
  • Size: 1,992 training examples
  • Format: Chat messages (system/user/assistant)
  • Average abstract: 1,121 chars → Average TL;DR: 137 chars

Usage

Installation

pip install transformers peft bitsandbytes accelerate torch

Loading the Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

# Load adapter
model = PeftModel.from_pretrained(
    base_model, 
    "tcy93/scitldr-qwen2.5-0.5b-summarizer"
)
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

Generating Summaries

def summarize(abstract: str) -> str:
    messages = [
        {"role": "system", "content": "You are a scientific paper summarizer. Provide a one-sentence summary (TL;DR) of the given abstract."},
        {"role": "user", "content": f"Summarize this abstract:\n\n{abstract}"},
    ]
    
    text = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=False,
        )
    
    response = tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1]:], 
        skip_special_tokens=True
    )
    return response.strip()

# Example
abstract = """
We propose a new optimization algorithm called AdamW that decouples 
weight decay from the gradient update. This simple modification leads 
to better generalization performance compared to L2 regularization in Adam.
"""

print(summarize(abstract))
# Output: "We propose AdamW, an optimizer that decouples weight decay from gradient updates for improved generalization."

Example Outputs

Paper Generated TL;DR
Attention Is All You Need We propose a simple but effective model for sequence transduction that is parallelizable and requires significantly less training time than existing models.
BERT A new language representation model that pre-trains bidirectional representations from unlabeled text by joint conditioning on both left and right context in all layers.
GPT-3 We show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.

Limitations

  • Domain: Optimized for scientific/academic abstracts (computer science, physics, etc.)
  • Language: English only
  • Output: Single-sentence summaries only (by design)
  • Model Size: 0.5B parameters limits complex reasoning compared to larger models
  • Generalization: May not perform well on non-scientific text (news, social media, etc.)

Ethical Considerations

  • This model generates summaries and may occasionally produce inaccurate or misleading content
  • Always verify important claims against the original source
  • Not intended for generating fake scientific content

Citation

@misc{scitldr-qwen-summarizer-2025,
  author = {Turgut Cem Yilmazturk},
  title = {Scientific Paper Summarizer: Fine-tuned Qwen2.5-0.5B on SciTLDR},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/tcy93/scitldr-qwen2.5-0.5b-summarizer}}
}

Acknowledgments

License

Apache 2.0 (following the base model license)

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tcy93/scitldr-qwen2.5-0.5b-summarizer

Adapter
(562)
this model

Datasets used to train tcy93/scitldr-qwen2.5-0.5b-summarizer