πŸ“° ArXiv NewsBriefing v4.2

Transform dense academic abstracts into clear, engaging news-style briefings that anyone can understand in 30 seconds.

Built on Qwen2.5-1.5B with QLoRA fine-tuning on 1,845 research papers from ArXiv.

Python 3.11+ License: Apache 2.0 Model: Qwen2.5-1.5B

🎯 Overview

This project converts academic research papers into accessible 2-sentence news briefings (~45 words) in the style of NPR/BBC news broadcasts. Designed for science communicators, journalists, educators, and anyone interested in staying current with research.

Key Features

  • 🎯 Strict Format: Exactly 2 sentences, maximum 45 words
  • πŸ“° News Style: NPR/BBC broadcasting tone
  • 🚫 Anti-Hallucination: No fabricated numbers or facts
  • ⚑ Fast Inference: 3-4s on CPU (GGUF), <1s on GPU
  • 🧠 CPU Optimized: GGUF quantization for deployment

πŸ“Š Performance

Metric Baseline v4.2 Improvement
ROUGE-1 0.42 0.48 +14%
ROUGE-2 0.18 0.21 +17%
Format Compliance 67% 98.9% +48%
LLM Judge Score 6.5/10 8.6/10 +32%
Avg Word Count 50.3 43.2 96% under limit
Factual Accuracy - 98% βœ… Pass

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • 8GB RAM minimum (16GB recommended)
  • For training: CUDA-capable GPU (T4/A100)
  • For inference: CPU (GGUF) or GPU

Installation

# Clone repository
git clone https://huggingface.co/chopeace/arxiv-newsbriefing-v4.2
cd arxiv-newsbriefing-v4.2

# Install dependencies
pip install -r requirements.txt

Usage

Option 1: GGUF (CPU Inference) ⭐ Recommended

from llama_cpp import Llama

# Load GGUF model
model = Llama(
    model_path="ArXiv-NewsBrief-Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=8
)

# Summarize abstract
abstract = """
We present a novel approach to quantum computing using topological 
qubits that achieves 99.9% gate fidelity, significantly improving 
upon previous methods.
"""

result = model(
    f"Convert this ArXiv abstract into a 2-sentence news briefing: {abstract}",
    max_tokens=120,
    temperature=0.4
)

print(result['choices'][0]['text'])
# Output: Scientists developed a new quantum computing method using 
# topological qubits that achieves record 99.9% accuracy. The breakthrough 
# could enable more reliable quantum computers for practical applications.

Option 2: Transformers (GPU)

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load base model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

# Apply LoRA adapters (after training)
# from peft import PeftModel
# model = PeftModel.from_pretrained(model, "./final_model")

# Prepare input
abstract = "Your research abstract here..."
system_message = "Convert this ArXiv abstract into a 2-sentence news briefing."

messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": abstract}
]

# Generate
inputs = tokenizer.apply_chat_template(
    messages, 
    tokenize=True, 
    return_tensors="pt"
)

outputs = model.generate(
    inputs,
    max_new_tokens=120,
    temperature=0.4,
    top_p=0.9,
    do_sample=True
)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)

πŸ“¦ Model Files

File Size Format Use Case Status
ArXiv-NewsBrief-Q4_K_M.gguf 0.9GB GGUF Q4_K_M ⭐ CPU inference βœ… Available
ArXiv-NewsBrief-Q8_0.gguf 1.5GB GGUF Q8_0 High quality Coming soon

Download & Usage

# GGUF model is included in this repository
# Or download directly:
wget https://huggingface.co/chopeace/arxiv-newsbriefing-v4.2/resolve/main/ArXiv-NewsBrief-Q4_K_M.gguf

πŸ—οΈ Training from Scratch

1. Generate Training Data

# Set API key for teacher model (Gemini)
export GOOGLE_API_KEY="your-api-key"

# Generate dataset
python scripts/dataset_generator.py

Output: data/processed/train.csv (1,660 samples)

Configuration:

  • Teacher Model: Gemini-3-27b-it
  • Source: ArXiv abstracts (Physics, CS, Math)
  • Validation: Automated format checking

2. Train Model

# Full training (MODE 1)
python scripts/sft_train.py

Training Settings:

  • Base Model: Qwen/Qwen2.5-1.5B-Instruct
  • Method: QLoRA (4-bit quantization)
  • LoRA Config: r=16, Ξ±=32, dropout=0.05
  • Training: 5 epochs, 2e-4 learning rate
  • Hardware: Google Colab A100 (~50 minutes)
  • Cost: ~$0.50

3-Mode Training System:

  • MODE 0: Practice (50 samples) - Quick validation
  • MODE 1: Full Training (1,845 samples) - Production
  • MODE 2: Inference Only - CPU testing

3. Convert to GGUF

python scripts/sft_merge_gguf.py

Output:

  • ArXiv-NewsBrief-Q4_K_M.gguf (0.9GB) - Recommended
  • ArXiv-NewsBrief-Q8_0.gguf (1.5GB) - High quality

πŸ”¬ Technical Details

System Prompt

Convert this ArXiv abstract into a concise 2-sentence news briefing.

Requirements:
- Use clear, accessible language without technical jargon
- Maximum 45 words total
- Exactly 2 sentences
- Style: NPR/BBC news broadcast tone
- No hallucinations: only use information from the abstract
- Focus on practical implications and significance

Format:
[Sentence 1: Main finding/innovation in simple terms]
[Sentence 2: Significance/implications for practical applications]

Training Configuration

# Base Model
base_model = "Qwen/Qwen2.5-1.5B-Instruct"

# LoRA Configuration
lora_r = 16
lora_alpha = 32
lora_dropout = 0.05
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# Training
num_epochs = 5
learning_rate = 2e-4
per_device_train_batch_size = 4
gradient_accumulation_steps = 2
max_seq_length = 512

# Generation
temperature = 0.4
top_k = 40
top_p = 0.9
max_new_tokens = 120

Model Specifications

  • Base: Qwen2.5-1.5B-Instruct (1.5B parameters)
  • Context Length: 32K tokens
  • Fine-tuning: QLoRA with 4-bit quantization
  • Training Data: 1,845 pairs (90/10 train/val split)
  • Quantization: Q4_K_M (4-bit) for GGUF

πŸ“ Repository Structure

arxiv-newsbriefing-v4.2/
β”œβ”€β”€ README.md                           # This file
β”œβ”€β”€ LICENSE                             # Apache 2.0
β”œβ”€β”€ requirements.txt                    # Dependencies
β”œβ”€β”€ QUICKSTART.md                       # Quick start guide
β”‚
β”œβ”€β”€ ArXiv-NewsBrief-Q4_K_M.gguf        # ⭐ GGUF model (CPU-optimized)
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ dataset_generator.py           # Multi-LLM data generation
β”‚   β”œβ”€β”€ sft_train.py                   # 3-mode QLoRA training
β”‚   β”œβ”€β”€ sft_merge_gguf.py              # Merge & GGUF conversion
β”‚   └── web.py                         # Streamlit web interface
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ processed/
β”‚   β”‚   β”œβ”€β”€ train.csv                  # 1,660 training samples
β”‚   β”‚   └── validation.csv             # 185 validation samples
β”‚   └── teacher_prompt.md              # System prompt template
β”‚
β”œβ”€β”€ samples/
β”‚   └── example_outputs.md             # Sample model outputs
β”‚
β”œβ”€β”€ docs/                               # Detailed documentation
β”‚   β”œβ”€β”€ DATASET_CONSTRUCTION.md
β”‚   β”œβ”€β”€ TRAINING_GUIDE.md
β”‚   β”œβ”€β”€ LLM_JUDGE_GUIDE.md
β”‚   └── UV_GUIDE.md
β”‚
└── reports/
    β”œβ”€β”€ evaluation/                     # Quality assessment
    β”‚   β”œβ”€β”€ chatgpt_result_v4.2.md
    β”‚   β”œβ”€β”€ claude_result_v4.2.md
    β”‚   └── LLM-as-a-Judge.md
    β”œβ”€β”€ training/                       # Training logs
    β”‚   └── colab_train_result_v4.2.md
    └── colab/                          # Jupyter notebooks
        β”œβ”€β”€ Dataset_Generater_V4_2.ipynb
        β”œβ”€β”€ V4_0_SFT_DATASET_maker.ipynb
        └── DATA_Merger_GGUF.ipynb

πŸ“Š Evaluation

Automated Metrics

Metric Score Description
ROUGE-L 0.42 Longest common subsequence
ROUGE-1 0.48 Unigram overlap
ROUGE-2 0.21 Bigram overlap
BERTScore 0.88 Semantic similarity
Format Compliance 98.9% 2 sentences, ≀45 words

LLM-as-a-Judge (100-point scale)

Format Evaluation (50 points):

  • Sentence count (exactly 2): 20 pts
  • Word count (≀45): 15 pts
  • No special characters: 10 pts
  • No prompt leakage: 5 pts

Content Evaluation (50 points):

  • Key contribution captured: 20 pts
  • Factual accuracy: 15 pts
  • Clarity for general audience: 10 pts
  • TTS readability: 5 pts

Average Score: 86/100

Performance Benchmarks

Inference Speed

Environment Hardware Speed Throughput
CPU 16 cores 3-4s 15-20/min
GPU T4 (16GB) 1s 60/min
GPU A100 0.5s 120/min

Model Sizes

Format Size Memory Quality Use Case
FP16 3.0GB 6GB 100% Training
Merged 3.0GB 3GB 100% GPU inference
Q4_K_M 0.9GB 1GB 98% ⭐ CPU deployment
Q8_0 1.5GB 2GB 99% High quality

🎯 Use Cases

Science Communication

  • Quick briefings for press releases
  • Newsletter content generation
  • Social media posts about research

Journalism

  • Research trend monitoring
  • Story lead generation
  • Background research acceleration

Education

  • Latest discoveries for students
  • Curriculum updates
  • Academic summaries

General Public

  • Stay current with science
  • Understand research impact
  • Daily research highlights

πŸ”„ Reproducibility

All training artifacts can be recreated:

# Step 1: Generate training data
python scripts/dataset_generator.py
# Output: data/processed/train.csv

# Step 2: Train LoRA
python scripts/sft_train.py
# Output: ./final_model/

# Step 3: Merge and convert
python scripts/sft_merge_gguf.py
# Output: ArXiv-NewsBrief-Q4_K_M.gguf

Cost: ~$0.50 (Google Colab A100)
Time: ~50 minutes

πŸ› οΈ Requirements

transformers>=4.46.0
torch>=2.0.0
peft>=0.13.0
bitsandbytes>=0.43.0
datasets>=3.2.0
accelerate>=1.2.0
llama-cpp-python  # For GGUF inference

πŸ”§ Troubleshooting

CUDA Out of Memory

# Reduce batch size in scripts/sft_train.py
per_device_train_batch_size = 2  # default: 4

Slow CPU Inference

# Increase threads
n_threads = 16  # use all CPU cores

Missing API Key

# Create .env file
echo "GOOGLE_API_KEY=your-key" > .env

πŸ“š Documentation

πŸš€ Future Roadmap

v4.3 (In Progress)

  • Expand training data to 5,000 pairs
  • Multi-domain specialization
  • Batch inference optimization

v5.0 (Planned)

  • Multilingual support (Korean, Spanish, French)
  • Topic-aware briefing styles
  • Mobile app development

πŸ™ Acknowledgments

  • Qwen2.5-1.5B: Alibaba Cloud
  • Gemini: Google DeepMind (data generation)
  • llama.cpp: Georgi Gerganov (GGUF runtime)
  • Transformers: Hugging Face
  • ArXiv: Cornell University

πŸ“œ License

Apache 2.0 - Free to use, modify, and distribute with attribution

See LICENSE for details.

πŸ“§ Contact

Author: Peace Cho (chopeaceus@gmail.com)

πŸ“– Citation

If you use this model in your research, please cite:

@software{arxiv_newsbriefing_2025,
  author = {Peace Cho},
  title = {ArXiv NewsBriefing: Transform Academic Papers into News Briefings},
  year = {2025},
  version = {4.2},
  publisher = {Hugging Face},
  url = {https://huggingface.co/chopeace/arxiv-newsbriefing-v4.2}
}

🌟 Support

If you find this project useful:

  • ⭐ Star this repository
  • πŸ› Report issues on GitHub
  • πŸ’¬ Join discussions
  • πŸ“’ Share with others

Built with ❀️ using Qwen2.5, QLoRA, and Hugging Face Transformers

Quick Start β€’ Training β€’ Evaluation β€’ Documentation

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for chopeace/arxiv-newsbriefing-v4.2

Finetuned
(1527)
this model