📰 ArXiv NewsBriefing v4.2

Transform dense academic abstracts into clear, engaging news-style briefings that anyone can understand in 30 seconds.

Built on Qwen2.5-1.5B with QLoRA fine-tuning on 1,845 research papers from ArXiv.

🎯 Overview

This project converts academic research papers into accessible 2-sentence news briefings (~45 words) in the style of NPR/BBC news broadcasts. Designed for science communicators, journalists, educators, and anyone interested in staying current with research.

Key Features

🎯 Strict Format: Exactly 2 sentences, maximum 45 words
📰 News Style: NPR/BBC broadcasting tone
🚫 Anti-Hallucination: No fabricated numbers or facts
⚡ Fast Inference: 3-4s on CPU (GGUF), <1s on GPU
🧠 CPU Optimized: GGUF quantization for deployment

📊 Performance

Metric	Baseline	v4.2	Improvement
ROUGE-1	0.42	0.48	+14%
ROUGE-2	0.18	0.21	+17%
Format Compliance	67%	98.9%	+48%
LLM Judge Score	6.5/10	8.6/10	+32%
Avg Word Count	50.3	43.2	96% under limit
Factual Accuracy	-	98%	✅ Pass

🚀 Quick Start

Prerequisites

Python 3.11+
8GB RAM minimum (16GB recommended)
For training: CUDA-capable GPU (T4/A100)
For inference: CPU (GGUF) or GPU

Installation

# Clone repository
git clone https://huggingface.co/chopeace/arxiv-newsbriefing-v4.2
cd arxiv-newsbriefing-v4.2

# Install dependencies
pip install -r requirements.txt

Usage

Option 1: GGUF (CPU Inference) ⭐ Recommended

from llama_cpp import Llama

# Load GGUF model
model = Llama(
    model_path="ArXiv-NewsBrief-Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=8
)

# Summarize abstract
abstract = """
We present a novel approach to quantum computing using topological 
qubits that achieves 99.9% gate fidelity, significantly improving 
upon previous methods.
"""

result = model(
    f"Convert this ArXiv abstract into a 2-sentence news briefing: {abstract}",
    max_tokens=120,
    temperature=0.4
)

print(result['choices'][0]['text'])
# Output: Scientists developed a new quantum computing method using 
# topological qubits that achieves record 99.9% accuracy. The breakthrough 
# could enable more reliable quantum computers for practical applications.

Option 2: Transformers (GPU)

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load base model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

# Apply LoRA adapters (after training)
# from peft import PeftModel
# model = PeftModel.from_pretrained(model, "./final_model")

# Prepare input
abstract = "Your research abstract here..."
system_message = "Convert this ArXiv abstract into a 2-sentence news briefing."

messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": abstract}
]

# Generate
inputs = tokenizer.apply_chat_template(
    messages, 
    tokenize=True, 
    return_tensors="pt"
)

outputs = model.generate(
    inputs,
    max_new_tokens=120,
    temperature=0.4,
    top_p=0.9,
    do_sample=True
)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)

📦 Model Files

File	Size	Format	Use Case	Status
ArXiv-NewsBrief-Q4_K_M.gguf	0.9GB	GGUF Q4_K_M	⭐ CPU inference	✅ Available
ArXiv-NewsBrief-Q8_0.gguf	1.5GB	GGUF Q8_0	High quality	Coming soon

Download & Usage

# GGUF model is included in this repository
# Or download directly:
wget https://huggingface.co/chopeace/arxiv-newsbriefing-v4.2/resolve/main/ArXiv-NewsBrief-Q4_K_M.gguf

🏗️ Training from Scratch

1. Generate Training Data

# Set API key for teacher model (Gemini)
export GOOGLE_API_KEY="your-api-key"

# Generate dataset
python scripts/dataset_generator.py

Output: data/processed/train.csv (1,660 samples)

Configuration:

Teacher Model: Gemini-3-27b-it
Source: ArXiv abstracts (Physics, CS, Math)
Validation: Automated format checking

2. Train Model

# Full training (MODE 1)
python scripts/sft_train.py

Training Settings:

Base Model: Qwen/Qwen2.5-1.5B-Instruct
Method: QLoRA (4-bit quantization)
LoRA Config: r=16, α=32, dropout=0.05
Training: 5 epochs, 2e-4 learning rate
Hardware: Google Colab A100 (~50 minutes)
Cost: ~$0.50

3-Mode Training System:

MODE 0: Practice (50 samples) - Quick validation
MODE 1: Full Training (1,845 samples) - Production
MODE 2: Inference Only - CPU testing

3. Convert to GGUF

python scripts/sft_merge_gguf.py

Output:

ArXiv-NewsBrief-Q4_K_M.gguf (0.9GB) - Recommended
ArXiv-NewsBrief-Q8_0.gguf (1.5GB) - High quality

🔬 Technical Details

System Prompt

Convert this ArXiv abstract into a concise 2-sentence news briefing.

Requirements:
- Use clear, accessible language without technical jargon
- Maximum 45 words total
- Exactly 2 sentences
- Style: NPR/BBC news broadcast tone
- No hallucinations: only use information from the abstract
- Focus on practical implications and significance

Format:
[Sentence 1: Main finding/innovation in simple terms]
[Sentence 2: Significance/implications for practical applications]

Training Configuration

# Base Model
base_model = "Qwen/Qwen2.5-1.5B-Instruct"

# LoRA Configuration
lora_r = 16
lora_alpha = 32
lora_dropout = 0.05
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# Training
num_epochs = 5
learning_rate = 2e-4
per_device_train_batch_size = 4
gradient_accumulation_steps = 2
max_seq_length = 512

# Generation
temperature = 0.4
top_k = 40
top_p = 0.9
max_new_tokens = 120

Model Specifications

Base: Qwen2.5-1.5B-Instruct (1.5B parameters)
Context Length: 32K tokens
Fine-tuning: QLoRA with 4-bit quantization
Training Data: 1,845 pairs (90/10 train/val split)
Quantization: Q4_K_M (4-bit) for GGUF

📁 Repository Structure

arxiv-newsbriefing-v4.2/
├── README.md                           # This file
├── LICENSE                             # Apache 2.0
├── requirements.txt                    # Dependencies
├── QUICKSTART.md                       # Quick start guide
│
├── ArXiv-NewsBrief-Q4_K_M.gguf        # ⭐ GGUF model (CPU-optimized)
│
├── scripts/
│   ├── dataset_generator.py           # Multi-LLM data generation
│   ├── sft_train.py                   # 3-mode QLoRA training
│   ├── sft_merge_gguf.py              # Merge & GGUF conversion
│   └── web.py                         # Streamlit web interface
│
├── data/
│   ├── processed/
│   │   ├── train.csv                  # 1,660 training samples
│   │   └── validation.csv             # 185 validation samples
│   └── teacher_prompt.md              # System prompt template
│
├── samples/
│   └── example_outputs.md             # Sample model outputs
│
├── docs/                               # Detailed documentation
│   ├── DATASET_CONSTRUCTION.md
│   ├── TRAINING_GUIDE.md
│   ├── LLM_JUDGE_GUIDE.md
│   └── UV_GUIDE.md
│
└── reports/
    ├── evaluation/                     # Quality assessment
    │   ├── chatgpt_result_v4.2.md
    │   ├── claude_result_v4.2.md
    │   └── LLM-as-a-Judge.md
    ├── training/                       # Training logs
    │   └── colab_train_result_v4.2.md
    └── colab/                          # Jupyter notebooks
        ├── Dataset_Generater_V4_2.ipynb
        ├── V4_0_SFT_DATASET_maker.ipynb
        └── DATA_Merger_GGUF.ipynb

📊 Evaluation

Automated Metrics

Metric	Score	Description
ROUGE-L	0.42	Longest common subsequence
ROUGE-1	0.48	Unigram overlap
ROUGE-2	0.21	Bigram overlap
BERTScore	0.88	Semantic similarity
Format Compliance	98.9%	2 sentences, ≤45 words

LLM-as-a-Judge (100-point scale)

Format Evaluation (50 points):

Sentence count (exactly 2): 20 pts
Word count (≤45): 15 pts
No special characters: 10 pts
No prompt leakage: 5 pts

Content Evaluation (50 points):

Key contribution captured: 20 pts
Factual accuracy: 15 pts
Clarity for general audience: 10 pts
TTS readability: 5 pts

Average Score: 86/100

Performance Benchmarks

Inference Speed

Environment	Hardware	Speed	Throughput
CPU	16 cores	3-4s	15-20/min
GPU	T4 (16GB)	1s	60/min
GPU	A100	0.5s	120/min

Model Sizes

Format	Size	Memory	Quality	Use Case
FP16	3.0GB	6GB	100%	Training
Merged	3.0GB	3GB	100%	GPU inference
Q4_K_M	0.9GB	1GB	98%	⭐ CPU deployment
Q8_0	1.5GB	2GB	99%	High quality

🎯 Use Cases

Science Communication

Quick briefings for press releases
Newsletter content generation
Social media posts about research

Journalism

Research trend monitoring
Story lead generation
Background research acceleration

Education

Latest discoveries for students
Curriculum updates
Academic summaries

General Public

Stay current with science
Understand research impact
Daily research highlights

🔄 Reproducibility

All training artifacts can be recreated:

# Step 1: Generate training data
python scripts/dataset_generator.py
# Output: data/processed/train.csv

# Step 2: Train LoRA
python scripts/sft_train.py
# Output: ./final_model/

# Step 3: Merge and convert
python scripts/sft_merge_gguf.py
# Output: ArXiv-NewsBrief-Q4_K_M.gguf

Cost: ~$0.50 (Google Colab A100)
Time: ~50 minutes

🛠️ Requirements

transformers>=4.46.0
torch>=2.0.0
peft>=0.13.0
bitsandbytes>=0.43.0
datasets>=3.2.0
accelerate>=1.2.0
llama-cpp-python  # For GGUF inference

🔧 Troubleshooting

CUDA Out of Memory

# Reduce batch size in scripts/sft_train.py
per_device_train_batch_size = 2  # default: 4

Slow CPU Inference

# Increase threads
n_threads = 16  # use all CPU cores

Missing API Key

# Create .env file
echo "GOOGLE_API_KEY=your-key" > .env

📚 Documentation

Quick Start: QUICKSTART.md
Dataset Construction: docs/DATASET_CONSTRUCTION.md
Training Guide: docs/TRAINING_GUIDE.md
LLM Judge: docs/LLM_JUDGE_GUIDE.md
Evaluation Results: reports/evaluation/
Training Logs: reports/training/

🚀 Future Roadmap

v4.3 (In Progress)

Expand training data to 5,000 pairs
Multi-domain specialization
Batch inference optimization

v5.0 (Planned)

Multilingual support (Korean, Spanish, French)
Topic-aware briefing styles
Mobile app development

🙏 Acknowledgments

Qwen2.5-1.5B: Alibaba Cloud
Gemini: Google DeepMind (data generation)
llama.cpp: Georgi Gerganov (GGUF runtime)
Transformers: Hugging Face
ArXiv: Cornell University

📜 License

Apache 2.0 - Free to use, modify, and distribute with attribution

See LICENSE for details.

📧 Contact

Author: Peace Cho (chopeaceus@gmail.com)

GitHub: @chopeace
Hugging Face: @chopeace
Project: GitHub Repository

📖 Citation

If you use this model in your research, please cite:

@software{arxiv_newsbriefing_2025,
  author = {Peace Cho},
  title = {ArXiv NewsBriefing: Transform Academic Papers into News Briefings},
  year = {2025},
  version = {4.2},
  publisher = {Hugging Face},
  url = {https://huggingface.co/chopeace/arxiv-newsbriefing-v4.2}
}

🌟 Support

If you find this project useful:

⭐ Star this repository
🐛 Report issues on GitHub
💬 Join discussions
📢 Share with others

Built with ❤️ using Qwen2.5, QLoRA, and Hugging Face Transformers

Quick Start • Training • Evaluation • Documentation

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for chopeace/arxiv-newsbriefing-v4.2

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Finetuned

(1527)

this model