π° ArXiv NewsBriefing v4.2
Transform dense academic abstracts into clear, engaging news-style briefings that anyone can understand in 30 seconds.
Built on Qwen2.5-1.5B with QLoRA fine-tuning on 1,845 research papers from ArXiv.
π― Overview
This project converts academic research papers into accessible 2-sentence news briefings (~45 words) in the style of NPR/BBC news broadcasts. Designed for science communicators, journalists, educators, and anyone interested in staying current with research.
Key Features
- π― Strict Format: Exactly 2 sentences, maximum 45 words
- π° News Style: NPR/BBC broadcasting tone
- π« Anti-Hallucination: No fabricated numbers or facts
- β‘ Fast Inference: 3-4s on CPU (GGUF), <1s on GPU
- π§ CPU Optimized: GGUF quantization for deployment
π Performance
| Metric | Baseline | v4.2 | Improvement |
|---|---|---|---|
| ROUGE-1 | 0.42 | 0.48 | +14% |
| ROUGE-2 | 0.18 | 0.21 | +17% |
| Format Compliance | 67% | 98.9% | +48% |
| LLM Judge Score | 6.5/10 | 8.6/10 | +32% |
| Avg Word Count | 50.3 | 43.2 | 96% under limit |
| Factual Accuracy | - | 98% | β Pass |
π Quick Start
Prerequisites
- Python 3.11+
- 8GB RAM minimum (16GB recommended)
- For training: CUDA-capable GPU (T4/A100)
- For inference: CPU (GGUF) or GPU
Installation
# Clone repository
git clone https://huggingface.co/chopeace/arxiv-newsbriefing-v4.2
cd arxiv-newsbriefing-v4.2
# Install dependencies
pip install -r requirements.txt
Usage
Option 1: GGUF (CPU Inference) β Recommended
from llama_cpp import Llama
# Load GGUF model
model = Llama(
model_path="ArXiv-NewsBrief-Q4_K_M.gguf",
n_ctx=2048,
n_threads=8
)
# Summarize abstract
abstract = """
We present a novel approach to quantum computing using topological
qubits that achieves 99.9% gate fidelity, significantly improving
upon previous methods.
"""
result = model(
f"Convert this ArXiv abstract into a 2-sentence news briefing: {abstract}",
max_tokens=120,
temperature=0.4
)
print(result['choices'][0]['text'])
# Output: Scientists developed a new quantum computing method using
# topological qubits that achieves record 99.9% accuracy. The breakthrough
# could enable more reliable quantum computers for practical applications.
Option 2: Transformers (GPU)
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load base model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
# Apply LoRA adapters (after training)
# from peft import PeftModel
# model = PeftModel.from_pretrained(model, "./final_model")
# Prepare input
abstract = "Your research abstract here..."
system_message = "Convert this ArXiv abstract into a 2-sentence news briefing."
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": abstract}
]
# Generate
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt"
)
outputs = model.generate(
inputs,
max_new_tokens=120,
temperature=0.4,
top_p=0.9,
do_sample=True
)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)
π¦ Model Files
| File | Size | Format | Use Case | Status |
|---|---|---|---|---|
| ArXiv-NewsBrief-Q4_K_M.gguf | 0.9GB | GGUF Q4_K_M | β CPU inference | β Available |
| ArXiv-NewsBrief-Q8_0.gguf | 1.5GB | GGUF Q8_0 | High quality | Coming soon |
Download & Usage
# GGUF model is included in this repository
# Or download directly:
wget https://huggingface.co/chopeace/arxiv-newsbriefing-v4.2/resolve/main/ArXiv-NewsBrief-Q4_K_M.gguf
ποΈ Training from Scratch
1. Generate Training Data
# Set API key for teacher model (Gemini)
export GOOGLE_API_KEY="your-api-key"
# Generate dataset
python scripts/dataset_generator.py
Output: data/processed/train.csv (1,660 samples)
Configuration:
- Teacher Model: Gemini-3-27b-it
- Source: ArXiv abstracts (Physics, CS, Math)
- Validation: Automated format checking
2. Train Model
# Full training (MODE 1)
python scripts/sft_train.py
Training Settings:
- Base Model: Qwen/Qwen2.5-1.5B-Instruct
- Method: QLoRA (4-bit quantization)
- LoRA Config: r=16, Ξ±=32, dropout=0.05
- Training: 5 epochs, 2e-4 learning rate
- Hardware: Google Colab A100 (~50 minutes)
- Cost: ~$0.50
3-Mode Training System:
- MODE 0: Practice (50 samples) - Quick validation
- MODE 1: Full Training (1,845 samples) - Production
- MODE 2: Inference Only - CPU testing
3. Convert to GGUF
python scripts/sft_merge_gguf.py
Output:
ArXiv-NewsBrief-Q4_K_M.gguf(0.9GB) - RecommendedArXiv-NewsBrief-Q8_0.gguf(1.5GB) - High quality
π¬ Technical Details
System Prompt
Convert this ArXiv abstract into a concise 2-sentence news briefing.
Requirements:
- Use clear, accessible language without technical jargon
- Maximum 45 words total
- Exactly 2 sentences
- Style: NPR/BBC news broadcast tone
- No hallucinations: only use information from the abstract
- Focus on practical implications and significance
Format:
[Sentence 1: Main finding/innovation in simple terms]
[Sentence 2: Significance/implications for practical applications]
Training Configuration
# Base Model
base_model = "Qwen/Qwen2.5-1.5B-Instruct"
# LoRA Configuration
lora_r = 16
lora_alpha = 32
lora_dropout = 0.05
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
# Training
num_epochs = 5
learning_rate = 2e-4
per_device_train_batch_size = 4
gradient_accumulation_steps = 2
max_seq_length = 512
# Generation
temperature = 0.4
top_k = 40
top_p = 0.9
max_new_tokens = 120
Model Specifications
- Base: Qwen2.5-1.5B-Instruct (1.5B parameters)
- Context Length: 32K tokens
- Fine-tuning: QLoRA with 4-bit quantization
- Training Data: 1,845 pairs (90/10 train/val split)
- Quantization: Q4_K_M (4-bit) for GGUF
π Repository Structure
arxiv-newsbriefing-v4.2/
βββ README.md # This file
βββ LICENSE # Apache 2.0
βββ requirements.txt # Dependencies
βββ QUICKSTART.md # Quick start guide
β
βββ ArXiv-NewsBrief-Q4_K_M.gguf # β GGUF model (CPU-optimized)
β
βββ scripts/
β βββ dataset_generator.py # Multi-LLM data generation
β βββ sft_train.py # 3-mode QLoRA training
β βββ sft_merge_gguf.py # Merge & GGUF conversion
β βββ web.py # Streamlit web interface
β
βββ data/
β βββ processed/
β β βββ train.csv # 1,660 training samples
β β βββ validation.csv # 185 validation samples
β βββ teacher_prompt.md # System prompt template
β
βββ samples/
β βββ example_outputs.md # Sample model outputs
β
βββ docs/ # Detailed documentation
β βββ DATASET_CONSTRUCTION.md
β βββ TRAINING_GUIDE.md
β βββ LLM_JUDGE_GUIDE.md
β βββ UV_GUIDE.md
β
βββ reports/
βββ evaluation/ # Quality assessment
β βββ chatgpt_result_v4.2.md
β βββ claude_result_v4.2.md
β βββ LLM-as-a-Judge.md
βββ training/ # Training logs
β βββ colab_train_result_v4.2.md
βββ colab/ # Jupyter notebooks
βββ Dataset_Generater_V4_2.ipynb
βββ V4_0_SFT_DATASET_maker.ipynb
βββ DATA_Merger_GGUF.ipynb
π Evaluation
Automated Metrics
| Metric | Score | Description |
|---|---|---|
| ROUGE-L | 0.42 | Longest common subsequence |
| ROUGE-1 | 0.48 | Unigram overlap |
| ROUGE-2 | 0.21 | Bigram overlap |
| BERTScore | 0.88 | Semantic similarity |
| Format Compliance | 98.9% | 2 sentences, β€45 words |
LLM-as-a-Judge (100-point scale)
Format Evaluation (50 points):
- Sentence count (exactly 2): 20 pts
- Word count (β€45): 15 pts
- No special characters: 10 pts
- No prompt leakage: 5 pts
Content Evaluation (50 points):
- Key contribution captured: 20 pts
- Factual accuracy: 15 pts
- Clarity for general audience: 10 pts
- TTS readability: 5 pts
Average Score: 86/100
Performance Benchmarks
Inference Speed
| Environment | Hardware | Speed | Throughput |
|---|---|---|---|
| CPU | 16 cores | 3-4s | 15-20/min |
| GPU | T4 (16GB) | 1s | 60/min |
| GPU | A100 | 0.5s | 120/min |
Model Sizes
| Format | Size | Memory | Quality | Use Case |
|---|---|---|---|---|
| FP16 | 3.0GB | 6GB | 100% | Training |
| Merged | 3.0GB | 3GB | 100% | GPU inference |
| Q4_K_M | 0.9GB | 1GB | 98% | β CPU deployment |
| Q8_0 | 1.5GB | 2GB | 99% | High quality |
π― Use Cases
Science Communication
- Quick briefings for press releases
- Newsletter content generation
- Social media posts about research
Journalism
- Research trend monitoring
- Story lead generation
- Background research acceleration
Education
- Latest discoveries for students
- Curriculum updates
- Academic summaries
General Public
- Stay current with science
- Understand research impact
- Daily research highlights
π Reproducibility
All training artifacts can be recreated:
# Step 1: Generate training data
python scripts/dataset_generator.py
# Output: data/processed/train.csv
# Step 2: Train LoRA
python scripts/sft_train.py
# Output: ./final_model/
# Step 3: Merge and convert
python scripts/sft_merge_gguf.py
# Output: ArXiv-NewsBrief-Q4_K_M.gguf
Cost: ~$0.50 (Google Colab A100)
Time: ~50 minutes
π οΈ Requirements
transformers>=4.46.0
torch>=2.0.0
peft>=0.13.0
bitsandbytes>=0.43.0
datasets>=3.2.0
accelerate>=1.2.0
llama-cpp-python # For GGUF inference
π§ Troubleshooting
CUDA Out of Memory
# Reduce batch size in scripts/sft_train.py
per_device_train_batch_size = 2 # default: 4
Slow CPU Inference
# Increase threads
n_threads = 16 # use all CPU cores
Missing API Key
# Create .env file
echo "GOOGLE_API_KEY=your-key" > .env
π Documentation
- Quick Start: QUICKSTART.md
- Dataset Construction: docs/DATASET_CONSTRUCTION.md
- Training Guide: docs/TRAINING_GUIDE.md
- LLM Judge: docs/LLM_JUDGE_GUIDE.md
- Evaluation Results: reports/evaluation/
- Training Logs: reports/training/
π Future Roadmap
v4.3 (In Progress)
- Expand training data to 5,000 pairs
- Multi-domain specialization
- Batch inference optimization
v5.0 (Planned)
- Multilingual support (Korean, Spanish, French)
- Topic-aware briefing styles
- Mobile app development
π Acknowledgments
- Qwen2.5-1.5B: Alibaba Cloud
- Gemini: Google DeepMind (data generation)
- llama.cpp: Georgi Gerganov (GGUF runtime)
- Transformers: Hugging Face
- ArXiv: Cornell University
π License
Apache 2.0 - Free to use, modify, and distribute with attribution
See LICENSE for details.
π§ Contact
Author: Peace Cho (chopeaceus@gmail.com)
- GitHub: @chopeace
- Hugging Face: @chopeace
- Project: GitHub Repository
π Citation
If you use this model in your research, please cite:
@software{arxiv_newsbriefing_2025,
author = {Peace Cho},
title = {ArXiv NewsBriefing: Transform Academic Papers into News Briefings},
year = {2025},
version = {4.2},
publisher = {Hugging Face},
url = {https://huggingface.co/chopeace/arxiv-newsbriefing-v4.2}
}
π Support
If you find this project useful:
- β Star this repository
- π Report issues on GitHub
- π¬ Join discussions
- π’ Share with others
Built with β€οΈ using Qwen2.5, QLoRA, and Hugging Face Transformers
Quick Start β’ Training β’ Evaluation β’ Documentation