LearningToPresent-RL-Qwen-2.5-Coder-7B-Instruct-GRPO-Finetuned
A GRPO-finetuned LoRA adapter for Qwen2.5-Coder-7B-Instruct, trained to generate professional slide presentations through multi-turn tool use in a reinforcement learning environment.
The model achieves 91.2% of Claude Opus 4.6's quality score (0.724 vs. 0.794) while improving 33.1% over the untuned base model (0.544), using only 0.53% trainable parameters (40.4M out of 7.62B).
Model Details
- Developed by: Karthik Ragunath Ananda Kumar (University of Texas at Dallas, Tavus Inc.), Subrahmanyam Arunachalam (Texas A&M University)
- Model type: LoRA adapter (PEFT) for causal language model
- Language: English
- License: Apache 2.0
- Base model: unsloth/qwen2.5-coder-7b-instruct-bnb-4bit (Qwen2.5-Coder-7B-Instruct, 4-bit quantized)
- Fine-tuning method: GRPO (Group Relative Policy Optimization) with LoRA
Model Sources
- Repository: https://github.com/pushing-the-frontier/slide-forge-llm
- Paper: Learning to Present: Inverse Specification Rewards for Agentic Slide Generation (pending arXiv submission)
- Dataset: KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts
Architecture
LoRA Configuration
| Parameter | Value |
|---|---|
| LoRA rank (r) | 16 |
| LoRA alpha | 16 |
| LoRA dropout | 0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 40.4M (0.53% of 7.62B) |
| Base model precision | 4-bit (bitsandbytes NF4) |
| Adapter precision | bfloat16 |
| Task type | CAUSAL_LM |
Base Model Architecture
- Model: Qwen2.5-Coder-7B-Instruct (7.62B parameters)
- Layers: 28 transformer decoder blocks
- Attention: Grouped-Query Attention (28 query heads, 4 KV heads, head dim 128)
- FFN: SwiGLU (intermediate dim 18,944)
- Vocabulary: 151,936 tokens
- Context length: 8,192 tokens (training), 32,768 tokens (base model max)
Uses
Direct Use
This adapter is designed for agentic slide presentation generation through multi-turn tool use. The model generates JSON tool calls to interact with a slide generation environment (14 tools across 5 categories: research, content planning, design, deck structure, and meta).
How to Load
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-7B-Instruct",
load_in_4bit=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")
model = PeftModel.from_pretrained(
base_model,
"KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
)
Or with Unsloth for faster inference:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
max_seq_length=8192,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
Training Details
Training Data
Expert trajectories collected using Claude Opus 4.6 as the agent operating in the SlideForge environment. Each trajectory is a complete episode (20-35 turns) from research through finalization of a slide deck.
- 48 diverse business presentation briefs spanning financial reports, investor pitches, market analyses, technical reviews, and strategic planning
- Dataset: SlideRL — 288 full-episode trajectories (48 briefs x 6 models)
Training Procedure
- Algorithm: GRPO (Group Relative Policy Optimization)
- Reward system: Multi-component reward with 6 dimensions:
- Structural validation
- Render quality assessment
- LLM-based aesthetic scoring
- Content quality metrics
- Inverse specification reward (novel)
- Dense step rewards (quality deltas + action bonuses)
Training Hyperparameters
| Parameter | Value |
|---|---|
| Training regime | bf16 mixed precision |
| Learning rate | 5e-5 |
| Temperature | 1.0 |
| Num generations (per prompt) | 2 |
| Max completion length | 1,024 tokens |
| Max sequence length | 8,192 tokens |
| Total training steps | 200 (checkpoint) |
| Save steps | 25 |
| Gradient checkpointing | Unsloth optimized |
| Optimizer | AdamW |
Checkpoint Info
- Checkpoint step: 200
- Epoch: ~0.617
- Checkpoint size: ~321 MB (LoRA adapter only)
Evaluation
Evaluated on 48 diverse business briefs using the same environment and reward pipeline.
Results
| Model | Quality Score | vs. Claude Opus |
|---|---|---|
| Claude Opus 4.6 (expert) | 0.794 | 100% |
| Llama 4 Scout | 0.779 | 98.1% |
| SlideRL (this model) | 0.724 | 91.2% |
| Claude Sonnet 4.6 | 0.698 | 87.9% |
| Qwen 7B (base, untuned) | 0.544 | 68.5% |
| GPT OSS 120B | 0.249 | 31.4% |
Key findings:
- +33.1% improvement over the untuned base model
- Achieves 91.2% of the expert model (Claude Opus 4.6) that generated the training data
- Outperforms Claude Sonnet 4.6 despite being 7B parameters
- Demonstrates that instruction adherence and tool-use compliance matter more than raw parameter count
Environmental Impact
- Hardware: Single NVIDIA GPU with 4-bit quantization
- Training time: ~200 steps with gradient checkpointing
- Efficiency: Only 0.53% of parameters trained, base model in 4-bit reduces memory from ~15 GB to ~4 GB
Citation
@article{anandakumar2026learning,
title={Learning to Present: Inverse Specification Rewards for Agentic Slide Generation via Reinforcement Learning},
author={Ananda Kumar, Karthik Ragunath and Arunachalam, Subrahmanyam},
year={2026}
}
Framework Versions
- PEFT: 0.18.1
- Transformers: compatible with Qwen2 architecture
- TRL: GRPO trainer
- Unsloth: 4-bit quantization and optimized training
- Python: 3.10+
- Downloads last month
- 32