LearningToPresent-RL-Qwen-2.5-Coder-7B-Instruct-GRPO-Finetuned

A GRPO-finetuned LoRA adapter for Qwen2.5-Coder-7B-Instruct, trained to generate professional slide presentations through multi-turn tool use in a reinforcement learning environment.

The model achieves 91.2% of Claude Opus 4.6's quality score (0.724 vs. 0.794) while improving 33.1% over the untuned base model (0.544), using only 0.53% trainable parameters (40.4M out of 7.62B).

Model Details

  • Developed by: Karthik Ragunath Ananda Kumar (University of Texas at Dallas, Tavus Inc.), Subrahmanyam Arunachalam (Texas A&M University)
  • Model type: LoRA adapter (PEFT) for causal language model
  • Language: English
  • License: Apache 2.0
  • Base model: unsloth/qwen2.5-coder-7b-instruct-bnb-4bit (Qwen2.5-Coder-7B-Instruct, 4-bit quantized)
  • Fine-tuning method: GRPO (Group Relative Policy Optimization) with LoRA

Model Sources

Architecture

LoRA Configuration

Parameter Value
LoRA rank (r) 16
LoRA alpha 16
LoRA dropout 0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters 40.4M (0.53% of 7.62B)
Base model precision 4-bit (bitsandbytes NF4)
Adapter precision bfloat16
Task type CAUSAL_LM

Base Model Architecture

  • Model: Qwen2.5-Coder-7B-Instruct (7.62B parameters)
  • Layers: 28 transformer decoder blocks
  • Attention: Grouped-Query Attention (28 query heads, 4 KV heads, head dim 128)
  • FFN: SwiGLU (intermediate dim 18,944)
  • Vocabulary: 151,936 tokens
  • Context length: 8,192 tokens (training), 32,768 tokens (base model max)

Uses

Direct Use

This adapter is designed for agentic slide presentation generation through multi-turn tool use. The model generates JSON tool calls to interact with a slide generation environment (14 tools across 5 categories: research, content planning, design, deck structure, and meta).

How to Load

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-7B-Instruct",
    load_in_4bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

model = PeftModel.from_pretrained(
    base_model,
    "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
)

Or with Unsloth for faster inference:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
    max_seq_length=8192,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

Training Details

Training Data

Expert trajectories collected using Claude Opus 4.6 as the agent operating in the SlideForge environment. Each trajectory is a complete episode (20-35 turns) from research through finalization of a slide deck.

  • 48 diverse business presentation briefs spanning financial reports, investor pitches, market analyses, technical reviews, and strategic planning
  • Dataset: SlideRL — 288 full-episode trajectories (48 briefs x 6 models)

Training Procedure

  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Reward system: Multi-component reward with 6 dimensions:
    • Structural validation
    • Render quality assessment
    • LLM-based aesthetic scoring
    • Content quality metrics
    • Inverse specification reward (novel)
    • Dense step rewards (quality deltas + action bonuses)

Training Hyperparameters

Parameter Value
Training regime bf16 mixed precision
Learning rate 5e-5
Temperature 1.0
Num generations (per prompt) 2
Max completion length 1,024 tokens
Max sequence length 8,192 tokens
Total training steps 200 (checkpoint)
Save steps 25
Gradient checkpointing Unsloth optimized
Optimizer AdamW

Checkpoint Info

  • Checkpoint step: 200
  • Epoch: ~0.617
  • Checkpoint size: ~321 MB (LoRA adapter only)

Evaluation

Evaluated on 48 diverse business briefs using the same environment and reward pipeline.

Results

Model Quality Score vs. Claude Opus
Claude Opus 4.6 (expert) 0.794 100%
Llama 4 Scout 0.779 98.1%
SlideRL (this model) 0.724 91.2%
Claude Sonnet 4.6 0.698 87.9%
Qwen 7B (base, untuned) 0.544 68.5%
GPT OSS 120B 0.249 31.4%

Key findings:

  • +33.1% improvement over the untuned base model
  • Achieves 91.2% of the expert model (Claude Opus 4.6) that generated the training data
  • Outperforms Claude Sonnet 4.6 despite being 7B parameters
  • Demonstrates that instruction adherence and tool-use compliance matter more than raw parameter count

Environmental Impact

  • Hardware: Single NVIDIA GPU with 4-bit quantization
  • Training time: ~200 steps with gradient checkpointing
  • Efficiency: Only 0.53% of parameters trained, base model in 4-bit reduces memory from ~15 GB to ~4 GB

Citation

@article{anandakumar2026learning,
  title={Learning to Present: Inverse Specification Rewards for Agentic Slide Generation via Reinforcement Learning},
  author={Ananda Kumar, Karthik Ragunath and Arunachalam, Subrahmanyam},
  year={2026}
}

Framework Versions

  • PEFT: 0.18.1
  • Transformers: compatible with Qwen2 architecture
  • TRL: GRPO trainer
  • Unsloth: 4-bit quantization and optimized training
  • Python: 3.10+
Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned