Instructions to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/qwen2.5-coder-7b-instruct-bnb-4bit") model = PeftModel.from_pretrained(base_model, "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned") - Transformers
How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned
- SGLang
How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned", max_seq_length=2048, ) - Docker Model Runner
How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with Docker Model Runner:
docker model run hf.co/KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned", dtype="auto")LearningToPresent-RL-Qwen-2.5-Coder-7B-Instruct-GRPO-Finetuned
A GRPO-finetuned LoRA adapter for Qwen2.5-Coder-7B-Instruct, trained to generate professional slide presentations through multi-turn tool use in a reinforcement learning environment.
The model achieves 91.2% of Claude Opus 4.6's quality score (0.724 vs. 0.794) while improving 33.1% over the untuned base model (0.544), using only 0.53% trainable parameters (40.4M out of 7.62B).
Model Details
- Developed by: Karthik Ragunath Ananda Kumar (University of Texas at Dallas, Tavus Inc.), Subrahmanyam Arunachalam (Texas A&M University)
- Model type: LoRA adapter (PEFT) for causal language model
- Language: English
- License: Apache 2.0
- Base model: unsloth/qwen2.5-coder-7b-instruct-bnb-4bit (Qwen2.5-Coder-7B-Instruct, 4-bit quantized)
- Fine-tuning method: GRPO (Group Relative Policy Optimization) with LoRA
Model Sources
- Repository: https://github.com/pushing-the-frontier/slide-forge-llm
- Paper: Learning to Present: Inverse Specification Rewards for Agentic Slide Generation (pending arXiv submission)
- Dataset: KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts
Architecture
LoRA Configuration
| Parameter | Value |
|---|---|
| LoRA rank (r) | 16 |
| LoRA alpha | 16 |
| LoRA dropout | 0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 40.4M (0.53% of 7.62B) |
| Base model precision | 4-bit (bitsandbytes NF4) |
| Adapter precision | bfloat16 |
| Task type | CAUSAL_LM |
Base Model Architecture
- Model: Qwen2.5-Coder-7B-Instruct (7.62B parameters)
- Layers: 28 transformer decoder blocks
- Attention: Grouped-Query Attention (28 query heads, 4 KV heads, head dim 128)
- FFN: SwiGLU (intermediate dim 18,944)
- Vocabulary: 151,936 tokens
- Context length: 8,192 tokens (training), 32,768 tokens (base model max)
Uses
Direct Use
This adapter is designed for agentic slide presentation generation through multi-turn tool use. The model generates JSON tool calls to interact with a slide generation environment (14 tools across 5 categories: research, content planning, design, deck structure, and meta).
How to Load
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-7B-Instruct",
load_in_4bit=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")
model = PeftModel.from_pretrained(
base_model,
"KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
)
Or with Unsloth for faster inference:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
max_seq_length=8192,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
Training Details
Training Data
Expert trajectories collected using Claude Opus 4.6 as the agent operating in the SlideForge environment. Each trajectory is a complete episode (20-35 turns) from research through finalization of a slide deck.
- 48 diverse business presentation briefs spanning financial reports, investor pitches, market analyses, technical reviews, and strategic planning
- Dataset: SlideRL — 288 full-episode trajectories (48 briefs x 6 models)
Training Procedure
- Algorithm: GRPO (Group Relative Policy Optimization)
- Reward system: Multi-component reward with 6 dimensions:
- Structural validation
- Render quality assessment
- LLM-based aesthetic scoring
- Content quality metrics
- Inverse specification reward (novel)
- Dense step rewards (quality deltas + action bonuses)
Training Hyperparameters
| Parameter | Value |
|---|---|
| Training regime | bf16 mixed precision |
| Learning rate | 5e-5 |
| Temperature | 1.0 |
| Num generations (per prompt) | 2 |
| Max completion length | 1,024 tokens |
| Max sequence length | 8,192 tokens |
| Total training steps | 200 (checkpoint) |
| Save steps | 25 |
| Gradient checkpointing | Unsloth optimized |
| Optimizer | AdamW |
Checkpoint Info
- Checkpoint step: 200
- Epoch: ~0.617
- Checkpoint size: ~321 MB (LoRA adapter only)
Evaluation
Evaluated on 48 diverse business briefs using the same environment and reward pipeline.
Results
| Model | Quality Score | vs. Claude Opus |
|---|---|---|
| Claude Opus 4.6 (expert) | 0.794 | 100% |
| Llama 4 Scout | 0.779 | 98.1% |
| SlideRL (this model) | 0.724 | 91.2% |
| Claude Sonnet 4.6 | 0.698 | 87.9% |
| Qwen 7B (base, untuned) | 0.544 | 68.5% |
| GPT OSS 120B | 0.249 | 31.4% |
Key findings:
- +33.1% improvement over the untuned base model
- Achieves 91.2% of the expert model (Claude Opus 4.6) that generated the training data
- Outperforms Claude Sonnet 4.6 despite being 7B parameters
- Demonstrates that instruction adherence and tool-use compliance matter more than raw parameter count
Environmental Impact
- Hardware: Single NVIDIA GPU with 4-bit quantization
- Training time: ~200 steps with gradient checkpointing
- Efficiency: Only 0.53% of parameters trained, base model in 4-bit reduces memory from ~15 GB to ~4 GB
Citation
@article{anandakumar2026learning,
title={Learning to Present: Inverse Specification Rewards for Agentic Slide Generation via Reinforcement Learning},
author={Ananda Kumar, Karthik Ragunath and Arunachalam, Subrahmanyam},
year={2026}
}
Framework Versions
- PEFT: 0.18.1
- Transformers: compatible with Qwen2 architecture
- TRL: GRPO trainer
- Unsloth: 4-bit quantization and optimized training
- Python: 3.10+
- Downloads last month
- -
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)