Instructions to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("unsloth/qwen2.5-coder-7b-instruct-bnb-4bit")
model = PeftModel.from_pretrained(base_model, "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned")

Transformers

How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned

SGLang

How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
    max_seq_length=2048,
)

Docker Model Runner
How to use KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned with Docker Model Runner:
```
docker model run hf.co/KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned
```

LearningToPresent-RL-Qwen-2.5-Coder-7B-Instruct-GRPO-Finetuned

A GRPO-finetuned LoRA adapter for Qwen2.5-Coder-7B-Instruct, trained to generate professional slide presentations through multi-turn tool use in a reinforcement learning environment.

The model achieves 91.2% of Claude Opus 4.6's quality score (0.724 vs. 0.794) while improving 33.1% over the untuned base model (0.544), using only 0.53% trainable parameters (40.4M out of 7.62B).

Model Details

Developed by: Karthik Ragunath Ananda Kumar (University of Texas at Dallas, Tavus Inc.), Subrahmanyam Arunachalam (Texas A&M University)
Model type: LoRA adapter (PEFT) for causal language model
Language: English
License: Apache 2.0
Base model: unsloth/qwen2.5-coder-7b-instruct-bnb-4bit (Qwen2.5-Coder-7B-Instruct, 4-bit quantized)
Fine-tuning method: GRPO (Group Relative Policy Optimization) with LoRA

Model Sources

Repository: https://github.com/pushing-the-frontier/slide-forge-llm
Paper: Learning to Present: Inverse Specification Rewards for Agentic Slide Generation (pending arXiv submission)
Dataset: KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts

Architecture

LoRA Configuration

Parameter	Value
LoRA rank (r)	16
LoRA alpha	16
LoRA dropout	0
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Trainable parameters	40.4M (0.53% of 7.62B)
Base model precision	4-bit (bitsandbytes NF4)
Adapter precision	bfloat16
Task type	CAUSAL_LM

Base Model Architecture

Model: Qwen2.5-Coder-7B-Instruct (7.62B parameters)
Layers: 28 transformer decoder blocks
Attention: Grouped-Query Attention (28 query heads, 4 KV heads, head dim 128)
FFN: SwiGLU (intermediate dim 18,944)
Vocabulary: 151,936 tokens
Context length: 8,192 tokens (training), 32,768 tokens (base model max)

Uses

Direct Use

This adapter is designed for agentic slide presentation generation through multi-turn tool use. The model generates JSON tool calls to interact with a slide generation environment (14 tools across 5 categories: research, content planning, design, deck structure, and meta).

How to Load

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-7B-Instruct",
    load_in_4bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

model = PeftModel.from_pretrained(
    base_model,
    "KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
)

Or with Unsloth for faster inference:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="KarthikRagunathAnandaKumar/LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned",
    max_seq_length=8192,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

Training Details

Training Data

Expert trajectories collected using Claude Opus 4.6 as the agent operating in the SlideForge environment. Each trajectory is a complete episode (20-35 turns) from research through finalization of a slide deck.

48 diverse business presentation briefs spanning financial reports, investor pitches, market analyses, technical reviews, and strategic planning
Dataset: SlideRL — 288 full-episode trajectories (48 briefs x 6 models)

Training Procedure

Algorithm: GRPO (Group Relative Policy Optimization)
Reward system: Multi-component reward with 6 dimensions:
- Structural validation
- Render quality assessment
- LLM-based aesthetic scoring
- Content quality metrics
- Inverse specification reward (novel)
- Dense step rewards (quality deltas + action bonuses)

Training Hyperparameters

Parameter	Value
Training regime	bf16 mixed precision
Learning rate	5e-5
Temperature	1.0
Num generations (per prompt)	2
Max completion length	1,024 tokens
Max sequence length	8,192 tokens
Total training steps	200 (checkpoint)
Save steps	25
Gradient checkpointing	Unsloth optimized
Optimizer	AdamW

Checkpoint Info

Checkpoint step: 200
Epoch: ~0.617
Checkpoint size: ~321 MB (LoRA adapter only)

Evaluation

Evaluated on 48 diverse business briefs using the same environment and reward pipeline.

Results

Model	Quality Score	vs. Claude Opus
Claude Opus 4.6 (expert)	0.794	100%
Llama 4 Scout	0.779	98.1%
SlideRL (this model)	0.724	91.2%
Claude Sonnet 4.6	0.698	87.9%
Qwen 7B (base, untuned)	0.544	68.5%
GPT OSS 120B	0.249	31.4%

Key findings:

+33.1% improvement over the untuned base model
Achieves 91.2% of the expert model (Claude Opus 4.6) that generated the training data
Outperforms Claude Sonnet 4.6 despite being 7B parameters
Demonstrates that instruction adherence and tool-use compliance matter more than raw parameter count

Environmental Impact

Hardware: Single NVIDIA GPU with 4-bit quantization
Training time: ~200 steps with gradient checkpointing
Efficiency: Only 0.53% of parameters trained, base model in 4-bit reduces memory from ~15 GB to ~4 GB

Citation

@article{anandakumar2026learning,
  title={Learning to Present: Inverse Specification Rewards for Agentic Slide Generation via Reinforcement Learning},
  author={Ananda Kumar, Karthik Ragunath and Arunachalam, Subrahmanyam},
  year={2026}
}

Framework Versions

PEFT: 0.18.1
Transformers: compatible with Qwen2 architecture
TRL: GRPO trainer
Unsloth: 4-bit quantization and optimized training
Python: 3.10+

Downloads last month: 4

KarthikRagunathAnandaKumar
/

LearningToPresent-RL-Qwen-2.5B-Coder-Instruct-GRPO-Finetuned