Clinical Recruitment GRPO Agent (Best Run โ€” 80 Steps)

This is the best trained LoRA adapter for the Adaptive Clinical Recruitment Environment, produced by SFT warmup followed by 80-step GRPO on Qwen3-1.7B (NVIDIA L4, 24GB).

Hackathon positioning: Theme #2 (Super Long-Horizon Planning).

Training Summary

Metric Value
Base model Qwen/Qwen3-1.7B
Method SFT warmup + 80-step GRPO
GPU NVIDIA L4 (24GB) via HF Jobs
Duration 142 min total (3.5 min SFT + 141.7 min GRPO)
Reward (start) 0.269
Reward (end) 0.331
Tool calls/step (start) 3.5
Tool calls/step (end) 11
Enrollment/rollout 3-4 patients
Zero-std collapse rate 10% (8/80 steps, intermittent)
LoRA rank 16
LoRA alpha 16
LoRA dropout 0.05

What the Model Learned

  • Calls screen_patient as first action (previously collapsed to adjust_strategy)
  • Follows the correct pipeline: screen -> recontact -> enrollment
  • Enrolls 3-4 patients per episode on easy_bench
  • Makes 11+ tool calls per rollout (up from 3.5)
  • Recovers from intermittent collapse without sustained degradation

Training Plots

GRPO 80-Step Training

All Runs Comparison

Honest Limits

  • Enrollment stays at 3-4/80 target patients per episode
  • Zero-std collapse still occurs on ~10% of steps
  • Reward plateaued at 0.33 after 80 steps

Links

Framework Versions

  • TRL: 1.2.0
  • Transformers: 5.6.2
  • PyTorch: 2.11.0
  • PEFT: 0.19.1
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for pratimassaravanan/grpo

Finetuned
Qwen/Qwen3-1.7B
Adapter
(510)
this model