DrugEnv Policy β€” Qwen2.5-3B-Instruct (GRPO)

A drug target validation agent trained with Group Relative Policy Optimization (GRPO) on the DrugEnv RL environment.

Given a gene and a disease, the model decides whether the target is a go or no_go for drug development β€” under a budget that forces it to choose what to investigate next rather than running every tool.

What It Learned

Before training, the base model sees high expression on CD33, submits go with 0.9 confidence. Wrong β€” CD33 hits healthy blood stem cells. After GRPO, the same model runs query_expression, sees the signal, then calls off_target_screen and toxicity_panel, finds the liability, and submits no_go with 0.7 confidence.

The shift is not from wrong to right. It is from confidently wrong to cautiously right.

Training Pipeline

Stage 1 β€” SFT Warm-Start (LoRA on oracle trajectories)

LoRA adapters (r=16, Ξ±=32, dropout=0.05) on q_proj, k_proj, v_proj, o_proj. ~200 expert trajectories. Teaches the model the rhythm of investigation before RL begins.

Stage 2 β€” GRPO

Hyperparameter Value
Base model Qwen/Qwen2.5-3B-Instruct
Learning rate 5e-6
Epochs 2
Dataset episodes 350
Rollout steps / episode 8
Num generations (group size) 4
Per-device batch size 4
Gradient accumulation steps 4
Effective batch size 16
Total training examples ~2,800
Max prompt length 1024 tokens
Max completion length 256 tokens
Hardware A100 GPU
Collection policy heuristic

Training plots (reward curve, loss, KL divergence) are in the evidence/ directory of this repository.

Reward Function

Not a single scalar β€” a panel of independent judges:

  • Step: evidence novelty, investigation coherence, credit efficiency, rule violations
  • Terminal: 40% decision accuracy + 35% evidence coverage + 15% efficiency + 10% coherence
  • Confidence calibration: a confident wrong answer is penalized far more than a hesitant wrong one

Scenarios

Scenario Difficulty Correct Tests
egfr_nsclc_viable Easy go Clear positive
kras_pdac_borderline Medium go Temporal β€” recent literature decides
cd33_aml_misleading Hard no_go Expression trap; safety kills it
tp53_solid_tumors_clear_fail Easy-Medium no_go Fame β‰  druggability
ptpn11_juvenile_mml_complex Very Hard go Allosteric + stratification

Plus 20 procedurally generated oncology scenarios (seed=42).

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("skinnysidd/drugenv-policy")
model = AutoModelForCausalLM.from_pretrained("skinnysidd/drugenv-policy")

Prompts must follow the DrugEnv format. Use build_training_prompt() from training/training_script.py in the environment Space.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for skinnysidd/drugenv-policy

Base model

Qwen/Qwen2.5-3B
Finetuned
(1306)
this model