DrugEnv Policy — Qwen2.5-3B-Instruct (GRPO)

A drug target validation agent trained with Group Relative Policy Optimization (GRPO) on the DrugEnv RL environment.

Given a gene and a disease, the model decides whether the target is a go or no_go for drug development — under a budget that forces it to choose what to investigate next rather than running every tool.

What It Learned

Before training, the base model sees high expression on CD33, submits go with 0.9 confidence. Wrong — CD33 hits healthy blood stem cells. After GRPO, the same model runs query_expression, sees the signal, then calls off_target_screen and toxicity_panel, finds the liability, and submits no_go with 0.7 confidence.

The shift is not from wrong to right. It is from confidently wrong to cautiously right.

Training Pipeline

Stage 1 — SFT Warm-Start (LoRA on oracle trajectories)

LoRA adapters (r=16, α=32, dropout=0.05) on q_proj, k_proj, v_proj, o_proj. ~200 expert trajectories. Teaches the model the rhythm of investigation before RL begins.

Stage 2 — GRPO

Hyperparameter	Value
Base model	`Qwen/Qwen2.5-3B-Instruct`
Learning rate	`5e-6`
Epochs	`2`
Dataset episodes	`350`
Rollout steps / episode	`8`
Num generations (group size)	`4`
Per-device batch size	`4`
Gradient accumulation steps	`4`
Effective batch size	`16`
Total training examples	~2,800
Max prompt length	`1024` tokens
Max completion length	`256` tokens
Hardware	A100 GPU
Collection policy	`heuristic`

Training plots (reward curve, loss, KL divergence) are in the evidence/ directory of this repository.

Reward Function

Not a single scalar — a panel of independent judges:

Step: evidence novelty, investigation coherence, credit efficiency, rule violations
Terminal: 40% decision accuracy + 35% evidence coverage + 15% efficiency + 10% coherence
Confidence calibration: a confident wrong answer is penalized far more than a hesitant wrong one

Scenarios

Scenario	Difficulty	Correct	Tests
`egfr_nsclc_viable`	Easy	`go`	Clear positive
`kras_pdac_borderline`	Medium	`go`	Temporal — recent literature decides
`cd33_aml_misleading`	Hard	`no_go`	Expression trap; safety kills it
`tp53_solid_tumors_clear_fail`	Easy-Medium	`no_go`	Fame ≠ druggability
`ptpn11_juvenile_mml_complex`	Very Hard	`go`	Allosteric + stratification

Plus 20 procedurally generated oncology scenarios (seed=42).

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("skinnysidd/drugenv-policy")
model = AutoModelForCausalLM.from_pretrained("skinnysidd/drugenv-policy")

Prompts must follow the DrugEnv format. Use build_training_prompt() from training/training_script.py in the environment Space.

Model tree for skinnysidd/drugenv-policy

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

(1306)

this model

skinnysidd
/

drugenv-policy