Instructions to use skinnysidd/drugenv-policy with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use skinnysidd/drugenv-policy with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("skinnysidd/drugenv-policy", dtype="auto") - Notebooks
- Google Colab
- Kaggle
DrugEnv Policy β Qwen2.5-3B-Instruct (GRPO)
A drug target validation agent trained with Group Relative Policy Optimization (GRPO) on the DrugEnv RL environment.
Given a gene and a disease, the model decides whether the target is a go or no_go
for drug development β under a budget that forces it to choose what to investigate next
rather than running every tool.
What It Learned
Before training, the base model sees high expression on CD33, submits go with 0.9
confidence. Wrong β CD33 hits healthy blood stem cells. After GRPO, the same model runs
query_expression, sees the signal, then calls off_target_screen and toxicity_panel,
finds the liability, and submits no_go with 0.7 confidence.
The shift is not from wrong to right. It is from confidently wrong to cautiously right.
Training Pipeline
Stage 1 β SFT Warm-Start (LoRA on oracle trajectories)
LoRA adapters (r=16, Ξ±=32, dropout=0.05) on q_proj, k_proj, v_proj, o_proj.
~200 expert trajectories. Teaches the model the rhythm of investigation before RL begins.
Stage 2 β GRPO
| Hyperparameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-3B-Instruct |
| Learning rate | 5e-6 |
| Epochs | 2 |
| Dataset episodes | 350 |
| Rollout steps / episode | 8 |
| Num generations (group size) | 4 |
| Per-device batch size | 4 |
| Gradient accumulation steps | 4 |
| Effective batch size | 16 |
| Total training examples | ~2,800 |
| Max prompt length | 1024 tokens |
| Max completion length | 256 tokens |
| Hardware | A100 GPU |
| Collection policy | heuristic |
Training plots (reward curve, loss, KL divergence) are in the evidence/ directory of
this repository.
Reward Function
Not a single scalar β a panel of independent judges:
- Step: evidence novelty, investigation coherence, credit efficiency, rule violations
- Terminal: 40% decision accuracy + 35% evidence coverage + 15% efficiency + 10% coherence
- Confidence calibration: a confident wrong answer is penalized far more than a hesitant wrong one
Scenarios
| Scenario | Difficulty | Correct | Tests |
|---|---|---|---|
egfr_nsclc_viable |
Easy | go |
Clear positive |
kras_pdac_borderline |
Medium | go |
Temporal β recent literature decides |
cd33_aml_misleading |
Hard | no_go |
Expression trap; safety kills it |
tp53_solid_tumors_clear_fail |
Easy-Medium | no_go |
Fame β druggability |
ptpn11_juvenile_mml_complex |
Very Hard | go |
Allosteric + stratification |
Plus 20 procedurally generated oncology scenarios (seed=42).
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("skinnysidd/drugenv-policy")
model = AutoModelForCausalLM.from_pretrained("skinnysidd/drugenv-policy")
Prompts must follow the DrugEnv format. Use build_training_prompt() from
training/training_script.py in the environment Space.