openenv-clinical-trial / docs /phase_workflow.md
Roopalgn
Add spec/contract files to main for Suyash: ARCHITECTURE, reward spec, scenarios, milestones, phase workflow
863b4cb
|
raw
history blame
10.4 kB

Clinical Trial Phase-Aware Workflow & Scoring

Inspired by: KubeSRE's triage → investigate → fix → verify phase-order bonus (+0.2) and skip penalty. Bio Experiment's prerequisite rules as hard constraints. KubeSRE's judge persona scaling (junior/senior/principal) with curriculum tier.

Clinical Trial Workflow Phases

A clinical trial follows a strict sequential workflow. The agent should learn this ordering through reward signal, not hard-coding. Phase-order bonuses and skip penalties teach the agent to follow the correct diagnostic sequence — the same pattern that won 1st place (KubeSRE's triage → investigate → fix → verify).

Our workflow maps KubeSRE's 5-phase SRE pattern to clinical trial design:

KubeSRE Phase Our Phase Why it maps
Triage literature_review + hypothesis Understand the problem before acting
Investigation phase_i_design + phase_i_analysis Gather data to inform fix
Mitigation phase_ii_design + regulatory Implement the solution design
Fix enrollment + monitoring Execute and monitor
Verification analysis + conclusion Verify the outcome

Phase Definitions

# Phase Description Typical Actions
0 literature_review Understand the disease, existing treatments, and target population Review scenario description, note constraints
1 hypothesis Form a hypothesis about the drug's mechanism and target population Estimate expected effect size, identify potential responders
2 phase_i_design Design Phase I safety/dose-finding study run_dose_escalation, observe_safety_signal
3 phase_i_analysis Analyze Phase I results estimate_effect_size, identify MTD
4 phase_ii_design Design Phase II efficacy trial based on Phase I findings set_primary_endpoint, set_sample_size, set_inclusion_criteria, set_exclusion_criteria, set_dosing_schedule, set_control_arm, set_randomization_ratio, set_blinding
5 regulatory Submit protocol for FDA review submit_to_fda_review, request_protocol_amendment
6 enrollment Enroll patients, begin trial execution (implicit — happens after FDA approval)
7 monitoring Run interim analyses, monitor safety run_interim_analysis, modify_sample_size, add_biomarker_stratification
8 analysis Run final statistical analysis run_primary_analysis
9 conclusion Synthesize results and conclusions synthesize_conclusion

Phase Detection Heuristic

def _detect_phase(action: TrialAction, history: list) -> str:
    """Classify action into clinical workflow phase."""
    action_type = action.action_type

    # Phase I
    if action_type in ("run_dose_escalation", "observe_safety_signal"):
        return "phase_i_design"
    if action_type == "estimate_effect_size":
        return "phase_i_analysis"

    # Phase II design
    if action_type in ("set_primary_endpoint", "set_sample_size",
                        "set_inclusion_criteria", "set_exclusion_criteria",
                        "set_dosing_schedule", "set_control_arm",
                        "set_randomization_ratio", "set_blinding"):
        return "phase_ii_design"

    # Regulatory
    if action_type in ("submit_to_fda_review", "request_protocol_amendment"):
        return "regulatory"

    # Monitoring
    if action_type in ("run_interim_analysis", "modify_sample_size",
                        "add_biomarker_stratification"):
        return "monitoring"

    # Analysis
    if action_type == "run_primary_analysis":
        return "analysis"

    # Conclusion
    if action_type == "synthesize_conclusion":
        return "conclusion"

    return "literature_review"

Phase Order Map

PHASE_ORDER = {
    "literature_review": 0,
    "hypothesis": 1,
    "phase_i_design": 2,
    "phase_i_analysis": 3,
    "phase_ii_design": 4,
    "regulatory": 5,
    "enrollment": 6,
    "monitoring": 7,
    "analysis": 8,
    "conclusion": 9,
}

Phase-Order Scoring Rules

Bonus: Correct Phase Ordering (+0.2)

If the current action's phase is equal to or one step ahead of the highest phase seen so far, the agent gets +0.2 bonus. This rewards following the natural clinical trial workflow.

def _is_phase_order_correct(current_phase: str, history: list) -> bool:
    current_order = PHASE_ORDER[current_phase]
    if not history:
        return current_order <= 2  # literature_review, hypothesis, or phase_i_design are valid starts
    past_phases = [_detect_phase(h["action"], history[:i]) for i, h in enumerate(history)]
    max_past_order = max(PHASE_ORDER[p] for p in past_phases)
    return current_order <= max_past_order + 1

Penalty: Skipping Phases (-0.3)

If the agent jumps ahead by 2+ phases (e.g., going from Phase I directly to analysis without Phase II design), it gets -0.3 penalty per skipped phase. This penalizes shortcutting the clinical workflow.

def _get_skipped_phases(current_phase: str, history: list) -> list[str]:
    current_order = PHASE_ORDER[current_phase]
    if current_order <= 2:
        return []
    past_phases = {_detect_phase(h["action"], history[:i]) for i, h in enumerate(history)}
    past_phases.add(current_phase)
    skipped = []
    for phase, order in PHASE_ORDER.items():
        if order < current_order and phase not in past_phases:
            skipped.append(phase)
    return skipped

Examples

Good workflow (total phase bonus: +2.0):

Step 1: run_dose_escalation       → phase_i_design  (correct start, +0.2)
Step 2: observe_safety_signal     → phase_i_design  (same phase, +0.2)
Step 3: estimate_effect_size      → phase_i_analysis (+0.2)
Step 4: set_primary_endpoint      → phase_ii_design (+0.2)
Step 5: set_sample_size           → phase_ii_design (same phase, +0.2)
Step 6: set_inclusion_criteria    → phase_ii_design (same phase, +0.2)
Step 7: submit_to_fda_review      → regulatory      (+0.2)
Step 8: run_interim_analysis      → monitoring      (+0.2)
Step 9: run_primary_analysis      → analysis        (+0.2)
Step 10: synthesize_conclusion    → conclusion      (+0.2)

Bad workflow (total phase penalty: -0.9):

Step 1: set_sample_size           → phase_ii_design (skipped phase_i_design, phase_i_analysis → -0.6)
Step 2: run_primary_analysis      → analysis        (skipped regulatory, monitoring → -0.6)
Step 3: synthesize_conclusion     → conclusion      (correct ordering from analysis, +0.2)
Net: -0.6 + -0.6 + 0.2 = -1.0

Prerequisite Hard Constraints

In addition to soft phase-order scoring, some actions have hard prerequisites enforced by the rule engine. These are not reward signals — they block the action entirely and return an error:

Action Prerequisite Rationale
estimate_effect_size At least 1 run_dose_escalation completed Can't estimate without Phase I data
set_sample_size estimate_effect_size completed Need effect size to calculate power
submit_to_fda_review set_primary_endpoint + set_sample_size completed FDA requires these minimum protocol elements
run_interim_analysis submit_to_fda_review passed Can't do interim before FDA approval
run_primary_analysis submit_to_fda_review passed Can't analyze before approved protocol
synthesize_conclusion run_primary_analysis completed Can't conclude without analysis
modify_sample_size run_interim_analysis completed Adaptation requires interim look
add_biomarker_stratification estimate_effect_size completed Need data to stratify on

Integration with Reward

The phase-order bonus/penalty is one component of the per-step reward:

r_step = (
    w_validity  × r_validity     +       # FDA rule compliance
    w_ordering  × r_ordering     +       # Phase-order bonus/penalty (THIS DOC)
    w_info_gain × r_info_gain    +       # Information gained
    w_efficiency × r_efficiency  +       # Budget/time efficiency
    w_novelty   × r_novelty      +       # Action diversity bonus
    w_penalty   × r_penalty      +       # Soft constraint violations
    γ × (φ(s') − φ(s))                  # Potential-based shaping
)

Where r_ordering = +0.2 if correct order, -0.3 × len(skipped_phases) if phases were skipped.

Judge Persona Scaling by Difficulty

Inspired by: KubeSRE's judge persona system: junior (lenient, gives hints) at low difficulty, senior (standard) at medium, principal (strict, penalizes inefficiency) at expert.

The phase-order evaluation strictness scales with curriculum tier:

Tier Judge Persona Phase-Order Behavior
Warmup (0.0–0.25) Junior Lenient: allows 1 phase skip without penalty. Includes hint in observation (e.g., "Consider running dose escalation before setting sample size"). Phase bonus still +0.2.
Beginner (0.25–0.40) Junior→Senior Standard: phase skip penalty -0.3/skip. No hints. Phase bonus +0.2.
Intermediate (0.40–0.60) Senior Strict: phase skip penalty -0.3/skip. Phase bonus reduced to +0.15 (expects correct ordering as baseline).
Advanced (0.60–0.80) Senior→Principal Very strict: phase skip penalty -0.5/skip. Phase bonus +0.1. Penalizes redundant actions within a phase (-0.1 for repeating already-completed actions).
Expert (0.80–0.95) Principal Harshest: phase skip penalty -0.5/skip. Phase bonus +0.05 (near-zero, expects perfection). Redundancy penalty -0.15. Efficiency penalty for taking >N steps in any phase.

This mirrors KubeSRE's approach: the junior judge at warmup is lenient and gives hints, while the principal judge at expert actively punishes inefficiency.

Protocol Amendment & Recovery

Inspired by: KubeSRE's mitigation phase — agents can partially fix before full resolution. Bio Experiment's design_followup meta-action.

The request_protocol_amendment action allows recovery from earlier mistakes:

  • If FDA review fails, agent can amend the protocol and resubmit
  • Amendment costs time and budget (realistic consequence)
  • Successful recovery after failure gets a +0.3 recovery bonus (like KubeSRE's mitigation credit)
  • Maximum 2 amendments per episode (prevents infinite retry loops)
  • This teaches the agent that mistakes are recoverable but costly — a key real-world lesson