Spaces:
Sleeping
Sleeping
Roopalgn
Add spec/contract files to main for Suyash: ARCHITECTURE, reward spec, scenarios, milestones, phase workflow
863b4cb | # Clinical Trial Phase-Aware Workflow & Scoring | |
| > **Inspired by:** KubeSRE's triage → investigate → fix → verify phase-order bonus (+0.2) and skip penalty. Bio Experiment's prerequisite rules as hard constraints. KubeSRE's judge persona scaling (junior/senior/principal) with curriculum tier. | |
| ## Clinical Trial Workflow Phases | |
| A clinical trial follows a strict sequential workflow. The agent should learn this ordering through reward signal, not hard-coding. Phase-order bonuses and skip penalties teach the agent to follow the correct diagnostic sequence — the same pattern that won 1st place (KubeSRE's triage → investigate → fix → verify). | |
| Our workflow maps KubeSRE's 5-phase SRE pattern to clinical trial design: | |
| | KubeSRE Phase | Our Phase | Why it maps | | |
| |---|---|---| | |
| | Triage | literature_review + hypothesis | Understand the problem before acting | | |
| | Investigation | phase_i_design + phase_i_analysis | Gather data to inform fix | | |
| | Mitigation | phase_ii_design + regulatory | Implement the solution design | | |
| | Fix | enrollment + monitoring | Execute and monitor | | |
| | Verification | analysis + conclusion | Verify the outcome | | |
| ### Phase Definitions | |
| | # | Phase | Description | Typical Actions | | |
| |---|---|---|---| | |
| | 0 | `literature_review` | Understand the disease, existing treatments, and target population | Review scenario description, note constraints | | |
| | 1 | `hypothesis` | Form a hypothesis about the drug's mechanism and target population | Estimate expected effect size, identify potential responders | | |
| | 2 | `phase_i_design` | Design Phase I safety/dose-finding study | run_dose_escalation, observe_safety_signal | | |
| | 3 | `phase_i_analysis` | Analyze Phase I results | estimate_effect_size, identify MTD | | |
| | 4 | `phase_ii_design` | Design Phase II efficacy trial based on Phase I findings | set_primary_endpoint, set_sample_size, set_inclusion_criteria, set_exclusion_criteria, set_dosing_schedule, set_control_arm, set_randomization_ratio, set_blinding | | |
| | 5 | `regulatory` | Submit protocol for FDA review | submit_to_fda_review, request_protocol_amendment | | |
| | 6 | `enrollment` | Enroll patients, begin trial execution | (implicit — happens after FDA approval) | | |
| | 7 | `monitoring` | Run interim analyses, monitor safety | run_interim_analysis, modify_sample_size, add_biomarker_stratification | | |
| | 8 | `analysis` | Run final statistical analysis | run_primary_analysis | | |
| | 9 | `conclusion` | Synthesize results and conclusions | synthesize_conclusion | | |
| ### Phase Detection Heuristic | |
| ```python | |
| def _detect_phase(action: TrialAction, history: list) -> str: | |
| """Classify action into clinical workflow phase.""" | |
| action_type = action.action_type | |
| # Phase I | |
| if action_type in ("run_dose_escalation", "observe_safety_signal"): | |
| return "phase_i_design" | |
| if action_type == "estimate_effect_size": | |
| return "phase_i_analysis" | |
| # Phase II design | |
| if action_type in ("set_primary_endpoint", "set_sample_size", | |
| "set_inclusion_criteria", "set_exclusion_criteria", | |
| "set_dosing_schedule", "set_control_arm", | |
| "set_randomization_ratio", "set_blinding"): | |
| return "phase_ii_design" | |
| # Regulatory | |
| if action_type in ("submit_to_fda_review", "request_protocol_amendment"): | |
| return "regulatory" | |
| # Monitoring | |
| if action_type in ("run_interim_analysis", "modify_sample_size", | |
| "add_biomarker_stratification"): | |
| return "monitoring" | |
| # Analysis | |
| if action_type == "run_primary_analysis": | |
| return "analysis" | |
| # Conclusion | |
| if action_type == "synthesize_conclusion": | |
| return "conclusion" | |
| return "literature_review" | |
| ``` | |
| ### Phase Order Map | |
| ```python | |
| PHASE_ORDER = { | |
| "literature_review": 0, | |
| "hypothesis": 1, | |
| "phase_i_design": 2, | |
| "phase_i_analysis": 3, | |
| "phase_ii_design": 4, | |
| "regulatory": 5, | |
| "enrollment": 6, | |
| "monitoring": 7, | |
| "analysis": 8, | |
| "conclusion": 9, | |
| } | |
| ``` | |
| ## Phase-Order Scoring Rules | |
| ### Bonus: Correct Phase Ordering (+0.2) | |
| If the current action's phase is equal to or one step ahead of the highest phase seen so far, the agent gets +0.2 bonus. This rewards following the natural clinical trial workflow. | |
| ```python | |
| def _is_phase_order_correct(current_phase: str, history: list) -> bool: | |
| current_order = PHASE_ORDER[current_phase] | |
| if not history: | |
| return current_order <= 2 # literature_review, hypothesis, or phase_i_design are valid starts | |
| past_phases = [_detect_phase(h["action"], history[:i]) for i, h in enumerate(history)] | |
| max_past_order = max(PHASE_ORDER[p] for p in past_phases) | |
| return current_order <= max_past_order + 1 | |
| ``` | |
| ### Penalty: Skipping Phases (-0.3) | |
| If the agent jumps ahead by 2+ phases (e.g., going from Phase I directly to analysis without Phase II design), it gets -0.3 penalty per skipped phase. This penalizes shortcutting the clinical workflow. | |
| ```python | |
| def _get_skipped_phases(current_phase: str, history: list) -> list[str]: | |
| current_order = PHASE_ORDER[current_phase] | |
| if current_order <= 2: | |
| return [] | |
| past_phases = {_detect_phase(h["action"], history[:i]) for i, h in enumerate(history)} | |
| past_phases.add(current_phase) | |
| skipped = [] | |
| for phase, order in PHASE_ORDER.items(): | |
| if order < current_order and phase not in past_phases: | |
| skipped.append(phase) | |
| return skipped | |
| ``` | |
| ### Examples | |
| **Good workflow** (total phase bonus: +2.0): | |
| ``` | |
| Step 1: run_dose_escalation → phase_i_design (correct start, +0.2) | |
| Step 2: observe_safety_signal → phase_i_design (same phase, +0.2) | |
| Step 3: estimate_effect_size → phase_i_analysis (+0.2) | |
| Step 4: set_primary_endpoint → phase_ii_design (+0.2) | |
| Step 5: set_sample_size → phase_ii_design (same phase, +0.2) | |
| Step 6: set_inclusion_criteria → phase_ii_design (same phase, +0.2) | |
| Step 7: submit_to_fda_review → regulatory (+0.2) | |
| Step 8: run_interim_analysis → monitoring (+0.2) | |
| Step 9: run_primary_analysis → analysis (+0.2) | |
| Step 10: synthesize_conclusion → conclusion (+0.2) | |
| ``` | |
| **Bad workflow** (total phase penalty: -0.9): | |
| ``` | |
| Step 1: set_sample_size → phase_ii_design (skipped phase_i_design, phase_i_analysis → -0.6) | |
| Step 2: run_primary_analysis → analysis (skipped regulatory, monitoring → -0.6) | |
| Step 3: synthesize_conclusion → conclusion (correct ordering from analysis, +0.2) | |
| Net: -0.6 + -0.6 + 0.2 = -1.0 | |
| ``` | |
| ## Prerequisite Hard Constraints | |
| In addition to soft phase-order scoring, some actions have hard prerequisites enforced by the rule engine. These are not reward signals — they block the action entirely and return an error: | |
| | Action | Prerequisite | Rationale | | |
| |---|---|---| | |
| | `estimate_effect_size` | At least 1 `run_dose_escalation` completed | Can't estimate without Phase I data | | |
| | `set_sample_size` | `estimate_effect_size` completed | Need effect size to calculate power | | |
| | `submit_to_fda_review` | `set_primary_endpoint` + `set_sample_size` completed | FDA requires these minimum protocol elements | | |
| | `run_interim_analysis` | `submit_to_fda_review` passed | Can't do interim before FDA approval | | |
| | `run_primary_analysis` | `submit_to_fda_review` passed | Can't analyze before approved protocol | | |
| | `synthesize_conclusion` | `run_primary_analysis` completed | Can't conclude without analysis | | |
| | `modify_sample_size` | `run_interim_analysis` completed | Adaptation requires interim look | | |
| | `add_biomarker_stratification` | `estimate_effect_size` completed | Need data to stratify on | | |
| ## Integration with Reward | |
| The phase-order bonus/penalty is one component of the per-step reward: | |
| ``` | |
| r_step = ( | |
| w_validity × r_validity + # FDA rule compliance | |
| w_ordering × r_ordering + # Phase-order bonus/penalty (THIS DOC) | |
| w_info_gain × r_info_gain + # Information gained | |
| w_efficiency × r_efficiency + # Budget/time efficiency | |
| w_novelty × r_novelty + # Action diversity bonus | |
| w_penalty × r_penalty + # Soft constraint violations | |
| γ × (φ(s') − φ(s)) # Potential-based shaping | |
| ) | |
| ``` | |
| Where `r_ordering` = +0.2 if correct order, -0.3 × len(skipped_phases) if phases were skipped. | |
| ## Judge Persona Scaling by Difficulty | |
| > **Inspired by:** KubeSRE's judge persona system: junior (lenient, gives hints) at low difficulty, senior (standard) at medium, principal (strict, penalizes inefficiency) at expert. | |
| The phase-order evaluation strictness scales with curriculum tier: | |
| | Tier | Judge Persona | Phase-Order Behavior | | |
| |---|---|---| | |
| | Warmup (0.0–0.25) | **Junior** | Lenient: allows 1 phase skip without penalty. Includes hint in observation (e.g., "Consider running dose escalation before setting sample size"). Phase bonus still +0.2. | | |
| | Beginner (0.25–0.40) | **Junior→Senior** | Standard: phase skip penalty -0.3/skip. No hints. Phase bonus +0.2. | | |
| | Intermediate (0.40–0.60) | **Senior** | Strict: phase skip penalty -0.3/skip. Phase bonus reduced to +0.15 (expects correct ordering as baseline). | | |
| | Advanced (0.60–0.80) | **Senior→Principal** | Very strict: phase skip penalty -0.5/skip. Phase bonus +0.1. Penalizes redundant actions within a phase (-0.1 for repeating already-completed actions). | | |
| | Expert (0.80–0.95) | **Principal** | Harshest: phase skip penalty -0.5/skip. Phase bonus +0.05 (near-zero, expects perfection). Redundancy penalty -0.15. Efficiency penalty for taking >N steps in any phase. | | |
| This mirrors KubeSRE's approach: the junior judge at warmup is lenient and gives hints, while the principal judge at expert actively punishes inefficiency. | |
| ## Protocol Amendment & Recovery | |
| > **Inspired by:** KubeSRE's mitigation phase — agents can partially fix before full resolution. Bio Experiment's `design_followup` meta-action. | |
| The `request_protocol_amendment` action allows recovery from earlier mistakes: | |
| - If FDA review fails, agent can amend the protocol and resubmit | |
| - Amendment costs time and budget (realistic consequence) | |
| - Successful recovery after failure gets a +0.3 **recovery bonus** (like KubeSRE's mitigation credit) | |
| - Maximum 2 amendments per episode (prevents infinite retry loops) | |
| - This teaches the agent that mistakes are recoverable but costly — a key real-world lesson | |