openenv-clinical-trial / docs /KnowledgeBase.md
Roopalgn's picture
docs: add roadmap, README, branch-pr matrix, and formatting pass
b3e0336
|
raw
history blame
18.5 kB
# Knowledge Base β€” OpenEnv Hackathon
## Clinical Trial Designer Project
## 1. What OpenEnv Actually Is
OpenEnv is a framework for building RL training environments served as FastAPI apps.
Agent (LLM) ──sends action──► Environment.step() ──returns──► observation + reward
↑
YOUR world lives here
The contract:
class YourEnvironment(Environment):
def reset() -> YourObservation # start new episode, inject scenario
def step(action) -> YourObservation # apply action, return obs + reward + done
def state -> State # current episode state
Served via:
app = create_app(YourEnvironment, YourAction, YourObservation, env_name="your_env")
# β†’ FastAPI with POST /reset, POST /step, GET /state, GET /schema, WS /ws
Deployed as: Docker-based HuggingFace Space using openenv.yaml
Trained via: HF TRL GRPO β€” agent generates rollouts against the live environment, gets reward signal, updates weights.
## 2. What Every Winner Had in Common
### The Non-Negotiable Pattern
1. Real world state (not just text)
2. Actions that change that state (real commands / real math)
3. Verification WITHOUT an LLM judge (system state, math, test pass/fail)
4. Curriculum (easy β†’ hard, progressive difficulty)
5. Long episodes (15–100+ steps)
6. Clear reward variance (GRPO needs +high vs -low separation)
### The Single Most Important Rule
Ground truth must be objective. Either the pod is running or it isn't. Either the p-value is < 0.05 or it isn't. Either the books balance or they don't.
If you need an LLM to judge whether the agent succeeded, your environment is weak.
## 3. Past Winners β€” What They Built & Why They Won
### πŸ₯‡ 1st Place β€” Kube SRE (kube-sre-gym)
Domain: Kubernetes Site Reliability Engineering
What it is: Agent receives a PagerDuty alert about a broken K8s cluster. Must diagnose and fix using real kubectl commands against a live GKE cluster.
Real-world grounding:
Live GKE cluster (not simulated)
Real kubectl commands execute against real pods
Real failure modes: OOMKill, CrashLoopBackOff, ImagePullBackOff, scale-to-zero
Real SRE workflow: triage β†’ investigate β†’ fix β†’ verify
Verification (no LLM needed for core check):
Pod status is ground truth: Running or not
Restart counts, OOM flags are real K8s events
LLM judge used only as secondary confirmation layer
Reward structure:
Per-step: LLM judge score (-1.0 to +1.0) for SRE workflow quality
Repeat penalty: -0.15 per repeated command
Resolution bonus: +1.0 to +5.0 (efficiency-scaled, faster = higher)
Timeout: net -2.0 for failed episodes
Phase-order bonus: +0.2 for correct triage→investigate→fix→verify sequence
Curriculum:
Warmup (0.0–0.25): single easy faults (OOM, crashloop, image pull)
Beginner (0.25–0.40): medium faults (bad config, scale zero)
Intermediate (0.40–0.60): harder investigation required
Advanced (0.60–0.80): compound multi-fault scenarios
Expert (0.80–0.95): adversarial LLM-designed incidents across all 3 namespaces
Adversarial Designer:
Claude designs incidents targeting agent's tracked weak spots
Multi-fault scenarios spread across namespaces with red herrings
Scenarios must be solvable within step budget (inject/fix pairs validated)
Judge personas (scale with difficulty):
Junior (< 0.4): lenient, gives hints
Senior (0.4–0.7): standard SRE expectations
Principal (> 0.7): strict, penalizes inefficiency
Key insight that won it: Environment co-evolved with the agent. Training exposed bugs in the command parser, judge truncation, and health check race conditions. Fixing them made both the environment and agent better.
Episode length: 15–25 steps (scales with difficulty)
Model: Qwen3-1.7B + LoRA, GRPO with 8 parallel rollouts
### πŸ₯ˆ 2nd Place β€” Bio Experiment Environment
Domain: Biological Research / Single-Cell Genomics
What it is: Agent plans a biological experiment pipeline step-by-step. Hidden ground truth (true DE genes, true effect sizes, true cell populations) is never revealed. Agent must design experiments that would discover it.
Real-world grounding:
Real bioinformatics tools: Scanpy, Seurat, DESeq2, Monocle3, SCENIC (all real)
Real scientific workflow: collect β†’ QC β†’ normalize β†’ cluster β†’ DE β†’ conclude
Real lab constraints: budget ($80K–$120K), time (120–180 days), action costs
Literature-backed scenarios with real DOIs and true DE genes with log2FC values
4 real biological scenarios: cardiac disease, hematopoiesis, perturbation, biomarker validation
Verification:
Prerequisite rules are hard constraints (can't run DE before normalization β€” real science)
Budget/time math is ground truth
Terminal reward: conclusions compared against hidden ground truth markers/mechanisms
Calibration score: how well agent's claims match true biology
Reward structure (decomposed):
R_t = r_validity(0.3) + r_ordering(0.2) + r_info_gain(0.4) + r_efficiency(0.3)
+ r_novelty(+0.1) + r_penalty(-0.15/violation) + shaping(Ξ³=0.99)
Terminal reward adds: pipeline completeness (3.0), calibration (4.0), efficiency (1.0), overconfidence penalty (-0.5/wrong high-confidence claim)
POMDP structure:
Hidden: true cell populations, true DE genes, technical noise, failure conditions
Visible: task spec, pipeline history, resource usage, intermediate outputs, discovered markers
Episode length: Up to 30 steps
Key insight: Decomposed reward makes it easy to debug and train against. Each component is independently verifiable.
### πŸ₯‰ 3rd Place β€” EcomRLVE
Domain: E-commerce Shopping Assistant
What it is: Agent helps a simulated customer (LLM-driven persona) find products, manage cart, handle returns. Uses real 2M product catalog (Amazon dataset) indexed with FAISS.
Real-world grounding:
2M real Amazon products with FAISS HNSW index (3.4GB, ~10ms search)
Real e-commerce tools: catalog.search, cart.add, order.list, return.initiate
Real return policies with eligibility windows
Persona-driven customer simulator with hidden preferences
Reward:
r_total = w_task Γ— r_task + w_eff Γ— r_eff + w_hall Γ— r_hall
r_task = clip(0.55 Γ— r_rank + 0.35 Γ— r_constraints + 0.10 Γ— r_oos, -1, 1)
8 environment types: Product Discovery, Substitution, Cart Building, Return+Replacement, Order Tracking, Policy QA, Bundle Planning, Multi-Intent Journey
Episode length: Up to 14 turns
### Finalist β€” VRAM (Voyager-VRAM)
Domain: Workplace Project Management / Memory
What it is: Agent manages a 6-week software project across 31 tools (Email, Slack, Calendar, Drive, Sheets, Notes, Meta-Search). Hidden state includes stale spreadsheets, chat-only constraints, changed deadlines.
Key innovation: Voyager architecture β€” Skill Library (reusable tool sequences), Working Memory (structured within-episode state), Episodic Memory (cross-episode learning).
Training: Expert Iteration β€” Best-of-4 rejection sampling + SFT Γ— 3 rounds
Result: 21% improvement (5.75 vs 4.74 shaped reward) before any training, just from architecture.
## 4. OpenEnv Technical Requirements (Minimum to Submit)
- Use OpenEnv v0.2.1 (openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.1)
- Minimal training script using Unsloth or HF TRL in Colab
- Mini-blog on HuggingFace or mini-video on YouTube (< 2 minutes)
- Deployed on HF Spaces as Docker app
- Judging weights:
- Environment Innovation (40%)
- Storytelling (30%)
- Showing Improvement in Rewards (20%)
- Reward + Training Script Setup (10%)
## 5. Our Project β€” Clinical Trial Designer
### Core Concept
Agent designs a clinical trial to detect a drug effect. The simulator holds hidden ground truth (true effect size, true side effect rate, true responder population). Agent must design a trial that would statistically detect it.
- **Theme:** #3.1 β€” World Modeling / Professional Tasks
Real-world grounding:
FDA trial design rules are real and codified (Phase I/II/III requirements)
Statistical power calculations are pure math (no LLM needed)
Trial simulation runs with hidden true parameters β†’ p-value is ground truth
Clinical trial protocols follow established procedures (ICH E9, FDA guidance)
### The World (Hidden State)
When reset() is called, the simulator secretly sets:
class TrialGroundTruth:
true_effect_size: float # e.g. 0.23 (23% tumor reduction)
true_side_effect_rate: float # e.g. 0.08 (8% serious adverse events)
true_responder_population: str # e.g. "BRCA1+ only" (agent doesn't know this)
true_mechanism: str # e.g. "inhibits VEGF pathway"
true_dose_response: dict # dose β†’ effect curve (hidden)
placebo_response_rate: float # background noise
dropout_rate: float # patients who leave trial
Agent never sees this. It must design a trial that would detect it.
### Agent Actions (What the Agent Does)
class TrialAction:
action_type: ActionType # one of the actions below
parameters: dict
justification: str
confidence: float # 0.0–1.0
Action vocabulary:
Phase Action Real-world analog
Design set_primary_endpoint Choose what to measure (OS, PFS, ORR)
Design set_sample_size Power calculation β†’ n patients
Design set_inclusion_criteria Who can enroll
Design set_exclusion_criteria Who is excluded
Design set_dosing_schedule Dose, frequency, cycle length
Design set_control_arm Placebo vs standard of care
Design set_randomization_ratio 1:1, 2:1, etc.
Design set_blinding Open-label, single-blind, double-blind
Phase I run_dose_escalation 3+3 design, find MTD
Phase I observe_safety_signal Read adverse event data
Phase I estimate_effect_size Estimate from Phase I data
Phase II run_interim_analysis Check futility/efficacy at 50% enrollment
Phase II modify_sample_size Adaptive design adjustment
Phase II add_biomarker_stratification Enrich for responders
Regulatory submit_to_fda_review Check protocol compliance
Regulatory request_protocol_amendment Change design mid-trial
Analysis run_primary_analysis Final statistical test
Analysis synthesize_conclusion Write trial conclusion
### Verification (No LLM Judge Needed)
1. Statistical Power β€” pure math
from scipy.stats import norm
def calculate_power(effect_size, n, alpha=0.05):
z_alpha = norm.ppf(1 - alpha/2)
z_beta = effect_size * sqrt(n/2) - z_alpha
return norm.cdf(z_beta)
# If power < 0.80 β†’ underpowered β†’ reward penalty
# Agent must estimate effect_size from Phase I data (hidden true value)
2. FDA Rule Compliance β€” hard rules (binary pass/fail)
FDA_RULES = {
"phase_ii_min_n": 100,
"primary_endpoint_must_be_prespecified": True,
"interim_analysis_requires_alpha_spending": True,
"randomization_required_for_phase_iii": True,
"safety_monitoring_committee_required": True,
"informed_consent_required": True,
}
# Each rule is a hard check β€” no LLM needed
3. Trial Simulation β€” run it with hidden truth
def simulate_trial(design, ground_truth):
# Sample patients from true population
# Apply true effect to treatment arm
# Apply placebo response to control arm
# Add dropout, noise, adverse events
# Run pre-specified statistical test
# Return: p_value, confidence_interval, adverse_event_rate
p_value = run_statistical_test(treatment_outcomes, control_outcomes)
success = p_value < design.alpha # ground truth: did it work?
return TrialResult(p_value, success, adverse_events)
4. Budget β€” math
cost = n_patients * cost_per_patient + site_costs + regulatory_fees
over_budget = cost > trial_budget # binary
### Reward Structure
Per-step rewards:
Component Verification Weight
FDA rule compliance Hard rule engine +0.3 per rule passed
Valid action sequence Prerequisite check +0.2
Information gain from Phase I Bayesian update quality +0.4
Budget efficiency Math +0.1
Soft violation penalty Rule engine -0.15 each
Terminal rewards (when trial simulation runs):
Component Verification Weight
Trial detects true effect (p < 0.05) Simulation math +5.0
Statistical power β‰₯ 0.80 Formula +2.0
All FDA rules pass Rule engine +2.0
Correct responder population identified Hidden state match +3.0
Budget under limit Math +1.0
Interim analysis catches futility early Simulation +1.0 bonus
Underpowered design Formula -2.0
Wrong primary endpoint Domain rules -1.5
Overconfident wrong claims Calibration check -0.5 each
Reward variance for GRPO:
Successful trial: +8 to +14
Failed trial (wrong population): -2 to 0
Timeout / FDA rejection: -3
### Episode Structure (Long Horizon)
Phase I (20–30 steps):
β†’ dose_escalation Γ— 6 cohorts
β†’ observe_safety_signal Γ— 3
β†’ estimate_effect_size (Bayesian update)
β†’ decide: go/no-go to Phase II
Phase II (30–40 steps):
β†’ set_primary_endpoint
β†’ set_sample_size (power calculation)
β†’ set_inclusion_criteria (try to find responder population)
β†’ set_dosing_schedule
β†’ submit_to_fda_review
β†’ run_interim_analysis (at 50% enrollment)
β†’ modify_sample_size if needed
β†’ run_primary_analysis
Conclusion (5–10 steps):
β†’ synthesize_conclusion
β†’ Terminal reward fires
Total: 80–100 steps per episode
### Curriculum
Tier Difficulty What changes
Warmup 0.0–0.25 Large effect size (easy to detect), homogeneous population
Beginner 0.25–0.40 Medium effect, some noise
Intermediate 0.40–0.60 Small effect, need correct population enrichment
Advanced 0.60–0.80 Hidden responder subgroup, misleading Phase I signal
Expert 0.80–0.95 Tiny effect, high dropout, adaptive design required
Scenarios (4 to start, like Bio project)
Name Disease Challenge True Effect
solid_tumor_chemo Non-small cell lung cancer Find EGFR+ subgroup 31% PFS improvement in EGFR+ only
autoimmune_biologic Rheumatoid arthritis Dose-response curve, find optimal dose U-shaped response, 200mg optimal
cns_depression Treatment-resistant depression High placebo response masks drug effect 18% improvement over placebo
rare_disease_orphan Rare pediatric metabolic disorder Tiny n, adaptive design required Large effect (Cohen's d = 1.2) but n < 50
### Hidden State Structure
class TrialLatentState:
# Biology
true_effect_size: float
true_responder_criteria: List[str] # e.g. ["BRCA1+", "age < 65"]
true_dose_response: Dict[float, float]
true_mechanism: str
# Technical
placebo_response_rate: float
dropout_rate: float
site_variability: float
measurement_noise: float
# Progress flags (18 milestones like Bio project)
phase_i_complete: bool
mtd_identified: bool
effect_estimated: bool
protocol_submitted: bool
interim_complete: bool
trial_complete: bool
# Resources
budget_remaining: float
time_remaining_days: int
patients_enrolled: int
### Key Design Decisions
Real statistical math β€” scipy.stats does the power calculations. No LLM.
FDA rules as hard constraints β€” ICH E9 guidelines encoded as rule engine (like Bio project's prerequisite rules).
Simulation is ground truth β€” trial either detects effect or doesn't. Same as KubeSRE's pod status.
Phase I β†’ Phase II information flow β€” agent must use Phase I observations to update its Phase II design. This is the long-horizon planning challenge.
Hidden responder population β€” the hardest part. Agent must figure out that the drug only works in BRCA1+ patients by designing smart inclusion criteria. This is where the curriculum earns its keep.
Decomposed reward β€” like Bio project, each component is independently verifiable and debuggable.
## 6. Rules Learned from Winners
### Environment Design Rules
One clear success criterion β€” pod running, p < 0.05, books balance
Real tools/APIs β€” not mocked. Real kubectl, real scipy, real SQL
Prerequisite chains β€” can't run Phase II without Phase I (like Bio project's rule engine)
Reward variance β€” GRPO needs clear separation between good and bad episodes
No reward hacking β€” multi-layer verification (programmatic + optional LLM)
Environment must fight back β€” too-easy rewards cause plateaus (KubeSRE lesson)
Repeat penalty β€” prevents agent from spamming same action
### Training Rules
GRPO over PPO β€” better for sparse delayed rewards, no value function needed
8 parallel rollouts β€” gives GRPO enough variance to compute advantages
Curriculum is mandatory β€” cold start on hard problems = no learning signal
Fast-track advancement β€” 90%+ success rate β†’ skip min_episodes requirement
Episode transcripts β€” save to JSONL for debugging and offline analysis
### Reward Rules
Timeout = net negative β€” wipe accumulated rewards, set to -2.0 total
Efficiency scaling β€” faster fixes get higher bonuses (prevents lazy solutions)
Phase-order bonus β€” reward correct workflow sequence
Overconfidence penalty β€” high-confidence wrong claims get penalized (Bio project)
Decompose rewards β€” makes debugging and training easier
### Pitfalls to Avoid
LLM-only verification β€” too slow, too expensive, too noisy
Too-generous rewards β€” agent finds plateau and stops improving
Static scenarios β€” agent memorizes, doesn't generalize
Single-fault only β€” too easy, no curriculum progression
Mocked tool responses β€” agent learns to exploit mock, not real behavior
Truncated observations β€” KubeSRE bug: judge was cutting off pods alphabetically
## 7. Tech Stack (Based on Winners)
Environment: openenv-core[core] @ v0.2.1
Server: FastAPI + uvicorn
Training: HF TRL (GRPOTrainer) + vLLM colocate
Model: Qwen3-1.7B or Qwen2.5-7B + LoRA (BF16)
Deployment: Docker β†’ HuggingFace Spaces
Compute: H100 80GB (training) + GKE/cloud (environment)
Stats: scipy.stats (power calculations)
### Training command pattern
# Terminal 1: Environment server
uv run server
# Terminal 2: GRPO training
python train.py --vllm-mode colocate --num-generations 8 --max-steps 100
## 8. Pitch Strategy (3 min)
Based on judging criteria (40% innovation, 30% storytelling, 20% reward improvement, 10% pipeline):
Minute 1 β€” Story (30% of score)
"A drug works. But only in 15% of patients. The FDA needs proof. How do you design a trial that finds those patients before you run out of money?"
Minute 2 β€” Environment Innovation (40% of score)
Show: hidden ground truth, statistical verification, FDA rule engine, Phase I β†’ Phase II information flow
Minute 3 β€” Reward Curves + Demo (30% of score)
Show reward curve improving. Show agent learning to enrich for responder population. Show before/after: random inclusion criteria vs. learned BRCA1+ enrichment.