Spaces:
Sleeping
Sleeping
| # Knowledge Base β OpenEnv Hackathon | |
| ## Clinical Trial Designer Project | |
| ## 1. What OpenEnv Actually Is | |
| OpenEnv is a framework for building RL training environments served as FastAPI apps. | |
| Agent (LLM) ββsends actionβββΊ Environment.step() ββreturnsβββΊ observation + reward | |
| β | |
| YOUR world lives here | |
| The contract: | |
| class YourEnvironment(Environment): | |
| def reset() -> YourObservation # start new episode, inject scenario | |
| def step(action) -> YourObservation # apply action, return obs + reward + done | |
| def state -> State # current episode state | |
| Served via: | |
| app = create_app(YourEnvironment, YourAction, YourObservation, env_name="your_env") | |
| # β FastAPI with POST /reset, POST /step, GET /state, GET /schema, WS /ws | |
| Deployed as: Docker-based HuggingFace Space using openenv.yaml | |
| Trained via: HF TRL GRPO β agent generates rollouts against the live environment, gets reward signal, updates weights. | |
| ## 2. What Every Winner Had in Common | |
| ### The Non-Negotiable Pattern | |
| 1. Real world state (not just text) | |
| 2. Actions that change that state (real commands / real math) | |
| 3. Verification WITHOUT an LLM judge (system state, math, test pass/fail) | |
| 4. Curriculum (easy β hard, progressive difficulty) | |
| 5. Long episodes (15β100+ steps) | |
| 6. Clear reward variance (GRPO needs +high vs -low separation) | |
| ### The Single Most Important Rule | |
| Ground truth must be objective. Either the pod is running or it isn't. Either the p-value is < 0.05 or it isn't. Either the books balance or they don't. | |
| If you need an LLM to judge whether the agent succeeded, your environment is weak. | |
| ## 3. Past Winners β What They Built & Why They Won | |
| ### π₯ 1st Place β Kube SRE (kube-sre-gym) | |
| Domain: Kubernetes Site Reliability Engineering | |
| What it is: Agent receives a PagerDuty alert about a broken K8s cluster. Must diagnose and fix using real kubectl commands against a live GKE cluster. | |
| Real-world grounding: | |
| Live GKE cluster (not simulated) | |
| Real kubectl commands execute against real pods | |
| Real failure modes: OOMKill, CrashLoopBackOff, ImagePullBackOff, scale-to-zero | |
| Real SRE workflow: triage β investigate β fix β verify | |
| Verification (no LLM needed for core check): | |
| Pod status is ground truth: Running or not | |
| Restart counts, OOM flags are real K8s events | |
| LLM judge used only as secondary confirmation layer | |
| Reward structure: | |
| Per-step: LLM judge score (-1.0 to +1.0) for SRE workflow quality | |
| Repeat penalty: -0.15 per repeated command | |
| Resolution bonus: +1.0 to +5.0 (efficiency-scaled, faster = higher) | |
| Timeout: net -2.0 for failed episodes | |
| Phase-order bonus: +0.2 for correct triageβinvestigateβfixβverify sequence | |
| Curriculum: | |
| Warmup (0.0β0.25): single easy faults (OOM, crashloop, image pull) | |
| Beginner (0.25β0.40): medium faults (bad config, scale zero) | |
| Intermediate (0.40β0.60): harder investigation required | |
| Advanced (0.60β0.80): compound multi-fault scenarios | |
| Expert (0.80β0.95): adversarial LLM-designed incidents across all 3 namespaces | |
| Adversarial Designer: | |
| Claude designs incidents targeting agent's tracked weak spots | |
| Multi-fault scenarios spread across namespaces with red herrings | |
| Scenarios must be solvable within step budget (inject/fix pairs validated) | |
| Judge personas (scale with difficulty): | |
| Junior (< 0.4): lenient, gives hints | |
| Senior (0.4β0.7): standard SRE expectations | |
| Principal (> 0.7): strict, penalizes inefficiency | |
| Key insight that won it: Environment co-evolved with the agent. Training exposed bugs in the command parser, judge truncation, and health check race conditions. Fixing them made both the environment and agent better. | |
| Episode length: 15β25 steps (scales with difficulty) | |
| Model: Qwen3-1.7B + LoRA, GRPO with 8 parallel rollouts | |
| ### π₯ 2nd Place β Bio Experiment Environment | |
| Domain: Biological Research / Single-Cell Genomics | |
| What it is: Agent plans a biological experiment pipeline step-by-step. Hidden ground truth (true DE genes, true effect sizes, true cell populations) is never revealed. Agent must design experiments that would discover it. | |
| Real-world grounding: | |
| Real bioinformatics tools: Scanpy, Seurat, DESeq2, Monocle3, SCENIC (all real) | |
| Real scientific workflow: collect β QC β normalize β cluster β DE β conclude | |
| Real lab constraints: budget ($80Kβ$120K), time (120β180 days), action costs | |
| Literature-backed scenarios with real DOIs and true DE genes with log2FC values | |
| 4 real biological scenarios: cardiac disease, hematopoiesis, perturbation, biomarker validation | |
| Verification: | |
| Prerequisite rules are hard constraints (can't run DE before normalization β real science) | |
| Budget/time math is ground truth | |
| Terminal reward: conclusions compared against hidden ground truth markers/mechanisms | |
| Calibration score: how well agent's claims match true biology | |
| Reward structure (decomposed): | |
| R_t = r_validity(0.3) + r_ordering(0.2) + r_info_gain(0.4) + r_efficiency(0.3) | |
| + r_novelty(+0.1) + r_penalty(-0.15/violation) + shaping(Ξ³=0.99) | |
| Terminal reward adds: pipeline completeness (3.0), calibration (4.0), efficiency (1.0), overconfidence penalty (-0.5/wrong high-confidence claim) | |
| POMDP structure: | |
| Hidden: true cell populations, true DE genes, technical noise, failure conditions | |
| Visible: task spec, pipeline history, resource usage, intermediate outputs, discovered markers | |
| Episode length: Up to 30 steps | |
| Key insight: Decomposed reward makes it easy to debug and train against. Each component is independently verifiable. | |
| ### π₯ 3rd Place β EcomRLVE | |
| Domain: E-commerce Shopping Assistant | |
| What it is: Agent helps a simulated customer (LLM-driven persona) find products, manage cart, handle returns. Uses real 2M product catalog (Amazon dataset) indexed with FAISS. | |
| Real-world grounding: | |
| 2M real Amazon products with FAISS HNSW index (3.4GB, ~10ms search) | |
| Real e-commerce tools: catalog.search, cart.add, order.list, return.initiate | |
| Real return policies with eligibility windows | |
| Persona-driven customer simulator with hidden preferences | |
| Reward: | |
| r_total = w_task Γ r_task + w_eff Γ r_eff + w_hall Γ r_hall | |
| r_task = clip(0.55 Γ r_rank + 0.35 Γ r_constraints + 0.10 Γ r_oos, -1, 1) | |
| 8 environment types: Product Discovery, Substitution, Cart Building, Return+Replacement, Order Tracking, Policy QA, Bundle Planning, Multi-Intent Journey | |
| Episode length: Up to 14 turns | |
| ### Finalist β VRAM (Voyager-VRAM) | |
| Domain: Workplace Project Management / Memory | |
| What it is: Agent manages a 6-week software project across 31 tools (Email, Slack, Calendar, Drive, Sheets, Notes, Meta-Search). Hidden state includes stale spreadsheets, chat-only constraints, changed deadlines. | |
| Key innovation: Voyager architecture β Skill Library (reusable tool sequences), Working Memory (structured within-episode state), Episodic Memory (cross-episode learning). | |
| Training: Expert Iteration β Best-of-4 rejection sampling + SFT Γ 3 rounds | |
| Result: 21% improvement (5.75 vs 4.74 shaped reward) before any training, just from architecture. | |
| ## 4. OpenEnv Technical Requirements (Minimum to Submit) | |
| - Use OpenEnv v0.2.1 (openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.1) | |
| - Minimal training script using Unsloth or HF TRL in Colab | |
| - Mini-blog on HuggingFace or mini-video on YouTube (< 2 minutes) | |
| - Deployed on HF Spaces as Docker app | |
| - Judging weights: | |
| - Environment Innovation (40%) | |
| - Storytelling (30%) | |
| - Showing Improvement in Rewards (20%) | |
| - Reward + Training Script Setup (10%) | |
| ## 5. Our Project β Clinical Trial Designer | |
| ### Core Concept | |
| Agent designs a clinical trial to detect a drug effect. The simulator holds hidden ground truth (true effect size, true side effect rate, true responder population). Agent must design a trial that would statistically detect it. | |
| - **Theme:** #3.1 β World Modeling / Professional Tasks | |
| Real-world grounding: | |
| FDA trial design rules are real and codified (Phase I/II/III requirements) | |
| Statistical power calculations are pure math (no LLM needed) | |
| Trial simulation runs with hidden true parameters β p-value is ground truth | |
| Clinical trial protocols follow established procedures (ICH E9, FDA guidance) | |
| ### The World (Hidden State) | |
| When reset() is called, the simulator secretly sets: | |
| class TrialGroundTruth: | |
| true_effect_size: float # e.g. 0.23 (23% tumor reduction) | |
| true_side_effect_rate: float # e.g. 0.08 (8% serious adverse events) | |
| true_responder_population: str # e.g. "BRCA1+ only" (agent doesn't know this) | |
| true_mechanism: str # e.g. "inhibits VEGF pathway" | |
| true_dose_response: dict # dose β effect curve (hidden) | |
| placebo_response_rate: float # background noise | |
| dropout_rate: float # patients who leave trial | |
| Agent never sees this. It must design a trial that would detect it. | |
| ### Agent Actions (What the Agent Does) | |
| class TrialAction: | |
| action_type: ActionType # one of the actions below | |
| parameters: dict | |
| justification: str | |
| confidence: float # 0.0β1.0 | |
| Action vocabulary: | |
| Phase Action Real-world analog | |
| Design set_primary_endpoint Choose what to measure (OS, PFS, ORR) | |
| Design set_sample_size Power calculation β n patients | |
| Design set_inclusion_criteria Who can enroll | |
| Design set_exclusion_criteria Who is excluded | |
| Design set_dosing_schedule Dose, frequency, cycle length | |
| Design set_control_arm Placebo vs standard of care | |
| Design set_randomization_ratio 1:1, 2:1, etc. | |
| Design set_blinding Open-label, single-blind, double-blind | |
| Phase I run_dose_escalation 3+3 design, find MTD | |
| Phase I observe_safety_signal Read adverse event data | |
| Phase I estimate_effect_size Estimate from Phase I data | |
| Phase II run_interim_analysis Check futility/efficacy at 50% enrollment | |
| Phase II modify_sample_size Adaptive design adjustment | |
| Phase II add_biomarker_stratification Enrich for responders | |
| Regulatory submit_to_fda_review Check protocol compliance | |
| Regulatory request_protocol_amendment Change design mid-trial | |
| Analysis run_primary_analysis Final statistical test | |
| Analysis synthesize_conclusion Write trial conclusion | |
| ### Verification (No LLM Judge Needed) | |
| 1. Statistical Power β pure math | |
| from scipy.stats import norm | |
| def calculate_power(effect_size, n, alpha=0.05): | |
| z_alpha = norm.ppf(1 - alpha/2) | |
| z_beta = effect_size * sqrt(n/2) - z_alpha | |
| return norm.cdf(z_beta) | |
| # If power < 0.80 β underpowered β reward penalty | |
| # Agent must estimate effect_size from Phase I data (hidden true value) | |
| 2. FDA Rule Compliance β hard rules (binary pass/fail) | |
| FDA_RULES = { | |
| "phase_ii_min_n": 100, | |
| "primary_endpoint_must_be_prespecified": True, | |
| "interim_analysis_requires_alpha_spending": True, | |
| "randomization_required_for_phase_iii": True, | |
| "safety_monitoring_committee_required": True, | |
| "informed_consent_required": True, | |
| } | |
| # Each rule is a hard check β no LLM needed | |
| 3. Trial Simulation β run it with hidden truth | |
| def simulate_trial(design, ground_truth): | |
| # Sample patients from true population | |
| # Apply true effect to treatment arm | |
| # Apply placebo response to control arm | |
| # Add dropout, noise, adverse events | |
| # Run pre-specified statistical test | |
| # Return: p_value, confidence_interval, adverse_event_rate | |
| p_value = run_statistical_test(treatment_outcomes, control_outcomes) | |
| success = p_value < design.alpha # ground truth: did it work? | |
| return TrialResult(p_value, success, adverse_events) | |
| 4. Budget β math | |
| cost = n_patients * cost_per_patient + site_costs + regulatory_fees | |
| over_budget = cost > trial_budget # binary | |
| ### Reward Structure | |
| Per-step rewards: | |
| Component Verification Weight | |
| FDA rule compliance Hard rule engine +0.3 per rule passed | |
| Valid action sequence Prerequisite check +0.2 | |
| Information gain from Phase I Bayesian update quality +0.4 | |
| Budget efficiency Math +0.1 | |
| Soft violation penalty Rule engine -0.15 each | |
| Terminal rewards (when trial simulation runs): | |
| Component Verification Weight | |
| Trial detects true effect (p < 0.05) Simulation math +5.0 | |
| Statistical power β₯ 0.80 Formula +2.0 | |
| All FDA rules pass Rule engine +2.0 | |
| Correct responder population identified Hidden state match +3.0 | |
| Budget under limit Math +1.0 | |
| Interim analysis catches futility early Simulation +1.0 bonus | |
| Underpowered design Formula -2.0 | |
| Wrong primary endpoint Domain rules -1.5 | |
| Overconfident wrong claims Calibration check -0.5 each | |
| Reward variance for GRPO: | |
| Successful trial: +8 to +14 | |
| Failed trial (wrong population): -2 to 0 | |
| Timeout / FDA rejection: -3 | |
| ### Episode Structure (Long Horizon) | |
| Phase I (20β30 steps): | |
| β dose_escalation Γ 6 cohorts | |
| β observe_safety_signal Γ 3 | |
| β estimate_effect_size (Bayesian update) | |
| β decide: go/no-go to Phase II | |
| Phase II (30β40 steps): | |
| β set_primary_endpoint | |
| β set_sample_size (power calculation) | |
| β set_inclusion_criteria (try to find responder population) | |
| β set_dosing_schedule | |
| β submit_to_fda_review | |
| β run_interim_analysis (at 50% enrollment) | |
| β modify_sample_size if needed | |
| β run_primary_analysis | |
| Conclusion (5β10 steps): | |
| β synthesize_conclusion | |
| β Terminal reward fires | |
| Total: 80β100 steps per episode | |
| ### Curriculum | |
| Tier Difficulty What changes | |
| Warmup 0.0β0.25 Large effect size (easy to detect), homogeneous population | |
| Beginner 0.25β0.40 Medium effect, some noise | |
| Intermediate 0.40β0.60 Small effect, need correct population enrichment | |
| Advanced 0.60β0.80 Hidden responder subgroup, misleading Phase I signal | |
| Expert 0.80β0.95 Tiny effect, high dropout, adaptive design required | |
| Scenarios (4 to start, like Bio project) | |
| Name Disease Challenge True Effect | |
| solid_tumor_chemo Non-small cell lung cancer Find EGFR+ subgroup 31% PFS improvement in EGFR+ only | |
| autoimmune_biologic Rheumatoid arthritis Dose-response curve, find optimal dose U-shaped response, 200mg optimal | |
| cns_depression Treatment-resistant depression High placebo response masks drug effect 18% improvement over placebo | |
| rare_disease_orphan Rare pediatric metabolic disorder Tiny n, adaptive design required Large effect (Cohen's d = 1.2) but n < 50 | |
| ### Hidden State Structure | |
| class TrialLatentState: | |
| # Biology | |
| true_effect_size: float | |
| true_responder_criteria: List[str] # e.g. ["BRCA1+", "age < 65"] | |
| true_dose_response: Dict[float, float] | |
| true_mechanism: str | |
| # Technical | |
| placebo_response_rate: float | |
| dropout_rate: float | |
| site_variability: float | |
| measurement_noise: float | |
| # Progress flags (18 milestones like Bio project) | |
| phase_i_complete: bool | |
| mtd_identified: bool | |
| effect_estimated: bool | |
| protocol_submitted: bool | |
| interim_complete: bool | |
| trial_complete: bool | |
| # Resources | |
| budget_remaining: float | |
| time_remaining_days: int | |
| patients_enrolled: int | |
| ### Key Design Decisions | |
| Real statistical math β scipy.stats does the power calculations. No LLM. | |
| FDA rules as hard constraints β ICH E9 guidelines encoded as rule engine (like Bio project's prerequisite rules). | |
| Simulation is ground truth β trial either detects effect or doesn't. Same as KubeSRE's pod status. | |
| Phase I β Phase II information flow β agent must use Phase I observations to update its Phase II design. This is the long-horizon planning challenge. | |
| Hidden responder population β the hardest part. Agent must figure out that the drug only works in BRCA1+ patients by designing smart inclusion criteria. This is where the curriculum earns its keep. | |
| Decomposed reward β like Bio project, each component is independently verifiable and debuggable. | |
| ## 6. Rules Learned from Winners | |
| ### Environment Design Rules | |
| One clear success criterion β pod running, p < 0.05, books balance | |
| Real tools/APIs β not mocked. Real kubectl, real scipy, real SQL | |
| Prerequisite chains β can't run Phase II without Phase I (like Bio project's rule engine) | |
| Reward variance β GRPO needs clear separation between good and bad episodes | |
| No reward hacking β multi-layer verification (programmatic + optional LLM) | |
| Environment must fight back β too-easy rewards cause plateaus (KubeSRE lesson) | |
| Repeat penalty β prevents agent from spamming same action | |
| ### Training Rules | |
| GRPO over PPO β better for sparse delayed rewards, no value function needed | |
| 8 parallel rollouts β gives GRPO enough variance to compute advantages | |
| Curriculum is mandatory β cold start on hard problems = no learning signal | |
| Fast-track advancement β 90%+ success rate β skip min_episodes requirement | |
| Episode transcripts β save to JSONL for debugging and offline analysis | |
| ### Reward Rules | |
| Timeout = net negative β wipe accumulated rewards, set to -2.0 total | |
| Efficiency scaling β faster fixes get higher bonuses (prevents lazy solutions) | |
| Phase-order bonus β reward correct workflow sequence | |
| Overconfidence penalty β high-confidence wrong claims get penalized (Bio project) | |
| Decompose rewards β makes debugging and training easier | |
| ### Pitfalls to Avoid | |
| LLM-only verification β too slow, too expensive, too noisy | |
| Too-generous rewards β agent finds plateau and stops improving | |
| Static scenarios β agent memorizes, doesn't generalize | |
| Single-fault only β too easy, no curriculum progression | |
| Mocked tool responses β agent learns to exploit mock, not real behavior | |
| Truncated observations β KubeSRE bug: judge was cutting off pods alphabetically | |
| ## 7. Tech Stack (Based on Winners) | |
| Environment: openenv-core[core] @ v0.2.1 | |
| Server: FastAPI + uvicorn | |
| Training: HF TRL (GRPOTrainer) + vLLM colocate | |
| Model: Qwen3-1.7B or Qwen2.5-7B + LoRA (BF16) | |
| Deployment: Docker β HuggingFace Spaces | |
| Compute: H100 80GB (training) + GKE/cloud (environment) | |
| Stats: scipy.stats (power calculations) | |
| ### Training command pattern | |
| # Terminal 1: Environment server | |
| uv run server | |
| # Terminal 2: GRPO training | |
| python train.py --vllm-mode colocate --num-generations 8 --max-steps 100 | |
| ## 8. Pitch Strategy (3 min) | |
| Based on judging criteria (40% innovation, 30% storytelling, 20% reward improvement, 10% pipeline): | |
| Minute 1 β Story (30% of score) | |
| "A drug works. But only in 15% of patients. The FDA needs proof. How do you design a trial that finds those patients before you run out of money?" | |
| Minute 2 β Environment Innovation (40% of score) | |
| Show: hidden ground truth, statistical verification, FDA rule engine, Phase I β Phase II information flow | |
| Minute 3 β Reward Curves + Demo (30% of score) | |
| Show reward curve improving. Show agent learning to enrich for responder population. Show before/after: random inclusion criteria vs. learned BRCA1+ enrichment. | |