File size: 18,548 Bytes
b3e0336
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
# Knowledge Base β€” OpenEnv Hackathon
## Clinical Trial Designer Project

## 1. What OpenEnv Actually Is
OpenEnv is a framework for building RL training environments served as FastAPI apps.

Agent (LLM) ──sends action──► Environment.step() ──returns──► observation + reward
                                      ↑
                         YOUR world lives here
The contract:

class YourEnvironment(Environment):
    def reset() -> YourObservation      # start new episode, inject scenario
    def step(action) -> YourObservation # apply action, return obs + reward + done
    def state -> State                  # current episode state
Served via:

app = create_app(YourEnvironment, YourAction, YourObservation, env_name="your_env")
# β†’ FastAPI with POST /reset, POST /step, GET /state, GET /schema, WS /ws
Deployed as: Docker-based HuggingFace Space using openenv.yaml

Trained via: HF TRL GRPO β€” agent generates rollouts against the live environment, gets reward signal, updates weights.

## 2. What Every Winner Had in Common
### The Non-Negotiable Pattern
1. Real world state (not just text)
2. Actions that change that state (real commands / real math)
3. Verification WITHOUT an LLM judge (system state, math, test pass/fail)
4. Curriculum (easy β†’ hard, progressive difficulty)
5. Long episodes (15–100+ steps)
6. Clear reward variance (GRPO needs +high vs -low separation)
### The Single Most Important Rule
Ground truth must be objective. Either the pod is running or it isn't. Either the p-value is < 0.05 or it isn't. Either the books balance or they don't.

If you need an LLM to judge whether the agent succeeded, your environment is weak.

## 3. Past Winners β€” What They Built & Why They Won
### πŸ₯‡ 1st Place β€” Kube SRE (kube-sre-gym)
Domain: Kubernetes Site Reliability Engineering

What it is: Agent receives a PagerDuty alert about a broken K8s cluster. Must diagnose and fix using real kubectl commands against a live GKE cluster.

Real-world grounding:

Live GKE cluster (not simulated)
Real kubectl commands execute against real pods
Real failure modes: OOMKill, CrashLoopBackOff, ImagePullBackOff, scale-to-zero
Real SRE workflow: triage β†’ investigate β†’ fix β†’ verify
Verification (no LLM needed for core check):

Pod status is ground truth: Running or not
Restart counts, OOM flags are real K8s events
LLM judge used only as secondary confirmation layer
Reward structure:

Per-step: LLM judge score (-1.0 to +1.0) for SRE workflow quality
Repeat penalty: -0.15 per repeated command
Resolution bonus: +1.0 to +5.0 (efficiency-scaled, faster = higher)
Timeout: net -2.0 for failed episodes
Phase-order bonus: +0.2 for correct triage→investigate→fix→verify sequence
Curriculum:

Warmup (0.0–0.25): single easy faults (OOM, crashloop, image pull)
Beginner (0.25–0.40): medium faults (bad config, scale zero)
Intermediate (0.40–0.60): harder investigation required
Advanced (0.60–0.80): compound multi-fault scenarios
Expert (0.80–0.95): adversarial LLM-designed incidents across all 3 namespaces
Adversarial Designer:

Claude designs incidents targeting agent's tracked weak spots
Multi-fault scenarios spread across namespaces with red herrings
Scenarios must be solvable within step budget (inject/fix pairs validated)
Judge personas (scale with difficulty):

Junior (< 0.4): lenient, gives hints
Senior (0.4–0.7): standard SRE expectations
Principal (> 0.7): strict, penalizes inefficiency
Key insight that won it: Environment co-evolved with the agent. Training exposed bugs in the command parser, judge truncation, and health check race conditions. Fixing them made both the environment and agent better.

Episode length: 15–25 steps (scales with difficulty)

Model: Qwen3-1.7B + LoRA, GRPO with 8 parallel rollouts

### πŸ₯ˆ 2nd Place β€” Bio Experiment Environment
Domain: Biological Research / Single-Cell Genomics

What it is: Agent plans a biological experiment pipeline step-by-step. Hidden ground truth (true DE genes, true effect sizes, true cell populations) is never revealed. Agent must design experiments that would discover it.

Real-world grounding:

Real bioinformatics tools: Scanpy, Seurat, DESeq2, Monocle3, SCENIC (all real)
Real scientific workflow: collect β†’ QC β†’ normalize β†’ cluster β†’ DE β†’ conclude
Real lab constraints: budget ($80K–$120K), time (120–180 days), action costs
Literature-backed scenarios with real DOIs and true DE genes with log2FC values
4 real biological scenarios: cardiac disease, hematopoiesis, perturbation, biomarker validation
Verification:

Prerequisite rules are hard constraints (can't run DE before normalization β€” real science)
Budget/time math is ground truth
Terminal reward: conclusions compared against hidden ground truth markers/mechanisms
Calibration score: how well agent's claims match true biology
Reward structure (decomposed):

R_t = r_validity(0.3) + r_ordering(0.2) + r_info_gain(0.4) + r_efficiency(0.3) 
      + r_novelty(+0.1) + r_penalty(-0.15/violation) + shaping(Ξ³=0.99)
Terminal reward adds: pipeline completeness (3.0), calibration (4.0), efficiency (1.0), overconfidence penalty (-0.5/wrong high-confidence claim)

POMDP structure:

Hidden: true cell populations, true DE genes, technical noise, failure conditions
Visible: task spec, pipeline history, resource usage, intermediate outputs, discovered markers
Episode length: Up to 30 steps

Key insight: Decomposed reward makes it easy to debug and train against. Each component is independently verifiable.

### πŸ₯‰ 3rd Place β€” EcomRLVE
Domain: E-commerce Shopping Assistant

What it is: Agent helps a simulated customer (LLM-driven persona) find products, manage cart, handle returns. Uses real 2M product catalog (Amazon dataset) indexed with FAISS.

Real-world grounding:

2M real Amazon products with FAISS HNSW index (3.4GB, ~10ms search)
Real e-commerce tools: catalog.search, cart.add, order.list, return.initiate
Real return policies with eligibility windows
Persona-driven customer simulator with hidden preferences
Reward:

r_total = w_task Γ— r_task + w_eff Γ— r_eff + w_hall Γ— r_hall
r_task = clip(0.55 Γ— r_rank + 0.35 Γ— r_constraints + 0.10 Γ— r_oos, -1, 1)
8 environment types: Product Discovery, Substitution, Cart Building, Return+Replacement, Order Tracking, Policy QA, Bundle Planning, Multi-Intent Journey

Episode length: Up to 14 turns

### Finalist β€” VRAM (Voyager-VRAM)
Domain: Workplace Project Management / Memory

What it is: Agent manages a 6-week software project across 31 tools (Email, Slack, Calendar, Drive, Sheets, Notes, Meta-Search). Hidden state includes stale spreadsheets, chat-only constraints, changed deadlines.

Key innovation: Voyager architecture β€” Skill Library (reusable tool sequences), Working Memory (structured within-episode state), Episodic Memory (cross-episode learning).

Training: Expert Iteration β€” Best-of-4 rejection sampling + SFT Γ— 3 rounds

Result: 21% improvement (5.75 vs 4.74 shaped reward) before any training, just from architecture.

## 4. OpenEnv Technical Requirements (Minimum to Submit)
- Use OpenEnv v0.2.1 (openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.1)
- Minimal training script using Unsloth or HF TRL in Colab
- Mini-blog on HuggingFace or mini-video on YouTube (< 2 minutes)
- Deployed on HF Spaces as Docker app
- Judging weights:

- Environment Innovation (40%)
- Storytelling (30%)
- Showing Improvement in Rewards (20%)
- Reward + Training Script Setup (10%)
## 5. Our Project β€” Clinical Trial Designer
### Core Concept
Agent designs a clinical trial to detect a drug effect. The simulator holds hidden ground truth (true effect size, true side effect rate, true responder population). Agent must design a trial that would statistically detect it.

- **Theme:** #3.1 β€” World Modeling / Professional Tasks

Real-world grounding:

FDA trial design rules are real and codified (Phase I/II/III requirements)
Statistical power calculations are pure math (no LLM needed)
Trial simulation runs with hidden true parameters β†’ p-value is ground truth
Clinical trial protocols follow established procedures (ICH E9, FDA guidance)
### The World (Hidden State)
When reset() is called, the simulator secretly sets:

class TrialGroundTruth:
    true_effect_size: float          # e.g. 0.23 (23% tumor reduction)
    true_side_effect_rate: float     # e.g. 0.08 (8% serious adverse events)
    true_responder_population: str   # e.g. "BRCA1+ only" (agent doesn't know this)
    true_mechanism: str              # e.g. "inhibits VEGF pathway"
    true_dose_response: dict         # dose β†’ effect curve (hidden)
    placebo_response_rate: float     # background noise
    dropout_rate: float              # patients who leave trial
Agent never sees this. It must design a trial that would detect it.

### Agent Actions (What the Agent Does)
class TrialAction:
    action_type: ActionType  # one of the actions below
    parameters: dict
    justification: str
    confidence: float  # 0.0–1.0
Action vocabulary:

Phase	Action	Real-world analog
Design	set_primary_endpoint	Choose what to measure (OS, PFS, ORR)
Design	set_sample_size	Power calculation β†’ n patients
Design	set_inclusion_criteria	Who can enroll
Design	set_exclusion_criteria	Who is excluded
Design	set_dosing_schedule	Dose, frequency, cycle length
Design	set_control_arm	Placebo vs standard of care
Design	set_randomization_ratio	1:1, 2:1, etc.
Design	set_blinding	Open-label, single-blind, double-blind
Phase I	run_dose_escalation	3+3 design, find MTD
Phase I	observe_safety_signal	Read adverse event data
Phase I	estimate_effect_size	Estimate from Phase I data
Phase II	run_interim_analysis	Check futility/efficacy at 50% enrollment
Phase II	modify_sample_size	Adaptive design adjustment
Phase II	add_biomarker_stratification	Enrich for responders
Regulatory	submit_to_fda_review	Check protocol compliance
Regulatory	request_protocol_amendment	Change design mid-trial
Analysis	run_primary_analysis	Final statistical test
Analysis	synthesize_conclusion	Write trial conclusion
### Verification (No LLM Judge Needed)
1. Statistical Power β€” pure math

from scipy.stats import norm

def calculate_power(effect_size, n, alpha=0.05):
    z_alpha = norm.ppf(1 - alpha/2)
    z_beta = effect_size * sqrt(n/2) - z_alpha
    return norm.cdf(z_beta)

# If power < 0.80 β†’ underpowered β†’ reward penalty
# Agent must estimate effect_size from Phase I data (hidden true value)
2. FDA Rule Compliance β€” hard rules (binary pass/fail)

FDA_RULES = {
    "phase_ii_min_n": 100,
    "primary_endpoint_must_be_prespecified": True,
    "interim_analysis_requires_alpha_spending": True,
    "randomization_required_for_phase_iii": True,
    "safety_monitoring_committee_required": True,
    "informed_consent_required": True,
}
# Each rule is a hard check β€” no LLM needed
3. Trial Simulation β€” run it with hidden truth

def simulate_trial(design, ground_truth):
    # Sample patients from true population
    # Apply true effect to treatment arm
    # Apply placebo response to control arm
    # Add dropout, noise, adverse events
    # Run pre-specified statistical test
    # Return: p_value, confidence_interval, adverse_event_rate
    
    p_value = run_statistical_test(treatment_outcomes, control_outcomes)
    success = p_value < design.alpha  # ground truth: did it work?
    return TrialResult(p_value, success, adverse_events)
4. Budget β€” math

cost = n_patients * cost_per_patient + site_costs + regulatory_fees
over_budget = cost > trial_budget  # binary
### Reward Structure
Per-step rewards:

Component	Verification	Weight
FDA rule compliance	Hard rule engine	+0.3 per rule passed
Valid action sequence	Prerequisite check	+0.2
Information gain from Phase I	Bayesian update quality	+0.4
Budget efficiency	Math	+0.1
Soft violation penalty	Rule engine	-0.15 each
Terminal rewards (when trial simulation runs):

Component	Verification	Weight
Trial detects true effect (p < 0.05)	Simulation math	+5.0
Statistical power β‰₯ 0.80	Formula	+2.0
All FDA rules pass	Rule engine	+2.0
Correct responder population identified	Hidden state match	+3.0
Budget under limit	Math	+1.0
Interim analysis catches futility early	Simulation	+1.0 bonus
Underpowered design	Formula	-2.0
Wrong primary endpoint	Domain rules	-1.5
Overconfident wrong claims	Calibration check	-0.5 each
Reward variance for GRPO:

Successful trial: +8 to +14
Failed trial (wrong population): -2 to 0
Timeout / FDA rejection: -3
### Episode Structure (Long Horizon)
Phase I (20–30 steps):
  β†’ dose_escalation Γ— 6 cohorts
  β†’ observe_safety_signal Γ— 3
  β†’ estimate_effect_size (Bayesian update)
  β†’ decide: go/no-go to Phase II

Phase II (30–40 steps):
  β†’ set_primary_endpoint
  β†’ set_sample_size (power calculation)
  β†’ set_inclusion_criteria (try to find responder population)
  β†’ set_dosing_schedule
  β†’ submit_to_fda_review
  β†’ run_interim_analysis (at 50% enrollment)
  β†’ modify_sample_size if needed
  β†’ run_primary_analysis

Conclusion (5–10 steps):
  β†’ synthesize_conclusion
  β†’ Terminal reward fires
Total: 80–100 steps per episode

### Curriculum
Tier	Difficulty	What changes
Warmup	0.0–0.25	Large effect size (easy to detect), homogeneous population
Beginner	0.25–0.40	Medium effect, some noise
Intermediate	0.40–0.60	Small effect, need correct population enrichment
Advanced	0.60–0.80	Hidden responder subgroup, misleading Phase I signal
Expert	0.80–0.95	Tiny effect, high dropout, adaptive design required
Scenarios (4 to start, like Bio project)
Name	Disease	Challenge	True Effect
solid_tumor_chemo	Non-small cell lung cancer	Find EGFR+ subgroup	31% PFS improvement in EGFR+ only
autoimmune_biologic	Rheumatoid arthritis	Dose-response curve, find optimal dose	U-shaped response, 200mg optimal
cns_depression	Treatment-resistant depression	High placebo response masks drug effect	18% improvement over placebo
rare_disease_orphan	Rare pediatric metabolic disorder	Tiny n, adaptive design required	Large effect (Cohen's d = 1.2) but n < 50
### Hidden State Structure
class TrialLatentState:
    # Biology
    true_effect_size: float
    true_responder_criteria: List[str]   # e.g. ["BRCA1+", "age < 65"]
    true_dose_response: Dict[float, float]
    true_mechanism: str
    
    # Technical
    placebo_response_rate: float
    dropout_rate: float
    site_variability: float
    measurement_noise: float
    
    # Progress flags (18 milestones like Bio project)
    phase_i_complete: bool
    mtd_identified: bool
    effect_estimated: bool
    protocol_submitted: bool
    interim_complete: bool
    trial_complete: bool
    
    # Resources
    budget_remaining: float
    time_remaining_days: int
    patients_enrolled: int
### Key Design Decisions
Real statistical math β€” scipy.stats does the power calculations. No LLM.
FDA rules as hard constraints β€” ICH E9 guidelines encoded as rule engine (like Bio project's prerequisite rules).
Simulation is ground truth β€” trial either detects effect or doesn't. Same as KubeSRE's pod status.
Phase I β†’ Phase II information flow β€” agent must use Phase I observations to update its Phase II design. This is the long-horizon planning challenge.
Hidden responder population β€” the hardest part. Agent must figure out that the drug only works in BRCA1+ patients by designing smart inclusion criteria. This is where the curriculum earns its keep.
Decomposed reward β€” like Bio project, each component is independently verifiable and debuggable.
## 6. Rules Learned from Winners
### Environment Design Rules
One clear success criterion β€” pod running, p < 0.05, books balance
Real tools/APIs β€” not mocked. Real kubectl, real scipy, real SQL
Prerequisite chains β€” can't run Phase II without Phase I (like Bio project's rule engine)
Reward variance β€” GRPO needs clear separation between good and bad episodes
No reward hacking β€” multi-layer verification (programmatic + optional LLM)
Environment must fight back β€” too-easy rewards cause plateaus (KubeSRE lesson)
Repeat penalty β€” prevents agent from spamming same action
### Training Rules
GRPO over PPO β€” better for sparse delayed rewards, no value function needed
8 parallel rollouts β€” gives GRPO enough variance to compute advantages
Curriculum is mandatory β€” cold start on hard problems = no learning signal
Fast-track advancement β€” 90%+ success rate β†’ skip min_episodes requirement
Episode transcripts β€” save to JSONL for debugging and offline analysis
### Reward Rules
Timeout = net negative β€” wipe accumulated rewards, set to -2.0 total
Efficiency scaling β€” faster fixes get higher bonuses (prevents lazy solutions)
Phase-order bonus β€” reward correct workflow sequence
Overconfidence penalty β€” high-confidence wrong claims get penalized (Bio project)
Decompose rewards β€” makes debugging and training easier
### Pitfalls to Avoid
LLM-only verification β€” too slow, too expensive, too noisy
Too-generous rewards β€” agent finds plateau and stops improving
Static scenarios β€” agent memorizes, doesn't generalize
Single-fault only β€” too easy, no curriculum progression
Mocked tool responses β€” agent learns to exploit mock, not real behavior
Truncated observations β€” KubeSRE bug: judge was cutting off pods alphabetically
## 7. Tech Stack (Based on Winners)
Environment:     openenv-core[core] @ v0.2.1
Server:          FastAPI + uvicorn
Training:        HF TRL (GRPOTrainer) + vLLM colocate
Model:           Qwen3-1.7B or Qwen2.5-7B + LoRA (BF16)
Deployment:      Docker β†’ HuggingFace Spaces
Compute:         H100 80GB (training) + GKE/cloud (environment)
Stats:           scipy.stats (power calculations)
### Training command pattern

# Terminal 1: Environment server
uv run server

# Terminal 2: GRPO training
python train.py --vllm-mode colocate --num-generations 8 --max-steps 100
## 8. Pitch Strategy (3 min)
Based on judging criteria (40% innovation, 30% storytelling, 20% reward improvement, 10% pipeline):

Minute 1 β€” Story (30% of score)

"A drug works. But only in 15% of patients. The FDA needs proof. How do you design a trial that finds those patients before you run out of money?"

Minute 2 β€” Environment Innovation (40% of score)

Show: hidden ground truth, statistical verification, FDA rule engine, Phase I β†’ Phase II information flow

Minute 3 β€” Reward Curves + Demo (30% of score)

Show reward curve improving. Show agent learning to enrich for responder population. Show before/after: random inclusion criteria vs. learned BRCA1+ enrichment.