File size: 12,109 Bytes
7bdbe90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
# OpenEnv Environment Research - Key Learnings

Research conducted on 5 top OpenEnv environments to inform hackathon project development.

## Executive Summary

| Environment | Domain | Key Strength | Best For Learning |
|-------------|--------|--------------|-------------------|
| **calendar_env** | Calendar Management | Generic MCP wrapper architecture | Multi-tenant systems, database-backed tasks |
| **reasoning_gym_env** | Reasoning Tasks | Minimal, single-step episodes | Simple task structures, dataset integration |
| **tbench2_env** | Terminal/Tool Use | Dual execution modes (local/docker) | Tool benchmarking, session management |
| **carla_env** | Autonomous Driving | Scenario-based design | Complex simulations, ethical dilemmas |
| **repl_env** | Code Execution | Recursive LLM architecture | Interactive environments, reward shaping |

---

## 1. Calendar Environment (calendar_env)

### Architecture Highlights
- **Generic MCP Wrapper**: Fully reusable `openenv_wrapper/` for any MCP server
- **Multi-Tenancy**: SQLite per agent via `x-database-id` header
- **Rich Database Schema**: Google Calendar API v3 compliant models

### Action/Observation Pattern
```python
# Action
class MCPAction(Action):
    action_type: Literal["ListToolsAction", "ToolCallAction"]
    tool_name: Optional[str]
    arguments: Optional[Dict]

# Observation
class MCPObservation(Observation):
    success: bool
    error_message: Optional[str]
    tool_result: Optional[Dict]
    reward: Optional[float]
    done: bool
```

### Task Definition Pattern
- **JSON Scenarios**: Version-controlled task definitions
- **SQL Verifiers**: Programmatic graders checking database state
- **3 Verifier Types**: database_state, response_check, tool_execution

### Reward Design
- Sparse binary rewards: +1.0 (success), -0.5 (error)
- ListToolsAction: +0.1 (discovery reward)
- Status code based with metadata for flexibility

### Worth Copying
1. **Generic wrapper architecture** - Copy `openenv_wrapper/` for new MCPs
2. **Session manager pattern** - Multi-tenant database isolation
3. **Verifier-driven tasks** - No code changes for new tasks
4. **Config-driven tool discovery** - Dynamic tool handlers via importlib

---

## 2. Reasoning Gym Environment (reasoning_gym_env)

### Architecture Highlights
- **Minimal footprint**: ~200 lines core logic
- **Single-step episodes**: reset() → step() → done
- **Dataset persistence**: Reuse datasets across resets

### Action/Observation Pattern
```python
# Action
class ReasoningGymAction(Action):
    answer: str  # Agent's answer

# Observation
class ReasoningGymObservation(Observation):
    question: Optional[str]      # Only in reset()
    score: Optional[float]       # Only after step()
    correct_answer: Optional[str]
    done: bool
```

### Task Definition Pattern
- **External library**: `reasoning_gym` handles generation + scoring
- **Simple datasets**: Single task type (leg_counting, reverse_sort, etc.)
- **Composite datasets**: Mix multiple tasks with weights

### Reward Design
- **Binary/partial**: Depends on dataset scoring function
- **Terminal only**: reward=0.0 on reset, actual score after step()
- **Single-step**: No trajectory rewards

### Worth Copying
1. **Iterator pattern** - Seamless dataset cycling with StopIteration handling
2. **Parameter idempotency** - reset() continues, reset(seed=...) restarts
3. **Dataset caching** - Compare config to avoid rebuilding
4. **Minimal state** - Just episode_id and step_count

---

## 3. TB2 Environment (tbench2_env)

### Architecture Highlights
- **Dual execution modes**: Local (CAMEL toolkit) vs Docker (TB2 fidelity)
- **Session management**: Streaming process support via session_id
- **Task auto-discovery**: Download from GitHub + cache locally

### Action/Observation Pattern
```python
# Action
class Tbench2Action(Action):
    action_type: str  # exec, write, view, wait, kill, evaluate, etc.
    command: str
    session_id: Optional[str]
    block: bool = True

# Observation
class Tbench2Observation(Observation):
    instruction: str
    output: str
    success: bool
    error: str
    reward: Optional[float]  # Only on evaluate
    done: bool              # Only on evaluate
```

### Task Definition Pattern
- **TOML-based**: `task.toml` with environment + verifier config
- **Pytest graders**: Each task has tests/ directory
- **External benchmark**: Terminal-Bench 2 suite

### Reward Design
- **Binary**: 1.0 if all pytest tests pass, 0.0 otherwise
- **Terminal only**: reward=None until evaluate action
- **Exit code parsing**: `__TB2_EXIT_CODE__:$?` marker pattern

### Worth Copying
1. **Dual mode pattern** - Local + Docker execution with env var switching
2. **Lazy dependency loading** - Import errors surface only when used
3. **Docker-in-Docker safe** - Tar streaming instead of bind mounts
4. **Session isolation** - Unique working directories per episode_id
5. **Metadata-driven discovery** - Tasks self-describe requirements

---

## 4. CARLA Environment (carla_env)

### Architecture Highlights
- **Scenario system**: BaseScenario ABC with composable tasks
- **Rubric factory**: Auto-select reward function by scenario type
- **Mock mode**: Test without GPU/CARLA
- **GPU-accelerated**: T4 16GB minimum for real mode

### Action/Observation Pattern
```python
# Action
class CarlaAction(Action):
    action_type: str  # observe, control, navigate, capture_image, etc.
    throttle: Optional[float]  # [0, 1] with Pydantic validation
    steer: Optional[float]     # [-1, 1]
    brake: Optional[float]     # [0, 1]

# Observation
class CarlaObservation(Observation):
    scene_description: str
    vehicle_state: Dict  # speed, location, rotation
    collision_detected: bool
    nearby_actors: List[Dict]
    camera_images: Optional[Dict]
    rubric_reward: float
```

### Task Definition Pattern
- **9 Trolley scenarios**: Ethical dilemmas with expected outcomes
- **Navigation tasks**: Maze (goal-directed), Free-roam (open-world)
- **JSON externalized**: Benchmark definitions separate from code

### Reward Design
- **Trajectory-based (Trolley)**: r_t = 0.0 until terminal, then gamma-discounted final
- **Step-level (Navigation)**: Progress + arrival bonus - collision penalty - time cost
- **Scenario-specific**: compute_outcome() owns scoring logic

### Worth Copying
1. **Scenario ABC** - Each task owns physics + scoring independently
2. **Rubric factory** - Auto-select reward function by task type
3. **Dual mode** - Mock for testing, real for evaluation
4. **Layered config** - Common + scenario-specific fields
5. **JSON externalization** - Decouple task data from code

---

## 5. REPL Environment (repl_env)

### Architecture Highlights
- **Layered design**: Environment → Runner → Backend separation
- **Recursive LLM**: Depth-limited child spawning with RLM pattern
- **Composable rubrics**: Outcome + process rewards
- **Thread-safe batching**: Multiple concurrent child queries

### Action/Observation Pattern
```python
# Action
class REPLAction(Action):
    code: str
    is_final: bool = False
    final_answer: Optional[str] = None

# Observation
class REPLObservation(Observation):
    result: CodeBlockResult  # stdout, stderr, locals_snapshot
    available_variables: List[str]
    iteration: int
    done: bool
    reward: float
```

### Task Definition Pattern
- **Rubric-driven**: Ground truth passed at reset()
- **Multiple finalization patterns**: FINAL(), FINAL_VAR(), dict with ready flag
- **External graders**: CustomMetricRubric for user-provided scoring

### Reward Design
- **Composable**: REPLRubric = outcome + process
- **Outcome (terminal)**: ExactMatch, FuzzyMatch, or CustomMetric
- **Process (per-step)**: +success_reward, -error_penalty
- **Failure**: -failure_reward if max_iterations without answer

### Worth Copying
1. **Composable rubrics** - outcome + process separation
2. **Recursive backend** - Protocol-based with depth limits
3. **Message-based loop** - Explicit iteration with timeout checks
4. **Variable snapshots** - Serialize namespace state
5. **Dual API** - Sync + async with same models
6. **Cooperative timeout** - perf_counter() checks, not interrupts
7. **Injected helpers** - llm_query, rlm_query available in namespace

---

## Cross-Cutting Patterns

### 1. Pydantic Models Everywhere
All environments use Pydantic BaseModel for:
- Type safety + validation
- JSON serialization
- OpenAPI schema generation
- Field descriptions for documentation

### 2. FastAPI App Factory
```python
from openenv.core.env_server.http_server import create_app

app = create_app(
    MyEnvironment,
    MyAction,
    MyObservation,
    env_name="my_env",
    max_concurrent_envs=1,
)
```

### 3. Client-Server Separation
- Server: Implements Environment[Action, Observation, State]
- Client: EnvClient[Action, Observation, State] wraps HTTP/WebSocket
- Local variants for in-process testing

### 4. Episode State Management
```python
class State(BaseModel):
    episode_id: str        # UUID per episode
    step_count: int        # Actions taken
    # Environment-specific metrics
```

### 5. Metadata for Flexibility
- Actions have optional `metadata: Dict[str, Any]`
- Observations include `metadata` for extra context
- Enables custom reward signals without model changes

### 6. Docker + openenv.yaml
```yaml
spec_version: 1
name: my_env
type: space
runtime: fastapi
app: server.app:app
port: 8000
```

### 7. Concurrent Sessions Support
```python
class MyEnvironment(Environment):
    SUPPORTS_CONCURRENT_SESSIONS: bool = True
```

---

## Recommendations for Hackathon Project

### Use calendar_env approach if:
- Building database-backed environment (customer support, data cleaning)
- Need multi-agent evaluation isolation
- Want reusable wrapper for other MCPs

### Use reasoning_gym_env approach if:
- Simple single-step tasks (email triage, classification)
- Dataset-based evaluation
- Minimal code complexity desired

### Use tbench2_env approach if:
- Tool use benchmarking (API integration, CLI tools)
- Need Docker isolation
- Session-based interaction required

### Use carla_env approach if:
- Complex simulation with physics
- Scenario-based curriculum learning
- Trajectory-based rewards

### Use repl_env approach if:
- Code execution environment
- Recursive reasoning needed
- Composable reward functions

---

## Quick Start Checklist

For your hackathon environment, ensure:

- [ ] **3+ tasks with graders** returning scores 0.0-1.0
- [ ] **Pydantic models** for Action, Observation, State
- [ ] **openenv.yaml** with correct metadata
- [ ] **inference.py** in root (uses HF Router, not OpenAI)
- [ ] **STDOUT logging** with [START], [STEP], [END] format
- [ ] **Dockerfile** in root directory (not /server)
- [ ] **Meaningful rewards** that distinguish performance levels
- [ ] **Real-world task** with genuine value
- [ ] **< 20 min runtime** on vcpu=2, memory=8GB
- [ ] **Passes `openenv validate`**

---

## Key Files to Reference

### For Implementation Patterns:
- `calendar_env/server/openenv_wrapper/mcp_env_environment.py` - Generic wrapper
- `reasoning_gym_env/server/reasoning_gym_environment.py` - Minimal implementation
- `tbench2_env/server/tbench2_env_environment.py` - Session management
- `carla_env/server/benchmark_scenarios/base.py` - Scenario ABC
- `repl_env/rubrics.py` - Composable reward design

### For Client Usage:
- `*/client.py` - All environments have reference implementations
- `repl_env/runner.py` - Message-based orchestration loop

### For Server Setup:
- `*/server/app.py` - FastAPI app factory usage
- `*/openenv.yaml` - Configuration examples
- `*/Dockerfile` - Docker image patterns

---

## Next Steps

1. **Choose architecture**: Pick closest reference environment to your task
2. **Copy skeleton**: Use `openenv init` or copy from reference
3. **Define models**: Start with Action/Observation Pydantic models
4. **Implement graders**: 3 tasks with programmatic scoring
5. **Test locally**: Use client.py pattern for rapid iteration
6. **Validate**: Run `openenv validate` before deployment
7. **Deploy**: `openenv push` to Hugging Face Spaces