Spaces:
Sleeping
Sleeping
| # OpenEnv Environment Research - Key Learnings | |
| Research conducted on 5 top OpenEnv environments to inform hackathon project development. | |
| ## Executive Summary | |
| | Environment | Domain | Key Strength | Best For Learning | | |
| |-------------|--------|--------------|-------------------| | |
| | **calendar_env** | Calendar Management | Generic MCP wrapper architecture | Multi-tenant systems, database-backed tasks | | |
| | **reasoning_gym_env** | Reasoning Tasks | Minimal, single-step episodes | Simple task structures, dataset integration | | |
| | **tbench2_env** | Terminal/Tool Use | Dual execution modes (local/docker) | Tool benchmarking, session management | | |
| | **carla_env** | Autonomous Driving | Scenario-based design | Complex simulations, ethical dilemmas | | |
| | **repl_env** | Code Execution | Recursive LLM architecture | Interactive environments, reward shaping | | |
| --- | |
| ## 1. Calendar Environment (calendar_env) | |
| ### Architecture Highlights | |
| - **Generic MCP Wrapper**: Fully reusable `openenv_wrapper/` for any MCP server | |
| - **Multi-Tenancy**: SQLite per agent via `x-database-id` header | |
| - **Rich Database Schema**: Google Calendar API v3 compliant models | |
| ### Action/Observation Pattern | |
| ```python | |
| # Action | |
| class MCPAction(Action): | |
| action_type: Literal["ListToolsAction", "ToolCallAction"] | |
| tool_name: Optional[str] | |
| arguments: Optional[Dict] | |
| # Observation | |
| class MCPObservation(Observation): | |
| success: bool | |
| error_message: Optional[str] | |
| tool_result: Optional[Dict] | |
| reward: Optional[float] | |
| done: bool | |
| ``` | |
| ### Task Definition Pattern | |
| - **JSON Scenarios**: Version-controlled task definitions | |
| - **SQL Verifiers**: Programmatic graders checking database state | |
| - **3 Verifier Types**: database_state, response_check, tool_execution | |
| ### Reward Design | |
| - Sparse binary rewards: +1.0 (success), -0.5 (error) | |
| - ListToolsAction: +0.1 (discovery reward) | |
| - Status code based with metadata for flexibility | |
| ### Worth Copying | |
| 1. **Generic wrapper architecture** - Copy `openenv_wrapper/` for new MCPs | |
| 2. **Session manager pattern** - Multi-tenant database isolation | |
| 3. **Verifier-driven tasks** - No code changes for new tasks | |
| 4. **Config-driven tool discovery** - Dynamic tool handlers via importlib | |
| --- | |
| ## 2. Reasoning Gym Environment (reasoning_gym_env) | |
| ### Architecture Highlights | |
| - **Minimal footprint**: ~200 lines core logic | |
| - **Single-step episodes**: reset() → step() → done | |
| - **Dataset persistence**: Reuse datasets across resets | |
| ### Action/Observation Pattern | |
| ```python | |
| # Action | |
| class ReasoningGymAction(Action): | |
| answer: str # Agent's answer | |
| # Observation | |
| class ReasoningGymObservation(Observation): | |
| question: Optional[str] # Only in reset() | |
| score: Optional[float] # Only after step() | |
| correct_answer: Optional[str] | |
| done: bool | |
| ``` | |
| ### Task Definition Pattern | |
| - **External library**: `reasoning_gym` handles generation + scoring | |
| - **Simple datasets**: Single task type (leg_counting, reverse_sort, etc.) | |
| - **Composite datasets**: Mix multiple tasks with weights | |
| ### Reward Design | |
| - **Binary/partial**: Depends on dataset scoring function | |
| - **Terminal only**: reward=0.0 on reset, actual score after step() | |
| - **Single-step**: No trajectory rewards | |
| ### Worth Copying | |
| 1. **Iterator pattern** - Seamless dataset cycling with StopIteration handling | |
| 2. **Parameter idempotency** - reset() continues, reset(seed=...) restarts | |
| 3. **Dataset caching** - Compare config to avoid rebuilding | |
| 4. **Minimal state** - Just episode_id and step_count | |
| --- | |
| ## 3. TB2 Environment (tbench2_env) | |
| ### Architecture Highlights | |
| - **Dual execution modes**: Local (CAMEL toolkit) vs Docker (TB2 fidelity) | |
| - **Session management**: Streaming process support via session_id | |
| - **Task auto-discovery**: Download from GitHub + cache locally | |
| ### Action/Observation Pattern | |
| ```python | |
| # Action | |
| class Tbench2Action(Action): | |
| action_type: str # exec, write, view, wait, kill, evaluate, etc. | |
| command: str | |
| session_id: Optional[str] | |
| block: bool = True | |
| # Observation | |
| class Tbench2Observation(Observation): | |
| instruction: str | |
| output: str | |
| success: bool | |
| error: str | |
| reward: Optional[float] # Only on evaluate | |
| done: bool # Only on evaluate | |
| ``` | |
| ### Task Definition Pattern | |
| - **TOML-based**: `task.toml` with environment + verifier config | |
| - **Pytest graders**: Each task has tests/ directory | |
| - **External benchmark**: Terminal-Bench 2 suite | |
| ### Reward Design | |
| - **Binary**: 1.0 if all pytest tests pass, 0.0 otherwise | |
| - **Terminal only**: reward=None until evaluate action | |
| - **Exit code parsing**: `__TB2_EXIT_CODE__:$?` marker pattern | |
| ### Worth Copying | |
| 1. **Dual mode pattern** - Local + Docker execution with env var switching | |
| 2. **Lazy dependency loading** - Import errors surface only when used | |
| 3. **Docker-in-Docker safe** - Tar streaming instead of bind mounts | |
| 4. **Session isolation** - Unique working directories per episode_id | |
| 5. **Metadata-driven discovery** - Tasks self-describe requirements | |
| --- | |
| ## 4. CARLA Environment (carla_env) | |
| ### Architecture Highlights | |
| - **Scenario system**: BaseScenario ABC with composable tasks | |
| - **Rubric factory**: Auto-select reward function by scenario type | |
| - **Mock mode**: Test without GPU/CARLA | |
| - **GPU-accelerated**: T4 16GB minimum for real mode | |
| ### Action/Observation Pattern | |
| ```python | |
| # Action | |
| class CarlaAction(Action): | |
| action_type: str # observe, control, navigate, capture_image, etc. | |
| throttle: Optional[float] # [0, 1] with Pydantic validation | |
| steer: Optional[float] # [-1, 1] | |
| brake: Optional[float] # [0, 1] | |
| # Observation | |
| class CarlaObservation(Observation): | |
| scene_description: str | |
| vehicle_state: Dict # speed, location, rotation | |
| collision_detected: bool | |
| nearby_actors: List[Dict] | |
| camera_images: Optional[Dict] | |
| rubric_reward: float | |
| ``` | |
| ### Task Definition Pattern | |
| - **9 Trolley scenarios**: Ethical dilemmas with expected outcomes | |
| - **Navigation tasks**: Maze (goal-directed), Free-roam (open-world) | |
| - **JSON externalized**: Benchmark definitions separate from code | |
| ### Reward Design | |
| - **Trajectory-based (Trolley)**: r_t = 0.0 until terminal, then gamma-discounted final | |
| - **Step-level (Navigation)**: Progress + arrival bonus - collision penalty - time cost | |
| - **Scenario-specific**: compute_outcome() owns scoring logic | |
| ### Worth Copying | |
| 1. **Scenario ABC** - Each task owns physics + scoring independently | |
| 2. **Rubric factory** - Auto-select reward function by task type | |
| 3. **Dual mode** - Mock for testing, real for evaluation | |
| 4. **Layered config** - Common + scenario-specific fields | |
| 5. **JSON externalization** - Decouple task data from code | |
| --- | |
| ## 5. REPL Environment (repl_env) | |
| ### Architecture Highlights | |
| - **Layered design**: Environment → Runner → Backend separation | |
| - **Recursive LLM**: Depth-limited child spawning with RLM pattern | |
| - **Composable rubrics**: Outcome + process rewards | |
| - **Thread-safe batching**: Multiple concurrent child queries | |
| ### Action/Observation Pattern | |
| ```python | |
| # Action | |
| class REPLAction(Action): | |
| code: str | |
| is_final: bool = False | |
| final_answer: Optional[str] = None | |
| # Observation | |
| class REPLObservation(Observation): | |
| result: CodeBlockResult # stdout, stderr, locals_snapshot | |
| available_variables: List[str] | |
| iteration: int | |
| done: bool | |
| reward: float | |
| ``` | |
| ### Task Definition Pattern | |
| - **Rubric-driven**: Ground truth passed at reset() | |
| - **Multiple finalization patterns**: FINAL(), FINAL_VAR(), dict with ready flag | |
| - **External graders**: CustomMetricRubric for user-provided scoring | |
| ### Reward Design | |
| - **Composable**: REPLRubric = outcome + process | |
| - **Outcome (terminal)**: ExactMatch, FuzzyMatch, or CustomMetric | |
| - **Process (per-step)**: +success_reward, -error_penalty | |
| - **Failure**: -failure_reward if max_iterations without answer | |
| ### Worth Copying | |
| 1. **Composable rubrics** - outcome + process separation | |
| 2. **Recursive backend** - Protocol-based with depth limits | |
| 3. **Message-based loop** - Explicit iteration with timeout checks | |
| 4. **Variable snapshots** - Serialize namespace state | |
| 5. **Dual API** - Sync + async with same models | |
| 6. **Cooperative timeout** - perf_counter() checks, not interrupts | |
| 7. **Injected helpers** - llm_query, rlm_query available in namespace | |
| --- | |
| ## Cross-Cutting Patterns | |
| ### 1. Pydantic Models Everywhere | |
| All environments use Pydantic BaseModel for: | |
| - Type safety + validation | |
| - JSON serialization | |
| - OpenAPI schema generation | |
| - Field descriptions for documentation | |
| ### 2. FastAPI App Factory | |
| ```python | |
| from openenv.core.env_server.http_server import create_app | |
| app = create_app( | |
| MyEnvironment, | |
| MyAction, | |
| MyObservation, | |
| env_name="my_env", | |
| max_concurrent_envs=1, | |
| ) | |
| ``` | |
| ### 3. Client-Server Separation | |
| - Server: Implements Environment[Action, Observation, State] | |
| - Client: EnvClient[Action, Observation, State] wraps HTTP/WebSocket | |
| - Local variants for in-process testing | |
| ### 4. Episode State Management | |
| ```python | |
| class State(BaseModel): | |
| episode_id: str # UUID per episode | |
| step_count: int # Actions taken | |
| # Environment-specific metrics | |
| ``` | |
| ### 5. Metadata for Flexibility | |
| - Actions have optional `metadata: Dict[str, Any]` | |
| - Observations include `metadata` for extra context | |
| - Enables custom reward signals without model changes | |
| ### 6. Docker + openenv.yaml | |
| ```yaml | |
| spec_version: 1 | |
| name: my_env | |
| type: space | |
| runtime: fastapi | |
| app: server.app:app | |
| port: 8000 | |
| ``` | |
| ### 7. Concurrent Sessions Support | |
| ```python | |
| class MyEnvironment(Environment): | |
| SUPPORTS_CONCURRENT_SESSIONS: bool = True | |
| ``` | |
| --- | |
| ## Recommendations for Hackathon Project | |
| ### Use calendar_env approach if: | |
| - Building database-backed environment (customer support, data cleaning) | |
| - Need multi-agent evaluation isolation | |
| - Want reusable wrapper for other MCPs | |
| ### Use reasoning_gym_env approach if: | |
| - Simple single-step tasks (email triage, classification) | |
| - Dataset-based evaluation | |
| - Minimal code complexity desired | |
| ### Use tbench2_env approach if: | |
| - Tool use benchmarking (API integration, CLI tools) | |
| - Need Docker isolation | |
| - Session-based interaction required | |
| ### Use carla_env approach if: | |
| - Complex simulation with physics | |
| - Scenario-based curriculum learning | |
| - Trajectory-based rewards | |
| ### Use repl_env approach if: | |
| - Code execution environment | |
| - Recursive reasoning needed | |
| - Composable reward functions | |
| --- | |
| ## Quick Start Checklist | |
| For your hackathon environment, ensure: | |
| - [ ] **3+ tasks with graders** returning scores 0.0-1.0 | |
| - [ ] **Pydantic models** for Action, Observation, State | |
| - [ ] **openenv.yaml** with correct metadata | |
| - [ ] **inference.py** in root (uses HF Router, not OpenAI) | |
| - [ ] **STDOUT logging** with [START], [STEP], [END] format | |
| - [ ] **Dockerfile** in root directory (not /server) | |
| - [ ] **Meaningful rewards** that distinguish performance levels | |
| - [ ] **Real-world task** with genuine value | |
| - [ ] **< 20 min runtime** on vcpu=2, memory=8GB | |
| - [ ] **Passes `openenv validate`** | |
| --- | |
| ## Key Files to Reference | |
| ### For Implementation Patterns: | |
| - `calendar_env/server/openenv_wrapper/mcp_env_environment.py` - Generic wrapper | |
| - `reasoning_gym_env/server/reasoning_gym_environment.py` - Minimal implementation | |
| - `tbench2_env/server/tbench2_env_environment.py` - Session management | |
| - `carla_env/server/benchmark_scenarios/base.py` - Scenario ABC | |
| - `repl_env/rubrics.py` - Composable reward design | |
| ### For Client Usage: | |
| - `*/client.py` - All environments have reference implementations | |
| - `repl_env/runner.py` - Message-based orchestration loop | |
| ### For Server Setup: | |
| - `*/server/app.py` - FastAPI app factory usage | |
| - `*/openenv.yaml` - Configuration examples | |
| - `*/Dockerfile` - Docker image patterns | |
| --- | |
| ## Next Steps | |
| 1. **Choose architecture**: Pick closest reference environment to your task | |
| 2. **Copy skeleton**: Use `openenv init` or copy from reference | |
| 3. **Define models**: Start with Action/Observation Pydantic models | |
| 4. **Implement graders**: 3 tasks with programmatic scoring | |
| 5. **Test locally**: Use client.py pattern for rapid iteration | |
| 6. **Validate**: Run `openenv validate` before deployment | |
| 7. **Deploy**: `openenv push` to Hugging Face Spaces | |