Spaces:
Sleeping
Sleeping
OpenEnv Environment Research - Key Learnings
Research conducted on 5 top OpenEnv environments to inform hackathon project development.
Executive Summary
| Environment | Domain | Key Strength | Best For Learning |
|---|---|---|---|
| calendar_env | Calendar Management | Generic MCP wrapper architecture | Multi-tenant systems, database-backed tasks |
| reasoning_gym_env | Reasoning Tasks | Minimal, single-step episodes | Simple task structures, dataset integration |
| tbench2_env | Terminal/Tool Use | Dual execution modes (local/docker) | Tool benchmarking, session management |
| carla_env | Autonomous Driving | Scenario-based design | Complex simulations, ethical dilemmas |
| repl_env | Code Execution | Recursive LLM architecture | Interactive environments, reward shaping |
1. Calendar Environment (calendar_env)
Architecture Highlights
- Generic MCP Wrapper: Fully reusable
openenv_wrapper/for any MCP server - Multi-Tenancy: SQLite per agent via
x-database-idheader - Rich Database Schema: Google Calendar API v3 compliant models
Action/Observation Pattern
# Action
class MCPAction(Action):
action_type: Literal["ListToolsAction", "ToolCallAction"]
tool_name: Optional[str]
arguments: Optional[Dict]
# Observation
class MCPObservation(Observation):
success: bool
error_message: Optional[str]
tool_result: Optional[Dict]
reward: Optional[float]
done: bool
Task Definition Pattern
- JSON Scenarios: Version-controlled task definitions
- SQL Verifiers: Programmatic graders checking database state
- 3 Verifier Types: database_state, response_check, tool_execution
Reward Design
- Sparse binary rewards: +1.0 (success), -0.5 (error)
- ListToolsAction: +0.1 (discovery reward)
- Status code based with metadata for flexibility
Worth Copying
- Generic wrapper architecture - Copy
openenv_wrapper/for new MCPs - Session manager pattern - Multi-tenant database isolation
- Verifier-driven tasks - No code changes for new tasks
- Config-driven tool discovery - Dynamic tool handlers via importlib
2. Reasoning Gym Environment (reasoning_gym_env)
Architecture Highlights
- Minimal footprint: ~200 lines core logic
- Single-step episodes: reset() → step() → done
- Dataset persistence: Reuse datasets across resets
Action/Observation Pattern
# Action
class ReasoningGymAction(Action):
answer: str # Agent's answer
# Observation
class ReasoningGymObservation(Observation):
question: Optional[str] # Only in reset()
score: Optional[float] # Only after step()
correct_answer: Optional[str]
done: bool
Task Definition Pattern
- External library:
reasoning_gymhandles generation + scoring - Simple datasets: Single task type (leg_counting, reverse_sort, etc.)
- Composite datasets: Mix multiple tasks with weights
Reward Design
- Binary/partial: Depends on dataset scoring function
- Terminal only: reward=0.0 on reset, actual score after step()
- Single-step: No trajectory rewards
Worth Copying
- Iterator pattern - Seamless dataset cycling with StopIteration handling
- Parameter idempotency - reset() continues, reset(seed=...) restarts
- Dataset caching - Compare config to avoid rebuilding
- Minimal state - Just episode_id and step_count
3. TB2 Environment (tbench2_env)
Architecture Highlights
- Dual execution modes: Local (CAMEL toolkit) vs Docker (TB2 fidelity)
- Session management: Streaming process support via session_id
- Task auto-discovery: Download from GitHub + cache locally
Action/Observation Pattern
# Action
class Tbench2Action(Action):
action_type: str # exec, write, view, wait, kill, evaluate, etc.
command: str
session_id: Optional[str]
block: bool = True
# Observation
class Tbench2Observation(Observation):
instruction: str
output: str
success: bool
error: str
reward: Optional[float] # Only on evaluate
done: bool # Only on evaluate
Task Definition Pattern
- TOML-based:
task.tomlwith environment + verifier config - Pytest graders: Each task has tests/ directory
- External benchmark: Terminal-Bench 2 suite
Reward Design
- Binary: 1.0 if all pytest tests pass, 0.0 otherwise
- Terminal only: reward=None until evaluate action
- Exit code parsing:
__TB2_EXIT_CODE__:$?marker pattern
Worth Copying
- Dual mode pattern - Local + Docker execution with env var switching
- Lazy dependency loading - Import errors surface only when used
- Docker-in-Docker safe - Tar streaming instead of bind mounts
- Session isolation - Unique working directories per episode_id
- Metadata-driven discovery - Tasks self-describe requirements
4. CARLA Environment (carla_env)
Architecture Highlights
- Scenario system: BaseScenario ABC with composable tasks
- Rubric factory: Auto-select reward function by scenario type
- Mock mode: Test without GPU/CARLA
- GPU-accelerated: T4 16GB minimum for real mode
Action/Observation Pattern
# Action
class CarlaAction(Action):
action_type: str # observe, control, navigate, capture_image, etc.
throttle: Optional[float] # [0, 1] with Pydantic validation
steer: Optional[float] # [-1, 1]
brake: Optional[float] # [0, 1]
# Observation
class CarlaObservation(Observation):
scene_description: str
vehicle_state: Dict # speed, location, rotation
collision_detected: bool
nearby_actors: List[Dict]
camera_images: Optional[Dict]
rubric_reward: float
Task Definition Pattern
- 9 Trolley scenarios: Ethical dilemmas with expected outcomes
- Navigation tasks: Maze (goal-directed), Free-roam (open-world)
- JSON externalized: Benchmark definitions separate from code
Reward Design
- Trajectory-based (Trolley): r_t = 0.0 until terminal, then gamma-discounted final
- Step-level (Navigation): Progress + arrival bonus - collision penalty - time cost
- Scenario-specific: compute_outcome() owns scoring logic
Worth Copying
- Scenario ABC - Each task owns physics + scoring independently
- Rubric factory - Auto-select reward function by task type
- Dual mode - Mock for testing, real for evaluation
- Layered config - Common + scenario-specific fields
- JSON externalization - Decouple task data from code
5. REPL Environment (repl_env)
Architecture Highlights
- Layered design: Environment → Runner → Backend separation
- Recursive LLM: Depth-limited child spawning with RLM pattern
- Composable rubrics: Outcome + process rewards
- Thread-safe batching: Multiple concurrent child queries
Action/Observation Pattern
# Action
class REPLAction(Action):
code: str
is_final: bool = False
final_answer: Optional[str] = None
# Observation
class REPLObservation(Observation):
result: CodeBlockResult # stdout, stderr, locals_snapshot
available_variables: List[str]
iteration: int
done: bool
reward: float
Task Definition Pattern
- Rubric-driven: Ground truth passed at reset()
- Multiple finalization patterns: FINAL(), FINAL_VAR(), dict with ready flag
- External graders: CustomMetricRubric for user-provided scoring
Reward Design
- Composable: REPLRubric = outcome + process
- Outcome (terminal): ExactMatch, FuzzyMatch, or CustomMetric
- Process (per-step): +success_reward, -error_penalty
- Failure: -failure_reward if max_iterations without answer
Worth Copying
- Composable rubrics - outcome + process separation
- Recursive backend - Protocol-based with depth limits
- Message-based loop - Explicit iteration with timeout checks
- Variable snapshots - Serialize namespace state
- Dual API - Sync + async with same models
- Cooperative timeout - perf_counter() checks, not interrupts
- Injected helpers - llm_query, rlm_query available in namespace
Cross-Cutting Patterns
1. Pydantic Models Everywhere
All environments use Pydantic BaseModel for:
- Type safety + validation
- JSON serialization
- OpenAPI schema generation
- Field descriptions for documentation
2. FastAPI App Factory
from openenv.core.env_server.http_server import create_app
app = create_app(
MyEnvironment,
MyAction,
MyObservation,
env_name="my_env",
max_concurrent_envs=1,
)
3. Client-Server Separation
- Server: Implements Environment[Action, Observation, State]
- Client: EnvClient[Action, Observation, State] wraps HTTP/WebSocket
- Local variants for in-process testing
4. Episode State Management
class State(BaseModel):
episode_id: str # UUID per episode
step_count: int # Actions taken
# Environment-specific metrics
5. Metadata for Flexibility
- Actions have optional
metadata: Dict[str, Any] - Observations include
metadatafor extra context - Enables custom reward signals without model changes
6. Docker + openenv.yaml
spec_version: 1
name: my_env
type: space
runtime: fastapi
app: server.app:app
port: 8000
7. Concurrent Sessions Support
class MyEnvironment(Environment):
SUPPORTS_CONCURRENT_SESSIONS: bool = True
Recommendations for Hackathon Project
Use calendar_env approach if:
- Building database-backed environment (customer support, data cleaning)
- Need multi-agent evaluation isolation
- Want reusable wrapper for other MCPs
Use reasoning_gym_env approach if:
- Simple single-step tasks (email triage, classification)
- Dataset-based evaluation
- Minimal code complexity desired
Use tbench2_env approach if:
- Tool use benchmarking (API integration, CLI tools)
- Need Docker isolation
- Session-based interaction required
Use carla_env approach if:
- Complex simulation with physics
- Scenario-based curriculum learning
- Trajectory-based rewards
Use repl_env approach if:
- Code execution environment
- Recursive reasoning needed
- Composable reward functions
Quick Start Checklist
For your hackathon environment, ensure:
- 3+ tasks with graders returning scores 0.0-1.0
- Pydantic models for Action, Observation, State
- openenv.yaml with correct metadata
- inference.py in root (uses HF Router, not OpenAI)
- STDOUT logging with [START], [STEP], [END] format
- Dockerfile in root directory (not /server)
- Meaningful rewards that distinguish performance levels
- Real-world task with genuine value
- < 20 min runtime on vcpu=2, memory=8GB
- Passes
openenv validate
Key Files to Reference
For Implementation Patterns:
calendar_env/server/openenv_wrapper/mcp_env_environment.py- Generic wrapperreasoning_gym_env/server/reasoning_gym_environment.py- Minimal implementationtbench2_env/server/tbench2_env_environment.py- Session managementcarla_env/server/benchmark_scenarios/base.py- Scenario ABCrepl_env/rubrics.py- Composable reward design
For Client Usage:
*/client.py- All environments have reference implementationsrepl_env/runner.py- Message-based orchestration loop
For Server Setup:
*/server/app.py- FastAPI app factory usage*/openenv.yaml- Configuration examples*/Dockerfile- Docker image patterns
Next Steps
- Choose architecture: Pick closest reference environment to your task
- Copy skeleton: Use
openenv initor copy from reference - Define models: Start with Action/Observation Pydantic models
- Implement graders: 3 tasks with programmatic scoring
- Test locally: Use client.py pattern for rapid iteration
- Validate: Run
openenv validatebefore deployment - Deploy:
openenv pushto Hugging Face Spaces