Spaces:

Akshaykumarbm
/

scheduling_env

Sleeping

App Files Files Community

scheduling_env / docs /ENV_LEARNINGS.md

Akshaykumarbm

Upload folder using huggingface_hub

7bdbe90 verified about 1 month ago

preview code

raw

history blame contribute delete

12.1 kB

OpenEnv Environment Research - Key Learnings

Research conducted on 5 top OpenEnv environments to inform hackathon project development.

Executive Summary

Environment	Domain	Key Strength	Best For Learning
calendar_env	Calendar Management	Generic MCP wrapper architecture	Multi-tenant systems, database-backed tasks
reasoning_gym_env	Reasoning Tasks	Minimal, single-step episodes	Simple task structures, dataset integration
tbench2_env	Terminal/Tool Use	Dual execution modes (local/docker)	Tool benchmarking, session management
carla_env	Autonomous Driving	Scenario-based design	Complex simulations, ethical dilemmas
repl_env	Code Execution	Recursive LLM architecture	Interactive environments, reward shaping

1. Calendar Environment (calendar_env)

Architecture Highlights

Generic MCP Wrapper: Fully reusable openenv_wrapper/ for any MCP server
Multi-Tenancy: SQLite per agent via x-database-id header
Rich Database Schema: Google Calendar API v3 compliant models

Action/Observation Pattern

# Action
class MCPAction(Action):
    action_type: Literal["ListToolsAction", "ToolCallAction"]
    tool_name: Optional[str]
    arguments: Optional[Dict]

# Observation
class MCPObservation(Observation):
    success: bool
    error_message: Optional[str]
    tool_result: Optional[Dict]
    reward: Optional[float]
    done: bool

Task Definition Pattern

JSON Scenarios: Version-controlled task definitions
SQL Verifiers: Programmatic graders checking database state
3 Verifier Types: database_state, response_check, tool_execution

Reward Design

Sparse binary rewards: +1.0 (success), -0.5 (error)
ListToolsAction: +0.1 (discovery reward)
Status code based with metadata for flexibility

Worth Copying

Generic wrapper architecture - Copy openenv_wrapper/ for new MCPs
Session manager pattern - Multi-tenant database isolation
Verifier-driven tasks - No code changes for new tasks
Config-driven tool discovery - Dynamic tool handlers via importlib

2. Reasoning Gym Environment (reasoning_gym_env)

Architecture Highlights

Minimal footprint: ~200 lines core logic
Single-step episodes: reset() → step() → done
Dataset persistence: Reuse datasets across resets

Action/Observation Pattern

# Action
class ReasoningGymAction(Action):
    answer: str  # Agent's answer

# Observation
class ReasoningGymObservation(Observation):
    question: Optional[str]      # Only in reset()
    score: Optional[float]       # Only after step()
    correct_answer: Optional[str]
    done: bool

Task Definition Pattern

External library: reasoning_gym handles generation + scoring
Simple datasets: Single task type (leg_counting, reverse_sort, etc.)
Composite datasets: Mix multiple tasks with weights

Reward Design

Binary/partial: Depends on dataset scoring function
Terminal only: reward=0.0 on reset, actual score after step()
Single-step: No trajectory rewards

Worth Copying

Iterator pattern - Seamless dataset cycling with StopIteration handling
Parameter idempotency - reset() continues, reset(seed=...) restarts
Dataset caching - Compare config to avoid rebuilding
Minimal state - Just episode_id and step_count

3. TB2 Environment (tbench2_env)

Architecture Highlights

Dual execution modes: Local (CAMEL toolkit) vs Docker (TB2 fidelity)
Session management: Streaming process support via session_id
Task auto-discovery: Download from GitHub + cache locally

Action/Observation Pattern

# Action
class Tbench2Action(Action):
    action_type: str  # exec, write, view, wait, kill, evaluate, etc.
    command: str
    session_id: Optional[str]
    block: bool = True

# Observation
class Tbench2Observation(Observation):
    instruction: str
    output: str
    success: bool
    error: str
    reward: Optional[float]  # Only on evaluate
    done: bool              # Only on evaluate

Task Definition Pattern

TOML-based: task.toml with environment + verifier config
Pytest graders: Each task has tests/ directory
External benchmark: Terminal-Bench 2 suite

Reward Design

Binary: 1.0 if all pytest tests pass, 0.0 otherwise
Terminal only: reward=None until evaluate action
Exit code parsing: __TB2_EXIT_CODE__:$? marker pattern

Worth Copying

Dual mode pattern - Local + Docker execution with env var switching
Lazy dependency loading - Import errors surface only when used
Docker-in-Docker safe - Tar streaming instead of bind mounts
Session isolation - Unique working directories per episode_id
Metadata-driven discovery - Tasks self-describe requirements

4. CARLA Environment (carla_env)

Architecture Highlights

Scenario system: BaseScenario ABC with composable tasks
Rubric factory: Auto-select reward function by scenario type
Mock mode: Test without GPU/CARLA
GPU-accelerated: T4 16GB minimum for real mode

Action/Observation Pattern

# Action
class CarlaAction(Action):
    action_type: str  # observe, control, navigate, capture_image, etc.
    throttle: Optional[float]  # [0, 1] with Pydantic validation
    steer: Optional[float]     # [-1, 1]
    brake: Optional[float]     # [0, 1]

# Observation
class CarlaObservation(Observation):
    scene_description: str
    vehicle_state: Dict  # speed, location, rotation
    collision_detected: bool
    nearby_actors: List[Dict]
    camera_images: Optional[Dict]
    rubric_reward: float

Task Definition Pattern

9 Trolley scenarios: Ethical dilemmas with expected outcomes
Navigation tasks: Maze (goal-directed), Free-roam (open-world)
JSON externalized: Benchmark definitions separate from code

Reward Design

Trajectory-based (Trolley): r_t = 0.0 until terminal, then gamma-discounted final
Step-level (Navigation): Progress + arrival bonus - collision penalty - time cost
Scenario-specific: compute_outcome() owns scoring logic

Worth Copying

Scenario ABC - Each task owns physics + scoring independently
Rubric factory - Auto-select reward function by task type
Dual mode - Mock for testing, real for evaluation
Layered config - Common + scenario-specific fields
JSON externalization - Decouple task data from code

5. REPL Environment (repl_env)

Architecture Highlights

Layered design: Environment → Runner → Backend separation
Recursive LLM: Depth-limited child spawning with RLM pattern
Composable rubrics: Outcome + process rewards
Thread-safe batching: Multiple concurrent child queries

Action/Observation Pattern

# Action
class REPLAction(Action):
    code: str
    is_final: bool = False
    final_answer: Optional[str] = None

# Observation
class REPLObservation(Observation):
    result: CodeBlockResult  # stdout, stderr, locals_snapshot
    available_variables: List[str]
    iteration: int
    done: bool
    reward: float

Task Definition Pattern

Rubric-driven: Ground truth passed at reset()
Multiple finalization patterns: FINAL(), FINAL_VAR(), dict with ready flag
External graders: CustomMetricRubric for user-provided scoring

Reward Design

Composable: REPLRubric = outcome + process
Outcome (terminal): ExactMatch, FuzzyMatch, or CustomMetric
Process (per-step): +success_reward, -error_penalty
Failure: -failure_reward if max_iterations without answer

Worth Copying

Composable rubrics - outcome + process separation
Recursive backend - Protocol-based with depth limits
Message-based loop - Explicit iteration with timeout checks
Variable snapshots - Serialize namespace state
Dual API - Sync + async with same models
Cooperative timeout - perf_counter() checks, not interrupts
Injected helpers - llm_query, rlm_query available in namespace

Cross-Cutting Patterns

1. Pydantic Models Everywhere

All environments use Pydantic BaseModel for:

Type safety + validation
JSON serialization
OpenAPI schema generation
Field descriptions for documentation

2. FastAPI App Factory

from openenv.core.env_server.http_server import create_app

app = create_app(
    MyEnvironment,
    MyAction,
    MyObservation,
    env_name="my_env",
    max_concurrent_envs=1,
)

3. Client-Server Separation

Server: Implements Environment[Action, Observation, State]
Client: EnvClient[Action, Observation, State] wraps HTTP/WebSocket
Local variants for in-process testing

4. Episode State Management

class State(BaseModel):
    episode_id: str        # UUID per episode
    step_count: int        # Actions taken
    # Environment-specific metrics

5. Metadata for Flexibility

Actions have optional metadata: Dict[str, Any]
Observations include metadata for extra context
Enables custom reward signals without model changes

6. Docker + openenv.yaml

spec_version: 1
name: my_env
type: space
runtime: fastapi
app: server.app:app
port: 8000

7. Concurrent Sessions Support

class MyEnvironment(Environment):
    SUPPORTS_CONCURRENT_SESSIONS: bool = True

Recommendations for Hackathon Project

Use calendar_env approach if:

Building database-backed environment (customer support, data cleaning)
Need multi-agent evaluation isolation
Want reusable wrapper for other MCPs

Use reasoning_gym_env approach if:

Simple single-step tasks (email triage, classification)
Dataset-based evaluation
Minimal code complexity desired

Use tbench2_env approach if:

Tool use benchmarking (API integration, CLI tools)
Need Docker isolation
Session-based interaction required

Use carla_env approach if:

Complex simulation with physics
Scenario-based curriculum learning
Trajectory-based rewards

Use repl_env approach if:

Code execution environment
Recursive reasoning needed
Composable reward functions

Quick Start Checklist

For your hackathon environment, ensure:

3+ tasks with graders returning scores 0.0-1.0
Pydantic models for Action, Observation, State
openenv.yaml with correct metadata
inference.py in root (uses HF Router, not OpenAI)
STDOUT logging with [START], [STEP], [END] format
Dockerfile in root directory (not /server)
Meaningful rewards that distinguish performance levels
Real-world task with genuine value
< 20 min runtime on vcpu=2, memory=8GB
Passes openenv validate

Key Files to Reference

For Implementation Patterns:

calendar_env/server/openenv_wrapper/mcp_env_environment.py - Generic wrapper
reasoning_gym_env/server/reasoning_gym_environment.py - Minimal implementation
tbench2_env/server/tbench2_env_environment.py - Session management
carla_env/server/benchmark_scenarios/base.py - Scenario ABC
repl_env/rubrics.py - Composable reward design

For Client Usage:

*/client.py - All environments have reference implementations
repl_env/runner.py - Message-based orchestration loop

For Server Setup:

*/server/app.py - FastAPI app factory usage
*/openenv.yaml - Configuration examples
*/Dockerfile - Docker image patterns

Next Steps

Choose architecture: Pick closest reference environment to your task
Copy skeleton: Use openenv init or copy from reference
Define models: Start with Action/Observation Pydantic models
Implement graders: 3 tasks with programmatic scoring
Test locally: Use client.py pattern for rapid iteration
Validate: Run openenv validate before deployment
Deploy: openenv push to Hugging Face Spaces