Spaces:

SidraMiconi
/

exec-assistant-arena

Runtime error

App Files Files Community

SidraMiconi commited on Mar 8

Commit

378cf8e

verified ·

1 Parent(s): e3a3154

Upload folder using huggingface_hub

Browse files

Files changed (13) hide show

Dockerfile +81 -0
README.md +249 -4
__init__.py +11 -0
client.py +63 -0
models.py +46 -0
openenv.yaml +7 -0
pyproject.toml +45 -0
server/__init__.py +11 -0
server/app.py +32 -0
server/exec_assistant_arena_environment.py +211 -0
server/requirements.txt +6 -0
server/reward.py +207 -0
server/scenario_generator.py +266 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,81 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=exec_assistant_arena
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,255 @@
 ---
-title: Exec Assistant Arena
-emoji: 📉
-colorFrom: pink
 colorTo: green
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Exec Assistant Arena Environment Server
+emoji: 🎣
+colorFrom: gray
 colorTo: green
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
 ---
+# Exec Assistant Arena Environment
+A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
+## Quick Start
+The simplest way to use the Exec Assistant Arena environment is through the `ExecAssistantArenaEnv` class:
+```python
+from exec_assistant_arena import ExecAssistantArenaAction, ExecAssistantArenaEnv
+try:
+    # Create environment from Docker image
+    exec_assistant_arenaenv = ExecAssistantArenaEnv.from_docker_image("exec_assistant_arena-env:latest")
+    # Reset
+    result = exec_assistant_arenaenv.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Send multiple messages
+    messages = ["Hello, World!", "Testing echo", "Final message"]
+    for msg in messages:
+        result = exec_assistant_arenaenv.step(ExecAssistantArenaAction(message=msg))
+        print(f"Sent: '{msg}'")
+        print(f"  → Echoed: '{result.observation.echoed_message}'")
+        print(f"  → Length: {result.observation.message_length}")
+        print(f"  → Reward: {result.reward}")
+finally:
+    # Always clean up
+    exec_assistant_arenaenv.close()
+```
+That's it! The `ExecAssistantArenaEnv.from_docker_image()` method handles:
+- Starting the Docker container
+- Waiting for the server to be ready
+- Connecting to the environment
+- Container cleanup when you call `close()`
+## Building the Docker Image
+Before using the environment, you need to build the Docker image:
+```bash
+# From project root
+docker build -t exec_assistant_arena-env:latest -f server/Dockerfile .
+```
+## Deploying to Hugging Face Spaces
+You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
+```bash
+# From the environment directory (where openenv.yaml is located)
+openenv push
+# Or specify options
+openenv push --namespace my-org --private
+```
+The `openenv push` command will:
+1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
+2. Prepare a custom build for Hugging Face Docker space (enables web interface)
+3. Upload to Hugging Face (ensuring you're logged in)
+### Prerequisites
+- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
+### Options
+- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
+- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
+- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
+- `--private`: Deploy the space as private (default: public)
+### Examples
+```bash
+# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
+openenv push
+# Push to a specific repository
+openenv push --repo-id my-org/my-env
+# Push with a custom base image
+openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
+# Push as a private space
+openenv push --private
+# Combine options
+openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
+```
+After deployment, your space will be available at:
+`https://huggingface.co/spaces/<repo-id>`
+The deployed space includes:
+- **Web Interface** at `/web` - Interactive UI for exploring the environment
+- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
+- **Health Check** at `/health` - Container health monitoring
+- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
+## Environment Details
+### Action
+**ExecAssistantArenaAction**: Contains a single field
+- `message` (str) - The message to echo back
+### Observation
+**ExecAssistantArenaObservation**: Contains the echo response and metadata
+- `echoed_message` (str) - The message echoed back
+- `message_length` (int) - Length of the message
+- `reward` (float) - Reward based on message length (length × 0.1)
+- `done` (bool) - Always False for echo environment
+- `metadata` (dict) - Additional info like step count
+### Reward
+The reward is calculated as: `message_length × 0.1`
+- "Hi" → reward: 0.2
+- "Hello, World!" → reward: 1.3
+- Empty message → reward: 0.0
+## Advanced Usage
+### Connecting to an Existing Server
+If you already have a Exec Assistant Arena environment server running, you can connect directly:
+```python
+from exec_assistant_arena import ExecAssistantArenaEnv
+# Connect to existing server
+exec_assistant_arenaenv = ExecAssistantArenaEnv(base_url="<ENV_HTTP_URL_HERE>")
+# Use as normal
+result = exec_assistant_arenaenv.reset()
+result = exec_assistant_arenaenv.step(ExecAssistantArenaAction(message="Hello!"))
+```
+Note: When connecting to an existing server, `exec_assistant_arenaenv.close()` will NOT stop the server.
+### Using the Context Manager
+The client supports context manager usage for automatic connection management:
+```python
+from exec_assistant_arena import ExecAssistantArenaAction, ExecAssistantArenaEnv
+# Connect with context manager (auto-connects and closes)
+with ExecAssistantArenaEnv(base_url="http://localhost:8000") as env:
+    result = env.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Multiple steps with low latency
+    for msg in ["Hello", "World", "!"]:
+        result = env.step(ExecAssistantArenaAction(message=msg))
+        print(f"Echoed: {result.observation.echoed_message}")
+```
+The client uses WebSocket connections for:
+- **Lower latency**: No HTTP connection overhead per request
+- **Persistent session**: Server maintains your environment state
+- **Efficient for episodes**: Better for many sequential steps
+### Concurrent WebSocket Sessions
+The server supports multiple concurrent WebSocket connections. To enable this,
+modify `server/app.py` to use factory mode:
+```python
+# In server/app.py - use factory mode for concurrent sessions
+app = create_app(
+    ExecAssistantArenaEnvironment,  # Pass class, not instance
+    ExecAssistantArenaAction,
+    ExecAssistantArenaObservation,
+    max_concurrent_envs=4,  # Allow 4 concurrent sessions
+)
+```
+Then multiple clients can connect simultaneously:
+```python
+from exec_assistant_arena import ExecAssistantArenaAction, ExecAssistantArenaEnv
+from concurrent.futures import ThreadPoolExecutor
+def run_episode(client_id: int):
+    with ExecAssistantArenaEnv(base_url="http://localhost:8000") as env:
+        result = env.reset()
+        for i in range(10):
+            result = env.step(ExecAssistantArenaAction(message=f"Client {client_id}, step {i}"))
+        return client_id, result.observation.message_length
+# Run 4 episodes concurrently
+with ThreadPoolExecutor(max_workers=4) as executor:
+    results = list(executor.map(run_episode, range(4)))
+```
+## Development & Testing
+### Direct Environment Testing
+Test the environment logic directly without starting the HTTP server:
+```bash
+# From the server directory
+python3 server/exec_assistant_arena_environment.py
+```
+This verifies that:
+- Environment resets correctly
+- Step executes actions properly
+- State tracking works
+- Rewards are calculated correctly
+### Running Locally
+Run the server locally for development:
+```bash
+uvicorn server.app:app --reload
+```
+## Project Structure
+```
+exec_assistant_arena/
+├── .dockerignore         # Docker build exclusions
+├── __init__.py            # Module exports
+├── README.md              # This file
+├── openenv.yaml           # OpenEnv manifest
+├── pyproject.toml         # Project metadata and dependencies
+├── uv.lock                # Locked dependencies (generated)
+├── client.py              # ExecAssistantArenaEnv client
+├── models.py              # Action and Observation models
+└── server/
+    ├── __init__.py        # Server module exports
+    ├── exec_assistant_arena_environment.py  # Core environment logic
+    ├── app.py             # FastAPI application (HTTP + WebSocket endpoints)
+    └── Dockerfile         # Container image definition
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""Executive Assistant Arena Environment."""
+from .client import ExecAssistantArenaEnv
+from .models import AssistantAction, AssistantObservation, AssistantState
+__all__ = [
+    "AssistantAction",
+    "AssistantObservation",
+    "AssistantState",
+    "ExecAssistantArenaEnv",
+]

client.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""Executive Assistant Arena Environment Client."""
+from typing import Dict
+from openenv.core.client_types import StepResult
+from openenv.core import EnvClient
+from .models import AssistantAction, AssistantObservation, AssistantState
+class ExecAssistantArenaEnv(
+    EnvClient[AssistantAction, AssistantObservation, AssistantState]
+):
+    """
+    Client for the Executive Assistant Arena Environment.
+    Example:
+        >>> with ExecAssistantArenaEnv(base_url="http://localhost:8000") as client:
+        ...     result = client.reset(difficulty="medium")
+        ...     result = client.step(AssistantAction(tool="check_calendar"))
+    """
+    def _step_payload(self, action: AssistantAction) -> Dict:
+        return {
+            "tool": action.tool,
+            "arguments": action.arguments,
+        }
+    def _parse_result(self, payload: Dict) -> StepResult[AssistantObservation]:
+        obs_data = payload.get("observation", {})
+        observation = AssistantObservation(
+            inbox_summary=obs_data.get("inbox_summary", ""),
+            calendar_view=obs_data.get("calendar_view", ""),
+            pending_tasks=obs_data.get("pending_tasks", []),
+            tool_result=obs_data.get("tool_result", ""),
+            conflicts=obs_data.get("conflicts", []),
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> AssistantState:
+        return AssistantState(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            conflicts_resolved=payload.get("conflicts_resolved", 0),
+            total_conflicts=payload.get("total_conflicts", 0),
+            preferences_inferred=payload.get("preferences_inferred", 0),
+            total_preferences=payload.get("total_preferences", 0),
+            emails_drafted=payload.get("emails_drafted", 0),
+            total_emails=payload.get("total_emails", 0),
+            deadlines_met=payload.get("deadlines_met", 0),
+            deadlines_missed=payload.get("deadlines_missed", 0),
+            unnecessary_actions=payload.get("unnecessary_actions", 0),
+            late_changes_handled=payload.get("late_changes_handled", 0),
+            total_late_changes=payload.get("total_late_changes", 0),
+            cumulative_reward=payload.get("cumulative_reward", 0.0),
+        )

models.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""Data models for the Executive Assistant Arena Environment."""
+from typing import Optional
+from pydantic import Field
+from openenv.core.env_server.types import Action, Observation, State
+class AssistantAction(Action):
+    """Action for the assistant environment - tool calls to manage calendar/email."""
+    tool: str = Field(
+        ...,
+        description="Tool to invoke: check_calendar, check_inbox, reschedule, draft_reply, delegate_task, done",
+    )
+    arguments: dict = Field(
+        default_factory=dict,
+        description="Tool arguments, e.g. {'event_id': 'mtg_3', 'new_time': '2pm'}",
+    )
+class AssistantObservation(Observation):
+    """Observation from the assistant environment."""
+    inbox_summary: str = Field(default="", description="Current emails/messages")
+    calendar_view: str = Field(default="", description="Today's schedule as text")
+    pending_tasks: list[str] = Field(default_factory=list, description="Unresolved items")
+    tool_result: str = Field(default="", description="Output of last tool call")
+    conflicts: list[str] = Field(default_factory=list, description="Detected scheduling conflicts")
+class AssistantState(State):
+    """Internal state tracking for the assistant environment."""
+    conflicts_resolved: int = Field(default=0)
+    total_conflicts: int = Field(default=0)
+    preferences_inferred: int = Field(default=0)
+    total_preferences: int = Field(default=0)
+    emails_drafted: int = Field(default=0)
+    total_emails: int = Field(default=0)
+    deadlines_met: int = Field(default=0)
+    deadlines_missed: int = Field(default=0)
+    unnecessary_actions: int = Field(default=0)
+    late_changes_handled: int = Field(default=0)
+    total_late_changes: int = Field(default=0)
+    cumulative_reward: float = Field(default=0.0)

openenv.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+spec_version: 1
+name: exec_assistant_arena
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

pyproject.toml ADDED Viewed

	@@ -0,0 +1,45 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-exec_assistant_arena"
+version = "0.1.0"
+description = "Exec Assistant Arena environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.0",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m exec_assistant_arena.server.app
+server = "exec_assistant_arena.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["exec_assistant_arena", "exec_assistant_arena.server"]
+package-dir = { "exec_assistant_arena" = ".", "exec_assistant_arena.server" = "server" }

server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Exec Assistant Arena environment server components."""
+from .exec_assistant_arena_environment import ExecAssistantArenaEnvironment
+__all__ = ["ExecAssistantArenaEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,32 @@

+"""FastAPI application for the Executive Assistant Arena Environment."""
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:
+    raise ImportError(
+        "openenv is required. Install with: pip install 'openenv-core[core]>=0.2.1'"
+    ) from e
+from models import AssistantAction, AssistantObservation
+from .exec_assistant_arena_environment import ExecAssistantArenaEnvironment
+app = create_app(
+    ExecAssistantArenaEnvironment,
+    AssistantAction,
+    AssistantObservation,
+    env_name="exec_assistant_arena",
+    max_concurrent_envs=5,
+)
+def main(host: str = "0.0.0.0", port: int = 8000):
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port)

server/exec_assistant_arena_environment.py ADDED Viewed

	@@ -0,0 +1,211 @@

+"""Executive Assistant Arena Environment Implementation."""
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from models import AssistantAction, AssistantObservation, AssistantState
+from .scenario_generator import generate_scenario, Scenario, CalendarEvent, TIME_SLOTS
+from .reward import score_reschedule, score_email_reply, score_terminal, RewardBreakdown
+class ExecAssistantArenaEnvironment(Environment):
+    """
+    An environment that simulates a personal assistant's morning inbox.
+    The agent must resolve calendar conflicts, draft email replies,
+    infer user preferences, and handle late-breaking changes.
+    Episodes are 10-20 steps. Rewards are rule-based and decomposed
+    into 6 components for training visibility.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self):
+        self._state = AssistantState(episode_id=str(uuid4()), step_count=0)
+        self.scenario: Scenario | None = None
+        self.late_change_injected = False
+        self.late_change_step: int | None = None
+        self.replied_emails: set[str] = set()
+        self.reward_breakdown = RewardBreakdown()
+    def reset(self, seed=None, difficulty="medium", **kwargs) -> AssistantObservation:
+        """Reset the environment with a new procedural scenario."""
+        if isinstance(seed, str):
+            seed = hash(seed) % (2**31)
+        self.scenario = generate_scenario(difficulty, seed)
+        self.late_change_injected = False
+        self.late_change_step = None
+        self.replied_emails = set()
+        self.reward_breakdown = RewardBreakdown()
+        self._state = AssistantState(
+            episode_id=str(uuid4()),
+            step_count=0,
+            total_conflicts=len(self.scenario.conflicts),
+            total_emails=len([e for e in self.scenario.emails if e.requires_reply]),
+            total_preferences=len(self.scenario.preferences),
+            total_late_changes=len(self.scenario.late_changes),
+        )
+        # Build the welcome observation
+        pref_hints = "\n".join(f"  - {desc}" for _, desc in self.scenario.preferences)
+        return AssistantObservation(
+            inbox_summary=self.scenario.inbox_text(),
+            calendar_view=self.scenario.calendar_text(),
+            pending_tasks=self.scenario.pending_tasks_text(),
+            tool_result=f"Good morning. You have {len(self.scenario.conflicts)} scheduling conflicts and {self._state.total_emails} emails needing replies.\n\nUser preferences:\n{pref_hints}",
+            conflicts=self.scenario.conflicts_text(),
+            done=False,
+            reward=0.0,
+        )
+    def step(self, action: AssistantAction, **kwargs) -> AssistantObservation:
+        """Process one assistant action."""
+        if self.scenario is None:
+            self.reset()
+        self._state.step_count += 1
+        reward = 0.0
+        tool_result = ""
+        # Inject late change at step 7+
+        if self._state.step_count >= 7 and not self.late_change_injected:
+            change_desc = self.scenario.inject_late_change()
+            if change_desc:
+                self.late_change_injected = True
+                self.late_change_step = self._state.step_count
+                tool_result = f"*** LATE CHANGE: {change_desc} ***\n\n"
+        # Process tool call
+        tool = action.tool
+        args = action.arguments
+        if tool == "check_calendar":
+            tool_result += self.scenario.calendar_text()
+            # Free action - no reward
+        elif tool == "check_inbox":
+            tool_result += self.scenario.inbox_text()
+            # Free action
+        elif tool == "reschedule":
+            event_id = args.get("event_id", "")
+            new_time = args.get("new_time", "")
+            conflict_r, pref_r, msg = score_reschedule(
+                self.scenario, event_id, new_time, self.scenario.preferences
+            )
+            reward += conflict_r + pref_r
+            self.reward_breakdown.conflict_resolution += conflict_r
+            self.reward_breakdown.preference_inference += pref_r
+            if conflict_r > 0:
+                self._state.conflicts_resolved += 1
+            if pref_r > 0:
+                self._state.preferences_inferred += 1
+            tool_result += msg
+        elif tool == "draft_reply":
+            email_id = args.get("email_id", "")
+            body = args.get("body", "")
+            if email_id in self.replied_emails:
+                reward -= 0.2
+                self._state.unnecessary_actions += 1
+                self.reward_breakdown.efficiency_penalty -= 0.2
+                tool_result += f"Already replied to {email_id}."
+            else:
+                email_r, pref_r, msg = score_email_reply(
+                    email_id, body, self.scenario, self.scenario.preferences
+                )
+                reward += email_r + pref_r
+                self.reward_breakdown.email_quality += email_r
+                self.reward_breakdown.preference_inference += pref_r
+                self._state.emails_drafted += 1
+                if pref_r > 0:
+                    self._state.preferences_inferred += 1
+                self.replied_emails.add(email_id)
+                # Mark deadline as met
+                for e in self.scenario.emails:
+                    if e.email_id == email_id and e.deadline:
+                        self._state.deadlines_met += 1
+                        self.reward_breakdown.deadline_adherence += 0.5
+                tool_result += msg
+        elif tool == "delegate_task":
+            task_desc = args.get("task", "")
+            to = args.get("to", "")
+            if task_desc and to:
+                tool_result += f"Delegated '{task_desc}' to {to}."
+                # Small positive if it's related to a late change
+                if self.late_change_injected and self.late_change_step:
+                    reward += 0.5
+                    self.reward_breakdown.late_change_recovery += 0.5
+                    self._state.late_changes_handled += 1
+            else:
+                reward -= 0.2
+                self._state.unnecessary_actions += 1
+                self.reward_breakdown.efficiency_penalty -= 0.2
+                tool_result += "Delegate requires 'task' and 'to' arguments."
+        elif tool == "done":
+            # Compute terminal rewards
+            terminal = score_terminal(self.scenario)
+            # Credit back deadlines that were met
+            terminal.deadline_adherence += self._state.deadlines_met * 1.0
+            # Credit late changes handled
+            if self.late_change_injected:
+                # Check if agent took any action after the late change
+                handled = self._state.late_changes_handled > 0
+                if handled:
+                    terminal.late_change_recovery += 2.0
+                    self._state.late_changes_handled = max(1, self._state.late_changes_handled)
+            reward += terminal.total
+            self.reward_breakdown.deadline_adherence += terminal.deadline_adherence
+            self.reward_breakdown.late_change_recovery += terminal.late_change_recovery
+            self.reward_breakdown.conflict_resolution += terminal.conflict_resolution
+            tool_result += f"Episode complete. Final breakdown:\n"
+            tool_result += f"  Conflicts resolved: {self._state.conflicts_resolved}/{self._state.total_conflicts}\n"
+            tool_result += f"  Emails drafted: {self._state.emails_drafted}/{self._state.total_emails}\n"
+            tool_result += f"  Preferences inferred: {self._state.preferences_inferred}/{self._state.total_preferences}\n"
+            tool_result += f"  Deadlines met: {self._state.deadlines_met}\n"
+            tool_result += f"  Late changes handled: {self._state.late_changes_handled}/{self._state.total_late_changes}\n"
+        else:
+            self._state.unnecessary_actions += 1
+            reward -= 0.2
+            self.reward_breakdown.efficiency_penalty -= 0.2
+            tool_result += f"Unknown tool: {tool}. Available: check_calendar, check_inbox, reschedule, draft_reply, delegate_task, done"
+        done = tool == "done" or self._state.step_count >= 20
+        self._state.cumulative_reward += reward
+        # If we hit max steps without "done", compute terminal penalties
+        if self._state.step_count >= 20 and tool != "done":
+            terminal = score_terminal(self.scenario)
+            terminal.deadline_adherence += self._state.deadlines_met * 1.0
+            reward += terminal.total
+            self._state.cumulative_reward += terminal.total
+            tool_result += "\n[Max steps reached - episode terminated]"
+        return AssistantObservation(
+            inbox_summary=self.scenario.inbox_text(),
+            calendar_view=self.scenario.calendar_text(),
+            pending_tasks=self.scenario.pending_tasks_text(),
+            tool_result=tool_result,
+            conflicts=self.scenario.conflicts_text(),
+            done=done,
+            reward=reward,
+        )
+    @property
+    def state(self) -> AssistantState:
+        return self._state

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0

server/reward.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""Decomposed reward computation for the Executive Assistant Arena.
+All rewards are rule-based and deterministic. No LLM judges.
+Each component is logged separately for W&B tracking.
+"""
+from dataclasses import dataclass
+from .scenario_generator import Scenario, TIME_SLOTS
+@dataclass
+class RewardBreakdown:
+    conflict_resolution: float = 0.0
+    preference_inference: float = 0.0
+    email_quality: float = 0.0
+    deadline_adherence: float = 0.0
+    efficiency_penalty: float = 0.0
+    late_change_recovery: float = 0.0
+    @property
+    def total(self) -> float:
+        return (
+            self.conflict_resolution
+            + self.preference_inference
+            + self.email_quality
+            + self.deadline_adherence
+            + self.efficiency_penalty
+            + self.late_change_recovery
+        )
+def score_reschedule(
+    scenario: Scenario,
+    event_id: str,
+    new_time: str,
+    preferences: list[tuple[str, str]],
+) -> tuple[float, float, str]:
+    """Score a reschedule action. Returns (conflict_reward, pref_reward, message)."""
+    event = None
+    for e in scenario.calendar:
+        if e.event_id == event_id:
+            event = e
+            break
+    if event is None:
+        return -0.2, 0.0, f"Event {event_id} not found."
+    if not event.can_reschedule:
+        return -0.5, 0.0, f"Event {event_id} cannot be rescheduled (high priority)."
+    if new_time not in TIME_SLOTS:
+        return -0.2, 0.0, f"Invalid time slot: {new_time}."
+    # Check if this resolves a conflict
+    old_time = event.time
+    was_in_conflict = any(
+        event_id in (a, b) for a, b in scenario.conflicts
+    )
+    # Temporarily move event and check new conflicts
+    event.time = new_time
+    time_index = {t: i for i, t in enumerate(TIME_SLOTS)}
+    creates_new_conflict = False
+    for other in scenario.calendar:
+        if other.event_id == event_id:
+            continue
+        if other.time in time_index and new_time in time_index:
+            o_start = time_index[other.time]
+            n_start = time_index[new_time]
+            o_slots = other.duration_min // 30
+            e_slots = event.duration_min // 30
+            if n_start < o_start + o_slots and o_start < n_start + e_slots:
+                creates_new_conflict = True
+                break
+    conflict_reward = 0.0
+    if was_in_conflict and not creates_new_conflict:
+        conflict_reward = 1.0
+        # Remove resolved conflicts
+        scenario.conflicts = [
+            (a, b) for a, b in scenario.conflicts
+            if event_id not in (a, b)
+        ]
+        msg = f"Conflict resolved: {event_id} moved to {new_time}."
+    elif creates_new_conflict:
+        conflict_reward = -0.5
+        event.time = old_time  # revert
+        msg = f"Cannot move {event_id} to {new_time} - creates new conflict."
+    else:
+        conflict_reward = 0.0
+        msg = f"Moved {event_id} to {new_time} (no conflict impact)."
+    # Check preference alignment
+    pref_reward = 0.0
+    pref_ids = [p[0] for p in preferences]
+    if "no_early_meetings" in pref_ids and new_time in ["9:00am", "9:30am"]:
+        pref_reward -= 0.3
+        msg += " Warning: user prefers no early meetings."
+    if "lunch_block" in pref_ids and new_time in ["12:00pm", "12:30pm"]:
+        pref_reward -= 0.3
+        msg += " Warning: moved into lunch block."
+    if "no_early_meetings" in pref_ids and old_time in ["9:00am", "9:30am"] and new_time not in ["9:00am", "9:30am"]:
+        pref_reward += 0.5
+        msg += " Good: moved away from early slot per preference."
+    if "buffer_time" in pref_ids or "no_back_to_back" in pref_ids:
+        # Check adjacent meetings
+        n_idx = time_index.get(new_time, -1)
+        for other in scenario.calendar:
+            if other.event_id == event_id:
+                continue
+            o_idx = time_index.get(other.time, -1)
+            if abs(n_idx - o_idx) == 1:
+                pref_reward -= 0.3
+                msg += " Warning: back-to-back meeting created."
+                break
+    return conflict_reward, pref_reward, msg
+def score_email_reply(
+    email_id: str,
+    reply_body: str,
+    scenario: Scenario,
+    preferences: list[tuple[str, str]],
+) -> tuple[float, float, str]:
+    """Score an email reply. Returns (email_reward, pref_reward, message)."""
+    email = None
+    for e in scenario.emails:
+        if e.email_id == email_id:
+            email = e
+            break
+    if email is None:
+        return -0.2, 0.0, f"Email {email_id} not found."
+    if not reply_body or len(reply_body.strip()) < 10:
+        return 0.0, 0.0, "Reply too short."
+    reply_lower = reply_body.lower()
+    # Score: addresses_issue (0.4)
+    addresses_score = 0.0
+    for kp in email.key_points:
+        # Simple keyword matching
+        keywords = kp.lower().split()
+        matches = sum(1 for kw in keywords if kw in reply_lower)
+        if matches >= len(keywords) * 0.3:
+            addresses_score += 0.4 / len(email.key_points)
+    # Score: tone (0.3)
+    formal_markers = ["dear", "regards", "sincerely", "please find", "i would like to"]
+    informal_markers = ["hey", "hi!", "thanks!", "sounds good", "sure thing", "no worries"]
+    formal_count = sum(1 for m in formal_markers if m in reply_lower)
+    informal_count = sum(1 for m in informal_markers if m in reply_lower)
+    tone_score = 0.0
+    if email.tone_expected == "formal" and formal_count > informal_count:
+        tone_score = 0.3
+    elif email.tone_expected == "informal" and informal_count >= formal_count:
+        tone_score = 0.3
+    elif formal_count == 0 and informal_count == 0:
+        tone_score = 0.15  # neutral is ok
+    # Score: preference alignment (0.3)
+    pref_score = 0.0
+    pref_ids = [p[0] for p in preferences]
+    if "informal_tone" in pref_ids and informal_count > 0:
+        pref_score += 0.3
+    elif "formal_tone" in pref_ids and formal_count > 0:
+        pref_score += 0.3
+    elif "informal_tone" not in pref_ids and "formal_tone" not in pref_ids:
+        pref_score += 0.15  # no tone preference
+    email_reward = addresses_score + tone_score + pref_score
+    pref_reward = 0.0
+    if pref_score > 0:
+        pref_reward = 0.5  # preference inferred
+    msg = f"Email reply scored: addresses={addresses_score:.2f}, tone={tone_score:.2f}, pref={pref_score:.2f}"
+    return email_reward, pref_reward, msg
+def score_terminal(scenario: Scenario) -> RewardBreakdown:
+    """Compute terminal rewards at episode end."""
+    breakdown = RewardBreakdown()
+    # Deadline adherence
+    for email in scenario.emails:
+        if email.deadline and email.requires_reply:
+            breakdown.deadline_adherence -= 1.0  # missed deadline (unreplied)
+        elif email.deadline is None and email.requires_reply:
+            breakdown.deadline_adherence -= 0.5  # unreplied but no deadline
+    # Unresolved conflicts
+    remaining = len(scenario.conflicts)
+    breakdown.conflict_resolution -= remaining * 0.5
+    # Late changes not handled
+    for lc in scenario.late_changes:
+        if lc.injected:
+            breakdown.late_change_recovery += 0.0  # was injected but not handled
+    return breakdown

server/scenario_generator.py ADDED Viewed

	@@ -0,0 +1,266 @@

+"""Procedural scenario generation for the Executive Assistant Arena."""
+import random
+from dataclasses import dataclass, field
+NAMES = [
+    "Alice Chen", "Bob Martinez", "Carol Park", "David Kim", "Eve Johnson",
+    "Frank Lee", "Grace Wang", "Henry Brown", "Irene Davis", "Jack Wilson",
+]
+EMAIL_SUBJECTS = [
+    "Q3 Budget Review", "Team Offsite Planning", "Client Demo Prep",
+    "Performance Review Follow-up", "Product Launch Timeline",
+    "Vendor Contract Renewal", "Board Presentation Draft", "Hiring Update",
+    "Customer Escalation", "Partnership Proposal",
+]
+MEETING_TYPES = [
+    "1:1", "team standup", "client call", "design review", "sprint planning",
+    "all-hands", "interview", "lunch meeting", "board prep", "strategy session",
+]
+TIME_SLOTS = [
+    "9:00am", "9:30am", "10:00am", "10:30am", "11:00am", "11:30am",
+    "12:00pm", "12:30pm", "1:00pm", "1:30pm", "2:00pm", "2:30pm",
+    "3:00pm", "3:30pm", "4:00pm", "4:30pm", "5:00pm",
+]
+PREFERENCES = [
+    ("no_early_meetings", "User prefers no meetings before 10am"),
+    ("lunch_block", "User always blocks 12pm-1pm for lunch"),
+    ("informal_tone", "User prefers informal/casual tone in emails"),
+    ("formal_tone", "User prefers formal/professional tone in emails"),
+    ("short_meetings", "User prefers 30-min meetings over 60-min"),
+    ("no_friday_meetings", "User avoids meetings on Fridays"),
+    ("boss_priority", "Meetings with the boss always take priority"),
+    ("client_priority", "Client meetings cannot be rescheduled"),
+    ("buffer_time", "User needs 15-min buffer between meetings"),
+    ("no_back_to_back", "User dislikes back-to-back meetings"),
+]
+LATE_CHANGES = [
+    "boss_reschedule",   # Boss moves a meeting to a conflicting time
+    "urgent_client",     # Urgent client call appears
+    "meeting_cancelled", # A meeting gets cancelled, opening a slot
+    "deadline_moved",    # A deadline moves earlier
+]
+@dataclass
+class CalendarEvent:
+    event_id: str
+    title: str
+    time: str
+    duration_min: int
+    attendees: list[str]
+    priority: str  # "high", "medium", "low"
+    can_reschedule: bool = True
+    def to_text(self) -> str:
+        att = ", ".join(self.attendees)
+        return f"[{self.event_id}] {self.time} ({self.duration_min}min) - {self.title} with {att} [priority: {self.priority}]"
+@dataclass
+class Email:
+    email_id: str
+    sender: str
+    subject: str
+    body: str
+    requires_reply: bool
+    tone_expected: str  # "formal" or "informal"
+    key_points: list[str]  # what the reply must address
+    deadline: str | None = None
+    def to_text(self) -> str:
+        dl = f" [DEADLINE: {self.deadline}]" if self.deadline else ""
+        return f"[{self.email_id}] From: {self.sender} | Subject: {self.subject}{dl}\n  {self.body}"
+@dataclass
+class LateChange:
+    change_type: str
+    description: str
+    affected_event_id: str | None
+    new_time: str | None = None
+    injected: bool = False
+@dataclass
+class Scenario:
+    calendar: list[CalendarEvent]
+    emails: list[Email]
+    preferences: list[tuple[str, str]]  # (pref_id, description)
+    late_changes: list[LateChange]
+    conflicts: list[tuple[str, str]]  # pairs of conflicting event_ids
+    difficulty: str
+    def calendar_text(self) -> str:
+        return "\n".join(e.to_text() for e in self.calendar)
+    def inbox_text(self) -> str:
+        return "\n\n".join(e.to_text() for e in self.emails)
+    def conflicts_text(self) -> list[str]:
+        return [f"CONFLICT: {a} overlaps with {b}" for a, b in self.conflicts]
+    def pending_tasks_text(self) -> list[str]:
+        tasks = []
+        for a, b in self.conflicts:
+            tasks.append(f"Resolve conflict between {a} and {b}")
+        for e in self.emails:
+            if e.requires_reply:
+                tasks.append(f"Reply to email {e.email_id} from {e.sender}")
+        return tasks
+    def inject_late_change(self) -> str | None:
+        """Inject the next un-injected late change. Returns description or None."""
+        for lc in self.late_changes:
+            if not lc.injected:
+                lc.injected = True
+                if lc.change_type == "boss_reschedule" and lc.affected_event_id:
+                    for ev in self.calendar:
+                        if ev.event_id == lc.affected_event_id and lc.new_time:
+                            ev.time = lc.new_time
+                            # This may create a new conflict
+                            self._recompute_conflicts()
+                elif lc.change_type == "meeting_cancelled" and lc.affected_event_id:
+                    self.calendar = [e for e in self.calendar if e.event_id != lc.affected_event_id]
+                    self._recompute_conflicts()
+                return lc.description
+        return None
+    def _recompute_conflicts(self):
+        """Recompute conflicts based on current calendar."""
+        time_index = {t: i for i, t in enumerate(TIME_SLOTS)}
+        self.conflicts = []
+        events = self.calendar
+        for i in range(len(events)):
+            for j in range(i + 1, len(events)):
+                a, b = events[i], events[j]
+                if a.time in time_index and b.time in time_index:
+                    a_start = time_index[a.time]
+                    b_start = time_index[b.time]
+                    a_slots = a.duration_min // 30
+                    b_slots = b.duration_min // 30
+                    if a_start < b_start + b_slots and b_start < a_start + a_slots:
+                        self.conflicts.append((a.event_id, b.event_id))
+def generate_scenario(difficulty: str = "medium", seed: int | None = None) -> Scenario:
+    """Generate a procedural scenario with the given difficulty."""
+    rng = random.Random(seed)
+    if difficulty == "easy":
+        n_events, n_conflicts, n_emails, n_prefs, n_late = 4, 2, 1, 2, 0
+    elif difficulty == "medium":
+        n_events, n_conflicts, n_emails, n_prefs, n_late = 6, 4, 3, 4, 1
+    else:  # hard
+        n_events, n_conflicts, n_emails, n_prefs, n_late = 8, 6, 5, 6, 2
+    # Generate calendar events
+    people = rng.sample(NAMES, min(n_events + n_emails, len(NAMES)))
+    meeting_types = rng.sample(MEETING_TYPES, min(n_events, len(MEETING_TYPES)))
+    # Pick time slots - intentionally create conflicts
+    available_slots = list(TIME_SLOTS)
+    events = []
+    used_slots = []
+    for i in range(n_events):
+        eid = f"mtg_{i+1}"
+        title = meeting_types[i] if i < len(meeting_types) else f"Meeting {i+1}"
+        attendee = people[i] if i < len(people) else rng.choice(NAMES)
+        duration = rng.choice([30, 60])
+        priority = rng.choice(["high", "medium", "low"])
+        can_resched = priority != "high" or rng.random() > 0.5
+        if i < n_conflicts and used_slots:
+            # Intentionally pick a conflicting time
+            time = rng.choice(used_slots)
+        else:
+            time = rng.choice(available_slots)
+        used_slots.append(time)
+        events.append(CalendarEvent(
+            event_id=eid, title=title, time=time,
+            duration_min=duration, attendees=[attendee],
+            priority=priority, can_reschedule=can_resched,
+        ))
+    # Compute actual conflicts
+    time_index = {t: i for i, t in enumerate(TIME_SLOTS)}
+    conflicts = []
+    for i in range(len(events)):
+        for j in range(i + 1, len(events)):
+            a, b = events[i], events[j]
+            if a.time in time_index and b.time in time_index:
+                a_start = time_index[a.time]
+                b_start = time_index[b.time]
+                a_slots = a.duration_min // 30
+                b_slots = b.duration_min // 30
+                if a_start < b_start + b_slots and b_start < a_start + a_slots:
+                    conflicts.append((a.event_id, b.event_id))
+    # Generate emails
+    emails = []
+    for i in range(n_emails):
+        sender = people[n_events + i] if n_events + i < len(people) else rng.choice(NAMES)
+        subject = rng.choice(EMAIL_SUBJECTS)
+        tone = rng.choice(["formal", "informal"])
+        key_points = [f"Address the {subject.lower()} timeline"]
+        if rng.random() > 0.5:
+            key_points.append("Confirm next steps")
+        deadline = rng.choice(["today", "tomorrow", None])
+        body = f"Hi, I wanted to follow up on {subject.lower()}. Could you get back to me{' by ' + deadline if deadline else ''}? Thanks, {sender}"
+        emails.append(Email(
+            email_id=f"email_{i+1}", sender=sender, subject=subject,
+            body=body, requires_reply=True, tone_expected=tone,
+            key_points=key_points, deadline=deadline,
+        ))
+    # Pick preferences
+    prefs = rng.sample(PREFERENCES, min(n_prefs, len(PREFERENCES)))
+    # Generate late changes
+    late_changes = []
+    for i in range(n_late):
+        change_type = rng.choice(LATE_CHANGES)
+        if change_type == "boss_reschedule" and events:
+            target = rng.choice(events)
+            new_time = rng.choice([t for t in TIME_SLOTS if t != target.time])
+            late_changes.append(LateChange(
+                change_type=change_type,
+                description=f"URGENT: Boss has rescheduled {target.title} ({target.event_id}) to {new_time}",
+                affected_event_id=target.event_id,
+                new_time=new_time,
+            ))
+        elif change_type == "meeting_cancelled" and events:
+            target = rng.choice(events)
+            late_changes.append(LateChange(
+                change_type=change_type,
+                description=f"CANCELLED: {target.title} ({target.event_id}) has been cancelled",
+                affected_event_id=target.event_id,
+            ))
+        elif change_type == "urgent_client":
+            time = rng.choice(TIME_SLOTS)
+            late_changes.append(LateChange(
+                change_type=change_type,
+                description=f"URGENT: New client call scheduled at {time} - must attend",
+                affected_event_id=None,
+                new_time=time,
+            ))
+        else:
+            late_changes.append(LateChange(
+                change_type="deadline_moved",
+                description="URGENT: Q3 report deadline moved to today",
+                affected_event_id=None,
+            ))
+    return Scenario(
+        calendar=events, emails=emails, preferences=prefs,
+        late_changes=late_changes, conflicts=conflicts, difficulty=difficulty,
+    )