Spaces:

Jayant2304
/

commitment-os

Sleeping

App Files Files Community

jayantaggarwal-sketch commited on Apr 25

Commit

6762657

0 Parent(s):

CommitmentOS: temporal commitment coherence RL environment

Browse files

Files changed (26) hide show

.gitignore +13 -0
Dockerfile +25 -0
HF_README.md +77 -0
README.md +190 -0
__init__.py +9 -0
conftest.py +10 -0
constants.py +14 -0
inference.py +225 -0
models.py +87 -0
openenv.yaml +82 -0
pyproject.toml +43 -0
requirements.txt +6 -0
server/__init__.py +0 -0
server/app.py +54 -0
server/domain.py +131 -0
server/environment.py +244 -0
server/graders.py +236 -0
server/mcp.py +65 -0
server/tasks.py +616 -0
server/world.py +290 -0
tests/__init__.py +0 -0
tests/test_environment.py +523 -0
training/__init__.py +0 -0
training/env_factory.py +167 -0
training/train_grpo.py +174 -0
uv.lock +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,13 @@

+__pycache__/
+*.pyc
+*.pyo
+.venv/
+venv/
+.env
+*.egg-info/
+dist/
+build/
+.pytest_cache/
+.ruff_cache/
+*.log
+.DS_Store

Dockerfile ADDED Viewed

	@@ -0,0 +1,25 @@

+FROM python:3.11-slim
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends curl \
+    && rm -rf /var/lib/apt/lists/*
+COPY requirements.txt requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+COPY constants.py ./constants.py
+COPY models.py ./models.py
+COPY __init__.py ./__init__.py
+COPY server/ ./server/
+COPY openenv.yaml ./openenv.yaml
+COPY inference.py ./inference.py
+ENV PORT=7860
+EXPOSE 7860
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

HF_README.md ADDED Viewed

	@@ -0,0 +1,77 @@

+---
+title: CommitmentOS
+emoji: 📋
+colorFrom: blue
+colorTo: green
+sdk: docker
+app_port: 7860
+tags:
+  - openenv
+  - reinforcement-learning
+  - commitment-coherence
+  - personal-task-management
+  - multi-turn
+---
+# CommitmentOS: Training Temporal Commitment Coherence in LLMs
+**The first RL environment that trains LLMs to keep their promises.**
+CommitmentOS is a multi-turn personal task management environment where
+agents manage calendars, emails, and dining reservations across realistic
+scenarios. The key innovation: the agent's own prior decisions create
+binding future constraints tracked via a **commitment ledger**, and
+violations are penalised regardless of how many turns have elapsed.
+## Quick Start
+```bash
+# Reset to a scenario
+curl -X POST "https://jayant2304-commitment-os.hf.space/reset?task_id=easy_001"
+# Make a tool call
+curl -X POST "https://jayant2304-commitment-os.hf.space/step" \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "view_calendar", "date": "2026-04-25"}'
+# Get state
+curl "https://jayant2304-commitment-os.hf.space/state"
+```
+## API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/reset` | POST | Start a new episode (optional: `task_id`, `difficulty`) |
+| `/step` | POST | Execute one tool call |
+| `/state` | GET | Current episode state |
+| `/health` | GET | Health check |
+| `/tasks` | GET | List all available scenarios |
+| `/mcp` | POST | MCP JSON-RPC 2.0 |
+## 15 Scenarios (5 Easy / 5 Medium / 5 Hard)
+Scenarios range from simple calendar reschedules to multi-crisis cascades
+with information asymmetry and production incidents interrupting a full day
+of commitments.
+## Reward Function (5 components)
+| Component | Weight | Signal |
+|-----------|--------|--------|
+| Constraint Satisfaction | 35% | Binary per-constraint checks |
+| Conflict Resolution | 20% | Calendar free of overlaps |
+| **Commitment Coherence** | **20%** | **Violations tracked via ledger** |
+| Communication Quality | 15% | Keyword matching on emails |
+| Step Efficiency | 10% | Fewer steps = higher score |
+## What Makes This Novel
+Existing constraint-satisfaction environments compute dependency graphs
+upfront. CommitmentOS is different: constraints **emerge from the agent's
+own decisions** as the episode unfolds. A meeting scheduled in turn 2
+becomes a binding constraint in turn 7. Breaking it without communication
+is a tracked, penalised violation.
+This is **temporal commitment coherence** — a capability no existing RL
+environment trains.

README.md ADDED Viewed

	@@ -0,0 +1,190 @@

+# CommitmentOS: Training Temporal Commitment Coherence in LLMs
+> *The first RL environment that trains LLMs to keep their promises.*
+**Innovation claim**: The first RL environment for training temporal commitment coherence — where the agent's own prior decisions create binding future constraints, tracked and penalised across multi-turn episodes.
+**Theme**: Primary 3.2 (Personal Tasks) + Secondary Theme 2 (Long-Horizon Planning)
+---
+## Architecture
+```
+┌──────────────── Client ────────────────┐     ┌────────────── CommitmentOS Server ──────────────┐
+│                                        │     │                                                 │
+│  inference.py ──HTTP──▶ POST /reset    │────▶│  FastAPI App                                    │
+│  (LLM agent)    HTTP──▶ POST /step     │     │    │                                            │
+│                  HTTP──▶ GET  /state    │     │    ▼                                            │
+│                                        │     │  CommitmentEnvironment                          │
+│  train_grpo.py                         │     │    ├── WorldState (calendar, contacts,           │
+│  (GRPO+TRL)                            │     │    │   restaurants, inbox)                       │
+│                                        │     │    ├── CommitmentLedger (tracks promises)        │
+│                                        │     │    └── Grader (5-component reward)               │
+└────────────────────────────────────────┘     └─────────────────────────────────────────────────┘
+```
+## Why CommitmentOS is Novel
+Existing constraint-satisfaction environments (GAP, LGC-MARL, NeMo Gym, PEARL) compute dependency graphs **upfront**. CommitmentOS is fundamentally different:
+- **Constraints emerge from the agent's own decisions** as the episode unfolds
+- A meeting scheduled in turn 2 becomes a **binding constraint** in turn 7
+- Breaking it without communication is a **tracked, penalised violation**
+- The commitment ledger persists across the full episode — the agent must remember what it promised
+This is **temporal commitment coherence** — a capability no existing RL environment trains.
+---
+## Quick Start
+### Local Development
+```bash
+cd commitment_os
+# Create virtual environment
+python3 -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+# Start server
+uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
+# Run tests
+pip install pytest httpx
+pytest tests/ -v
+```
+### Docker
+```bash
+docker build -t commitment-os .
+docker run -p 7860:7860 commitment-os
+```
+### API Usage
+```bash
+# Reset to a scenario
+curl -X POST "http://localhost:7860/reset?task_id=easy_001"
+# Make a tool call (multi-turn — one per step)
+curl -X POST "http://localhost:7860/step" \
+  -H "Content-Type: application/json" \
+  -d '{"action": {"action_type": "view_calendar", "date": "2026-04-25"}}'
+# Get state
+curl "http://localhost:7860/state"
+# List all scenarios
+curl "http://localhost:7860/tasks"
+```
+---
+## Reward Function (5 Components)
+| Component | Weight | How it's Measured |
+|-----------|--------|-------------------|
+| **Constraint Satisfaction** | 35% | Binary per-constraint checks |
+| **Conflict Resolution** | 20% | Final calendar free of overlapping events |
+| **Commitment Coherence** | 20% | `(total - silent_violations) / total` from ledger |
+| **Communication Quality** | 15% | Keyword matching on sent emails |
+| **Step Efficiency** | 10% | `max(0, 1 - (steps - optimal) × 0.1)` |
+**Example** (easy_001 — perfect run):
+```
+constraints: 3/3 met         → 0.35 × 1.0 = 0.350
+conflicts:   0 overlaps      → 0.20 × 1.0 = 0.200
+commitments: 1 honored       → 0.20 × 1.0 = 0.200
+emails:      Team notified   → 0.15 × 1.0 = 0.150
+efficiency:  3 steps (opt 3) → 0.10 × 1.0 = 0.100
+─────────────────────────────────────────────
+total = 0.99 (clamped to [0.01, 0.99])
+```
+---
+## 15 Scenarios
+### Easy (2-4 steps)
+| ID | Description |
+|----|-------------|
+| easy_001 | Double-booked meetings — reschedule by priority |
+| easy_002 | Book dinner with cuisine/price/distance constraints |
+| easy_003 | Check availability and propose meeting slots |
+| easy_004 | Cancel conflicting work meeting for personal appointment |
+| easy_005 | Triage inbox by urgency priority |
+### Medium (5-8 steps)
+| ID | Description |
+|----|-------------|
+| med_006 | Cascading reschedule chain (A→B→C dependency) |
+| med_007 | Team dinner with 3 dietary + distance + budget constraints |
+| med_008 | Boss's urgent request during client call (commitment conflict) |
+| med_009 | Disambiguate vague "push our thing" across 3 recurring meetings |
+| med_010 | Client visit: conference room + lunch + itinerary |
+### Hard (8-15 steps)
+| ID | Description |
+|----|-------------|
+| hard_011 | VP investor dinner: cascade, restaurant, multi-party notification |
+| hard_012 | Triple conference room conflict with diplomatic resolution |
+| hard_013 | Triple crisis: cancelled flight + moved board prep + lost reservation |
+| hard_014 | Information asymmetry — schedule without revealing confidential reasons |
+| hard_015 | **SRE Crisis** — production incident interrupts day of commitments |
+---
+## Training
+### GRPO + TRL + LoRA
+```bash
+pip install trl transformers peft datasets torch
+python training/train_grpo.py \
+  --model Qwen/Qwen2.5-1.5B-Instruct \
+  --epochs 2 \
+  --lr 5e-6 \
+  --lora_rank 16 \
+  --batch_size 4
+```
+**What improves with training:**
+- Constraint satisfaction score ↑
+- Commitment violation rate ↓
+- Steps per episode ↓
+- Communication quality ↑
+---
+## Submission Compliance
+| Requirement | Status |
+|-------------|--------|
+| reset() / step() / state() | ✅ |
+| openenv.yaml with 15 tasks | ✅ |
+| Programmatic graders, scores ∈ (0, 1) | ✅ |
+| inference.py at root using openai client | ✅ |
+| [START]/[STEP]/[END] log format | ✅ |
+| API_BASE_URL / MODEL_NAME / HF_TOKEN from env | ✅ |
+| Dockerfile builds and responds to /reset | ✅ |
+| pyproject.toml with [project.scripts] | ✅ |
+| uv.lock generated | ✅ |
+| server/app.py main() with if __name__ | ✅ |
+---
+## Story Hook
+> "Every AI assistant today can schedule one meeting. But your real life is never one meeting. CommitmentOS trains AI to juggle the chaos — and penalises it when it breaks its own promises."
+**Connection to Round 1**: In Round 1, we trained agents to diagnose production incidents. In Round 2, we asked: *what happens when that incident interrupts a day full of commitments?* CommitmentOS was born. Hard scenario `hard_015` directly reuses SRE incident data from Round 1.
+---
+## License
+MIT

__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""CommitmentOS — Temporal Commitment Coherence RL Environment."""
+from models import CommitmentAction, CommitmentObservation, CommitmentState
+__all__ = [
+    "CommitmentAction",
+    "CommitmentObservation",
+    "CommitmentState",
+]

conftest.py ADDED Viewed

	@@ -0,0 +1,10 @@

+"""Pytest configuration — ensures the project root is on sys.path for all tests."""
+from __future__ import annotations
+import sys
+from pathlib import Path
+PROJECT_ROOT = str(Path(__file__).resolve().parent)
+if PROJECT_ROOT not in sys.path:
+    sys.path.insert(0, PROJECT_ROOT)

constants.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""Project-wide constants — single source of truth for version and metadata."""
+from __future__ import annotations
+VERSION = "0.1.0"
+PROJECT_NAME = "commitment-os"
+PROJECT_DESCRIPTION = (
+    "CommitmentOS: the first RL environment that trains temporal commitment "
+    "coherence in LLMs. Agents manage a simulated personal world (calendar, "
+    "email, restaurants, contacts) across multi-turn episodes where their own "
+    "prior decisions create binding constraints tracked and penalised via a "
+    "commitment ledger."
+)
+AUTHOR = "Jayant Aggarwal"

inference.py ADDED Viewed

	@@ -0,0 +1,225 @@

+"""Baseline inference script for CommitmentOS.
+Uses an OpenAI-compatible LLM to play through all 15 scenarios.
+Multi-turn: the agent gets the briefing, makes tool calls, then submits.
+Required environment variables:
+  API_BASE_URL  — OpenAI-compatible endpoint
+  MODEL_NAME    — model identifier
+  HF_TOKEN      — API key (also checked as OPENAI_API_KEY)
+  ENV_BASE_URL  — CommitmentOS server URL (default: HF Space)
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import time
+from typing import Any, Dict, List
+import requests
+from openai import OpenAI
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+API_BASE_URL = os.getenv("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "gpt-4o-mini")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or ""
+ENV_BASE_URL = os.getenv("ENV_BASE_URL", "https://jayant2304-commitment-os.hf.space")
+MAX_STEPS = 12
+SYSTEM_PROMPT = """You are an expert executive assistant AI. You manage calendars, emails, and dining reservations.
+You will be given a scenario briefing describing a situation with calendar conflicts, emails, or planning tasks.
+For each turn, you must respond with EXACTLY ONE JSON object choosing a tool to call:
+Available tools:
+- {"action_type": "view_calendar", "date": "2026-04-25"}
+- {"action_type": "check_availability", "person": "Client_Jones"}
+- {"action_type": "search_restaurants", "cuisine": "Italian", "max_price": 50, "dietary": "vegetarian", "max_distance_miles": 3.0, "near_airport": false}
+- {"action_type": "schedule_meeting", "title": "Demo", "date": "2026-04-25", "time": "14:00", "duration_min": 60, "participants": ["Client_Jones"], "location": "Room A"}
+- {"action_type": "reschedule_event", "event_id": "evt_1", "new_time": "15:00"}
+- {"action_type": "cancel_event", "event_id": "evt_1"}
+- {"action_type": "send_email", "to": "VP_Chen", "subject": "Meeting update", "body": "Hi, I need to reschedule..."}
+- {"action_type": "book_restaurant", "restaurant_name": "Sky Lounge"}
+- {"action_type": "submit_plan"}
+IMPORTANT RULES:
+1. Respond with ONLY a JSON object, no markdown, no explanation
+2. Handle higher-priority items before lower-priority ones
+3. When cancelling or rescheduling commitments, ALWAYS send an email to affected parties BEFORE submitting
+4. Call submit_plan when you have resolved all issues
+5. Never silently drop a commitment — always notify the affected person"""
+# ---------------------------------------------------------------------------
+# Logging helpers — exact format required by hackathon evaluator
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: str | None = None) -> None:
+    err = error if error else "null"
+    print(f"[STEP] step={step} action={action} reward={reward:.2f} done={'true' if done else 'false'} error={err}", flush=True)
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={'true' if success else 'false'} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+# ---------------------------------------------------------------------------
+# Environment interaction
+# ---------------------------------------------------------------------------
+def env_reset(task_id: str) -> Dict[str, Any]:
+    resp = requests.post(f"{ENV_BASE_URL}/reset", params={"task_id": task_id}, timeout=30)
+    resp.raise_for_status()
+    data = resp.json()
+    return data.get("observation", data)
+def env_step(action: Dict[str, Any]) -> Dict[str, Any]:
+    resp = requests.post(f"{ENV_BASE_URL}/step", json={"action": action}, timeout=30)
+    resp.raise_for_status()
+    data = resp.json()
+    obs = data.get("observation", data)
+    obs["done"] = data.get("done", obs.get("done", False))
+    obs["reward"] = data.get("reward", obs.get("reward", 0.0))
+    return obs
+def get_task_ids() -> List[str]:
+    resp = requests.get(f"{ENV_BASE_URL}/tasks", timeout=30)
+    resp.raise_for_status()
+    data = resp.json()
+    ids: List[str] = []
+    for difficulty in ["easy", "medium", "hard"]:
+        ids.extend(data.get(difficulty, []))
+    return ids
+# ---------------------------------------------------------------------------
+# LLM call
+# ---------------------------------------------------------------------------
+def call_llm(client: OpenAI, messages: List[Dict[str, str]]) -> str:
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=messages,
+        temperature=0.2,
+        max_tokens=512,
+        stream=False,
+    )
+    return response.choices[0].message.content.strip()
+def parse_action(text: str) -> Dict[str, Any]:
+    text = text.strip()
+    if text.startswith("```"):
+        lines = text.split("\n")
+        text = "\n".join(lines[1:-1]) if len(lines) > 2 else lines[0]
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        return {"action_type": "submit_plan"}
+# ---------------------------------------------------------------------------
+# Run one task
+# ---------------------------------------------------------------------------
+def run_task(client: OpenAI, task_id: str) -> Dict[str, Any]:
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.01
+    success = False
+    try:
+        obs = env_reset(task_id)
+        log_start(task=task_id, env="commitment-os", model=MODEL_NAME)
+        briefing = obs.get("briefing", "")
+        calendar = json.dumps(obs.get("calendar_snapshot", []), indent=2)
+        inbox = json.dumps(obs.get("inbox", []), indent=2)
+        messages: List[Dict[str, str]] = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": f"SCENARIO: {briefing}\n\nCALENDAR:\n{calendar}\n\nINBOX:\n{inbox}\n\nWhat is your first action?"},
+        ]
+        for step_num in range(1, MAX_STEPS + 1):
+            llm_output = call_llm(client, messages)
+            action = parse_action(llm_output)
+            step_data = env_step(action)
+            reward = float(step_data.get("reward", 0.0) or 0.0)
+            done = step_data.get("done", False)
+            steps_taken = step_num
+            rewards.append(reward)
+            action_str = json.dumps(action, separators=(",", ":"))
+            log_step(step=step_num, action=action_str, reward=reward, done=done)
+            if done:
+                score = max(0.01, min(0.99, reward))
+                success = score > 0.01
+                break
+            tool_result = step_data.get("tool_result", "")
+            messages.append({"role": "assistant", "content": llm_output})
+            messages.append({"role": "user", "content": f"TOOL RESULT: {tool_result}\n\nWhat is your next action?"})
+        if not done:
+            step_data = env_step({"action_type": "submit_plan"})
+            reward = float(step_data.get("reward", 0.0) or 0.0)
+            steps_taken += 1
+            rewards.append(reward)
+            score = max(0.01, min(0.99, reward))
+            success = score > 0.01
+            log_step(step=steps_taken, action='{"action_type":"submit_plan"}', reward=reward, done=True)
+    except Exception as exc:
+        steps_taken = max(steps_taken, 1)
+        if not rewards:
+            rewards.append(0.01)
+        log_step(step=steps_taken, action="error", reward=0.01, done=True, error=str(exc))
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return {"task_id": task_id, "reward": score, "success": success}
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main() -> None:
+    if not API_KEY:
+        print("ERROR: Set HF_TOKEN or OPENAI_API_KEY environment variable", file=sys.stderr)
+        sys.exit(1)
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    task_ids = get_task_ids()
+    results: List[Dict[str, Any]] = []
+    for tid in task_ids:
+        result = run_task(client, tid)
+        results.append(result)
+    total = len(results)
+    successes = sum(1 for r in results if r["success"])
+    mean_reward = sum(r["reward"] for r in results) / total if total > 0 else 0.0
+    print(f"\n# Summary: {successes}/{total} tasks succeeded, mean_reward={mean_reward:.3f}", flush=True)
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""API-facing Pydantic models — the public contract of CommitmentOS."""
+from __future__ import annotations
+from typing import Any, Dict, List
+from pydantic import Field
+from openenv.core.env_server import Action, Observation, State
+class CommitmentAction(Action):
+    """Agent's tool call submitted via POST /step.
+    Each step is one tool invocation. The agent fills ``action_type`` and
+    the relevant subset of optional parameters for that tool.
+    """
+    action_type: str = Field(
+        ...,
+        description=(
+            "Tool to invoke: 'view_calendar' | 'check_availability' | "
+            "'search_restaurants' | 'schedule_meeting' | 'reschedule_event' | "
+            "'cancel_event' | 'send_email' | 'submit_plan'"
+        ),
+    )
+    # calendar operations
+    date: str = Field("", description="ISO date for calendar queries (yyyy-mm-dd)")
+    event_id: str = Field("", description="Event ID for reschedule / cancel")
+    new_time: str = Field("", description="New start time HH:MM for reschedule")
+    title: str = Field("", description="Title for new meetings")
+    participants: List[str] = Field(default_factory=list, description="Attendee names")
+    time: str = Field("", description="Start time HH:MM for new meetings")
+    duration_min: int = Field(60, description="Meeting duration in minutes")
+    location: str = Field("", description="Room or location")
+    # contact queries
+    person: str = Field("", description="Contact name for availability check")
+    # restaurant search
+    cuisine: str = Field("", description="Cuisine filter")
+    max_price: int = Field(0, description="Max price per person (0 = no limit)")
+    dietary: str = Field("", description="Dietary requirement filter")
+    max_distance_miles: float = Field(0.0, description="Max distance (0 = no limit)")
+    near_airport: bool = Field(False, description="Filter for airport proximity")
+    restaurant_name: str = Field("", description="Specific restaurant to book")
+    # email
+    to: str = Field("", description="Recipient name for send_email")
+    subject: str = Field("", description="Email subject line")
+    body: str = Field("", description="Email body text")
+class CommitmentObservation(Observation):
+    """Observation from reset() and step(). Inherits ``done``, ``reward``."""
+    scenario_id: str = Field(default="", description="Current scenario identifier")
+    difficulty: str = Field(default="", description="easy | medium | hard")
+    briefing: str = Field(default="", description="Scenario description shown on reset")
+    tool_result: str = Field(default="", description="Output of the last tool call")
+    calendar_snapshot: List[Dict[str, Any]] = Field(
+        default_factory=list, description="Current calendar events",
+    )
+    inbox: List[Dict[str, Any]] = Field(
+        default_factory=list, description="Unread inbox emails",
+    )
+    pending_commitments: int = Field(0, description="Number of active commitments in ledger")
+    step_number: int = Field(0, description="Current step within this episode")
+    max_steps: int = Field(15, description="Maximum steps before forced submission")
+    reward_breakdown: Dict[str, float] = Field(
+        default_factory=dict, description="Per-component reward scores",
+    )
+    feedback: str = Field(default="", description="Human-readable grader feedback")
+class CommitmentState(State):
+    """Episode metadata from GET /state."""
+    scenario_id: str = Field(default="", description="Current scenario identifier")
+    difficulty: str = Field(default="", description="Current difficulty level")
+    completed: bool = Field(default=False, description="Whether episode is finished")
+    cumulative_reward: float = Field(default=0.0, description="Sum of rewards this episode")
+    commitment_count: int = Field(default=0, description="Total commitments created")
+    violation_count: int = Field(default=0, description="Silent commitment violations")
+    available_tasks: List[str] = Field(
+        default_factory=list, description="All scenario IDs in the dataset",
+    )

openenv.yaml ADDED Viewed

	@@ -0,0 +1,82 @@

+spec_version: 1
+name: commitment-os
+description: >
+  CommitmentOS: the first RL environment that trains temporal commitment
+  coherence in LLMs. Multi-turn episodes where agents manage calendar,
+  email, and dining across scenarios where their own decisions create
+  binding constraints tracked via a commitment ledger.
+author: Jayant Aggarwal
+version: 0.1.0
+action_model: CommitmentAction
+observation_model: CommitmentObservation
+state_model: CommitmentState
+endpoints:
+  reset: POST /reset
+  step: POST /step
+  state: GET /state
+  health: GET /health
+  metadata: GET /metadata
+  schema: GET /schema
+  mcp: POST /mcp
+tasks:
+  - name: easy_001
+    difficulty: easy
+    description: Resolve double-booked meetings by priority and notify team
+  - name: easy_002
+    difficulty: easy
+    description: Book dinner with cuisine, price, and distance constraints
+  - name: easy_003
+    difficulty: easy
+    description: Check availability and propose meeting slots to client via email
+  - name: easy_004
+    difficulty: easy
+    description: Cancel conflicting work meeting for personal appointment and notify
+  - name: easy_005
+    difficulty: easy
+    description: Triage inbox by urgency and respond to critical emails first
+  - name: med_006
+    difficulty: medium
+    description: Resolve cascading reschedule chain across 3 dependent meetings
+  - name: med_007
+    difficulty: medium
+    description: Plan team dinner with 3 dietary restrictions and multi-constraint search
+  - name: med_008
+    difficulty: medium
+    description: Handle urgent boss request while in a client call without abandoning commitments
+  - name: med_009
+    difficulty: medium
+    description: Disambiguate vague reschedule request across 3 recurring meetings
+  - name: med_010
+    difficulty: medium
+    description: Plan client visit with conference room, lunch, and itinerary dependencies
+  - name: hard_011
+    difficulty: hard
+    description: VP investor dinner with calendar cascade, restaurant constraints, and multi-party notifications
+  - name: hard_012
+    difficulty: hard
+    description: Resolve triple conference room conflict with diplomatic priority-based emails
+  - name: hard_013
+    difficulty: hard
+    description: Triple crisis recovery — cancelled flight, moved board prep, lost restaurant
+  - name: hard_014
+    difficulty: hard
+    description: Navigate information asymmetry — schedule meeting without revealing confidential constraints
+  - name: hard_015
+    difficulty: hard
+    description: Production incident interrupts day of commitments — triage, renegotiate, notify all parties
+observation_space:
+  description: >
+    Current scenario context including calendar snapshot, inbox messages,
+    tool call results, commitment count, step number, reward breakdown,
+    and grader feedback.
+action_space:
+  description: >
+    Single tool invocation per step. Agent selects action_type (view_calendar,
+    check_availability, search_restaurants, schedule_meeting, reschedule_event,
+    cancel_event, send_email, book_restaurant, submit_plan) and fills relevant
+    parameters. Episodes are multi-turn with 2-15 steps per scenario.

pyproject.toml ADDED Viewed

	@@ -0,0 +1,43 @@

+[build-system]
+requires = ["setuptools>=68.0", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "commitment-os"
+version = "0.1.0"
+description = "CommitmentOS: the first RL environment that trains temporal commitment coherence in LLMs"
+requires-python = ">=3.10"
+license = {text = "MIT"}
+authors = [
+    {name = "Jayant Aggarwal"},
+]
+dependencies = [
+    "openenv-core>=0.2.0",
+    "fastapi>=0.110.0",
+    "uvicorn[standard]>=0.29.0",
+    "pydantic>=2.0.0",
+    "python-dotenv>=1.0.0",
+]
+[project.scripts]
+server = "server.app:main"
+[project.optional-dependencies]
+inference = [
+    "openai>=1.0.0",
+    "requests>=2.31.0",
+]
+dev = [
+    "pytest>=8.0.0",
+    "httpx>=0.27.0",
+    "openai>=1.0.0",
+    "requests>=2.31.0",
+]
+training = [
+    "trl>=0.14.0",
+    "transformers>=4.45.0",
+    "torch>=2.0.0",
+    "peft>=0.14.0",
+    "datasets>=3.0.0",
+]

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+# Docker build dependencies — must stay in sync with pyproject.toml [project.dependencies]
+openenv-core>=0.2.0
+fastapi>=0.110.0
+uvicorn[standard]>=0.29.0
+pydantic>=2.0.0
+python-dotenv>=1.0.0

server/__init__.py ADDED Viewed

File without changes

server/app.py ADDED Viewed

	@@ -0,0 +1,54 @@

+"""FastAPI composition root — wires environment, MCP, and custom endpoints."""
+from __future__ import annotations
+import os
+from openenv.core.env_server import create_fastapi_app
+from constants import PROJECT_DESCRIPTION, VERSION
+from models import CommitmentAction, CommitmentObservation, CommitmentState
+from server.environment import CommitmentEnvironment
+from server.mcp import router as mcp_router
+from server.tasks import get_scenario_ids_grouped
+_shared_env = CommitmentEnvironment()
+app = create_fastapi_app(
+    env=lambda: _shared_env,
+    action_cls=CommitmentAction,
+    observation_cls=CommitmentObservation,
+)
+app.title = "CommitmentOS"
+app.description = PROJECT_DESCRIPTION
+app.version = VERSION
+app.routes[:] = [
+    r for r in app.routes
+    if not (hasattr(r, "path") and r.path in ("/state", "/mcp"))
+]
+@app.get("/state", response_model=CommitmentState)
+def get_state() -> CommitmentState:
+    return _shared_env.state
+@app.get("/tasks")
+def list_tasks() -> dict[str, list[str]]:
+    return get_scenario_ids_grouped()
+app.include_router(mcp_router)
+def main() -> None:
+    import uvicorn
+    port = int(os.environ.get("PORT", 7860))
+    uvicorn.run(app, host="0.0.0.0", port=port)
+if __name__ == "__main__":
+    main()

server/domain.py ADDED Viewed

	@@ -0,0 +1,131 @@

+"""Internal domain types — not exposed via the HTTP API."""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional
+from pydantic import BaseModel, Field
+# ---------------------------------------------------------------------------
+# Commitment ledger entry
+# ---------------------------------------------------------------------------
+@dataclass
+class Commitment:
+    """A binding constraint the agent created via its own actions."""
+    turn_created: int
+    commitment_type: str       # "meeting_scheduled" | "email_promise" | "reservation_made"
+    description: str           # human-readable: "3pm meeting with Client X"
+    constraint: str            # machine key: "2026-04-25T15:00"
+    to_whom: str               # who was promised
+    active: bool = True
+    renegotiated_at: Optional[int] = None
+# ---------------------------------------------------------------------------
+# Scenario / task definition
+# ---------------------------------------------------------------------------
+class CalendarEvent(BaseModel):
+    """A single calendar entry."""
+    event_id: str = Field(..., description="Unique event identifier")
+    title: str = Field(..., description="Event title")
+    date: str = Field(..., description="ISO date yyyy-mm-dd")
+    time: str = Field(..., description="Start time HH:MM")
+    duration_min: int = Field(60, description="Duration in minutes")
+    participants: List[str] = Field(default_factory=list)
+    location: str = Field("", description="Room or location name")
+    priority: str = Field("normal", description="low | normal | high | critical")
+    is_personal: bool = Field(False, description="Personal vs work event")
+class Contact(BaseModel):
+    """A person the agent can interact with."""
+    name: str
+    role: str = ""
+    email: str = ""
+    priority_level: int = Field(1, description="1 (lowest) to 5 (highest)")
+    availability: Dict[str, List[str]] = Field(
+        default_factory=dict,
+        description="date -> list of free time slots e.g. {'2026-04-25': ['09:00','10:00','14:00']}",
+    )
+    dietary: str = Field("", description="Dietary restrictions if any")
+class Restaurant(BaseModel):
+    """A restaurant option the agent can search/book."""
+    name: str
+    cuisine: str
+    price_per_person: int
+    distance_miles: float
+    dietary_options: List[str] = Field(default_factory=list)
+    capacity: int = 20
+    hours: str = "11:00-22:00"
+    has_private_room: bool = False
+    near_airport: bool = False
+class InboxEmail(BaseModel):
+    """An email in the agent's inbox."""
+    email_id: str
+    sender: str
+    subject: str
+    body: str
+    urgency: str = Field("normal", description="low | normal | high | critical")
+    received_at: str = Field("", description="ISO datetime")
+    requires_response: bool = True
+    context_hint: str = Field("", description="Hidden hint for grader about what the correct action is")
+class ConstraintDef(BaseModel):
+    """A single verifiable constraint for grading."""
+    description: str = Field(..., description="Human-readable: 'Restaurant must have vegan options'")
+    check_type: str = Field(..., description="'calendar_no_conflict' | 'restaurant_match' | 'email_sent' | 'event_exists' | 'event_cancelled' | 'priority_order'")
+    check_params: Dict[str, Any] = Field(default_factory=dict)
+class CommunicationReq(BaseModel):
+    """A required outgoing communication for grading."""
+    to: str = Field(..., description="Recipient name")
+    required_keywords: List[str] = Field(default_factory=list, description="Keywords that should appear")
+    purpose: str = Field("", description="'notify_reschedule' | 'propose_alternative' | 'acknowledge' | 'renegotiate'")
+class ScenarioDef(BaseModel):
+    """Complete definition of a single task scenario."""
+    scenario_id: str
+    difficulty: str = Field(..., description="easy | medium | hard")
+    briefing: str = Field(..., description="The scenario description the agent sees on reset")
+    initial_calendar: List[CalendarEvent] = Field(default_factory=list)
+    initial_inbox: List[InboxEmail] = Field(default_factory=list)
+    available_restaurants: List[Restaurant] = Field(default_factory=list)
+    contacts: List[Contact] = Field(default_factory=list)
+    constraints: List[ConstraintDef] = Field(default_factory=list)
+    priority_ordering: List[str] = Field(
+        default_factory=list,
+        description="Ordered list from highest to lowest priority contact/event",
+    )
+    communication_requirements: List[CommunicationReq] = Field(default_factory=list)
+    optimal_steps: int = Field(3, description="Minimum steps to solve perfectly")
+    max_steps: int = Field(15, description="Maximum allowed steps before timeout")
+    # ground-truth for grading
+    expected_final_events: List[str] = Field(
+        default_factory=list,
+        description="Event IDs that should exist in final calendar",
+    )
+    expected_cancelled_events: List[str] = Field(
+        default_factory=list,
+        description="Event IDs that should be cancelled",
+    )
+    expected_restaurant: str = Field("", description="Name of the correct restaurant pick")

server/environment.py ADDED Viewed

	@@ -0,0 +1,244 @@

+"""CommitmentOS environment — multi-turn personal task management with
+temporal commitment coherence tracking.
+Episode lifecycle:
+  1. reset()  -> agent receives scenario briefing + calendar + inbox
+  2. step()   -> agent makes one tool call per step (done=False)
+  3. step(submit_plan) or max_steps reached -> grading + done=True
+"""
+from __future__ import annotations
+import random
+import uuid
+from typing import Any, Optional
+from openenv.core.env_server import Environment
+from openenv.core.env_server.types import EnvironmentMetadata
+from constants import AUTHOR, PROJECT_DESCRIPTION, PROJECT_NAME, VERSION
+from models import CommitmentAction, CommitmentObservation, CommitmentState
+from server.domain import ScenarioDef
+from server.world import WorldState
+class CommitmentEnvironment(
+    Environment[CommitmentAction, CommitmentObservation, CommitmentState]
+):
+    def __init__(self) -> None:
+        super().__init__()
+        self._world: Optional[WorldState] = None
+        self._scenario: Optional[ScenarioDef] = None
+        self._episode_id: str = ""
+        self._step_count: int = 0
+        self._done: bool = False
+        self._cumulative_reward: float = 0.0
+        self._last_tool_result: str = ""
+        self._last_breakdown: dict[str, float] = {}
+        self._last_feedback: str = ""
+    # ------------------------------------------------------------------
+    # Task selection
+    # ------------------------------------------------------------------
+    def _select_scenario(
+        self,
+        scenario_id: Optional[str] = None,
+        difficulty: Optional[str] = None,
+    ) -> ScenarioDef:
+        from server.tasks import get_all_scenarios, get_scenario, get_scenarios_by_difficulty
+        if scenario_id:
+            s = get_scenario(scenario_id)
+            if s is None:
+                raise ValueError(f"Unknown scenario_id: {scenario_id}")
+            return s
+        if difficulty:
+            candidates = get_scenarios_by_difficulty(difficulty)
+            if not candidates:
+                raise ValueError(f"No scenarios for difficulty: {difficulty}")
+            return random.choice(candidates)
+        return random.choice(list(get_all_scenarios().values()))
+    # ------------------------------------------------------------------
+    # Core API
+    # ------------------------------------------------------------------
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> CommitmentObservation:
+        if seed is not None:
+            random.seed(seed)
+        scenario = self._select_scenario(
+            scenario_id=kwargs.get("scenario_id") or kwargs.get("task_id"),
+            difficulty=kwargs.get("difficulty"),
+        )
+        self._scenario = scenario
+        self._world = WorldState(scenario)
+        self._episode_id = episode_id or str(uuid.uuid4())
+        self._step_count = 0
+        self._done = False
+        self._cumulative_reward = 0.0
+        self._last_tool_result = ""
+        self._last_breakdown = {}
+        self._last_feedback = "New episode started. Read the briefing and use tools to manage the situation."
+        return self._build_observation(reward=0.0, done=False)
+    def step(
+        self,
+        action: CommitmentAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> CommitmentObservation:
+        if self._world is None or self._scenario is None:
+            raise ValueError("No active episode. Call reset() first.")
+        if self._done:
+            raise ValueError("Episode already completed. Call reset() to start a new one.")
+        self._step_count += 1
+        self._world.step_count = self._step_count
+        at = action.action_type.lower().strip()
+        if at == "submit_plan" or self._step_count >= self._scenario.max_steps:
+            return self._finish_episode()
+        step_reward = 0.0
+        tool_result = self._dispatch_tool(action, at)
+        self._last_tool_result = tool_result
+        if "CONFLICT" in tool_result:
+            step_reward = -0.05
+        elif at in ("schedule_meeting", "reschedule_event", "send_email", "book_restaurant"):
+            step_reward = 0.05
+        self._cumulative_reward += step_reward
+        self._last_feedback = ""
+        self._last_breakdown = {}
+        return self._build_observation(reward=step_reward, done=False)
+    def _finish_episode(self) -> CommitmentObservation:
+        from server.graders import grade_scenario
+        assert self._world is not None
+        assert self._scenario is not None
+        total_reward, breakdown, feedback = grade_scenario(
+            self._scenario, self._world,
+        )
+        self._done = True
+        self._cumulative_reward += total_reward
+        self._last_breakdown = breakdown
+        self._last_feedback = feedback
+        self._last_tool_result = "Plan submitted. Episode graded."
+        return self._build_observation(reward=total_reward, done=True)
+    # ------------------------------------------------------------------
+    # Tool dispatch
+    # ------------------------------------------------------------------
+    def _dispatch_tool(self, action: CommitmentAction, at: str) -> str:
+        assert self._world is not None
+        turn = self._step_count
+        if at == "view_calendar":
+            return self._world.view_calendar(action.date)
+        elif at == "check_availability":
+            return self._world.check_availability(action.person)
+        elif at == "search_restaurants":
+            return self._world.search_restaurants(
+                cuisine=action.cuisine,
+                max_price=action.max_price,
+                dietary=action.dietary,
+                max_distance_miles=action.max_distance_miles,
+                near_airport=action.near_airport,
+            )
+        elif at == "schedule_meeting":
+            return self._world.schedule_meeting(
+                title=action.title,
+                date=action.date,
+                time=action.time,
+                duration_min=action.duration_min,
+                participants=action.participants,
+                location=action.location,
+                turn=turn,
+            )
+        elif at == "reschedule_event":
+            return self._world.reschedule_event(
+                event_id=action.event_id,
+                new_time=action.new_time,
+                turn=turn,
+            )
+        elif at == "cancel_event":
+            return self._world.cancel_event(action.event_id, turn=turn)
+        elif at == "send_email":
+            return self._world.send_email(
+                to=action.to,
+                subject=action.subject,
+                body=action.body,
+                turn=turn,
+            )
+        elif at == "book_restaurant":
+            return self._world.book_restaurant(action.restaurant_name, turn=turn)
+        else:
+            return f"Unknown action_type: '{at}'. Valid types: view_calendar, check_availability, search_restaurants, schedule_meeting, reschedule_event, cancel_event, send_email, book_restaurant, submit_plan"
+    # ------------------------------------------------------------------
+    # Observation builder
+    # ------------------------------------------------------------------
+    def _build_observation(self, *, reward: float, done: bool) -> CommitmentObservation:
+        assert self._world is not None
+        assert self._scenario is not None
+        return CommitmentObservation(
+            scenario_id=self._scenario.scenario_id,
+            difficulty=self._scenario.difficulty,
+            briefing=self._scenario.briefing if self._step_count == 0 else "",
+            tool_result=self._last_tool_result,
+            calendar_snapshot=self._world.get_calendar_snapshot(),
+            inbox=self._world.get_inbox_snapshot(),
+            pending_commitments=len(self._world.get_active_commitments()),
+            step_number=self._step_count,
+            max_steps=self._scenario.max_steps,
+            reward=reward,
+            reward_breakdown=self._last_breakdown,
+            done=done,
+            feedback=self._last_feedback,
+        )
+    # ------------------------------------------------------------------
+    # State property
+    # ------------------------------------------------------------------
+    @property
+    def state(self) -> CommitmentState:
+        from server.tasks import get_all_scenarios
+        violations = self._world.get_silent_violations() if self._world else []
+        return CommitmentState(
+            episode_id=self._episode_id,
+            step_count=self._step_count,
+            scenario_id=self._scenario.scenario_id if self._scenario else "",
+            difficulty=self._scenario.difficulty if self._scenario else "",
+            completed=self._done,
+            cumulative_reward=self._cumulative_reward,
+            commitment_count=len(self._world.commitment_ledger) if self._world else 0,
+            violation_count=len(violations),
+            available_tasks=list(get_all_scenarios().keys()),
+        )
+    def get_metadata(self) -> EnvironmentMetadata:
+        return EnvironmentMetadata(
+            name=PROJECT_NAME,
+            description=PROJECT_DESCRIPTION,
+            version=VERSION,
+            author=AUTHOR,
+        )

server/graders.py ADDED Viewed

	@@ -0,0 +1,236 @@

+"""Deterministic grading — 5-component reward for CommitmentOS.
+Components:
+  constraint_satisfaction (0.35) — binary per scenario constraint
+  conflict_resolution     (0.20) — final calendar free of overlaps
+  commitment_coherence    (0.20) — ledger violations penalised
+  communication_quality   (0.15) — keyword matching on sent emails
+  step_efficiency         (0.10) — fewer steps = higher score
+"""
+from __future__ import annotations
+from typing import Any, Dict, List, Tuple
+from server.domain import ScenarioDef
+from server.world import WorldState, _time_to_min
+WEIGHTS: Dict[str, float] = {
+    "constraint_satisfaction": 0.35,
+    "conflict_resolution": 0.20,
+    "commitment_coherence": 0.20,
+    "communication_quality": 0.15,
+    "step_efficiency": 0.10,
+}
+def _keyword_score(text: str, keywords: List[str], min_matches: int = 2) -> Tuple[float, List[str]]:
+    """0 hits -> 0.0, < min_matches -> 0.5 (partial), >= min_matches -> 1.0."""
+    text_lower = text.lower()
+    matched = [kw for kw in keywords if kw.lower() in text_lower]
+    if len(matched) == 0:
+        return 0.0, matched
+    if len(matched) < min_matches:
+        return 0.5, matched
+    return 1.0, matched
+def _check_constraint(constraint, world: WorldState) -> bool:
+    """Evaluate a single ConstraintDef against the world state."""
+    ct = constraint.check_type
+    params = constraint.check_params
+    if ct == "calendar_no_conflict":
+        return _calendar_has_no_overlaps(world)
+    elif ct == "event_exists":
+        eid = params.get("event_id", "")
+        return eid in world.calendar
+    elif ct == "event_cancelled":
+        eid = params.get("event_id", "")
+        return eid not in world.calendar
+    elif ct == "email_sent":
+        to = params.get("to", "").lower()
+        keywords = params.get("keywords", [])
+        for em in world.emails_sent:
+            if to in em.get("to", "").lower():
+                if keywords:
+                    score, _ = _keyword_score(em.get("body", ""), keywords, min_matches=1)
+                    if score > 0:
+                        return True
+                else:
+                    return True
+        return False
+    elif ct == "restaurant_match":
+        name = params.get("name", "")
+        if name:
+            return world.booked_restaurant.lower() == name.lower()
+        criteria = params.get("criteria", {})
+        if not world.booked_restaurant:
+            return False
+        r = world.restaurants.get(world.booked_restaurant)
+        if r is None:
+            return False
+        if "dietary" in criteria and criteria["dietary"].lower() not in [d.lower() for d in r.dietary_options]:
+            return False
+        if "max_price" in criteria and r.price_per_person > criteria["max_price"]:
+            return False
+        if "max_distance" in criteria and r.distance_miles > criteria["max_distance"]:
+            return False
+        if "near_airport" in criteria and criteria["near_airport"] and not r.near_airport:
+            return False
+        return True
+    elif ct == "priority_order":
+        higher = params.get("higher", "").lower()
+        lower = params.get("lower", "").lower()
+        higher_kept = any(
+            ev.title.lower() == higher or higher in ev.title.lower()
+            for ev in world.calendar.values()
+        )
+        lower_moved = not any(
+            ev.title.lower() == lower or lower in ev.title.lower()
+            for ev in world.calendar.values()
+        ) or any(
+            em.get("to", "").lower() == lower or lower in em.get("body", "").lower()
+            for em in world.emails_sent
+        )
+        return higher_kept
+    return False
+def _calendar_has_no_overlaps(world: WorldState) -> bool:
+    events = list(world.calendar.values())
+    for i, a in enumerate(events):
+        for b in events[i + 1:]:
+            if a.date != b.date:
+                continue
+            a_start = _time_to_min(a.time)
+            a_end = a_start + a.duration_min
+            b_start = _time_to_min(b.time)
+            b_end = b_start + b.duration_min
+            if a_start < b_end and b_start < a_end:
+                return False
+    return True
+def _score_constraint_satisfaction(scenario: ScenarioDef, world: WorldState) -> Tuple[float, str]:
+    if not scenario.constraints:
+        return 1.0, "No constraints defined"
+    met = sum(1 for c in scenario.constraints if _check_constraint(c, world))
+    total = len(scenario.constraints)
+    score = met / total
+    return score, f"{met}/{total} constraints met"
+def _score_conflict_resolution(world: WorldState) -> Tuple[float, str]:
+    clean = _calendar_has_no_overlaps(world)
+    return (1.0 if clean else 0.0), ("No calendar conflicts" if clean else "Calendar has overlapping events")
+def _score_commitment_coherence(world: WorldState) -> Tuple[float, str]:
+    total = len(world.commitment_ledger)
+    if total == 0:
+        return 1.0, "No commitments created"
+    violations = world.get_silent_violations()
+    silent_count = len(violations)
+    renegotiated = sum(1 for c in world.commitment_ledger if c.renegotiated_at is not None)
+    honored = total - silent_count - renegotiated
+    score = (total - silent_count) / total
+    parts = []
+    if honored > 0:
+        parts.append(f"{honored} honored")
+    if renegotiated > 0:
+        parts.append(f"{renegotiated} renegotiated")
+    if silent_count > 0:
+        parts.append(f"{silent_count} SILENTLY BROKEN")
+    return score, " | ".join(parts) if parts else "OK"
+def _score_communication(scenario: ScenarioDef, world: WorldState) -> Tuple[float, str]:
+    reqs = scenario.communication_requirements
+    if not reqs:
+        return 1.0, "No communication requirements"
+    total_score = 0.0
+    feedback_parts: List[str] = []
+    for req in reqs:
+        to_lower = req.to.lower()
+        matching_emails = [
+            em for em in world.emails_sent
+            if to_lower in em.get("to", "").lower()
+        ]
+        if not matching_emails:
+            feedback_parts.append(f"MISSING email to {req.to}")
+            continue
+        best_score = 0.0
+        for em in matching_emails:
+            body = em.get("body", "") + " " + em.get("subject", "")
+            if req.required_keywords:
+                ks, matched = _keyword_score(body, req.required_keywords, min_matches=1)
+                best_score = max(best_score, ks)
+            else:
+                best_score = 1.0
+        total_score += best_score
+        if best_score >= 1.0:
+            feedback_parts.append(f"Email to {req.to}: full credit")
+        elif best_score > 0:
+            feedback_parts.append(f"Email to {req.to}: partial ({best_score:.1f})")
+        else:
+            feedback_parts.append(f"Email to {req.to}: missing keywords")
+    score = total_score / len(reqs) if reqs else 1.0
+    return score, " | ".join(feedback_parts)
+def _score_step_efficiency(scenario: ScenarioDef, world: WorldState) -> Tuple[float, str]:
+    optimal = scenario.optimal_steps
+    actual = world.step_count
+    if actual <= optimal:
+        return 1.0, f"{actual} steps (optimal: {optimal})"
+    penalty = (actual - optimal) * 0.1
+    score = max(0.0, 1.0 - penalty)
+    return score, f"{actual} steps (optimal: {optimal}, penalty: -{penalty:.1f})"
+def grade_scenario(
+    scenario: ScenarioDef,
+    world: WorldState,
+) -> Tuple[float, Dict[str, float], str]:
+    """Returns ``(total_reward, breakdown, feedback)``."""
+    breakdown: Dict[str, float] = {}
+    feedback_parts: List[str] = []
+    cs_score, cs_fb = _score_constraint_satisfaction(scenario, world)
+    breakdown["constraint_satisfaction"] = round(cs_score * WEIGHTS["constraint_satisfaction"], 4)
+    feedback_parts.append(f"[constraints] {cs_fb}")
+    cr_score, cr_fb = _score_conflict_resolution(world)
+    breakdown["conflict_resolution"] = round(cr_score * WEIGHTS["conflict_resolution"], 4)
+    feedback_parts.append(f"[conflicts] {cr_fb}")
+    cc_score, cc_fb = _score_commitment_coherence(world)
+    breakdown["commitment_coherence"] = round(cc_score * WEIGHTS["commitment_coherence"], 4)
+    feedback_parts.append(f"[commitments] {cc_fb}")
+    cq_score, cq_fb = _score_communication(scenario, world)
+    breakdown["communication_quality"] = round(cq_score * WEIGHTS["communication_quality"], 4)
+    feedback_parts.append(f"[communication] {cq_fb}")
+    se_score, se_fb = _score_step_efficiency(scenario, world)
+    breakdown["step_efficiency"] = round(se_score * WEIGHTS["step_efficiency"], 4)
+    feedback_parts.append(f"[efficiency] {se_fb}")
+    total_reward = round(sum(breakdown.values()), 4)
+    total_reward = max(0.01, min(0.99, total_reward))
+    feedback = " | ".join(feedback_parts)
+    return total_reward, breakdown, feedback

server/mcp.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""MCP JSON-RPC 2.0 endpoint for OpenEnv validator compliance."""
+from __future__ import annotations
+from fastapi import APIRouter, Request
+from fastapi.responses import JSONResponse
+from constants import PROJECT_NAME, VERSION
+from models import CommitmentAction
+router = APIRouter()
+_CAPABILITIES = {
+    "tools": {"listChanged": False},
+    "resources": {"subscribe": False, "listChanged": False},
+}
+_TOOLS = [
+    {
+        "name": "reset",
+        "description": "Start a new CommitmentOS episode",
+        "inputSchema": CommitmentAction.model_json_schema(),
+    },
+    {
+        "name": "step",
+        "description": "Execute one tool call in the current episode",
+        "inputSchema": CommitmentAction.model_json_schema(),
+    },
+    {
+        "name": "state",
+        "description": "Get current episode state",
+        "inputSchema": {"type": "object", "properties": {}},
+    },
+]
+def _jsonrpc_response(rpc_id: object, result: dict) -> JSONResponse:
+    return JSONResponse({"jsonrpc": "2.0", "id": rpc_id, "result": result})
+def _jsonrpc_error(rpc_id: object, code: int, message: str) -> JSONResponse:
+    return JSONResponse({"jsonrpc": "2.0", "id": rpc_id, "error": {"code": code, "message": message}})
+@router.post("/mcp")
+async def mcp_endpoint(request: Request) -> JSONResponse:
+    try:
+        body = await request.json()
+    except Exception:
+        return _jsonrpc_error(None, -32700, "Parse error")
+    rpc_id = body.get("id")
+    method = body.get("method", "")
+    if method == "initialize":
+        return _jsonrpc_response(rpc_id, {
+            "protocolVersion": "2024-11-05",
+            "capabilities": _CAPABILITIES,
+            "serverInfo": {"name": PROJECT_NAME, "version": VERSION},
+        })
+    if method == "tools/list":
+        return _jsonrpc_response(rpc_id, {"tools": _TOOLS})
+    return _jsonrpc_error(rpc_id, -32601, f"Method not found: {method}")

server/tasks.py ADDED Viewed

	@@ -0,0 +1,616 @@

+"""Scenario dataset — 15 tasks across 3 difficulty tiers.
+Each scenario is a validated ``ScenarioDef`` Pydantic model containing the
+initial world state and deterministic grader keys.
+"""
+from __future__ import annotations
+from typing import Dict, List, Optional
+from server.domain import (
+    CalendarEvent,
+    CommunicationReq,
+    ConstraintDef,
+    Contact,
+    InboxEmail,
+    Restaurant,
+    ScenarioDef,
+)
+# ===================================================================
+# EASY — 2-4 tool calls, single constraint domain
+# ===================================================================
+_EASY_001 = ScenarioDef(
+    scenario_id="easy_001",
+    difficulty="easy",
+    briefing=(
+        "You have two meetings at 2:00 PM today (2026-04-25): a 1-on-1 with your boss "
+        "VP_Chen and a team standup with 6 people. Both are in different rooms. "
+        "VP_Chen's meeting is higher priority. Reschedule the standup to a free slot "
+        "and notify the team."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_1", title="1-on-1 with VP_Chen", date="2026-04-25", time="14:00", duration_min=30, participants=["VP_Chen"], location="Room A", priority="high"),
+        CalendarEvent(event_id="evt_2", title="Team Standup", date="2026-04-25", time="14:00", duration_min=30, participants=["Alice", "Bob", "Carol", "Dave", "Eve", "Frank"], location="Room B", priority="normal"),
+        CalendarEvent(event_id="evt_3", title="Lunch", date="2026-04-25", time="12:00", duration_min=60, participants=[], priority="low", is_personal=True),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_1", sender="VP_Chen", subject="Our 1-on-1 today", body="Looking forward to our 2pm chat. I have some feedback on the Q3 roadmap.", urgency="high"),
+    ],
+    contacts=[
+        Contact(name="VP_Chen", role="VP Engineering", priority_level=5),
+        Contact(name="Alice", role="Engineer", priority_level=2),
+        Contact(name="Team", role="Engineering Team", priority_level=2, email="team@company.com"),
+    ],
+    constraints=[
+        ConstraintDef(description="1-on-1 with VP_Chen must remain at 14:00", check_type="event_exists", check_params={"event_id": "evt_1"}),
+        ConstraintDef(description="Team standup must not conflict with 1-on-1", check_type="calendar_no_conflict", check_params={}),
+        ConstraintDef(description="Team must be notified of reschedule", check_type="email_sent", check_params={"to": "Team", "keywords": ["reschedule", "standup", "move"]}),
+    ],
+    priority_ordering=["VP_Chen", "Team"],
+    communication_requirements=[
+        CommunicationReq(to="Team", required_keywords=["reschedule", "standup"], purpose="notify_reschedule"),
+    ],
+    optimal_steps=3,
+    max_steps=8,
+    expected_cancelled_events=[],
+    expected_final_events=["evt_1"],
+)
+_EASY_002 = ScenarioDef(
+    scenario_id="easy_002",
+    difficulty="easy",
+    briefing=(
+        "Book a dinner tonight (2026-04-25) for 4 people. Requirements: "
+        "Italian cuisine, under $50 per person, within 3 miles. "
+        "Search restaurants and book the best match."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_10", title="Morning Standup", date="2026-04-25", time="09:00", duration_min=30, participants=["Team"]),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_10", sender="Alice", subject="Dinner tonight?", body="Can you find us a nice Italian place? Budget is $50/person max. Needs to be close to the office.", urgency="normal"),
+    ],
+    available_restaurants=[
+        Restaurant(name="Bella Italia", cuisine="Italian", price_per_person=40, distance_miles=2.0, dietary_options=["vegetarian", "gluten-free"], capacity=30),
+        Restaurant(name="Chez Pierre", cuisine="French", price_per_person=80, distance_miles=1.5, dietary_options=["vegetarian"], capacity=40),
+        Restaurant(name="Pasta Palace", cuisine="Italian", price_per_person=55, distance_miles=1.0, dietary_options=["vegan", "vegetarian"], capacity=20),
+        Restaurant(name="Dragon Wok", cuisine="Chinese", price_per_person=25, distance_miles=4.0, dietary_options=["vegan", "vegetarian"], capacity=50),
+    ],
+    contacts=[
+        Contact(name="Alice", role="Friend", priority_level=2),
+    ],
+    constraints=[
+        ConstraintDef(description="Restaurant must be Italian", check_type="restaurant_match", check_params={"criteria": {"dietary": ""}}),
+        ConstraintDef(description="Restaurant must be under $50/pp", check_type="restaurant_match", check_params={"criteria": {"max_price": 50}}),
+        ConstraintDef(description="Restaurant must be within 3 miles", check_type="restaurant_match", check_params={"criteria": {"max_distance": 3.0}}),
+    ],
+    optimal_steps=2,
+    max_steps=6,
+    expected_restaurant="Bella Italia",
+)
+_EASY_003 = ScenarioDef(
+    scenario_id="easy_003",
+    difficulty="easy",
+    briefing=(
+        "Client_Jones has emailed asking for a meeting this week. Check your "
+        "calendar for 2026-04-25 and Client_Jones's availability, then propose "
+        "3 available slots via email."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_20", title="Team Sync", date="2026-04-25", time="10:00", duration_min=60, participants=["Team"]),
+        CalendarEvent(event_id="evt_21", title="Lunch", date="2026-04-25", time="12:00", duration_min=60, is_personal=True),
+        CalendarEvent(event_id="evt_22", title="Design Review", date="2026-04-25", time="15:00", duration_min=60, participants=["Bob", "Carol"]),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_20", sender="Client_Jones", subject="Meeting this week?", body="Hi, I'd love to catch up this week. Do you have any openings? Need about 30 minutes.", urgency="high"),
+    ],
+    contacts=[
+        Contact(name="Client_Jones", role="Client", priority_level=4, availability={"2026-04-25": ["09:00", "11:00", "14:00", "16:00"]}),
+    ],
+    constraints=[
+        ConstraintDef(description="Email must be sent to Client_Jones", check_type="email_sent", check_params={"to": "Client_Jones", "keywords": ["slot", "available", "meet"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="Client_Jones", required_keywords=["available", "slot", "time"], purpose="propose_slots"),
+    ],
+    optimal_steps=3,
+    max_steps=8,
+)
+_EASY_004 = ScenarioDef(
+    scenario_id="easy_004",
+    difficulty="easy",
+    briefing=(
+        "Your personal doctor appointment at 3:00 PM today (2026-04-25) conflicts "
+        "with the weekly team sync. The doctor appointment was booked first and is "
+        "important. Cancel the team sync and notify the team."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_30", title="Weekly Team Sync", date="2026-04-25", time="15:00", duration_min=60, participants=["Team"], priority="normal"),
+        CalendarEvent(event_id="evt_31", title="Doctor Appointment", date="2026-04-25", time="15:00", duration_min=60, priority="high", is_personal=True),
+    ],
+    initial_inbox=[],
+    contacts=[
+        Contact(name="Team", role="Engineering Team", priority_level=2),
+    ],
+    constraints=[
+        ConstraintDef(description="Doctor appointment must remain", check_type="event_exists", check_params={"event_id": "evt_31"}),
+        ConstraintDef(description="Team sync must be cancelled", check_type="event_cancelled", check_params={"event_id": "evt_30"}),
+        ConstraintDef(description="Team must be notified", check_type="email_sent", check_params={"to": "Team", "keywords": ["cancel", "sync"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="Team", required_keywords=["cancel", "sync", "apologi"], purpose="notify_reschedule"),
+    ],
+    optimal_steps=2,
+    max_steps=6,
+    expected_cancelled_events=["evt_30"],
+    expected_final_events=["evt_31"],
+)
+_EASY_005 = ScenarioDef(
+    scenario_id="easy_005",
+    difficulty="easy",
+    briefing=(
+        "You have 3 unread emails. Triage them by urgency and respond to the most "
+        "urgent one first. VP_Chen's email is critical, Client_Jones is high, "
+        "and Alice is normal priority."
+    ),
+    initial_calendar=[],
+    initial_inbox=[
+        InboxEmail(email_id="em_50", sender="Alice", subject="Lunch tomorrow?", body="Want to grab lunch tomorrow?", urgency="low"),
+        InboxEmail(email_id="em_51", sender="Client_Jones", subject="Contract review", body="Please review the attached contract by end of day.", urgency="high"),
+        InboxEmail(email_id="em_52", sender="VP_Chen", subject="URGENT: Board deck", body="I need the Q3 numbers for the board deck. Can you send them in the next hour?", urgency="critical"),
+    ],
+    contacts=[
+        Contact(name="VP_Chen", role="VP Engineering", priority_level=5),
+        Contact(name="Client_Jones", role="Client", priority_level=4),
+        Contact(name="Alice", role="Engineer", priority_level=2),
+    ],
+    constraints=[
+        ConstraintDef(description="VP_Chen must be responded to", check_type="email_sent", check_params={"to": "VP_Chen", "keywords": ["Q3", "number", "board"]}),
+        ConstraintDef(description="Client_Jones must be responded to", check_type="email_sent", check_params={"to": "Client_Jones", "keywords": ["contract", "review"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="VP_Chen", required_keywords=["Q3", "numbers", "send"], purpose="acknowledge"),
+        CommunicationReq(to="Client_Jones", required_keywords=["contract", "review"], purpose="acknowledge"),
+    ],
+    optimal_steps=2,
+    max_steps=6,
+)
+# ===================================================================
+# MEDIUM — 5-8 tool calls, cross-domain with commitment tracking
+# ===================================================================
+_MED_006 = ScenarioDef(
+    scenario_id="med_006",
+    difficulty="medium",
+    briefing=(
+        "Meeting A ('Design Review') has been moved from 2:00 PM to 3:00 PM today "
+        "(2026-04-25). But you have Meeting B ('Sprint Planning') at 3:00 PM, and "
+        "Meeting C ('Demo Prep') at 4:00 PM depends on Sprint Planning's output. "
+        "Resolve the cascade: reschedule B without conflicting with C, and notify "
+        "all affected parties."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_40", title="Design Review", date="2026-04-25", time="14:00", duration_min=60, participants=["Bob", "Carol"], priority="high"),
+        CalendarEvent(event_id="evt_41", title="Sprint Planning", date="2026-04-25", time="15:00", duration_min=60, participants=["Team"], priority="normal"),
+        CalendarEvent(event_id="evt_42", title="Demo Prep", date="2026-04-25", time="16:00", duration_min=60, participants=["Alice", "Dave"], priority="normal"),
+        CalendarEvent(event_id="evt_43", title="Morning Standup", date="2026-04-25", time="09:00", duration_min=30, participants=["Team"]),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_40", sender="Bob", subject="Design Review moved", body="Hey, I need to push our 2pm design review to 3pm. Apologies for the late change.", urgency="high"),
+    ],
+    contacts=[
+        Contact(name="Bob", role="Lead Designer", priority_level=3),
+        Contact(name="Team", role="Engineering Team", priority_level=2),
+        Contact(name="Alice", role="Engineer", priority_level=2),
+    ],
+    constraints=[
+        ConstraintDef(description="Design Review must be at 15:00", check_type="calendar_no_conflict", check_params={}),
+        ConstraintDef(description="Sprint Planning must not conflict", check_type="calendar_no_conflict", check_params={}),
+        ConstraintDef(description="Demo Prep must remain after Sprint Planning", check_type="event_exists", check_params={"event_id": "evt_42"}),
+        ConstraintDef(description="Team notified about Sprint Planning change", check_type="email_sent", check_params={"to": "Team", "keywords": ["sprint", "reschedule", "move"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="Team", required_keywords=["sprint", "planning", "reschedule"], purpose="notify_reschedule"),
+    ],
+    optimal_steps=4,
+    max_steps=10,
+)
+_MED_007 = ScenarioDef(
+    scenario_id="med_007",
+    difficulty="medium",
+    briefing=(
+        "Plan a team dinner for 6 people tonight (2026-04-25). Constraints: "
+        "Alice is vegan, Bob has a nut allergy, must be within 3 miles, "
+        "under $45 per person, and needs a private room for 6+. "
+        "Search restaurants, book the right one, and email the team with details."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_50", title="Afternoon Focus", date="2026-04-25", time="14:00", duration_min=120),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_50", sender="Alice", subject="Dinner tonight", body="Can you book a place? Remember I'm vegan. Bob has a nut allergy. We need a private room.", urgency="normal"),
+    ],
+    available_restaurants=[
+        Restaurant(name="Green Garden", cuisine="Mediterranean", price_per_person=38, distance_miles=2.5, dietary_options=["vegan", "nut-free", "vegetarian"], capacity=30, has_private_room=True),
+        Restaurant(name="Steak House Prime", cuisine="American", price_per_person=55, distance_miles=1.0, dietary_options=["gluten-free"], capacity=50, has_private_room=True),
+        Restaurant(name="Lotus Thai", cuisine="Thai", price_per_person=30, distance_miles=3.5, dietary_options=["vegan", "vegetarian"], capacity=25, has_private_room=False),
+        Restaurant(name="Cafe Novo", cuisine="Fusion", price_per_person=42, distance_miles=2.0, dietary_options=["vegan", "nut-free", "gluten-free", "vegetarian"], capacity=15, has_private_room=True),
+        Restaurant(name="Burgers & Brew", cuisine="American", price_per_person=20, distance_miles=0.5, dietary_options=["vegetarian"], capacity=40, has_private_room=False),
+    ],
+    contacts=[
+        Contact(name="Alice", role="Engineer", priority_level=2, dietary="vegan"),
+        Contact(name="Bob", role="Engineer", priority_level=2, dietary="nut-free"),
+        Contact(name="Team", role="Engineering Team", priority_level=2),
+    ],
+    constraints=[
+        ConstraintDef(description="Restaurant has vegan options", check_type="restaurant_match", check_params={"criteria": {"dietary": "vegan"}}),
+        ConstraintDef(description="Restaurant under $45/pp", check_type="restaurant_match", check_params={"criteria": {"max_price": 45}}),
+        ConstraintDef(description="Restaurant within 3 miles", check_type="restaurant_match", check_params={"criteria": {"max_distance": 3.0}}),
+        ConstraintDef(description="Team notified of dinner details", check_type="email_sent", check_params={"to": "Team", "keywords": ["dinner", "restaurant"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="Team", required_keywords=["dinner", "tonight", "restaurant"], purpose="notify_reschedule"),
+    ],
+    optimal_steps=3,
+    max_steps=8,
+    expected_restaurant="Green Garden",
+)
+_MED_008 = ScenarioDef(
+    scenario_id="med_008",
+    difficulty="medium",
+    briefing=(
+        "You are currently in a client call (Client_Jones) that ends at 3:15 PM. "
+        "Your boss VP_Chen just emailed saying 'Need Q3 numbers in 30 minutes — "
+        "board meeting moved up.' It's currently 2:45 PM on 2026-04-25. "
+        "You cannot leave the client call early. Acknowledge VP_Chen with a "
+        "realistic ETA and do NOT cancel the client meeting."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_60", title="Client Call with Jones", date="2026-04-25", time="14:30", duration_min=45, participants=["Client_Jones"], priority="high"),
+        CalendarEvent(event_id="evt_61", title="Focus Time", date="2026-04-25", time="16:00", duration_min=60, priority="low"),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_60", sender="VP_Chen", subject="URGENT: Q3 numbers NOW", body="Board meeting moved up. I need the Q3 revenue numbers in the next 30 minutes. This is critical.", urgency="critical"),
+    ],
+    contacts=[
+        Contact(name="VP_Chen", role="VP Engineering", priority_level=5),
+        Contact(name="Client_Jones", role="Client", priority_level=4),
+    ],
+    constraints=[
+        ConstraintDef(description="Client call must NOT be cancelled", check_type="event_exists", check_params={"event_id": "evt_60"}),
+        ConstraintDef(description="VP_Chen must be acknowledged", check_type="email_sent", check_params={"to": "VP_Chen", "keywords": ["Q3", "numbers"]}),
+        ConstraintDef(description="Realistic ETA communicated", check_type="email_sent", check_params={"to": "VP_Chen", "keywords": ["after", "3:15", "call", "send"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="VP_Chen", required_keywords=["Q3", "numbers", "after", "client"], purpose="acknowledge"),
+    ],
+    optimal_steps=2,
+    max_steps=6,
+    expected_final_events=["evt_60"],
+)
+_MED_009 = ScenarioDef(
+    scenario_id="med_009",
+    difficulty="medium",
+    briefing=(
+        "You received an email from Bob saying 'Can we push our thing to next week?' "
+        "You have 3 recurring meetings with Bob: Monday Design Review (evt_70), "
+        "Wednesday Code Review (evt_71), and Friday Retrospective (evt_72) — all on "
+        "different days this week (2026-04-25 is Friday). Check the context and "
+        "determine which meeting Bob means, then confirm via email."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_70", title="Design Review with Bob", date="2026-04-21", time="10:00", duration_min=60, participants=["Bob"]),
+        CalendarEvent(event_id="evt_71", title="Code Review with Bob", date="2026-04-23", time="14:00", duration_min=60, participants=["Bob"]),
+        CalendarEvent(event_id="evt_72", title="Retrospective with Bob", date="2026-04-25", time="11:00", duration_min=60, participants=["Bob"]),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_70", sender="Bob", subject="Push our thing?", body="Hey, can we push our thing to next week? I'm swamped with the release today.", urgency="normal", context_hint="Bob means the Retrospective (today, Friday) since he says 'today'"),
+    ],
+    contacts=[
+        Contact(name="Bob", role="Lead Designer", priority_level=3, availability={"2026-05-02": ["11:00", "14:00"]}),
+    ],
+    constraints=[
+        ConstraintDef(description="Bob must be responded to", check_type="email_sent", check_params={"to": "Bob", "keywords": ["retrospective", "next week"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="Bob", required_keywords=["retrospective", "next week", "reschedule"], purpose="renegotiate"),
+    ],
+    optimal_steps=4,
+    max_steps=10,
+)
+_MED_010 = ScenarioDef(
+    scenario_id="med_010",
+    difficulty="medium",
+    briefing=(
+        "Client_Jones is visiting your office tomorrow (2026-04-26). You need to: "
+        "(1) book a conference room for a 10 AM demo, "
+        "(2) arrange lunch at a restaurant with vegetarian options, "
+        "and (3) send Client_Jones an itinerary email with all details."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_80", title="Team Standup", date="2026-04-26", time="09:00", duration_min=30, participants=["Team"]),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_80", sender="Client_Jones", subject="Visit tomorrow", body="Looking forward to the demo tomorrow. Is 10am still good? I'm vegetarian by the way.", urgency="high"),
+    ],
+    available_restaurants=[
+        Restaurant(name="Garden Bistro", cuisine="Mediterranean", price_per_person=35, distance_miles=0.5, dietary_options=["vegetarian", "vegan"], capacity=20),
+        Restaurant(name="BBQ Pit", cuisine="American BBQ", price_per_person=30, distance_miles=1.0, dietary_options=[], capacity=40),
+    ],
+    contacts=[
+        Contact(name="Client_Jones", role="Client", priority_level=4, availability={"2026-04-26": ["10:00", "11:00", "12:00", "13:00"]}, dietary="vegetarian"),
+    ],
+    constraints=[
+        ConstraintDef(description="Demo meeting scheduled at 10:00", check_type="calendar_no_conflict", check_params={}),
+        ConstraintDef(description="Restaurant has vegetarian options", check_type="restaurant_match", check_params={"criteria": {"dietary": "vegetarian"}}),
+        ConstraintDef(description="Client_Jones receives itinerary", check_type="email_sent", check_params={"to": "Client_Jones", "keywords": ["itinerary", "10", "demo", "lunch"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="Client_Jones", required_keywords=["itinerary", "demo", "lunch", "10"], purpose="notify_reschedule"),
+    ],
+    optimal_steps=4,
+    max_steps=10,
+    expected_restaurant="Garden Bistro",
+)
+# ===================================================================
+# HARD — 8-15 tool calls, full cross-task cascade + SRE crisis
+# ===================================================================
+_HARD_011 = ScenarioDef(
+    scenario_id="hard_011",
+    difficulty="hard",
+    briefing=(
+        "VP_Chen just emailed: an important investor (Investor_Park) is in town tonight "
+        "(2026-04-25) and needs a dinner meeting. Investor_Park has a 9:00 PM flight "
+        "so dinner must end by 8:00 PM. Investor_Park is vegetarian. Your calendar: "
+        "6:00 PM Yoga (personal), 7:00 PM Team Happy Hour (you organised it and "
+        "promised the team last week). You must: find a restaurant near the airport "
+        "with vegetarian options under $60/pp, handle the calendar conflicts by "
+        "priority (investor > happy hour > yoga), and email everyone affected."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_90", title="Yoga", date="2026-04-25", time="18:00", duration_min=60, priority="low", is_personal=True),
+        CalendarEvent(event_id="evt_91", title="Team Happy Hour", date="2026-04-25", time="19:00", duration_min=120, participants=["Team"], priority="normal"),
+        CalendarEvent(event_id="evt_92", title="Afternoon Focus", date="2026-04-25", time="14:00", duration_min=120),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_90", sender="VP_Chen", subject="Investor dinner TONIGHT", body="Investor_Park is in town tonight only. We need dinner before their 9pm flight. They're vegetarian. Book something near the airport. This is top priority.", urgency="critical"),
+    ],
+    available_restaurants=[
+        Restaurant(name="Sky Lounge", cuisine="International", price_per_person=55, distance_miles=1.0, dietary_options=["vegetarian", "vegan", "gluten-free"], capacity=30, near_airport=True, has_private_room=True),
+        Restaurant(name="Terminal Grill", cuisine="American", price_per_person=35, distance_miles=0.5, dietary_options=["vegetarian"], capacity=50, near_airport=True),
+        Restaurant(name="Downtown Sushi", cuisine="Japanese", price_per_person=45, distance_miles=8.0, dietary_options=["vegetarian"], capacity=20),
+        Restaurant(name="Fancy Steak", cuisine="Steakhouse", price_per_person=70, distance_miles=0.8, dietary_options=[], capacity=40, near_airport=True),
+    ],
+    contacts=[
+        Contact(name="VP_Chen", role="VP Engineering", priority_level=5),
+        Contact(name="Investor_Park", role="Investor", priority_level=5, dietary="vegetarian"),
+        Contact(name="Team", role="Engineering Team", priority_level=2),
+    ],
+    constraints=[
+        ConstraintDef(description="Restaurant near airport", check_type="restaurant_match", check_params={"criteria": {"near_airport": True}}),
+        ConstraintDef(description="Restaurant has vegetarian options", check_type="restaurant_match", check_params={"criteria": {"dietary": "vegetarian"}}),
+        ConstraintDef(description="Restaurant under $60/pp", check_type="restaurant_match", check_params={"criteria": {"max_price": 60}}),
+        ConstraintDef(description="Yoga cancelled (lower priority)", check_type="event_cancelled", check_params={"event_id": "evt_90"}),
+        ConstraintDef(description="Team notified about Happy Hour change", check_type="email_sent", check_params={"to": "Team", "keywords": ["happy hour", "reschedule", "sorry"]}),
+        ConstraintDef(description="VP_Chen sent dinner plan", check_type="email_sent", check_params={"to": "VP_Chen", "keywords": ["dinner", "restaurant", "investor"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="Team", required_keywords=["happy hour", "reschedule", "sorry", "apologi"], purpose="renegotiate"),
+        CommunicationReq(to="VP_Chen", required_keywords=["dinner", "restaurant", "investor", "vegetarian"], purpose="acknowledge"),
+    ],
+    optimal_steps=7,
+    max_steps=15,
+    expected_restaurant="Sky Lounge",
+    expected_cancelled_events=["evt_90"],
+)
+_HARD_012 = ScenarioDef(
+    scenario_id="hard_012",
+    difficulty="hard",
+    briefing=(
+        "Three VPs all want Conference Room Alpha at 2:00 PM today (2026-04-25) for "
+        "different meetings. VP_Chen: Board Prep (critical). VP_Lee: Client Demo "
+        "(high). VP_Kumar: Team Retro (normal). You must assess priority, keep the "
+        "highest-priority meeting in Alpha, propose alternative rooms/times for the "
+        "other two, and send diplomatic emails to all three VPs."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_100", title="Board Prep", date="2026-04-25", time="14:00", duration_min=60, participants=["VP_Chen"], location="Alpha", priority="critical"),
+        CalendarEvent(event_id="evt_101", title="Client Demo", date="2026-04-25", time="14:00", duration_min=60, participants=["VP_Lee", "Client_Jones"], location="Alpha", priority="high"),
+        CalendarEvent(event_id="evt_102", title="Team Retro", date="2026-04-25", time="14:00", duration_min=60, participants=["VP_Kumar", "Team"], location="Alpha", priority="normal"),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_100", sender="Admin", subject="Room conflict alert", body="Conference Room Alpha has 3 bookings at 2pm. Please resolve.", urgency="critical"),
+    ],
+    contacts=[
+        Contact(name="VP_Chen", role="VP Engineering", priority_level=5),
+        Contact(name="VP_Lee", role="VP Sales", priority_level=4),
+        Contact(name="VP_Kumar", role="VP Product", priority_level=3),
+    ],
+    constraints=[
+        ConstraintDef(description="Board Prep stays in Alpha at 14:00", check_type="event_exists", check_params={"event_id": "evt_100"}),
+        ConstraintDef(description="No calendar conflicts after resolution", check_type="calendar_no_conflict", check_params={}),
+        ConstraintDef(description="VP_Lee notified of room change", check_type="email_sent", check_params={"to": "VP_Lee", "keywords": ["room", "move", "demo"]}),
+        ConstraintDef(description="VP_Kumar notified of room change", check_type="email_sent", check_params={"to": "VP_Kumar", "keywords": ["room", "move", "retro"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="VP_Lee", required_keywords=["room", "move", "alternative", "apologi"], purpose="renegotiate"),
+        CommunicationReq(to="VP_Kumar", required_keywords=["room", "move", "alternative", "apologi"], purpose="renegotiate"),
+    ],
+    optimal_steps=6,
+    max_steps=15,
+)
+_HARD_013 = ScenarioDef(
+    scenario_id="hard_013",
+    difficulty="hard",
+    briefing=(
+        "Triple crisis on 2026-04-25: (1) Your 4:00 PM flight (evt_110) was cancelled — "
+        "you need to rebook before the 6:00 PM board prep (evt_111) tomorrow. "
+        "(2) Board prep moved from 4:00 PM to 2:00 PM tomorrow (2026-04-26), "
+        "conflicting with your lunch with Client_Jones (evt_112). "
+        "(3) Your dinner reservation at Downtown Sushi was lost. "
+        "Handle all three crises: rebook flight constraints, reschedule lunch "
+        "with Client_Jones, find a new dinner restaurant, email all affected parties."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_110", title="Flight to NYC", date="2026-04-25", time="16:00", duration_min=180, priority="high"),
+        CalendarEvent(event_id="evt_111", title="Board Prep", date="2026-04-26", time="16:00", duration_min=120, participants=["VP_Chen"], priority="critical"),
+        CalendarEvent(event_id="evt_112", title="Lunch with Client_Jones", date="2026-04-26", time="12:00", duration_min=90, participants=["Client_Jones"], priority="high"),
+        CalendarEvent(event_id="evt_113", title="Morning Standup", date="2026-04-26", time="09:00", duration_min=30, participants=["Team"]),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_110", sender="Airline", subject="Flight cancelled", body="Your flight at 4:00 PM today has been cancelled. Next available flight: 6:00 PM or 8:00 PM.", urgency="critical"),
+        InboxEmail(email_id="em_111", sender="VP_Chen", subject="Board prep moved up", body="Board prep is now at 2pm tomorrow instead of 4pm. Non-negotiable.", urgency="critical"),
+        InboxEmail(email_id="em_112", sender="Downtown Sushi", subject="Reservation cancelled", body="We regret to inform you that we had to cancel your reservation due to a private event.", urgency="high"),
+    ],
+    available_restaurants=[
+        Restaurant(name="Sakura Garden", cuisine="Japanese", price_per_person=40, distance_miles=2.0, dietary_options=["vegetarian", "vegan"], capacity=25),
+        Restaurant(name="Pizza Corner", cuisine="Italian", price_per_person=25, distance_miles=1.0, dietary_options=["vegetarian"], capacity=30),
+    ],
+    contacts=[
+        Contact(name="VP_Chen", role="VP Engineering", priority_level=5),
+        Contact(name="Client_Jones", role="Client", priority_level=4, availability={"2026-04-26": ["09:30", "10:00", "11:00"]}),
+    ],
+    constraints=[
+        ConstraintDef(description="Board Prep rescheduled to 14:00", check_type="calendar_no_conflict", check_params={}),
+        ConstraintDef(description="Client_Jones notified of lunch reschedule", check_type="email_sent", check_params={"to": "Client_Jones", "keywords": ["lunch", "reschedule", "move"]}),
+        ConstraintDef(description="New dinner restaurant booked", check_type="restaurant_match", check_params={"criteria": {}}),
+        ConstraintDef(description="VP_Chen acknowledged board prep change", check_type="email_sent", check_params={"to": "VP_Chen", "keywords": ["board", "prep", "2pm", "confirmed"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="Client_Jones", required_keywords=["lunch", "reschedule", "sorry", "alternative"], purpose="renegotiate"),
+        CommunicationReq(to="VP_Chen", required_keywords=["board", "prep", "confirmed"], purpose="acknowledge"),
+    ],
+    optimal_steps=8,
+    max_steps=15,
+)
+_HARD_014 = ScenarioDef(
+    scenario_id="hard_014",
+    difficulty="hard",
+    briefing=(
+        "VP_Chen asks you to schedule a meeting with Client_Jones 'sometime this week' "
+        "(2026-04-21 to 2026-04-25). Client_Jones privately told you they're unavailable "
+        "Mon-Wed due to a family emergency — this is confidential. VP_Chen doesn't know. "
+        "You must propose Thu/Fri slots without revealing Client_Jones's private reason. "
+        "Navigate the information asymmetry diplomatically."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_120", title="Team Sync", date="2026-04-24", time="10:00", duration_min=60, participants=["Team"]),
+        CalendarEvent(event_id="evt_121", title="1-on-1 with VP_Chen", date="2026-04-25", time="14:00", duration_min=30, participants=["VP_Chen"]),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_120", sender="VP_Chen", subject="Meeting with Jones", body="Can you set up a meeting with Client_Jones this week? 30 minutes. Any day works for me.", urgency="high"),
+        InboxEmail(email_id="em_121", sender="Client_Jones", subject="Availability - confidential", body="I'm dealing with a family emergency Mon-Wed. I'd prefer to keep this private. I'm free Thu after 2pm and all day Friday.", urgency="normal", context_hint="CONFIDENTIAL: do not share reason with VP_Chen"),
+    ],
+    contacts=[
+        Contact(name="VP_Chen", role="VP Engineering", priority_level=5, availability={"2026-04-24": ["09:00", "10:00", "14:00", "15:00"], "2026-04-25": ["09:00", "10:00", "15:00", "16:00"]}),
+        Contact(name="Client_Jones", role="Client", priority_level=4, availability={"2026-04-24": ["14:00", "15:00", "16:00"], "2026-04-25": ["09:00", "10:00", "11:00", "14:00", "15:00"]}),
+    ],
+    constraints=[
+        ConstraintDef(description="Meeting scheduled Thu or Fri only", check_type="calendar_no_conflict", check_params={}),
+        ConstraintDef(description="VP_Chen notified of proposed time", check_type="email_sent", check_params={"to": "VP_Chen", "keywords": ["Thursday", "Friday", "Client_Jones", "slot"]}),
+        ConstraintDef(description="Client_Jones notified", check_type="email_sent", check_params={"to": "Client_Jones", "keywords": ["meeting", "VP", "time"]}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="VP_Chen", required_keywords=["Thursday", "Friday", "Client_Jones", "available"], purpose="propose_slots"),
+        CommunicationReq(to="Client_Jones", required_keywords=["meeting", "time", "VP_Chen"], purpose="propose_slots"),
+    ],
+    optimal_steps=5,
+    max_steps=12,
+)
+_HARD_015 = ScenarioDef(
+    scenario_id="hard_015",
+    difficulty="hard",
+    briefing=(
+        "PRODUCTION INCIDENT: At 11:45 AM on 2026-04-25, PagerDuty fires — "
+        "payment-service is returning 503s with 94%% error rate. HikariPool connection "
+        "pool exhausted. You're the on-call engineer.\n\n"
+        "Your existing commitments today:\n"
+        "- 12:00 PM: Team lunch at Garden Bistro (you organised, 6 people attending)\n"
+        "- 2:00 PM: Client demo with Client_Jones (promised last week)\n"
+        "- 3:30 PM: 1-on-1 with VP_Chen\n"
+        "- 6:00 PM: Personal dinner reservation\n\n"
+        "You must triage the incident (acknowledge, page backup), handle your "
+        "commitments (which ones to keep, which to reschedule), and properly "
+        "notify everyone affected. The incident is highest priority."
+    ),
+    initial_calendar=[
+        CalendarEvent(event_id="evt_130", title="Team Lunch", date="2026-04-25", time="12:00", duration_min=90, participants=["Alice", "Bob", "Carol", "Dave", "Eve", "Frank"], location="Garden Bistro", priority="normal"),
+        CalendarEvent(event_id="evt_131", title="Client Demo", date="2026-04-25", time="14:00", duration_min=60, participants=["Client_Jones"], priority="high"),
+        CalendarEvent(event_id="evt_132", title="1-on-1 with VP_Chen", date="2026-04-25", time="15:30", duration_min=30, participants=["VP_Chen"], priority="high"),
+        CalendarEvent(event_id="evt_133", title="Dinner", date="2026-04-25", time="18:00", duration_min=120, priority="low", is_personal=True),
+    ],
+    initial_inbox=[
+        InboxEmail(email_id="em_130", sender="PagerDuty", subject="[CRITICAL] payment-service 503 — 94% error rate", body="payment-service ERROR HikariPool-1 Connection not available, timed out after 30000ms. Active: 10, Idle: 0, Waiting: 47. Circuit breaker OPEN.", urgency="critical"),
+    ],
+    contacts=[
+        Contact(name="VP_Chen", role="VP Engineering", priority_level=5),
+        Contact(name="Client_Jones", role="Client", priority_level=4),
+        Contact(name="Team", role="Engineering Team", priority_level=2),
+        Contact(name="Alice", role="Engineer (Backup On-Call)", priority_level=3),
+    ],
+    constraints=[
+        ConstraintDef(description="Incident acknowledged via email", check_type="email_sent", check_params={"to": "Team", "keywords": ["incident", "payment", "503"]}),
+        ConstraintDef(description="Team lunch cancelled or rescheduled", check_type="event_cancelled", check_params={"event_id": "evt_130"}),
+        ConstraintDef(description="Client_Jones notified of demo reschedule", check_type="email_sent", check_params={"to": "Client_Jones", "keywords": ["reschedule", "demo", "apologi"]}),
+        ConstraintDef(description="VP_Chen informed of incident", check_type="email_sent", check_params={"to": "VP_Chen", "keywords": ["incident", "payment", "on-call"]}),
+        ConstraintDef(description="No unresolved calendar conflicts", check_type="calendar_no_conflict", check_params={}),
+    ],
+    communication_requirements=[
+        CommunicationReq(to="Team", required_keywords=["incident", "payment", "cancel", "lunch"], purpose="notify_reschedule"),
+        CommunicationReq(to="Client_Jones", required_keywords=["reschedule", "demo", "sorry", "apologi", "production"], purpose="renegotiate"),
+        CommunicationReq(to="VP_Chen", required_keywords=["incident", "payment", "1-on-1", "reschedule"], purpose="renegotiate"),
+    ],
+    optimal_steps=8,
+    max_steps=15,
+    expected_cancelled_events=["evt_130"],
+)
+# ===================================================================
+# Registry helpers
+# ===================================================================
+_ALL_SCENARIOS: Dict[str, ScenarioDef] = {
+    s.scenario_id: s
+    for s in [
+        _EASY_001, _EASY_002, _EASY_003, _EASY_004, _EASY_005,
+        _MED_006, _MED_007, _MED_008, _MED_009, _MED_010,
+        _HARD_011, _HARD_012, _HARD_013, _HARD_014, _HARD_015,
+    ]
+}
+def get_all_scenarios() -> Dict[str, ScenarioDef]:
+    return _ALL_SCENARIOS
+def get_scenario(scenario_id: str) -> Optional[ScenarioDef]:
+    return _ALL_SCENARIOS.get(scenario_id)
+def get_scenarios_by_difficulty(difficulty: str) -> List[ScenarioDef]:
+    return [s for s in _ALL_SCENARIOS.values() if s.difficulty == difficulty]
+def get_scenario_ids_grouped() -> Dict[str, List[str]]:
+    grouped: Dict[str, List[str]] = {"easy": [], "medium": [], "hard": []}
+    for s in _ALL_SCENARIOS.values():
+        grouped.setdefault(s.difficulty, []).append(s.scenario_id)
+    return grouped

server/world.py ADDED Viewed

	@@ -0,0 +1,290 @@

+"""Simulated personal world — calendar, contacts, restaurants, email state."""
+from __future__ import annotations
+from copy import deepcopy
+from typing import Any, Dict, List, Optional
+from server.domain import (
+    CalendarEvent,
+    Commitment,
+    Contact,
+    InboxEmail,
+    Restaurant,
+    ScenarioDef,
+)
+class WorldState:
+    """Mutable in-memory state for a single episode."""
+    def __init__(self, scenario: ScenarioDef) -> None:
+        self.scenario = scenario
+        self.calendar: Dict[str, CalendarEvent] = {
+            e.event_id: deepcopy(e) for e in scenario.initial_calendar
+        }
+        self.contacts: Dict[str, Contact] = {
+            c.name: deepcopy(c) for c in scenario.contacts
+        }
+        self.restaurants: Dict[str, Restaurant] = {
+            r.name: deepcopy(r) for r in scenario.available_restaurants
+        }
+        self.inbox: List[InboxEmail] = deepcopy(scenario.initial_inbox)
+        self.emails_sent: List[Dict[str, str]] = []
+        self.commitment_ledger: List[Commitment] = []
+        self.step_count: int = 0
+        self.booked_restaurant: str = ""
+        self._next_event_id: int = 100
+    # ------------------------------------------------------------------
+    # Tool implementations
+    # ------------------------------------------------------------------
+    def view_calendar(self, date: str) -> str:
+        events = [
+            e for e in self.calendar.values()
+            if e.date == date
+        ]
+        if not events:
+            return f"No events on {date}."
+        events.sort(key=lambda e: e.time)
+        lines = [f"Calendar for {date}:"]
+        for ev in events:
+            parts = ev.participants
+            part_str = f" with {', '.join(parts)}" if parts else ""
+            loc_str = f" at {ev.location}" if ev.location else ""
+            lines.append(
+                f"  [{ev.event_id}] {ev.time} ({ev.duration_min}min) "
+                f"{ev.title}{part_str}{loc_str} "
+                f"[priority={ev.priority}]"
+            )
+        return "\n".join(lines)
+    def check_availability(self, person: str) -> str:
+        contact = self.contacts.get(person)
+        if contact is None:
+            return f"Contact '{person}' not found."
+        if not contact.availability:
+            return f"{person} has no availability information on file."
+        lines = [f"Availability for {person} (role: {contact.role}):"]
+        for date, slots in sorted(contact.availability.items()):
+            lines.append(f"  {date}: {', '.join(slots)}")
+        if contact.dietary:
+            lines.append(f"  Dietary: {contact.dietary}")
+        return "\n".join(lines)
+    def search_restaurants(
+        self,
+        cuisine: str = "",
+        max_price: int = 0,
+        dietary: str = "",
+        max_distance_miles: float = 0.0,
+        near_airport: bool = False,
+    ) -> str:
+        matches: List[Restaurant] = []
+        for r in self.restaurants.values():
+            if cuisine and cuisine.lower() not in r.cuisine.lower():
+                continue
+            if max_price > 0 and r.price_per_person > max_price:
+                continue
+            if dietary and dietary.lower() not in [d.lower() for d in r.dietary_options]:
+                continue
+            if max_distance_miles > 0 and r.distance_miles > max_distance_miles:
+                continue
+            if near_airport and not r.near_airport:
+                continue
+            matches.append(r)
+        if not matches:
+            return "No restaurants match your criteria."
+        lines = ["Matching restaurants:"]
+        for r in matches:
+            lines.append(
+                f"  {r.name} — {r.cuisine}, ${r.price_per_person}/pp, "
+                f"{r.distance_miles}mi, dietary: {', '.join(r.dietary_options)}, "
+                f"capacity: {r.capacity}, hours: {r.hours}"
+                f"{', near airport' if r.near_airport else ''}"
+                f"{', private room' if r.has_private_room else ''}"
+            )
+        return "\n".join(lines)
+    def schedule_meeting(
+        self,
+        title: str,
+        date: str,
+        time: str,
+        duration_min: int = 60,
+        participants: Optional[List[str]] = None,
+        location: str = "",
+        turn: int = 0,
+    ) -> str:
+        conflict = self._find_conflict(date, time, duration_min)
+        if conflict is not None:
+            return (
+                f"CONFLICT: '{title}' at {time} overlaps with "
+                f"'{conflict.title}' at {conflict.time}. "
+                f"Resolve the conflict first."
+            )
+        eid = f"evt_{self._next_event_id}"
+        self._next_event_id += 1
+        event = CalendarEvent(
+            event_id=eid,
+            title=title,
+            date=date,
+            time=time,
+            duration_min=duration_min,
+            participants=participants or [],
+            location=location,
+        )
+        self.calendar[eid] = event
+        self.commitment_ledger.append(Commitment(
+            turn_created=turn,
+            commitment_type="meeting_scheduled",
+            description=f"{time} {title} on {date}",
+            constraint=f"{date}T{time}",
+            to_whom=", ".join(participants or ["self"]),
+        ))
+        return f"Meeting scheduled: [{eid}] {date} {time} — {title}"
+    def reschedule_event(self, event_id: str, new_time: str, turn: int = 0) -> str:
+        event = self.calendar.get(event_id)
+        if event is None:
+            return f"Event '{event_id}' not found."
+        conflict = self._find_conflict(event.date, new_time, event.duration_min, exclude=event_id)
+        if conflict is not None:
+            return (
+                f"CONFLICT: moving '{event.title}' to {new_time} would overlap "
+                f"with '{conflict.title}' at {conflict.time}."
+            )
+        old_time = event.time
+        event.time = new_time
+        for c in self.commitment_ledger:
+            if c.active and c.constraint == f"{event.date}T{old_time}":
+                c.active = False
+                c.renegotiated_at = turn
+        self.commitment_ledger.append(Commitment(
+            turn_created=turn,
+            commitment_type="meeting_scheduled",
+            description=f"{new_time} {event.title} on {event.date} (rescheduled from {old_time})",
+            constraint=f"{event.date}T{new_time}",
+            to_whom=", ".join(event.participants) if event.participants else "self",
+        ))
+        return f"Rescheduled [{event_id}] '{event.title}' from {old_time} to {new_time}."
+    def cancel_event(self, event_id: str, turn: int = 0) -> str:
+        event = self.calendar.pop(event_id, None)
+        if event is None:
+            return f"Event '{event_id}' not found."
+        for c in self.commitment_ledger:
+            if c.active and c.constraint == f"{event.date}T{event.time}":
+                if event.is_personal:
+                    c.active = False
+                    c.renegotiated_at = turn
+                # non-personal cancellations remain active until email is sent
+        return f"Cancelled [{event_id}] '{event.title}' at {event.time} on {event.date}."
+    def send_email(self, to: str, subject: str, body: str, turn: int = 0) -> str:
+        self.emails_sent.append({
+            "to": to,
+            "subject": subject,
+            "body": body,
+            "turn": turn,
+        })
+        body_lower = body.lower()
+        renegotiation_keywords = ["reschedule", "move", "cancel", "change", "instead", "alternative", "postpone"]
+        is_renegotiation = any(kw in body_lower for kw in renegotiation_keywords)
+        if is_renegotiation:
+            for c in self.commitment_ledger:
+                if c.active and to.lower() in c.to_whom.lower():
+                    c.renegotiated_at = turn
+        return f"Email sent to {to}: '{subject}'"
+    def book_restaurant(self, restaurant_name: str, turn: int = 0) -> str:
+        r = self.restaurants.get(restaurant_name)
+        if r is None:
+            return f"Restaurant '{restaurant_name}' not found."
+        self.booked_restaurant = restaurant_name
+        self.commitment_ledger.append(Commitment(
+            turn_created=turn,
+            commitment_type="reservation_made",
+            description=f"Reservation at {restaurant_name}",
+            constraint=restaurant_name,
+            to_whom="group",
+        ))
+        return f"Reservation confirmed at {restaurant_name}."
+    # ------------------------------------------------------------------
+    # Internal helpers
+    # ------------------------------------------------------------------
+    def _find_conflict(
+        self, date: str, time: str, duration_min: int, exclude: str = "",
+    ) -> Optional[CalendarEvent]:
+        new_start = _time_to_min(time)
+        new_end = new_start + duration_min
+        for eid, ev in self.calendar.items():
+            if eid == exclude:
+                continue
+            if ev.date != date:
+                continue
+            ev_start = _time_to_min(ev.time)
+            ev_end = ev_start + ev.duration_min
+            if new_start < ev_end and new_end > ev_start:
+                return ev
+        return None
+    def get_calendar_snapshot(self) -> List[Dict[str, Any]]:
+        return [ev.model_dump() for ev in sorted(self.calendar.values(), key=lambda e: (e.date, e.time))]
+    def get_inbox_snapshot(self) -> List[Dict[str, Any]]:
+        return [e.model_dump(exclude={"context_hint"}) for e in self.inbox]
+    def get_active_commitments(self) -> List[Commitment]:
+        return [c for c in self.commitment_ledger if c.active]
+    def get_silent_violations(self) -> List[Commitment]:
+        """Commitments that are still active but whose constraint no longer holds."""
+        violations: List[Commitment] = []
+        for c in self.commitment_ledger:
+            if not c.active:
+                continue
+            if c.renegotiated_at is not None:
+                continue
+            if c.commitment_type == "meeting_scheduled":
+                time_key = c.constraint
+                parts = time_key.split("T")
+                if len(parts) == 2:
+                    date_str, time_str = parts
+                    found = any(
+                        ev.date == date_str and ev.time == time_str
+                        for ev in self.calendar.values()
+                    )
+                    if not found:
+                        has_email = any(
+                            c.to_whom.lower() in em.get("to", "").lower()
+                            for em in self.emails_sent
+                        )
+                        if not has_email:
+                            violations.append(c)
+        return violations
+def _time_to_min(t: str) -> int:
+    """Convert 'HH:MM' to minutes since midnight."""
+    parts = t.split(":")
+    return int(parts[0]) * 60 + int(parts[1])

tests/__init__.py ADDED Viewed

File without changes

tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,523 @@

+"""Comprehensive test suite for CommitmentOS.
+Tests cover:
+  - Grader (perfect/partial/zero for each component)
+  - Environment lifecycle (reset/step/state/multi-turn)
+  - Commitment ledger (creation, violation, renegotiation)
+  - Task dataset integrity
+  - API endpoints
+  - Difficulty verification
+"""
+from __future__ import annotations
+import sys
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+import json
+from typing import Any, Dict
+import pytest
+from models import CommitmentAction, CommitmentObservation, CommitmentState
+from server.domain import CalendarEvent, ConstraintDef, ScenarioDef
+from server.environment import CommitmentEnvironment
+from server.graders import (
+    _calendar_has_no_overlaps,
+    _keyword_score,
+    _score_commitment_coherence,
+    _score_conflict_resolution,
+    _score_step_efficiency,
+    grade_scenario,
+)
+from server.tasks import get_all_scenarios, get_scenario, get_scenarios_by_difficulty
+from server.world import WorldState, _time_to_min
+# ===================================================================
+# Fixtures
+# ===================================================================
+@pytest.fixture
+def env() -> CommitmentEnvironment:
+    return CommitmentEnvironment()
+@pytest.fixture
+def easy_env(env: CommitmentEnvironment) -> CommitmentEnvironment:
+    env.reset(task_id="easy_001")
+    return env
+# ===================================================================
+# 1. Task dataset integrity
+# ===================================================================
+class TestTaskDataset:
+    def test_15_scenarios_loaded(self) -> None:
+        scenarios = get_all_scenarios()
+        assert len(scenarios) == 15
+    def test_5_easy_5_medium_5_hard(self) -> None:
+        for difficulty, count in [("easy", 5), ("medium", 5), ("hard", 5)]:
+            tasks = get_scenarios_by_difficulty(difficulty)
+            assert len(tasks) == count, f"Expected {count} {difficulty} tasks, got {len(tasks)}"
+    def test_each_scenario_has_required_fields(self) -> None:
+        for sid, scenario in get_all_scenarios().items():
+            assert scenario.scenario_id == sid
+            assert scenario.difficulty in ("easy", "medium", "hard")
+            assert len(scenario.briefing) > 20, f"{sid}: briefing too short"
+            assert scenario.optimal_steps >= 2, f"{sid}: optimal_steps too low"
+            assert scenario.max_steps >= scenario.optimal_steps
+            assert len(scenario.constraints) >= 1, f"{sid}: no constraints defined"
+    def test_scenario_ids_unique(self) -> None:
+        ids = list(get_all_scenarios().keys())
+        assert len(ids) == len(set(ids))
+    def test_get_scenario_returns_none_for_missing(self) -> None:
+        assert get_scenario("nonexistent_999") is None
+    def test_get_scenario_returns_correct(self) -> None:
+        s = get_scenario("easy_001")
+        assert s is not None
+        assert s.difficulty == "easy"
+# ===================================================================
+# 2. Grader unit tests
+# ===================================================================
+class TestKeywordScore:
+    def test_full_match(self) -> None:
+        score, matched = _keyword_score("I need to reschedule the standup meeting", ["reschedule", "standup"], min_matches=2)
+        assert score == 1.0
+        assert len(matched) == 2
+    def test_partial_match(self) -> None:
+        score, matched = _keyword_score("I need to reschedule", ["reschedule", "standup"], min_matches=2)
+        assert score == 0.5
+        assert len(matched) == 1
+    def test_no_match(self) -> None:
+        score, matched = _keyword_score("Hello world", ["reschedule", "standup"], min_matches=2)
+        assert score == 0.0
+        assert len(matched) == 0
+    def test_case_insensitive(self) -> None:
+        score, _ = _keyword_score("RESCHEDULE THE STANDUP", ["reschedule", "standup"], min_matches=2)
+        assert score == 1.0
+class TestCalendarConflicts:
+    def test_no_conflicts(self) -> None:
+        scenario = get_scenario("easy_002")
+        assert scenario is not None
+        world = WorldState(scenario)
+        assert _calendar_has_no_overlaps(world) is True
+    def test_conflict_detected(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        assert _calendar_has_no_overlaps(world) is False
+class TestCommitmentCoherence:
+    def test_no_commitments_full_score(self) -> None:
+        scenario = get_scenario("easy_005")
+        assert scenario is not None
+        world = WorldState(scenario)
+        score, _ = _score_commitment_coherence(world)
+        assert score == 1.0
+    def test_honored_commitment(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        env.step(CommitmentAction(action_type="reschedule_event", event_id="evt_2", new_time="15:00"))
+        assert env._world is not None
+        score, feedback = _score_commitment_coherence(env._world)
+        assert score == 1.0
+    def test_silent_violation_detected(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        env.step(CommitmentAction(action_type="schedule_meeting", title="New Meeting", date="2026-04-25", time="16:00", participants=["Alice"]))
+        assert env._world is not None
+        env._world.calendar.pop("evt_100", None)
+        for c in env._world.commitment_ledger:
+            if c.commitment_type == "meeting_scheduled" and "16:00" in c.constraint:
+                event_key = c.constraint
+                for eid, ev in list(env._world.calendar.items()):
+                    if ev.time == "16:00" and ev.date == "2026-04-25" and ev.title == "New Meeting":
+                        del env._world.calendar[eid]
+                        break
+        violations = env._world.get_silent_violations()
+        assert len(violations) >= 1
+class TestStepEfficiency:
+    def test_optimal_steps(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        world.step_count = 3
+        score, _ = _score_step_efficiency(scenario, world)
+        assert score == 1.0
+    def test_over_optimal(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        world.step_count = 8
+        score, _ = _score_step_efficiency(scenario, world)
+        assert score == 0.5
+# ===================================================================
+# 3. Environment lifecycle
+# ===================================================================
+class TestEnvironmentLifecycle:
+    def test_reset_returns_observation(self, env: CommitmentEnvironment) -> None:
+        obs = env.reset(task_id="easy_001")
+        assert isinstance(obs, CommitmentObservation)
+        assert obs.scenario_id == "easy_001"
+        assert obs.done is False
+        assert obs.reward == 0.0
+        assert len(obs.briefing) > 0
+    def test_step_before_reset_raises(self, env: CommitmentEnvironment) -> None:
+        with pytest.raises(ValueError, match="No active episode"):
+            env.step(CommitmentAction(action_type="view_calendar", date="2026-04-25"))
+    def test_step_after_done_raises(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        env.step(CommitmentAction(action_type="submit_plan"))
+        with pytest.raises(ValueError, match="already completed"):
+            env.step(CommitmentAction(action_type="view_calendar", date="2026-04-25"))
+    def test_state_property(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        state = env.state
+        assert isinstance(state, CommitmentState)
+        assert state.scenario_id == "easy_001"
+        assert state.completed is False
+        assert len(state.available_tasks) == 15
+    def test_multi_turn_episode(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        obs = env.step(CommitmentAction(action_type="view_calendar", date="2026-04-25"))
+        assert obs.done is False
+        assert obs.step_number == 1
+        obs = env.step(CommitmentAction(action_type="reschedule_event", event_id="evt_2", new_time="15:00"))
+        assert obs.done is False
+        assert obs.step_number == 2
+        obs = env.step(CommitmentAction(action_type="submit_plan"))
+        assert obs.done is True
+        assert obs.reward > 0
+    def test_max_steps_auto_submits(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_002")
+        for _ in range(20):
+            obs = env.step(CommitmentAction(action_type="view_calendar", date="2026-04-25"))
+            if obs.done:
+                break
+        assert obs.done is True
+    def test_reset_clears_state(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        env.step(CommitmentAction(action_type="view_calendar", date="2026-04-25"))
+        env.reset(task_id="easy_002")
+        assert env.state.scenario_id == "easy_002"
+        assert env.state.step_count == 0
+    def test_unknown_action_type(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        obs = env.step(CommitmentAction(action_type="fly_to_moon"))
+        assert "Unknown action_type" in obs.tool_result
+    def test_random_reset(self, env: CommitmentEnvironment) -> None:
+        obs = env.reset(seed=42)
+        assert obs.scenario_id in get_all_scenarios()
+    def test_difficulty_filter_reset(self, env: CommitmentEnvironment) -> None:
+        obs = env.reset(difficulty="hard", seed=1)
+        assert obs.difficulty == "hard"
+# ===================================================================
+# 4. World simulation (tool functions)
+# ===================================================================
+class TestWorldTools:
+    def test_view_calendar(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.view_calendar("2026-04-25")
+        assert "evt_1" in result
+        assert "14:00" in result
+    def test_view_calendar_empty(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.view_calendar("2099-01-01")
+        assert "No events" in result
+    def test_check_availability(self) -> None:
+        scenario = get_scenario("easy_003")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.check_availability("Client_Jones")
+        assert "09:00" in result
+    def test_check_availability_unknown(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.check_availability("NonExistentPerson")
+        assert "not found" in result
+    def test_search_restaurants_filters(self) -> None:
+        scenario = get_scenario("med_007")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.search_restaurants(dietary="vegan", max_price=45, max_distance_miles=3.0)
+        assert "Green Garden" in result
+        assert "Steak House Prime" not in result
+    def test_schedule_meeting_creates_commitment(self) -> None:
+        scenario = get_scenario("easy_002")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.schedule_meeting("Test Meeting", "2026-04-25", "14:00", turn=1)
+        assert "scheduled" in result.lower()
+        assert len(world.commitment_ledger) == 1
+        assert world.commitment_ledger[0].commitment_type == "meeting_scheduled"
+    def test_schedule_meeting_conflict(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.schedule_meeting("Conflicting", "2026-04-25", "14:00", turn=1)
+        assert "CONFLICT" in result
+    def test_reschedule_event(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.reschedule_event("evt_2", "15:00", turn=1)
+        assert "Rescheduled" in result
+        assert world.calendar["evt_2"].time == "15:00"
+    def test_cancel_event(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.cancel_event("evt_2", turn=1)
+        assert "Cancelled" in result
+        assert "evt_2" not in world.calendar
+    def test_send_email(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.send_email("Team", "Hello", "Testing email body", turn=1)
+        assert "sent" in result.lower()
+        assert len(world.emails_sent) == 1
+    def test_book_restaurant(self) -> None:
+        scenario = get_scenario("easy_002")
+        assert scenario is not None
+        world = WorldState(scenario)
+        result = world.book_restaurant("Bella Italia", turn=1)
+        assert "confirmed" in result.lower()
+        assert world.booked_restaurant == "Bella Italia"
+# ===================================================================
+# 5. Commitment ledger behaviour
+# ===================================================================
+class TestCommitmentLedger:
+    def test_schedule_creates_commitment(self) -> None:
+        scenario = get_scenario("easy_002")
+        assert scenario is not None
+        world = WorldState(scenario)
+        world.schedule_meeting("Test", "2026-04-25", "10:00", turn=1)
+        assert len(world.commitment_ledger) == 1
+        c = world.commitment_ledger[0]
+        assert c.turn_created == 1
+        assert c.active is True
+        assert c.renegotiated_at is None
+    def test_reschedule_marks_old_renegotiated(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        world.reschedule_event("evt_2", "15:00", turn=1)
+        renegotiated = [c for c in world.commitment_ledger if c.renegotiated_at is not None]
+        assert len(renegotiated) == 0  # initial events don't create ledger entries
+        new_commits = [c for c in world.commitment_ledger if c.active]
+        assert len(new_commits) >= 1
+    def test_email_renegotiation_detection(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        world.schedule_meeting("Important", "2026-04-25", "16:00", participants=["Alice"], turn=1)
+        world.send_email("Alice", "Change of plans", "I need to reschedule our meeting", turn=2)
+        renegotiated = [c for c in world.commitment_ledger if c.renegotiated_at is not None]
+        assert len(renegotiated) >= 1
+    def test_cancel_personal_marks_renegotiated(self) -> None:
+        scenario = get_scenario("easy_001")
+        assert scenario is not None
+        world = WorldState(scenario)
+        # evt_3 is Lunch (personal)
+        world.cancel_event("evt_3", turn=1)
+        # Personal cancellations are auto-OK
+# ===================================================================
+# 6. Full scenario scoring
+# ===================================================================
+class TestFullScoring:
+    def test_perfect_easy_001(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        env.step(CommitmentAction(action_type="reschedule_event", event_id="evt_2", new_time="15:00"))
+        env.step(CommitmentAction(action_type="send_email", to="Team", subject="Standup moved", body="Hi team, I've rescheduled the standup to 3:00 PM. Sorry for the move."))
+        obs = env.step(CommitmentAction(action_type="submit_plan"))
+        assert obs.done is True
+        assert obs.reward >= 0.85
+    def test_zero_effort_gets_low_score(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        obs = env.step(CommitmentAction(action_type="submit_plan"))
+        assert obs.done is True
+        assert obs.reward <= 0.50
+    def test_hard_011_perfect_run(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="hard_011")
+        env.step(CommitmentAction(action_type="view_calendar", date="2026-04-25"))
+        env.step(CommitmentAction(action_type="cancel_event", event_id="evt_90"))
+        env.step(CommitmentAction(action_type="search_restaurants", dietary="vegetarian", near_airport=True, max_price=60))
+        env.step(CommitmentAction(action_type="book_restaurant", restaurant_name="Sky Lounge"))
+        env.step(CommitmentAction(action_type="send_email", to="Team", subject="Happy Hour Rescheduled", body="Sorry team, I need to reschedule the happy hour to Thursday. An investor dinner came up tonight. Apologies!"))
+        env.step(CommitmentAction(action_type="send_email", to="VP_Chen", subject="Investor dinner plan", body="I've booked Sky Lounge for dinner tonight with Investor_Park. Vegetarian options available, near the airport."))
+        obs = env.step(CommitmentAction(action_type="submit_plan"))
+        assert obs.done is True
+        assert obs.reward >= 0.85
+    def test_hard_015_sre_crisis(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="hard_015")
+        env.step(CommitmentAction(action_type="view_calendar", date="2026-04-25"))
+        env.step(CommitmentAction(action_type="cancel_event", event_id="evt_130"))
+        env.step(CommitmentAction(action_type="send_email", to="Team", subject="Lunch cancelled - incident", body="Team, I'm cancelling our lunch due to a production incident. Payment service returning 503s. Will handle this first."))
+        env.step(CommitmentAction(action_type="send_email", to="Client_Jones", subject="Demo reschedule needed", body="Hi Client_Jones, I sincerely apologize but I need to reschedule our demo. We have a production incident with the payment system. Can we find another time this week?"))
+        env.step(CommitmentAction(action_type="send_email", to="VP_Chen", subject="Incident + 1-on-1", body="VP_Chen, we have a production incident — payment service is returning 503s. I'm on-call and handling it. May need to reschedule our 1-on-1 depending on resolution time."))
+        obs = env.step(CommitmentAction(action_type="submit_plan"))
+        assert obs.done is True
+        assert obs.reward >= 0.60
+# ===================================================================
+# 7. Reward clamping
+# ===================================================================
+class TestRewardClamping:
+    def test_reward_never_zero(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        obs = env.step(CommitmentAction(action_type="submit_plan"))
+        assert obs.reward >= 0.01
+    def test_reward_never_one(self, env: CommitmentEnvironment) -> None:
+        env.reset(task_id="easy_001")
+        env.step(CommitmentAction(action_type="reschedule_event", event_id="evt_2", new_time="15:00"))
+        env.step(CommitmentAction(action_type="send_email", to="Team", subject="Standup moved", body="Hi team, the standup is rescheduled to 3pm. Sorry for the move."))
+        obs = env.step(CommitmentAction(action_type="submit_plan"))
+        assert obs.reward <= 0.99
+        assert obs.reward > 0.01
+# ===================================================================
+# 8. Time utility
+# ===================================================================
+class TestTimeUtil:
+    def test_time_to_min(self) -> None:
+        assert _time_to_min("00:00") == 0
+        assert _time_to_min("09:30") == 570
+        assert _time_to_min("14:00") == 840
+        assert _time_to_min("23:59") == 1439
+# ===================================================================
+# 9. API endpoint tests (via TestClient)
+# ===================================================================
+class TestAPI:
+    @pytest.fixture
+    def client(self):
+        from fastapi.testclient import TestClient
+        from server.app import app
+        return TestClient(app)
+    def test_health(self, client) -> None:
+        resp = client.get("/health")
+        assert resp.status_code == 200
+    def test_tasks(self, client) -> None:
+        resp = client.get("/tasks")
+        assert resp.status_code == 200
+        data = resp.json()
+        assert len(data["easy"]) == 5
+        assert len(data["medium"]) == 5
+        assert len(data["hard"]) == 5
+    def test_reset_step_state(self, client) -> None:
+        resp = client.post("/reset", params={"task_id": "easy_001"})
+        assert resp.status_code == 200
+        resp = client.post("/step", json={"action": {"action_type": "view_calendar", "date": "2026-04-25"}})
+        assert resp.status_code == 200
+        data = resp.json()
+        assert data.get("done") is False
+        resp = client.get("/state")
+        assert resp.status_code == 200
+        state = resp.json()
+        assert "step_count" in state
+    def test_mcp_initialize(self, client) -> None:
+        resp = client.post("/mcp", json={
+            "jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {},
+        })
+        assert resp.status_code == 200
+        data = resp.json()
+        assert data["result"]["serverInfo"]["name"] == "commitment-os"
+    def test_mcp_tools_list(self, client) -> None:
+        resp = client.post("/mcp", json={
+            "jsonrpc": "2.0", "id": 2, "method": "tools/list", "params": {},
+        })
+        assert resp.status_code == 200
+        tools = resp.json()["result"]["tools"]
+        assert len(tools) == 3
+# ===================================================================
+# 10. Metadata
+# ===================================================================
+class TestMetadata:
+    def test_get_metadata(self, env: CommitmentEnvironment) -> None:
+        meta = env.get_metadata()
+        assert meta.name == "commitment-os"
+        assert "Jayant" in meta.author

training/__init__.py ADDED Viewed

File without changes

training/env_factory.py ADDED Viewed

	@@ -0,0 +1,167 @@

+"""Environment factory for TRL GRPOTrainer integration.
+Wraps CommitmentOS as a callable that accepts model completions and
+returns rewards, making it compatible with TRL's ``environment_factory``
+pattern for multi-turn RL training.
+"""
+from __future__ import annotations
+import json
+import sys
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from server.domain import ScenarioDef
+from server.environment import CommitmentEnvironment
+from server.tasks import get_all_scenarios
+from models import CommitmentAction
+TOOL_DESCRIPTIONS = """Available tools (respond with JSON):
+- {"action_type": "view_calendar", "date": "2026-04-25"}
+- {"action_type": "check_availability", "person": "Name"}
+- {"action_type": "search_restaurants", "cuisine": "...", "max_price": 50, "dietary": "..."}
+- {"action_type": "schedule_meeting", "title": "...", "date": "...", "time": "HH:MM", "participants": [...]}
+- {"action_type": "reschedule_event", "event_id": "evt_X", "new_time": "HH:MM"}
+- {"action_type": "cancel_event", "event_id": "evt_X"}
+- {"action_type": "send_email", "to": "Name", "subject": "...", "body": "..."}
+- {"action_type": "book_restaurant", "restaurant_name": "..."}
+- {"action_type": "submit_plan"}"""
+def build_system_prompt() -> str:
+    return (
+        "You are an expert executive assistant AI managing calendars, emails, and "
+        "dining reservations. For each turn, respond with EXACTLY ONE JSON tool call.\n\n"
+        f"{TOOL_DESCRIPTIONS}\n\n"
+        "Rules:\n"
+        "1. Respond with ONLY JSON, no markdown or explanation\n"
+        "2. Handle higher-priority items first\n"
+        "3. When cancelling/rescheduling commitments, ALWAYS email affected parties\n"
+        "4. Call submit_plan when all issues are resolved\n"
+        "5. Never silently drop a commitment"
+    )
+def build_initial_prompt(scenario: ScenarioDef) -> str:
+    """Build the user message for the first turn of an episode."""
+    from server.world import WorldState
+    world = WorldState(scenario)
+    calendar = json.dumps(world.get_calendar_snapshot(), indent=2)
+    inbox = json.dumps(world.get_inbox_snapshot(), indent=2)
+    return (
+        f"SCENARIO: {scenario.briefing}\n\n"
+        f"CALENDAR:\n{calendar}\n\n"
+        f"INBOX:\n{inbox}\n\n"
+        "What is your first action? Respond with a JSON tool call."
+    )
+def parse_action_from_text(text: str) -> Dict[str, Any]:
+    """Extract a JSON action from model output, with fallback to submit."""
+    text = text.strip()
+    if text.startswith("```"):
+        lines = text.split("\n")
+        text = "\n".join(lines[1:-1]) if len(lines) > 2 else text
+    try:
+        data = json.loads(text)
+        if isinstance(data, dict) and "action_type" in data:
+            return data
+    except (json.JSONDecodeError, ValueError):
+        pass
+    for line in text.split("\n"):
+        line = line.strip()
+        if line.startswith("{"):
+            try:
+                data = json.loads(line)
+                if isinstance(data, dict) and "action_type" in data:
+                    return data
+            except (json.JSONDecodeError, ValueError):
+                continue
+    return {"action_type": "submit_plan"}
+class CommitmentOSEnvFactory:
+    """Wraps CommitmentOS for use with TRL's GRPOTrainer.
+    Usage with TRL::
+        from training.env_factory import CommitmentOSEnvFactory
+        factory = CommitmentOSEnvFactory(max_turns=8)
+        trainer = GRPOTrainer(
+            ...
+            environment_factory=factory,
+        )
+    """
+    def __init__(
+        self,
+        max_turns: int = 8,
+        scenario_ids: Optional[List[str]] = None,
+    ) -> None:
+        self.max_turns = max_turns
+        self.scenario_ids = scenario_ids or list(get_all_scenarios().keys())
+        self.system_prompt = build_system_prompt()
+    def __call__(self, completions: List[str], **kwargs: Any) -> List[float]:
+        """Evaluate a batch of model completions.
+        Each completion is treated as a full multi-turn transcript where
+        each line is one JSON action. Returns a list of final rewards.
+        """
+        rewards: List[float] = []
+        for completion in completions:
+            reward = self._evaluate_single(completion)
+            rewards.append(reward)
+        return rewards
+    def _evaluate_single(self, completion: str) -> float:
+        import random
+        env = CommitmentEnvironment()
+        scenario_id = random.choice(self.scenario_ids)
+        env.reset(task_id=scenario_id)
+        actions = completion.strip().split("\n")
+        last_reward = 0.01
+        for i, action_text in enumerate(actions[: self.max_turns]):
+            action_dict = parse_action_from_text(action_text)
+            try:
+                action = CommitmentAction(**action_dict)
+                obs = env.step(action)
+                last_reward = obs.reward
+                if obs.done:
+                    break
+            except Exception:
+                continue
+        if not env._done:
+            obs = env.step(CommitmentAction(action_type="submit_plan"))
+            last_reward = obs.reward
+        return float(last_reward)
+    def get_prompt(self, scenario_id: Optional[str] = None) -> List[Dict[str, str]]:
+        """Build chat messages for a scenario."""
+        import random
+        from server.tasks import get_scenario
+        sid = scenario_id or random.choice(self.scenario_ids)
+        scenario = get_scenario(sid)
+        if scenario is None:
+            raise ValueError(f"Unknown scenario: {sid}")
+        return [
+            {"role": "system", "content": self.system_prompt},
+            {"role": "user", "content": build_initial_prompt(scenario)},
+        ]

training/train_grpo.py ADDED Viewed

	@@ -0,0 +1,174 @@

+"""GRPO training script for CommitmentOS.
+Uses TRL's GRPOTrainer with LoRA to train Qwen2.5-1.5B-Instruct on
+temporal commitment coherence tasks.
+Designed for Google Colab A100 or similar GPU environments.
+Usage:
+  python training/train_grpo.py [--model MODEL] [--epochs N] [--lr LR]
+Environment variables:
+  HF_TOKEN — HuggingFace token for model upload (optional)
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import random
+import sys
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="GRPO training for CommitmentOS")
+    parser.add_argument("--model", default="Qwen/Qwen2.5-1.5B-Instruct", help="Base model")
+    parser.add_argument("--epochs", type=int, default=2, help="Number of training epochs")
+    parser.add_argument("--lr", type=float, default=5e-6, help="Learning rate")
+    parser.add_argument("--batch_size", type=int, default=4, help="Per-device batch size")
+    parser.add_argument("--max_steps", type=int, default=-1, help="Max training steps (-1 for full epochs)")
+    parser.add_argument("--lora_rank", type=int, default=16, help="LoRA rank")
+    parser.add_argument("--lora_alpha", type=int, default=32, help="LoRA alpha")
+    parser.add_argument("--output_dir", default="./training_output", help="Output directory")
+    parser.add_argument("--push_to_hub", action="store_true", help="Push model to HuggingFace Hub")
+    parser.add_argument("--hub_model_id", default="jayant2304/commitmentos-qwen-grpo", help="HF Hub model ID")
+    parser.add_argument("--num_scenarios", type=int, default=15, help="Number of scenarios to use")
+    parser.add_argument("--max_turns", type=int, default=8, help="Max turns per episode")
+    parser.add_argument("--group_size", type=int, default=4, help="GRPO group size (completions per prompt)")
+    return parser.parse_args()
+def build_dataset(num_scenarios: int = 15) -> List[Dict[str, Any]]:
+    """Build training dataset from CommitmentOS scenarios."""
+    from server.tasks import get_all_scenarios
+    from training.env_factory import build_initial_prompt, build_system_prompt
+    scenarios = list(get_all_scenarios().values())[:num_scenarios]
+    system_prompt = build_system_prompt()
+    dataset: List[Dict[str, Any]] = []
+    for scenario in scenarios:
+        user_msg = build_initial_prompt(scenario)
+        dataset.append({
+            "prompt": [
+                {"role": "system", "content": system_prompt},
+                {"role": "user", "content": user_msg},
+            ],
+            "scenario_id": scenario.scenario_id,
+            "difficulty": scenario.difficulty,
+        })
+    return dataset
+def reward_function(completions: List[str], **kwargs: Any) -> List[float]:
+    """Reward function for GRPO — evaluates completions against CommitmentOS."""
+    from training.env_factory import CommitmentOSEnvFactory
+    factory = CommitmentOSEnvFactory(max_turns=8)
+    return factory(completions)
+def main() -> None:
+    args = parse_args()
+    try:
+        import torch
+        from datasets import Dataset
+        from peft import LoraConfig
+        from transformers import AutoModelForCausalLM, AutoTokenizer
+        from trl import GRPOConfig, GRPOTrainer
+    except ImportError as e:
+        print(f"Missing training dependency: {e}")
+        print("Install with: pip install trl transformers peft datasets torch")
+        sys.exit(1)
+    print(f"Loading model: {args.model}")
+    tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model,
+        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
+        device_map="auto" if torch.cuda.is_available() else None,
+        trust_remote_code=True,
+    )
+    lora_config = LoraConfig(
+        r=args.lora_rank,
+        lora_alpha=args.lora_alpha,
+        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
+        lora_dropout=0.05,
+        task_type="CAUSAL_LM",
+    )
+    print("Building dataset...")
+    raw_data = build_dataset(args.num_scenarios)
+    dataset = Dataset.from_list(raw_data)
+    training_config = GRPOConfig(
+        output_dir=args.output_dir,
+        num_train_epochs=args.epochs,
+        max_steps=args.max_steps,
+        per_device_train_batch_size=args.batch_size,
+        learning_rate=args.lr,
+        logging_steps=1,
+        save_steps=50,
+        bf16=torch.cuda.is_available(),
+        gradient_accumulation_steps=2,
+        warmup_ratio=0.1,
+        max_completion_length=512,
+        num_generations=args.group_size,
+        report_to="none",
+    )
+    print("Initialising GRPOTrainer...")
+    trainer = GRPOTrainer(
+        model=model,
+        config=training_config,
+        train_dataset=dataset,
+        processing_class=tokenizer,
+        reward_funcs=reward_function,
+        peft_config=lora_config,
+    )
+    print("Starting training...")
+    trainer.train()
+    print(f"Saving model to {args.output_dir}")
+    trainer.save_model(args.output_dir)
+    tokenizer.save_pretrained(args.output_dir)
+    if args.push_to_hub:
+        hf_token = os.getenv("HF_TOKEN", "")
+        if hf_token:
+            print(f"Pushing to hub: {args.hub_model_id}")
+            trainer.push_to_hub(args.hub_model_id, token=hf_token)
+        else:
+            print("HF_TOKEN not set — skipping hub push")
+    print("Training complete!")
+    save_training_metrics(trainer, args.output_dir)
+def save_training_metrics(trainer: Any, output_dir: str) -> None:
+    """Save training metrics to JSON for plotting training curves."""
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+    history = trainer.state.log_history if hasattr(trainer.state, "log_history") else []
+    metrics_file = output_path / "training_metrics.json"
+    with open(metrics_file, "w") as f:
+        json.dump(history, f, indent=2)
+    print(f"Training metrics saved to {metrics_file}")
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff