Spaces:

Revanth-ml
/

agentops-gym

Sleeping

App Files Files Community

Revanth-ml commited on Apr 7

Commit

e2eb9d7

verified ·

1 Parent(s): c1fd719

Upload folder using huggingface_hub

Browse files

Files changed (16) hide show

Dockerfile +82 -0
README.md +115 -5
__init__.py +7 -0
client.py +38 -0
inference.py +278 -0
models.py +86 -0
openenv.yaml +7 -0
pyproject.toml +45 -0
server/__init__.py +7 -0
server/app.py +156 -0
server/environment.py +288 -0
server/inference.py +342 -0
server/requirements.txt +6 -0
server/tasks.py +428 -0
server/tools.py +308 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,82 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=agentops_gym
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+ENV ENABLE_WEB_INTERFACE=true
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,120 @@
 ---
-title: Agentops Gym
-emoji: 🚀
-colorFrom: indigo
-colorTo: purple
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Agentops Gym Environment Server
+emoji: 🏏
+colorFrom: gray
+colorTo: pink
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
 ---
+# Agentops Gym Environment
+Stateful, partially observable, efficiency-penalizing RL environment for training agents on software engineering tool-use tasks.
+## Quick Start
+The simplest way to use the Agentops Gym environment is through the `AgentopsGymEnv` class:
+```python
+from agentops_gym import AgentopsGymAction, AgentopsGymEnv
+from agentops_gym.models import ToolCall
+try:
+    # Create environment from Docker image
+    agentops_gymenv = AgentopsGymEnv.from_docker_image("agentops_gym-env:latest")
+    # Reset to start a task
+    result = agentops_gymenv.reset(task_id="task_1")
+    print(f"Task: {result.observation.task_description}")
+    # Use tools to complete the task
+    # Example: Search for a pattern
+    action = AgentopsGymAction(
+        tool_call=ToolCall(tool="Grep", parameters={"pattern": "json"})
+    )
+    result = agentops_gymenv.step(action)
+    print(f"Grep Result: {result.observation.last_tool_result}")
+finally:
+    # Always clean up
+    agentops_gymenv.close()
+```
+## Building the Docker Image
+Before using the environment, you need to build the Docker image:
+```bash
+# From project root
+docker build -t agentops_gym-env:latest -f agentops_gym/server/Dockerfile .
+```
+## Environment Details
+### Action
+**AgentopsGymAction**:
+- `tool_call` (ToolCall) - The tool to execute (Grep, FileRead, FileWrite, Bash, TodoWrite, Submit)
+- `reasoning` (str, optional) - Agent's explanation for the action
+### Observation
+**AgentopsGymObservation**:
+- `task_description` (str) - The task objective
+- `visible_files` (list[str]) - Files discovered so far
+- `last_tool_result` (str) - Output of the last tool call
+- `action_history` (list[str]) - Previous actions in this episode
+- `step_count` (int) - Current step number
+- `max_steps` (int) - Maximum allowed steps
+- `done` (bool) - Whether the episode is complete
+- `feedback` (str, optional) - Warnings or penalties from the environment
+### Available Tools
+- **Grep**: Search for patterns in the virtual filesystem.
+- **FileRead**: Read file contents.
+- **FileWrite**: Modify file contents.
+- **Bash**: Run simulated commands (lint, test).
+- **TodoWrite**: Save a plan for the task.
+- **Submit**: Submit the final answer.
+## Advanced Usage
+### Using the Context Manager
+```python
+from agentops_gym import AgentopsGymAction, AgentopsGymEnv
+from agentops_gym.models import ToolCall
+with AgentopsGymEnv(base_url="http://localhost:8000") as env:
+    result = env.reset(task_id="task_1")
+    # Execute steps...
+    action = AgentopsGymAction(tool_call=ToolCall(tool="FileRead", parameters={"filename": "README.md"}))
+    result = env.step(action)
+```
+## Running Locally
+Run the server locally for development:
+```bash
+cd agentops_gym
+uvicorn server.app:app --reload
+```
+## Project Structure
+```
+agentops_gym/
+├── __init__.py            # Module exports
+├── README.md              # This file
+├── openenv.yaml           # OpenEnv manifest
+├── pyproject.toml         # Project metadata and dependencies
+├── models.py              # Action and Observation models
+└── server/
+    ├── __init__.py        # Server module exports
+    ├── agentops_gym_environment.py  # Core environment logic
+    ├── app.py             # FastAPI application
+    └── Dockerfile         # Container image definition
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""AgentOps Gym — Tool-use efficiency environment for LLM agents."""

client.py ADDED Viewed

	@@ -0,0 +1,38 @@

+"""
+AgentOps Gym — Environment client.
+Wraps WebSocket communication with the environment server.
+Provides typed step/reset/state methods for the agent.
+"""
+from typing import Dict, Any
+from openenv.core.env_client import EnvClient
+from openenv.core.client_types import StepResult
+from agentops_gym.models import ToolCall, AgentObservation, AgentState
+class AgentOpsEnv(EnvClient[ToolCall, AgentObservation, AgentState]):
+    """Client for the AgentOps Gym environment."""
+    def _step_payload(self, action: ToolCall) -> Dict[str, Any]:
+        """Convert a ToolCall action to the JSON payload expected by the server."""
+        return action.model_dump()
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[AgentObservation]:
+        """Parse server response into a StepResult with typed observation."""
+        obs_data = payload.get("observation", {})
+        obs = AgentObservation(
+            **obs_data,
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+        )
+        return StepResult(
+            observation=obs,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> AgentState:
+        """Parse server state response into typed State object."""
+        return AgentState(**payload)

inference.py ADDED Viewed

	@@ -0,0 +1,278 @@

+#!/usr/bin/env python3
+"""
+AgentOps Gym — Baseline inference script.
+Runs an LLM agent against all 3 AgentOps Gym tasks (tool-use efficiency)
+and reports per-task scores in the mandatory OpenEnv stdout format.
+Environment variables (MANDATORY):
+    API_BASE_URL   The API endpoint for the LLM (default: HF router)
+    MODEL_NAME     The model identifier (default: Qwen/Qwen2.5-72B-Instruct)
+    HF_TOKEN       Your Hugging Face / API key (must be set)
+    IMAGE_NAME     Docker image name for the environment (must be set)
+Usage:
+    IMAGE_NAME=agentops-gym HF_TOKEN=xxx python inference.py
+"""
+from __future__ import annotations
+import asyncio
+import json
+import os
+import sys
+from typing import Any, Dict, List, Optional
+from openai import OpenAI
+from agentops_gym.client import AgentOpsEnv
+from agentops_gym.models import ToolCall
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+IMAGE_NAME  = os.getenv("IMAGE_NAME")
+API_KEY     = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME  = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+BENCHMARK  = "agentops-gym"
+MAX_STEPS  = 10
+TEMPERATURE = 0.0
+MAX_TOKENS  = 600
+ALL_TASKS = ["task_1", "task_2", "task_3", "task_4"]
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = """\
+You are an expert software engineer agent. You solve coding tasks by calling tools.
+Available tools:
+  FileRead   — Read a file. Parameters: {"filename": "path/to/file.py"}
+  FileWrite  — Write/overwrite a file. Parameters: {"filename": "...", "content": "..."}
+  Grep       — Search files for a pattern. Parameters: {"pattern": "regex_or_string"}
+  Bash       — Run simulated shell command. Parameters: {"command": "lint main.py"}
+  WebSearch  — Search documentation. Parameters: {"query": "python lru_cache"}
+  TodoWrite  — Write a plan. Parameters: {"plan": "1. Do X\\n2. Do Y"}
+RULES:
+1. Respond ONLY with a single JSON object — no markdown, no explanation.
+2. Format: {"tool": "ToolName", "parameters": {...}, "reasoning": "why"}
+3. Be efficient — minimize total tool calls.
+4. For hard tasks: use TodoWrite FIRST to plan, then act.
+5. Never call the exact same tool+parameters twice.
+Example response:
+{"tool": "Grep", "parameters": {"pattern": "def fetch"}, "reasoning": "Find the function location"}
+"""
+# ---------------------------------------------------------------------------
+# Logging helpers (mandatory OpenEnv stdout format)
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    err_val = error if error else "null"
+    done_val = str(done).lower()
+    action_short = action.replace("\n", " ")[:200]
+    print(
+        f"[STEP] step={step} action={action_short} reward={reward:.2f} done={done_val} error={err_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} rewards={rewards_str}",
+        flush=True,
+    )
+# ---------------------------------------------------------------------------
+# Prompt builder
+# ---------------------------------------------------------------------------
+def build_prompt(obs_data: Dict[str, Any]) -> str:
+    parts = [f"TASK: {obs_data.get('task_description', '')}"]
+    parts.append(f"\nVisible files: {obs_data.get('visible_files', [])}")
+    if obs_data.get("last_tool_result"):
+        parts.append(f"\nLast tool result:\n{obs_data['last_tool_result']}")
+    history = obs_data.get("action_history", [])
+    if history:
+        parts.append(f"\nHistory ({len(history)} calls): {history[-3:]}")  # last 3
+    if obs_data.get("message"):
+        parts.append(f"\nEnvironment message: {obs_data['message']}")
+    meta = obs_data.get("metadata", {})
+    parts.append(f"\nStep {obs_data.get('step_count', 0)}/{meta.get('max_steps', 10)}, "
+                 f"steps remaining: {meta.get('steps_remaining', '?')}")
+    parts.append("\nRespond with a single JSON tool call:")
+    return "\n".join(parts)
+def extract_tool_call(text: str) -> Optional[Dict]:
+    """Extract JSON tool call from model response."""
+    text = text.strip()
+    # Strip markdown fences if present
+    if "```" in text:
+        blocks = text.split("```")
+        for b in blocks:
+            b = b.strip().lstrip("json").strip()
+            if b.startswith("{"):
+                text = b
+                break
+    # Try direct JSON parse
+    try:
+        obj = json.loads(text)
+        if "tool" in obj:
+            return obj
+    except json.JSONDecodeError:
+        pass
+    # Try to extract first {...} block
+    import re
+    m = re.search(r'\{[^{}]+\}', text, re.DOTALL)
+    if m:
+        try:
+            obj = json.loads(m.group())
+            if "tool" in obj:
+                return obj
+        except json.JSONDecodeError:
+            pass
+    return None
+# ---------------------------------------------------------------------------
+# Episode runner
+# ---------------------------------------------------------------------------
+async def run_episode(
+    env: AgentOpsEnv,
+    client: OpenAI,
+    task_id: str,
+) -> Dict[str, Any]:
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    try:
+        result = await env.reset(seed=None, task_id=task_id)
+        obs = result.observation
+        obs_data = obs.model_dump() if hasattr(obs, "model_dump") else obs.dict()
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            prompt = build_prompt(obs_data)
+            completion = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {"role": "system", "content": SYSTEM_PROMPT},
+                    {"role": "user", "content": prompt},
+                ],
+                max_tokens=MAX_TOKENS,
+                temperature=TEMPERATURE,
+            )
+            raw = (completion.choices[0].message.content or "").strip()
+            tool_call = extract_tool_call(raw)
+            if tool_call is None:
+                # Fallback: emit a safe no-op
+                tool_call = {"tool": "Grep", "parameters": {"pattern": "def "}, "reasoning": "fallback"}
+            tool = tool_call.get("tool", "Grep")
+            parameters = tool_call.get("parameters", {})
+            reasoning = tool_call.get("reasoning", "")
+            action_str = f"{tool}({json.dumps(parameters)})"
+            result = await env.step(ToolCall(tool=tool, parameters=parameters, reasoning=reasoning))
+            obs = result.observation
+            obs_data = obs.model_dump() if hasattr(obs, "model_dump") else obs.dict()
+            reward = result.reward or 0.0
+            done = result.done
+            error = None  # tools return errors inside last_tool_result, not as exceptions
+            rewards.append(reward)
+            steps_taken = step
+            log_step(step=step, action=action_str, reward=reward, done=done, error=error)
+            if done:
+                break
+        meta = obs_data.get("metadata", {})
+        score = meta.get("grader_score") or 0.0
+        success = score >= 0.5
+    except Exception as exc:
+        print(f"[DEBUG] Episode error for {task_id}: {exc}", flush=True)
+    finally:
+        log_end(success=success, steps=steps_taken, rewards=rewards)
+    return {
+        "task_id": task_id,
+        "score": score,
+        "steps": steps_taken,
+        "success": success,
+        "rewards": rewards,
+    }
+# ---------------------------------------------------------------------------
+# Entrypoint
+# ---------------------------------------------------------------------------
+async def async_main() -> None:
+    if not API_KEY:
+        raise SystemExit(
+            "HF_TOKEN (or API_KEY) must be set.\n"
+            "  export HF_TOKEN=your_token_here"
+        )
+    if not IMAGE_NAME:
+        raise SystemExit(
+            "IMAGE_NAME must be set.\n"
+            "  export IMAGE_NAME=agentops-gym"
+        )
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    async with AgentOpsEnv.from_docker_image(IMAGE_NAME) as env:
+        results = []
+        for task_id in ALL_TASKS:
+            result = await run_episode(env, client, task_id)
+            results.append(result)
+        # Summary
+        print(f"\n{'='*60}", flush=True)
+        print("SUMMARY", flush=True)
+        print(f"{'='*60}", flush=True)
+        total = sum(r["score"] for r in results)
+        resolved = sum(1 for r in results if r["success"])
+        avg = total / len(results) if results else 0.0
+        for r in results:
+            status = "SOLVED" if r["success"] else "FAILED"
+            print(f"  {r['task_id']:>8}: score={r['score']:.3f}  steps={r['steps']}  {status}", flush=True)
+        print(f"\n  Total:    {total:.3f} / {len(results)}", flush=True)
+        print(f"  Average:  {avg:.3f}", flush=True)
+        print(f"  Solved:   {resolved} / {len(results)}", flush=True)
+def main() -> None:
+    asyncio.run(async_main())
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,86 @@

+"""
+AgentOps Gym — Pydantic models for Action, Observation, and State.
+The agent operates on a simulated Python codebase by calling tools.
+The environment is partially observable, stateful, and efficiency-aware.
+Rewards shrink with wasteful or redundant tool calls.
+"""
+from typing import Optional, List, Dict, Any
+from pydantic import Field
+from openenv.core.env_server.types import Action, Observation, State
+class ToolCall(Action):
+    """Agent submits a tool call with a name and parameters.
+    Open action space: any valid tool name from AVAILABLE_TOOLS with
+    any parameters. This mirrors how real agents interact with tool-use
+    environments — no artificial discretization.
+    """
+    tool: str = Field(
+        ...,
+        description="Tool name (FileRead, FileWrite, Grep, Bash, WebSearch, TodoWrite)"
+    )
+    parameters: Dict[str, Any] = Field(
+        default_factory=dict,
+        description="Tool parameters, e.g. {'filename': 'main.py'} or {'pattern': 'def fetch'}"
+    )
+    reasoning: Optional[str] = Field(
+        default=None,
+        description="Optional: why the agent is calling this tool (for interpretability)"
+    )
+class AgentObservation(Observation):
+    """What the agent sees after each action.
+    Inherits from Observation which provides:
+        - done: bool
+        - reward: Optional[float]
+        - metadata: Dict[str, Any]
+    """
+    # Files the agent has discovered so far (partial observability)
+    visible_files: List[str] = Field(
+        default_factory=list,
+        description="Files the agent currently knows exist in the project"
+    )
+    # Output of the most recent tool call
+    last_tool_result: Optional[str] = Field(
+        default=None,
+        description="Output string from the last tool call"
+    )
+    # Sequential history of tool calls made this episode
+    action_history: List[str] = Field(
+        default_factory=list,
+        description="e.g. ['Grep(pattern=timeout)', 'FileRead(config.json)']"
+    )
+    step_count: int = Field(default=0, description="How many steps taken so far")
+    task_description: str = Field(default="", description="The task the agent must solve")
+    # Feedback from the environment on quality of last action
+    message: Optional[str] = Field(
+        default=None,
+        description="Environment feedback e.g. 'redundant call detected'"
+    )
+class AgentState(State):
+    """Episode metadata for training harnesses and curriculum schedulers.
+    Inherits from State which provides:
+        - episode_id: Optional[str]
+        - step_count: int
+    """
+    task_id: str = Field(default="", description="Current task identifier")
+    task_description: str = Field(default="", description="Human-readable task description")
+    difficulty: str = Field(default="", description="easy / medium / hard")
+    max_steps: int = Field(default=10, description="Max steps allowed this episode")
+    visible_files: List[str] = Field(default_factory=list)
+    discovered_files: List[str] = Field(default_factory=list)
+    action_history: List[str] = Field(default_factory=list)
+    current_reward: float = Field(default=0.0, description="Cumulative reward so far")
+    completed: bool = Field(default=False)
+    grader_score: Optional[float] = Field(
+        default=None,
+        description="Final grader score (0.0-1.0), set at end of episode"
+    )

openenv.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+spec_version: 1
+name: agentops_gym
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

pyproject.toml ADDED Viewed

	@@ -0,0 +1,45 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-agentops_gym"
+version = "0.1.0"
+description = "Agentops Gym environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.2",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m agentops_gym.server.app
+server = "agentops_gym.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["agentops_gym", "agentops_gym.server"]
+package-dir = { "agentops_gym" = ".", "agentops_gym.server" = "server" }

server/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""AgentOps Gym — Server package."""

server/app.py ADDED Viewed

	@@ -0,0 +1,156 @@

+"""
+AgentOps Gym — FastAPI application.
+Exposes the OpenEnv-compatible HTTP + WebSocket API via openenv-core's
+create_app(), plus custom endpoints: /tasks, /grader, /health.
+A persistent singleton environment handles HTTP /reset and /step (for
+the baseline script and interactive testing). WebSocket connections each
+get their own AgentOpsEnvironment instance (via create_app factory pattern).
+"""
+import threading
+import logging
+from typing import Optional
+from fastapi.responses import JSONResponse
+from openenv.core.env_server.http_server import create_app
+from agentops_gym.models import ToolCall, AgentObservation
+from agentops_gym.server.environment import AgentOpsEnvironment, get_last_grader_result
+from agentops_gym.server.tasks import TASK_REGISTRY
+logger = logging.getLogger(__name__)
+app = create_app(
+    AgentOpsEnvironment,
+    ToolCall,
+    AgentObservation,
+    env_name="agentops-gym",
+)
+_env = AgentOpsEnvironment()
+_env_lock = threading.Lock()
+def _serialize(obs: AgentObservation) -> dict:
+    return obs.model_dump() if hasattr(obs, "model_dump") else obs.dict()
+app.router.routes = [
+    r for r in app.router.routes
+    if not (hasattr(r, "path") and r.path in ("/reset", "/step"))
+]
+@app.post("/reset")
+async def stateful_reset(request: dict = None):
+    """Reset environment for a new episode. Pass {'task_id': 'task_1'} etc."""
+    import asyncio
+    request = request or {}
+    task_id = request.get("task_id", "task_1")
+    def _do():
+        with _env_lock:
+            obs = _env.reset(task_id=task_id)
+        return _serialize(obs)
+    loop = asyncio.get_event_loop()
+    obs_dict = await loop.run_in_executor(None, _do)
+    return {"observation": obs_dict, "reward": 0.0, "done": False}
+@app.post("/step")
+async def stateful_step(request: dict = None):
+    """Execute one tool call.
+    Accepts two body shapes:
+      1. {"action": {"tool": "...", "parameters": {...}}}   ← inference script
+      2. {"tool": "...", "parameters": {...}}               ← direct curl
+    """
+    import asyncio
+    request = request or {}
+    if "action" in request:
+        action_data = request["action"]
+    else:
+        action_data = request
+    tool = action_data.get("tool", "")
+    parameters = action_data.get("parameters", {})
+    reasoning = action_data.get("reasoning", "")
+    if not tool:
+        return JSONResponse(
+            status_code=400,
+            content={"error": "'tool' field is required. Body must be {'action': {'tool': '...', 'parameters': {...}}}"},
+        )
+    def _do():
+        with _env_lock:
+            obs = _env.step(ToolCall(tool=tool, parameters=parameters, reasoning=reasoning))
+        return _serialize(obs)
+    loop = asyncio.get_event_loop()
+    obs_dict = await loop.run_in_executor(None, _do)
+    return {
+        "observation": obs_dict,
+        "reward": obs_dict.get("reward", 0.0),
+        "done": obs_dict.get("done", False),
+    }
+@app.get("/tasks")
+async def list_tasks():
+    """List all available tasks with metadata."""
+    tasks = []
+    for tid, t in TASK_REGISTRY.items():
+        tasks.append({
+            "id": tid,
+            "name": t["name"],
+            "difficulty": t["difficulty"],
+            "description": t["description"],
+            "max_steps": t["max_steps"],
+            "optimal_steps": t["optimal_steps"],
+        })
+    return {
+        "tasks": tasks,
+        "action_schema": {
+            "tool": "string — one of FileRead|FileWrite|Grep|Bash|WebSearch|TodoWrite",
+            "parameters": "dict — tool-specific params",
+            "reasoning": "string (optional) — agent's reasoning",
+        },
+    }
+@app.get("/grader")
+async def grader_score():
+    """Return the grader score for the last completed episode."""
+    result = get_last_grader_result()
+    if result is None:
+        return JSONResponse(
+            status_code=404,
+            content={"error": "No episode graded yet. Complete an episode first."},
+        )
+    return result
+@app.get("/health")
+async def health():
+    return {"status": "ok", "env": "agentops-gym"}
+def main():
+    import uvicorn
+    import os
+    host = os.getenv("HOST", "0.0.0.0")
+    port = int(os.getenv("PORT", 8000))
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    main()

server/environment.py ADDED Viewed

	@@ -0,0 +1,288 @@

+"""
+AgentOps Gym — Core Environment class.
+Implements the OpenEnv Environment interface: reset(), step(), state.
+Orchestrates tool execution, reward shaping, and episode grading.
+Each episode is fully deterministic given a task_id:
+  - Snapshot is restored from PROJECT_SNAPSHOTS on reset
+  - All tool calls operate on the in-memory snapshot
+  - No real filesystem, no real subprocess
+"""
+import copy
+import logging
+import uuid
+from typing import Optional, Any
+from openenv.core.env_server.interfaces import Environment
+from agentops_gym.models import ToolCall, AgentObservation, AgentState
+from agentops_gym.server.tools import run_tool, PROJECT_SNAPSHOTS, AVAILABLE_TOOLS
+from agentops_gym.server.tasks import (
+    TASK_REGISTRY,
+    get_task,
+    list_task_ids,
+    compute_step_reward,
+    grade_episode,
+)
+logger = logging.getLogger(__name__)
+_last_grader_result: Optional[dict] = None
+class AgentOpsEnvironment(Environment[ToolCall, AgentObservation, AgentState]):
+    """Tool-use efficiency training environment.
+    Each episode:
+    1. reset() selects a task, initialises the in-memory snapshot, returns initial obs
+    2. step() executes a tool call, computes reward, checks completion
+    3. state property returns current episode metadata
+    """
+    def __init__(self):
+        super().__init__()
+        self._episode_id: str = ""
+        self._task_id: str = ""
+        self._task: dict = {}
+        self._snapshot: dict = {}
+        self._visible_files: list = []
+        self._discovered_files: list = []
+        self._action_history: list = []
+        self._step_count: int = 0
+        self._max_steps: int = 10
+        self._done: bool = True
+        self._cumulative_reward: float = 0.0
+        self._grader_score: Optional[float] = None
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> AgentObservation:
+        """Start a new episode.
+        kwargs may include 'task_id' to select a specific task.
+        If not given, defaults to task_1 (can be cycled externally).
+        """
+        task_id = kwargs.get("task_id", "task_1")
+        if task_id not in TASK_REGISTRY:
+            task_id = "task_1"
+        self._episode_id = episode_id or str(uuid.uuid4())
+        self._task_id = task_id
+        self._task = get_task(task_id)
+        self._max_steps = self._task["max_steps"]
+        self._snapshot = copy.deepcopy(PROJECT_SNAPSHOTS.get(task_id, {}))
+        self._visible_files = list(self._task["initial_visible_files"])
+        self._discovered_files = list(self._visible_files)
+        self._action_history = []
+        self._step_count = 0
+        self._done = False
+        self._cumulative_reward = 0.0
+        self._grader_score = None
+        logger.info("Episode %s started: task=%s", self._episode_id, task_id)
+        return AgentObservation(
+            visible_files=list(self._visible_files),
+            last_tool_result=None,
+            action_history=[],
+            step_count=0,
+            task_description=self._task["description"],
+            message=f"Episode started. Available tools: {', '.join(AVAILABLE_TOOLS.keys())}",
+            done=False,
+            reward=0.0,
+            metadata={
+                "task_id": task_id,
+                "difficulty": self._task["difficulty"],
+                "max_steps": self._max_steps,
+                "available_tools": list(AVAILABLE_TOOLS.keys()),
+            },
+        )
+    def step(
+        self,
+        action: ToolCall,
+        **kwargs: Any,
+    ) -> AgentObservation:
+        """Execute one tool call and return updated observation."""
+        if self._done:
+            return self._terminal_obs("Episode already done. Call reset() first.")
+        self._step_count += 1
+        tool = action.tool
+        params = action.parameters
+        tool_result, self._snapshot, self._discovered_files = run_tool(
+            tool=tool,
+            parameters=params,
+            snapshot=self._snapshot,
+            discovered_files=self._discovered_files,
+        )
+        history_before = list(self._action_history)
+        action_str = f"{tool}({params})"
+        self._action_history.append(action_str)
+        for f in self._discovered_files:
+            if f not in self._visible_files:
+                self._visible_files.append(f)
+        step_reward, reward_breakdown = compute_step_reward(
+            task_id=self._task_id,
+            tool=tool,
+            parameters=params,
+            tool_result=tool_result,
+            action_history=history_before,
+            discovered_files=self._discovered_files,
+            snapshot=self._snapshot,
+        )
+        self._cumulative_reward += step_reward
+        self._cumulative_reward = max(0.0, min(1.0, self._cumulative_reward))
+        done = False
+        message = None
+        if self._step_count >= self._max_steps:
+            done = True
+            message = f"Max steps ({self._max_steps}) reached."
+        # Hard cap for task_3
+        if self._task_id == "task_3" and self._step_count > 8:
+            done = True
+            message = "Hard step cap (8) exceeded. Score capped at 0.3."
+        # ── Task completion detection ──────────────────────────────────
+        # task_1: linter ran and found the bug (or agent read main.py + grepped json)
+        if self._task_id == "task_1":
+            linted = any("BASH" in h.upper() and "LINT" in h.upper() for h in self._action_history)
+            read_main = any("FILEREAD" in h.upper() and "MAIN.PY" in h.upper() for h in self._action_history)
+            found_json = any("GREP" in h.upper() and "JSON" in h.upper() for h in self._action_history)
+            if linted or (read_main and found_json):
+                done = True
+                message = "Bug identified — grading episode."
+        # task_2: config.json was written with timeout=10
+        elif self._task_id == "task_2":
+            import json as _json
+            try:
+                cfg = _json.loads(self._snapshot.get("config.json", "{}"))
+                if cfg.get("timeout") == 10:
+                    done = True
+                    message = "Config patched successfully — grading episode."
+            except Exception:
+                pass
+        # task_3: main.py now contains a cache mechanism
+        elif self._task_id == "task_3":
+            main_src = self._snapshot.get("main.py", "")
+            if "lru_cache" in main_src or "_cache" in main_src:
+                done = True
+                message = "Caching implemented — grading episode."
+        # task_4: .env contains API_KEY and main.py uses os.getenv
+        elif self._task_id == "task_4":
+            main_src = self._snapshot.get("main.py", "")
+            env_src = self._snapshot.get(".env", "")
+            if "API_KEY=SECRET_TOKEN_XYZ" in env_src.replace(" ", "") and \
+               "os.getenv" in main_src and \
+               "SECRET_TOKEN_XYZ" not in main_src:
+                done = True
+                message = "Secret migrated successfully — grading episode."
+        # Redundant call message (non-terminating)
+        if len(self._action_history) >= 2 and self._action_history[-1] == self._action_history[-2]:
+            message = (message or "") + " Redundant call detected."
+        self._done = done
+        # Compute final grader score at episode end
+        grader_score = None
+        if done:
+            grader_score, breakdown = grade_episode(
+                task_id=self._task_id,
+                snapshot=self._snapshot,
+                action_history=self._action_history,
+                steps_used=self._step_count,
+            )
+            self._grader_score = grader_score
+            # Store globally for /grader endpoint
+            global _last_grader_result
+            _last_grader_result = {
+                "task_id": self._task_id,
+                "episode_id": self._episode_id,
+                "score": grader_score,
+                "breakdown": breakdown,
+                "steps_used": self._step_count,
+            }
+            # Add completion bonus proportional to grader score
+            step_reward += grader_score * 0.5
+            logger.info(
+                "Episode %s done: task=%s score=%.3f steps=%d",
+                self._episode_id, self._task_id, grader_score, self._step_count,
+            )
+        return AgentObservation(
+            visible_files=list(self._visible_files),
+            last_tool_result=tool_result,
+            action_history=list(self._action_history),
+            step_count=self._step_count,
+            task_description=self._task["description"],
+            message=message,
+            done=done,
+            reward=round(step_reward, 4),
+            metadata={
+                "task_id": self._task_id,
+                "difficulty": self._task["difficulty"],
+                "cumulative_reward": round(self._cumulative_reward, 4),
+                "grader_score": grader_score,
+                "reward_breakdown": reward_breakdown,
+                "steps_remaining": self._max_steps - self._step_count,
+            },
+        )
+    @property
+    def state(self) -> AgentState:
+        return AgentState(
+            episode_id=self._episode_id,
+            step_count=self._step_count,
+            task_id=self._task_id,
+            task_description=self._task.get("description", ""),
+            difficulty=self._task.get("difficulty", ""),
+            max_steps=self._max_steps,
+            visible_files=list(self._visible_files),
+            discovered_files=list(self._discovered_files),
+            action_history=list(self._action_history),
+            current_reward=round(self._cumulative_reward, 4),
+            completed=self._done,
+            grader_score=self._grader_score,
+        )
+    def close(self) -> None:
+        pass
+    def _terminal_obs(self, msg: str) -> AgentObservation:
+        return AgentObservation(
+            visible_files=list(self._visible_files),
+            last_tool_result=msg,
+            action_history=list(self._action_history),
+            step_count=self._step_count,
+            task_description=self._task.get("description", ""),
+            message=msg,
+            done=True,
+            reward=0.0,
+            metadata={"task_id": self._task_id, "grader_score": self._grader_score},
+        )
+def get_last_grader_result() -> Optional[dict]:
+    return _last_grader_result

server/inference.py ADDED Viewed

	@@ -0,0 +1,342 @@

+#!/usr/bin/env python3
+"""
+AgentOps Gym — Baseline inference script.
+Runs an LLM agent against all 3 tasks and reports per-task scores
+in the mandatory OpenEnv stdout format.
+Environment variables (MANDATORY):
+    API_BASE_URL   LLM API endpoint  (default: https://router.huggingface.co/v1)
+    MODEL_NAME     Model identifier  (default: Qwen/Qwen2.5-72B-Instruct)
+    HF_TOKEN       HuggingFace / API key (must be set)
+    IMAGE_NAME     Docker image name (must be set)
+Usage:
+    IMAGE_NAME=agentops-gym HF_TOKEN=xxx python inference.py
+"""
+from __future__ import annotations
+import json
+import os
+import re
+import sys
+import time
+from typing import Any, Dict, List, Optional
+import requests
+from openai import OpenAI
+# Load .env file if present (works without it too)
+try:
+    from dotenv import load_dotenv
+    load_dotenv()
+except ImportError:
+    pass
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+IMAGE_NAME   = os.getenv("IMAGE_NAME")
+API_KEY      = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME   = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+BASE_URL     = os.getenv("ENV_BASE_URL", "http://localhost:8000")
+BENCHMARK   = "agentops-gym"
+MAX_STEPS   = 10
+TEMPERATURE = 0.3
+MAX_TOKENS  = 600
+ALL_TASKS = ["task_1", "task_2", "task_3", "task_4"]
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = """\
+You are an expert software engineer agent. You solve coding tasks by calling tools.
+Available tools:
+  FileRead   — Read a file.         Parameters: {"filename": "path/to/file.py"}
+  FileWrite  — Write/overwrite.     Parameters: {"filename": "...", "content": "..."}
+  Grep       — Search all files.    Parameters: {"pattern": "regex_or_string"}
+  Bash       — Simulated shell.     Parameters: {"command": "lint main.py"}
+  WebSearch  — Search docs.         Parameters: {"query": "python lru_cache"}
+  TodoWrite  — Record a plan.       Parameters: {"plan": "1. Do X\\n2. Do Y"}
+RULES:
+1. Respond ONLY with a single JSON object — no markdown, no extra text.
+2. Format exactly: {"tool": "ToolName", "parameters": {...}, "reasoning": "why"}
+3. Be efficient — minimize total tool calls.
+4. For hard tasks: call TodoWrite FIRST to plan, then act.
+5. Never repeat the exact same tool + parameters twice in a row.
+Example:
+{"tool": "Grep", "parameters": {"pattern": "def fetch"}, "reasoning": "Find the function"}
+"""
+# ---------------------------------------------------------------------------
+# Mandatory stdout log helpers
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    err_val = error if error else "null"
+    action_short = str(action).replace("\n", " ")[:200]
+    print(
+        f"[STEP] step={step} action={action_short} "
+        f"reward={reward:.2f} done={str(done).lower()} error={err_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} rewards={rewards_str}",
+        flush=True,
+    )
+# ---------------------------------------------------------------------------
+# HTTP helpers
+# ---------------------------------------------------------------------------
+def http_reset(task_id: str) -> Dict:
+    """POST /reset and return the observation dict."""
+    resp = requests.post(
+        f"{BASE_URL}/reset",
+        json={"task_id": task_id},
+        timeout=30,
+    )
+    resp.raise_for_status()
+    return resp.json()
+def http_step(tool: str, parameters: Dict, reasoning: str = "") -> Dict:
+    """POST /step with the correct body shape and return the response dict."""
+    body = {
+        "action": {
+            "tool": tool,
+            "parameters": parameters,
+            "reasoning": reasoning,
+        }
+    }
+    resp = requests.post(
+        f"{BASE_URL}/step",
+        json=body,
+        timeout=30,
+    )
+    resp.raise_for_status()
+    return resp.json()
+def http_grader() -> Dict:
+    resp = requests.get(f"{BASE_URL}/grader", timeout=10)
+    if resp.status_code == 200:
+        return resp.json()
+    return {}
+# ---------------------------------------------------------------------------
+# Prompt builder
+# ---------------------------------------------------------------------------
+def build_prompt(obs: Dict) -> str:
+    parts = [f"TASK: {obs.get('task_description', '')}"]
+    parts.append(f"\nVisible files: {obs.get('visible_files', [])}")
+    last = obs.get("last_tool_result")
+    if last:
+        # Truncate long outputs
+        parts.append(f"\nLast tool result:\n{str(last)[:1500]}")
+    history = obs.get("action_history", [])
+    if history:
+        parts.append(f"\nHistory (last 3): {history[-3:]}")
+    if obs.get("message"):
+        parts.append(f"\nEnv message: {obs['message']}")
+    meta = obs.get("metadata", {})
+    steps_rem = meta.get("steps_remaining", "?")
+    parts.append(f"\nStep {obs.get('step_count', 0)}, steps remaining: {steps_rem}")
+    parts.append("\nRespond with a single JSON tool call:")
+    return "\n".join(parts)
+# ---------------------------------------------------------------------------
+# JSON extraction
+# ---------------------------------------------------------------------------
+def extract_tool_call(text: str) -> Optional[Dict]:
+    """Extract a valid JSON tool call from model output."""
+    text = text.strip()
+    # Strip markdown fences
+    if "```" in text:
+        for block in text.split("```"):
+            block = block.strip().lstrip("json").strip()
+            if block.startswith("{"):
+                text = block
+                break
+    # Direct parse
+    try:
+        obj = json.loads(text)
+        if "tool" in obj:
+            return obj
+    except json.JSONDecodeError:
+        pass
+    # Extract first {...} block
+    m = re.search(r'\{[^{}]+\}', text, re.DOTALL)
+    if m:
+        try:
+            obj = json.loads(m.group())
+            if "tool" in obj:
+                return obj
+        except json.JSONDecodeError:
+            pass
+    return None
+# ---------------------------------------------------------------------------
+# Episode runner
+# ---------------------------------------------------------------------------
+def run_episode(client: OpenAI, task_id: str) -> Dict:
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    error_msg = None
+    try:
+        # Reset
+        reset_resp = http_reset(task_id)
+        obs = reset_resp.get("observation", {})
+        for step in range(1, MAX_STEPS + 1):
+            if reset_resp.get("done") or obs.get("done"):
+                break
+            # Ask the model
+            prompt = build_prompt(obs)
+            try:
+                completion = client.chat.completions.create(
+                    model=MODEL_NAME,
+                    messages=[
+                        {"role": "system", "content": SYSTEM_PROMPT},
+                        {"role": "user",   "content": prompt},
+                    ],
+                    max_tokens=MAX_TOKENS,
+                    temperature=TEMPERATURE,
+                )
+                raw = (completion.choices[0].message.content or "").strip()
+            except Exception as e:
+                error_msg = f"LLM error: {e}"
+                log_step(step=step, action="(llm_error)", reward=0.0, done=True, error=str(e))
+                break
+            tool_call = extract_tool_call(raw)
+            if tool_call is None:
+                # Fallback: safe no-op grep
+                tool_call = {
+                    "tool": "Grep",
+                    "parameters": {"pattern": "def "},
+                    "reasoning": "fallback — could not parse model output",
+                }
+            tool      = tool_call.get("tool", "Grep")
+            params    = tool_call.get("parameters", {})
+            reasoning = tool_call.get("reasoning", "")
+            action_str = f"{tool}({json.dumps(params)})"
+            # Execute
+            try:
+                step_resp = http_step(tool, params, reasoning)
+            except requests.HTTPError as e:
+                error_msg = str(e)
+                log_step(step=step, action=action_short, reward=0.0, done=True, error=error_msg)
+                break
+            obs     = step_resp.get("observation", {})
+            reward  = float(step_resp.get("reward", 0.0) or 0.0)
+            done    = bool(step_resp.get("done", False))
+            rewards.append(reward)
+            steps_taken = step
+            log_step(step=step, action=action_str, reward=reward, done=done, error=None)
+            if done:
+                break
+        # Fetch grader score
+        grader = http_grader()
+        score = float(grader.get("score", 0.0) or 0.0)
+        success = score >= 0.5
+    except Exception as exc:
+        print(f"[DEBUG] Episode error for {task_id}: {exc}", flush=True)
+    finally:
+        log_end(success=success, steps=steps_taken, rewards=rewards)
+    return {
+        "task_id":  task_id,
+        "score":    score,
+        "steps":    steps_taken,
+        "success":  success,
+        "rewards":  rewards,
+    }
+def main() -> None:
+    if not API_KEY:
+        print("ERROR: HF_TOKEN (or API_KEY) must be set.", file=sys.stderr)
+        print("  export HF_TOKEN=hf_xxx", file=sys.stderr)
+        sys.exit(1)
+    for attempt in range(10):
+        try:
+            r = requests.get(f"{BASE_URL}/health", timeout=5)
+            if r.status_code == 200:
+                break
+        except Exception:
+            pass
+        print(f"[DEBUG] Waiting for server... attempt {attempt+1}/10", flush=True)
+        time.sleep(2)
+    else:
+        print("ERROR: Server did not become ready.", file=sys.stderr)
+        sys.exit(1)
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    print("=" * 60, flush=True)
+    print(f"AgentOps Gym — Baseline Inference", flush=True)
+    print(f"Model: {MODEL_NAME}  |  Server: {BASE_URL}", flush=True)
+    print("=" * 60, flush=True)
+    results = []
+    for task_id in ALL_TASKS:
+        print("─" * 40, flush=True)
+        result = run_episode(client, task_id)
+        results.append(result)
+    print("=" * 60, flush=True)
+    print("BASELINE SUMMARY", flush=True)
+    print("=" * 60, flush=True)
+    total   = sum(r["score"] for r in results)
+    solved  = sum(1 for r in results if r["success"])
+    avg     = total / len(results) if results else 0.0
+    for r in results:
+        status = "✅ PASS" if r["success"] else "❌ FAIL"
+        print(f"  {r['task_id']:>8}    score={r['score']:.3f}  steps={r['steps']:2d}  {status}", flush=True)
+    print(f"\n  Average score: {avg:.3f}", flush=True)
+    print(f"  Solved: {solved} / {len(results)}", flush=True)
+    print("=" * 60, flush=True)
+if __name__ == "__main__":
+    main()

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0

server/tasks.py ADDED Viewed

	@@ -0,0 +1,428 @@

+"""
+AgentOps Gym — Task definitions and deterministic graders.
+3 tasks with a clear difficulty gradient:
+  task_1 (easy)   — Bug Localization
+  task_2 (medium) — Config Patching
+  task_3 (hard)   — Caching Implementation
+Each grader returns a float in [0.0, 1.0] and a breakdown dict.
+Graders check the in-memory snapshot state, not keyword matching.
+"""
+import json
+import re
+from typing import Dict, Any, List, Tuple, Optional
+# ---------------------------------------------------------------------------
+# Task registry
+# ---------------------------------------------------------------------------
+TASK_REGISTRY: Dict[str, Dict[str, Any]] = {
+    "task_1": {
+        "name": "Bug Localization",
+        "difficulty": "easy",
+        "max_steps": 8,
+        "optimal_steps": 3,
+        "description": (
+            "The fetch_user function in this project is broken. "
+            "Users report it always returns None instead of user data. "
+            "Find the bug and report which file and line number contains it."
+        ),
+        "initial_visible_files": ["README.md"],
+    },
+    "task_2": {
+        "name": "Config Patching",
+        "difficulty": "medium",
+        "max_steps": 10,
+        "optimal_steps": 4,
+        "description": (
+            "Production is timing out. Someone reported the API timeout is misconfigured. "
+            "Find the config file and change the timeout value from 30 to 10."
+        ),
+        "initial_visible_files": ["main.py", "README.md"],
+    },
+    "task_3": {
+        "name": "Caching Implementation",
+        "difficulty": "hard",
+        "max_steps": 8,
+        "optimal_steps": 6,
+        "description": (
+            "API latency is high. Logs show fetch_user() is being called repeatedly "
+            "with the same user_id. Implement simple in-memory caching for fetch_user. "
+            "You have 8 tool calls max. Plan before acting."
+        ),
+        "initial_visible_files": ["README.md"],
+    },
+    "task_4": {
+        "name": "Secret Migration",
+        "difficulty": "medium",
+        "max_steps": 10,
+        "optimal_steps": 4,
+        "description": (
+            "Security audit found a hardcoded API key in main.py. "
+            "Move the key 'SECRET_TOKEN_XYZ' to a new .env file as API_KEY=SECRET_TOKEN_XYZ "
+            "and update main.py to load it using os.getenv('API_KEY')."
+        ),
+        "initial_visible_files": ["main.py", "README.md"],
+    },
+}
+def get_task(task_id: str) -> Dict[str, Any]:
+    if task_id not in TASK_REGISTRY:
+        raise KeyError(f"Unknown task_id: {task_id!r}. Available: {list(TASK_REGISTRY.keys())}")
+    return TASK_REGISTRY[task_id]
+def list_task_ids() -> List[str]:
+    return list(TASK_REGISTRY.keys())
+# ---------------------------------------------------------------------------
+# Step-level reward (called on every step)
+# ---------------------------------------------------------------------------
+def compute_step_reward(
+    task_id: str,
+    tool: str,
+    parameters: Dict[str, Any],
+    tool_result: str,
+    action_history: List[str],
+    discovered_files: List[str],
+    snapshot: Dict[str, str],
+) -> Tuple[float, Dict[str, float]]:
+    """Compute per-step reward signal.
+    action_history is the history BEFORE this step was appended,
+    so the current action is NOT yet in the list.
+    Returns (reward_value, breakdown_dict).
+    """
+    reward = 0.0
+    breakdown: Dict[str, float] = {}
+    current_action = f"{tool}({parameters})"
+    # ── Penalty: exact repeated call (compare against previous entries only) ──
+    if len(action_history) >= 1 and action_history[-1] == current_action:
+        reward -= 0.15
+        breakdown["repeat_penalty"] = -0.15
+    # ── Penalty: FileRead/FileWrite on unknown file ──
+    if tool in ("FileRead", "FileWrite"):
+        fname = parameters.get("filename", "")
+        if fname and fname not in discovered_files:
+            reward -= 0.10
+            breakdown["hallucination_penalty"] = -0.10
+    # ── Bonus: TodoWrite at step 0 (planning bonus) ──
+    # action_history is pre-append, so empty means this IS step 1
+    if tool == "TodoWrite" and len(action_history) == 0:
+        reward += 0.05
+        breakdown["planning_bonus"] = 0.05
+    # ── Penalty: error result ──
+    if tool_result.startswith("ERROR:"):
+        reward -= 0.05
+        breakdown["error_penalty"] = -0.05
+    # ── Task-specific step signals ──
+    step_signal = _task_step_signal(task_id, tool, parameters, tool_result, action_history)
+    if step_signal != 0.0:
+        reward += step_signal
+        breakdown["task_signal"] = step_signal
+    return round(reward, 3), breakdown
+def _task_step_signal(
+    task_id: str, tool: str, params: Dict, result: str, history: List[str]
+) -> float:
+    """Small positive reward for productive actions toward the task goal."""
+    if task_id == "task_1":
+        # Reward discovering relevant files/patterns
+        if tool == "Grep" and "json" in str(params).lower():
+            return 0.05
+        if tool == "FileRead" and params.get("filename") == "main.py":
+            return 0.10
+        if tool == "Bash" and "lint" in str(params).lower():
+            return 0.05
+    elif task_id == "task_2":
+        if tool == "Grep" and "timeout" in str(params).lower():
+            return 0.05
+        if tool == "FileRead" and params.get("filename") == "config.json":
+            return 0.10
+        if tool == "FileWrite" and params.get("filename") == "config.json":
+            return 0.05
+    elif task_id == "task_3":
+        if tool == "TodoWrite":
+            return 0.05
+        if tool == "WebSearch" and "cache" in str(params).lower():
+            return 0.05
+        if tool == "FileRead" and params.get("filename") == "main.py":
+            return 0.05
+        if tool == "FileWrite" and params.get("filename") == "main.py":
+            return 0.05
+    elif task_id == "task_4":
+        if tool == "FileWrite" and params.get("filename") == ".env":
+            return 0.10
+        if tool == "FileRead" and params.get("filename") == "main.py":
+            return 0.05
+        if tool == "Grep" and "SECRET_TOKEN" in str(params).upper():
+            return 0.05
+    return 0.0
+# ---------------------------------------------------------------------------
+# Episode-level graders (called at done=True)
+# ---------------------------------------------------------------------------
+def grade_episode(
+    task_id: str,
+    snapshot: Dict[str, str],
+    action_history: List[str],
+    steps_used: int,
+) -> Tuple[float, Dict[str, float]]:
+    """Compute final episode score. Returns (score, breakdown)."""
+    graders = {
+        "task_1": _grade_task1,
+        "task_2": _grade_task2,
+        "task_3": _grade_task3,
+        "task_4": _grade_task4,
+    }
+    fn = graders.get(task_id)
+    if fn is None:
+        return 0.0, {"error": f"No grader for {task_id}"}
+    try:
+        return fn(snapshot, action_history, steps_used)
+    except Exception as e:
+        return 0.0, {"error": str(e)}
+def _efficiency_score(steps_used: int, optimal_steps: int) -> float:
+    """Efficiency component: 1.0 at optimal, -0.08 per extra step, min 0."""
+    return max(0.0, 1.0 - (steps_used - optimal_steps) * 0.08)
+def _history_contains(history: List[str], *keywords: str) -> bool:
+    """True if any history entry contains ALL keywords (case-insensitive)."""
+    for entry in history:
+        upper = entry.upper()
+        if all(kw.upper() in upper for kw in keywords):
+            return True
+    return False
+def _history_contains_any(history: List[str], *keywords: str) -> bool:
+    for entry in history:
+        upper = entry.upper()
+        if any(kw.upper() in upper for kw in keywords):
+            return True
+    return False
+# ── Task 1: Bug Localization ──────────────────────────────────────────────
+def _grade_task1(
+    snapshot: Dict[str, str],
+    history: List[str],
+    steps_used: int,
+) -> Tuple[float, Dict[str, float]]:
+    """
+    Grader checks:
+      +0.30 — agent found correct file (main.py referenced)
+      +0.40 — agent found correct line (line 6 or mentions the bug location)
+      +0.30 — agent's answer mentions .json() fix
+    Efficiency multiplier applied to correctness * 0.7 + efficiency * 0.3
+    """
+    breakdown: Dict[str, float] = {}
+    score = 0.0
+    # Found correct file
+    if _history_contains_any(history, "MAIN.PY"):
+        breakdown["found_correct_file"] = 0.30
+        score += 0.30
+    # Found correct line — check if agent read main.py and referenced line 6
+    main_read = _history_contains(history, "FILEREAD", "MAIN.PY")
+    grep_json = _history_contains_any(history, "RESPONSE.JSON", "JSON")
+    if main_read and grep_json:
+        breakdown["found_correct_line"] = 0.40
+        score += 0.40
+    # Answer mentions fix
+    bash_lint = _history_contains_any(history, "BASH", "LINT")
+    if bash_lint:
+        breakdown["ran_linter"] = 0.30
+        score += 0.30
+    eff = _efficiency_score(steps_used, TASK_REGISTRY["task_1"]["optimal_steps"])
+    final = score * 0.7 + eff * 0.3
+    breakdown["efficiency"] = round(eff, 3)
+    return round(min(1.0, final), 4), breakdown
+# ── Task 2: Config Patching ──────────────────────────────────────────────
+def _grade_task2(
+    snapshot: Dict[str, str],
+    history: List[str],
+    steps_used: int,
+) -> Tuple[float, Dict[str, float]]:
+    """
+    +0.20 — found config.json (referenced in history)
+    +0.20 — read config before writing (FileRead before FileWrite)
+    +0.40 — timeout correctly set to 10 in the snapshot
+    +0.20 — config is valid JSON after write
+    """
+    breakdown: Dict[str, float] = {}
+    score = 0.0
+    # Found config.json
+    if _history_contains_any(history, "CONFIG.JSON"):
+        breakdown["found_config"] = 0.20
+        score += 0.20
+    # Read before write (good safety practice)
+    read_idx = next((i for i, h in enumerate(history) if "FILEREAD" in h.upper() and "CONFIG" in h.upper()), None)
+    write_idx = next((i for i, h in enumerate(history) if "FILEWRITE" in h.upper() and "CONFIG" in h.upper()), None)
+    if read_idx is not None and write_idx is not None and read_idx < write_idx:
+        breakdown["read_before_write"] = 0.20
+        score += 0.20
+    elif write_idx is not None and read_idx is None:
+        # Destructive write without reading
+        breakdown["destructive_write_penalty"] = -0.20
+        score -= 0.20
+    # Correct value in snapshot
+    config_content = snapshot.get("config.json", "")
+    try:
+        cfg = json.loads(config_content)
+        if cfg.get("timeout") == 10:
+            breakdown["correct_timeout_value"] = 0.40
+            score += 0.40
+        # Valid JSON
+        breakdown["valid_json"] = 0.20
+        score += 0.20
+    except (json.JSONDecodeError, Exception):
+        breakdown["invalid_json_penalty"] = -0.10
+        score -= 0.10
+    eff = _efficiency_score(steps_used, TASK_REGISTRY["task_2"]["optimal_steps"])
+    final = score * 0.7 + eff * 0.3
+    breakdown["efficiency"] = round(eff, 3)
+    return round(min(1.0, max(0.0, final)), 4), breakdown
+# ── Task 3: Caching Implementation ───────────────────────────────────────
+def _grade_task3(
+    snapshot: Dict[str, str],
+    history: List[str],
+    steps_used: int,
+) -> Tuple[float, Dict[str, float]]:
+    """
+    +0.30 — cache mechanism present in main.py (lru_cache or dict cache)
+    +0.30 — correct function decorated/modified (fetch_user)
+    +0.20 — code is syntactically clean (Bash lint passes)
+    +0.10 — used TodoWrite before acting
+    +0.10 — used WebSearch for docs
+    Hard cap: if steps > 8, done=True and score capped at 0.3
+    """
+    breakdown: Dict[str, float] = {}
+    score = 0.0
+    main_content = snapshot.get("main.py", "")
+    # Cache mechanism present
+    has_lru = "lru_cache" in main_content
+    has_dict_cache = re.search(r'_cache\s*=\s*\{', main_content) or re.search(r'cache\s*=\s*\{\}', main_content)
+    if has_lru or has_dict_cache:
+        breakdown["cache_mechanism_present"] = 0.30
+        score += 0.30
+    # Correct function modified
+    if "fetch_user" in main_content and (has_lru or has_dict_cache):
+        # Check lru_cache is on the right function
+        if re.search(r'@.*lru_cache.*\ndef fetch_user', main_content, re.DOTALL) or \
+           re.search(r'lru_cache.*fetch_user', main_content):
+            breakdown["correct_function_modified"] = 0.30
+            score += 0.30
+        elif has_dict_cache and "fetch_user" in main_content:
+            breakdown["correct_function_modified"] = 0.20
+            score += 0.20
+    # Lint passed — no obvious bugs introduced
+    bash_lint = _history_contains_any(history, "BASH", "LINT")
+    if bash_lint and not _history_contains_any(history, "ISSUE(S) FOUND", "ERROR"):
+        breakdown["lint_passes"] = 0.20
+        score += 0.20
+    # Used TodoWrite at start
+    if _history_contains_any(history, "TODOWRITE"):
+        breakdown["planning_bonus"] = 0.10
+        score += 0.10
+    # Used WebSearch
+    if _history_contains_any(history, "WEBSEARCH"):
+        breakdown["websearch_bonus"] = 0.10
+        score += 0.10
+    # Hard cap for exceeding 8 steps
+    if steps_used > 8:
+        score = min(score, 0.30)
+        breakdown["hard_cap_applied"] = True
+    eff = _efficiency_score(steps_used, TASK_REGISTRY["task_3"]["optimal_steps"])
+    final = score * 0.7 + eff * 0.3
+    breakdown["efficiency"] = round(eff, 3)
+    return round(min(1.0, max(0.0, final)), 4), breakdown
+# ── Task 4: Secret Migration ──────────────────────────────────────────────
+def _grade_task4(
+    snapshot: Dict[str, str],
+    history: List[str],
+    steps_used: int,
+) -> Tuple[float, Dict[str, float]]:
+    """
+    +0.30 — .env file contains API_KEY=SECRET_TOKEN_XYZ
+    +0.40 — main.py imports os and uses os.getenv('API_KEY')
+    +0.20 — main.py no longer contains hardcoded secret
+    +0.10 — planning bonus (TodoWrite)
+    """
+    breakdown: Dict[str, float] = {}
+    score = 0.0
+    env_content = snapshot.get(".env", "")
+    main_content = snapshot.get("main.py", "")
+    # .env check
+    if "API_KEY=SECRET_TOKEN_XYZ" in env_content.replace(" ", ""):
+        breakdown["env_file_correct"] = 0.30
+        score += 0.30
+    # main.py check
+    if "import os" in main_content and "os.getenv('API_KEY')" in main_content:
+        breakdown["main_uses_getenv"] = 0.40
+        score += 0.40
+    elif "import os" in main_content and 'os.getenv("API_KEY")' in main_content:
+        breakdown["main_uses_getenv"] = 0.40
+        score += 0.40
+    # Secret removal
+    if "SECRET_TOKEN_XYZ" not in main_content:
+        breakdown["secret_removed_from_main"] = 0.20
+        score += 0.20
+    # Planning bonus
+    if _history_contains_any(history, "TODOWRITE"):
+        breakdown["planning_bonus"] = 0.10
+        score += 0.10
+    eff = _efficiency_score(steps_used, TASK_REGISTRY["task_4"]["optimal_steps"])
+    final = score * 0.7 + eff * 0.3
+    breakdown["efficiency"] = round(eff, 3)
+    return round(min(1.0, max(0.0, final)), 4), breakdown

server/tools.py ADDED Viewed

	@@ -0,0 +1,308 @@

+"""
+AgentOps Gym — Simulated tool implementations.
+All tools operate on an in-memory filesystem snapshot. No real subprocess,
+no real filesystem, fully deterministic and reproducible. The fake linter/
+test runner uses static analysis of the snapshot strings.
+"""
+import re
+import json
+from typing import Dict, Optional, Tuple
+# ---------------------------------------------------------------------------
+# In-memory project snapshots (one per task)
+# ---------------------------------------------------------------------------
+PROJECT_SNAPSHOTS: Dict[str, Dict[str, str]] = {
+    "task_1": {
+        "main.py": """\
+import requests
+def fetch_user(user_id):
+    url = f"https://api.example.com/users/{user_id}"
+    response = requests.get(url)
+    return response.json          # BUG: missing () — should be response.json()
+def main():
+    user = fetch_user(123)
+    print(user['name'])
+if __name__ == "__main__":
+    main()
+""",
+        "utils.py": "def helper(): pass\n",
+        "config.json": '{"api_url": "https://api.example.com", "timeout": 30}\n',
+        "README.md": "# Example Project\n",
+    },
+    "task_2": {
+        "main.py": """\
+import requests
+import json
+def fetch_data(endpoint):
+    url = f"https://api.example.com/{endpoint}"
+    response = requests.get(url, timeout=30)
+    return response.json()
+def main():
+    data = fetch_data("data")
+    print(data)
+""",
+        "utils.py": "def helper(): pass\n",
+        "config.json": '{"api_url": "https://api.example.com", "timeout": 30}\n',
+        "README.md": "# Example Project\n",
+    },
+    "task_3": {
+        "main.py": """\
+import requests
+def fetch_user(user_id):
+    url = f"https://api.example.com/users/{user_id}"
+    response = requests.get(url)
+    return response.json()
+def main():
+    for uid in range(100):
+        user = fetch_user(uid)
+        print(user['name'])
+if __name__ == "__main__":
+    main()
+""",
+        "utils.py": "def helper(): pass\n",
+        "config.json": '{"api_url": "https://api.example.com", "timeout": 30}\n',
+        "README.md": "# Example Project\n",
+        "tests/test_main.py": """\
+from main import fetch_user
+def test_fetch_user():
+    result = fetch_user(1)
+    assert result is not None
+""",
+    },
+    "task_4": {
+        "main.py": """\
+import requests
+API_KEY = "SECRET_TOKEN_XYZ"
+def fetch_data():
+    headers = {"Authorization": f"Bearer {API_KEY}"}
+    response = requests.get("https://api.example.com/data", headers=headers)
+    return response.json()
+if __name__ == "__main__":
+    print(fetch_data())
+""",
+        "README.md": "# Project Alpha\nSecure the API key.\n",
+    },
+}
+# ---------------------------------------------------------------------------
+# Simulated web search index
+# ---------------------------------------------------------------------------
+WEB_SEARCH_DOCS: Dict[str, str] = {
+    "lru_cache": """\
+functools.lru_cache — Python docs
+  @functools.lru_cache(maxsize=128)
+  def my_function(arg): ...
+  Caches results of function calls. Use maxsize=None for unlimited cache.
+""",
+    "response.json": """\
+requests.Response.json() — requests docs
+  response.json() returns the JSON-encoded content of the response.
+  Note: json is a method, must be called with parentheses: response.json()
+""",
+    "timeout": """\
+requests timeout — requests docs
+  Set timeout in seconds: requests.get(url, timeout=10)
+  Recommended: keep timeout low (5-15s) for production APIs.
+""",
+    "python caching": """\
+Python caching patterns:
+  1. functools.lru_cache — in-memory memoization decorator
+  2. dict-based cache    — manual dict for full control
+  3. joblib.Memory       — disk-backed cache
+  For simple in-memory caching, lru_cache is idiomatic Python.
+""",
+    "getenv": """\
+os.getenv(key, default=None) — Python docs
+  Return the value of the environment variable key if it exists, or default if it doesn't.
+  Example:
+    import os
+    api_key = os.getenv('API_KEY')
+""",
+    ".env": """\
+.env files — Best Practices
+  Store secrets and configuration in a .env file:
+    API_KEY=your_secret_here
+  Never commit .env files to version control.
+""",
+}
+# ---------------------------------------------------------------------------
+# Tool implementations
+# ---------------------------------------------------------------------------
+AVAILABLE_TOOLS = {
+    "FileRead":  "Read contents of a specific file",
+    "FileWrite": "Write/edit a specific file with new content",
+    "Grep":      "Search for a pattern across all files",
+    "Bash":      "Run a shell command (simulated: lint, test runner)",
+    "WebSearch": "Search for documentation (simulated)",
+    "TodoWrite": "Write a plan/todo list before acting",
+}
+def run_tool(
+    tool: str,
+    parameters: Dict,
+    snapshot: Dict[str, str],
+    discovered_files: list,
+) -> Tuple[str, Dict[str, str], list]:
+    """
+    Execute a simulated tool and return (result_string, updated_snapshot, updated_discovered).
+    All mutations to the snapshot are returned as a new dict.
+    """
+    snapshot = dict(snapshot)
+    discovered = list(discovered_files)
+    if tool == "FileRead":
+        return _file_read(parameters, snapshot, discovered)
+    elif tool == "FileWrite":
+        return _file_write(parameters, snapshot, discovered)
+    elif tool == "Grep":
+        return _grep(parameters, snapshot, discovered)
+    elif tool == "Bash":
+        return _bash(parameters, snapshot)
+    elif tool == "WebSearch":
+        return _web_search(parameters), snapshot, discovered
+    elif tool == "TodoWrite":
+        return _todo_write(parameters), snapshot, discovered
+    else:
+        return f"ERROR: Unknown tool '{tool}'. Available: {list(AVAILABLE_TOOLS.keys())}", snapshot, discovered
+def _file_read(params, snapshot, discovered):
+    fname = params.get("filename", "")
+    if not fname:
+        return "ERROR: 'filename' parameter required for FileRead.", snapshot, discovered
+    if fname not in snapshot:
+        return f"ERROR: File '{fname}' not found in project.", snapshot, discovered
+    # Reveal file in discovered list
+    if fname not in discovered:
+        discovered.append(fname)
+    content = snapshot[fname]
+    lines = content.splitlines()
+    numbered = "\n".join(f"{i+1:3}: {line}" for i, line in enumerate(lines))
+    return f"=== {fname} ===\n{numbered}", snapshot, discovered
+def _file_write(params, snapshot, discovered):
+    fname = params.get("filename", "")
+    content = params.get("content", "")
+    if not fname:
+        return "ERROR: 'filename' parameter required for FileWrite.", snapshot, discovered
+    snapshot[fname] = content
+    if fname not in discovered:
+        discovered.append(fname)
+    return f"Write successful: {fname} ({len(content)} bytes written)", snapshot, discovered
+def _grep(params, snapshot, discovered):
+    pattern = params.get("pattern", "")
+    if not pattern:
+        return "ERROR: 'pattern' parameter required for Grep.", snapshot, discovered
+    results = []
+    for fname, content in snapshot.items():
+        for i, line in enumerate(content.splitlines(), 1):
+            if re.search(pattern, line, re.IGNORECASE):
+                results.append(f"{fname}:{i} → {line.strip()}")
+                # Discovering a file via grep reveals it
+                if fname not in discovered:
+                    discovered.append(fname)
+    if not results:
+        return f"No matches for pattern '{pattern}'.", snapshot, discovered
+    return "\n".join(results), snapshot, discovered
+def _bash(params, snapshot):
+    cmd = params.get("command", "")
+    if not cmd:
+        return "ERROR: 'command' parameter required for Bash.", snapshot, []
+    cmd_lower = cmd.lower()
+    # Simulated linter
+    if "lint" in cmd_lower or "flake8" in cmd_lower or "pylint" in cmd_lower:
+        fname = None
+        for f in snapshot:
+            if f.endswith(".py") and f in cmd:
+                fname = f
+                break
+        if fname and fname in snapshot:
+            return _lint_file(fname, snapshot[fname]), snapshot, []
+        # Lint all py files
+        out = []
+        for f, content in snapshot.items():
+            if f.endswith(".py"):
+                out.append(_lint_file(f, content))
+        return "\n".join(out) if out else "No Python files found.", snapshot, []
+    # Simulated test runner
+    if "pytest" in cmd_lower or "test" in cmd_lower:
+        test_files = [f for f in snapshot if "test" in f]
+        if not test_files:
+            return "No test files found.", snapshot, []
+        # Check if main.py has obvious bugs
+        main_content = snapshot.get("main.py", "")
+        if "response.json\n" in main_content or "response.json " in main_content:
+            return '{"status": "error", "file": "main.py", "line": 6, "message": "AttributeError: method object is not subscriptable — did you forget response.json()?"}'
+        return '{"status": "pass", "passed": 1, "failed": 0}', snapshot, []
+    # Simulated validate (for config check)
+    if "validate" in cmd_lower or "json" in cmd_lower:
+        for fname, content in snapshot.items():
+            if fname.endswith(".json") and fname in cmd:
+                try:
+                    json.loads(content)
+                    return f"✓ {fname} is valid JSON", snapshot, []
+                except json.JSONDecodeError as e:
+                    return f"✗ {fname} invalid JSON: {e}", snapshot, []
+        return "Validation complete.", snapshot, []
+    return f"$ {cmd}\n(simulated) Command executed. No output.", snapshot, []
+def _lint_file(fname: str, content: str) -> str:
+    errors = []
+    for i, line in enumerate(content.splitlines(), 1):
+        # Check for common bug: response.json without ()
+        if re.search(r'response\.json\b(?!\()', line):
+            errors.append(f'  {fname}:{i}: E001 response.json called without parentheses — should be response.json()')
+        # Check for bare except
+        if re.match(r'\s*except\s*:', line):
+            errors.append(f'  {fname}:{i}: W001 Bare except clause detected')
+        # Check for hardcoded secrets (task_4)
+        if "SECRET_TOKEN_XYZ" in line and fname == "main.py":
+            errors.append(f'  {fname}:{i}: E002 Hardcoded secret detected — use environment variables')
+    if errors:
+        return f'{fname}: {len(errors)} issue(s) found\n' + '\n'.join(errors)
+    return f'{fname}: OK'
+def _web_search(params) -> str:
+    query = params.get("query", "").lower()
+    for key, doc in WEB_SEARCH_DOCS.items():
+        if key in query:
+            return doc
+    return f"No results found for '{params.get('query', '')}'. Try more specific terms."
+def _todo_write(params) -> str:
+    plan = params.get("plan", params.get("content", ""))
+    if not plan:
+        return "ERROR: 'plan' parameter required for TodoWrite."
+    return f"✓ Plan recorded:\n{plan}"

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff