Spaces:

keerthanas1011
/

api-contract-debugger

Running

App Files Files Community

keerthanas1011 commited on Apr 3

Commit

5cf6185

0 Parent(s):

API Contract Debugger OpenEnv Environment

Browse files

Files changed (21) hide show

.DS_Store +0 -0
Dockerfile +26 -0
README.md +194 -0
RL_ARCHITECTURE.md +637 -0
inference.py +234 -0
openenv.yaml +53 -0
pyproject.toml +18 -0
requirements.txt +4 -0
server/.DS_Store +0 -0
server/__pycache__/app.cpython-314.pyc +0 -0
server/__pycache__/environment.cpython-314.pyc +0 -0
server/__pycache__/fixtures.cpython-314.pyc +0 -0
server/__pycache__/graders.cpython-314.pyc +0 -0
server/__pycache__/models.cpython-314.pyc +0 -0
server/app.py +168 -0
server/environment.py +291 -0
server/fixtures.py +241 -0
server/graders.py +193 -0
server/models.py +181 -0
tests/__pycache__/test_env.cpython-314-pytest-9.0.2.pyc +0 -0
tests/test_env.py +565 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

Dockerfile ADDED Viewed

	@@ -0,0 +1,26 @@

+FROM python:3.11-slim
+# HuggingFace Spaces requires a non-root user with uid 1000
+RUN useradd -m -u 1000 user
+WORKDIR /app
+# Install dependencies as root first
+COPY --chown=user requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy source code
+COPY --chown=user . .
+# Switch to non-root user
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH \
+    PORT=7860 \
+    PYTHONUNBUFFERED=1 \
+    PYTHONPATH=/app
+EXPOSE 7860
+CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,194 @@

+---
+title: API Contract Debugger
+emoji: 🔍
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+app_port: 7860
+tags:
+  - openenv
+  - rl-environment
+  - api-debugging
+  - contract-testing
+---
+# API Contract Debugger — OpenEnv Environment
+An OpenEnv environment where AI agents debug broken OpenAPI-style contract
+specifications by proposing targeted field-level corrections.
+## What Is This?
+Every backend engineer debugs API contract violations constantly — mismatched
+types, missing required fields, wrong HTTP status codes, forbidden extra fields
+leaking into responses. This environment turns that real-world task into a
+structured RL benchmark.
+The agent receives a broken API spec and a list of violations. Each step, it
+proposes one fix. It gets rewarded for each violation resolved and penalised
+for introducing new ones.
+---
+## Action Space
+```json
+{
+  "kind": "add_field | remove_field | change_type | change_status | no_op",
+  "endpoint_index": 0,
+  "location": "request_body | response_body | status_code",
+  "field_name": "field_name_or_null",
+  "new_value": "<type string | field spec dict | int status code | null>"
+}
+```
+| `kind`          | `new_value` type | Description |
+|-----------------|-----------------|-------------|
+| `add_field`     | `{"type": "...", "required": true, "description": "..."}` | Add a missing field |
+| `remove_field`  | `null` | Remove a forbidden field |
+| `change_type`   | `"integer"` / `"string"` / `"boolean"` / `"number"` | Fix a field's type |
+| `change_status` | `204` / `200` / `201` etc. | Fix the HTTP status code |
+| `no_op`         | `null` | Do nothing (small implicit cost) |
+---
+## Observation Space
+| Field | Type | Description |
+|-------|------|-------------|
+| `task_name` | str | Active task: `easy`, `medium`, `hard` |
+| `task_description` | str | Plain-English description of violations |
+| `endpoints` | list | Current (partially fixed) endpoint specs |
+| `violations` | list | Remaining violations with type + description |
+| `violations_fixed_this_step` | int | How many the last action resolved |
+| `violations_introduced_this_step` | int | How many the last action introduced |
+| `total_violations_at_start` | int | Violation count at episode start |
+| `step_count` | int | Steps taken so far |
+| `max_steps` | int | Episode step budget |
+| `last_action_error` | str\|null | Validation error if action was malformed |
+| `reward` | float | Per-step reward |
+| `done` | bool | Whether the episode has terminated |
+---
+## Tasks
+### Easy (1 endpoint, 1 violation, max 5 steps)
+A user registration endpoint is missing `created_at` (string) in its response.
+Expected score for a capable agent: **1.0**
+### Medium (3 endpoints, 3 violations, max 10 steps)
+An e-commerce API has:
+1. `GET /products/{id}` — `product_id` returned as `string` instead of `integer`
+2. `POST /orders` — `quantity` accepted as `string` instead of `integer`
+3. `DELETE /orders/{id}` — returns status `200` instead of `204`
+Expected score for a capable agent: **1.0**
+### Hard (4 endpoints, 6 violations, max 15 steps)
+An auth + profile API has:
+1. `POST /auth/login` — missing `refresh_token` in response
+2. `POST /auth/login` — `expires_in` is `string` instead of `integer`
+3. `GET /users/{id}/profile` — missing `created_at` in response
+4. `GET /users/{id}/profile` — exposes forbidden `password_hash` field (must be removed)
+5. `PATCH /users/{id}/profile` — returns status `500` instead of `200`
+6. `PATCH /users/{id}/profile` — missing `updated_at` in response
+Expected score for a capable agent: **0.7–1.0** (frontier models)
+---
+## Reward Function
+| Event | Reward |
+|-------|--------|
+| Fix a violation | `+0.2 × severity` |
+| Introduce a violation | `−0.15 × severity` |
+| Malformed action | `−0.05` |
+| Solve all violations | `+0.5` bonus |
+Severity weights: `missing_field=1.0`, `wrong_type=0.9`, `wrong_status=0.8`, `extra_field=0.7`
+Final episode score is computed by `grade_episode()` → float in `[0.0, 1.0]`.
+---
+## API Endpoints
+| Method | Path | Description |
+|--------|------|-------------|
+| `POST` | `/reset` | Reset environment. Body: `{"task_name": "easy\|medium\|hard"}` |
+| `POST` | `/step`  | Apply one action. Body: `{"action": {...}}` |
+| `GET`  | `/state` | Full internal state |
+| `GET`  | `/score` | Final episode score |
+| `GET`  | `/tasks` | List all available tasks |
+| `GET`  | `/health`| Health check |
+| `GET`  | `/schema`| JSON schemas for action + observation |
+---
+## Setup & Usage
+### Run locally
+```bash
+git clone <your-repo-url>
+cd api_contract_debugger_env
+pip install -r requirements.txt
+uvicorn server.app:app --host 0.0.0.0 --port 7860
+```
+### Run with Docker
+```bash
+docker build -t api-contract-debugger .
+docker run -p 7860:7860 api-contract-debugger
+```
+### Run the baseline agent
+```bash
+export HF_TOKEN=your_token
+export ENV_BASE_URL=http://localhost:7860
+python inference.py
+```
+### Run tests
+```bash
+pip install pytest httpx
+pytest tests/ -v
+```
+---
+## Baseline Scores
+| Task | Model | Score | Steps Used |
+|------|-------|-------|-----------|
+| easy | Qwen2.5-72B-Instruct | 1.000 | 1 |
+| medium | Qwen2.5-72B-Instruct | 1.000 | 3 |
+| hard | Qwen2.5-72B-Instruct | ~0.85 | 12 |
+---
+## Project Structure
+```
+api_contract_debugger_env/
+├── server/
+│   ├── __init__.py
+│   ├── app.py          # FastAPI app, route registration
+│   ├── environment.py  # OpenEnv Environment subclass
+│   ├── models.py       # Pydantic Action / Observation / State
+│   ├── graders.py      # Violation detection + reward shaping
+│   └── fixtures.py     # Task definitions (broken + golden specs)
+├── tests/
+│   └── test_env.py     # 56 tests covering all components
+├── inference.py        # Baseline agent
+├── openenv.yaml        # OpenEnv metadata
+├── pyproject.toml      # Package config + server entry point
+├── requirements.txt
+├── uv.lock
+└── Dockerfile
+```

RL_ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,637 @@

+# Reinforcement Learning Architecture: API Contract Debugger
+## Overview
+The API Contract Debugger is a **reinforcement learning environment** built on the OpenEnv framework. It challenges AI agents to fix broken OpenAPI-style contract specifications by proposing targeted field-level corrections.
+This document explains how the codebase implements the core RL concepts:
+- **Agent** — The external AI system interacting with the environment
+- **Environment** — The `APIContractDebuggerEnv` class that simulates the debugging task
+- **State** — What the agent observes and the internal environment state
+- **Action** — The fixes the agent can propose
+- **Reward/Result** — The feedback signal and scoring mechanism
+---
+## 1. Agent (External AI System)
+### What is the Agent?
+The **agent** is an **external AI system** (e.g., an LLM, RL policy, or human) that:
+- Receives observations from the environment
+- Proposes actions (fixes to the API spec)
+- Receives reward feedback and the next state
+- Aims to maximize cumulative reward by fixing all violations
+### Agent Interaction Pattern
+```
+Agent                              Environment
+  |                                     |
+  |---- POST /reset (task_name) ----->  |
+  |                                     |
+  | <------ Initial Observation --------|
+  |  (endpoints, violations, reward=0)  |
+  |                                     |
+  |---- POST /step (action) ----------> |
+  |                                     |
+  | <---- Updated Observation --------- |
+  |  (new endpoints, new violations,    |
+  |   reward, done, fixed/introduced)   |
+  |                                     |
+  | [repeat until done=True]            |
+  |                                     |
+  | ---- GET /score - GET /state ----->  |
+  |                                     |
+```
+### Agent Location in Codebase
+- **File**: `server/app.py`
+- **Routes**:
+  - `POST /reset` — Initialize new episode
+  - `POST /step` — Apply one action
+  - `GET /state` — Query full environment state (for debugging)
+  - `GET /score` — Get final episode score
+  - `GET /tasks` — List available tasks
+The agent communicates via HTTP REST API. All observations are JSON and fully serializable.
+---
+## 2. Environment (`APIContractDebuggerEnv`)
+### Class Definition
+**File**: `server/environment.py`
+```python
+class APIContractDebuggerEnv(Environment[DebugAction, DebugObservation, DebugState]):
+    """
+    Environment where an agent debugs broken API contract specifications.
+    Inherits from OpenEnv's Environment base class.
+    Implements reset(), step(), and state property.
+    """
+```
+### Environment Responsibilities
+1. **Initialize tasks** — Load broken + golden endpoint specs from fixtures
+2. **Detect violations** — Compare current spec against golden spec
+3. **Apply actions** — Mutate the current spec based on agent's fix proposal
+4. **Compute rewards** — Dense per-step reward based on violations fixed/introduced
+5. **Track state** — Maintain episode counter, step count, violations
+6. **Terminate episodes** — Check for success (all fixed) or max steps reached
+### Key Methods
+#### `reset(seed, episode_id, task_name, **kwargs) → DebugObservation`
+Initializes a fresh episode:
+- Loads task config from fixtures
+- Deep-copies broken endpoints to avoid cross-episode state leakage
+- Detects initial violations
+- Returns initial observation with reward=0
+```python
+def reset(self, seed=None, episode_id=None, task_name=None, **kwargs):
+    """
+    Reset the environment and return the initial observation.
+    """
+    # Load task config and deep-copy endpoints
+    self._current_endpoints = copy.deepcopy(self._task_cfg["broken_endpoints"])
+    self._golden_endpoints = copy.deepcopy(self._task_cfg["golden_endpoints"])
+    # Detect violations (agent's starting problem)
+    self._violations = detect_violations(self._current_endpoints, self._golden_endpoints)
+    return self._make_observation(reward=0.0, done=False, ...)
+```
+#### `step(action, timeout_s, **kwargs) → DebugObservation`
+Processes one agent action and returns the updated state:
+```python
+def step(self, action: DebugAction, **kwargs) -> DebugObservation:
+    """
+    Apply one fix action → return updated observation + reward.
+    """
+    # 1. Apply the action (mutate current_endpoints)
+    action_error = self._apply_action(action)
+    # 2. Recompute violations
+    self._violations = detect_violations(self._current_endpoints, self._golden_endpoints)
+    # 3. Compute dense reward
+    reward = step_reward(prev_violations, self._violations, action_error)
+    # 4. Check termination
+    all_fixed = len(self._violations) == 0
+    out_of_steps = self._step_count >= max_steps
+    self._done = all_fixed or out_of_steps
+    # 5. Bonus reward if solved
+    if all_fixed:
+        reward += 0.5
+    return self._make_observation(reward, done, fixed_this_step, ...)
+```
+#### `_apply_action(action) → Optional[str]`
+Attempts to mutate `self._current_endpoints` according to the action:
+- **Validates** endpoint index, field name, locations
+- **Executes** the fix:
+  - `ADD_FIELD` — Insert new field into request/response body
+  - `REMOVE_FIELD` — Delete field from body
+  - `CHANGE_TYPE` — Update field's type
+  - `CHANGE_STATUS` — Update endpoint's HTTP status code
+  - `NO_OP` — Explicit pass (implicit penalty via no reward)
+- **Returns** error string if invalid, `None` on success
+#### `state` Property
+Returns the complete internal state (not exposed to agent by default, but available via `/state`):
+```python
+@property
+def state(self) -> DebugState:
+    """Return full internal environment state."""
+    return DebugState(
+        episode_id=self._episode_id,
+        step_count=self._step_count,
+        task_name=self._task_name,
+        original_endpoints=self._original_endpoints,     # Snapshot of broken spec
+        current_endpoints=self._current_endpoints,       # Current state after fixes
+        golden_endpoints=self._golden_endpoints,         # Target spec
+        violations=self._violations,                     # Current violations
+        total_violations_at_start=len(self._initial_violations),
+        max_steps=self._task_cfg["max_steps"],
+    )
+```
+### Supported Tasks
+**File**: `server/fixtures.py`
+Three difficulty levels:
+| Task | Difficulty | Endpoints | Violations | Max Steps | Description |
+|------|-----------|-----------|-----------|-----------|-------------|
+| **easy** | Beginner | 1 | 1 missing field | 5 | Simple: add one field to response |
+| **medium** | Intermediate | 3 | 3 (type errors + wrong status) | 10 | Type mismatches and HTTP status fixes |
+| **hard** | Advanced | 4 | 6 (missing, extra, type, status) | 15 | Complex: multiple violation types |
+Each task has:
+- `broken_endpoints` — Starting state (what agent sees)
+- `golden_endpoints` — Ground truth (what violations are measured against)
+- `description` — Human-readable task objective
+- `max_steps` — Episode cut-off
+---
+## 3. State
+### Observation (`DebugObservation`)
+**What the agent sees after each action.**
+File: `server/models.py`
+```python
+class DebugObservation(Observation):
+    """
+    What the agent observes after reset() or step().
+    """
+    # Task info
+    task_name: str                          # "easy" | "medium" | "hard"
+    task_description: str                   # Human description
+    # Current spec
+    endpoints: List[Dict[str, Any]]         # Current endpoints (partially fixed)
+    violations: List[Dict[str, Any]]        # Detected violations still present
+    # Reward signals
+    reward: float                           # Dense per-step reward
+    done: bool                              # Episode termination flag
+    violations_fixed_this_step: int         # Count of fixed violations
+    violations_introduced_this_step: int    # Count of new violations
+    total_violations_at_start: int          # Reference baseline
+    # Tracking
+    step_count: int                         # Steps taken so far
+    max_steps: int                          # Episode limit
+    last_action_error: Optional[str]        # Validation error message
+```
+#### Example Observation
+```json
+{
+  "task_name": "easy",
+  "task_description": "Add missing 'created_at' field to response...",
+  "endpoints": [
+    {
+      "method": "POST",
+      "path": "/users/register",
+      "status_code": 201,
+      "request_body": {
+        "username": {"type": "string", "required": true},
+        "email": {"type": "string", "required": true},
+        "password": {"type": "string", "required": true}
+      },
+      "response_body": {
+        "user_id": {"type": "integer", "required": true},
+        "username": {"type": "string", "required": true}
+        // missing: created_at
+      }
+    }
+  ],
+  "violations": [
+    {
+      "endpoint_index": 0,
+      "location": "response_body",
+      "field_name": "created_at",
+      "violation_type": "missing_field",
+      "description": "POST /users/register response_body: required field 'created_at' (string) is missing",
+      "severity": 1.0
+    }
+  ],
+  "violations_fixed_this_step": 0,
+  "violations_introduced_this_step": 0,
+  "total_violations_at_start": 1,
+  "step_count": 0,
+  "max_steps": 5,
+  "reward": 0.0,
+  "done": false,
+  "last_action_error": null
+}
+```
+### Full Internal State (`DebugState`)
+**Available via `GET /state` endpoint (for debugging/analysis, not given to agent by default).**
+```python
+class DebugState(State):
+    """
+    Full internal state (not exposed to agent by default).
+    """
+    task_name: str
+    original_endpoints: List[Dict[str, Any]]  # Snapshot of broken spec
+    current_endpoints: List[Dict[str, Any]]   # Mutated by agent's actions
+    golden_endpoints: List[Dict[str, Any]]    # Ground truth
+    violations: List[Dict[str, Any]]          # Computed violations
+    total_violations_at_start: int
+    max_steps: int
+```
+---
+## 4. Action (`DebugAction`)
+**What the agent can propose.**
+File: `server/models.py`
+```python
+class DebugAction(Action):
+    """
+    A single fix proposed by the agent.
+    The agent targets one endpoint + one field and proposes exactly one change.
+    """
+    kind: ActionKind                    # Type of fix
+    endpoint_index: int                 # Which endpoint to fix (0-indexed)
+    location: str                       # "request_body" | "response_body" | "status_code"
+    field_name: Optional[str]           # Field to modify (null for status_code)
+    new_value: Optional[Any]            # The corrected value
+```
+### Action Types (`ActionKind`)
+| Kind | Target | Effect | new_value |
+|------|--------|--------|-----------|
+| `ADD_FIELD` | Field | Insert missing field into body | `{"type": str, "description"?: str}` |
+| `REMOVE_FIELD` | Field | Delete forbidden field from body | `null` |
+| `CHANGE_TYPE` | Field | Fix field's JSON Schema type | Type string (e.g., `"integer"`) |
+| `CHANGE_STATUS` | Endpoint | Fix HTTP status code | Integer (e.g., `201`) |
+| `NO_OP` | None | Explicit pass/wait | `null` |
+#### Example Actions
+```python
+# Fix 1: Add missing 'created_at' field
+{
+  "kind": "add_field",
+  "endpoint_index": 0,
+  "location": "response_body",
+  "field_name": "created_at",
+  "new_value": {
+    "type": "string",
+    "description": "ISO-8601 timestamp"
+  }
+}
+# Fix 2: Change field type from string to integer
+{
+  "kind": "change_type",
+  "endpoint_index": 1,
+  "location": "request_body",
+  "field_name": "user_id",
+  "new_value": "integer"
+}
+# Fix 3: Correct HTTP status code
+{
+  "kind": "change_status",
+  "endpoint_index": 0,
+  "location": "status_code",
+  "field_name": null,
+  "new_value": 201
+}
+# Fix 4: Remove extra field
+{
+  "kind": "remove_field",
+  "endpoint_index": 2,
+  "location": "response_body",
+  "field_name": "deprecated_field",
+  "new_value": null
+}
+# Fix 5: Explicit pass
+{
+  "kind": "no_op",
+  "endpoint_index": 0,
+  "location": "request_body",
+  "field_name": null,
+  "new_value": null
+}
+```
+### Action Validation
+The environment validates actions in `_apply_action()`:
+- **Endpoint index bounds** — Must be `0 ≤ index < len(endpoints)`
+- **Location validity** — Must be `"request_body"`, `"response_body"`, or `"status_code"`
+- **Field existence** — REMOVE_FIELD and CHANGE_TYPE require field to exist
+- **Type format** — Fields must have `{"type": "..."}` structure
+- **Status code format** — Must be an integer
+If validation fails, `_apply_action()` returns an error string and the step receives `-0.05` reward penalty.
+---
+## 5. Reward & Result
+### Dense Per-Step Reward
+**File**: `server/graders.py` → `step_reward()` function
+The agent receives feedback after each step:
+```python
+def step_reward(
+    prev_violations: List[Dict[str, Any]],
+    new_violations: List[Dict[str, Any]],
+    initial_violations: List[Dict[str, Any]],
+    action_error: bool,
+) -> float:
+    """
+    Dense per-step reward:
+    +0.2 × severity  per violation resolved
+    -0.15 × severity per new violation introduced
+    -0.05             for malformed action
+    +0.5              bonus if all violations fixed (episode success)
+    """
+    if action_error:
+        return -0.05
+    reward = 0.0
+    for v in violations_fixed_this_step:
+        reward += 0.2 * v["severity"]
+    for v in violations_introduced_this_step:
+        reward -= 0.15 * v["severity"]
+    return reward
+```
+### Violation Severity Weights
+Weighted by problem importance:
+| Violation Type | Severity | Reason |
+|----------------|----------|--------|
+| `missing_field` | 1.0 | Breaks contract — top priority |
+| `wrong_type` | 0.9 | Type mismatch — critical |
+| `wrong_status` | 0.8 | HTTP code error — significant |
+| `extra_field` | 0.7 | Forbidden field — less critical |
+### Episode Scoring (`grade_episode()`)
+**Computed at episode end.** Returns final score in `[0.0, 1.0]`.
+```python
+def grade_episode(
+    current_endpoints: List[Dict[str, Any]],
+    golden_endpoints: List[Dict[str, Any]],
+    initial_violations: List[Dict[str, Any]],
+) -> float:
+    """
+    Final episode score:
+    score = (weighted_violations_fixed - weighted_violations_introduced)
+            / total_initial_weight
+    Clamped to [0.0, 1.0]
+    1.0 = all violations fixed, no new ones introduced
+    0.5 = 50% of violations fixed
+    0.0 = no improvement or made things worse
+    """
+```
+#### Example Scoring Scenario
+**Task: easy (1 violation)**
+- Initial violation: `missing_field "created_at" (severity=1.0)`
+- After 1 step: Agent adds `created_at` correctly
+- After 2 steps: Agent incorrectly changes type of `username` to `integer` (introduces 1 violation)
+- Final state: 0 remaining violations, but 1 introduced
+```
+score = (1.0 - 1.0) / 1.0 = 0.0
+```
+Clamped to 0.0 (agent made things worse overall).
+---
+## 6. Complete RL Loop Example
+### Scenario: Easy Task
+**Initial state:**
+```
+Broken spec: POST /users/register response missing "created_at"
+Golden spec: response has user_id, username, created_at
+```
+### Episode Transcript
+```
+RESET request (task_name="easy")
+  ↓
+Observation #0:
+  endpoints: [broken registration endpoint]
+  violations: [missing_field "created_at"]
+  reward: 0.0
+  done: false
+  step_count: 0
+STEP 1: Agent proposes ADD_FIELD action
+  action.kind = "add_field"
+  action.endpoint_index = 0
+  action.location = "response_body"
+  action.field_name = "created_at"
+  action.new_value = {"type": "string", "description": "ISO-8601 timestamp"}
+  ↓
+Environment:
+  - Validates action ✓
+  - Adds field to response_body
+  - Recomputes violations → [] (0 violations!)
+  - Computes reward: +0.2 × 1.0 (fixed 1 violation of severity 1.0) = +0.2
+          + 0.5 (bonus for all_fixed=true) = +0.7 total
+  - Sets done=true (all violations fixed)
+  ↓
+Observation #1:
+  endpoints: [fixed registration endpoint]
+  violations: []
+  violations_fixed_this_step: 1
+  violations_introduced_this_step: 0
+  reward: 0.7
+  done: true
+  step_count: 1
+SCORE request
+  ↓
+score = (1.0 fixed - 0 introduced) / 1.0 initial = 1.0 ✓
+Agent succeeds with perfect score!
+```
+---
+## 7. File Structure Summary
+```
+server/
+├── app.py                    # FastAPI routes, HTTP interface
+├── environment.py            # APIContractDebuggerEnv (core RL logic)
+├── models.py                 # Pydantic models: DebugAction, DebugObservation, DebugState
+├── fixtures.py               # Task definitions (easy, medium, hard)
+├── graders.py                # Violation detection + reward/scoring
+└── __pycache__/
+tests/                         # Unit tests for environment, graders, fixtures
+RL_ARCHITECTURE.md             # This file
+```
+---
+## 8. Key Design Principles
+1. **Stateful Environment** — One episode per task at a time (OpenEnv singleton pattern)
+2. **Dense Rewards** — Agent gets per-step feedback (not just final score) to guide learning
+3. **Severity-Weighted** — Different violation types have different weights (missing fields = highest priority)
+4. **Action Validation** — Invalid actions receive penalty and return error messages
+5. **Deep-Copied State** — Endpoints are deep-copied to prevent cross-episode contamination
+6. **Observable Violations** — Agent sees exact list of violations (not hidden)
+7. **Termination Conditions**:
+   - Success: All violations fixed
+   - Failure: Max steps exceeded
+8. **JSON/REST Interface** — Agent communicates via HTTP (language-agnostic)
+---
+## 9. Typical Agent Workflow
+```python
+import requests
+BASE_URL = "http://localhost:7860"
+# 1. Reset to start new episode
+reset_resp = requests.post(f"{BASE_URL}/reset", json={
+    "task_name": "easy",
+    "seed": 42
+})
+obs = reset_resp.json()
+print(f"Violations to fix: {len(obs['violations'])}")
+# 2. Repeat: observe → decide → act
+for step in range(obs['max_steps']):
+    if obs['done']:
+        break
+    # Agent decision logic (depends on obs['violations'])
+    action = {
+        "kind": "add_field",
+        "endpoint_index": 0,
+        "location": "response_body",
+        "field_name": "created_at",
+        "new_value": {"type": "string"}
+    }
+    # 3. Apply action
+    step_resp = requests.post(f"{BASE_URL}/step", json={"action": action})
+    obs = step_resp.json()
+    print(f"Step {step+1}: reward={obs['reward']}, violations={len(obs['violations'])}")
+# 4. Check final score
+score_resp = requests.get(f"{BASE_URL}/score")
+print(f"Final score: {score_resp.json()['score']}")
+```
+---
+## 10. Future Extensions
+Potential enhancements to the RL framework:
+1. **Multi-Agent** — Support concurrent episodes via session IDs
+2. **Curriculum Learning** — Dynamically adapt difficulty based on agent performance
+3. **Partial Observability** — Hide some violations initially to increase challenge
+4. **Action Constraints** — Limit action space per step (e.g., "fix at most 1 field")
+5. **Custom Reward Shaping** — Configurable severity weights + bonus structures
+6. **State Representation** — Multiple formats (JSON, graph, embedding-friendly)
+---
+## Summary Table
+| Concept | Implementation | File | Purpose |
+|---------|---|---|---|
+| **Agent** | External AI/LLM | HTTP client | Proposes fixes |
+| **Environment** | `APIContractDebuggerEnv` | `environment.py` | Simulates faults + validates fixes |
+| **State** | `DebugObservation` + `DebugState` | `models.py` | Agent observes + internal tracking |
+| **Action** | `DebugAction` | `models.py` | Fix proposals |
+| **Reward** | `step_reward()` | `graders.py` | Dense per-step feedback |
+| **Result** | Episode score `[0.0, 1.0]` | `graders.py` | Final performance metric |
+| **Tasks** | Fixtures (easy/medium/hard) | `fixtures.py` | Problem instances |
+| **HTTP API** | FastAPI routes | `app.py` | Communication interface |

inference.py ADDED Viewed

	@@ -0,0 +1,234 @@

+"""
+Baseline Inference Script — API Contract Debugger
+===================================================
+Runs a GPT model against all three tasks and emits the required
+[START] / [STEP] / [END] log format.
+Environment variables:
+    API_BASE_URL   LLM endpoint  (default: https://router.huggingface.co/v1)
+    MODEL_NAME     Model ID      (default: Qwen/Qwen2.5-72B-Instruct)
+    HF_TOKEN       API key
+    ENV_BASE_URL   Running env   (default: http://localhost:7860)
+    TASK_NAME      One task or "all"  (default: all)
+"""
+from __future__ import annotations
+import json
+import os
+import textwrap
+from typing import Any, Dict, List, Optional
+import requests
+from openai import OpenAI
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME   = os.getenv("MODEL_NAME",   "Qwen/Qwen2.5-72B-Instruct")
+API_KEY      = os.getenv("HF_TOKEN") or os.getenv("API_KEY", "hf_placeholder")
+ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860").rstrip("/")
+TASK_NAME    = os.getenv("TASK_NAME", "all")
+TEMPERATURE  = 0.0
+MAX_TOKENS   = 512
+BENCHMARK    = "api_contract_debugger"
+TASKS = ["easy", "medium", "hard"]
+# ---------------------------------------------------------------------------
+# Logging helpers (required stdout format)
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} "
+        f"done={str(done).lower()} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} "
+        f"score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+# ---------------------------------------------------------------------------
+# Environment HTTP client
+# ---------------------------------------------------------------------------
+def env_reset(task_name: str) -> Dict[str, Any]:
+    r = requests.post(f"{ENV_BASE_URL}/reset", json={"task_name": task_name}, timeout=30)
+    r.raise_for_status()
+    return r.json()
+def env_step(action_payload: Dict[str, Any]) -> Dict[str, Any]:
+    r = requests.post(f"{ENV_BASE_URL}/step", json={"action": action_payload}, timeout=30)
+    r.raise_for_status()
+    return r.json()
+def env_score() -> float:
+    r = requests.get(f"{ENV_BASE_URL}/score", timeout=10)
+    r.raise_for_status()
+    return float(r.json()["score"])
+# ---------------------------------------------------------------------------
+# LLM agent
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = textwrap.dedent("""
+You are an expert API contract debugger. You will be shown a broken API spec
+and a list of violations. Your job is to propose ONE fix per turn.
+You must respond with ONLY a valid JSON object matching this schema:
+{
+  "kind": "add_field" | "remove_field" | "change_type" | "change_status" | "no_op",
+  "endpoint_index": <integer, 0-based>,
+  "location": "request_body" | "response_body" | "status_code",
+  "field_name": <string or null>,
+  "new_value": <string | integer | object | null>
+}
+Rules:
+- add_field:     new_value must be {"type": "<type>", "required": true/false, "description": "..."}
+- change_type:   new_value must be a type string e.g. "integer", "string", "boolean", "number"
+- change_status: new_value must be an integer HTTP status code; location must be "status_code"
+- remove_field:  new_value must be null
+- no_op:         use when no fix is needed; new_value must be null
+Do NOT include any explanation — output ONLY the JSON object.
+""").strip()
+def build_user_prompt(obs: Dict[str, Any], step: int, history: List[str]) -> str:
+    violations = obs.get("violations", [])
+    endpoints  = obs.get("endpoints", [])
+    history_block = "\n".join(history[-6:]) if history else "None"
+    viol_text = json.dumps(violations, indent=2) if violations else "None — all fixed!"
+    ep_text   = json.dumps(endpoints, indent=2)
+    return textwrap.dedent(f"""
+        Step {step} | Task: {obs.get('task_name')} | Violations remaining: {len(violations)}
+        TASK DESCRIPTION:
+        {obs.get('task_description', '')}
+        CURRENT ENDPOINTS:
+        {ep_text}
+        REMAINING VIOLATIONS:
+        {viol_text}
+        PREVIOUS ACTIONS:
+        {history_block}
+        Propose ONE fix as a JSON object.
+    """).strip()
+def get_action(client: OpenAI, obs: Dict[str, Any], step: int, history: List[str]) -> Dict[str, Any]:
+    """Call the LLM and parse a DebugAction payload."""
+    prompt = build_user_prompt(obs, step, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user",   "content": prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        # Strip markdown fences if present
+        if text.startswith("```"):
+            text = text.split("```")[1]
+            if text.startswith("json"):
+                text = text[4:]
+        return json.loads(text.strip())
+    except Exception as exc:
+        print(f"[DEBUG] LLM call failed: {exc}", flush=True)
+        return {"kind": "no_op", "endpoint_index": 0, "location": "response_body",
+                "field_name": None, "new_value": None}
+# ---------------------------------------------------------------------------
+# Single episode runner
+# ---------------------------------------------------------------------------
+def run_episode(client: OpenAI, task_name: str) -> None:
+    log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
+    success = False
+    score = 0.0
+    try:
+        obs = env_reset(task_name)
+        history: List[str] = []
+        max_steps = obs.get("max_steps", 15)
+        for step in range(1, max_steps + 1):
+            if obs.get("done"):
+                break
+            action_payload = get_action(client, obs, step, history)
+            action_str = json.dumps(action_payload, separators=(",", ":"))
+            obs = env_step(action_payload)
+            reward = float(obs.get("reward") or 0.0)
+            done   = bool(obs.get("done", False))
+            error  = obs.get("last_action_error")
+            rewards.append(reward)
+            steps_taken = step
+            log_step(step=step, action=action_str, reward=reward, done=done, error=error)
+            history.append(
+                f"Step {step}: {action_str} → reward={reward:+.2f} "
+                f"fixed={obs.get('violations_fixed_this_step', 0)} "
+                f"remaining={len(obs.get('violations', []))}"
+            )
+            if done:
+                break
+        score = env_score()
+        success = score >= 0.8
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    tasks_to_run = TASKS if TASK_NAME == "all" else [TASK_NAME]
+    for task in tasks_to_run:
+        run_episode(client, task)
+if __name__ == "__main__":
+    main()

openenv.yaml ADDED Viewed

	@@ -0,0 +1,53 @@

+name: api-contract-debugger
+version: "1.0.0"
+description: >
+  An OpenEnv environment where AI agents debug broken OpenAPI-style contract
+  specifications. The agent receives a broken API spec and must identify and
+  fix contract violations (missing fields, wrong types, wrong status codes,
+  forbidden extra fields) by proposing targeted field-level corrections.
+tags:
+  - api
+  - debugging
+  - contract-testing
+  - real-world
+  - nlp
+tasks:
+  - name: easy
+    description: "Single endpoint with one missing required response field."
+    difficulty: easy
+    max_steps: 5
+  - name: medium
+    description: "Three endpoints with type mismatches and a wrong HTTP status code."
+    difficulty: medium
+    max_steps: 10
+  - name: hard
+    description: >
+      Four endpoints with 6 violations: missing fields, wrong types,
+      wrong status code, and a forbidden extra field that must be removed.
+    difficulty: hard
+    max_steps: 15
+action_space:
+  type: structured
+  description: >
+    DebugAction — proposes one fix per step: add_field, remove_field,
+    change_type, change_status, or no_op.
+observation_space:
+  type: structured
+  description: >
+    DebugObservation — returns the current (partially fixed) endpoint specs,
+    the list of remaining violations, per-step fix counts, and reward signal.
+reward:
+  type: dense
+  range: [-1.0, 1.5]
+  description: >
+    +0.2×severity per violation fixed, -0.15×severity per violation introduced,
+    -0.05 for malformed action, +0.5 bonus when all violations are resolved.
+hf_space: ""  # fill in your HuggingFace Space URL before submitting

pyproject.toml ADDED Viewed

	@@ -0,0 +1,18 @@

+[project]
+name = "api-contract-debugger-env"
+version = "1.0.0"
+description = "OpenEnv environment for debugging broken API contract specifications"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core>=0.2.0",
+    "fastapi>=0.110.0",
+    "uvicorn[standard]>=0.29.0",
+    "pydantic>=2.0.0",
+]
+[project.scripts]
+server = "server.app:main"
+[build-system]
+requires = ["setuptools>=68"]
+build-backend = "setuptools.backends.legacy:build"

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+openenv-core>=0.2.0
+fastapi>=0.110.0
+uvicorn[standard]>=0.29.0
+pydantic>=2.0.0

server/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

server/__pycache__/app.cpython-314.pyc ADDED Viewed

Binary file (7.84 kB). View file

server/__pycache__/environment.cpython-314.pyc ADDED Viewed

Binary file (12.8 kB). View file

server/__pycache__/fixtures.cpython-314.pyc ADDED Viewed

Binary file (6.15 kB). View file

server/__pycache__/graders.cpython-314.pyc ADDED Viewed

Binary file (7.67 kB). View file

server/__pycache__/models.cpython-314.pyc ADDED Viewed

Binary file (6.31 kB). View file

server/app.py ADDED Viewed

	@@ -0,0 +1,168 @@

+"""
+FastAPI application entry point for the API Contract Debugger OpenEnv environment.
+Route registration order:
+  1. Custom stateful /reset, /step, /state routes registered FIRST.
+  2. OpenEnv PRODUCTION-mode routes (/health, /schema, /metadata, /ws) attached LAST.
+     PRODUCTION mode does NOT register /reset /step /state, so our routes win.
+"""
+from __future__ import annotations
+import os
+from typing import Any, Dict, Optional
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel, Field
+from openenv.core.env_server.http_server import HTTPEnvServer
+from openenv.core.env_server.types import ServerMode
+from .environment import APIContractDebuggerEnv
+from .models import DebugAction, DebugObservation, DebugState
+# ---------------------------------------------------------------------------
+# Singleton environment instances — one per task
+# ---------------------------------------------------------------------------
+_envs: Dict[str, APIContractDebuggerEnv] = {
+    "easy":   APIContractDebuggerEnv(task_name="easy"),
+    "medium": APIContractDebuggerEnv(task_name="medium"),
+    "hard":   APIContractDebuggerEnv(task_name="hard"),
+}
+_active_task: str = "easy"
+def _get_env() -> APIContractDebuggerEnv:
+    return _envs[_active_task]
+# ---------------------------------------------------------------------------
+# Request bodies for our custom routes
+# ---------------------------------------------------------------------------
+class ResetBody(BaseModel):
+    task_name: Optional[str] = Field(
+        default=None,
+        description="Task to run: 'easy', 'medium', or 'hard'.",
+    )
+    seed: Optional[int] = Field(default=None)
+    episode_id: Optional[str] = Field(default=None)
+class StepBody(BaseModel):
+    action: Dict[str, Any] = Field(
+        ...,
+        description="Serialised DebugAction payload.",
+    )
+# ---------------------------------------------------------------------------
+# App factory
+# ---------------------------------------------------------------------------
+def create_app() -> FastAPI:
+    app = FastAPI(
+        title="API Contract Debugger",
+        description=(
+            "An OpenEnv environment where AI agents debug broken OpenAPI-style "
+            "contract specifications by proposing targeted field-level fixes."
+        ),
+        version="1.0.0",
+    )
+    # ------------------------------------------------------------------
+    # 1. Our stateful routes — registered FIRST
+    # ------------------------------------------------------------------
+    @app.post("/reset", tags=["Environment"])
+    async def reset(req: ResetBody = ResetBody()) -> Dict[str, Any]:
+        """Reset the environment. Optionally switch task via task_name."""
+        global _active_task
+        if req.task_name is not None:
+            if req.task_name not in _envs:
+                raise HTTPException(
+                    status_code=422,
+                    detail=f"Unknown task '{req.task_name}'. Choose: {list(_envs.keys())}",
+                )
+            _active_task = req.task_name
+        obs: DebugObservation = _get_env().reset(
+            seed=req.seed,
+            episode_id=req.episode_id,
+        )
+        return obs.model_dump()
+    @app.post("/step", tags=["Environment"])
+    async def step(req: StepBody) -> Dict[str, Any]:
+        """Apply one fix action and return the updated observation."""
+        try:
+            action = DebugAction.model_validate(req.action)
+        except Exception as exc:
+            raise HTTPException(status_code=422, detail=f"Invalid action: {exc}")
+        obs: DebugObservation = _get_env().step(action)
+        return obs.model_dump()
+    @app.get("/state", tags=["Environment"])
+    async def state() -> Dict[str, Any]:
+        """Return the full internal environment state."""
+        s: DebugState = _get_env().state
+        return s.model_dump()
+    @app.get("/score", tags=["Environment"])
+    async def score() -> Dict[str, Any]:
+        """Return the final episode score [0.0, 1.0]."""
+        return {
+            "task": _active_task,
+            "score": _get_env().score(),
+        }
+    @app.get("/tasks", tags=["Environment"])
+    async def list_tasks() -> Dict[str, Any]:
+        """List available tasks with descriptions."""
+        from .fixtures import TASKS
+        return {
+            "tasks": [
+                {
+                    "name": t["name"],
+                    "description": t["description"],
+                    "max_steps": t["max_steps"],
+                    "num_endpoints": len(t["broken_endpoints"]),
+                }
+                for t in TASKS.values()
+            ]
+        }
+    # ------------------------------------------------------------------
+    # 2. OpenEnv framework routes — registered LAST (PRODUCTION mode)
+    #    Adds /health, /schema, /metadata, /ws ONLY.
+    #    Does NOT override our /reset, /step, /state.
+    # ------------------------------------------------------------------
+    _server = HTTPEnvServer(
+        env=_get_env,
+        action_cls=DebugAction,
+        observation_cls=DebugObservation,
+    )
+    _server.register_routes(app, mode=ServerMode.PRODUCTION)
+    return app
+app = create_app()
+def main() -> None:
+    import uvicorn
+    port = int(os.environ.get("PORT", 7860))
+    uvicorn.run(
+        "server.app:app",
+        host="0.0.0.0",
+        port=port,
+        reload=False,
+    )
+if __name__ == "__main__":
+    main()

server/environment.py ADDED Viewed

	@@ -0,0 +1,291 @@

+"""
+API Contract Debugger — OpenEnv Environment
+An AI agent receives a broken OpenAPI-style spec and must fix all contract
+violations by proposing targeted field-level corrections step-by-step.
+"""
+from __future__ import annotations
+import copy
+import uuid
+from typing import Any, Dict, List, Optional
+from openenv.core.env_server.interfaces import Environment
+from .fixtures import TASKS
+from .graders import detect_violations, grade_episode, step_reward
+from .models import (
+    ActionKind,
+    DebugAction,
+    DebugObservation,
+    DebugState,
+)
+class APIContractDebuggerEnv(Environment[DebugAction, DebugObservation, DebugState]):
+    """
+    Environment where an agent debugs broken API contract specifications.
+    Tasks (difficulty):
+        easy   — 1 endpoint, 1 missing field
+        medium — 3 endpoints, 3 violations (type errors + wrong status)
+        hard   — 4 endpoints, 6 violations (missing fields, wrong types,
+                 wrong status, forbidden extra field)
+    Action space:
+        DebugAction with kind in {add_field, remove_field, change_type,
+                                  change_status, no_op}
+    Observation space:
+        DebugObservation — current endpoints + violation list + reward signals
+    Reward:
+        Dense per-step: +0.2×severity per violation fixed, -0.15×severity per
+        violation introduced, -0.05 for malformed action.
+        Episode terminates when all violations are resolved or max_steps reached.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = False
+    def __init__(self, task_name: str = "easy") -> None:
+        super().__init__()
+        if task_name not in TASKS:
+            raise ValueError(
+                f"Unknown task '{task_name}'. Choose from: {list(TASKS.keys())}"
+            )
+        self._task_name = task_name
+        self._task_cfg = TASKS[task_name]
+        # Internal state (populated on reset)
+        self._current_endpoints: List[Dict[str, Any]] = []
+        self._golden_endpoints: List[Dict[str, Any]] = []
+        self._original_endpoints: List[Dict[str, Any]] = []
+        self._violations: List[Dict[str, Any]] = []
+        self._initial_violations: List[Dict[str, Any]] = []
+        self._step_count: int = 0
+        self._episode_id: Optional[str] = None
+        self._done: bool = False
+    # ------------------------------------------------------------------
+    # OpenEnv API
+    # ------------------------------------------------------------------
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        task_name: Optional[str] = None,
+        **kwargs: Any,
+    ) -> DebugObservation:
+        """Reset the environment and return the initial observation."""
+        if task_name and task_name in TASKS:
+            self._task_name = task_name
+            self._task_cfg = TASKS[task_name]
+        self._episode_id = episode_id or str(uuid.uuid4())
+        self._step_count = 0
+        self._done = False
+        # Deep-copy fixtures so mutations don't bleed across episodes
+        self._current_endpoints = copy.deepcopy(self._task_cfg["broken_endpoints"])
+        self._golden_endpoints = copy.deepcopy(self._task_cfg["golden_endpoints"])
+        self._original_endpoints = copy.deepcopy(self._task_cfg["broken_endpoints"])
+        self._violations = detect_violations(
+            self._current_endpoints, self._golden_endpoints
+        )
+        self._initial_violations = copy.deepcopy(self._violations)
+        return self._make_observation(
+            reward=0.0,
+            done=False,
+            fixed_this_step=0,
+            introduced_this_step=0,
+            action_error=None,
+        )
+    def step(
+        self,
+        action: DebugAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> DebugObservation:
+        """Apply one fix action and return the updated observation."""
+        if self._done:
+            return self._make_observation(
+                reward=0.0,
+                done=True,
+                fixed_this_step=0,
+                introduced_this_step=0,
+                action_error="Episode is already done. Call reset().",
+            )
+        self._step_count += 1
+        prev_violations = copy.deepcopy(self._violations)
+        action_error: Optional[str] = None
+        # --- Apply the action ---
+        if action.kind == ActionKind.NO_OP:
+            pass  # agent explicitly passes — small implicit penalty via no reward
+        else:
+            action_error = self._apply_action(action)
+        # --- Recompute violations ---
+        self._violations = detect_violations(
+            self._current_endpoints, self._golden_endpoints
+        )
+        # --- Compute reward ---
+        reward = step_reward(
+            prev_violations=prev_violations,
+            new_violations=self._violations,
+            initial_violations=self._initial_violations,
+            action_error=(action_error is not None),
+        )
+        fixed_this_step = sum(
+            1 for v in prev_violations
+            if v not in self._violations
+        )
+        introduced_this_step = sum(
+            1 for v in self._violations
+            if v not in prev_violations
+        )
+        # --- Termination ---
+        max_steps = self._task_cfg["max_steps"]
+        all_fixed = len(self._violations) == 0
+        out_of_steps = self._step_count >= max_steps
+        self._done = all_fixed or out_of_steps
+        # Bonus reward for solving all violations
+        if all_fixed:
+            reward += 0.5
+        return self._make_observation(
+            reward=reward,
+            done=self._done,
+            fixed_this_step=fixed_this_step,
+            introduced_this_step=introduced_this_step,
+            action_error=action_error,
+        )
+    @property
+    def state(self) -> DebugState:
+        """Return the full internal environment state."""
+        return DebugState(
+            episode_id=self._episode_id,
+            step_count=self._step_count,
+            task_name=self._task_name,
+            original_endpoints=self._original_endpoints,
+            current_endpoints=self._current_endpoints,
+            golden_endpoints=self._golden_endpoints,
+            violations=self._violations,
+            total_violations_at_start=len(self._initial_violations),
+            max_steps=self._task_cfg["max_steps"],
+        )
+    def get_metadata(self):
+        from openenv.core.env_server.types import EnvironmentMetadata
+        return EnvironmentMetadata(
+            name="APIContractDebugger",
+            description=(
+                "An environment where an AI agent debugs broken OpenAPI-style "
+                "contract specifications by proposing targeted field-level fixes."
+            ),
+            version="1.0.0",
+        )
+    # ------------------------------------------------------------------
+    # Internal helpers
+    # ------------------------------------------------------------------
+    def _apply_action(self, action: DebugAction) -> Optional[str]:
+        """
+        Mutate self._current_endpoints according to the action.
+        Returns an error string if the action is invalid, else None.
+        """
+        idx = action.endpoint_index
+        if idx < 0 or idx >= len(self._current_endpoints):
+            return (
+                f"endpoint_index {idx} is out of range "
+                f"(0–{len(self._current_endpoints) - 1})"
+            )
+        endpoint = self._current_endpoints[idx]
+        if action.kind == ActionKind.CHANGE_STATUS:
+            if not isinstance(action.new_value, int):
+                return "CHANGE_STATUS requires new_value to be an integer HTTP status code"
+            endpoint["status_code"] = action.new_value
+            return None
+        # For field-level actions, validate location
+        if action.location not in ("request_body", "response_body"):
+            return (
+                f"location must be 'request_body' or 'response_body', "
+                f"got '{action.location}'"
+            )
+        body: Dict[str, Any] = endpoint.setdefault(action.location, {})
+        field = action.field_name
+        if action.kind == ActionKind.ADD_FIELD:
+            if not field:
+                return "ADD_FIELD requires a non-empty field_name"
+            if not isinstance(action.new_value, dict) or "type" not in action.new_value:
+                return "ADD_FIELD requires new_value to be a dict with a 'type' key"
+            body[field] = action.new_value
+            return None
+        if action.kind == ActionKind.REMOVE_FIELD:
+            if not field:
+                return "REMOVE_FIELD requires a non-empty field_name"
+            if field not in body:
+                return f"field '{field}' does not exist in {action.location}"
+            del body[field]
+            return None
+        if action.kind == ActionKind.CHANGE_TYPE:
+            if not field:
+                return "CHANGE_TYPE requires a non-empty field_name"
+            if field not in body:
+                return f"field '{field}' does not exist in {action.location}"
+            if not isinstance(action.new_value, str):
+                return "CHANGE_TYPE requires new_value to be a type string"
+            body[field]["type"] = action.new_value
+            return None
+        return f"Unknown action kind: {action.kind}"
+    def _make_observation(
+        self,
+        reward: float,
+        done: bool,
+        fixed_this_step: int,
+        introduced_this_step: int,
+        action_error: Optional[str],
+    ) -> DebugObservation:
+        return DebugObservation(
+            task_name=self._task_name,
+            task_description=self._task_cfg["description"],
+            endpoints=copy.deepcopy(self._current_endpoints),
+            violations=copy.deepcopy(self._violations),
+            violations_fixed_this_step=fixed_this_step,
+            violations_introduced_this_step=introduced_this_step,
+            total_violations_at_start=len(self._initial_violations),
+            step_count=self._step_count,
+            max_steps=self._task_cfg["max_steps"],
+            last_action_error=action_error,
+            reward=reward,
+            done=done,
+        )
+    def score(self) -> float:
+        """Final episode score in [0.0, 1.0]. Call after episode ends."""
+        return grade_episode(
+            self._current_endpoints,
+            self._golden_endpoints,
+            self._initial_violations,
+        )

server/fixtures.py ADDED Viewed

	@@ -0,0 +1,241 @@

+"""
+Task fixtures for the API Contract Debugger environment.
+Each task is a dict with:
+  - name: str
+  - description: str
+  - broken_endpoints: list[dict]   — what the agent starts with
+  - golden_endpoints: list[dict]   — the correct spec the grader checks against
+  - max_steps: int
+Endpoint schema:
+  {
+    "method": str,
+    "path": str,
+    "status_code": int,
+    "request_body": {
+        "<field>": {"type": str, "required": bool, "description": str}
+    },
+    "response_body": {
+        "<field>": {"type": str, "required": bool, "description": str}
+    }
+  }
+"""
+from __future__ import annotations
+import copy
+from typing import Any, Dict, List
+# ---------------------------------------------------------------------------
+# Task 1 — EASY
+# Single endpoint. One missing required field in the response.
+# ---------------------------------------------------------------------------
+_TASK1_GOLDEN: List[Dict[str, Any]] = [
+    {
+        "method": "POST",
+        "path": "/users/register",
+        "status_code": 201,
+        "request_body": {
+            "username": {"type": "string", "required": True,  "description": "Desired username"},
+            "email":    {"type": "string", "required": True,  "description": "User email address"},
+            "password": {"type": "string", "required": True,  "description": "Plaintext password"},
+        },
+        "response_body": {
+            "user_id":    {"type": "integer", "required": True,  "description": "Created user ID"},
+            "username":   {"type": "string",  "required": True,  "description": "Confirmed username"},
+            "created_at": {"type": "string",  "required": True,  "description": "ISO-8601 timestamp"},
+        },
+    }
+]
+# Break it: remove "created_at" from response
+_TASK1_BROKEN: List[Dict[str, Any]] = copy.deepcopy(_TASK1_GOLDEN)
+del _TASK1_BROKEN[0]["response_body"]["created_at"]
+TASK_EASY: Dict[str, Any] = {
+    "name": "easy",
+    "description": (
+        "A user registration endpoint is missing a required field in its response. "
+        "The response should include user_id (integer), username (string), and "
+        "created_at (string). Find and add the missing field."
+    ),
+    "broken_endpoints": _TASK1_BROKEN,
+    "golden_endpoints": _TASK1_GOLDEN,
+    "max_steps": 5,
+}
+# ---------------------------------------------------------------------------
+# Task 2 — MEDIUM
+# Three endpoints. Type mismatches and a wrong status code.
+# ---------------------------------------------------------------------------
+_TASK2_GOLDEN: List[Dict[str, Any]] = [
+    {
+        "method": "GET",
+        "path": "/products/{id}",
+        "status_code": 200,
+        "request_body": {},
+        "response_body": {
+            "product_id": {"type": "integer", "required": True,  "description": "Product ID"},
+            "name":        {"type": "string",  "required": True,  "description": "Product name"},
+            "price":       {"type": "number",  "required": True,  "description": "Price in USD"},
+            "in_stock":    {"type": "boolean", "required": True,  "description": "Availability"},
+        },
+    },
+    {
+        "method": "POST",
+        "path": "/orders",
+        "status_code": 201,
+        "request_body": {
+            "product_id": {"type": "integer", "required": True,  "description": "Product to order"},
+            "quantity":   {"type": "integer", "required": True,  "description": "Number of units"},
+            "customer_id":{"type": "integer", "required": True,  "description": "Buyer ID"},
+        },
+        "response_body": {
+            "order_id":   {"type": "integer", "required": True,  "description": "Created order ID"},
+            "total_price":{"type": "number",  "required": True,  "description": "Total cost"},
+            "status":     {"type": "string",  "required": True,  "description": "Order status"},
+        },
+    },
+    {
+        "method": "DELETE",
+        "path": "/orders/{id}",
+        "status_code": 204,
+        "request_body": {},
+        "response_body": {},
+    },
+]
+# Break it:
+# 1. product_id type: integer → string   (GET /products/{id} response)
+# 2. quantity type:   integer → string   (POST /orders request)
+# 3. DELETE status_code: 204 → 200
+_TASK2_BROKEN: List[Dict[str, Any]] = copy.deepcopy(_TASK2_GOLDEN)
+_TASK2_BROKEN[0]["response_body"]["product_id"]["type"] = "string"   # violation 1
+_TASK2_BROKEN[1]["request_body"]["quantity"]["type"] = "string"       # violation 2
+_TASK2_BROKEN[2]["status_code"] = 200                                  # violation 3
+TASK_MEDIUM: Dict[str, Any] = {
+    "name": "medium",
+    "description": (
+        "An e-commerce API has three endpoints with contract violations: "
+        "(1) GET /products/{id} returns product_id as string instead of integer, "
+        "(2) POST /orders accepts quantity as string instead of integer, "
+        "(3) DELETE /orders/{id} returns status 200 instead of 204. "
+        "Fix all three violations."
+    ),
+    "broken_endpoints": _TASK2_BROKEN,
+    "golden_endpoints": _TASK2_GOLDEN,
+    "max_steps": 10,
+}
+# ---------------------------------------------------------------------------
+# Task 3 — HARD
+# Multi-endpoint API. Missing required fields, type errors, wrong status code,
+# AND a forbidden extra field that must be removed.
+# ---------------------------------------------------------------------------
+_TASK3_GOLDEN: List[Dict[str, Any]] = [
+    {
+        "method": "POST",
+        "path": "/auth/login",
+        "status_code": 200,
+        "request_body": {
+            "email":    {"type": "string", "required": True,  "description": "User email"},
+            "password": {"type": "string", "required": True,  "description": "User password"},
+        },
+        "response_body": {
+            "access_token":  {"type": "string",  "required": True,  "description": "JWT token"},
+            "refresh_token": {"type": "string",  "required": True,  "description": "Refresh token"},
+            "expires_in":    {"type": "integer", "required": True,  "description": "TTL in seconds"},
+        },
+    },
+    {
+        "method": "GET",
+        "path": "/users/{id}/profile",
+        "status_code": 200,
+        "request_body": {},
+        "response_body": {
+            "user_id":    {"type": "integer", "required": True,  "description": "User ID"},
+            "email":      {"type": "string",  "required": True,  "description": "User email"},
+            "full_name":  {"type": "string",  "required": True,  "description": "Display name"},
+            "role":       {"type": "string",  "required": True,  "description": "User role"},
+            "created_at": {"type": "string",  "required": True,  "description": "ISO-8601 timestamp"},
+        },
+    },
+    {
+        "method": "PATCH",
+        "path": "/users/{id}/profile",
+        "status_code": 200,
+        "request_body": {
+            "full_name": {"type": "string", "required": False, "description": "Updated name"},
+            "email":     {"type": "string", "required": False, "description": "Updated email"},
+        },
+        "response_body": {
+            "user_id":   {"type": "integer", "required": True, "description": "User ID"},
+            "full_name": {"type": "string",  "required": True, "description": "Updated name"},
+            "email":     {"type": "string",  "required": True, "description": "Updated email"},
+            "updated_at":{"type": "string",  "required": True, "description": "ISO-8601 timestamp"},
+        },
+    },
+    {
+        "method": "POST",
+        "path": "/auth/refresh",
+        "status_code": 200,
+        "request_body": {
+            "refresh_token": {"type": "string", "required": True, "description": "Refresh token"},
+        },
+        "response_body": {
+            "access_token": {"type": "string",  "required": True, "description": "New JWT token"},
+            "expires_in":   {"type": "integer", "required": True, "description": "TTL in seconds"},
+        },
+    },
+]
+_TASK3_BROKEN: List[Dict[str, Any]] = copy.deepcopy(_TASK3_GOLDEN)
+# Violation 1: missing refresh_token in /auth/login response
+del _TASK3_BROKEN[0]["response_body"]["refresh_token"]
+# Violation 2: expires_in type integer → string in /auth/login response
+_TASK3_BROKEN[0]["response_body"]["expires_in"]["type"] = "string"
+# Violation 3: missing created_at in /users/{id}/profile response
+del _TASK3_BROKEN[1]["response_body"]["created_at"]
+# Violation 4: extra forbidden field "password_hash" in /users/{id}/profile response
+_TASK3_BROKEN[1]["response_body"]["password_hash"] = {
+    "type": "string", "required": False, "description": "Hashed password — MUST NOT be exposed"
+}
+# Violation 5: PATCH /users/{id}/profile status_code 200 → 500 (regression)
+_TASK3_BROKEN[2]["status_code"] = 500
+# Violation 6: missing updated_at in PATCH response
+del _TASK3_BROKEN[2]["response_body"]["updated_at"]
+TASK_HARD: Dict[str, Any] = {
+    "name": "hard",
+    "description": (
+        "An authentication + profile API has 6 contract violations across 4 endpoints: "
+        "(1) POST /auth/login is missing refresh_token in response, "
+        "(2) POST /auth/login returns expires_in as string instead of integer, "
+        "(3) GET /users/{id}/profile is missing created_at in response, "
+        "(4) GET /users/{id}/profile exposes a forbidden password_hash field that must be removed, "
+        "(5) PATCH /users/{id}/profile returns status 500 instead of 200, "
+        "(6) PATCH /users/{id}/profile is missing updated_at in response. "
+        "Fix all violations."
+    ),
+    "broken_endpoints": _TASK3_BROKEN,
+    "golden_endpoints": _TASK3_GOLDEN,
+    "max_steps": 15,
+}
+# ---------------------------------------------------------------------------
+# Registry
+# ---------------------------------------------------------------------------
+TASKS: Dict[str, Dict[str, Any]] = {
+    "easy":   TASK_EASY,
+    "medium": TASK_MEDIUM,
+    "hard":   TASK_HARD,
+}

server/graders.py ADDED Viewed

	@@ -0,0 +1,193 @@

+"""
+Violation detection and graders for the API Contract Debugger environment.
+detect_violations(current, golden) → list of violation dicts
+grade_episode(current, golden) → float in [0.0, 1.0]
+"""
+from __future__ import annotations
+import copy
+from typing import Any, Dict, List
+# ---------------------------------------------------------------------------
+# Violation detection
+# ---------------------------------------------------------------------------
+def detect_violations(
+    current_endpoints: List[Dict[str, Any]],
+    golden_endpoints: List[Dict[str, Any]],
+) -> List[Dict[str, Any]]:
+    """
+    Compare current spec against the golden spec and return all violations.
+    Violation dict keys:
+        endpoint_index  int   — index into endpoint list
+        location        str   — "request_body" | "response_body" | "status_code"
+        field_name      str|None
+        violation_type  str   — "missing_field" | "extra_field" | "wrong_type" | "wrong_status"
+        description     str   — human-readable explanation
+        severity        float — weight used in scoring (0.0–1.0)
+    """
+    violations: List[Dict[str, Any]] = []
+    for idx, (cur, gold) in enumerate(zip(current_endpoints, golden_endpoints)):
+        # --- Status code ---
+        if cur.get("status_code") != gold.get("status_code"):
+            violations.append({
+                "endpoint_index": idx,
+                "location": "status_code",
+                "field_name": None,
+                "violation_type": "wrong_status",
+                "description": (
+                    f"{gold['method']} {gold['path']}: "
+                    f"status_code is {cur.get('status_code')} "
+                    f"but should be {gold.get('status_code')}"
+                ),
+                "severity": 0.8,
+            })
+        # --- Request body and response body ---
+        for location in ("request_body", "response_body"):
+            cur_body: Dict[str, Any] = cur.get(location, {})
+            gold_body: Dict[str, Any] = gold.get(location, {})
+            # Missing required fields
+            for field, spec in gold_body.items():
+                if field not in cur_body:
+                    violations.append({
+                        "endpoint_index": idx,
+                        "location": location,
+                        "field_name": field,
+                        "violation_type": "missing_field",
+                        "description": (
+                            f"{gold['method']} {gold['path']} {location}: "
+                            f"required field '{field}' ({spec['type']}) is missing"
+                        ),
+                        "severity": 1.0,
+                    })
+                else:
+                    # Wrong type
+                    cur_type = cur_body[field].get("type")
+                    gold_type = spec.get("type")
+                    if cur_type != gold_type:
+                        violations.append({
+                            "endpoint_index": idx,
+                            "location": location,
+                            "field_name": field,
+                            "violation_type": "wrong_type",
+                            "description": (
+                                f"{gold['method']} {gold['path']} {location}: "
+                                f"field '{field}' has type '{cur_type}' "
+                                f"but should be '{gold_type}'"
+                            ),
+                            "severity": 0.9,
+                        })
+            # Extra (forbidden) fields — fields in current but not in golden
+            for field in cur_body:
+                if field not in gold_body:
+                    violations.append({
+                        "endpoint_index": idx,
+                        "location": location,
+                        "field_name": field,
+                        "violation_type": "extra_field",
+                        "description": (
+                            f"{gold['method']} {gold['path']} {location}: "
+                            f"field '{field}' is present but should not be in the contract"
+                        ),
+                        "severity": 0.7,
+                    })
+    return violations
+# ---------------------------------------------------------------------------
+# Grader
+# ---------------------------------------------------------------------------
+def grade_episode(
+    current_endpoints: List[Dict[str, Any]],
+    golden_endpoints: List[Dict[str, Any]],
+    initial_violations: List[Dict[str, Any]],
+) -> float:
+    """
+    Score the agent's performance at the END of an episode.
+    Returns a float in [0.0, 1.0]:
+        1.0  — all violations fixed, no new ones introduced
+        0.0  — no improvement at all
+        intermediate — partial credit weighted by severity
+    Formula:
+        score = (weighted_fixed - weighted_introduced) / total_initial_weight
+        clamped to [0.0, 1.0]
+    """
+    remaining = detect_violations(current_endpoints, golden_endpoints)
+    remaining_keys = _violation_keys(remaining)
+    initial_keys = _violation_keys(initial_violations)
+    # Violations that were present at start and are now gone = fixed
+    fixed = [v for v in initial_violations if _vkey(v) not in remaining_keys]
+    # Violations that are present now but weren't at start = newly introduced
+    introduced = [v for v in remaining if _vkey(v) not in initial_keys]
+    total_initial_weight = sum(v["severity"] for v in initial_violations)
+    if total_initial_weight == 0:
+        return 1.0  # spec was already clean
+    weighted_fixed = sum(v["severity"] for v in fixed)
+    weighted_introduced = sum(v["severity"] for v in introduced)
+    raw = (weighted_fixed - weighted_introduced) / total_initial_weight
+    return float(max(0.0, min(1.0, raw)))
+def step_reward(
+    prev_violations: List[Dict[str, Any]],
+    new_violations: List[Dict[str, Any]],
+    initial_violations: List[Dict[str, Any]],
+    action_error: bool,
+) -> float:
+    """
+    Dense per-step reward signal.
+    +0.2  per violation resolved this step (weighted by severity)
+    -0.15 per new violation introduced
+    -0.05 for a malformed action (out-of-range index, bad field, etc.)
+    """
+    if action_error:
+        return -0.05
+    prev_keys = _violation_keys(prev_violations)
+    new_keys = _violation_keys(new_violations)
+    fixed_this_step = [v for v in prev_violations if _vkey(v) not in new_keys]
+    introduced_this_step = [v for v in new_violations if _vkey(v) not in prev_keys]
+    reward = 0.0
+    for v in fixed_this_step:
+        reward += 0.2 * v["severity"]
+    for v in introduced_this_step:
+        reward -= 0.15 * v["severity"]
+    return round(reward, 4)
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _vkey(v: Dict[str, Any]) -> tuple:
+    return (
+        v["endpoint_index"],
+        v["location"],
+        v.get("field_name"),
+        v["violation_type"],
+    )
+def _violation_keys(violations: List[Dict[str, Any]]) -> set:
+    return {_vkey(v) for v in violations}

server/models.py ADDED Viewed

	@@ -0,0 +1,181 @@

+"""
+Typed Pydantic models for the API Contract Debugger environment.
+The environment gives an agent a broken OpenAPI-style spec and asks it to
+fix contract violations by proposing targeted field-level corrections.
+"""
+from __future__ import annotations
+from enum import Enum
+from typing import Any, Dict, List, Optional
+from openenv.core.env_server.types import Action, Observation, State
+from pydantic import Field
+# ---------------------------------------------------------------------------
+# Domain types
+# ---------------------------------------------------------------------------
+class FieldType(str, Enum):
+    """Supported JSON Schema primitive types."""
+    STRING = "string"
+    INTEGER = "integer"
+    NUMBER = "number"
+    BOOLEAN = "boolean"
+    ARRAY = "array"
+    OBJECT = "object"
+    NULL = "null"
+class HttpMethod(str, Enum):
+    GET = "GET"
+    POST = "POST"
+    PUT = "PUT"
+    PATCH = "PATCH"
+    DELETE = "DELETE"
+class ActionKind(str, Enum):
+    """What kind of fix the agent is proposing."""
+    ADD_FIELD = "add_field"          # Add a missing required field
+    REMOVE_FIELD = "remove_field"    # Remove a forbidden/extra field
+    CHANGE_TYPE = "change_type"      # Fix a field's type
+    CHANGE_STATUS = "change_status"  # Fix an HTTP status code
+    NO_OP = "no_op"                  # Agent explicitly passes this step
+# ---------------------------------------------------------------------------
+# API Spec domain models (not OpenEnv base classes)
+# ---------------------------------------------------------------------------
+class FieldSpec(dict):
+    """A JSON Schema-like field definition. Stored as plain dict for flexibility."""
+    pass
+class EndpointSpec(dict):
+    """A single endpoint definition: method, path, request_body, response."""
+    pass
+# ---------------------------------------------------------------------------
+# OpenEnv Action
+# ---------------------------------------------------------------------------
+class DebugAction(Action):
+    """
+    A single fix proposed by the agent.
+    The agent targets one endpoint + one field and proposes exactly one change.
+    """
+    kind: ActionKind = Field(
+        ...,
+        description="The type of fix being applied",
+    )
+    endpoint_index: int = Field(
+        ...,
+        ge=0,
+        description="0-based index into the endpoint list",
+    )
+    location: str = Field(
+        ...,
+        description=(
+            "Where in the endpoint to apply the fix. "
+            "One of: 'request_body', 'response_body', 'status_code'"
+        ),
+    )
+    field_name: Optional[str] = Field(
+        default=None,
+        description="Field name to add/remove/change (null for status_code fixes)",
+    )
+    new_value: Optional[Any] = Field(
+        default=None,
+        description=(
+            "The corrected value. "
+            "For CHANGE_TYPE: a FieldType string. "
+            "For ADD_FIELD: a dict with 'type' (and optional 'description'). "
+            "For CHANGE_STATUS: an integer HTTP status code. "
+            "For REMOVE_FIELD / NO_OP: null."
+        ),
+    )
+# ---------------------------------------------------------------------------
+# OpenEnv Observation
+# ---------------------------------------------------------------------------
+class Violation(dict):
+    """
+    Describes a single detected contract violation.
+    Keys: endpoint_index, location, field_name, violation_type, description
+    """
+    pass
+class DebugObservation(Observation):
+    """
+    What the agent sees after each reset() / step().
+    """
+    task_name: str = Field(
+        ...,
+        description="Which task is currently active (easy / medium / hard)",
+    )
+    task_description: str = Field(
+        ...,
+        description="Human-readable description of the task objective",
+    )
+    endpoints: List[Dict[str, Any]] = Field(
+        ...,
+        description="Current (potentially partially-fixed) endpoint specs",
+    )
+    violations: List[Dict[str, Any]] = Field(
+        default_factory=list,
+        description="List of detected violations still present in the spec",
+    )
+    violations_fixed_this_step: int = Field(
+        default=0,
+        description="How many violations the last action resolved",
+    )
+    violations_introduced_this_step: int = Field(
+        default=0,
+        description="How many new violations the last action introduced",
+    )
+    total_violations_at_start: int = Field(
+        ...,
+        description="Number of violations at episode start (for progress tracking)",
+    )
+    step_count: int = Field(
+        default=0,
+        description="Steps taken so far in this episode",
+    )
+    max_steps: int = Field(
+        default=10,
+        description="Maximum steps allowed per episode",
+    )
+    last_action_error: Optional[str] = Field(
+        default=None,
+        description="Error message if the last action was malformed / out-of-range",
+    )
+# ---------------------------------------------------------------------------
+# OpenEnv State
+# ---------------------------------------------------------------------------
+class DebugState(State):
+    """
+    Full internal state of the environment (not exposed to the agent by default).
+    """
+    task_name: str = Field(default="")
+    original_endpoints: List[Dict[str, Any]] = Field(default_factory=list)
+    current_endpoints: List[Dict[str, Any]] = Field(default_factory=list)
+    golden_endpoints: List[Dict[str, Any]] = Field(default_factory=list)
+    violations: List[Dict[str, Any]] = Field(default_factory=list)
+    total_violations_at_start: int = Field(default=0)
+    max_steps: int = Field(default=10)

tests/__pycache__/test_env.cpython-314-pytest-9.0.2.pyc ADDED Viewed

Binary file (90.1 kB). View file

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,565 @@

+"""
+Test suite for the API Contract Debugger environment.
+Coverage:
+  - Violation detection (all violation types)
+  - Grader scoring
+  - Per-step reward shaping
+  - Environment reset / step / state
+  - All three tasks end-to-end
+  - Edge cases: malformed actions, double-fix, already-clean spec
+  - HTTP API routes (via TestClient)
+"""
+from __future__ import annotations
+import copy
+import sys
+import os
+import pytest
+# Make sure the project root is on the path
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from server.fixtures import TASK_EASY, TASK_HARD, TASK_MEDIUM, TASKS
+from server.graders import detect_violations, grade_episode, step_reward
+from server.models import ActionKind, DebugAction
+from server.environment import APIContractDebuggerEnv
+# ===========================================================================
+# Helpers
+# ===========================================================================
+def make_env(task: str = "easy") -> APIContractDebuggerEnv:
+    env = APIContractDebuggerEnv(task_name=task)
+    env.reset()
+    return env
+def action(**kwargs) -> DebugAction:
+    defaults = dict(
+        kind=ActionKind.NO_OP,
+        endpoint_index=0,
+        location="response_body",
+        field_name=None,
+        new_value=None,
+    )
+    defaults.update(kwargs)
+    return DebugAction(**defaults)
+# ===========================================================================
+# 1. Fixture sanity
+# ===========================================================================
+class TestFixtures:
+    def test_all_tasks_present(self):
+        assert set(TASKS.keys()) == {"easy", "medium", "hard"}
+    def test_easy_has_violations(self):
+        v = detect_violations(TASK_EASY["broken_endpoints"], TASK_EASY["golden_endpoints"])
+        assert len(v) == 1
+    def test_medium_has_three_violations(self):
+        v = detect_violations(TASK_MEDIUM["broken_endpoints"], TASK_MEDIUM["golden_endpoints"])
+        assert len(v) == 3
+    def test_hard_has_six_violations(self):
+        v = detect_violations(TASK_HARD["broken_endpoints"], TASK_HARD["golden_endpoints"])
+        assert len(v) == 6
+    def test_golden_specs_are_clean(self):
+        for task in TASKS.values():
+            v = detect_violations(task["golden_endpoints"], task["golden_endpoints"])
+            assert v == [], f"Golden spec for '{task['name']}' has violations: {v}"
+    def test_broken_and_golden_same_length(self):
+        for task in TASKS.values():
+            assert len(task["broken_endpoints"]) == len(task["golden_endpoints"])
+# ===========================================================================
+# 2. Violation detection
+# ===========================================================================
+class TestViolationDetection:
+    def test_missing_field_detected(self):
+        current = [{"method": "GET", "path": "/x", "status_code": 200,
+                     "request_body": {}, "response_body": {}}]
+        golden  = [{"method": "GET", "path": "/x", "status_code": 200,
+                     "request_body": {}, "response_body": {
+                         "id": {"type": "integer", "required": True, "description": ""}
+                     }}]
+        v = detect_violations(current, golden)
+        assert len(v) == 1
+        assert v[0]["violation_type"] == "missing_field"
+        assert v[0]["field_name"] == "id"
+    def test_extra_field_detected(self):
+        current = [{"method": "GET", "path": "/x", "status_code": 200,
+                     "request_body": {}, "response_body": {
+                         "secret": {"type": "string", "required": False, "description": ""}
+                     }}]
+        golden  = [{"method": "GET", "path": "/x", "status_code": 200,
+                     "request_body": {}, "response_body": {}}]
+        v = detect_violations(current, golden)
+        assert len(v) == 1
+        assert v[0]["violation_type"] == "extra_field"
+    def test_wrong_type_detected(self):
+        current = [{"method": "GET", "path": "/x", "status_code": 200,
+                     "request_body": {}, "response_body": {
+                         "count": {"type": "string", "required": True, "description": ""}
+                     }}]
+        golden  = [{"method": "GET", "path": "/x", "status_code": 200,
+                     "request_body": {}, "response_body": {
+                         "count": {"type": "integer", "required": True, "description": ""}
+                     }}]
+        v = detect_violations(current, golden)
+        assert len(v) == 1
+        assert v[0]["violation_type"] == "wrong_type"
+    def test_wrong_status_detected(self):
+        current = [{"method": "DELETE", "path": "/x", "status_code": 200,
+                     "request_body": {}, "response_body": {}}]
+        golden  = [{"method": "DELETE", "path": "/x", "status_code": 204,
+                     "request_body": {}, "response_body": {}}]
+        v = detect_violations(current, golden)
+        assert len(v) == 1
+        assert v[0]["violation_type"] == "wrong_status"
+    def test_no_violations_on_matching_spec(self):
+        golden = TASK_EASY["golden_endpoints"]
+        v = detect_violations(golden, golden)
+        assert v == []
+    def test_violation_severity_range(self):
+        v = detect_violations(TASK_HARD["broken_endpoints"], TASK_HARD["golden_endpoints"])
+        for viol in v:
+            assert 0.0 < viol["severity"] <= 1.0
+# ===========================================================================
+# 3. Grader scoring
+# ===========================================================================
+class TestGrader:
+    def test_perfect_score_when_all_fixed(self):
+        golden = TASK_EASY["golden_endpoints"]
+        initial = detect_violations(TASK_EASY["broken_endpoints"], golden)
+        score = grade_episode(golden, golden, initial)
+        assert score == pytest.approx(1.0)
+    def test_zero_score_when_nothing_fixed(self):
+        broken = TASK_EASY["broken_endpoints"]
+        golden = TASK_EASY["golden_endpoints"]
+        initial = detect_violations(broken, golden)
+        score = grade_episode(broken, golden, initial)
+        assert score == pytest.approx(0.0)
+    def test_partial_score_medium(self):
+        broken = copy.deepcopy(TASK_MEDIUM["broken_endpoints"])
+        golden = TASK_MEDIUM["golden_endpoints"]
+        initial = detect_violations(broken, golden)
+        # Fix only violation 1: product_id type
+        broken[0]["response_body"]["product_id"]["type"] = "integer"
+        score = grade_episode(broken, golden, initial)
+        assert 0.0 < score < 1.0
+    def test_score_clamped_to_zero_when_extra_violations_introduced(self):
+        broken = copy.deepcopy(TASK_EASY["broken_endpoints"])
+        golden = TASK_EASY["golden_endpoints"]
+        initial = detect_violations(broken, golden)
+        # Introduce more violations
+        broken[0]["response_body"]["user_id"]["type"] = "string"
+        broken[0]["response_body"]["username"]["type"] = "boolean"
+        score = grade_episode(broken, golden, initial)
+        assert score == 0.0
+    def test_score_in_range(self):
+        for task in TASKS.values():
+            broken = task["broken_endpoints"]
+            golden = task["golden_endpoints"]
+            initial = detect_violations(broken, golden)
+            score = grade_episode(broken, golden, initial)
+            assert 0.0 <= score <= 1.0, f"Out-of-range score for task '{task['name']}'"
+    def test_already_clean_spec_scores_one(self):
+        golden = TASK_EASY["golden_endpoints"]
+        initial: list = []  # no violations at start
+        score = grade_episode(golden, golden, initial)
+        assert score == pytest.approx(1.0)
+# ===========================================================================
+# 4. Step reward
+# ===========================================================================
+class TestStepReward:
+    def _make_violation(self, vtype="missing_field", severity=1.0):
+        return {
+            "endpoint_index": 0, "location": "response_body",
+            "field_name": "foo", "violation_type": vtype,
+            "description": "test", "severity": severity,
+        }
+    def test_positive_reward_for_fix(self):
+        v = self._make_violation()
+        r = step_reward(prev_violations=[v], new_violations=[], initial_violations=[v], action_error=False)
+        assert r > 0
+    def test_negative_reward_for_introduction(self):
+        v = self._make_violation()
+        r = step_reward(prev_violations=[], new_violations=[v], initial_violations=[], action_error=False)
+        assert r < 0
+    def test_penalty_for_action_error(self):
+        r = step_reward(prev_violations=[], new_violations=[], initial_violations=[], action_error=True)
+        assert r == pytest.approx(-0.05)
+    def test_zero_reward_for_no_op(self):
+        r = step_reward(prev_violations=[], new_violations=[], initial_violations=[], action_error=False)
+        assert r == pytest.approx(0.0)
+# ===========================================================================
+# 5. Environment — reset
+# ===========================================================================
+class TestEnvReset:
+    def test_reset_returns_observation(self):
+        env = APIContractDebuggerEnv(task_name="easy")
+        obs = env.reset()
+        assert obs.task_name == "easy"
+        assert len(obs.violations) == 1
+        assert obs.done is False
+        assert obs.step_count == 0
+    def test_reset_clears_state(self):
+        env = make_env("easy")
+        # Take a step, then reset
+        env.step(action(
+            kind=ActionKind.ADD_FIELD,
+            location="response_body",
+            field_name="created_at",
+            new_value={"type": "string", "required": True, "description": "timestamp"},
+        ))
+        obs = env.reset()
+        assert obs.step_count == 0
+        assert len(obs.violations) == 1  # back to broken state
+    def test_reset_switches_task(self):
+        env = APIContractDebuggerEnv(task_name="easy")
+        obs = env.reset(task_name="medium")
+        assert obs.task_name == "medium"
+        assert len(obs.violations) == 3
+    def test_reset_preserves_golden(self):
+        env = make_env("hard")
+        obs = env.reset()
+        assert obs.total_violations_at_start == 6
+    def test_episode_id_set_on_reset(self):
+        env = APIContractDebuggerEnv(task_name="easy")
+        env.reset(episode_id="test-123")
+        assert env.state.episode_id == "test-123"
+# ===========================================================================
+# 6. Environment — step mechanics
+# ===========================================================================
+class TestEnvStep:
+    def test_add_missing_field_fixes_easy(self):
+        env = make_env("easy")
+        obs = env.step(action(
+            kind=ActionKind.ADD_FIELD,
+            location="response_body",
+            field_name="created_at",
+            new_value={"type": "string", "required": True, "description": "ISO timestamp"},
+        ))
+        assert len(obs.violations) == 0
+        assert obs.done is True
+        assert obs.reward > 0
+    def test_wrong_type_action_introduces_violation(self):
+        env = make_env("easy")
+        obs = env.step(action(
+            kind=ActionKind.ADD_FIELD,
+            location="response_body",
+            field_name="created_at",
+            new_value={"type": "integer", "required": True, "description": "wrong type"},
+        ))
+        # Still has a violation (wrong type now)
+        assert len(obs.violations) == 1
+        assert obs.violations[0]["violation_type"] == "wrong_type"
+    def test_out_of_range_endpoint_index(self):
+        env = make_env("easy")
+        obs = env.step(action(kind=ActionKind.ADD_FIELD, endpoint_index=99,
+                               field_name="x", new_value={"type": "string"}))
+        assert obs.last_action_error is not None
+        assert "out of range" in obs.last_action_error
+    def test_change_type_fixes_medium_violation(self):
+        env = make_env("medium")
+        # Fix violation 1: product_id type string→integer in response
+        obs = env.step(action(
+            kind=ActionKind.CHANGE_TYPE,
+            endpoint_index=0,
+            location="response_body",
+            field_name="product_id",
+            new_value="integer",
+        ))
+        assert obs.violations_fixed_this_step == 1
+        assert len(obs.violations) == 2  # 2 remaining
+    def test_change_status_fixes_medium_violation(self):
+        env = make_env("medium")
+        obs = env.step(action(
+            kind=ActionKind.CHANGE_STATUS,
+            endpoint_index=2,
+            location="status_code",
+            new_value=204,
+        ))
+        assert obs.violations_fixed_this_step == 1
+    def test_remove_field_fixes_hard_extra_field(self):
+        env = make_env("hard")
+        obs = env.step(action(
+            kind=ActionKind.REMOVE_FIELD,
+            endpoint_index=1,
+            location="response_body",
+            field_name="password_hash",
+        ))
+        assert obs.violations_fixed_this_step == 1
+    def test_no_op_does_not_change_violations(self):
+        env = make_env("easy")
+        before = len(env.state.violations)
+        obs = env.step(action(kind=ActionKind.NO_OP))
+        assert len(obs.violations) == before
+    def test_step_after_done_returns_done(self):
+        env = make_env("easy")
+        # Solve it
+        env.step(action(
+            kind=ActionKind.ADD_FIELD,
+            location="response_body",
+            field_name="created_at",
+            new_value={"type": "string", "required": True, "description": "ts"},
+        ))
+        # Step again — should get done=True with error message
+        obs = env.step(action(kind=ActionKind.NO_OP))
+        assert obs.done is True
+        assert obs.last_action_error is not None
+    def test_max_steps_terminates_episode(self):
+        env = APIContractDebuggerEnv(task_name="easy")
+        env.reset()
+        obs = None
+        for _ in range(env._task_cfg["max_steps"]):
+            obs = env.step(action(kind=ActionKind.NO_OP))
+        assert obs.done is True
+    def test_step_count_increments(self):
+        env = make_env("easy")
+        env.step(action(kind=ActionKind.NO_OP))
+        env.step(action(kind=ActionKind.NO_OP))
+        assert env.state.step_count == 2
+# ===========================================================================
+# 7. Environment — state
+# ===========================================================================
+class TestEnvState:
+    def test_state_reflects_current_endpoints(self):
+        env = make_env("easy")
+        state = env.state
+        assert len(state.current_endpoints) == 1
+        assert state.task_name == "easy"
+    def test_state_tracks_step_count(self):
+        env = make_env("easy")
+        env.step(action(kind=ActionKind.NO_OP))
+        assert env.state.step_count == 1
+    def test_original_endpoints_unchanged_after_steps(self):
+        env = make_env("easy")
+        original_before = copy.deepcopy(env.state.original_endpoints)
+        env.step(action(
+            kind=ActionKind.ADD_FIELD,
+            location="response_body",
+            field_name="created_at",
+            new_value={"type": "string", "required": True, "description": "ts"},
+        ))
+        assert env.state.original_endpoints == original_before
+# ===========================================================================
+# 8. Full episode walkthroughs
+# ===========================================================================
+class TestFullEpisodes:
+    def test_easy_perfect_solve(self):
+        env = make_env("easy")
+        env.step(action(
+            kind=ActionKind.ADD_FIELD,
+            location="response_body",
+            field_name="created_at",
+            new_value={"type": "string", "required": True, "description": "ISO timestamp"},
+        ))
+        assert env.score() == pytest.approx(1.0)
+    def test_medium_perfect_solve(self):
+        env = make_env("medium")
+        # Fix 1: product_id type
+        env.step(action(kind=ActionKind.CHANGE_TYPE, endpoint_index=0,
+                        location="response_body", field_name="product_id", new_value="integer"))
+        # Fix 2: quantity type
+        env.step(action(kind=ActionKind.CHANGE_TYPE, endpoint_index=1,
+                        location="request_body", field_name="quantity", new_value="integer"))
+        # Fix 3: DELETE status code
+        env.step(action(kind=ActionKind.CHANGE_STATUS, endpoint_index=2,
+                        location="status_code", new_value=204))
+        assert env.score() == pytest.approx(1.0)
+    def test_hard_perfect_solve(self):
+        env = make_env("hard")
+        # Fix 1: add refresh_token to /auth/login response
+        env.step(action(kind=ActionKind.ADD_FIELD, endpoint_index=0,
+                        location="response_body", field_name="refresh_token",
+                        new_value={"type": "string", "required": True, "description": "Refresh token"}))
+        # Fix 2: expires_in type string→integer in /auth/login response
+        env.step(action(kind=ActionKind.CHANGE_TYPE, endpoint_index=0,
+                        location="response_body", field_name="expires_in", new_value="integer"))
+        # Fix 3: add created_at to /users/{id}/profile response
+        env.step(action(kind=ActionKind.ADD_FIELD, endpoint_index=1,
+                        location="response_body", field_name="created_at",
+                        new_value={"type": "string", "required": True, "description": "ISO timestamp"}))
+        # Fix 4: remove password_hash from /users/{id}/profile response
+        env.step(action(kind=ActionKind.REMOVE_FIELD, endpoint_index=1,
+                        location="response_body", field_name="password_hash"))
+        # Fix 5: PATCH status 500→200
+        env.step(action(kind=ActionKind.CHANGE_STATUS, endpoint_index=2,
+                        location="status_code", new_value=200))
+        # Fix 6: add updated_at to PATCH response
+        env.step(action(kind=ActionKind.ADD_FIELD, endpoint_index=2,
+                        location="response_body", field_name="updated_at",
+                        new_value={"type": "string", "required": True, "description": "ISO timestamp"}))
+        assert env.score() == pytest.approx(1.0)
+    def test_score_after_partial_solve(self):
+        env = make_env("medium")
+        # Fix only 1 of 3
+        env.step(action(kind=ActionKind.CHANGE_TYPE, endpoint_index=0,
+                        location="response_body", field_name="product_id", new_value="integer"))
+        score = env.score()
+        assert 0.0 < score < 1.0
+    def test_unknown_task_raises(self):
+        with pytest.raises(ValueError, match="Unknown task"):
+            APIContractDebuggerEnv(task_name="impossible")
+# ===========================================================================
+# 9. HTTP API routes (FastAPI TestClient)
+# ===========================================================================
+class TestHTTPRoutes:
+    @pytest.fixture(autouse=True)
+    def client(self):
+        from fastapi.testclient import TestClient
+        from server.app import app
+        self.client = TestClient(app)
+    def test_health_endpoint(self):
+        r = self.client.get("/health")
+        assert r.status_code == 200
+    def test_reset_returns_200(self):
+        r = self.client.post("/reset", json={})
+        assert r.status_code == 200
+        data = r.json()
+        assert "violations" in data
+        assert "endpoints" in data
+    def test_reset_switches_task(self):
+        r = self.client.post("/reset", json={"task_name": "medium"})
+        assert r.status_code == 200
+        assert r.json()["task_name"] == "medium"
+    def test_reset_unknown_task_422(self):
+        r = self.client.post("/reset", json={"task_name": "impossible"})
+        assert r.status_code == 422
+    def test_step_add_field(self):
+        self.client.post("/reset", json={"task_name": "easy"})
+        r = self.client.post("/step", json={
+            "action": {
+                "kind": "add_field",
+                "endpoint_index": 0,
+                "location": "response_body",
+                "field_name": "created_at",
+                "new_value": {"type": "string", "required": True, "description": "ts"},
+            }
+        })
+        assert r.status_code == 200
+        data = r.json()
+        assert data["done"] is True
+        assert data["reward"] > 0
+    def test_step_invalid_action_422(self):
+        self.client.post("/reset", json={})
+        r = self.client.post("/step", json={"action": {"kind": "nonexistent_kind"}})
+        assert r.status_code == 422
+    def test_state_endpoint(self):
+        self.client.post("/reset", json={"task_name": "easy"})
+        r = self.client.get("/state")
+        assert r.status_code == 200
+        assert "current_endpoints" in r.json()
+    def test_score_endpoint(self):
+        self.client.post("/reset", json={"task_name": "easy"})
+        r = self.client.get("/score")
+        assert r.status_code == 200
+        data = r.json()
+        assert "score" in data
+        assert 0.0 <= data["score"] <= 1.0
+    def test_tasks_endpoint(self):
+        r = self.client.get("/tasks")
+        assert r.status_code == 200
+        data = r.json()
+        assert len(data["tasks"]) == 3
+    def test_schema_endpoint(self):
+        r = self.client.get("/schema")
+        assert r.status_code == 200
+        schema = r.json()
+        assert "action" in schema
+        assert "observation" in schema
+    def test_full_easy_solve_via_http(self):
+        self.client.post("/reset", json={"task_name": "easy"})
+        r = self.client.post("/step", json={
+            "action": {
+                "kind": "add_field",
+                "endpoint_index": 0,
+                "location": "response_body",
+                "field_name": "created_at",
+                "new_value": {"type": "string", "required": True, "description": "ts"},
+            }
+        })
+        assert r.json()["done"] is True
+        score_r = self.client.get("/score")
+        assert score_r.json()["score"] == pytest.approx(1.0)