# Reinforcement Learning Architecture: API Contract Debugger ## Overview The API Contract Debugger is a **reinforcement learning environment** built on the OpenEnv framework. It challenges AI agents to fix broken OpenAPI-style contract specifications by proposing targeted field-level corrections. This document explains how the codebase implements the core RL concepts: - **Agent** — The external AI system interacting with the environment - **Environment** — The `APIContractDebuggerEnv` class that simulates the debugging task - **State** — What the agent observes and the internal environment state - **Action** — The fixes the agent can propose - **Reward/Result** — The feedback signal and scoring mechanism --- ## 1. Agent (External AI System) ### What is the Agent? The **agent** is an **external AI system** (e.g., an LLM, RL policy, or human) that: - Receives observations from the environment - Proposes actions (fixes to the API spec) - Receives reward feedback and the next state - Aims to maximize cumulative reward by fixing all violations ### Agent Interaction Pattern ``` Agent Environment | | |---- POST /reset (task_name) -----> | | | | <------ Initial Observation --------| | (endpoints, violations, reward=0) | | | |---- POST /step (action) ----------> | | | | <---- Updated Observation --------- | | (new endpoints, new violations, | | reward, done, fixed/introduced) | | | | [repeat until done=True] | | | | ---- GET /score - GET /state -----> | | | ``` ### Agent Location in Codebase - **File**: `server/app.py` - **Routes**: - `POST /reset` — Initialize new episode - `POST /step` — Apply one action - `GET /state` — Query full environment state (for debugging) - `GET /score` — Get final episode score - `GET /tasks` — List available tasks The agent communicates via HTTP REST API. All observations are JSON and fully serializable. --- ## 2. Environment (`APIContractDebuggerEnv`) ### Class Definition **File**: `server/environment.py` ```python class APIContractDebuggerEnv(Environment[DebugAction, DebugObservation, DebugState]): """ Environment where an agent debugs broken API contract specifications. Inherits from OpenEnv's Environment base class. Implements reset(), step(), and state property. """ ``` ### Environment Responsibilities 1. **Initialize tasks** — Load broken + golden endpoint specs from fixtures 2. **Detect violations** — Compare current spec against golden spec 3. **Apply actions** — Mutate the current spec based on agent's fix proposal 4. **Compute rewards** — Dense per-step reward based on violations fixed/introduced 5. **Track state** — Maintain episode counter, step count, violations 6. **Terminate episodes** — Check for success (all fixed) or max steps reached ### Key Methods #### `reset(seed, episode_id, task_name, **kwargs) → DebugObservation` Initializes a fresh episode: - Loads task config from fixtures - Deep-copies broken endpoints to avoid cross-episode state leakage - Detects initial violations - Returns initial observation with reward=0 ```python def reset(self, seed=None, episode_id=None, task_name=None, **kwargs): """ Reset the environment and return the initial observation. """ # Load task config and deep-copy endpoints self._current_endpoints = copy.deepcopy(self._task_cfg["broken_endpoints"]) self._golden_endpoints = copy.deepcopy(self._task_cfg["golden_endpoints"]) # Detect violations (agent's starting problem) self._violations = detect_violations(self._current_endpoints, self._golden_endpoints) return self._make_observation(reward=0.0, done=False, ...) ``` #### `step(action, timeout_s, **kwargs) → DebugObservation` Processes one agent action and returns the updated state: ```python def step(self, action: DebugAction, **kwargs) -> DebugObservation: """ Apply one fix action → return updated observation + reward. """ # 1. Apply the action (mutate current_endpoints) action_error = self._apply_action(action) # 2. Recompute violations self._violations = detect_violations(self._current_endpoints, self._golden_endpoints) # 3. Compute dense reward reward = step_reward(prev_violations, self._violations, action_error) # 4. Check termination all_fixed = len(self._violations) == 0 out_of_steps = self._step_count >= max_steps self._done = all_fixed or out_of_steps # 5. Bonus reward if solved if all_fixed: reward += 0.5 return self._make_observation(reward, done, fixed_this_step, ...) ``` #### `_apply_action(action) → Optional[str]` Attempts to mutate `self._current_endpoints` according to the action: - **Validates** endpoint index, field name, locations - **Executes** the fix: - `ADD_FIELD` — Insert new field into request/response body - `REMOVE_FIELD` — Delete field from body - `CHANGE_TYPE` — Update field's type - `CHANGE_STATUS` — Update endpoint's HTTP status code - `NO_OP` — Explicit pass (implicit penalty via no reward) - **Returns** error string if invalid, `None` on success #### `state` Property Returns the complete internal state (not exposed to agent by default, but available via `/state`): ```python @property def state(self) -> DebugState: """Return full internal environment state.""" return DebugState( episode_id=self._episode_id, step_count=self._step_count, task_name=self._task_name, original_endpoints=self._original_endpoints, # Snapshot of broken spec current_endpoints=self._current_endpoints, # Current state after fixes golden_endpoints=self._golden_endpoints, # Target spec violations=self._violations, # Current violations total_violations_at_start=len(self._initial_violations), max_steps=self._task_cfg["max_steps"], ) ``` ### Supported Tasks **File**: `server/fixtures.py` Three difficulty levels: | Task | Difficulty | Endpoints | Violations | Max Steps | Description | |------|-----------|-----------|-----------|-----------|-------------| | **easy** | Beginner | 1 | 1 missing field | 5 | Simple: add one field to response | | **medium** | Intermediate | 3 | 3 (type errors + wrong status) | 10 | Type mismatches and HTTP status fixes | | **hard** | Advanced | 4 | 6 (missing, extra, type, status) | 15 | Complex: multiple violation types | Each task has: - `broken_endpoints` — Starting state (what agent sees) - `golden_endpoints` — Ground truth (what violations are measured against) - `description` — Human-readable task objective - `max_steps` — Episode cut-off --- ## 3. State ### Observation (`DebugObservation`) **What the agent sees after each action.** File: `server/models.py` ```python class DebugObservation(Observation): """ What the agent observes after reset() or step(). """ # Task info task_name: str # "easy" | "medium" | "hard" task_description: str # Human description # Current spec endpoints: List[Dict[str, Any]] # Current endpoints (partially fixed) violations: List[Dict[str, Any]] # Detected violations still present # Reward signals reward: float # Dense per-step reward done: bool # Episode termination flag violations_fixed_this_step: int # Count of fixed violations violations_introduced_this_step: int # Count of new violations total_violations_at_start: int # Reference baseline # Tracking step_count: int # Steps taken so far max_steps: int # Episode limit last_action_error: Optional[str] # Validation error message ``` #### Example Observation ```json { "task_name": "easy", "task_description": "Add missing 'created_at' field to response...", "endpoints": [ { "method": "POST", "path": "/users/register", "status_code": 201, "request_body": { "username": {"type": "string", "required": true}, "email": {"type": "string", "required": true}, "password": {"type": "string", "required": true} }, "response_body": { "user_id": {"type": "integer", "required": true}, "username": {"type": "string", "required": true} // missing: created_at } } ], "violations": [ { "endpoint_index": 0, "location": "response_body", "field_name": "created_at", "violation_type": "missing_field", "description": "POST /users/register response_body: required field 'created_at' (string) is missing", "severity": 1.0 } ], "violations_fixed_this_step": 0, "violations_introduced_this_step": 0, "total_violations_at_start": 1, "step_count": 0, "max_steps": 5, "reward": 0.0, "done": false, "last_action_error": null } ``` ### Full Internal State (`DebugState`) **Available via `GET /state` endpoint (for debugging/analysis, not given to agent by default).** ```python class DebugState(State): """ Full internal state (not exposed to agent by default). """ task_name: str original_endpoints: List[Dict[str, Any]] # Snapshot of broken spec current_endpoints: List[Dict[str, Any]] # Mutated by agent's actions golden_endpoints: List[Dict[str, Any]] # Ground truth violations: List[Dict[str, Any]] # Computed violations total_violations_at_start: int max_steps: int ``` --- ## 4. Action (`DebugAction`) **What the agent can propose.** File: `server/models.py` ```python class DebugAction(Action): """ A single fix proposed by the agent. The agent targets one endpoint + one field and proposes exactly one change. """ kind: ActionKind # Type of fix endpoint_index: int # Which endpoint to fix (0-indexed) location: str # "request_body" | "response_body" | "status_code" field_name: Optional[str] # Field to modify (null for status_code) new_value: Optional[Any] # The corrected value ``` ### Action Types (`ActionKind`) | Kind | Target | Effect | new_value | |------|--------|--------|-----------| | `ADD_FIELD` | Field | Insert missing field into body | `{"type": str, "description"?: str}` | | `REMOVE_FIELD` | Field | Delete forbidden field from body | `null` | | `CHANGE_TYPE` | Field | Fix field's JSON Schema type | Type string (e.g., `"integer"`) | | `CHANGE_STATUS` | Endpoint | Fix HTTP status code | Integer (e.g., `201`) | | `NO_OP` | None | Explicit pass/wait | `null` | #### Example Actions ```python # Fix 1: Add missing 'created_at' field { "kind": "add_field", "endpoint_index": 0, "location": "response_body", "field_name": "created_at", "new_value": { "type": "string", "description": "ISO-8601 timestamp" } } # Fix 2: Change field type from string to integer { "kind": "change_type", "endpoint_index": 1, "location": "request_body", "field_name": "user_id", "new_value": "integer" } # Fix 3: Correct HTTP status code { "kind": "change_status", "endpoint_index": 0, "location": "status_code", "field_name": null, "new_value": 201 } # Fix 4: Remove extra field { "kind": "remove_field", "endpoint_index": 2, "location": "response_body", "field_name": "deprecated_field", "new_value": null } # Fix 5: Explicit pass { "kind": "no_op", "endpoint_index": 0, "location": "request_body", "field_name": null, "new_value": null } ``` ### Action Validation The environment validates actions in `_apply_action()`: - **Endpoint index bounds** — Must be `0 ≤ index < len(endpoints)` - **Location validity** — Must be `"request_body"`, `"response_body"`, or `"status_code"` - **Field existence** — REMOVE_FIELD and CHANGE_TYPE require field to exist - **Type format** — Fields must have `{"type": "..."}` structure - **Status code format** — Must be an integer If validation fails, `_apply_action()` returns an error string and the step receives `-0.05` reward penalty. --- ## 5. Reward & Result ### Dense Per-Step Reward **File**: `server/graders.py` → `step_reward()` function The agent receives feedback after each step: ```python def step_reward( prev_violations: List[Dict[str, Any]], new_violations: List[Dict[str, Any]], initial_violations: List[Dict[str, Any]], action_error: bool, ) -> float: """ Dense per-step reward: +0.2 × severity per violation resolved -0.15 × severity per new violation introduced -0.05 for malformed action +0.5 bonus if all violations fixed (episode success) """ if action_error: return -0.05 reward = 0.0 for v in violations_fixed_this_step: reward += 0.2 * v["severity"] for v in violations_introduced_this_step: reward -= 0.15 * v["severity"] return reward ``` ### Violation Severity Weights Weighted by problem importance: | Violation Type | Severity | Reason | |----------------|----------|--------| | `missing_field` | 1.0 | Breaks contract — top priority | | `wrong_type` | 0.9 | Type mismatch — critical | | `wrong_status` | 0.8 | HTTP code error — significant | | `extra_field` | 0.7 | Forbidden field — less critical | ### Episode Scoring (`grade_episode()`) **Computed at episode end.** Returns final score in `[0.0, 1.0]`. ```python def grade_episode( current_endpoints: List[Dict[str, Any]], golden_endpoints: List[Dict[str, Any]], initial_violations: List[Dict[str, Any]], ) -> float: """ Final episode score: score = (weighted_violations_fixed - weighted_violations_introduced) / total_initial_weight Clamped to [0.0, 1.0] 1.0 = all violations fixed, no new ones introduced 0.5 = 50% of violations fixed 0.0 = no improvement or made things worse """ ``` #### Example Scoring Scenario **Task: easy (1 violation)** - Initial violation: `missing_field "created_at" (severity=1.0)` - After 1 step: Agent adds `created_at` correctly - After 2 steps: Agent incorrectly changes type of `username` to `integer` (introduces 1 violation) - Final state: 0 remaining violations, but 1 introduced ``` score = (1.0 - 1.0) / 1.0 = 0.0 ``` Clamped to 0.0 (agent made things worse overall). --- ## 6. Complete RL Loop Example ### Scenario: Easy Task **Initial state:** ``` Broken spec: POST /users/register response missing "created_at" Golden spec: response has user_id, username, created_at ``` ### Episode Transcript ``` RESET request (task_name="easy") ↓ Observation #0: endpoints: [broken registration endpoint] violations: [missing_field "created_at"] reward: 0.0 done: false step_count: 0 STEP 1: Agent proposes ADD_FIELD action action.kind = "add_field" action.endpoint_index = 0 action.location = "response_body" action.field_name = "created_at" action.new_value = {"type": "string", "description": "ISO-8601 timestamp"} ↓ Environment: - Validates action ✓ - Adds field to response_body - Recomputes violations → [] (0 violations!) - Computes reward: +0.2 × 1.0 (fixed 1 violation of severity 1.0) = +0.2 + 0.5 (bonus for all_fixed=true) = +0.7 total - Sets done=true (all violations fixed) ↓ Observation #1: endpoints: [fixed registration endpoint] violations: [] violations_fixed_this_step: 1 violations_introduced_this_step: 0 reward: 0.7 done: true step_count: 1 SCORE request ↓ score = (1.0 fixed - 0 introduced) / 1.0 initial = 1.0 ✓ Agent succeeds with perfect score! ``` --- ## 7. File Structure Summary ``` server/ ├── app.py # FastAPI routes, HTTP interface ├── environment.py # APIContractDebuggerEnv (core RL logic) ├── models.py # Pydantic models: DebugAction, DebugObservation, DebugState ├── fixtures.py # Task definitions (easy, medium, hard) ├── graders.py # Violation detection + reward/scoring └── __pycache__/ tests/ # Unit tests for environment, graders, fixtures RL_ARCHITECTURE.md # This file ``` --- ## 8. Key Design Principles 1. **Stateful Environment** — One episode per task at a time (OpenEnv singleton pattern) 2. **Dense Rewards** — Agent gets per-step feedback (not just final score) to guide learning 3. **Severity-Weighted** — Different violation types have different weights (missing fields = highest priority) 4. **Action Validation** — Invalid actions receive penalty and return error messages 5. **Deep-Copied State** — Endpoints are deep-copied to prevent cross-episode contamination 6. **Observable Violations** — Agent sees exact list of violations (not hidden) 7. **Termination Conditions**: - Success: All violations fixed - Failure: Max steps exceeded 8. **JSON/REST Interface** — Agent communicates via HTTP (language-agnostic) --- ## 9. Typical Agent Workflow ```python import requests BASE_URL = "http://localhost:7860" # 1. Reset to start new episode reset_resp = requests.post(f"{BASE_URL}/reset", json={ "task_name": "easy", "seed": 42 }) obs = reset_resp.json() print(f"Violations to fix: {len(obs['violations'])}") # 2. Repeat: observe → decide → act for step in range(obs['max_steps']): if obs['done']: break # Agent decision logic (depends on obs['violations']) action = { "kind": "add_field", "endpoint_index": 0, "location": "response_body", "field_name": "created_at", "new_value": {"type": "string"} } # 3. Apply action step_resp = requests.post(f"{BASE_URL}/step", json={"action": action}) obs = step_resp.json() print(f"Step {step+1}: reward={obs['reward']}, violations={len(obs['violations'])}") # 4. Check final score score_resp = requests.get(f"{BASE_URL}/score") print(f"Final score: {score_resp.json()['score']}") ``` --- ## 10. Future Extensions Potential enhancements to the RL framework: 1. **Multi-Agent** — Support concurrent episodes via session IDs 2. **Curriculum Learning** — Dynamically adapt difficulty based on agent performance 3. **Partial Observability** — Hide some violations initially to increase challenge 4. **Action Constraints** — Limit action space per step (e.g., "fix at most 1 field") 5. **Custom Reward Shaping** — Configurable severity weights + bonus structures 6. **State Representation** — Multiple formats (JSON, graph, embedding-friendly) --- ## Summary Table | Concept | Implementation | File | Purpose | |---------|---|---|---| | **Agent** | External AI/LLM | HTTP client | Proposes fixes | | **Environment** | `APIContractDebuggerEnv` | `environment.py` | Simulates faults + validates fixes | | **State** | `DebugObservation` + `DebugState` | `models.py` | Agent observes + internal tracking | | **Action** | `DebugAction` | `models.py` | Fix proposals | | **Reward** | `step_reward()` | `graders.py` | Dense per-step feedback | | **Result** | Episode score `[0.0, 1.0]` | `graders.py` | Final performance metric | | **Tasks** | Fixtures (easy/medium/hard) | `fixtures.py` | Problem instances | | **HTTP API** | FastAPI routes | `app.py` | Communication interface |