| # Reinforcement Learning Architecture: API Contract Debugger |
|
|
| ## Overview |
|
|
| The API Contract Debugger is a **reinforcement learning environment** built on the OpenEnv framework. It challenges AI agents to fix broken OpenAPI-style contract specifications by proposing targeted field-level corrections. |
|
|
| This document explains how the codebase implements the core RL concepts: |
| - **Agent** β The external AI system interacting with the environment |
| - **Environment** β The `APIContractDebuggerEnv` class that simulates the debugging task |
| - **State** β What the agent observes and the internal environment state |
| - **Action** β The fixes the agent can propose |
| - **Reward/Result** β The feedback signal and scoring mechanism |
|
|
| --- |
|
|
| ## 1. Agent (External AI System) |
|
|
| ### What is the Agent? |
|
|
| The **agent** is an **external AI system** (e.g., an LLM, RL policy, or human) that: |
| - Receives observations from the environment |
| - Proposes actions (fixes to the API spec) |
| - Receives reward feedback and the next state |
| - Aims to maximize cumulative reward by fixing all violations |
|
|
| ### Agent Interaction Pattern |
|
|
| ``` |
| Agent Environment |
| | | |
| |---- POST /reset (task_name) -----> | |
| | | |
| | <------ Initial Observation --------| |
| | (endpoints, violations, reward=0) | |
| | | |
| |---- POST /step (action) ----------> | |
| | | |
| | <---- Updated Observation --------- | |
| | (new endpoints, new violations, | |
| | reward, done, fixed/introduced) | |
| | | |
| | [repeat until done=True] | |
| | | |
| | ---- GET /score - GET /state -----> | |
| | | |
| ``` |
|
|
| ### Agent Location in Codebase |
|
|
| - **File**: `server/app.py` |
| - **Routes**: |
| - `POST /reset` β Initialize new episode |
| - `POST /step` β Apply one action |
| - `GET /state` β Query full environment state (for debugging) |
| - `GET /score` β Get final episode score |
| - `GET /tasks` β List available tasks |
|
|
| The agent communicates via HTTP REST API. All observations are JSON and fully serializable. |
|
|
| --- |
|
|
| ## 2. Environment (`APIContractDebuggerEnv`) |
|
|
| ### Class Definition |
|
|
| **File**: `server/environment.py` |
|
|
| ```python |
| class APIContractDebuggerEnv(Environment[DebugAction, DebugObservation, DebugState]): |
| """ |
| Environment where an agent debugs broken API contract specifications. |
| |
| Inherits from OpenEnv's Environment base class. |
| Implements reset(), step(), and state property. |
| """ |
| ``` |
|
|
| ### Environment Responsibilities |
|
|
| 1. **Initialize tasks** β Load broken + golden endpoint specs from fixtures |
| 2. **Detect violations** β Compare current spec against golden spec |
| 3. **Apply actions** β Mutate the current spec based on agent's fix proposal |
| 4. **Compute rewards** β Dense per-step reward based on violations fixed/introduced |
| 5. **Track state** β Maintain episode counter, step count, violations |
| 6. **Terminate episodes** β Check for success (all fixed) or max steps reached |
|
|
| ### Key Methods |
|
|
| #### `reset(seed, episode_id, task_name, **kwargs) β DebugObservation` |
| |
| Initializes a fresh episode: |
| - Loads task config from fixtures |
| - Deep-copies broken endpoints to avoid cross-episode state leakage |
| - Detects initial violations |
| - Returns initial observation with reward=0 |
| |
| ```python |
| def reset(self, seed=None, episode_id=None, task_name=None, **kwargs): |
| """ |
| Reset the environment and return the initial observation. |
| """ |
| # Load task config and deep-copy endpoints |
| self._current_endpoints = copy.deepcopy(self._task_cfg["broken_endpoints"]) |
| self._golden_endpoints = copy.deepcopy(self._task_cfg["golden_endpoints"]) |
| |
| # Detect violations (agent's starting problem) |
| self._violations = detect_violations(self._current_endpoints, self._golden_endpoints) |
| |
| return self._make_observation(reward=0.0, done=False, ...) |
| ``` |
| |
| #### `step(action, timeout_s, **kwargs) β DebugObservation` |
| |
| Processes one agent action and returns the updated state: |
| |
| ```python |
| def step(self, action: DebugAction, **kwargs) -> DebugObservation: |
| """ |
| Apply one fix action β return updated observation + reward. |
| """ |
| # 1. Apply the action (mutate current_endpoints) |
| action_error = self._apply_action(action) |
| |
| # 2. Recompute violations |
| self._violations = detect_violations(self._current_endpoints, self._golden_endpoints) |
| |
| # 3. Compute dense reward |
| reward = step_reward(prev_violations, self._violations, action_error) |
| |
| # 4. Check termination |
| all_fixed = len(self._violations) == 0 |
| out_of_steps = self._step_count >= max_steps |
| self._done = all_fixed or out_of_steps |
| |
| # 5. Bonus reward if solved |
| if all_fixed: |
| reward += 0.5 |
| |
| return self._make_observation(reward, done, fixed_this_step, ...) |
| ``` |
| |
| #### `_apply_action(action) β Optional[str]` |
|
|
| Attempts to mutate `self._current_endpoints` according to the action: |
|
|
| - **Validates** endpoint index, field name, locations |
| - **Executes** the fix: |
| - `ADD_FIELD` β Insert new field into request/response body |
| - `REMOVE_FIELD` β Delete field from body |
| - `CHANGE_TYPE` β Update field's type |
| - `CHANGE_STATUS` β Update endpoint's HTTP status code |
| - `NO_OP` β Explicit pass (implicit penalty via no reward) |
| - **Returns** error string if invalid, `None` on success |
|
|
| #### `state` Property |
|
|
| Returns the complete internal state (not exposed to agent by default, but available via `/state`): |
|
|
| ```python |
| @property |
| def state(self) -> DebugState: |
| """Return full internal environment state.""" |
| return DebugState( |
| episode_id=self._episode_id, |
| step_count=self._step_count, |
| task_name=self._task_name, |
| original_endpoints=self._original_endpoints, # Snapshot of broken spec |
| current_endpoints=self._current_endpoints, # Current state after fixes |
| golden_endpoints=self._golden_endpoints, # Target spec |
| violations=self._violations, # Current violations |
| total_violations_at_start=len(self._initial_violations), |
| max_steps=self._task_cfg["max_steps"], |
| ) |
| ``` |
|
|
| ### Supported Tasks |
|
|
| **File**: `server/fixtures.py` |
|
|
| Three difficulty levels: |
|
|
| | Task | Difficulty | Endpoints | Violations | Max Steps | Description | |
| |------|-----------|-----------|-----------|-----------|-------------| |
| | **easy** | Beginner | 1 | 1 missing field | 5 | Simple: add one field to response | |
| | **medium** | Intermediate | 3 | 3 (type errors + wrong status) | 10 | Type mismatches and HTTP status fixes | |
| | **hard** | Advanced | 4 | 6 (missing, extra, type, status) | 15 | Complex: multiple violation types | |
|
|
| Each task has: |
| - `broken_endpoints` β Starting state (what agent sees) |
| - `golden_endpoints` β Ground truth (what violations are measured against) |
| - `description` β Human-readable task objective |
| - `max_steps` β Episode cut-off |
|
|
| --- |
|
|
| ## 3. State |
|
|
| ### Observation (`DebugObservation`) |
|
|
| **What the agent sees after each action.** |
|
|
| File: `server/models.py` |
|
|
| ```python |
| class DebugObservation(Observation): |
| """ |
| What the agent observes after reset() or step(). |
| """ |
| # Task info |
| task_name: str # "easy" | "medium" | "hard" |
| task_description: str # Human description |
| |
| # Current spec |
| endpoints: List[Dict[str, Any]] # Current endpoints (partially fixed) |
| violations: List[Dict[str, Any]] # Detected violations still present |
| |
| # Reward signals |
| reward: float # Dense per-step reward |
| done: bool # Episode termination flag |
| violations_fixed_this_step: int # Count of fixed violations |
| violations_introduced_this_step: int # Count of new violations |
| total_violations_at_start: int # Reference baseline |
| |
| # Tracking |
| step_count: int # Steps taken so far |
| max_steps: int # Episode limit |
| last_action_error: Optional[str] # Validation error message |
| ``` |
|
|
| #### Example Observation |
|
|
| ```json |
| { |
| "task_name": "easy", |
| "task_description": "Add missing 'created_at' field to response...", |
| "endpoints": [ |
| { |
| "method": "POST", |
| "path": "/users/register", |
| "status_code": 201, |
| "request_body": { |
| "username": {"type": "string", "required": true}, |
| "email": {"type": "string", "required": true}, |
| "password": {"type": "string", "required": true} |
| }, |
| "response_body": { |
| "user_id": {"type": "integer", "required": true}, |
| "username": {"type": "string", "required": true} |
| // missing: created_at |
| } |
| } |
| ], |
| "violations": [ |
| { |
| "endpoint_index": 0, |
| "location": "response_body", |
| "field_name": "created_at", |
| "violation_type": "missing_field", |
| "description": "POST /users/register response_body: required field 'created_at' (string) is missing", |
| "severity": 1.0 |
| } |
| ], |
| "violations_fixed_this_step": 0, |
| "violations_introduced_this_step": 0, |
| "total_violations_at_start": 1, |
| "step_count": 0, |
| "max_steps": 5, |
| "reward": 0.0, |
| "done": false, |
| "last_action_error": null |
| } |
| ``` |
|
|
| ### Full Internal State (`DebugState`) |
|
|
| **Available via `GET /state` endpoint (for debugging/analysis, not given to agent by default).** |
|
|
| ```python |
| class DebugState(State): |
| """ |
| Full internal state (not exposed to agent by default). |
| """ |
| task_name: str |
| original_endpoints: List[Dict[str, Any]] # Snapshot of broken spec |
| current_endpoints: List[Dict[str, Any]] # Mutated by agent's actions |
| golden_endpoints: List[Dict[str, Any]] # Ground truth |
| violations: List[Dict[str, Any]] # Computed violations |
| total_violations_at_start: int |
| max_steps: int |
| ``` |
|
|
| --- |
|
|
| ## 4. Action (`DebugAction`) |
|
|
| **What the agent can propose.** |
|
|
| File: `server/models.py` |
|
|
| ```python |
| class DebugAction(Action): |
| """ |
| A single fix proposed by the agent. |
| The agent targets one endpoint + one field and proposes exactly one change. |
| """ |
| |
| kind: ActionKind # Type of fix |
| endpoint_index: int # Which endpoint to fix (0-indexed) |
| location: str # "request_body" | "response_body" | "status_code" |
| field_name: Optional[str] # Field to modify (null for status_code) |
| new_value: Optional[Any] # The corrected value |
| ``` |
|
|
| ### Action Types (`ActionKind`) |
|
|
| | Kind | Target | Effect | new_value | |
| |------|--------|--------|-----------| |
| | `ADD_FIELD` | Field | Insert missing field into body | `{"type": str, "description"?: str}` | |
| | `REMOVE_FIELD` | Field | Delete forbidden field from body | `null` | |
| | `CHANGE_TYPE` | Field | Fix field's JSON Schema type | Type string (e.g., `"integer"`) | |
| | `CHANGE_STATUS` | Endpoint | Fix HTTP status code | Integer (e.g., `201`) | |
| | `NO_OP` | None | Explicit pass/wait | `null` | |
|
|
| #### Example Actions |
|
|
| ```python |
| # Fix 1: Add missing 'created_at' field |
| { |
| "kind": "add_field", |
| "endpoint_index": 0, |
| "location": "response_body", |
| "field_name": "created_at", |
| "new_value": { |
| "type": "string", |
| "description": "ISO-8601 timestamp" |
| } |
| } |
| |
| # Fix 2: Change field type from string to integer |
| { |
| "kind": "change_type", |
| "endpoint_index": 1, |
| "location": "request_body", |
| "field_name": "user_id", |
| "new_value": "integer" |
| } |
| |
| # Fix 3: Correct HTTP status code |
| { |
| "kind": "change_status", |
| "endpoint_index": 0, |
| "location": "status_code", |
| "field_name": null, |
| "new_value": 201 |
| } |
| |
| # Fix 4: Remove extra field |
| { |
| "kind": "remove_field", |
| "endpoint_index": 2, |
| "location": "response_body", |
| "field_name": "deprecated_field", |
| "new_value": null |
| } |
| |
| # Fix 5: Explicit pass |
| { |
| "kind": "no_op", |
| "endpoint_index": 0, |
| "location": "request_body", |
| "field_name": null, |
| "new_value": null |
| } |
| ``` |
|
|
| ### Action Validation |
|
|
| The environment validates actions in `_apply_action()`: |
|
|
| - **Endpoint index bounds** β Must be `0 β€ index < len(endpoints)` |
| - **Location validity** β Must be `"request_body"`, `"response_body"`, or `"status_code"` |
| - **Field existence** β REMOVE_FIELD and CHANGE_TYPE require field to exist |
| - **Type format** β Fields must have `{"type": "..."}` structure |
| - **Status code format** β Must be an integer |
|
|
| If validation fails, `_apply_action()` returns an error string and the step receives `-0.05` reward penalty. |
|
|
| --- |
|
|
| ## 5. Reward & Result |
|
|
| ### Dense Per-Step Reward |
|
|
| **File**: `server/graders.py` β `step_reward()` function |
|
|
| The agent receives feedback after each step: |
|
|
| ```python |
| def step_reward( |
| prev_violations: List[Dict[str, Any]], |
| new_violations: List[Dict[str, Any]], |
| initial_violations: List[Dict[str, Any]], |
| action_error: bool, |
| ) -> float: |
| """ |
| Dense per-step reward: |
| +0.2 Γ severity per violation resolved |
| -0.15 Γ severity per new violation introduced |
| -0.05 for malformed action |
| +0.5 bonus if all violations fixed (episode success) |
| """ |
| if action_error: |
| return -0.05 |
| |
| reward = 0.0 |
| for v in violations_fixed_this_step: |
| reward += 0.2 * v["severity"] |
| for v in violations_introduced_this_step: |
| reward -= 0.15 * v["severity"] |
| |
| return reward |
| ``` |
|
|
| ### Violation Severity Weights |
|
|
| Weighted by problem importance: |
|
|
| | Violation Type | Severity | Reason | |
| |----------------|----------|--------| |
| | `missing_field` | 1.0 | Breaks contract β top priority | |
| | `wrong_type` | 0.9 | Type mismatch β critical | |
| | `wrong_status` | 0.8 | HTTP code error β significant | |
| | `extra_field` | 0.7 | Forbidden field β less critical | |
|
|
| ### Episode Scoring (`grade_episode()`) |
| |
| **Computed at episode end.** Returns final score in `[0.0, 1.0]`. |
| |
| ```python |
| def grade_episode( |
| current_endpoints: List[Dict[str, Any]], |
| golden_endpoints: List[Dict[str, Any]], |
| initial_violations: List[Dict[str, Any]], |
| ) -> float: |
| """ |
| Final episode score: |
| |
| score = (weighted_violations_fixed - weighted_violations_introduced) |
| / total_initial_weight |
| |
| Clamped to [0.0, 1.0] |
| |
| 1.0 = all violations fixed, no new ones introduced |
| 0.5 = 50% of violations fixed |
| 0.0 = no improvement or made things worse |
| """ |
| ``` |
| |
| #### Example Scoring Scenario |
|
|
| **Task: easy (1 violation)** |
| - Initial violation: `missing_field "created_at" (severity=1.0)` |
| - After 1 step: Agent adds `created_at` correctly |
| - After 2 steps: Agent incorrectly changes type of `username` to `integer` (introduces 1 violation) |
| - Final state: 0 remaining violations, but 1 introduced |
|
|
| ``` |
| score = (1.0 - 1.0) / 1.0 = 0.0 |
| ``` |
|
|
| Clamped to 0.0 (agent made things worse overall). |
|
|
| --- |
|
|
| ## 6. Complete RL Loop Example |
|
|
| ### Scenario: Easy Task |
|
|
| **Initial state:** |
| ``` |
| Broken spec: POST /users/register response missing "created_at" |
| Golden spec: response has user_id, username, created_at |
| ``` |
|
|
| ### Episode Transcript |
|
|
| ``` |
| RESET request (task_name="easy") |
| β |
| Observation #0: |
| endpoints: [broken registration endpoint] |
| violations: [missing_field "created_at"] |
| reward: 0.0 |
| done: false |
| step_count: 0 |
| |
| STEP 1: Agent proposes ADD_FIELD action |
| action.kind = "add_field" |
| action.endpoint_index = 0 |
| action.location = "response_body" |
| action.field_name = "created_at" |
| action.new_value = {"type": "string", "description": "ISO-8601 timestamp"} |
| β |
| Environment: |
| - Validates action β |
| - Adds field to response_body |
| - Recomputes violations β [] (0 violations!) |
| - Computes reward: +0.2 Γ 1.0 (fixed 1 violation of severity 1.0) = +0.2 |
| + 0.5 (bonus for all_fixed=true) = +0.7 total |
| - Sets done=true (all violations fixed) |
| β |
| Observation #1: |
| endpoints: [fixed registration endpoint] |
| violations: [] |
| violations_fixed_this_step: 1 |
| violations_introduced_this_step: 0 |
| reward: 0.7 |
| done: true |
| step_count: 1 |
| |
| SCORE request |
| β |
| score = (1.0 fixed - 0 introduced) / 1.0 initial = 1.0 β |
| |
| Agent succeeds with perfect score! |
| ``` |
|
|
| --- |
|
|
| ## 7. File Structure Summary |
|
|
| ``` |
| server/ |
| βββ app.py # FastAPI routes, HTTP interface |
| βββ environment.py # APIContractDebuggerEnv (core RL logic) |
| βββ models.py # Pydantic models: DebugAction, DebugObservation, DebugState |
| βββ fixtures.py # Task definitions (easy, medium, hard) |
| βββ graders.py # Violation detection + reward/scoring |
| βββ __pycache__/ |
| |
| tests/ # Unit tests for environment, graders, fixtures |
| |
| RL_ARCHITECTURE.md # This file |
| ``` |
|
|
| --- |
|
|
| ## 8. Key Design Principles |
|
|
| 1. **Stateful Environment** β One episode per task at a time (OpenEnv singleton pattern) |
|
|
| 2. **Dense Rewards** β Agent gets per-step feedback (not just final score) to guide learning |
|
|
| 3. **Severity-Weighted** β Different violation types have different weights (missing fields = highest priority) |
|
|
| 4. **Action Validation** β Invalid actions receive penalty and return error messages |
|
|
| 5. **Deep-Copied State** β Endpoints are deep-copied to prevent cross-episode contamination |
|
|
| 6. **Observable Violations** β Agent sees exact list of violations (not hidden) |
|
|
| 7. **Termination Conditions**: |
| - Success: All violations fixed |
| - Failure: Max steps exceeded |
|
|
| 8. **JSON/REST Interface** β Agent communicates via HTTP (language-agnostic) |
|
|
| --- |
|
|
| ## 9. Typical Agent Workflow |
|
|
| ```python |
| import requests |
| |
| BASE_URL = "http://localhost:7860" |
| |
| # 1. Reset to start new episode |
| reset_resp = requests.post(f"{BASE_URL}/reset", json={ |
| "task_name": "easy", |
| "seed": 42 |
| }) |
| obs = reset_resp.json() |
| print(f"Violations to fix: {len(obs['violations'])}") |
| |
| # 2. Repeat: observe β decide β act |
| for step in range(obs['max_steps']): |
| if obs['done']: |
| break |
| |
| # Agent decision logic (depends on obs['violations']) |
| action = { |
| "kind": "add_field", |
| "endpoint_index": 0, |
| "location": "response_body", |
| "field_name": "created_at", |
| "new_value": {"type": "string"} |
| } |
| |
| # 3. Apply action |
| step_resp = requests.post(f"{BASE_URL}/step", json={"action": action}) |
| obs = step_resp.json() |
| |
| print(f"Step {step+1}: reward={obs['reward']}, violations={len(obs['violations'])}") |
| |
| # 4. Check final score |
| score_resp = requests.get(f"{BASE_URL}/score") |
| print(f"Final score: {score_resp.json()['score']}") |
| ``` |
|
|
| --- |
|
|
| ## 10. Future Extensions |
|
|
| Potential enhancements to the RL framework: |
|
|
| 1. **Multi-Agent** β Support concurrent episodes via session IDs |
| 2. **Curriculum Learning** β Dynamically adapt difficulty based on agent performance |
| 3. **Partial Observability** β Hide some violations initially to increase challenge |
| 4. **Action Constraints** β Limit action space per step (e.g., "fix at most 1 field") |
| 5. **Custom Reward Shaping** β Configurable severity weights + bonus structures |
| 6. **State Representation** β Multiple formats (JSON, graph, embedding-friendly) |
|
|
| --- |
|
|
| ## Summary Table |
|
|
| | Concept | Implementation | File | Purpose | |
| |---------|---|---|---| |
| | **Agent** | External AI/LLM | HTTP client | Proposes fixes | |
| | **Environment** | `APIContractDebuggerEnv` | `environment.py` | Simulates faults + validates fixes | |
| | **State** | `DebugObservation` + `DebugState` | `models.py` | Agent observes + internal tracking | |
| | **Action** | `DebugAction` | `models.py` | Fix proposals | |
| | **Reward** | `step_reward()` | `graders.py` | Dense per-step feedback | |
| | **Result** | Episode score `[0.0, 1.0]` | `graders.py` | Final performance metric | |
| | **Tasks** | Fixtures (easy/medium/hard) | `fixtures.py` | Problem instances | |
| | **HTTP API** | FastAPI routes | `app.py` | Communication interface | |
|
|
|
|