api-contract-debugger / RL_ARCHITECTURE.md
keerthanas1011's picture
API Contract Debugger OpenEnv Environment
5cf6185
# Reinforcement Learning Architecture: API Contract Debugger
## Overview
The API Contract Debugger is a **reinforcement learning environment** built on the OpenEnv framework. It challenges AI agents to fix broken OpenAPI-style contract specifications by proposing targeted field-level corrections.
This document explains how the codebase implements the core RL concepts:
- **Agent** β€” The external AI system interacting with the environment
- **Environment** β€” The `APIContractDebuggerEnv` class that simulates the debugging task
- **State** β€” What the agent observes and the internal environment state
- **Action** β€” The fixes the agent can propose
- **Reward/Result** β€” The feedback signal and scoring mechanism
---
## 1. Agent (External AI System)
### What is the Agent?
The **agent** is an **external AI system** (e.g., an LLM, RL policy, or human) that:
- Receives observations from the environment
- Proposes actions (fixes to the API spec)
- Receives reward feedback and the next state
- Aims to maximize cumulative reward by fixing all violations
### Agent Interaction Pattern
```
Agent Environment
| |
|---- POST /reset (task_name) -----> |
| |
| <------ Initial Observation --------|
| (endpoints, violations, reward=0) |
| |
|---- POST /step (action) ----------> |
| |
| <---- Updated Observation --------- |
| (new endpoints, new violations, |
| reward, done, fixed/introduced) |
| |
| [repeat until done=True] |
| |
| ---- GET /score - GET /state -----> |
| |
```
### Agent Location in Codebase
- **File**: `server/app.py`
- **Routes**:
- `POST /reset` β€” Initialize new episode
- `POST /step` β€” Apply one action
- `GET /state` β€” Query full environment state (for debugging)
- `GET /score` β€” Get final episode score
- `GET /tasks` β€” List available tasks
The agent communicates via HTTP REST API. All observations are JSON and fully serializable.
---
## 2. Environment (`APIContractDebuggerEnv`)
### Class Definition
**File**: `server/environment.py`
```python
class APIContractDebuggerEnv(Environment[DebugAction, DebugObservation, DebugState]):
"""
Environment where an agent debugs broken API contract specifications.
Inherits from OpenEnv's Environment base class.
Implements reset(), step(), and state property.
"""
```
### Environment Responsibilities
1. **Initialize tasks** β€” Load broken + golden endpoint specs from fixtures
2. **Detect violations** β€” Compare current spec against golden spec
3. **Apply actions** β€” Mutate the current spec based on agent's fix proposal
4. **Compute rewards** β€” Dense per-step reward based on violations fixed/introduced
5. **Track state** β€” Maintain episode counter, step count, violations
6. **Terminate episodes** β€” Check for success (all fixed) or max steps reached
### Key Methods
#### `reset(seed, episode_id, task_name, **kwargs) β†’ DebugObservation`
Initializes a fresh episode:
- Loads task config from fixtures
- Deep-copies broken endpoints to avoid cross-episode state leakage
- Detects initial violations
- Returns initial observation with reward=0
```python
def reset(self, seed=None, episode_id=None, task_name=None, **kwargs):
"""
Reset the environment and return the initial observation.
"""
# Load task config and deep-copy endpoints
self._current_endpoints = copy.deepcopy(self._task_cfg["broken_endpoints"])
self._golden_endpoints = copy.deepcopy(self._task_cfg["golden_endpoints"])
# Detect violations (agent's starting problem)
self._violations = detect_violations(self._current_endpoints, self._golden_endpoints)
return self._make_observation(reward=0.0, done=False, ...)
```
#### `step(action, timeout_s, **kwargs) β†’ DebugObservation`
Processes one agent action and returns the updated state:
```python
def step(self, action: DebugAction, **kwargs) -> DebugObservation:
"""
Apply one fix action β†’ return updated observation + reward.
"""
# 1. Apply the action (mutate current_endpoints)
action_error = self._apply_action(action)
# 2. Recompute violations
self._violations = detect_violations(self._current_endpoints, self._golden_endpoints)
# 3. Compute dense reward
reward = step_reward(prev_violations, self._violations, action_error)
# 4. Check termination
all_fixed = len(self._violations) == 0
out_of_steps = self._step_count >= max_steps
self._done = all_fixed or out_of_steps
# 5. Bonus reward if solved
if all_fixed:
reward += 0.5
return self._make_observation(reward, done, fixed_this_step, ...)
```
#### `_apply_action(action) β†’ Optional[str]`
Attempts to mutate `self._current_endpoints` according to the action:
- **Validates** endpoint index, field name, locations
- **Executes** the fix:
- `ADD_FIELD` β€” Insert new field into request/response body
- `REMOVE_FIELD` β€” Delete field from body
- `CHANGE_TYPE` β€” Update field's type
- `CHANGE_STATUS` β€” Update endpoint's HTTP status code
- `NO_OP` β€” Explicit pass (implicit penalty via no reward)
- **Returns** error string if invalid, `None` on success
#### `state` Property
Returns the complete internal state (not exposed to agent by default, but available via `/state`):
```python
@property
def state(self) -> DebugState:
"""Return full internal environment state."""
return DebugState(
episode_id=self._episode_id,
step_count=self._step_count,
task_name=self._task_name,
original_endpoints=self._original_endpoints, # Snapshot of broken spec
current_endpoints=self._current_endpoints, # Current state after fixes
golden_endpoints=self._golden_endpoints, # Target spec
violations=self._violations, # Current violations
total_violations_at_start=len(self._initial_violations),
max_steps=self._task_cfg["max_steps"],
)
```
### Supported Tasks
**File**: `server/fixtures.py`
Three difficulty levels:
| Task | Difficulty | Endpoints | Violations | Max Steps | Description |
|------|-----------|-----------|-----------|-----------|-------------|
| **easy** | Beginner | 1 | 1 missing field | 5 | Simple: add one field to response |
| **medium** | Intermediate | 3 | 3 (type errors + wrong status) | 10 | Type mismatches and HTTP status fixes |
| **hard** | Advanced | 4 | 6 (missing, extra, type, status) | 15 | Complex: multiple violation types |
Each task has:
- `broken_endpoints` β€” Starting state (what agent sees)
- `golden_endpoints` β€” Ground truth (what violations are measured against)
- `description` β€” Human-readable task objective
- `max_steps` β€” Episode cut-off
---
## 3. State
### Observation (`DebugObservation`)
**What the agent sees after each action.**
File: `server/models.py`
```python
class DebugObservation(Observation):
"""
What the agent observes after reset() or step().
"""
# Task info
task_name: str # "easy" | "medium" | "hard"
task_description: str # Human description
# Current spec
endpoints: List[Dict[str, Any]] # Current endpoints (partially fixed)
violations: List[Dict[str, Any]] # Detected violations still present
# Reward signals
reward: float # Dense per-step reward
done: bool # Episode termination flag
violations_fixed_this_step: int # Count of fixed violations
violations_introduced_this_step: int # Count of new violations
total_violations_at_start: int # Reference baseline
# Tracking
step_count: int # Steps taken so far
max_steps: int # Episode limit
last_action_error: Optional[str] # Validation error message
```
#### Example Observation
```json
{
"task_name": "easy",
"task_description": "Add missing 'created_at' field to response...",
"endpoints": [
{
"method": "POST",
"path": "/users/register",
"status_code": 201,
"request_body": {
"username": {"type": "string", "required": true},
"email": {"type": "string", "required": true},
"password": {"type": "string", "required": true}
},
"response_body": {
"user_id": {"type": "integer", "required": true},
"username": {"type": "string", "required": true}
// missing: created_at
}
}
],
"violations": [
{
"endpoint_index": 0,
"location": "response_body",
"field_name": "created_at",
"violation_type": "missing_field",
"description": "POST /users/register response_body: required field 'created_at' (string) is missing",
"severity": 1.0
}
],
"violations_fixed_this_step": 0,
"violations_introduced_this_step": 0,
"total_violations_at_start": 1,
"step_count": 0,
"max_steps": 5,
"reward": 0.0,
"done": false,
"last_action_error": null
}
```
### Full Internal State (`DebugState`)
**Available via `GET /state` endpoint (for debugging/analysis, not given to agent by default).**
```python
class DebugState(State):
"""
Full internal state (not exposed to agent by default).
"""
task_name: str
original_endpoints: List[Dict[str, Any]] # Snapshot of broken spec
current_endpoints: List[Dict[str, Any]] # Mutated by agent's actions
golden_endpoints: List[Dict[str, Any]] # Ground truth
violations: List[Dict[str, Any]] # Computed violations
total_violations_at_start: int
max_steps: int
```
---
## 4. Action (`DebugAction`)
**What the agent can propose.**
File: `server/models.py`
```python
class DebugAction(Action):
"""
A single fix proposed by the agent.
The agent targets one endpoint + one field and proposes exactly one change.
"""
kind: ActionKind # Type of fix
endpoint_index: int # Which endpoint to fix (0-indexed)
location: str # "request_body" | "response_body" | "status_code"
field_name: Optional[str] # Field to modify (null for status_code)
new_value: Optional[Any] # The corrected value
```
### Action Types (`ActionKind`)
| Kind | Target | Effect | new_value |
|------|--------|--------|-----------|
| `ADD_FIELD` | Field | Insert missing field into body | `{"type": str, "description"?: str}` |
| `REMOVE_FIELD` | Field | Delete forbidden field from body | `null` |
| `CHANGE_TYPE` | Field | Fix field's JSON Schema type | Type string (e.g., `"integer"`) |
| `CHANGE_STATUS` | Endpoint | Fix HTTP status code | Integer (e.g., `201`) |
| `NO_OP` | None | Explicit pass/wait | `null` |
#### Example Actions
```python
# Fix 1: Add missing 'created_at' field
{
"kind": "add_field",
"endpoint_index": 0,
"location": "response_body",
"field_name": "created_at",
"new_value": {
"type": "string",
"description": "ISO-8601 timestamp"
}
}
# Fix 2: Change field type from string to integer
{
"kind": "change_type",
"endpoint_index": 1,
"location": "request_body",
"field_name": "user_id",
"new_value": "integer"
}
# Fix 3: Correct HTTP status code
{
"kind": "change_status",
"endpoint_index": 0,
"location": "status_code",
"field_name": null,
"new_value": 201
}
# Fix 4: Remove extra field
{
"kind": "remove_field",
"endpoint_index": 2,
"location": "response_body",
"field_name": "deprecated_field",
"new_value": null
}
# Fix 5: Explicit pass
{
"kind": "no_op",
"endpoint_index": 0,
"location": "request_body",
"field_name": null,
"new_value": null
}
```
### Action Validation
The environment validates actions in `_apply_action()`:
- **Endpoint index bounds** β€” Must be `0 ≀ index < len(endpoints)`
- **Location validity** β€” Must be `"request_body"`, `"response_body"`, or `"status_code"`
- **Field existence** β€” REMOVE_FIELD and CHANGE_TYPE require field to exist
- **Type format** β€” Fields must have `{"type": "..."}` structure
- **Status code format** β€” Must be an integer
If validation fails, `_apply_action()` returns an error string and the step receives `-0.05` reward penalty.
---
## 5. Reward & Result
### Dense Per-Step Reward
**File**: `server/graders.py` β†’ `step_reward()` function
The agent receives feedback after each step:
```python
def step_reward(
prev_violations: List[Dict[str, Any]],
new_violations: List[Dict[str, Any]],
initial_violations: List[Dict[str, Any]],
action_error: bool,
) -> float:
"""
Dense per-step reward:
+0.2 Γ— severity per violation resolved
-0.15 Γ— severity per new violation introduced
-0.05 for malformed action
+0.5 bonus if all violations fixed (episode success)
"""
if action_error:
return -0.05
reward = 0.0
for v in violations_fixed_this_step:
reward += 0.2 * v["severity"]
for v in violations_introduced_this_step:
reward -= 0.15 * v["severity"]
return reward
```
### Violation Severity Weights
Weighted by problem importance:
| Violation Type | Severity | Reason |
|----------------|----------|--------|
| `missing_field` | 1.0 | Breaks contract β€” top priority |
| `wrong_type` | 0.9 | Type mismatch β€” critical |
| `wrong_status` | 0.8 | HTTP code error β€” significant |
| `extra_field` | 0.7 | Forbidden field β€” less critical |
### Episode Scoring (`grade_episode()`)
**Computed at episode end.** Returns final score in `[0.0, 1.0]`.
```python
def grade_episode(
current_endpoints: List[Dict[str, Any]],
golden_endpoints: List[Dict[str, Any]],
initial_violations: List[Dict[str, Any]],
) -> float:
"""
Final episode score:
score = (weighted_violations_fixed - weighted_violations_introduced)
/ total_initial_weight
Clamped to [0.0, 1.0]
1.0 = all violations fixed, no new ones introduced
0.5 = 50% of violations fixed
0.0 = no improvement or made things worse
"""
```
#### Example Scoring Scenario
**Task: easy (1 violation)**
- Initial violation: `missing_field "created_at" (severity=1.0)`
- After 1 step: Agent adds `created_at` correctly
- After 2 steps: Agent incorrectly changes type of `username` to `integer` (introduces 1 violation)
- Final state: 0 remaining violations, but 1 introduced
```
score = (1.0 - 1.0) / 1.0 = 0.0
```
Clamped to 0.0 (agent made things worse overall).
---
## 6. Complete RL Loop Example
### Scenario: Easy Task
**Initial state:**
```
Broken spec: POST /users/register response missing "created_at"
Golden spec: response has user_id, username, created_at
```
### Episode Transcript
```
RESET request (task_name="easy")
↓
Observation #0:
endpoints: [broken registration endpoint]
violations: [missing_field "created_at"]
reward: 0.0
done: false
step_count: 0
STEP 1: Agent proposes ADD_FIELD action
action.kind = "add_field"
action.endpoint_index = 0
action.location = "response_body"
action.field_name = "created_at"
action.new_value = {"type": "string", "description": "ISO-8601 timestamp"}
↓
Environment:
- Validates action βœ“
- Adds field to response_body
- Recomputes violations β†’ [] (0 violations!)
- Computes reward: +0.2 Γ— 1.0 (fixed 1 violation of severity 1.0) = +0.2
+ 0.5 (bonus for all_fixed=true) = +0.7 total
- Sets done=true (all violations fixed)
↓
Observation #1:
endpoints: [fixed registration endpoint]
violations: []
violations_fixed_this_step: 1
violations_introduced_this_step: 0
reward: 0.7
done: true
step_count: 1
SCORE request
↓
score = (1.0 fixed - 0 introduced) / 1.0 initial = 1.0 βœ“
Agent succeeds with perfect score!
```
---
## 7. File Structure Summary
```
server/
β”œβ”€β”€ app.py # FastAPI routes, HTTP interface
β”œβ”€β”€ environment.py # APIContractDebuggerEnv (core RL logic)
β”œβ”€β”€ models.py # Pydantic models: DebugAction, DebugObservation, DebugState
β”œβ”€β”€ fixtures.py # Task definitions (easy, medium, hard)
β”œβ”€β”€ graders.py # Violation detection + reward/scoring
└── __pycache__/
tests/ # Unit tests for environment, graders, fixtures
RL_ARCHITECTURE.md # This file
```
---
## 8. Key Design Principles
1. **Stateful Environment** β€” One episode per task at a time (OpenEnv singleton pattern)
2. **Dense Rewards** β€” Agent gets per-step feedback (not just final score) to guide learning
3. **Severity-Weighted** β€” Different violation types have different weights (missing fields = highest priority)
4. **Action Validation** β€” Invalid actions receive penalty and return error messages
5. **Deep-Copied State** β€” Endpoints are deep-copied to prevent cross-episode contamination
6. **Observable Violations** β€” Agent sees exact list of violations (not hidden)
7. **Termination Conditions**:
- Success: All violations fixed
- Failure: Max steps exceeded
8. **JSON/REST Interface** β€” Agent communicates via HTTP (language-agnostic)
---
## 9. Typical Agent Workflow
```python
import requests
BASE_URL = "http://localhost:7860"
# 1. Reset to start new episode
reset_resp = requests.post(f"{BASE_URL}/reset", json={
"task_name": "easy",
"seed": 42
})
obs = reset_resp.json()
print(f"Violations to fix: {len(obs['violations'])}")
# 2. Repeat: observe β†’ decide β†’ act
for step in range(obs['max_steps']):
if obs['done']:
break
# Agent decision logic (depends on obs['violations'])
action = {
"kind": "add_field",
"endpoint_index": 0,
"location": "response_body",
"field_name": "created_at",
"new_value": {"type": "string"}
}
# 3. Apply action
step_resp = requests.post(f"{BASE_URL}/step", json={"action": action})
obs = step_resp.json()
print(f"Step {step+1}: reward={obs['reward']}, violations={len(obs['violations'])}")
# 4. Check final score
score_resp = requests.get(f"{BASE_URL}/score")
print(f"Final score: {score_resp.json()['score']}")
```
---
## 10. Future Extensions
Potential enhancements to the RL framework:
1. **Multi-Agent** β€” Support concurrent episodes via session IDs
2. **Curriculum Learning** β€” Dynamically adapt difficulty based on agent performance
3. **Partial Observability** β€” Hide some violations initially to increase challenge
4. **Action Constraints** β€” Limit action space per step (e.g., "fix at most 1 field")
5. **Custom Reward Shaping** β€” Configurable severity weights + bonus structures
6. **State Representation** β€” Multiple formats (JSON, graph, embedding-friendly)
---
## Summary Table
| Concept | Implementation | File | Purpose |
|---------|---|---|---|
| **Agent** | External AI/LLM | HTTP client | Proposes fixes |
| **Environment** | `APIContractDebuggerEnv` | `environment.py` | Simulates faults + validates fixes |
| **State** | `DebugObservation` + `DebugState` | `models.py` | Agent observes + internal tracking |
| **Action** | `DebugAction` | `models.py` | Fix proposals |
| **Reward** | `step_reward()` | `graders.py` | Dense per-step feedback |
| **Result** | Episode score `[0.0, 1.0]` | `graders.py` | Final performance metric |
| **Tasks** | Fixtures (easy/medium/hard) | `fixtures.py` | Problem instances |
| **HTTP API** | FastAPI routes | `app.py` | Communication interface |