Reinforcement Learning Architecture: API Contract Debugger
Overview
The API Contract Debugger is a reinforcement learning environment built on the OpenEnv framework. It challenges AI agents to fix broken OpenAPI-style contract specifications by proposing targeted field-level corrections.
This document explains how the codebase implements the core RL concepts:
- Agent β The external AI system interacting with the environment
- Environment β The
APIContractDebuggerEnvclass that simulates the debugging task - State β What the agent observes and the internal environment state
- Action β The fixes the agent can propose
- Reward/Result β The feedback signal and scoring mechanism
1. Agent (External AI System)
What is the Agent?
The agent is an external AI system (e.g., an LLM, RL policy, or human) that:
- Receives observations from the environment
- Proposes actions (fixes to the API spec)
- Receives reward feedback and the next state
- Aims to maximize cumulative reward by fixing all violations
Agent Interaction Pattern
Agent Environment
| |
|---- POST /reset (task_name) -----> |
| |
| <------ Initial Observation --------|
| (endpoints, violations, reward=0) |
| |
|---- POST /step (action) ----------> |
| |
| <---- Updated Observation --------- |
| (new endpoints, new violations, |
| reward, done, fixed/introduced) |
| |
| [repeat until done=True] |
| |
| ---- GET /score - GET /state -----> |
| |
Agent Location in Codebase
- File:
server/app.py - Routes:
POST /resetβ Initialize new episodePOST /stepβ Apply one actionGET /stateβ Query full environment state (for debugging)GET /scoreβ Get final episode scoreGET /tasksβ List available tasks
The agent communicates via HTTP REST API. All observations are JSON and fully serializable.
2. Environment (APIContractDebuggerEnv)
Class Definition
File: server/environment.py
class APIContractDebuggerEnv(Environment[DebugAction, DebugObservation, DebugState]):
"""
Environment where an agent debugs broken API contract specifications.
Inherits from OpenEnv's Environment base class.
Implements reset(), step(), and state property.
"""
Environment Responsibilities
- Initialize tasks β Load broken + golden endpoint specs from fixtures
- Detect violations β Compare current spec against golden spec
- Apply actions β Mutate the current spec based on agent's fix proposal
- Compute rewards β Dense per-step reward based on violations fixed/introduced
- Track state β Maintain episode counter, step count, violations
- Terminate episodes β Check for success (all fixed) or max steps reached
Key Methods
reset(seed, episode_id, task_name, **kwargs) β DebugObservation
Initializes a fresh episode:
- Loads task config from fixtures
- Deep-copies broken endpoints to avoid cross-episode state leakage
- Detects initial violations
- Returns initial observation with reward=0
def reset(self, seed=None, episode_id=None, task_name=None, **kwargs):
"""
Reset the environment and return the initial observation.
"""
# Load task config and deep-copy endpoints
self._current_endpoints = copy.deepcopy(self._task_cfg["broken_endpoints"])
self._golden_endpoints = copy.deepcopy(self._task_cfg["golden_endpoints"])
# Detect violations (agent's starting problem)
self._violations = detect_violations(self._current_endpoints, self._golden_endpoints)
return self._make_observation(reward=0.0, done=False, ...)
step(action, timeout_s, **kwargs) β DebugObservation
Processes one agent action and returns the updated state:
def step(self, action: DebugAction, **kwargs) -> DebugObservation:
"""
Apply one fix action β return updated observation + reward.
"""
# 1. Apply the action (mutate current_endpoints)
action_error = self._apply_action(action)
# 2. Recompute violations
self._violations = detect_violations(self._current_endpoints, self._golden_endpoints)
# 3. Compute dense reward
reward = step_reward(prev_violations, self._violations, action_error)
# 4. Check termination
all_fixed = len(self._violations) == 0
out_of_steps = self._step_count >= max_steps
self._done = all_fixed or out_of_steps
# 5. Bonus reward if solved
if all_fixed:
reward += 0.5
return self._make_observation(reward, done, fixed_this_step, ...)
_apply_action(action) β Optional[str]
Attempts to mutate self._current_endpoints according to the action:
- Validates endpoint index, field name, locations
- Executes the fix:
ADD_FIELDβ Insert new field into request/response bodyREMOVE_FIELDβ Delete field from bodyCHANGE_TYPEβ Update field's typeCHANGE_STATUSβ Update endpoint's HTTP status codeNO_OPβ Explicit pass (implicit penalty via no reward)
- Returns error string if invalid,
Noneon success
state Property
Returns the complete internal state (not exposed to agent by default, but available via /state):
@property
def state(self) -> DebugState:
"""Return full internal environment state."""
return DebugState(
episode_id=self._episode_id,
step_count=self._step_count,
task_name=self._task_name,
original_endpoints=self._original_endpoints, # Snapshot of broken spec
current_endpoints=self._current_endpoints, # Current state after fixes
golden_endpoints=self._golden_endpoints, # Target spec
violations=self._violations, # Current violations
total_violations_at_start=len(self._initial_violations),
max_steps=self._task_cfg["max_steps"],
)
Supported Tasks
File: server/fixtures.py
Three difficulty levels:
| Task | Difficulty | Endpoints | Violations | Max Steps | Description |
|---|---|---|---|---|---|
| easy | Beginner | 1 | 1 missing field | 5 | Simple: add one field to response |
| medium | Intermediate | 3 | 3 (type errors + wrong status) | 10 | Type mismatches and HTTP status fixes |
| hard | Advanced | 4 | 6 (missing, extra, type, status) | 15 | Complex: multiple violation types |
Each task has:
broken_endpointsβ Starting state (what agent sees)golden_endpointsβ Ground truth (what violations are measured against)descriptionβ Human-readable task objectivemax_stepsβ Episode cut-off
3. State
Observation (DebugObservation)
What the agent sees after each action.
File: server/models.py
class DebugObservation(Observation):
"""
What the agent observes after reset() or step().
"""
# Task info
task_name: str # "easy" | "medium" | "hard"
task_description: str # Human description
# Current spec
endpoints: List[Dict[str, Any]] # Current endpoints (partially fixed)
violations: List[Dict[str, Any]] # Detected violations still present
# Reward signals
reward: float # Dense per-step reward
done: bool # Episode termination flag
violations_fixed_this_step: int # Count of fixed violations
violations_introduced_this_step: int # Count of new violations
total_violations_at_start: int # Reference baseline
# Tracking
step_count: int # Steps taken so far
max_steps: int # Episode limit
last_action_error: Optional[str] # Validation error message
Example Observation
{
"task_name": "easy",
"task_description": "Add missing 'created_at' field to response...",
"endpoints": [
{
"method": "POST",
"path": "/users/register",
"status_code": 201,
"request_body": {
"username": {"type": "string", "required": true},
"email": {"type": "string", "required": true},
"password": {"type": "string", "required": true}
},
"response_body": {
"user_id": {"type": "integer", "required": true},
"username": {"type": "string", "required": true}
// missing: created_at
}
}
],
"violations": [
{
"endpoint_index": 0,
"location": "response_body",
"field_name": "created_at",
"violation_type": "missing_field",
"description": "POST /users/register response_body: required field 'created_at' (string) is missing",
"severity": 1.0
}
],
"violations_fixed_this_step": 0,
"violations_introduced_this_step": 0,
"total_violations_at_start": 1,
"step_count": 0,
"max_steps": 5,
"reward": 0.0,
"done": false,
"last_action_error": null
}
Full Internal State (DebugState)
Available via GET /state endpoint (for debugging/analysis, not given to agent by default).
class DebugState(State):
"""
Full internal state (not exposed to agent by default).
"""
task_name: str
original_endpoints: List[Dict[str, Any]] # Snapshot of broken spec
current_endpoints: List[Dict[str, Any]] # Mutated by agent's actions
golden_endpoints: List[Dict[str, Any]] # Ground truth
violations: List[Dict[str, Any]] # Computed violations
total_violations_at_start: int
max_steps: int
4. Action (DebugAction)
What the agent can propose.
File: server/models.py
class DebugAction(Action):
"""
A single fix proposed by the agent.
The agent targets one endpoint + one field and proposes exactly one change.
"""
kind: ActionKind # Type of fix
endpoint_index: int # Which endpoint to fix (0-indexed)
location: str # "request_body" | "response_body" | "status_code"
field_name: Optional[str] # Field to modify (null for status_code)
new_value: Optional[Any] # The corrected value
Action Types (ActionKind)
| Kind | Target | Effect | new_value |
|---|---|---|---|
ADD_FIELD |
Field | Insert missing field into body | {"type": str, "description"?: str} |
REMOVE_FIELD |
Field | Delete forbidden field from body | null |
CHANGE_TYPE |
Field | Fix field's JSON Schema type | Type string (e.g., "integer") |
CHANGE_STATUS |
Endpoint | Fix HTTP status code | Integer (e.g., 201) |
NO_OP |
None | Explicit pass/wait | null |
Example Actions
# Fix 1: Add missing 'created_at' field
{
"kind": "add_field",
"endpoint_index": 0,
"location": "response_body",
"field_name": "created_at",
"new_value": {
"type": "string",
"description": "ISO-8601 timestamp"
}
}
# Fix 2: Change field type from string to integer
{
"kind": "change_type",
"endpoint_index": 1,
"location": "request_body",
"field_name": "user_id",
"new_value": "integer"
}
# Fix 3: Correct HTTP status code
{
"kind": "change_status",
"endpoint_index": 0,
"location": "status_code",
"field_name": null,
"new_value": 201
}
# Fix 4: Remove extra field
{
"kind": "remove_field",
"endpoint_index": 2,
"location": "response_body",
"field_name": "deprecated_field",
"new_value": null
}
# Fix 5: Explicit pass
{
"kind": "no_op",
"endpoint_index": 0,
"location": "request_body",
"field_name": null,
"new_value": null
}
Action Validation
The environment validates actions in _apply_action():
- Endpoint index bounds β Must be
0 β€ index < len(endpoints) - Location validity β Must be
"request_body","response_body", or"status_code" - Field existence β REMOVE_FIELD and CHANGE_TYPE require field to exist
- Type format β Fields must have
{"type": "..."}structure - Status code format β Must be an integer
If validation fails, _apply_action() returns an error string and the step receives -0.05 reward penalty.
5. Reward & Result
Dense Per-Step Reward
File: server/graders.py β step_reward() function
The agent receives feedback after each step:
def step_reward(
prev_violations: List[Dict[str, Any]],
new_violations: List[Dict[str, Any]],
initial_violations: List[Dict[str, Any]],
action_error: bool,
) -> float:
"""
Dense per-step reward:
+0.2 Γ severity per violation resolved
-0.15 Γ severity per new violation introduced
-0.05 for malformed action
+0.5 bonus if all violations fixed (episode success)
"""
if action_error:
return -0.05
reward = 0.0
for v in violations_fixed_this_step:
reward += 0.2 * v["severity"]
for v in violations_introduced_this_step:
reward -= 0.15 * v["severity"]
return reward
Violation Severity Weights
Weighted by problem importance:
| Violation Type | Severity | Reason |
|---|---|---|
missing_field |
1.0 | Breaks contract β top priority |
wrong_type |
0.9 | Type mismatch β critical |
wrong_status |
0.8 | HTTP code error β significant |
extra_field |
0.7 | Forbidden field β less critical |
Episode Scoring (grade_episode())
Computed at episode end. Returns final score in [0.0, 1.0].
def grade_episode(
current_endpoints: List[Dict[str, Any]],
golden_endpoints: List[Dict[str, Any]],
initial_violations: List[Dict[str, Any]],
) -> float:
"""
Final episode score:
score = (weighted_violations_fixed - weighted_violations_introduced)
/ total_initial_weight
Clamped to [0.0, 1.0]
1.0 = all violations fixed, no new ones introduced
0.5 = 50% of violations fixed
0.0 = no improvement or made things worse
"""
Example Scoring Scenario
Task: easy (1 violation)
- Initial violation:
missing_field "created_at" (severity=1.0) - After 1 step: Agent adds
created_atcorrectly - After 2 steps: Agent incorrectly changes type of
usernametointeger(introduces 1 violation) - Final state: 0 remaining violations, but 1 introduced
score = (1.0 - 1.0) / 1.0 = 0.0
Clamped to 0.0 (agent made things worse overall).
6. Complete RL Loop Example
Scenario: Easy Task
Initial state:
Broken spec: POST /users/register response missing "created_at"
Golden spec: response has user_id, username, created_at
Episode Transcript
RESET request (task_name="easy")
β
Observation #0:
endpoints: [broken registration endpoint]
violations: [missing_field "created_at"]
reward: 0.0
done: false
step_count: 0
STEP 1: Agent proposes ADD_FIELD action
action.kind = "add_field"
action.endpoint_index = 0
action.location = "response_body"
action.field_name = "created_at"
action.new_value = {"type": "string", "description": "ISO-8601 timestamp"}
β
Environment:
- Validates action β
- Adds field to response_body
- Recomputes violations β [] (0 violations!)
- Computes reward: +0.2 Γ 1.0 (fixed 1 violation of severity 1.0) = +0.2
+ 0.5 (bonus for all_fixed=true) = +0.7 total
- Sets done=true (all violations fixed)
β
Observation #1:
endpoints: [fixed registration endpoint]
violations: []
violations_fixed_this_step: 1
violations_introduced_this_step: 0
reward: 0.7
done: true
step_count: 1
SCORE request
β
score = (1.0 fixed - 0 introduced) / 1.0 initial = 1.0 β
Agent succeeds with perfect score!
7. File Structure Summary
server/
βββ app.py # FastAPI routes, HTTP interface
βββ environment.py # APIContractDebuggerEnv (core RL logic)
βββ models.py # Pydantic models: DebugAction, DebugObservation, DebugState
βββ fixtures.py # Task definitions (easy, medium, hard)
βββ graders.py # Violation detection + reward/scoring
βββ __pycache__/
tests/ # Unit tests for environment, graders, fixtures
RL_ARCHITECTURE.md # This file
8. Key Design Principles
Stateful Environment β One episode per task at a time (OpenEnv singleton pattern)
Dense Rewards β Agent gets per-step feedback (not just final score) to guide learning
Severity-Weighted β Different violation types have different weights (missing fields = highest priority)
Action Validation β Invalid actions receive penalty and return error messages
Deep-Copied State β Endpoints are deep-copied to prevent cross-episode contamination
Observable Violations β Agent sees exact list of violations (not hidden)
Termination Conditions:
- Success: All violations fixed
- Failure: Max steps exceeded
JSON/REST Interface β Agent communicates via HTTP (language-agnostic)
9. Typical Agent Workflow
import requests
BASE_URL = "http://localhost:7860"
# 1. Reset to start new episode
reset_resp = requests.post(f"{BASE_URL}/reset", json={
"task_name": "easy",
"seed": 42
})
obs = reset_resp.json()
print(f"Violations to fix: {len(obs['violations'])}")
# 2. Repeat: observe β decide β act
for step in range(obs['max_steps']):
if obs['done']:
break
# Agent decision logic (depends on obs['violations'])
action = {
"kind": "add_field",
"endpoint_index": 0,
"location": "response_body",
"field_name": "created_at",
"new_value": {"type": "string"}
}
# 3. Apply action
step_resp = requests.post(f"{BASE_URL}/step", json={"action": action})
obs = step_resp.json()
print(f"Step {step+1}: reward={obs['reward']}, violations={len(obs['violations'])}")
# 4. Check final score
score_resp = requests.get(f"{BASE_URL}/score")
print(f"Final score: {score_resp.json()['score']}")
10. Future Extensions
Potential enhancements to the RL framework:
- Multi-Agent β Support concurrent episodes via session IDs
- Curriculum Learning β Dynamically adapt difficulty based on agent performance
- Partial Observability β Hide some violations initially to increase challenge
- Action Constraints β Limit action space per step (e.g., "fix at most 1 field")
- Custom Reward Shaping β Configurable severity weights + bonus structures
- State Representation β Multiple formats (JSON, graph, embedding-friendly)
Summary Table
| Concept | Implementation | File | Purpose |
|---|---|---|---|
| Agent | External AI/LLM | HTTP client | Proposes fixes |
| Environment | APIContractDebuggerEnv |
environment.py |
Simulates faults + validates fixes |
| State | DebugObservation + DebugState |
models.py |
Agent observes + internal tracking |
| Action | DebugAction |
models.py |
Fix proposals |
| Reward | step_reward() |
graders.py |
Dense per-step feedback |
| Result | Episode score [0.0, 1.0] |
graders.py |
Final performance metric |
| Tasks | Fixtures (easy/medium/hard) | fixtures.py |
Problem instances |
| HTTP API | FastAPI routes | app.py |
Communication interface |