api-contract-debugger / RL_ARCHITECTURE.md
keerthanas1011's picture
API Contract Debugger OpenEnv Environment
5cf6185

Reinforcement Learning Architecture: API Contract Debugger

Overview

The API Contract Debugger is a reinforcement learning environment built on the OpenEnv framework. It challenges AI agents to fix broken OpenAPI-style contract specifications by proposing targeted field-level corrections.

This document explains how the codebase implements the core RL concepts:

  • Agent β€” The external AI system interacting with the environment
  • Environment β€” The APIContractDebuggerEnv class that simulates the debugging task
  • State β€” What the agent observes and the internal environment state
  • Action β€” The fixes the agent can propose
  • Reward/Result β€” The feedback signal and scoring mechanism

1. Agent (External AI System)

What is the Agent?

The agent is an external AI system (e.g., an LLM, RL policy, or human) that:

  • Receives observations from the environment
  • Proposes actions (fixes to the API spec)
  • Receives reward feedback and the next state
  • Aims to maximize cumulative reward by fixing all violations

Agent Interaction Pattern

Agent                              Environment
  |                                     |
  |---- POST /reset (task_name) ----->  |
  |                                     |
  | <------ Initial Observation --------| 
  |  (endpoints, violations, reward=0)  |
  |                                     |
  |---- POST /step (action) ----------> |
  |                                     |
  | <---- Updated Observation --------- |
  |  (new endpoints, new violations,    |
  |   reward, done, fixed/introduced)   |
  |                                     |
  | [repeat until done=True]            |
  |                                     |
  | ---- GET /score - GET /state ----->  |
  |                                     |

Agent Location in Codebase

  • File: server/app.py
  • Routes:
    • POST /reset β€” Initialize new episode
    • POST /step β€” Apply one action
    • GET /state β€” Query full environment state (for debugging)
    • GET /score β€” Get final episode score
    • GET /tasks β€” List available tasks

The agent communicates via HTTP REST API. All observations are JSON and fully serializable.


2. Environment (APIContractDebuggerEnv)

Class Definition

File: server/environment.py

class APIContractDebuggerEnv(Environment[DebugAction, DebugObservation, DebugState]):
    """
    Environment where an agent debugs broken API contract specifications.
    
    Inherits from OpenEnv's Environment base class.
    Implements reset(), step(), and state property.
    """

Environment Responsibilities

  1. Initialize tasks β€” Load broken + golden endpoint specs from fixtures
  2. Detect violations β€” Compare current spec against golden spec
  3. Apply actions β€” Mutate the current spec based on agent's fix proposal
  4. Compute rewards β€” Dense per-step reward based on violations fixed/introduced
  5. Track state β€” Maintain episode counter, step count, violations
  6. Terminate episodes β€” Check for success (all fixed) or max steps reached

Key Methods

reset(seed, episode_id, task_name, **kwargs) β†’ DebugObservation

Initializes a fresh episode:

  • Loads task config from fixtures
  • Deep-copies broken endpoints to avoid cross-episode state leakage
  • Detects initial violations
  • Returns initial observation with reward=0
def reset(self, seed=None, episode_id=None, task_name=None, **kwargs):
    """
    Reset the environment and return the initial observation.
    """
    # Load task config and deep-copy endpoints
    self._current_endpoints = copy.deepcopy(self._task_cfg["broken_endpoints"])
    self._golden_endpoints = copy.deepcopy(self._task_cfg["golden_endpoints"])
    
    # Detect violations (agent's starting problem)
    self._violations = detect_violations(self._current_endpoints, self._golden_endpoints)
    
    return self._make_observation(reward=0.0, done=False, ...)

step(action, timeout_s, **kwargs) β†’ DebugObservation

Processes one agent action and returns the updated state:

def step(self, action: DebugAction, **kwargs) -> DebugObservation:
    """
    Apply one fix action β†’ return updated observation + reward.
    """
    # 1. Apply the action (mutate current_endpoints)
    action_error = self._apply_action(action)
    
    # 2. Recompute violations
    self._violations = detect_violations(self._current_endpoints, self._golden_endpoints)
    
    # 3. Compute dense reward
    reward = step_reward(prev_violations, self._violations, action_error)
    
    # 4. Check termination
    all_fixed = len(self._violations) == 0
    out_of_steps = self._step_count >= max_steps
    self._done = all_fixed or out_of_steps
    
    # 5. Bonus reward if solved
    if all_fixed:
        reward += 0.5
    
    return self._make_observation(reward, done, fixed_this_step, ...)

_apply_action(action) β†’ Optional[str]

Attempts to mutate self._current_endpoints according to the action:

  • Validates endpoint index, field name, locations
  • Executes the fix:
    • ADD_FIELD β€” Insert new field into request/response body
    • REMOVE_FIELD β€” Delete field from body
    • CHANGE_TYPE β€” Update field's type
    • CHANGE_STATUS β€” Update endpoint's HTTP status code
    • NO_OP β€” Explicit pass (implicit penalty via no reward)
  • Returns error string if invalid, None on success

state Property

Returns the complete internal state (not exposed to agent by default, but available via /state):

@property
def state(self) -> DebugState:
    """Return full internal environment state."""
    return DebugState(
        episode_id=self._episode_id,
        step_count=self._step_count,
        task_name=self._task_name,
        original_endpoints=self._original_endpoints,     # Snapshot of broken spec
        current_endpoints=self._current_endpoints,       # Current state after fixes
        golden_endpoints=self._golden_endpoints,         # Target spec
        violations=self._violations,                     # Current violations
        total_violations_at_start=len(self._initial_violations),
        max_steps=self._task_cfg["max_steps"],
    )

Supported Tasks

File: server/fixtures.py

Three difficulty levels:

Task Difficulty Endpoints Violations Max Steps Description
easy Beginner 1 1 missing field 5 Simple: add one field to response
medium Intermediate 3 3 (type errors + wrong status) 10 Type mismatches and HTTP status fixes
hard Advanced 4 6 (missing, extra, type, status) 15 Complex: multiple violation types

Each task has:

  • broken_endpoints β€” Starting state (what agent sees)
  • golden_endpoints β€” Ground truth (what violations are measured against)
  • description β€” Human-readable task objective
  • max_steps β€” Episode cut-off

3. State

Observation (DebugObservation)

What the agent sees after each action.

File: server/models.py

class DebugObservation(Observation):
    """
    What the agent observes after reset() or step().
    """
    # Task info
    task_name: str                          # "easy" | "medium" | "hard"
    task_description: str                   # Human description
    
    # Current spec
    endpoints: List[Dict[str, Any]]         # Current endpoints (partially fixed)
    violations: List[Dict[str, Any]]        # Detected violations still present
    
    # Reward signals
    reward: float                           # Dense per-step reward
    done: bool                              # Episode termination flag
    violations_fixed_this_step: int         # Count of fixed violations
    violations_introduced_this_step: int    # Count of new violations
    total_violations_at_start: int          # Reference baseline
    
    # Tracking
    step_count: int                         # Steps taken so far
    max_steps: int                          # Episode limit
    last_action_error: Optional[str]        # Validation error message

Example Observation

{
  "task_name": "easy",
  "task_description": "Add missing 'created_at' field to response...",
  "endpoints": [
    {
      "method": "POST",
      "path": "/users/register",
      "status_code": 201,
      "request_body": {
        "username": {"type": "string", "required": true},
        "email": {"type": "string", "required": true},
        "password": {"type": "string", "required": true}
      },
      "response_body": {
        "user_id": {"type": "integer", "required": true},
        "username": {"type": "string", "required": true}
        // missing: created_at
      }
    }
  ],
  "violations": [
    {
      "endpoint_index": 0,
      "location": "response_body",
      "field_name": "created_at",
      "violation_type": "missing_field",
      "description": "POST /users/register response_body: required field 'created_at' (string) is missing",
      "severity": 1.0
    }
  ],
  "violations_fixed_this_step": 0,
  "violations_introduced_this_step": 0,
  "total_violations_at_start": 1,
  "step_count": 0,
  "max_steps": 5,
  "reward": 0.0,
  "done": false,
  "last_action_error": null
}

Full Internal State (DebugState)

Available via GET /state endpoint (for debugging/analysis, not given to agent by default).

class DebugState(State):
    """
    Full internal state (not exposed to agent by default).
    """
    task_name: str
    original_endpoints: List[Dict[str, Any]]  # Snapshot of broken spec
    current_endpoints: List[Dict[str, Any]]   # Mutated by agent's actions
    golden_endpoints: List[Dict[str, Any]]    # Ground truth
    violations: List[Dict[str, Any]]          # Computed violations
    total_violations_at_start: int
    max_steps: int

4. Action (DebugAction)

What the agent can propose.

File: server/models.py

class DebugAction(Action):
    """
    A single fix proposed by the agent.
    The agent targets one endpoint + one field and proposes exactly one change.
    """
    
    kind: ActionKind                    # Type of fix
    endpoint_index: int                 # Which endpoint to fix (0-indexed)
    location: str                       # "request_body" | "response_body" | "status_code"
    field_name: Optional[str]           # Field to modify (null for status_code)
    new_value: Optional[Any]            # The corrected value

Action Types (ActionKind)

Kind Target Effect new_value
ADD_FIELD Field Insert missing field into body {"type": str, "description"?: str}
REMOVE_FIELD Field Delete forbidden field from body null
CHANGE_TYPE Field Fix field's JSON Schema type Type string (e.g., "integer")
CHANGE_STATUS Endpoint Fix HTTP status code Integer (e.g., 201)
NO_OP None Explicit pass/wait null

Example Actions

# Fix 1: Add missing 'created_at' field
{
  "kind": "add_field",
  "endpoint_index": 0,
  "location": "response_body",
  "field_name": "created_at",
  "new_value": {
    "type": "string",
    "description": "ISO-8601 timestamp"
  }
}

# Fix 2: Change field type from string to integer
{
  "kind": "change_type",
  "endpoint_index": 1,
  "location": "request_body",
  "field_name": "user_id",
  "new_value": "integer"
}

# Fix 3: Correct HTTP status code
{
  "kind": "change_status",
  "endpoint_index": 0,
  "location": "status_code",
  "field_name": null,
  "new_value": 201
}

# Fix 4: Remove extra field
{
  "kind": "remove_field",
  "endpoint_index": 2,
  "location": "response_body",
  "field_name": "deprecated_field",
  "new_value": null
}

# Fix 5: Explicit pass
{
  "kind": "no_op",
  "endpoint_index": 0,
  "location": "request_body",
  "field_name": null,
  "new_value": null
}

Action Validation

The environment validates actions in _apply_action():

  • Endpoint index bounds β€” Must be 0 ≀ index < len(endpoints)
  • Location validity β€” Must be "request_body", "response_body", or "status_code"
  • Field existence β€” REMOVE_FIELD and CHANGE_TYPE require field to exist
  • Type format β€” Fields must have {"type": "..."} structure
  • Status code format β€” Must be an integer

If validation fails, _apply_action() returns an error string and the step receives -0.05 reward penalty.


5. Reward & Result

Dense Per-Step Reward

File: server/graders.py β†’ step_reward() function

The agent receives feedback after each step:

def step_reward(
    prev_violations: List[Dict[str, Any]],
    new_violations: List[Dict[str, Any]],
    initial_violations: List[Dict[str, Any]],
    action_error: bool,
) -> float:
    """
    Dense per-step reward:
    +0.2 Γ— severity  per violation resolved
    -0.15 Γ— severity per new violation introduced
    -0.05             for malformed action
    +0.5              bonus if all violations fixed (episode success)
    """
    if action_error:
        return -0.05
    
    reward = 0.0
    for v in violations_fixed_this_step:
        reward += 0.2 * v["severity"]
    for v in violations_introduced_this_step:
        reward -= 0.15 * v["severity"]
    
    return reward

Violation Severity Weights

Weighted by problem importance:

Violation Type Severity Reason
missing_field 1.0 Breaks contract β€” top priority
wrong_type 0.9 Type mismatch β€” critical
wrong_status 0.8 HTTP code error β€” significant
extra_field 0.7 Forbidden field β€” less critical

Episode Scoring (grade_episode())

Computed at episode end. Returns final score in [0.0, 1.0].

def grade_episode(
    current_endpoints: List[Dict[str, Any]],
    golden_endpoints: List[Dict[str, Any]],
    initial_violations: List[Dict[str, Any]],
) -> float:
    """
    Final episode score:
    
    score = (weighted_violations_fixed - weighted_violations_introduced) 
            / total_initial_weight
    
    Clamped to [0.0, 1.0]
    
    1.0 = all violations fixed, no new ones introduced
    0.5 = 50% of violations fixed
    0.0 = no improvement or made things worse
    """

Example Scoring Scenario

Task: easy (1 violation)

  • Initial violation: missing_field "created_at" (severity=1.0)
  • After 1 step: Agent adds created_at correctly
  • After 2 steps: Agent incorrectly changes type of username to integer (introduces 1 violation)
  • Final state: 0 remaining violations, but 1 introduced
score = (1.0 - 1.0) / 1.0 = 0.0

Clamped to 0.0 (agent made things worse overall).


6. Complete RL Loop Example

Scenario: Easy Task

Initial state:

Broken spec: POST /users/register response missing "created_at"
Golden spec: response has user_id, username, created_at

Episode Transcript

RESET request (task_name="easy")
  ↓
Observation #0:
  endpoints: [broken registration endpoint]
  violations: [missing_field "created_at"]
  reward: 0.0
  done: false
  step_count: 0

STEP 1: Agent proposes ADD_FIELD action
  action.kind = "add_field"
  action.endpoint_index = 0
  action.location = "response_body"
  action.field_name = "created_at"
  action.new_value = {"type": "string", "description": "ISO-8601 timestamp"}
  ↓
Environment:
  - Validates action βœ“
  - Adds field to response_body
  - Recomputes violations β†’ [] (0 violations!)
  - Computes reward: +0.2 Γ— 1.0 (fixed 1 violation of severity 1.0) = +0.2
          + 0.5 (bonus for all_fixed=true) = +0.7 total
  - Sets done=true (all violations fixed)
  ↓
Observation #1:
  endpoints: [fixed registration endpoint]
  violations: []
  violations_fixed_this_step: 1
  violations_introduced_this_step: 0
  reward: 0.7
  done: true
  step_count: 1

SCORE request
  ↓
score = (1.0 fixed - 0 introduced) / 1.0 initial = 1.0 βœ“

Agent succeeds with perfect score!

7. File Structure Summary

server/
β”œβ”€β”€ app.py                    # FastAPI routes, HTTP interface
β”œβ”€β”€ environment.py            # APIContractDebuggerEnv (core RL logic)
β”œβ”€β”€ models.py                 # Pydantic models: DebugAction, DebugObservation, DebugState
β”œβ”€β”€ fixtures.py               # Task definitions (easy, medium, hard)
β”œβ”€β”€ graders.py                # Violation detection + reward/scoring
└── __pycache__/

tests/                         # Unit tests for environment, graders, fixtures

RL_ARCHITECTURE.md             # This file

8. Key Design Principles

  1. Stateful Environment β€” One episode per task at a time (OpenEnv singleton pattern)

  2. Dense Rewards β€” Agent gets per-step feedback (not just final score) to guide learning

  3. Severity-Weighted β€” Different violation types have different weights (missing fields = highest priority)

  4. Action Validation β€” Invalid actions receive penalty and return error messages

  5. Deep-Copied State β€” Endpoints are deep-copied to prevent cross-episode contamination

  6. Observable Violations β€” Agent sees exact list of violations (not hidden)

  7. Termination Conditions:

    • Success: All violations fixed
    • Failure: Max steps exceeded
  8. JSON/REST Interface β€” Agent communicates via HTTP (language-agnostic)


9. Typical Agent Workflow

import requests

BASE_URL = "http://localhost:7860"

# 1. Reset to start new episode
reset_resp = requests.post(f"{BASE_URL}/reset", json={
    "task_name": "easy",
    "seed": 42
})
obs = reset_resp.json()
print(f"Violations to fix: {len(obs['violations'])}")

# 2. Repeat: observe β†’ decide β†’ act
for step in range(obs['max_steps']):
    if obs['done']:
        break
    
    # Agent decision logic (depends on obs['violations'])
    action = {
        "kind": "add_field",
        "endpoint_index": 0,
        "location": "response_body",
        "field_name": "created_at",
        "new_value": {"type": "string"}
    }
    
    # 3. Apply action
    step_resp = requests.post(f"{BASE_URL}/step", json={"action": action})
    obs = step_resp.json()
    
    print(f"Step {step+1}: reward={obs['reward']}, violations={len(obs['violations'])}")

# 4. Check final score
score_resp = requests.get(f"{BASE_URL}/score")
print(f"Final score: {score_resp.json()['score']}")

10. Future Extensions

Potential enhancements to the RL framework:

  1. Multi-Agent β€” Support concurrent episodes via session IDs
  2. Curriculum Learning β€” Dynamically adapt difficulty based on agent performance
  3. Partial Observability β€” Hide some violations initially to increase challenge
  4. Action Constraints β€” Limit action space per step (e.g., "fix at most 1 field")
  5. Custom Reward Shaping β€” Configurable severity weights + bonus structures
  6. State Representation β€” Multiple formats (JSON, graph, embedding-friendly)

Summary Table

Concept Implementation File Purpose
Agent External AI/LLM HTTP client Proposes fixes
Environment APIContractDebuggerEnv environment.py Simulates faults + validates fixes
State DebugObservation + DebugState models.py Agent observes + internal tracking
Action DebugAction models.py Fix proposals
Reward step_reward() graders.py Dense per-step feedback
Result Episode score [0.0, 1.0] graders.py Final performance metric
Tasks Fixtures (easy/medium/hard) fixtures.py Problem instances
HTTP API FastAPI routes app.py Communication interface