keerthanas1011's picture
fixed readme for hf
db203a8
metadata
title: API Contract Debugger
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
  - rl-environment
  - api-debugging
  - contract-testing

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

API Contract Debugger β€” OpenEnv Environment

An OpenEnv environment where AI agents debug broken OpenAPI-style contract specifications by proposing targeted field-level corrections.

What Is This?

Every backend engineer debugs API contract violations constantly β€” mismatched types, missing required fields, wrong HTTP status codes, forbidden extra fields leaking into responses. This environment turns that real-world task into a structured RL benchmark.

The agent receives a broken API spec and a list of violations. Each step, it proposes one fix. It gets rewarded for each violation resolved and penalised for introducing new ones.


Action Space

{
  "kind": "add_field | remove_field | change_type | change_status | no_op",
  "endpoint_index": 0,
  "location": "request_body | response_body | status_code",
  "field_name": "field_name_or_null",
  "new_value": "<type string | field spec dict | int status code | null>"
}
kind new_value type Description
add_field {"type": "...", "required": true, "description": "..."} Add a missing field
remove_field null Remove a forbidden field
change_type "integer" / "string" / "boolean" / "number" Fix a field's type
change_status 204 / 200 / 201 etc. Fix the HTTP status code
no_op null Do nothing (small implicit cost)

Observation Space

Field Type Description
task_name str Active task: easy, medium, hard
task_description str Plain-English description of violations
endpoints list Current (partially fixed) endpoint specs
violations list Remaining violations with type + description
violations_fixed_this_step int How many the last action resolved
violations_introduced_this_step int How many the last action introduced
total_violations_at_start int Violation count at episode start
step_count int Steps taken so far
max_steps int Episode step budget
last_action_error str|null Validation error if action was malformed
reward float Per-step reward
done bool Whether the episode has terminated

Tasks

Easy (1 endpoint, 1 violation, max 5 steps)

A user registration endpoint is missing created_at (string) in its response. Expected score for a capable agent: 1.0

Medium (3 endpoints, 3 violations, max 10 steps)

An e-commerce API has:

  1. GET /products/{id} β€” product_id returned as string instead of integer
  2. POST /orders β€” quantity accepted as string instead of integer
  3. DELETE /orders/{id} β€” returns status 200 instead of 204

Expected score for a capable agent: 1.0

Hard (4 endpoints, 6 violations, max 15 steps)

An auth + profile API has:

  1. POST /auth/login β€” missing refresh_token in response
  2. POST /auth/login β€” expires_in is string instead of integer
  3. GET /users/{id}/profile β€” missing created_at in response
  4. GET /users/{id}/profile β€” exposes forbidden password_hash field (must be removed)
  5. PATCH /users/{id}/profile β€” returns status 500 instead of 200
  6. PATCH /users/{id}/profile β€” missing updated_at in response

Expected score for a capable agent: 0.7–1.0 (frontier models)


Reward Function

Event Reward
Fix a violation +0.2 Γ— severity
Introduce a violation βˆ’0.15 Γ— severity
Malformed action βˆ’0.05
Solve all violations +0.5 bonus

Severity weights: missing_field=1.0, wrong_type=0.9, wrong_status=0.8, extra_field=0.7

Final episode score is computed by grade_episode() β†’ float in [0.0, 1.0].


API Endpoints

Method Path Description
POST /reset Reset environment. Body: {"task_name": "easy|medium|hard"}
POST /step Apply one action. Body: {"action": {...}}
GET /state Full internal state
GET /score Final episode score
GET /tasks List all available tasks
GET /health Health check
GET /schema JSON schemas for action + observation

Setup & Usage

Installation

# Clone the repository
git clone <your-repo-url>
cd api-contract-debugger

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

Run locally

# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload

The server will be available at http://localhost:7860

Run with Docker

docker build -t api-contract-debugger .
docker run -p 7860:7860 api-contract-debugger

Run tests

# Run entire test suite (56 tests)
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=server

Run the baseline agent

The baseline agent uses an LLM (via OpenAI client) to propose fixes.

Required environment variables (must be set):

export HF_TOKEN="your_huggingface_api_token"     # Get from huggingface.co/settings/tokens
export ENV_BASE_URL="http://localhost:7860"      # Environment server URL
export TASK_NAME="all"                           # "easy", "medium", "hard", or "all"

Optional environment variables (have defaults):

export API_BASE_URL="https://router.huggingface.co/v1"      # LLM endpoint
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"               # Model identifier
export LOCAL_IMAGE_NAME="optional_docker_image"             # For docker image initialization

Then run the agent:

python inference.py

Example output:

[START] task=easy env=api_contract_debugger model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action={"kind":"add_field",...} reward=0.70 done=true error=null
[END] success=true steps=1 score=1.000 rewards=0.70

Test individual endpoints

# Health check
curl http://localhost:7860/health

# List available tasks
curl http://localhost:7860/tasks

# Reset to a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_name":"easy"}'

# Apply an action
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "action": {
      "kind": "add_field",
      "endpoint_index": 0,
      "location": "response_body",
      "field_name": "created_at",
      "new_value": {"type": "string", "description": "ISO-8601 timestamp"}
    }
  }'

# Get final score
curl http://localhost:7860/score

Baseline Scores

Task Model Score Steps Used
easy Qwen2.5-72B-Instruct 1.000 1
medium Qwen2.5-72B-Instruct 1.000 3
hard Qwen2.5-72B-Instruct ~0.85 12

Project Structure

api-contract-debugger/
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py          # FastAPI app, route registration
β”‚   β”œβ”€β”€ environment.py  # OpenEnv Environment subclass
β”‚   β”œβ”€β”€ models.py       # Pydantic Action / Observation / State
β”‚   β”œβ”€β”€ graders.py      # Violation detection + reward shaping
β”‚   └── fixtures.py     # Task definitions (broken + golden specs)
β”œβ”€β”€ tests/
β”‚   └── test_env.py     # 56 unit tests covering all components
β”œβ”€β”€ inference.py        # Baseline LLM-powered agent
β”œβ”€β”€ openenv.yaml        # OpenEnv metadata
β”œβ”€β”€ pyproject.toml      # Package configuration
β”œβ”€β”€ requirements.txt    # Python dependencies
β”œβ”€β”€ Dockerfile          # Container image configuration
└── RL_ARCHITECTURE.md  # Complete RL framework documentation

Documentation

RL_ARCHITECTURE.md

Comprehensive guide to the reinforcement learning implementation:

  • Agent β€” How external AI systems interact with the environment via HTTP API
  • Environment β€” Core APIContractDebuggerEnv class and episode lifecycle
  • State β€” Observation space and full internal state representation
  • Action β€” All 5 action types with validation rules and examples
  • Reward & Scoring β€” Dense per-step rewards and episode grading formula
  • Complete example episode transcript with JSON payloads
  • Python agent pseudocode for custom implementations