Spaces:

keerthanas1011
/

api-contract-debugger

Sleeping

App Files Files Community

api-contract-debugger / README.md

keerthanas1011

fixed readme for hf

db203a8 about 1 month ago

preview code

raw

history blame contribute delete

8.73 kB

metadata

title: API Contract Debugger
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
  - rl-environment
  - api-debugging
  - contract-testing

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

API Contract Debugger — OpenEnv Environment

An OpenEnv environment where AI agents debug broken OpenAPI-style contract specifications by proposing targeted field-level corrections.

What Is This?

Every backend engineer debugs API contract violations constantly — mismatched types, missing required fields, wrong HTTP status codes, forbidden extra fields leaking into responses. This environment turns that real-world task into a structured RL benchmark.

The agent receives a broken API spec and a list of violations. Each step, it proposes one fix. It gets rewarded for each violation resolved and penalised for introducing new ones.

Action Space

{
  "kind": "add_field | remove_field | change_type | change_status | no_op",
  "endpoint_index": 0,
  "location": "request_body | response_body | status_code",
  "field_name": "field_name_or_null",
  "new_value": "<type string | field spec dict | int status code | null>"
}

`kind`	`new_value` type	Description
`add_field`	`{"type": "...", "required": true, "description": "..."}`	Add a missing field
`remove_field`	`null`	Remove a forbidden field
`change_type`	`"integer"` / `"string"` / `"boolean"` / `"number"`	Fix a field's type
`change_status`	`204` / `200` / `201` etc.	Fix the HTTP status code
`no_op`	`null`	Do nothing (small implicit cost)

Observation Space

Field	Type	Description
`task_name`	str	Active task: `easy`, `medium`, `hard`
`task_description`	str	Plain-English description of violations
`endpoints`	list	Current (partially fixed) endpoint specs
`violations`	list	Remaining violations with type + description
`violations_fixed_this_step`	int	How many the last action resolved
`violations_introduced_this_step`	int	How many the last action introduced
`total_violations_at_start`	int	Violation count at episode start
`step_count`	int	Steps taken so far
`max_steps`	int	Episode step budget
`last_action_error`	str\|null	Validation error if action was malformed
`reward`	float	Per-step reward
`done`	bool	Whether the episode has terminated

Tasks

Easy (1 endpoint, 1 violation, max 5 steps)

A user registration endpoint is missing created_at (string) in its response. Expected score for a capable agent: 1.0

Medium (3 endpoints, 3 violations, max 10 steps)

An e-commerce API has:

GET /products/{id} — product_id returned as string instead of integer
POST /orders — quantity accepted as string instead of integer
DELETE /orders/{id} — returns status 200 instead of 204

Expected score for a capable agent: 1.0

Hard (4 endpoints, 6 violations, max 15 steps)

An auth + profile API has:

POST /auth/login — missing refresh_token in response
POST /auth/login — expires_in is string instead of integer
GET /users/{id}/profile — missing created_at in response
GET /users/{id}/profile — exposes forbidden password_hash field (must be removed)
PATCH /users/{id}/profile — returns status 500 instead of 200
PATCH /users/{id}/profile — missing updated_at in response

Expected score for a capable agent: 0.7–1.0 (frontier models)

Reward Function

Event	Reward
Fix a violation	`+0.2 × severity`
Introduce a violation	`−0.15 × severity`
Malformed action	`−0.05`
Solve all violations	`+0.5` bonus

Severity weights: missing_field=1.0, wrong_type=0.9, wrong_status=0.8, extra_field=0.7

Final episode score is computed by grade_episode() → float in [0.0, 1.0].

API Endpoints

Method	Path	Description
`POST`	`/reset`	Reset environment. Body: `{"task_name": "easy\|medium\|hard"}`
`POST`	`/step`	Apply one action. Body: `{"action": {...}}`
`GET`	`/state`	Full internal state
`GET`	`/score`	Final episode score
`GET`	`/tasks`	List all available tasks
`GET`	`/health`	Health check
`GET`	`/schema`	JSON schemas for action + observation

Setup & Usage

Installation

# Clone the repository
git clone <your-repo-url>
cd api-contract-debugger

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

Run locally

# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload

The server will be available at http://localhost:7860

Run with Docker

docker build -t api-contract-debugger .
docker run -p 7860:7860 api-contract-debugger

Run tests

# Run entire test suite (56 tests)
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=server

Run the baseline agent

The baseline agent uses an LLM (via OpenAI client) to propose fixes.

Required environment variables (must be set):

export HF_TOKEN="your_huggingface_api_token"     # Get from huggingface.co/settings/tokens
export ENV_BASE_URL="http://localhost:7860"      # Environment server URL
export TASK_NAME="all"                           # "easy", "medium", "hard", or "all"

Optional environment variables (have defaults):

export API_BASE_URL="https://router.huggingface.co/v1"      # LLM endpoint
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"               # Model identifier
export LOCAL_IMAGE_NAME="optional_docker_image"             # For docker image initialization

Then run the agent:

python inference.py

Example output:

[START] task=easy env=api_contract_debugger model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action={"kind":"add_field",...} reward=0.70 done=true error=null
[END] success=true steps=1 score=1.000 rewards=0.70

Test individual endpoints

# Health check
curl http://localhost:7860/health

# List available tasks
curl http://localhost:7860/tasks

# Reset to a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_name":"easy"}'

# Apply an action
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "action": {
      "kind": "add_field",
      "endpoint_index": 0,
      "location": "response_body",
      "field_name": "created_at",
      "new_value": {"type": "string", "description": "ISO-8601 timestamp"}
    }
  }'

# Get final score
curl http://localhost:7860/score

Baseline Scores

Task	Model	Score	Steps Used
easy	Qwen2.5-72B-Instruct	1.000	1
medium	Qwen2.5-72B-Instruct	1.000	3
hard	Qwen2.5-72B-Instruct	~0.85	12

Project Structure

api-contract-debugger/
├── server/
│   ├── __init__.py
│   ├── app.py          # FastAPI app, route registration
│   ├── environment.py  # OpenEnv Environment subclass
│   ├── models.py       # Pydantic Action / Observation / State
│   ├── graders.py      # Violation detection + reward shaping
│   └── fixtures.py     # Task definitions (broken + golden specs)
├── tests/
│   └── test_env.py     # 56 unit tests covering all components
├── inference.py        # Baseline LLM-powered agent
├── openenv.yaml        # OpenEnv metadata
├── pyproject.toml      # Package configuration
├── requirements.txt    # Python dependencies
├── Dockerfile          # Container image configuration
└── RL_ARCHITECTURE.md  # Complete RL framework documentation

Documentation

RL_ARCHITECTURE.md

Comprehensive guide to the reinforcement learning implementation:

Agent — How external AI systems interact with the environment via HTTP API
Environment — Core APIContractDebuggerEnv class and episode lifecycle
State — Observation space and full internal state representation
Action — All 5 action types with validation rules and examples
Reward & Scoring — Dense per-step rewards and episode grading formula
Complete example episode transcript with JSON payloads
Python agent pseudocode for custom implementations