keerthanas1011's picture
fixed readme for hf
db203a8
---
title: API Contract Debugger
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
tags:
- openenv
- rl-environment
- api-debugging
- contract-testing
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# API Contract Debugger β€” OpenEnv Environment
An OpenEnv environment where AI agents debug broken OpenAPI-style contract
specifications by proposing targeted field-level corrections.
## What Is This?
Every backend engineer debugs API contract violations constantly β€” mismatched
types, missing required fields, wrong HTTP status codes, forbidden extra fields
leaking into responses. This environment turns that real-world task into a
structured RL benchmark.
The agent receives a broken API spec and a list of violations. Each step, it
proposes one fix. It gets rewarded for each violation resolved and penalised
for introducing new ones.
---
## Action Space
```json
{
"kind": "add_field | remove_field | change_type | change_status | no_op",
"endpoint_index": 0,
"location": "request_body | response_body | status_code",
"field_name": "field_name_or_null",
"new_value": "<type string | field spec dict | int status code | null>"
}
```
| `kind` | `new_value` type | Description |
|-----------------|-----------------|-------------|
| `add_field` | `{"type": "...", "required": true, "description": "..."}` | Add a missing field |
| `remove_field` | `null` | Remove a forbidden field |
| `change_type` | `"integer"` / `"string"` / `"boolean"` / `"number"` | Fix a field's type |
| `change_status` | `204` / `200` / `201` etc. | Fix the HTTP status code |
| `no_op` | `null` | Do nothing (small implicit cost) |
---
## Observation Space
| Field | Type | Description |
|-------|------|-------------|
| `task_name` | str | Active task: `easy`, `medium`, `hard` |
| `task_description` | str | Plain-English description of violations |
| `endpoints` | list | Current (partially fixed) endpoint specs |
| `violations` | list | Remaining violations with type + description |
| `violations_fixed_this_step` | int | How many the last action resolved |
| `violations_introduced_this_step` | int | How many the last action introduced |
| `total_violations_at_start` | int | Violation count at episode start |
| `step_count` | int | Steps taken so far |
| `max_steps` | int | Episode step budget |
| `last_action_error` | str\|null | Validation error if action was malformed |
| `reward` | float | Per-step reward |
| `done` | bool | Whether the episode has terminated |
---
## Tasks
### Easy (1 endpoint, 1 violation, max 5 steps)
A user registration endpoint is missing `created_at` (string) in its response.
Expected score for a capable agent: **1.0**
### Medium (3 endpoints, 3 violations, max 10 steps)
An e-commerce API has:
1. `GET /products/{id}` β€” `product_id` returned as `string` instead of `integer`
2. `POST /orders` β€” `quantity` accepted as `string` instead of `integer`
3. `DELETE /orders/{id}` β€” returns status `200` instead of `204`
Expected score for a capable agent: **1.0**
### Hard (4 endpoints, 6 violations, max 15 steps)
An auth + profile API has:
1. `POST /auth/login` β€” missing `refresh_token` in response
2. `POST /auth/login` β€” `expires_in` is `string` instead of `integer`
3. `GET /users/{id}/profile` β€” missing `created_at` in response
4. `GET /users/{id}/profile` β€” exposes forbidden `password_hash` field (must be removed)
5. `PATCH /users/{id}/profile` β€” returns status `500` instead of `200`
6. `PATCH /users/{id}/profile` β€” missing `updated_at` in response
Expected score for a capable agent: **0.7–1.0** (frontier models)
---
## Reward Function
| Event | Reward |
|-------|--------|
| Fix a violation | `+0.2 Γ— severity` |
| Introduce a violation | `βˆ’0.15 Γ— severity` |
| Malformed action | `βˆ’0.05` |
| Solve all violations | `+0.5` bonus |
Severity weights: `missing_field=1.0`, `wrong_type=0.9`, `wrong_status=0.8`, `extra_field=0.7`
Final episode score is computed by `grade_episode()` β†’ float in `[0.0, 1.0]`.
---
## API Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/reset` | Reset environment. Body: `{"task_name": "easy\|medium\|hard"}` |
| `POST` | `/step` | Apply one action. Body: `{"action": {...}}` |
| `GET` | `/state` | Full internal state |
| `GET` | `/score` | Final episode score |
| `GET` | `/tasks` | List all available tasks |
| `GET` | `/health`| Health check |
| `GET` | `/schema`| JSON schemas for action + observation |
---
## Setup & Usage
### Installation
```bash
# Clone the repository
git clone <your-repo-url>
cd api-contract-debugger
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
```
### Run locally
```bash
# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
```
The server will be available at `http://localhost:7860`
### Run with Docker
```bash
docker build -t api-contract-debugger .
docker run -p 7860:7860 api-contract-debugger
```
### Run tests
```bash
# Run entire test suite (56 tests)
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=server
```
### Run the baseline agent
The baseline agent uses an LLM (via OpenAI client) to propose fixes.
**Required environment variables** (must be set):
```bash
export HF_TOKEN="your_huggingface_api_token" # Get from huggingface.co/settings/tokens
export ENV_BASE_URL="http://localhost:7860" # Environment server URL
export TASK_NAME="all" # "easy", "medium", "hard", or "all"
```
**Optional environment variables** (have defaults):
```bash
export API_BASE_URL="https://router.huggingface.co/v1" # LLM endpoint
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" # Model identifier
export LOCAL_IMAGE_NAME="optional_docker_image" # For docker image initialization
```
Then run the agent:
```bash
python inference.py
```
**Example output:**
```
[START] task=easy env=api_contract_debugger model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action={"kind":"add_field",...} reward=0.70 done=true error=null
[END] success=true steps=1 score=1.000 rewards=0.70
```
### Test individual endpoints
```bash
# Health check
curl http://localhost:7860/health
# List available tasks
curl http://localhost:7860/tasks
# Reset to a task
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_name":"easy"}'
# Apply an action
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{
"action": {
"kind": "add_field",
"endpoint_index": 0,
"location": "response_body",
"field_name": "created_at",
"new_value": {"type": "string", "description": "ISO-8601 timestamp"}
}
}'
# Get final score
curl http://localhost:7860/score
```
---
## Baseline Scores
| Task | Model | Score | Steps Used |
|------|-------|-------|-----------|
| easy | Qwen2.5-72B-Instruct | 1.000 | 1 |
| medium | Qwen2.5-72B-Instruct | 1.000 | 3 |
| hard | Qwen2.5-72B-Instruct | ~0.85 | 12 |
---
## Project Structure
```
api-contract-debugger/
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ app.py # FastAPI app, route registration
β”‚ β”œβ”€β”€ environment.py # OpenEnv Environment subclass
β”‚ β”œβ”€β”€ models.py # Pydantic Action / Observation / State
β”‚ β”œβ”€β”€ graders.py # Violation detection + reward shaping
β”‚ └── fixtures.py # Task definitions (broken + golden specs)
β”œβ”€β”€ tests/
β”‚ └── test_env.py # 56 unit tests covering all components
β”œβ”€β”€ inference.py # Baseline LLM-powered agent
β”œβ”€β”€ openenv.yaml # OpenEnv metadata
β”œβ”€β”€ pyproject.toml # Package configuration
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ Dockerfile # Container image configuration
└── RL_ARCHITECTURE.md # Complete RL framework documentation
```
---
## Documentation
### RL_ARCHITECTURE.md
Comprehensive guide to the reinforcement learning implementation:
- **Agent** β€” How external AI systems interact with the environment via HTTP API
- **Environment** β€” Core `APIContractDebuggerEnv` class and episode lifecycle
- **State** β€” Observation space and full internal state representation
- **Action** β€” All 5 action types with validation rules and examples
- **Reward & Scoring** β€” Dense per-step rewards and episode grading formula
- **Complete example episode transcript** with JSON payloads
- **Python agent pseudocode** for custom implementations
---