| --- |
| title: API Contract Debugger |
| emoji: π |
| colorFrom: blue |
| colorTo: indigo |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| tags: |
| - openenv |
| - rl-environment |
| - api-debugging |
| - contract-testing |
| --- |
| |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
|
|
| # API Contract Debugger β OpenEnv Environment |
|
|
| An OpenEnv environment where AI agents debug broken OpenAPI-style contract |
| specifications by proposing targeted field-level corrections. |
|
|
| ## What Is This? |
|
|
| Every backend engineer debugs API contract violations constantly β mismatched |
| types, missing required fields, wrong HTTP status codes, forbidden extra fields |
| leaking into responses. This environment turns that real-world task into a |
| structured RL benchmark. |
|
|
| The agent receives a broken API spec and a list of violations. Each step, it |
| proposes one fix. It gets rewarded for each violation resolved and penalised |
| for introducing new ones. |
|
|
| --- |
|
|
| ## Action Space |
|
|
| ```json |
| { |
| "kind": "add_field | remove_field | change_type | change_status | no_op", |
| "endpoint_index": 0, |
| "location": "request_body | response_body | status_code", |
| "field_name": "field_name_or_null", |
| "new_value": "<type string | field spec dict | int status code | null>" |
| } |
| ``` |
|
|
| | `kind` | `new_value` type | Description | |
| |-----------------|-----------------|-------------| |
| | `add_field` | `{"type": "...", "required": true, "description": "..."}` | Add a missing field | |
| | `remove_field` | `null` | Remove a forbidden field | |
| | `change_type` | `"integer"` / `"string"` / `"boolean"` / `"number"` | Fix a field's type | |
| | `change_status` | `204` / `200` / `201` etc. | Fix the HTTP status code | |
| | `no_op` | `null` | Do nothing (small implicit cost) | |
|
|
| --- |
|
|
| ## Observation Space |
|
|
| | Field | Type | Description | |
| |-------|------|-------------| |
| | `task_name` | str | Active task: `easy`, `medium`, `hard` | |
| | `task_description` | str | Plain-English description of violations | |
| | `endpoints` | list | Current (partially fixed) endpoint specs | |
| | `violations` | list | Remaining violations with type + description | |
| | `violations_fixed_this_step` | int | How many the last action resolved | |
| | `violations_introduced_this_step` | int | How many the last action introduced | |
| | `total_violations_at_start` | int | Violation count at episode start | |
| | `step_count` | int | Steps taken so far | |
| | `max_steps` | int | Episode step budget | |
| | `last_action_error` | str\|null | Validation error if action was malformed | |
| | `reward` | float | Per-step reward | |
| | `done` | bool | Whether the episode has terminated | |
|
|
| --- |
|
|
| ## Tasks |
|
|
| ### Easy (1 endpoint, 1 violation, max 5 steps) |
| A user registration endpoint is missing `created_at` (string) in its response. |
| Expected score for a capable agent: **1.0** |
|
|
| ### Medium (3 endpoints, 3 violations, max 10 steps) |
| An e-commerce API has: |
| 1. `GET /products/{id}` β `product_id` returned as `string` instead of `integer` |
| 2. `POST /orders` β `quantity` accepted as `string` instead of `integer` |
| 3. `DELETE /orders/{id}` β returns status `200` instead of `204` |
|
|
| Expected score for a capable agent: **1.0** |
|
|
| ### Hard (4 endpoints, 6 violations, max 15 steps) |
| An auth + profile API has: |
| 1. `POST /auth/login` β missing `refresh_token` in response |
| 2. `POST /auth/login` β `expires_in` is `string` instead of `integer` |
| 3. `GET /users/{id}/profile` β missing `created_at` in response |
| 4. `GET /users/{id}/profile` β exposes forbidden `password_hash` field (must be removed) |
| 5. `PATCH /users/{id}/profile` β returns status `500` instead of `200` |
| 6. `PATCH /users/{id}/profile` β missing `updated_at` in response |
|
|
| Expected score for a capable agent: **0.7β1.0** (frontier models) |
|
|
| --- |
|
|
| ## Reward Function |
|
|
| | Event | Reward | |
| |-------|--------| |
| | Fix a violation | `+0.2 Γ severity` | |
| | Introduce a violation | `β0.15 Γ severity` | |
| | Malformed action | `β0.05` | |
| | Solve all violations | `+0.5` bonus | |
|
|
| Severity weights: `missing_field=1.0`, `wrong_type=0.9`, `wrong_status=0.8`, `extra_field=0.7` |
|
|
| Final episode score is computed by `grade_episode()` β float in `[0.0, 1.0]`. |
|
|
| --- |
|
|
| ## API Endpoints |
|
|
| | Method | Path | Description | |
| |--------|------|-------------| |
| | `POST` | `/reset` | Reset environment. Body: `{"task_name": "easy\|medium\|hard"}` | |
| | `POST` | `/step` | Apply one action. Body: `{"action": {...}}` | |
| | `GET` | `/state` | Full internal state | |
| | `GET` | `/score` | Final episode score | |
| | `GET` | `/tasks` | List all available tasks | |
| | `GET` | `/health`| Health check | |
| | `GET` | `/schema`| JSON schemas for action + observation | |
|
|
| --- |
|
|
| ## Setup & Usage |
|
|
| ### Installation |
|
|
| ```bash |
| # Clone the repository |
| git clone <your-repo-url> |
| cd api-contract-debugger |
| |
| # Create virtual environment |
| python3 -m venv .venv |
| source .venv/bin/activate |
| |
| # Install dependencies |
| pip install --upgrade pip |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Run locally |
|
|
| ```bash |
| # Start the server |
| uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload |
| ``` |
|
|
| The server will be available at `http://localhost:7860` |
|
|
| ### Run with Docker |
|
|
| ```bash |
| docker build -t api-contract-debugger . |
| docker run -p 7860:7860 api-contract-debugger |
| ``` |
|
|
| ### Run tests |
|
|
| ```bash |
| # Run entire test suite (56 tests) |
| pytest tests/ -v |
| |
| # Run with coverage |
| pytest tests/ -v --cov=server |
| ``` |
|
|
| ### Run the baseline agent |
|
|
| The baseline agent uses an LLM (via OpenAI client) to propose fixes. |
|
|
| **Required environment variables** (must be set): |
| ```bash |
| export HF_TOKEN="your_huggingface_api_token" # Get from huggingface.co/settings/tokens |
| export ENV_BASE_URL="http://localhost:7860" # Environment server URL |
| export TASK_NAME="all" # "easy", "medium", "hard", or "all" |
| ``` |
|
|
| **Optional environment variables** (have defaults): |
| ```bash |
| export API_BASE_URL="https://router.huggingface.co/v1" # LLM endpoint |
| export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" # Model identifier |
| export LOCAL_IMAGE_NAME="optional_docker_image" # For docker image initialization |
| ``` |
|
|
| Then run the agent: |
| ```bash |
| python inference.py |
| ``` |
|
|
| **Example output:** |
| ``` |
| [START] task=easy env=api_contract_debugger model=Qwen/Qwen2.5-72B-Instruct |
| [STEP] step=1 action={"kind":"add_field",...} reward=0.70 done=true error=null |
| [END] success=true steps=1 score=1.000 rewards=0.70 |
| ``` |
|
|
| ### Test individual endpoints |
|
|
| ```bash |
| # Health check |
| curl http://localhost:7860/health |
| |
| # List available tasks |
| curl http://localhost:7860/tasks |
| |
| # Reset to a task |
| curl -X POST http://localhost:7860/reset \ |
| -H "Content-Type: application/json" \ |
| -d '{"task_name":"easy"}' |
| |
| # Apply an action |
| curl -X POST http://localhost:7860/step \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "action": { |
| "kind": "add_field", |
| "endpoint_index": 0, |
| "location": "response_body", |
| "field_name": "created_at", |
| "new_value": {"type": "string", "description": "ISO-8601 timestamp"} |
| } |
| }' |
| |
| # Get final score |
| curl http://localhost:7860/score |
| ``` |
|
|
| --- |
|
|
| ## Baseline Scores |
|
|
| | Task | Model | Score | Steps Used | |
| |------|-------|-------|-----------| |
| | easy | Qwen2.5-72B-Instruct | 1.000 | 1 | |
| | medium | Qwen2.5-72B-Instruct | 1.000 | 3 | |
| | hard | Qwen2.5-72B-Instruct | ~0.85 | 12 | |
|
|
| --- |
|
|
| ## Project Structure |
|
|
| ``` |
| api-contract-debugger/ |
| βββ server/ |
| β βββ __init__.py |
| β βββ app.py # FastAPI app, route registration |
| β βββ environment.py # OpenEnv Environment subclass |
| β βββ models.py # Pydantic Action / Observation / State |
| β βββ graders.py # Violation detection + reward shaping |
| β βββ fixtures.py # Task definitions (broken + golden specs) |
| βββ tests/ |
| β βββ test_env.py # 56 unit tests covering all components |
| βββ inference.py # Baseline LLM-powered agent |
| βββ openenv.yaml # OpenEnv metadata |
| βββ pyproject.toml # Package configuration |
| βββ requirements.txt # Python dependencies |
| βββ Dockerfile # Container image configuration |
| βββ RL_ARCHITECTURE.md # Complete RL framework documentation |
| ``` |
|
|
| --- |
|
|
| ## Documentation |
|
|
| ### RL_ARCHITECTURE.md |
| Comprehensive guide to the reinforcement learning implementation: |
| - **Agent** β How external AI systems interact with the environment via HTTP API |
| - **Environment** β Core `APIContractDebuggerEnv` class and episode lifecycle |
| - **State** β Observation space and full internal state representation |
| - **Action** β All 5 action types with validation rules and examples |
| - **Reward & Scoring** β Dense per-step rewards and episode grading formula |
| - **Complete example episode transcript** with JSON payloads |
| - **Python agent pseudocode** for custom implementations |
| |
| --- |
| |