File size: 8,727 Bytes
a4fa89c
 
 
 
 
 
db203a8
a4fa89c
 
 
 
 
 
 
 
 
 
5cf6185
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
913cb3a
5cf6185
 
913cb3a
5cf6185
913cb3a
 
 
 
 
 
 
 
5cf6185
 
 
913cb3a
 
 
 
 
 
 
 
 
5cf6185
 
 
 
 
 
 
913cb3a
 
 
 
 
 
 
 
 
 
5cf6185
 
913cb3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5cf6185
 
 
 
913cb3a
 
 
 
 
 
 
 
5cf6185
 
913cb3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5cf6185
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
913cb3a
5cf6185
 
 
 
 
 
 
 
913cb3a
 
5cf6185
913cb3a
 
 
 
5cf6185
913cb3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
title: API Contract Debugger
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
  - rl-environment
  - api-debugging
  - contract-testing
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# API Contract Debugger β€” OpenEnv Environment

An OpenEnv environment where AI agents debug broken OpenAPI-style contract
specifications by proposing targeted field-level corrections.

## What Is This?

Every backend engineer debugs API contract violations constantly β€” mismatched
types, missing required fields, wrong HTTP status codes, forbidden extra fields
leaking into responses. This environment turns that real-world task into a
structured RL benchmark.

The agent receives a broken API spec and a list of violations. Each step, it
proposes one fix. It gets rewarded for each violation resolved and penalised
for introducing new ones.

---

## Action Space

```json
{
  "kind": "add_field | remove_field | change_type | change_status | no_op",
  "endpoint_index": 0,
  "location": "request_body | response_body | status_code",
  "field_name": "field_name_or_null",
  "new_value": "<type string | field spec dict | int status code | null>"
}
```

| `kind`          | `new_value` type | Description |
|-----------------|-----------------|-------------|
| `add_field`     | `{"type": "...", "required": true, "description": "..."}` | Add a missing field |
| `remove_field`  | `null` | Remove a forbidden field |
| `change_type`   | `"integer"` / `"string"` / `"boolean"` / `"number"` | Fix a field's type |
| `change_status` | `204` / `200` / `201` etc. | Fix the HTTP status code |
| `no_op`         | `null` | Do nothing (small implicit cost) |

---

## Observation Space

| Field | Type | Description |
|-------|------|-------------|
| `task_name` | str | Active task: `easy`, `medium`, `hard` |
| `task_description` | str | Plain-English description of violations |
| `endpoints` | list | Current (partially fixed) endpoint specs |
| `violations` | list | Remaining violations with type + description |
| `violations_fixed_this_step` | int | How many the last action resolved |
| `violations_introduced_this_step` | int | How many the last action introduced |
| `total_violations_at_start` | int | Violation count at episode start |
| `step_count` | int | Steps taken so far |
| `max_steps` | int | Episode step budget |
| `last_action_error` | str\|null | Validation error if action was malformed |
| `reward` | float | Per-step reward |
| `done` | bool | Whether the episode has terminated |

---

## Tasks

### Easy (1 endpoint, 1 violation, max 5 steps)
A user registration endpoint is missing `created_at` (string) in its response.
Expected score for a capable agent: **1.0**

### Medium (3 endpoints, 3 violations, max 10 steps)
An e-commerce API has:
1. `GET /products/{id}` β€” `product_id` returned as `string` instead of `integer`
2. `POST /orders` β€” `quantity` accepted as `string` instead of `integer`
3. `DELETE /orders/{id}` β€” returns status `200` instead of `204`

Expected score for a capable agent: **1.0**

### Hard (4 endpoints, 6 violations, max 15 steps)
An auth + profile API has:
1. `POST /auth/login` β€” missing `refresh_token` in response
2. `POST /auth/login` β€” `expires_in` is `string` instead of `integer`
3. `GET /users/{id}/profile` β€” missing `created_at` in response
4. `GET /users/{id}/profile` β€” exposes forbidden `password_hash` field (must be removed)
5. `PATCH /users/{id}/profile` β€” returns status `500` instead of `200`
6. `PATCH /users/{id}/profile` β€” missing `updated_at` in response

Expected score for a capable agent: **0.7–1.0** (frontier models)

---

## Reward Function

| Event | Reward |
|-------|--------|
| Fix a violation | `+0.2 Γ— severity` |
| Introduce a violation | `βˆ’0.15 Γ— severity` |
| Malformed action | `βˆ’0.05` |
| Solve all violations | `+0.5` bonus |

Severity weights: `missing_field=1.0`, `wrong_type=0.9`, `wrong_status=0.8`, `extra_field=0.7`

Final episode score is computed by `grade_episode()` β†’ float in `[0.0, 1.0]`.

---

## API Endpoints

| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/reset` | Reset environment. Body: `{"task_name": "easy\|medium\|hard"}` |
| `POST` | `/step`  | Apply one action. Body: `{"action": {...}}` |
| `GET`  | `/state` | Full internal state |
| `GET`  | `/score` | Final episode score |
| `GET`  | `/tasks` | List all available tasks |
| `GET`  | `/health`| Health check |
| `GET`  | `/schema`| JSON schemas for action + observation |

---

## Setup & Usage

### Installation

```bash
# Clone the repository
git clone <your-repo-url>
cd api-contract-debugger

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
```

### Run locally

```bash
# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
```

The server will be available at `http://localhost:7860`

### Run with Docker

```bash
docker build -t api-contract-debugger .
docker run -p 7860:7860 api-contract-debugger
```

### Run tests

```bash
# Run entire test suite (56 tests)
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=server
```

### Run the baseline agent

The baseline agent uses an LLM (via OpenAI client) to propose fixes.

**Required environment variables** (must be set):
```bash
export HF_TOKEN="your_huggingface_api_token"     # Get from huggingface.co/settings/tokens
export ENV_BASE_URL="http://localhost:7860"      # Environment server URL
export TASK_NAME="all"                           # "easy", "medium", "hard", or "all"
```

**Optional environment variables** (have defaults):
```bash
export API_BASE_URL="https://router.huggingface.co/v1"      # LLM endpoint
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"               # Model identifier
export LOCAL_IMAGE_NAME="optional_docker_image"             # For docker image initialization
```

Then run the agent:
```bash
python inference.py
```

**Example output:**
```
[START] task=easy env=api_contract_debugger model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action={"kind":"add_field",...} reward=0.70 done=true error=null
[END] success=true steps=1 score=1.000 rewards=0.70
```

### Test individual endpoints

```bash
# Health check
curl http://localhost:7860/health

# List available tasks
curl http://localhost:7860/tasks

# Reset to a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_name":"easy"}'

# Apply an action
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "action": {
      "kind": "add_field",
      "endpoint_index": 0,
      "location": "response_body",
      "field_name": "created_at",
      "new_value": {"type": "string", "description": "ISO-8601 timestamp"}
    }
  }'

# Get final score
curl http://localhost:7860/score
```

---

## Baseline Scores

| Task | Model | Score | Steps Used |
|------|-------|-------|-----------|
| easy | Qwen2.5-72B-Instruct | 1.000 | 1 |
| medium | Qwen2.5-72B-Instruct | 1.000 | 3 |
| hard | Qwen2.5-72B-Instruct | ~0.85 | 12 |

---

## Project Structure

```
api-contract-debugger/
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py          # FastAPI app, route registration
β”‚   β”œβ”€β”€ environment.py  # OpenEnv Environment subclass
β”‚   β”œβ”€β”€ models.py       # Pydantic Action / Observation / State
β”‚   β”œβ”€β”€ graders.py      # Violation detection + reward shaping
β”‚   └── fixtures.py     # Task definitions (broken + golden specs)
β”œβ”€β”€ tests/
β”‚   └── test_env.py     # 56 unit tests covering all components
β”œβ”€β”€ inference.py        # Baseline LLM-powered agent
β”œβ”€β”€ openenv.yaml        # OpenEnv metadata
β”œβ”€β”€ pyproject.toml      # Package configuration
β”œβ”€β”€ requirements.txt    # Python dependencies
β”œβ”€β”€ Dockerfile          # Container image configuration
└── RL_ARCHITECTURE.md  # Complete RL framework documentation
```

---

## Documentation

### RL_ARCHITECTURE.md
Comprehensive guide to the reinforcement learning implementation:
- **Agent** β€” How external AI systems interact with the environment via HTTP API
- **Environment** β€” Core `APIContractDebuggerEnv` class and episode lifecycle
- **State** β€” Observation space and full internal state representation
- **Action** β€” All 5 action types with validation rules and examples
- **Reward & Scoring** β€” Dense per-step rewards and episode grading formula
- **Complete example episode transcript** with JSON payloads
- **Python agent pseudocode** for custom implementations

---