Spaces:
Sleeping
Sleeping
File size: 25,288 Bytes
5d59bf9 8b10144 5d59bf9 36dac03 5d59bf9 8b10144 36dac03 579652a 36dac03 579652a 8b10144 579652a 36dac03 579652a 36dac03 579652a 486044c 579652a 36dac03 579652a 36dac03 579652a 36dac03 579652a 8b10144 579652a 8b10144 579652a 36dac03 579652a 36dac03 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 36dac03 579652a 8b10144 579652a 8b10144 36dac03 579652a 8b10144 579652a 8b10144 579652a 36dac03 579652a 36dac03 579652a 8b10144 579652a 8b10144 36dac03 579652a 36dac03 579652a 36dac03 579652a 36dac03 579652a 36dac03 579652a 36dac03 579652a 486044c 579652a 8b10144 486044c 579652a 486044c 8b10144 486044c 579652a 486044c 8b10144 579652a 8b10144 36dac03 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 36dac03 579652a 8b10144 36dac03 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a 486044c 36dac03 579652a 36dac03 579652a 8b10144 36dac03 579652a 36dac03 579652a 36dac03 579652a 36dac03 579652a 36dac03 579652a 36dac03 579652a 486044c 579652a 36dac03 486044c 579652a 8b10144 36dac03 579652a 36dac03 579652a 36dac03 8b10144 579652a 8b10144 36dac03 579652a 486044c 579652a 486044c 579652a 36dac03 8b10144 579652a 8b10144 579652a 8b10144 579652a 8b10144 579652a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 | ---
title: API Debug Env
emoji: π§
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
- openenv
---
# π§ API Integration Debugging Environment
> A real-world OpenEnv environment where an AI agent diagnoses and fixes broken API integrations across multi-service systems with **cascading failures**, **dynamic state**, and **multi-dimensional rubric grading**.
[](https://github.com/meta-pytorch/OpenEnv)
[](https://python.org)
[]()
[](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env)
---
## Table of Contents
- [Motivation β Why API Debugging?](#motivation--why-api-debugging)
- [Environment Overview](#environment-overview)
- [Key Design Features](#key-design-features)
- [Tasks (Easy / Medium / Hard)](#tasks)
- [Multi-Dimensional Grading Rubric](#multi-dimensional-grading-rubric)
- [Reward Shaping](#reward-shaping)
- [Action & Observation Spaces](#action--observation-spaces)
- [Example Transcript](#example-transcript)
- [Setup & Usage](#setup--usage)
- [API Endpoints](#api-endpoints)
- [Running Inference](#running-inference)
- [Running Tests](#running-tests)
- [Project Structure](#project-structure)
- [Design Philosophy](#design-philosophy)
---
## Motivation β Why API Debugging?
API integration failures are one of the **most common and expensive issues** in production software engineering. When microservices communicate β Service A calls Service B which calls Service C β a single misconfiguration can cascade through the entire system, producing confusing error chains that take hours to diagnose.
Real-world API debugging requires:
- **Structured diagnosis** β reading error logs and configs across multiple services
- **Dependency awareness** β understanding which upstream failure is causing downstream errors
- **Strategic reasoning** β fixing root causes first to unmask hidden downstream bugs
- **Precision** β submitting exact configuration corrections, not approximate guesses
This environment simulates **real-world cascading API failures** with dynamic state that changes as the agent acts β not a static lookup puzzle.
---
## Environment Overview
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Agent Debugging Loop β
β β
β 1. reset(task_id) β Initial observation with broken state β
β 2. step(inspect_logs) β Error logs with diagnostic clues β
β 3. step(inspect_config)β Current (broken) service configuration β
β 4. step(inspect_endpoint) β Simulated API response (401, 504..) β
β 5. step(submit_fix) β Strict fix validation + cascade update β
β 6. grade() β Multi-dimensional rubric score [0,1] β
β β
β State updates dynamically: service health changes, new logs β
β appear, error cascades resolve as the agent fixes issues. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
The agent interacts through the standard OpenEnv API:
- **`reset()`** β returns initial observation with broken service state
- **`step(action)`** β executes one debugging action, returns observation + reward
- **`state()`** β returns current environment state (episode_id, step_count)
- **`grade()`** β returns final score using multi-dimensional rubric
---
## Key Design Features
### 1. Cascading Failures with Service Dependency Graphs
Each task models a real multi-service ecosystem. Services depend on each other, and a bug in an upstream service **cascades** to all downstream services:
```
Hard Task Dependency Graph:
order_service βββ¬βββ inventory_service βββ¬βββ shipping_service
β ββββ auth_service
ββββ api_gateway
[ERROR] [DEGRADED] [HEALTHY]
```
- Fixing `order_service`'s wrong URL unmasks `inventory_service`'s timeout issue
- Fixing `inventory_service`'s expired token allows `shipping_service` to respond
- **Some issues are intentionally masked by upstream failures** β the agent must fix in the right order
### 2. Dynamic State
Unlike static environments, the state **changes as the agent acts**:
| What changes | How |
|---|---|
| **Service health** | Fixing issues updates service status: `error` β `degraded` β `healthy` |
| **Logs** | After a fix, re-inspecting logs shows **new entries** (e.g., "Authorization header set. Retrying...") |
| **Error traces** | The cascade chain shrinks as upstream issues are resolved |
| **Endpoint responses** | `inspect_endpoint` returns different HTTP errors based on current fix state |
### 3. Seed-Based Scenario Randomization
Each difficulty level has an **expanded issue pool** (more issues than are selected per episode):
| Difficulty | Pool Size | Selected Per Episode |
|---|---|---|
| Easy | 4 issues | 2 |
| Medium | 5 issues | 3 |
| Hard | 7 issues | 5 |
Passing a `seed` to `reset()` produces a **deterministic but varied** scenario β different seeds select different subsets from the pool and randomize log order. This prevents agents from memorizing fixed patterns.
### 4. Strict Fix Validation with Partial Credit
The grader validates both **keys and values** of submitted fixes:
- **Exact match** β Full credit (+0.25 reward)
- **Right key, close value** (e.g., timeout=7 when expected=10) β Partial credit (+0.03)
- **Right key, wrong value** (e.g., timeout=100 when expected=10) β Rejected
- **Wrong key entirely** β Penalized (-0.1)
- **Bearer token pattern matching** β `Bearer <any_valid_token>` is accepted
- **Numeric tolerance** β strict 10% tolerance
- **Boolean coercion** β `"true"`, `"1"`, `"yes"` all match `True`
---
## Tasks
### Easy: Payment API Integration (2 issues, 15 max steps)
**Scenario**: A payment processing client is failing to connect to the payment gateway. The agent must diagnose authentication and protocol errors.
- **Services**: `payment_client`, `payment_gateway`
- **Issue pool** (4 possible, 2 selected):
- Missing `Authorization` header (HTTP 401)
- Wrong `Content-Type` header β `text/plain` instead of `application/json` (HTTP 415)
- Timeout too low for payment processing (HTTP 504)
- Base URL pointing to deprecated v1 endpoint (HTTP 301)
- **Dependencies**: None β straightforward diagnosis
### Medium: Webhook Event Chain (3 issues, 25 max steps)
**Scenario**: A webhook notification system is dropping events across a 3-service chain. Events flow from sender β receiver β notification service, but multiple configuration issues are causing failures.
- **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
- **Issue pool** (5 possible, 3 selected):
- Rate limit mismatch (sender at 100/s, receiver accepts 10/s) β 429 errors
- Insufficient retry config (only 1 retry, no backoff, 429 not in retry list)
- Empty webhook signature header β receiver drops all events as unsigned
- Wrong target URL (`/webhook` vs `/hooks/incoming`) β 404 errors
- Payload compression enabled but receiver doesn't support gzip β 415 errors
- **Dependencies**: Retry issue is **masked** by rate limit β must fix rate limit first to see the retry problem
### Hard: E-Commerce Order Pipeline (5 issues, 40 max steps)
**Scenario**: A complex e-commerce order processing pipeline is failing with cascading errors across 5 services. Multiple dependency chains make this genuinely challenging for frontier models.
- **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
- **Issue pool** (7 possible, 5 selected):
- Deprecated URL (`/v1/check` β should be `/v2/reserve`) β 301 redirect
- Timeout too short (2s vs 4s processing time) β masked by wrong URL
- Synchronous mode causing race conditions between concurrent orders
- Expired auth token on inventoryβshipping calls β 401
- No auto token refresh configured β masked by expired token
- No circuit breaker β failed requests hammer inventory service
- Missing idempotency key β retries create duplicate orders
- **Dependencies**: `timeout` depends on `wrong_url` fix; `token_refresh` depends on `expired_token` fix; `idempotency` depends on `async` fix
---
## Multi-Dimensional Grading Rubric
The grader uses a **4-dimension weighted rubric**, not a simple `issues_fixed / total` ratio:
| Dimension | Weight | What It Measures |
|---|---|---|
| **Fix Score** | 40% | `issues_fixed / total_issues` β how many bugs were actually resolved |
| **Strategy Score** | 25% | Did the agent follow a logical approach? Inspect before fix, avoid repeats, follow dependency order, use all action types |
| **Diagnosis Score** | 20% | Did the agent inspect the service (logs/config) **before** submitting a fix for it? |
| **Efficiency Score** | 15% | `remaining_steps / max_steps` β faster solutions score higher |
```
Final Score = fix Γ 0.40 + strategy Γ 0.25 + diagnosis Γ 0.20 + efficiency Γ 0.15
Clamped to (0.001, 0.999) β never exactly 0.0 or 1.0
```
**Strategy scoring details:**
- Did the agent inspect logs/config before submitting any fix? (+1)
- Ratio of unique inspections to total inspections (no wasteful repeats) (+1)
- Did fixes follow the optimal dependency order? (+1)
- Did the agent use a variety of action types? (+1)
### Baseline Scores (Rule-Based Heuristic Agent)
| Task | Score | Steps Used | Issues Fixed |
|---|---|---|---|
| Easy | ~0.75 | 7 | 2/2 |
| Medium | ~0.55 | 10 | 3/3 |
| Hard | ~0.45 | 15 | 5/5 |
*The baseline uses a deterministic heuristic (inspect all logs β inspect all configs β submit known fixes). An LLM-based agent following good debugging strategy can score higher.*
---
## Reward Shaping
Every action produces a meaningful reward signal β not just sparse end-of-episode feedback:
| Action | Reward | Condition |
|---|---|---|
| `inspect_logs` (first time, finds error patterns) | **+0.15** | New issue-related log patterns found |
| `inspect_logs` (first time, no issues here) | +0.05 | Valid inspection, no errors in this service |
| `inspect_logs` (repeat, no new info) | 0.00 | Already inspected, nothing changed |
| `inspect_logs` (repeat, after a fix) | +0.05 | Dynamic logs appeared after a recent fix |
| `inspect_config` (service has issues) | +0.05 | Relevant config retrieved |
| `inspect_config` (service is clean) | +0.01 | Config retrieved but no issues here |
| `inspect_config` (repeat) | 0.00 | Already inspected |
| `inspect_endpoint` | +0.02 to +0.05 | Simulated endpoint test |
| `submit_fix` (correct fix) | **+0.25** | Issue resolved, service health updated |
| `submit_fix` (correct + inspected first) | **+0.30** | Fix + strategy bonus for diagnosis |
| `submit_fix` (partial β close but not exact) | +0.03 | Right key, approximately right value |
| `submit_fix` (wrong fix) | **-0.10** | Incorrect fix payload |
| `submit_fix` (empty payload) | -0.10 | Empty fix_payload submitted |
| All issues fixed | **+0.20** | Episode completion bonus |
| Invalid target / invalid action | -0.05 | Bad input |
| Every step | **-0.01** | Step cost β encourages efficiency |
---
## Action & Observation Spaces
### Action Schema (Pydantic model: `ApiDebugAction`)
```json
{
"action_type": "inspect_logs | inspect_config | inspect_endpoint | submit_fix",
"target": "<service_name>",
"fix_payload": {
"config_key": "corrected_value"
}
}
```
- `action_type` (required): One of the 4 debugging actions
- `target` (required): The service to act on (from `available_targets` in the observation)
- `fix_payload` (optional): Required only for `submit_fix` β the configuration correction
**Fix payload formats:**
```json
// Simple key-value fix
{"timeout": 10}
// Nested key fix (dot notation)
{"headers.Authorization": "Bearer my_api_key"}
// Complex nested object fix
{"retry": {"max_retries": 3, "backoff_factor": 2, "retry_on_status": [429, 500]}}
```
### Observation Schema (Pydantic model: `ApiDebugObservation`)
```json
{
"task_id": "easy",
"task_description": "A payment processing API integration is failing...",
"logs": ["[ERROR] 2026-03-25T10:15:23Z POST /process -> 401 Unauthorized", "..."],
"config_snapshot": {"headers": {"Content-Type": "text/plain"}, "timeout": 30},
"api_response": {"status": "error", "status_code": 401, "error": "Missing Authorization"},
"service_status": {"payment_client": "error", "payment_gateway": "healthy"},
"dependency_graph": {"payment_client": ["payment_gateway"], "payment_gateway": []},
"error_trace": [
"[CRITICAL] payment_client: Missing Authorization header",
" ββ> payment_gateway: All requests rejected with 401"
],
"hints": ["Check headers.Authorization"],
"remaining_steps": 14,
"issues_found": 1,
"issues_fixed": 0,
"issues_total": 2,
"action_result": "Inspected logs for 'payment_client'. Found relevant error patterns!",
"available_targets": ["payment_client", "payment_gateway"],
"done": false,
"reward": 0.15
}
```
**Key observation fields for agent reasoning:**
- `service_status` β shows which services are healthy/degraded/error (updates dynamically)
- `dependency_graph` β shows service relationships (agent should fix upstream first)
- `error_trace` β shows active error cascades (shrinks as issues are fixed)
- `hints` β progressive hints that get more specific as steps are used
---
## Example Transcript
```
>>> reset(task_id="easy")
task_description: "A payment processing API integration is failing..."
service_status: {payment_client: "error", payment_gateway: "healthy"}
error_trace:
[CRITICAL] payment_client: Missing Authorization header
ββ> payment_gateway: All requests rejected with 401
[ERROR] payment_client: Wrong Content-Type (text/plain instead of application/json)
ββ> payment_gateway: Request body parsing fails
issues_total: 2, remaining_steps: 15
>>> step(action_type="inspect_logs", target="payment_client")
logs: [
"[INFO] Payment client initialized...",
"[ERROR] POST /process -> 401 Unauthorized",
"[ERROR] Response: {'error': 'Missing or invalid Authorization header'}",
"[WARN] Request headers: Content-Type=text/plain",
"[ERROR] POST /process -> 415 Unsupported Media Type",
]
issues_found: 2, reward: +0.15
>>> step(action_type="inspect_config", target="payment_client")
config_snapshot: {
"base_url": "https://api.paymentgateway.com/v2",
"headers": {"Content-Type": "text/plain", "Accept": "application/json"},
"timeout": 30
}
reward: +0.05 // Service has issues, first inspection
>>> step(action_type="submit_fix", target="payment_client",
fix_payload={"headers.Authorization": "Bearer sk_live_my_key"})
action_result: "Fix accepted! Fixed 1 issue(s). Total: 1/2"
service_status: {payment_client: "degraded", payment_gateway: "healthy"}
reward: +0.30 // Fix (+0.25) + strategy bonus (+0.05) for inspecting first
>>> step(action_type="inspect_logs", target="payment_client")
logs: [...original logs...,
"[INFO] Authorization header set. Retrying request..." // NEW dynamic log!
]
reward: +0.05 // Re-inspection has new dynamic logs
>>> step(action_type="submit_fix", target="payment_client",
fix_payload={"headers.Content-Type": "application/json"})
action_result: "Fix accepted! All issues fixed! Episode complete."
service_status: {payment_client: "healthy", payment_gateway: "healthy"}
error_trace: ["All issues resolved. No error cascades active."]
reward: +0.50 // Fix (+0.25) + strategy (+0.05) + completion bonus (+0.20)
done: true
>>> grade()
score: 0.82
fix_score: 1.00 (2/2 fixed)
diagnosis_score: 1.00 (inspected before every fix)
efficiency_score: 0.67 (5/15 steps used)
strategy_score: 0.80 (inspected first, used multiple action types)
```
---
## Setup & Usage
### Prerequisites
- Python 3.10+
- [uv](https://docs.astral.sh/uv/) (recommended) or pip
- Docker (for containerized deployment)
### Install Dependencies
```bash
# Clone the repository
git clone https://github.com/yadnyeshkolte/openenv-task.git
cd openenv-task
# Install dependencies with uv
uv sync
# Or with pip
pip install -e .
```
### Run the Server Locally
```bash
# From the project root (openenv-task/)
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
```
The server will be available at `http://localhost:8000`. Visit `http://localhost:8000/docs` for interactive API documentation.
### Quick Test
```bash
# Reset environment
curl -X POST http://localhost:8000/reset
# Inspect logs
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type": "inspect_logs", "target": "payment_client"}'
# Submit a fix
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type": "submit_fix", "target": "payment_client", "fix_payload": {"headers.Authorization": "Bearer my_key"}}'
```
### Docker Build & Run
```bash
# From the project root (openenv-task/)
docker build -t api_debug_env -f Dockerfile .
docker run -p 8000:8000 api_debug_env
```
---
## API Endpoints
| Endpoint | Method | Description |
|---|---|---|
| `/` | GET | Environment info, version, and feature list |
| `/reset` | POST | Reset environment (accepts `task_id` and `seed` params) |
| `/step` | POST | Execute a debugging action |
| `/state` | GET | Get current state (episode_id, step_count) |
| `/schema` | GET | Get action/observation Pydantic schemas |
| `/tasks` | GET | List all 3 tasks with action schema and service dependencies |
| `/grader` | POST | Get multi-dimensional grader score for current episode |
| `/baseline` | POST | Run the rule-based baseline agent on all 3 tasks |
| `/health` | GET | Health check endpoint |
| `/docs` | GET | Interactive Swagger UI documentation |
---
## Running Inference
The `inference.py` script at the project root uses the OpenAI API client to run an LLM agent against all 3 tasks:
```bash
# Set your API credentials
export HF_TOKEN=your_huggingface_token
# Optional: override model and API base
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export API_BASE_URL=https://router.huggingface.co/v1
# Run inference from the project root
python inference.py
```
**Output format** (stdout):
```
[START] task=easy env=api_debug_env model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=inspect_logs(target=payment_client) reward=0.15 done=false error=null
[STEP] step=2 action=submit_fix(target=payment_client, fix={...}) reward=0.30 done=false error=null
...
[END] success=true steps=5 score=0.820 rewards=0.15,0.30,...
```
The inference script:
- Uses `openai.OpenAI` client for all LLM calls
- Reads `HF_TOKEN` (or `API_KEY`) from environment variables
- Includes retry logic with exponential backoff
- Emits `[START]`, `[STEP]`, `[END]` lines to stdout
---
## Running Tests
```bash
# From the project root (openenv-task/)
python -m pytest tests/ -v --tb=short
```
**70 tests** across 12 test classes covering:
- Scenario loading, seed randomization, and issue pool selection
- Environment reset and initialization
- All 4 action types: `inspect_logs`, `inspect_config`, `inspect_endpoint`, `submit_fix`
- Dynamic state: service health updates, dynamic log injection, error trace changes
- Multi-dimensional grading rubric (fix, diagnosis, efficiency, strategy)
- Strict fix validation with partial credit
- Value matching (strings, numbers, booleans, lists, Bearer tokens)
- Full episode integration tests (easy, medium, hard)
- Cascading failure mechanics and dependency chains
- Episode termination conditions
### Validate OpenEnv Compliance
```bash
openenv validate
```
---
## Project Structure
```
openenv-task/ # Project root
βββ __init__.py # Package init (exports ApiDebugEnv, Action, Observation)
βββ client.py # OpenEnv client (WebSocket connection to server)
βββ models.py # Pydantic Action & Observation type definitions
βββ scenarios.py # Task scenarios with dependency graphs & issue pools
βββ inference.py # MANDATORY inference script (LLM agent, OpenAI client)
βββ openenv.yaml # OpenEnv metadata (spec v1)
βββ pyproject.toml # Python project config & dependencies
βββ Dockerfile # Docker build for HF Spaces deployment
βββ LICENSE # BSD license
βββ README.md # This file
βββ PROGRESS.md # Development session log
βββ AGENTS.md # Instructions for AI coding agents
βββ server/
β βββ __init__.py # Server package init
β βββ api_debug_env_environment.py # Core environment (reset/step/grade logic)
β βββ app.py # FastAPI endpoints (/reset, /step, /tasks, etc.)
β βββ Dockerfile # Alternate Dockerfile (same as root)
β βββ requirements.txt # Server-specific requirements
βββ scripts/
β βββ baseline_inference.py # Alternate baseline script
βββ tests/
βββ test_environment.py # 70 unit & integration tests
```
### Key Files
| File | Purpose |
|---|---|
| `server/api_debug_env_environment.py` | **Core logic** β `reset()`, `step()`, `grade()`, dynamic state, cascading failures |
| `scenarios.py` | **Task definitions** β issue pools, dependency graphs, dynamic logs, service configs |
| `models.py` | **Type definitions** β `ApiDebugAction` and `ApiDebugObservation` Pydantic models |
| `inference.py` | **Mandatory** β LLM-based agent using OpenAI client with `[START]/[STEP]/[END]` output |
| `openenv.yaml` | **Mandatory** β OpenEnv spec v1 metadata with task definitions |
| `server/app.py` | **FastAPI server** β all HTTP endpoints including `/baseline` and `/grader` |
---
## Design Philosophy
This environment is designed to be useful for **RL/agent training and evaluation**, not just a one-off benchmark:
1. **Dense Reward Signal** β every action type produces positive or negative reward, enabling gradient-based training (GRPO, DPO, PPO). Not just a sparse binary score at the end.
2. **Progressive Difficulty** β Easy (2 services, 2 issues) β Medium (3 services, 3 issues with 1 dependency) β Hard (5 services, 5 issues with multiple dependency chains). Difficulty comes from complexity, not ambiguity.
3. **Partial Credit** β close-but-wrong fixes get constructive feedback instead of just rejection. This provides learning signal for agents that are on the right track.
4. **Strategy Incentives** β the multi-dimensional rubric rewards **how** the agent solves (inspect before fix, follow dependencies, avoid waste), not just **what** it solves. This encourages emergent debugging strategies.
5. **Stochastic Scenarios** β seed-based randomization from expanded issue pools prevents policy overfitting to memorized scenarios while maintaining reproducibility.
6. **Cascading Dynamics** β upstream fixes change downstream state, requiring **multi-step causal reasoning**. The agent can't just pattern-match each issue independently β it must understand the system architecture.
7. **Real-World Relevance** β API integration debugging is a genuine, high-value task that software engineers spend significant time on. The scenarios model actual failure patterns (expired tokens, rate limiting, missing headers, deprecated endpoints, race conditions).
---
## OpenEnv Spec Compliance
| Requirement | Status |
|---|---|
| OpenEnv spec v1 (`openenv.yaml`) | β
|
| Typed Pydantic models (Action, Observation) | β
|
| `reset()` / `step()` / `state()` API | β
|
| 3+ tasks with difficulty range | β
(easy, medium, hard) |
| Programmatic graders (0.0β1.0) | β
(multi-dimensional rubric) |
| Meaningful reward function | β
(dense, not sparse) |
| Baseline inference script | β
(`inference.py` at root) |
| OpenAI client for LLM calls | β
|
| `[START]/[STEP]/[END]` stdout format | β
|
| Dockerfile builds and runs | β
|
| HF Space deploys and responds | β
|
| `openenv validate` passes | β
|
---
## Hackathon Submission
- **HF Space**: [yadnyeshkolte/api-debug-env](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env)
- **GitHub**: [yadnyeshkolte/openenv-task](https://github.com/yadnyeshkolte/openenv-task)
- **Hackathon**: Meta PyTorch OpenEnv Hackathon Γ Scaler School of Technology
|