Spaces:
Sleeping
Sleeping
| title: API Debug Env | |
| emoji: π§ | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: docker | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| # π§ API Integration Debugging Environment | |
| > A real-world OpenEnv environment where an AI agent diagnoses and fixes broken API integrations across multi-service systems with **cascading failures**, **dynamic state**, and **multi-dimensional rubric grading**. | |
| [](https://github.com/meta-pytorch/OpenEnv) | |
| [](https://python.org) | |
| []() | |
| [](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env) | |
| --- | |
| ## Table of Contents | |
| - [Motivation β Why API Debugging?](#motivation--why-api-debugging) | |
| - [Environment Overview](#environment-overview) | |
| - [Key Design Features](#key-design-features) | |
| - [Tasks (Easy / Medium / Hard)](#tasks) | |
| - [Multi-Dimensional Grading Rubric](#multi-dimensional-grading-rubric) | |
| - [Reward Shaping](#reward-shaping) | |
| - [Action & Observation Spaces](#action--observation-spaces) | |
| - [Example Transcript](#example-transcript) | |
| - [Setup & Usage](#setup--usage) | |
| - [API Endpoints](#api-endpoints) | |
| - [Running Inference](#running-inference) | |
| - [Running Tests](#running-tests) | |
| - [Project Structure](#project-structure) | |
| - [Design Philosophy](#design-philosophy) | |
| --- | |
| ## Motivation β Why API Debugging? | |
| API integration failures are one of the **most common and expensive issues** in production software engineering. When microservices communicate β Service A calls Service B which calls Service C β a single misconfiguration can cascade through the entire system, producing confusing error chains that take hours to diagnose. | |
| Real-world API debugging requires: | |
| - **Structured diagnosis** β reading error logs and configs across multiple services | |
| - **Dependency awareness** β understanding which upstream failure is causing downstream errors | |
| - **Strategic reasoning** β fixing root causes first to unmask hidden downstream bugs | |
| - **Precision** β submitting exact configuration corrections, not approximate guesses | |
| This environment simulates **real-world cascading API failures** with dynamic state that changes as the agent acts β not a static lookup puzzle. | |
| --- | |
| ## Environment Overview | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Agent Debugging Loop β | |
| β β | |
| β 1. reset(task_id) β Initial observation with broken state β | |
| β 2. step(inspect_logs) β Error logs with diagnostic clues β | |
| β 3. step(inspect_config)β Current (broken) service configuration β | |
| β 4. step(inspect_endpoint) β Simulated API response (401, 504..) β | |
| β 5. step(submit_fix) β Strict fix validation + cascade update β | |
| β 6. grade() β Multi-dimensional rubric score [0,1] β | |
| β β | |
| β State updates dynamically: service health changes, new logs β | |
| β appear, error cascades resolve as the agent fixes issues. β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| The agent interacts through the standard OpenEnv API: | |
| - **`reset()`** β returns initial observation with broken service state | |
| - **`step(action)`** β executes one debugging action, returns observation + reward | |
| - **`state()`** β returns current environment state (episode_id, step_count) | |
| - **`grade()`** β returns final score using multi-dimensional rubric | |
| --- | |
| ## Key Design Features | |
| ### 1. Cascading Failures with Service Dependency Graphs | |
| Each task models a real multi-service ecosystem. Services depend on each other, and a bug in an upstream service **cascades** to all downstream services: | |
| ``` | |
| Hard Task Dependency Graph: | |
| order_service βββ¬βββ inventory_service βββ¬βββ shipping_service | |
| β ββββ auth_service | |
| ββββ api_gateway | |
| [ERROR] [DEGRADED] [HEALTHY] | |
| ``` | |
| - Fixing `order_service`'s wrong URL unmasks `inventory_service`'s timeout issue | |
| - Fixing `inventory_service`'s expired token allows `shipping_service` to respond | |
| - **Some issues are intentionally masked by upstream failures** β the agent must fix in the right order | |
| ### 2. Dynamic State | |
| Unlike static environments, the state **changes as the agent acts**: | |
| | What changes | How | | |
| |---|---| | |
| | **Service health** | Fixing issues updates service status: `error` β `degraded` β `healthy` | | |
| | **Logs** | After a fix, re-inspecting logs shows **new entries** (e.g., "Authorization header set. Retrying...") | | |
| | **Error traces** | The cascade chain shrinks as upstream issues are resolved | | |
| | **Endpoint responses** | `inspect_endpoint` returns different HTTP errors based on current fix state | | |
| ### 3. Seed-Based Scenario Randomization | |
| Each difficulty level has an **expanded issue pool** (more issues than are selected per episode): | |
| | Difficulty | Pool Size | Selected Per Episode | | |
| |---|---|---| | |
| | Easy | 4 issues | 2 | | |
| | Medium | 5 issues | 3 | | |
| | Hard | 7 issues | 5 | | |
| Passing a `seed` to `reset()` produces a **deterministic but varied** scenario β different seeds select different subsets from the pool and randomize log order. This prevents agents from memorizing fixed patterns. | |
| ### 4. Strict Fix Validation with Partial Credit | |
| The grader validates both **keys and values** of submitted fixes: | |
| - **Exact match** β Full credit (+0.25 reward) | |
| - **Right key, close value** (e.g., timeout=7 when expected=10) β Partial credit (+0.03) | |
| - **Right key, wrong value** (e.g., timeout=100 when expected=10) β Rejected | |
| - **Wrong key entirely** β Penalized (-0.1) | |
| - **Bearer token pattern matching** β `Bearer <any_valid_token>` is accepted | |
| - **Numeric tolerance** β strict 10% tolerance | |
| - **Boolean coercion** β `"true"`, `"1"`, `"yes"` all match `True` | |
| --- | |
| ## Tasks | |
| ### Easy: Payment API Integration (2 issues, 15 max steps) | |
| **Scenario**: A payment processing client is failing to connect to the payment gateway. The agent must diagnose authentication and protocol errors. | |
| - **Services**: `payment_client`, `payment_gateway` | |
| - **Issue pool** (4 possible, 2 selected): | |
| - Missing `Authorization` header (HTTP 401) | |
| - Wrong `Content-Type` header β `text/plain` instead of `application/json` (HTTP 415) | |
| - Timeout too low for payment processing (HTTP 504) | |
| - Base URL pointing to deprecated v1 endpoint (HTTP 301) | |
| - **Dependencies**: None β straightforward diagnosis | |
| ### Medium: Webhook Event Chain (3 issues, 25 max steps) | |
| **Scenario**: A webhook notification system is dropping events across a 3-service chain. Events flow from sender β receiver β notification service, but multiple configuration issues are causing failures. | |
| - **Services**: `webhook_sender`, `webhook_receiver`, `notification_service` | |
| - **Issue pool** (5 possible, 3 selected): | |
| - Rate limit mismatch (sender at 100/s, receiver accepts 10/s) β 429 errors | |
| - Insufficient retry config (only 1 retry, no backoff, 429 not in retry list) | |
| - Empty webhook signature header β receiver drops all events as unsigned | |
| - Wrong target URL (`/webhook` vs `/hooks/incoming`) β 404 errors | |
| - Payload compression enabled but receiver doesn't support gzip β 415 errors | |
| - **Dependencies**: Retry issue is **masked** by rate limit β must fix rate limit first to see the retry problem | |
| ### Hard: E-Commerce Order Pipeline (5 issues, 40 max steps) | |
| **Scenario**: A complex e-commerce order processing pipeline is failing with cascading errors across 5 services. Multiple dependency chains make this genuinely challenging for frontier models. | |
| - **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service` | |
| - **Issue pool** (7 possible, 5 selected): | |
| - Deprecated URL (`/v1/check` β should be `/v2/reserve`) β 301 redirect | |
| - Timeout too short (2s vs 4s processing time) β masked by wrong URL | |
| - Synchronous mode causing race conditions between concurrent orders | |
| - Expired auth token on inventoryβshipping calls β 401 | |
| - No auto token refresh configured β masked by expired token | |
| - No circuit breaker β failed requests hammer inventory service | |
| - Missing idempotency key β retries create duplicate orders | |
| - **Dependencies**: `timeout` depends on `wrong_url` fix; `token_refresh` depends on `expired_token` fix; `idempotency` depends on `async` fix | |
| --- | |
| ## Multi-Dimensional Grading Rubric | |
| The grader uses a **4-dimension weighted rubric**, not a simple `issues_fixed / total` ratio: | |
| | Dimension | Weight | What It Measures | | |
| |---|---|---| | |
| | **Fix Score** | 40% | `issues_fixed / total_issues` β how many bugs were actually resolved | | |
| | **Strategy Score** | 25% | Did the agent follow a logical approach? Inspect before fix, avoid repeats, follow dependency order, use all action types | | |
| | **Diagnosis Score** | 20% | Did the agent inspect the service (logs/config) **before** submitting a fix for it? | | |
| | **Efficiency Score** | 15% | `remaining_steps / max_steps` β faster solutions score higher | | |
| ``` | |
| Final Score = fix Γ 0.40 + strategy Γ 0.25 + diagnosis Γ 0.20 + efficiency Γ 0.15 | |
| Clamped to (0.001, 0.999) β never exactly 0.0 or 1.0 | |
| ``` | |
| **Strategy scoring details:** | |
| - Did the agent inspect logs/config before submitting any fix? (+1) | |
| - Ratio of unique inspections to total inspections (no wasteful repeats) (+1) | |
| - Did fixes follow the optimal dependency order? (+1) | |
| - Did the agent use a variety of action types? (+1) | |
| ### Baseline Scores (Rule-Based Heuristic Agent) | |
| | Task | Score | Steps Used | Issues Fixed | | |
| |---|---|---|---| | |
| | Easy | ~0.75 | 7 | 2/2 | | |
| | Medium | ~0.55 | 10 | 3/3 | | |
| | Hard | ~0.45 | 15 | 5/5 | | |
| *The baseline uses a deterministic heuristic (inspect all logs β inspect all configs β submit known fixes). An LLM-based agent following good debugging strategy can score higher.* | |
| --- | |
| ## Reward Shaping | |
| Every action produces a meaningful reward signal β not just sparse end-of-episode feedback: | |
| | Action | Reward | Condition | | |
| |---|---|---| | |
| | `inspect_logs` (first time, finds error patterns) | **+0.15** | New issue-related log patterns found | | |
| | `inspect_logs` (first time, no issues here) | +0.05 | Valid inspection, no errors in this service | | |
| | `inspect_logs` (repeat, no new info) | 0.00 | Already inspected, nothing changed | | |
| | `inspect_logs` (repeat, after a fix) | +0.05 | Dynamic logs appeared after a recent fix | | |
| | `inspect_config` (service has issues) | +0.05 | Relevant config retrieved | | |
| | `inspect_config` (service is clean) | +0.01 | Config retrieved but no issues here | | |
| | `inspect_config` (repeat) | 0.00 | Already inspected | | |
| | `inspect_endpoint` | +0.02 to +0.05 | Simulated endpoint test | | |
| | `submit_fix` (correct fix) | **+0.25** | Issue resolved, service health updated | | |
| | `submit_fix` (correct + inspected first) | **+0.30** | Fix + strategy bonus for diagnosis | | |
| | `submit_fix` (partial β close but not exact) | +0.03 | Right key, approximately right value | | |
| | `submit_fix` (wrong fix) | **-0.10** | Incorrect fix payload | | |
| | `submit_fix` (empty payload) | -0.10 | Empty fix_payload submitted | | |
| | All issues fixed | **+0.20** | Episode completion bonus | | |
| | Invalid target / invalid action | -0.05 | Bad input | | |
| | Every step | **-0.01** | Step cost β encourages efficiency | | |
| --- | |
| ## Action & Observation Spaces | |
| ### Action Schema (Pydantic model: `ApiDebugAction`) | |
| ```json | |
| { | |
| "action_type": "inspect_logs | inspect_config | inspect_endpoint | submit_fix", | |
| "target": "<service_name>", | |
| "fix_payload": { | |
| "config_key": "corrected_value" | |
| } | |
| } | |
| ``` | |
| - `action_type` (required): One of the 4 debugging actions | |
| - `target` (required): The service to act on (from `available_targets` in the observation) | |
| - `fix_payload` (optional): Required only for `submit_fix` β the configuration correction | |
| **Fix payload formats:** | |
| ```json | |
| // Simple key-value fix | |
| {"timeout": 10} | |
| // Nested key fix (dot notation) | |
| {"headers.Authorization": "Bearer my_api_key"} | |
| // Complex nested object fix | |
| {"retry": {"max_retries": 3, "backoff_factor": 2, "retry_on_status": [429, 500]}} | |
| ``` | |
| ### Observation Schema (Pydantic model: `ApiDebugObservation`) | |
| ```json | |
| { | |
| "task_id": "easy", | |
| "task_description": "A payment processing API integration is failing...", | |
| "logs": ["[ERROR] 2026-03-25T10:15:23Z POST /process -> 401 Unauthorized", "..."], | |
| "config_snapshot": {"headers": {"Content-Type": "text/plain"}, "timeout": 30}, | |
| "api_response": {"status": "error", "status_code": 401, "error": "Missing Authorization"}, | |
| "service_status": {"payment_client": "error", "payment_gateway": "healthy"}, | |
| "dependency_graph": {"payment_client": ["payment_gateway"], "payment_gateway": []}, | |
| "error_trace": [ | |
| "[CRITICAL] payment_client: Missing Authorization header", | |
| " ββ> payment_gateway: All requests rejected with 401" | |
| ], | |
| "hints": ["Check headers.Authorization"], | |
| "remaining_steps": 14, | |
| "issues_found": 1, | |
| "issues_fixed": 0, | |
| "issues_total": 2, | |
| "action_result": "Inspected logs for 'payment_client'. Found relevant error patterns!", | |
| "available_targets": ["payment_client", "payment_gateway"], | |
| "done": false, | |
| "reward": 0.15 | |
| } | |
| ``` | |
| **Key observation fields for agent reasoning:** | |
| - `service_status` β shows which services are healthy/degraded/error (updates dynamically) | |
| - `dependency_graph` β shows service relationships (agent should fix upstream first) | |
| - `error_trace` β shows active error cascades (shrinks as issues are fixed) | |
| - `hints` β progressive hints that get more specific as steps are used | |
| --- | |
| ## Example Transcript | |
| ``` | |
| >>> reset(task_id="easy") | |
| task_description: "A payment processing API integration is failing..." | |
| service_status: {payment_client: "error", payment_gateway: "healthy"} | |
| error_trace: | |
| [CRITICAL] payment_client: Missing Authorization header | |
| ββ> payment_gateway: All requests rejected with 401 | |
| [ERROR] payment_client: Wrong Content-Type (text/plain instead of application/json) | |
| ββ> payment_gateway: Request body parsing fails | |
| issues_total: 2, remaining_steps: 15 | |
| >>> step(action_type="inspect_logs", target="payment_client") | |
| logs: [ | |
| "[INFO] Payment client initialized...", | |
| "[ERROR] POST /process -> 401 Unauthorized", | |
| "[ERROR] Response: {'error': 'Missing or invalid Authorization header'}", | |
| "[WARN] Request headers: Content-Type=text/plain", | |
| "[ERROR] POST /process -> 415 Unsupported Media Type", | |
| ] | |
| issues_found: 2, reward: +0.15 | |
| >>> step(action_type="inspect_config", target="payment_client") | |
| config_snapshot: { | |
| "base_url": "https://api.paymentgateway.com/v2", | |
| "headers": {"Content-Type": "text/plain", "Accept": "application/json"}, | |
| "timeout": 30 | |
| } | |
| reward: +0.05 // Service has issues, first inspection | |
| >>> step(action_type="submit_fix", target="payment_client", | |
| fix_payload={"headers.Authorization": "Bearer sk_live_my_key"}) | |
| action_result: "Fix accepted! Fixed 1 issue(s). Total: 1/2" | |
| service_status: {payment_client: "degraded", payment_gateway: "healthy"} | |
| reward: +0.30 // Fix (+0.25) + strategy bonus (+0.05) for inspecting first | |
| >>> step(action_type="inspect_logs", target="payment_client") | |
| logs: [...original logs..., | |
| "[INFO] Authorization header set. Retrying request..." // NEW dynamic log! | |
| ] | |
| reward: +0.05 // Re-inspection has new dynamic logs | |
| >>> step(action_type="submit_fix", target="payment_client", | |
| fix_payload={"headers.Content-Type": "application/json"}) | |
| action_result: "Fix accepted! All issues fixed! Episode complete." | |
| service_status: {payment_client: "healthy", payment_gateway: "healthy"} | |
| error_trace: ["All issues resolved. No error cascades active."] | |
| reward: +0.50 // Fix (+0.25) + strategy (+0.05) + completion bonus (+0.20) | |
| done: true | |
| >>> grade() | |
| score: 0.82 | |
| fix_score: 1.00 (2/2 fixed) | |
| diagnosis_score: 1.00 (inspected before every fix) | |
| efficiency_score: 0.67 (5/15 steps used) | |
| strategy_score: 0.80 (inspected first, used multiple action types) | |
| ``` | |
| --- | |
| ## Setup & Usage | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - [uv](https://docs.astral.sh/uv/) (recommended) or pip | |
| - Docker (for containerized deployment) | |
| ### Install Dependencies | |
| ```bash | |
| # Clone the repository | |
| git clone https://github.com/yadnyeshkolte/openenv-task.git | |
| cd openenv-task | |
| # Install dependencies with uv | |
| uv sync | |
| # Or with pip | |
| pip install -e . | |
| ``` | |
| ### Run the Server Locally | |
| ```bash | |
| # From the project root (openenv-task/) | |
| uvicorn server.app:app --reload --host 0.0.0.0 --port 8000 | |
| ``` | |
| The server will be available at `http://localhost:8000`. Visit `http://localhost:8000/docs` for interactive API documentation. | |
| ### Quick Test | |
| ```bash | |
| # Reset environment | |
| curl -X POST http://localhost:8000/reset | |
| # Inspect logs | |
| curl -X POST http://localhost:8000/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "inspect_logs", "target": "payment_client"}' | |
| # Submit a fix | |
| curl -X POST http://localhost:8000/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "submit_fix", "target": "payment_client", "fix_payload": {"headers.Authorization": "Bearer my_key"}}' | |
| ``` | |
| ### Docker Build & Run | |
| ```bash | |
| # From the project root (openenv-task/) | |
| docker build -t api_debug_env -f Dockerfile . | |
| docker run -p 8000:8000 api_debug_env | |
| ``` | |
| --- | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| |---|---|---| | |
| | `/` | GET | Environment info, version, and feature list | | |
| | `/reset` | POST | Reset environment (accepts `task_id` and `seed` params) | | |
| | `/step` | POST | Execute a debugging action | | |
| | `/state` | GET | Get current state (episode_id, step_count) | | |
| | `/schema` | GET | Get action/observation Pydantic schemas | | |
| | `/tasks` | GET | List all 3 tasks with action schema and service dependencies | | |
| | `/grader` | POST | Get multi-dimensional grader score for current episode | | |
| | `/baseline` | POST | Run the rule-based baseline agent on all 3 tasks | | |
| | `/health` | GET | Health check endpoint | | |
| | `/docs` | GET | Interactive Swagger UI documentation | | |
| --- | |
| ## Running Inference | |
| The `inference.py` script at the project root uses the OpenAI API client to run an LLM agent against all 3 tasks: | |
| ```bash | |
| # Set your API credentials | |
| export HF_TOKEN=your_huggingface_token | |
| # Optional: override model and API base | |
| export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct | |
| export API_BASE_URL=https://router.huggingface.co/v1 | |
| # Run inference from the project root | |
| python inference.py | |
| ``` | |
| **Output format** (stdout): | |
| ``` | |
| [START] task=easy env=api_debug_env model=Qwen/Qwen2.5-72B-Instruct | |
| [STEP] step=1 action=inspect_logs(target=payment_client) reward=0.15 done=false error=null | |
| [STEP] step=2 action=submit_fix(target=payment_client, fix={...}) reward=0.30 done=false error=null | |
| ... | |
| [END] success=true steps=5 score=0.820 rewards=0.15,0.30,... | |
| ``` | |
| The inference script: | |
| - Uses `openai.OpenAI` client for all LLM calls | |
| - Reads `HF_TOKEN` (or `API_KEY`) from environment variables | |
| - Includes retry logic with exponential backoff | |
| - Emits `[START]`, `[STEP]`, `[END]` lines to stdout | |
| --- | |
| ## Running Tests | |
| ```bash | |
| # From the project root (openenv-task/) | |
| python -m pytest tests/ -v --tb=short | |
| ``` | |
| **70 tests** across 12 test classes covering: | |
| - Scenario loading, seed randomization, and issue pool selection | |
| - Environment reset and initialization | |
| - All 4 action types: `inspect_logs`, `inspect_config`, `inspect_endpoint`, `submit_fix` | |
| - Dynamic state: service health updates, dynamic log injection, error trace changes | |
| - Multi-dimensional grading rubric (fix, diagnosis, efficiency, strategy) | |
| - Strict fix validation with partial credit | |
| - Value matching (strings, numbers, booleans, lists, Bearer tokens) | |
| - Full episode integration tests (easy, medium, hard) | |
| - Cascading failure mechanics and dependency chains | |
| - Episode termination conditions | |
| ### Validate OpenEnv Compliance | |
| ```bash | |
| openenv validate | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| openenv-task/ # Project root | |
| βββ __init__.py # Package init (exports ApiDebugEnv, Action, Observation) | |
| βββ client.py # OpenEnv client (WebSocket connection to server) | |
| βββ models.py # Pydantic Action & Observation type definitions | |
| βββ scenarios.py # Task scenarios with dependency graphs & issue pools | |
| βββ inference.py # MANDATORY inference script (LLM agent, OpenAI client) | |
| βββ openenv.yaml # OpenEnv metadata (spec v1) | |
| βββ pyproject.toml # Python project config & dependencies | |
| βββ Dockerfile # Docker build for HF Spaces deployment | |
| βββ LICENSE # BSD license | |
| βββ README.md # This file | |
| βββ PROGRESS.md # Development session log | |
| βββ AGENTS.md # Instructions for AI coding agents | |
| βββ server/ | |
| β βββ __init__.py # Server package init | |
| β βββ api_debug_env_environment.py # Core environment (reset/step/grade logic) | |
| β βββ app.py # FastAPI endpoints (/reset, /step, /tasks, etc.) | |
| β βββ Dockerfile # Alternate Dockerfile (same as root) | |
| β βββ requirements.txt # Server-specific requirements | |
| βββ scripts/ | |
| β βββ baseline_inference.py # Alternate baseline script | |
| βββ tests/ | |
| βββ test_environment.py # 70 unit & integration tests | |
| ``` | |
| ### Key Files | |
| | File | Purpose | | |
| |---|---| | |
| | `server/api_debug_env_environment.py` | **Core logic** β `reset()`, `step()`, `grade()`, dynamic state, cascading failures | | |
| | `scenarios.py` | **Task definitions** β issue pools, dependency graphs, dynamic logs, service configs | | |
| | `models.py` | **Type definitions** β `ApiDebugAction` and `ApiDebugObservation` Pydantic models | | |
| | `inference.py` | **Mandatory** β LLM-based agent using OpenAI client with `[START]/[STEP]/[END]` output | | |
| | `openenv.yaml` | **Mandatory** β OpenEnv spec v1 metadata with task definitions | | |
| | `server/app.py` | **FastAPI server** β all HTTP endpoints including `/baseline` and `/grader` | | |
| --- | |
| ## Design Philosophy | |
| This environment is designed to be useful for **RL/agent training and evaluation**, not just a one-off benchmark: | |
| 1. **Dense Reward Signal** β every action type produces positive or negative reward, enabling gradient-based training (GRPO, DPO, PPO). Not just a sparse binary score at the end. | |
| 2. **Progressive Difficulty** β Easy (2 services, 2 issues) β Medium (3 services, 3 issues with 1 dependency) β Hard (5 services, 5 issues with multiple dependency chains). Difficulty comes from complexity, not ambiguity. | |
| 3. **Partial Credit** β close-but-wrong fixes get constructive feedback instead of just rejection. This provides learning signal for agents that are on the right track. | |
| 4. **Strategy Incentives** β the multi-dimensional rubric rewards **how** the agent solves (inspect before fix, follow dependencies, avoid waste), not just **what** it solves. This encourages emergent debugging strategies. | |
| 5. **Stochastic Scenarios** β seed-based randomization from expanded issue pools prevents policy overfitting to memorized scenarios while maintaining reproducibility. | |
| 6. **Cascading Dynamics** β upstream fixes change downstream state, requiring **multi-step causal reasoning**. The agent can't just pattern-match each issue independently β it must understand the system architecture. | |
| 7. **Real-World Relevance** β API integration debugging is a genuine, high-value task that software engineers spend significant time on. The scenarios model actual failure patterns (expired tokens, rate limiting, missing headers, deprecated endpoints, race conditions). | |
| --- | |
| ## OpenEnv Spec Compliance | |
| | Requirement | Status | | |
| |---|---| | |
| | OpenEnv spec v1 (`openenv.yaml`) | β | | |
| | Typed Pydantic models (Action, Observation) | β | | |
| | `reset()` / `step()` / `state()` API | β | | |
| | 3+ tasks with difficulty range | β (easy, medium, hard) | | |
| | Programmatic graders (0.0β1.0) | β (multi-dimensional rubric) | | |
| | Meaningful reward function | β (dense, not sparse) | | |
| | Baseline inference script | β (`inference.py` at root) | | |
| | OpenAI client for LLM calls | β | | |
| | `[START]/[STEP]/[END]` stdout format | β | | |
| | Dockerfile builds and runs | β | | |
| | HF Space deploys and responds | β | | |
| | `openenv validate` passes | β | | |
| --- | |
| ## Hackathon Submission | |
| - **HF Space**: [yadnyeshkolte/api-debug-env](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env) | |
| - **GitHub**: [yadnyeshkolte/openenv-task](https://github.com/yadnyeshkolte/openenv-task) | |
| - **Hackathon**: Meta PyTorch OpenEnv Hackathon Γ Scaler School of Technology | |