api-debug-env / README.md
yadnyeshkolte's picture
update on README.md
579652a
---
title: API Debug Env
emoji: πŸ”§
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
- openenv
---
# πŸ”§ API Integration Debugging Environment
> A real-world OpenEnv environment where an AI agent diagnoses and fixes broken API integrations across multi-service systems with **cascading failures**, **dynamic state**, and **multi-dimensional rubric grading**.
[![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2.2-blue)](https://github.com/meta-pytorch/OpenEnv)
[![Python](https://img.shields.io/badge/Python-3.10%2B-green)](https://python.org)
[![Tests](https://img.shields.io/badge/Tests-70%20passed-brightgreen)]()
[![HF Space](https://img.shields.io/badge/HF%20Space-Live-orange)](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env)
---
## Table of Contents
- [Motivation β€” Why API Debugging?](#motivation--why-api-debugging)
- [Environment Overview](#environment-overview)
- [Key Design Features](#key-design-features)
- [Tasks (Easy / Medium / Hard)](#tasks)
- [Multi-Dimensional Grading Rubric](#multi-dimensional-grading-rubric)
- [Reward Shaping](#reward-shaping)
- [Action & Observation Spaces](#action--observation-spaces)
- [Example Transcript](#example-transcript)
- [Setup & Usage](#setup--usage)
- [API Endpoints](#api-endpoints)
- [Running Inference](#running-inference)
- [Running Tests](#running-tests)
- [Project Structure](#project-structure)
- [Design Philosophy](#design-philosophy)
---
## Motivation β€” Why API Debugging?
API integration failures are one of the **most common and expensive issues** in production software engineering. When microservices communicate β€” Service A calls Service B which calls Service C β€” a single misconfiguration can cascade through the entire system, producing confusing error chains that take hours to diagnose.
Real-world API debugging requires:
- **Structured diagnosis** β€” reading error logs and configs across multiple services
- **Dependency awareness** β€” understanding which upstream failure is causing downstream errors
- **Strategic reasoning** β€” fixing root causes first to unmask hidden downstream bugs
- **Precision** β€” submitting exact configuration corrections, not approximate guesses
This environment simulates **real-world cascading API failures** with dynamic state that changes as the agent acts β€” not a static lookup puzzle.
---
## Environment Overview
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Agent Debugging Loop β”‚
β”‚ β”‚
β”‚ 1. reset(task_id) β†’ Initial observation with broken state β”‚
β”‚ 2. step(inspect_logs) β†’ Error logs with diagnostic clues β”‚
β”‚ 3. step(inspect_config)β†’ Current (broken) service configuration β”‚
β”‚ 4. step(inspect_endpoint) β†’ Simulated API response (401, 504..) β”‚
β”‚ 5. step(submit_fix) β†’ Strict fix validation + cascade update β”‚
β”‚ 6. grade() β†’ Multi-dimensional rubric score [0,1] β”‚
β”‚ β”‚
β”‚ State updates dynamically: service health changes, new logs β”‚
β”‚ appear, error cascades resolve as the agent fixes issues. β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
The agent interacts through the standard OpenEnv API:
- **`reset()`** β†’ returns initial observation with broken service state
- **`step(action)`** β†’ executes one debugging action, returns observation + reward
- **`state()`** β†’ returns current environment state (episode_id, step_count)
- **`grade()`** β†’ returns final score using multi-dimensional rubric
---
## Key Design Features
### 1. Cascading Failures with Service Dependency Graphs
Each task models a real multi-service ecosystem. Services depend on each other, and a bug in an upstream service **cascades** to all downstream services:
```
Hard Task Dependency Graph:
order_service ──┬──→ inventory_service ──┬──→ shipping_service
β”‚ └──→ auth_service
└──→ api_gateway
[ERROR] [DEGRADED] [HEALTHY]
```
- Fixing `order_service`'s wrong URL unmasks `inventory_service`'s timeout issue
- Fixing `inventory_service`'s expired token allows `shipping_service` to respond
- **Some issues are intentionally masked by upstream failures** β€” the agent must fix in the right order
### 2. Dynamic State
Unlike static environments, the state **changes as the agent acts**:
| What changes | How |
|---|---|
| **Service health** | Fixing issues updates service status: `error` β†’ `degraded` β†’ `healthy` |
| **Logs** | After a fix, re-inspecting logs shows **new entries** (e.g., "Authorization header set. Retrying...") |
| **Error traces** | The cascade chain shrinks as upstream issues are resolved |
| **Endpoint responses** | `inspect_endpoint` returns different HTTP errors based on current fix state |
### 3. Seed-Based Scenario Randomization
Each difficulty level has an **expanded issue pool** (more issues than are selected per episode):
| Difficulty | Pool Size | Selected Per Episode |
|---|---|---|
| Easy | 4 issues | 2 |
| Medium | 5 issues | 3 |
| Hard | 7 issues | 5 |
Passing a `seed` to `reset()` produces a **deterministic but varied** scenario β€” different seeds select different subsets from the pool and randomize log order. This prevents agents from memorizing fixed patterns.
### 4. Strict Fix Validation with Partial Credit
The grader validates both **keys and values** of submitted fixes:
- **Exact match** β†’ Full credit (+0.25 reward)
- **Right key, close value** (e.g., timeout=7 when expected=10) β†’ Partial credit (+0.03)
- **Right key, wrong value** (e.g., timeout=100 when expected=10) β†’ Rejected
- **Wrong key entirely** β†’ Penalized (-0.1)
- **Bearer token pattern matching** β€” `Bearer <any_valid_token>` is accepted
- **Numeric tolerance** β€” strict 10% tolerance
- **Boolean coercion** β€” `"true"`, `"1"`, `"yes"` all match `True`
---
## Tasks
### Easy: Payment API Integration (2 issues, 15 max steps)
**Scenario**: A payment processing client is failing to connect to the payment gateway. The agent must diagnose authentication and protocol errors.
- **Services**: `payment_client`, `payment_gateway`
- **Issue pool** (4 possible, 2 selected):
- Missing `Authorization` header (HTTP 401)
- Wrong `Content-Type` header β€” `text/plain` instead of `application/json` (HTTP 415)
- Timeout too low for payment processing (HTTP 504)
- Base URL pointing to deprecated v1 endpoint (HTTP 301)
- **Dependencies**: None β€” straightforward diagnosis
### Medium: Webhook Event Chain (3 issues, 25 max steps)
**Scenario**: A webhook notification system is dropping events across a 3-service chain. Events flow from sender β†’ receiver β†’ notification service, but multiple configuration issues are causing failures.
- **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
- **Issue pool** (5 possible, 3 selected):
- Rate limit mismatch (sender at 100/s, receiver accepts 10/s) β†’ 429 errors
- Insufficient retry config (only 1 retry, no backoff, 429 not in retry list)
- Empty webhook signature header β†’ receiver drops all events as unsigned
- Wrong target URL (`/webhook` vs `/hooks/incoming`) β†’ 404 errors
- Payload compression enabled but receiver doesn't support gzip β†’ 415 errors
- **Dependencies**: Retry issue is **masked** by rate limit β€” must fix rate limit first to see the retry problem
### Hard: E-Commerce Order Pipeline (5 issues, 40 max steps)
**Scenario**: A complex e-commerce order processing pipeline is failing with cascading errors across 5 services. Multiple dependency chains make this genuinely challenging for frontier models.
- **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
- **Issue pool** (7 possible, 5 selected):
- Deprecated URL (`/v1/check` β†’ should be `/v2/reserve`) β†’ 301 redirect
- Timeout too short (2s vs 4s processing time) β€” masked by wrong URL
- Synchronous mode causing race conditions between concurrent orders
- Expired auth token on inventory→shipping calls → 401
- No auto token refresh configured β€” masked by expired token
- No circuit breaker β†’ failed requests hammer inventory service
- Missing idempotency key β†’ retries create duplicate orders
- **Dependencies**: `timeout` depends on `wrong_url` fix; `token_refresh` depends on `expired_token` fix; `idempotency` depends on `async` fix
---
## Multi-Dimensional Grading Rubric
The grader uses a **4-dimension weighted rubric**, not a simple `issues_fixed / total` ratio:
| Dimension | Weight | What It Measures |
|---|---|---|
| **Fix Score** | 40% | `issues_fixed / total_issues` β€” how many bugs were actually resolved |
| **Strategy Score** | 25% | Did the agent follow a logical approach? Inspect before fix, avoid repeats, follow dependency order, use all action types |
| **Diagnosis Score** | 20% | Did the agent inspect the service (logs/config) **before** submitting a fix for it? |
| **Efficiency Score** | 15% | `remaining_steps / max_steps` β€” faster solutions score higher |
```
Final Score = fix Γ— 0.40 + strategy Γ— 0.25 + diagnosis Γ— 0.20 + efficiency Γ— 0.15
Clamped to (0.001, 0.999) β€” never exactly 0.0 or 1.0
```
**Strategy scoring details:**
- Did the agent inspect logs/config before submitting any fix? (+1)
- Ratio of unique inspections to total inspections (no wasteful repeats) (+1)
- Did fixes follow the optimal dependency order? (+1)
- Did the agent use a variety of action types? (+1)
### Baseline Scores (Rule-Based Heuristic Agent)
| Task | Score | Steps Used | Issues Fixed |
|---|---|---|---|
| Easy | ~0.75 | 7 | 2/2 |
| Medium | ~0.55 | 10 | 3/3 |
| Hard | ~0.45 | 15 | 5/5 |
*The baseline uses a deterministic heuristic (inspect all logs β†’ inspect all configs β†’ submit known fixes). An LLM-based agent following good debugging strategy can score higher.*
---
## Reward Shaping
Every action produces a meaningful reward signal β€” not just sparse end-of-episode feedback:
| Action | Reward | Condition |
|---|---|---|
| `inspect_logs` (first time, finds error patterns) | **+0.15** | New issue-related log patterns found |
| `inspect_logs` (first time, no issues here) | +0.05 | Valid inspection, no errors in this service |
| `inspect_logs` (repeat, no new info) | 0.00 | Already inspected, nothing changed |
| `inspect_logs` (repeat, after a fix) | +0.05 | Dynamic logs appeared after a recent fix |
| `inspect_config` (service has issues) | +0.05 | Relevant config retrieved |
| `inspect_config` (service is clean) | +0.01 | Config retrieved but no issues here |
| `inspect_config` (repeat) | 0.00 | Already inspected |
| `inspect_endpoint` | +0.02 to +0.05 | Simulated endpoint test |
| `submit_fix` (correct fix) | **+0.25** | Issue resolved, service health updated |
| `submit_fix` (correct + inspected first) | **+0.30** | Fix + strategy bonus for diagnosis |
| `submit_fix` (partial β€” close but not exact) | +0.03 | Right key, approximately right value |
| `submit_fix` (wrong fix) | **-0.10** | Incorrect fix payload |
| `submit_fix` (empty payload) | -0.10 | Empty fix_payload submitted |
| All issues fixed | **+0.20** | Episode completion bonus |
| Invalid target / invalid action | -0.05 | Bad input |
| Every step | **-0.01** | Step cost β€” encourages efficiency |
---
## Action & Observation Spaces
### Action Schema (Pydantic model: `ApiDebugAction`)
```json
{
"action_type": "inspect_logs | inspect_config | inspect_endpoint | submit_fix",
"target": "<service_name>",
"fix_payload": {
"config_key": "corrected_value"
}
}
```
- `action_type` (required): One of the 4 debugging actions
- `target` (required): The service to act on (from `available_targets` in the observation)
- `fix_payload` (optional): Required only for `submit_fix` β€” the configuration correction
**Fix payload formats:**
```json
// Simple key-value fix
{"timeout": 10}
// Nested key fix (dot notation)
{"headers.Authorization": "Bearer my_api_key"}
// Complex nested object fix
{"retry": {"max_retries": 3, "backoff_factor": 2, "retry_on_status": [429, 500]}}
```
### Observation Schema (Pydantic model: `ApiDebugObservation`)
```json
{
"task_id": "easy",
"task_description": "A payment processing API integration is failing...",
"logs": ["[ERROR] 2026-03-25T10:15:23Z POST /process -> 401 Unauthorized", "..."],
"config_snapshot": {"headers": {"Content-Type": "text/plain"}, "timeout": 30},
"api_response": {"status": "error", "status_code": 401, "error": "Missing Authorization"},
"service_status": {"payment_client": "error", "payment_gateway": "healthy"},
"dependency_graph": {"payment_client": ["payment_gateway"], "payment_gateway": []},
"error_trace": [
"[CRITICAL] payment_client: Missing Authorization header",
" └─> payment_gateway: All requests rejected with 401"
],
"hints": ["Check headers.Authorization"],
"remaining_steps": 14,
"issues_found": 1,
"issues_fixed": 0,
"issues_total": 2,
"action_result": "Inspected logs for 'payment_client'. Found relevant error patterns!",
"available_targets": ["payment_client", "payment_gateway"],
"done": false,
"reward": 0.15
}
```
**Key observation fields for agent reasoning:**
- `service_status` β€” shows which services are healthy/degraded/error (updates dynamically)
- `dependency_graph` β€” shows service relationships (agent should fix upstream first)
- `error_trace` β€” shows active error cascades (shrinks as issues are fixed)
- `hints` β€” progressive hints that get more specific as steps are used
---
## Example Transcript
```
>>> reset(task_id="easy")
task_description: "A payment processing API integration is failing..."
service_status: {payment_client: "error", payment_gateway: "healthy"}
error_trace:
[CRITICAL] payment_client: Missing Authorization header
└─> payment_gateway: All requests rejected with 401
[ERROR] payment_client: Wrong Content-Type (text/plain instead of application/json)
└─> payment_gateway: Request body parsing fails
issues_total: 2, remaining_steps: 15
>>> step(action_type="inspect_logs", target="payment_client")
logs: [
"[INFO] Payment client initialized...",
"[ERROR] POST /process -> 401 Unauthorized",
"[ERROR] Response: {'error': 'Missing or invalid Authorization header'}",
"[WARN] Request headers: Content-Type=text/plain",
"[ERROR] POST /process -> 415 Unsupported Media Type",
]
issues_found: 2, reward: +0.15
>>> step(action_type="inspect_config", target="payment_client")
config_snapshot: {
"base_url": "https://api.paymentgateway.com/v2",
"headers": {"Content-Type": "text/plain", "Accept": "application/json"},
"timeout": 30
}
reward: +0.05 // Service has issues, first inspection
>>> step(action_type="submit_fix", target="payment_client",
fix_payload={"headers.Authorization": "Bearer sk_live_my_key"})
action_result: "Fix accepted! Fixed 1 issue(s). Total: 1/2"
service_status: {payment_client: "degraded", payment_gateway: "healthy"}
reward: +0.30 // Fix (+0.25) + strategy bonus (+0.05) for inspecting first
>>> step(action_type="inspect_logs", target="payment_client")
logs: [...original logs...,
"[INFO] Authorization header set. Retrying request..." // NEW dynamic log!
]
reward: +0.05 // Re-inspection has new dynamic logs
>>> step(action_type="submit_fix", target="payment_client",
fix_payload={"headers.Content-Type": "application/json"})
action_result: "Fix accepted! All issues fixed! Episode complete."
service_status: {payment_client: "healthy", payment_gateway: "healthy"}
error_trace: ["All issues resolved. No error cascades active."]
reward: +0.50 // Fix (+0.25) + strategy (+0.05) + completion bonus (+0.20)
done: true
>>> grade()
score: 0.82
fix_score: 1.00 (2/2 fixed)
diagnosis_score: 1.00 (inspected before every fix)
efficiency_score: 0.67 (5/15 steps used)
strategy_score: 0.80 (inspected first, used multiple action types)
```
---
## Setup & Usage
### Prerequisites
- Python 3.10+
- [uv](https://docs.astral.sh/uv/) (recommended) or pip
- Docker (for containerized deployment)
### Install Dependencies
```bash
# Clone the repository
git clone https://github.com/yadnyeshkolte/openenv-task.git
cd openenv-task
# Install dependencies with uv
uv sync
# Or with pip
pip install -e .
```
### Run the Server Locally
```bash
# From the project root (openenv-task/)
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
```
The server will be available at `http://localhost:8000`. Visit `http://localhost:8000/docs` for interactive API documentation.
### Quick Test
```bash
# Reset environment
curl -X POST http://localhost:8000/reset
# Inspect logs
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type": "inspect_logs", "target": "payment_client"}'
# Submit a fix
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type": "submit_fix", "target": "payment_client", "fix_payload": {"headers.Authorization": "Bearer my_key"}}'
```
### Docker Build & Run
```bash
# From the project root (openenv-task/)
docker build -t api_debug_env -f Dockerfile .
docker run -p 8000:8000 api_debug_env
```
---
## API Endpoints
| Endpoint | Method | Description |
|---|---|---|
| `/` | GET | Environment info, version, and feature list |
| `/reset` | POST | Reset environment (accepts `task_id` and `seed` params) |
| `/step` | POST | Execute a debugging action |
| `/state` | GET | Get current state (episode_id, step_count) |
| `/schema` | GET | Get action/observation Pydantic schemas |
| `/tasks` | GET | List all 3 tasks with action schema and service dependencies |
| `/grader` | POST | Get multi-dimensional grader score for current episode |
| `/baseline` | POST | Run the rule-based baseline agent on all 3 tasks |
| `/health` | GET | Health check endpoint |
| `/docs` | GET | Interactive Swagger UI documentation |
---
## Running Inference
The `inference.py` script at the project root uses the OpenAI API client to run an LLM agent against all 3 tasks:
```bash
# Set your API credentials
export HF_TOKEN=your_huggingface_token
# Optional: override model and API base
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export API_BASE_URL=https://router.huggingface.co/v1
# Run inference from the project root
python inference.py
```
**Output format** (stdout):
```
[START] task=easy env=api_debug_env model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=inspect_logs(target=payment_client) reward=0.15 done=false error=null
[STEP] step=2 action=submit_fix(target=payment_client, fix={...}) reward=0.30 done=false error=null
...
[END] success=true steps=5 score=0.820 rewards=0.15,0.30,...
```
The inference script:
- Uses `openai.OpenAI` client for all LLM calls
- Reads `HF_TOKEN` (or `API_KEY`) from environment variables
- Includes retry logic with exponential backoff
- Emits `[START]`, `[STEP]`, `[END]` lines to stdout
---
## Running Tests
```bash
# From the project root (openenv-task/)
python -m pytest tests/ -v --tb=short
```
**70 tests** across 12 test classes covering:
- Scenario loading, seed randomization, and issue pool selection
- Environment reset and initialization
- All 4 action types: `inspect_logs`, `inspect_config`, `inspect_endpoint`, `submit_fix`
- Dynamic state: service health updates, dynamic log injection, error trace changes
- Multi-dimensional grading rubric (fix, diagnosis, efficiency, strategy)
- Strict fix validation with partial credit
- Value matching (strings, numbers, booleans, lists, Bearer tokens)
- Full episode integration tests (easy, medium, hard)
- Cascading failure mechanics and dependency chains
- Episode termination conditions
### Validate OpenEnv Compliance
```bash
openenv validate
```
---
## Project Structure
```
openenv-task/ # Project root
β”œβ”€β”€ __init__.py # Package init (exports ApiDebugEnv, Action, Observation)
β”œβ”€β”€ client.py # OpenEnv client (WebSocket connection to server)
β”œβ”€β”€ models.py # Pydantic Action & Observation type definitions
β”œβ”€β”€ scenarios.py # Task scenarios with dependency graphs & issue pools
β”œβ”€β”€ inference.py # MANDATORY inference script (LLM agent, OpenAI client)
β”œβ”€β”€ openenv.yaml # OpenEnv metadata (spec v1)
β”œβ”€β”€ pyproject.toml # Python project config & dependencies
β”œβ”€β”€ Dockerfile # Docker build for HF Spaces deployment
β”œβ”€β”€ LICENSE # BSD license
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ PROGRESS.md # Development session log
β”œβ”€β”€ AGENTS.md # Instructions for AI coding agents
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ __init__.py # Server package init
β”‚ β”œβ”€β”€ api_debug_env_environment.py # Core environment (reset/step/grade logic)
β”‚ β”œβ”€β”€ app.py # FastAPI endpoints (/reset, /step, /tasks, etc.)
β”‚ β”œβ”€β”€ Dockerfile # Alternate Dockerfile (same as root)
β”‚ └── requirements.txt # Server-specific requirements
β”œβ”€β”€ scripts/
β”‚ └── baseline_inference.py # Alternate baseline script
└── tests/
└── test_environment.py # 70 unit & integration tests
```
### Key Files
| File | Purpose |
|---|---|
| `server/api_debug_env_environment.py` | **Core logic** β€” `reset()`, `step()`, `grade()`, dynamic state, cascading failures |
| `scenarios.py` | **Task definitions** β€” issue pools, dependency graphs, dynamic logs, service configs |
| `models.py` | **Type definitions** β€” `ApiDebugAction` and `ApiDebugObservation` Pydantic models |
| `inference.py` | **Mandatory** β€” LLM-based agent using OpenAI client with `[START]/[STEP]/[END]` output |
| `openenv.yaml` | **Mandatory** β€” OpenEnv spec v1 metadata with task definitions |
| `server/app.py` | **FastAPI server** β€” all HTTP endpoints including `/baseline` and `/grader` |
---
## Design Philosophy
This environment is designed to be useful for **RL/agent training and evaluation**, not just a one-off benchmark:
1. **Dense Reward Signal** β€” every action type produces positive or negative reward, enabling gradient-based training (GRPO, DPO, PPO). Not just a sparse binary score at the end.
2. **Progressive Difficulty** β€” Easy (2 services, 2 issues) β†’ Medium (3 services, 3 issues with 1 dependency) β†’ Hard (5 services, 5 issues with multiple dependency chains). Difficulty comes from complexity, not ambiguity.
3. **Partial Credit** β€” close-but-wrong fixes get constructive feedback instead of just rejection. This provides learning signal for agents that are on the right track.
4. **Strategy Incentives** β€” the multi-dimensional rubric rewards **how** the agent solves (inspect before fix, follow dependencies, avoid waste), not just **what** it solves. This encourages emergent debugging strategies.
5. **Stochastic Scenarios** β€” seed-based randomization from expanded issue pools prevents policy overfitting to memorized scenarios while maintaining reproducibility.
6. **Cascading Dynamics** β€” upstream fixes change downstream state, requiring **multi-step causal reasoning**. The agent can't just pattern-match each issue independently β€” it must understand the system architecture.
7. **Real-World Relevance** β€” API integration debugging is a genuine, high-value task that software engineers spend significant time on. The scenarios model actual failure patterns (expired tokens, rate limiting, missing headers, deprecated endpoints, race conditions).
---
## OpenEnv Spec Compliance
| Requirement | Status |
|---|---|
| OpenEnv spec v1 (`openenv.yaml`) | βœ… |
| Typed Pydantic models (Action, Observation) | βœ… |
| `reset()` / `step()` / `state()` API | βœ… |
| 3+ tasks with difficulty range | βœ… (easy, medium, hard) |
| Programmatic graders (0.0–1.0) | βœ… (multi-dimensional rubric) |
| Meaningful reward function | βœ… (dense, not sparse) |
| Baseline inference script | βœ… (`inference.py` at root) |
| OpenAI client for LLM calls | βœ… |
| `[START]/[STEP]/[END]` stdout format | βœ… |
| Dockerfile builds and runs | βœ… |
| HF Space deploys and responds | βœ… |
| `openenv validate` passes | βœ… |
---
## Hackathon Submission
- **HF Space**: [yadnyeshkolte/api-debug-env](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env)
- **GitHub**: [yadnyeshkolte/openenv-task](https://github.com/yadnyeshkolte/openenv-task)
- **Hackathon**: Meta PyTorch OpenEnv Hackathon Γ— Scaler School of Technology