Spaces:
Sleeping
Sleeping
metadata
title: Code Debug Env
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web
code-debug-env
An OpenEnv environment for training AI agents to repair buggy Python code. The agent receives a broken function and must iteratively submit patches until all unit tests pass.
Quick Start
from code_debug_env import CodeDebugEnv, Action
async with CodeDebugEnv(base_url="https://luciferai-devil-code-debug-env.hf.space") as env:
obs = await env.reset(task_id="task_easy")
print(obs.buggy_code) # The broken function
result = await env.step(Action(
patch="def find_max_subarray_sum(nums):\n ...",
task_id="task_easy",
think="The off-by-one error is in range(1, len(nums)-1)"
))
print(result.observation.score) # 0.0β1.0
Action Space
| Field | Type | Required | Description |
|---|---|---|---|
patch |
str | Yes | Full Python source replacement for the function |
task_id |
str | Yes | Which task to target |
think |
str | No | Chain-of-thought reasoning (earns +0.2 reward bonus) |
Observation Space
| Field | Type | Description |
|---|---|---|
buggy_code |
str | Current version of the code |
test_results |
list | Per-test pass/fail with error messages |
passed / total |
int | Tests passing out of total |
score |
float | Composite reward for this step (0.0β1.0) |
done |
bool | True when all tests pass or max_steps reached |
Reward Function
r = 0.5 Γ (tests_passed / tests_total) # correctness
+ 0.2 Γ (1 if valid syntax else 0) # format
+ 0.2 Γ (1 if <think> provided else 0) # chain-of-thought bonus
+ 0.1 Γ (steps_remaining / max_steps) # efficiency
β 0.3 Γ (1 if timeout/crash else 0) # penalty
Tasks
| ID | Difficulty | Description | Variants |
|---|---|---|---|
task_easy |
Easy | Single off-by-one error | 6+ |
task_medium |
Medium | Two independent bugs | 6+ |
task_hard |
Hard | 3+ subtle bugs in recursive function | 7+ |
Total: 19 procedurally generated tasks via task_generator.py.
Setup
pip install openenv-core
pip install git+https://huggingface.co/spaces/luciferai-devil/code-debug-env
Docker
docker pull luciferai-devil/code-debug-env:latest
docker run -p 8000:8000 luciferai-devil/code-debug-env
Baseline Results (via OpenAI API)
Evaluated using gpt-4o-mini / gpt-oss-120b reasoning models.
| Task | Agent | Score | Notes |
|---|---|---|---|
| task_easy | LLM | 0.99 | One-shot fix with CoT |
| task_medium | LLM | 0.74 | Iterative refinement |
| task_hard | LLM | 0.59 | Struggles with complex recursion depth |
Average Score: 0.77
Training with GRPO
See baseline/run_baseline.py for the inference client.
Compatible with TRL's GRPOTrainer β pass reward_fn that calls /grader.
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/reset |
POST | Start a new episode |
/step |
POST | Submit action, get observation |
/state |
GET | Get current episode state |
/tasks |
GET | List all available tasks |
/grader |
GET | Grade a submission directly |
/baseline |
GET | Run baseline agent on all tasks |
Local Development
# Run server locally
uvicorn code_debug_env.server.app:app --reload --port 8000
# Build Docker
docker build -t code-debug-env -f server/Dockerfile .
# Run Docker
docker run -p 8000:8000 code-debug-env
# Smoke test
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
curl http://localhost:8000/tasks