Spaces:
Sleeping
Sleeping
File size: 3,687 Bytes
4f94501 cacd58c 4f94501 cacd58c 4f94501 cacd58c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | ---
title: Code Debug Env
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web
---
# code-debug-env
An OpenEnv environment for training AI agents to repair buggy Python code.
The agent receives a broken function and must iteratively submit patches until
all unit tests pass.
## Quick Start
```python
from code_debug_env import CodeDebugEnv, Action
async with CodeDebugEnv(base_url="https://luciferai-devil-code-debug-env.hf.space") as env:
obs = await env.reset(task_id="task_easy")
print(obs.buggy_code) # The broken function
result = await env.step(Action(
patch="def find_max_subarray_sum(nums):\n ...",
task_id="task_easy",
think="The off-by-one error is in range(1, len(nums)-1)"
))
print(result.observation.score) # 0.0β1.0
```
## Action Space
| Field | Type | Required | Description |
|---|---|---|---|
| `patch` | str | Yes | Full Python source replacement for the function |
| `task_id` | str | Yes | Which task to target |
| `think` | str | No | Chain-of-thought reasoning (earns +0.2 reward bonus) |
## Observation Space
| Field | Type | Description |
|---|---|---|
| `buggy_code` | str | Current version of the code |
| `test_results` | list | Per-test pass/fail with error messages |
| `passed` / `total` | int | Tests passing out of total |
| `score` | float | Composite reward for this step (0.0β1.0) |
| `done` | bool | True when all tests pass or max_steps reached |
## Reward Function
```
r = 0.5 Γ (tests_passed / tests_total) # correctness
+ 0.2 Γ (1 if valid syntax else 0) # format
+ 0.2 Γ (1 if <think> provided else 0) # chain-of-thought bonus
+ 0.1 Γ (steps_remaining / max_steps) # efficiency
β 0.3 Γ (1 if timeout/crash else 0) # penalty
```
## Tasks
| ID | Difficulty | Description | Variants |
|---|---|---|---|
| `task_easy` | Easy | Single off-by-one error | 6+ |
| `task_medium` | Medium | Two independent bugs | 6+ |
| `task_hard` | Hard | 3+ subtle bugs in recursive function | 7+ |
*Total: 19 procedurally generated tasks via `task_generator.py`.*
## Setup
```bash
pip install openenv-core
pip install git+https://huggingface.co/spaces/luciferai-devil/code-debug-env
```
## Docker
```bash
docker pull luciferai-devil/code-debug-env:latest
docker run -p 8000:8000 luciferai-devil/code-debug-env
```
## Baseline Results (via OpenAI API)
Evaluated using `gpt-4o-mini` / `gpt-oss-120b` reasoning models.
| Task | Agent | Score | Notes |
|---|---|---|---|
| task_easy | LLM | 0.99 | One-shot fix with CoT |
| task_medium | LLM | 0.74 | Iterative refinement |
| task_hard | LLM | 0.59 | Struggles with complex recursion depth |
*Average Score: 0.77*
## Training with GRPO
See `baseline/run_baseline.py` for the inference client.
Compatible with TRL's `GRPOTrainer` β pass `reward_fn` that calls `/grader`.
## API Endpoints
| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Health check |
| `/reset` | POST | Start a new episode |
| `/step` | POST | Submit action, get observation |
| `/state` | GET | Get current episode state |
| `/tasks` | GET | List all available tasks |
| `/grader` | GET | Grade a submission directly |
| `/baseline` | GET | Run baseline agent on all tasks |
## Local Development
```bash
# Run server locally
uvicorn code_debug_env.server.app:app --reload --port 8000
# Build Docker
docker build -t code-debug-env -f server/Dockerfile .
# Run Docker
docker run -p 8000:8000 code-debug-env
# Smoke test
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
curl http://localhost:8000/tasks
```
|