File size: 3,687 Bytes
4f94501
 
cacd58c
 
 
4f94501
cacd58c
 
4f94501
 
cacd58c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
title: Code Debug Env
emoji: 🐞
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web
---

# code-debug-env

An OpenEnv environment for training AI agents to repair buggy Python code.
The agent receives a broken function and must iteratively submit patches until
all unit tests pass.

## Quick Start

```python
from code_debug_env import CodeDebugEnv, Action

async with CodeDebugEnv(base_url="https://luciferai-devil-code-debug-env.hf.space") as env:
    obs = await env.reset(task_id="task_easy")
    print(obs.buggy_code)          # The broken function
    
    result = await env.step(Action(
        patch="def find_max_subarray_sum(nums):\n    ...",
        task_id="task_easy",
        think="The off-by-one error is in range(1, len(nums)-1)"
    ))
    print(result.observation.score)  # 0.0–1.0
```

## Action Space

| Field | Type | Required | Description |
|---|---|---|---|
| `patch` | str | Yes | Full Python source replacement for the function |
| `task_id` | str | Yes | Which task to target |
| `think` | str | No | Chain-of-thought reasoning (earns +0.2 reward bonus) |

## Observation Space

| Field | Type | Description |
|---|---|---|
| `buggy_code` | str | Current version of the code |
| `test_results` | list | Per-test pass/fail with error messages |
| `passed` / `total` | int | Tests passing out of total |
| `score` | float | Composite reward for this step (0.0–1.0) |
| `done` | bool | True when all tests pass or max_steps reached |

## Reward Function

```
r = 0.5 Γ— (tests_passed / tests_total)   # correctness
  + 0.2 Γ— (1 if valid syntax else 0)     # format
  + 0.2 Γ— (1 if <think> provided else 0) # chain-of-thought bonus
  + 0.1 Γ— (steps_remaining / max_steps)  # efficiency
  βˆ’ 0.3 Γ— (1 if timeout/crash else 0)    # penalty
```

## Tasks

| ID | Difficulty | Description | Variants |
|---|---|---|---|
| `task_easy` | Easy | Single off-by-one error | 6+ |
| `task_medium` | Medium | Two independent bugs | 6+ |
| `task_hard` | Hard | 3+ subtle bugs in recursive function | 7+ |

*Total: 19 procedurally generated tasks via `task_generator.py`.*

## Setup

```bash
pip install openenv-core
pip install git+https://huggingface.co/spaces/luciferai-devil/code-debug-env
```

## Docker

```bash
docker pull luciferai-devil/code-debug-env:latest
docker run -p 8000:8000 luciferai-devil/code-debug-env
```

## Baseline Results (via OpenAI API)

Evaluated using `gpt-4o-mini` / `gpt-oss-120b` reasoning models.

| Task | Agent | Score | Notes |
|---|---|---|---|
| task_easy | LLM | 0.99 | One-shot fix with CoT |
| task_medium | LLM | 0.74 | Iterative refinement |
| task_hard | LLM | 0.59 | Struggles with complex recursion depth |

*Average Score: 0.77*

## Training with GRPO

See `baseline/run_baseline.py` for the inference client.
Compatible with TRL's `GRPOTrainer` β€” pass `reward_fn` that calls `/grader`.

## API Endpoints

| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Health check |
| `/reset` | POST | Start a new episode |
| `/step` | POST | Submit action, get observation |
| `/state` | GET | Get current episode state |
| `/tasks` | GET | List all available tasks |
| `/grader` | GET | Grade a submission directly |
| `/baseline` | GET | Run baseline agent on all tasks |

## Local Development

```bash
# Run server locally
uvicorn code_debug_env.server.app:app --reload --port 8000

# Build Docker
docker build -t code-debug-env -f server/Dockerfile .

# Run Docker
docker run -p 8000:8000 code-debug-env

# Smoke test
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
curl http://localhost:8000/tasks
```