Spaces:

luciferai-devil
/

code-debug-env

Sleeping

File size: 3,687 Bytes

4f94501
 
cacd58c
 
 
4f94501
cacd58c
 
4f94501
 
cacd58c

---
title: Code Debug Env
emoji: 🐞
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web
---

# code-debug-env

An OpenEnv environment for training AI agents to repair buggy Python code.
The agent receives a broken function and must iteratively submit patches until
all unit tests pass.

## Quick Start

```python
from code_debug_env import CodeDebugEnv, Action

async with CodeDebugEnv(base_url="https://luciferai-devil-code-debug-env.hf.space") as env:
    obs = await env.reset(task_id="task_easy")
    print(obs.buggy_code)          # The broken function
    
    result = await env.step(Action(
        patch="def find_max_subarray_sum(nums):\n    ...",
        task_id="task_easy",
        think="The off-by-one error is in range(1, len(nums)-1)"
    ))
    print(result.observation.score)  # 0.0–1.0
```

## Action Space

| Field | Type | Required | Description |
|---|---|---|---|
| `patch` | str | Yes | Full Python source replacement for the function |
| `task_id` | str | Yes | Which task to target |
| `think` | str | No | Chain-of-thought reasoning (earns +0.2 reward bonus) |

## Observation Space

| Field | Type | Description |
|---|---|---|
| `buggy_code` | str | Current version of the code |
| `test_results` | list | Per-test pass/fail with error messages |
| `passed` / `total` | int | Tests passing out of total |
| `score` | float | Composite reward for this step (0.0–1.0) |
| `done` | bool | True when all tests pass or max_steps reached |

## Reward Function

```
r = 0.5 × (tests_passed / tests_total)   # correctness
  + 0.2 × (1 if valid syntax else 0)     # format
  + 0.2 × (1 if <think> provided else 0) # chain-of-thought bonus
  + 0.1 × (steps_remaining / max_steps)  # efficiency
  − 0.3 × (1 if timeout/crash else 0)    # penalty
```

## Tasks

| ID | Difficulty | Description | Variants |
|---|---|---|---|
| `task_easy` | Easy | Single off-by-one error | 6+ |
| `task_medium` | Medium | Two independent bugs | 6+ |
| `task_hard` | Hard | 3+ subtle bugs in recursive function | 7+ |

*Total: 19 procedurally generated tasks via `task_generator.py`.*

## Setup

```bash
pip install openenv-core
pip install git+https://huggingface.co/spaces/luciferai-devil/code-debug-env
```

## Docker

```bash
docker pull luciferai-devil/code-debug-env:latest
docker run -p 8000:8000 luciferai-devil/code-debug-env
```

## Baseline Results (via OpenAI API)

Evaluated using `gpt-4o-mini` / `gpt-oss-120b` reasoning models.

| Task | Agent | Score | Notes |
|---|---|---|---|
| task_easy | LLM | 0.99 | One-shot fix with CoT |
| task_medium | LLM | 0.74 | Iterative refinement |
| task_hard | LLM | 0.59 | Struggles with complex recursion depth |

*Average Score: 0.77*

## Training with GRPO

See `baseline/run_baseline.py` for the inference client.
Compatible with TRL's `GRPOTrainer` — pass `reward_fn` that calls `/grader`.

## API Endpoints

| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Health check |
| `/reset` | POST | Start a new episode |
| `/step` | POST | Submit action, get observation |
| `/state` | GET | Get current episode state |
| `/tasks` | GET | List all available tasks |
| `/grader` | GET | Grade a submission directly |
| `/baseline` | GET | Run baseline agent on all tasks |

## Local Development

```bash
# Run server locally
uvicorn code_debug_env.server.app:app --reload --port 8000

# Build Docker
docker build -t code-debug-env -f server/Dockerfile .

# Run Docker
docker run -p 8000:8000 code-debug-env

# Smoke test
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
curl http://localhost:8000/tasks
```