code-debug-env / README.md
luciferai-devil's picture
Upload folder using huggingface_hub
cacd58c verified
metadata
title: Code Debug Env
emoji: 🐞
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web

code-debug-env

An OpenEnv environment for training AI agents to repair buggy Python code. The agent receives a broken function and must iteratively submit patches until all unit tests pass.

Quick Start

from code_debug_env import CodeDebugEnv, Action

async with CodeDebugEnv(base_url="https://luciferai-devil-code-debug-env.hf.space") as env:
    obs = await env.reset(task_id="task_easy")
    print(obs.buggy_code)          # The broken function
    
    result = await env.step(Action(
        patch="def find_max_subarray_sum(nums):\n    ...",
        task_id="task_easy",
        think="The off-by-one error is in range(1, len(nums)-1)"
    ))
    print(result.observation.score)  # 0.0–1.0

Action Space

Field Type Required Description
patch str Yes Full Python source replacement for the function
task_id str Yes Which task to target
think str No Chain-of-thought reasoning (earns +0.2 reward bonus)

Observation Space

Field Type Description
buggy_code str Current version of the code
test_results list Per-test pass/fail with error messages
passed / total int Tests passing out of total
score float Composite reward for this step (0.0–1.0)
done bool True when all tests pass or max_steps reached

Reward Function

r = 0.5 Γ— (tests_passed / tests_total)   # correctness
  + 0.2 Γ— (1 if valid syntax else 0)     # format
  + 0.2 Γ— (1 if <think> provided else 0) # chain-of-thought bonus
  + 0.1 Γ— (steps_remaining / max_steps)  # efficiency
  βˆ’ 0.3 Γ— (1 if timeout/crash else 0)    # penalty

Tasks

ID Difficulty Description Variants
task_easy Easy Single off-by-one error 6+
task_medium Medium Two independent bugs 6+
task_hard Hard 3+ subtle bugs in recursive function 7+

Total: 19 procedurally generated tasks via task_generator.py.

Setup

pip install openenv-core
pip install git+https://huggingface.co/spaces/luciferai-devil/code-debug-env

Docker

docker pull luciferai-devil/code-debug-env:latest
docker run -p 8000:8000 luciferai-devil/code-debug-env

Baseline Results (via OpenAI API)

Evaluated using gpt-4o-mini / gpt-oss-120b reasoning models.

Task Agent Score Notes
task_easy LLM 0.99 One-shot fix with CoT
task_medium LLM 0.74 Iterative refinement
task_hard LLM 0.59 Struggles with complex recursion depth

Average Score: 0.77

Training with GRPO

See baseline/run_baseline.py for the inference client. Compatible with TRL's GRPOTrainer β€” pass reward_fn that calls /grader.

API Endpoints

Endpoint Method Description
/health GET Health check
/reset POST Start a new episode
/step POST Submit action, get observation
/state GET Get current episode state
/tasks GET List all available tasks
/grader GET Grade a submission directly
/baseline GET Run baseline agent on all tasks

Local Development

# Run server locally
uvicorn code_debug_env.server.app:app --reload --port 8000

# Build Docker
docker build -t code-debug-env -f server/Dockerfile .

# Run Docker
docker run -p 8000:8000 code-debug-env

# Smoke test
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
curl http://localhost:8000/tasks