Spaces:

luciferai-devil
/

code-debug-env

Sleeping

App Files Files Community

code-debug-env / README.md

luciferai-devil

Upload folder using huggingface_hub

cacd58c verified about 2 months ago

preview code

raw

history blame contribute delete

3.69 kB

metadata

title: Code Debug Env
emoji: 🐞
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web

code-debug-env

An OpenEnv environment for training AI agents to repair buggy Python code. The agent receives a broken function and must iteratively submit patches until all unit tests pass.

Quick Start

from code_debug_env import CodeDebugEnv, Action

async with CodeDebugEnv(base_url="https://luciferai-devil-code-debug-env.hf.space") as env:
    obs = await env.reset(task_id="task_easy")
    print(obs.buggy_code)          # The broken function
    
    result = await env.step(Action(
        patch="def find_max_subarray_sum(nums):\n    ...",
        task_id="task_easy",
        think="The off-by-one error is in range(1, len(nums)-1)"
    ))
    print(result.observation.score)  # 0.0–1.0

Action Space

Field	Type	Required	Description
`patch`	str	Yes	Full Python source replacement for the function
`task_id`	str	Yes	Which task to target
`think`	str	No	Chain-of-thought reasoning (earns +0.2 reward bonus)

Observation Space

Field	Type	Description
`buggy_code`	str	Current version of the code
`test_results`	list	Per-test pass/fail with error messages
`passed` / `total`	int	Tests passing out of total
`score`	float	Composite reward for this step (0.0–1.0)
`done`	bool	True when all tests pass or max_steps reached

Reward Function

r = 0.5 × (tests_passed / tests_total)   # correctness
  + 0.2 × (1 if valid syntax else 0)     # format
  + 0.2 × (1 if <think> provided else 0) # chain-of-thought bonus
  + 0.1 × (steps_remaining / max_steps)  # efficiency
  − 0.3 × (1 if timeout/crash else 0)    # penalty

Tasks

ID	Difficulty	Description	Variants
`task_easy`	Easy	Single off-by-one error	6+
`task_medium`	Medium	Two independent bugs	6+
`task_hard`	Hard	3+ subtle bugs in recursive function	7+

Total: 19 procedurally generated tasks via task_generator.py.

Setup

pip install openenv-core
pip install git+https://huggingface.co/spaces/luciferai-devil/code-debug-env

Docker

docker pull luciferai-devil/code-debug-env:latest
docker run -p 8000:8000 luciferai-devil/code-debug-env

Baseline Results (via OpenAI API)

Evaluated using gpt-4o-mini / gpt-oss-120b reasoning models.

Task	Agent	Score	Notes
task_easy	LLM	0.99	One-shot fix with CoT
task_medium	LLM	0.74	Iterative refinement
task_hard	LLM	0.59	Struggles with complex recursion depth

Average Score: 0.77

Training with GRPO

See baseline/run_baseline.py for the inference client. Compatible with TRL's GRPOTrainer — pass reward_fn that calls /grader.

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/reset`	POST	Start a new episode
`/step`	POST	Submit action, get observation
`/state`	GET	Get current episode state
`/tasks`	GET	List all available tasks
`/grader`	GET	Grade a submission directly
`/baseline`	GET	Run baseline agent on all tasks

Local Development

# Run server locally
uvicorn code_debug_env.server.app:app --reload --port 8000

# Build Docker
docker build -t code-debug-env -f server/Dockerfile .

# Run Docker
docker run -p 8000:8000 code-debug-env

# Smoke test
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
curl http://localhost:8000/tasks