Spaces:

luciferai-devil
/

code-debug-env

Sleeping

App Files Files Community

code-debug-env / README.md

luciferai-devil

Upload folder using huggingface_hub

cacd58c verified about 2 months ago

preview code

raw

history blame contribute delete

3.69 kB

	---
	title: Code Debug Env
	emoji: 🐞
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	app_port: 8000
	base_path: /web
	---

	# code-debug-env

	An OpenEnv environment for training AI agents to repair buggy Python code.
	The agent receives a broken function and must iteratively submit patches until
	all unit tests pass.

	## Quick Start

	```python
	from code_debug_env import CodeDebugEnv, Action

	async with CodeDebugEnv(base_url="https://luciferai-devil-code-debug-env.hf.space") as env:
	obs = await env.reset(task_id="task_easy")
	print(obs.buggy_code) # The broken function

	result = await env.step(Action(
	patch="def find_max_subarray_sum(nums):\n ...",
	task_id="task_easy",
	think="The off-by-one error is in range(1, len(nums)-1)"
	))
	print(result.observation.score) # 0.0–1.0
	```

	## Action Space

	\| Field \| Type \| Required \| Description \|
	\|---\|---\|---\|---\|
	\| `patch` \| str \| Yes \| Full Python source replacement for the function \|
	\| `task_id` \| str \| Yes \| Which task to target \|
	\| `think` \| str \| No \| Chain-of-thought reasoning (earns +0.2 reward bonus) \|

	## Observation Space

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `buggy_code` \| str \| Current version of the code \|
	\| `test_results` \| list \| Per-test pass/fail with error messages \|
	\| `passed` / `total` \| int \| Tests passing out of total \|
	\| `score` \| float \| Composite reward for this step (0.0–1.0) \|
	\| `done` \| bool \| True when all tests pass or max_steps reached \|

	## Reward Function

	```
	r = 0.5 × (tests_passed / tests_total) # correctness
	+ 0.2 × (1 if valid syntax else 0) # format
	+ 0.2 × (1 if <think> provided else 0) # chain-of-thought bonus
	+ 0.1 × (steps_remaining / max_steps) # efficiency
	− 0.3 × (1 if timeout/crash else 0) # penalty
	```

	## Tasks

	\| ID \| Difficulty \| Description \| Variants \|
	\|---\|---\|---\|---\|
	\| `task_easy` \| Easy \| Single off-by-one error \| 6+ \|
	\| `task_medium` \| Medium \| Two independent bugs \| 6+ \|
	\| `task_hard` \| Hard \| 3+ subtle bugs in recursive function \| 7+ \|

	Total: 19 procedurally generated tasks via `task_generator.py`.

	## Setup

	```bash
	pip install openenv-core
	pip install git+https://huggingface.co/spaces/luciferai-devil/code-debug-env
	```

	## Docker

	```bash
	docker pull luciferai-devil/code-debug-env:latest
	docker run -p 8000:8000 luciferai-devil/code-debug-env
	```

	## Baseline Results (via OpenAI API)

	Evaluated using `gpt-4o-mini` / `gpt-oss-120b` reasoning models.

	\| Task \| Agent \| Score \| Notes \|
	\|---\|---\|---\|---\|
	\| task_easy \| LLM \| 0.99 \| One-shot fix with CoT \|
	\| task_medium \| LLM \| 0.74 \| Iterative refinement \|
	\| task_hard \| LLM \| 0.59 \| Struggles with complex recursion depth \|

	Average Score: 0.77

	## Training with GRPO

	See `baseline/run_baseline.py` for the inference client.
	Compatible with TRL's `GRPOTrainer` — pass `reward_fn` that calls `/grader`.

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\|---\|---\|---\|
	\| `/health` \| GET \| Health check \|
	\| `/reset` \| POST \| Start a new episode \|
	\| `/step` \| POST \| Submit action, get observation \|
	\| `/state` \| GET \| Get current episode state \|
	\| `/tasks` \| GET \| List all available tasks \|
	\| `/grader` \| GET \| Grade a submission directly \|
	\| `/baseline` \| GET \| Run baseline agent on all tasks \|

	## Local Development

	```bash
	# Run server locally
	uvicorn code_debug_env.server.app:app --reload --port 8000

	# Build Docker
	docker build -t code-debug-env -f server/Dockerfile .

	# Run Docker
	docker run -p 8000:8000 code-debug-env

	# Smoke test
	curl http://localhost:8000/health
	curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
	curl http://localhost:8000/tasks
	```