code-debug-env / README.md
Souravdanyal's picture
fixed readme file
3985d80
metadata
title: code-debug-env
emoji: πŸ§ͺ
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false

Code Debug Environment

An OpenEnv-compatible RL environment where an LLM agent diagnoses and fixes buggy Python code across three difficulty levels.


Overview

Property Value
Domain Real-world Python code debugging
Tasks 45 total (15 easy + 15 medium + 15 hard)
Difficulties easy β†’ medium β†’ hard
Reward Range 0.0 – 1.0 (partial, proportional)
Max Steps/Episode 3
API OpenEnv standard: /reset, /step, /state

Environment Description

The agent receives a buggy Python function and must fix it. Tasks come from real-world domains: data processing, string algorithms, API validation, sorting, dynamic programming, and graph algorithms.

  • Easy: One bug (wrong operator, off-by-one, incorrect return). Reward proportional to test pass rate.
  • Medium: Two bugs (logic bug + edge case). Reward proportional to test pass rate.
  • Hard: One algorithmic bug + agent must explain what was wrong. Reward = 0.7 Γ— test score + 0.3 Γ— explanation quality.

Action Space

{
  "fixed_code": "string β€” the corrected Python function (required)",
  "explanation": "string β€” explanation of what was wrong (required for hard tasks)"
}
Field Type Required Description
fixed_code str Always Complete corrected Python function as a string
explanation str Hard tasks Describe the bug and why your fix is correct

Observation Space

Returned by /reset and /step:

{
  "task_id": "easy_003",
  "difficulty": "easy",
  "buggy_code": "def find_max(nums):\n    ...",
  "instructions": "The function has exactly one bug. Fix it.",
  "test_cases_description": "Finds max value in a list without IndexError",
  "reward": 0.67,
  "passed_tests": 2,
  "total_tests": 3,
  "feedback": "Test 1: βœ… ...\nTest 2: βœ… ...\nTest 3: ❌ ...",
  "done": false
}
Field Type Description
task_id str Unique task identifier
difficulty str easy / medium / hard
buggy_code str Buggy Python function to fix
instructions str Task instructions
test_cases_description str What the test cases check
reward float|null Score from last step (null on reset)
passed_tests int|null Tests passed (null on reset)
total_tests int Total number of test cases
feedback str|null Detailed per-test feedback
done bool True when episode is complete

Reward Function

Easy & Medium

reward = passed_tests / total_tests
  • 3/3 tests β†’ 1.0
  • 2/3 tests β†’ 0.67
  • 1/3 tests β†’ 0.33
  • 0/3 tests β†’ 0.0

Hard

reward = 0.7 Γ— test_score + 0.3 Γ— explanation_score

Explanation is scored by matching key algorithmic concepts. Partial credit is given.


Setup & Local Run

Prerequisites

  • Python 3.10+
  • Docker
  • Hugging Face CLI

Install

git clone https://github.com/YOUR_USERNAME/code-debug-env
cd code-debug-env
pip install -e .
# Also clone OpenEnv for PYTHONPATH
git clone https://github.com/meta-pytorch/OpenEnv.git
export PYTHONPATH=$PYTHONPATH:OpenEnv:OpenEnv/src:.

Run locally

uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload

Run with Docker

docker build -f server/Dockerfile -t code-debug-env .
docker run -p 7860:7860 code-debug-env

Test the API

# Health check
curl http://localhost:7860/health

# Reset (easy task)
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"difficulty": "easy"}'

# Submit a fix
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"fixed_code": "def find_max(nums):\n    return max(nums)"}'

# Check state
curl http://localhost:7860/state

Run Baseline Inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="your-api-key"

# Run all 3 difficulties
python inference.py --url http://localhost:7860

# Run specific difficulty
python inference.py --url http://localhost:7860 --difficulty hard

Pre-Submission Validation

Run before submitting to catch any disqualifying issues:

# Start the environment first, then:
python validator/pre_submit_check.py --url http://localhost:7860

# Or against your HF Space:
python validator/pre_submit_check.py --url https://YOUR_SPACE.hf.space

Deploy to Hugging Face Spaces

# Login
huggingface-cli login

# Create space and push
huggingface-cli repo create code-debug-env --type space --space_sdk docker
cd code-debug-env
git init
git remote add origin https://huggingface.co/spaces/YOUR_USERNAME/code-debug-env
git add .
git commit -m "Initial commit"
git push origin main

Project Structure

code-debug-env/
β”œβ”€β”€ openenv.yaml          ← OpenEnv manifest
β”œβ”€β”€ inference.py          ← Baseline agent (root, required)
β”œβ”€β”€ pyproject.toml        ← Dependencies
β”œβ”€β”€ README.md
β”œβ”€β”€ models.py             ← Pydantic Action/Observation/State
β”œβ”€β”€ client.py             ← EnvClient for training loops
β”œβ”€β”€ __init__.py
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py            ← FastAPI: /reset /step /state /health
β”‚   β”œβ”€β”€ environment.py    ← Core episode logic
β”‚   β”œβ”€β”€ tasks/
β”‚   β”‚   β”œβ”€β”€ task_easy.py  ← 15 single-bug tasks
β”‚   β”‚   β”œβ”€β”€ task_medium.py← 15 two-bug tasks
β”‚   β”‚   └── task_hard.py  ← 15 algorithmic tasks
β”‚   β”œβ”€β”€ graders/
β”‚   β”‚   β”œβ”€β”€ grader_easy.py
β”‚   β”‚   β”œβ”€β”€ grader_medium.py
β”‚   β”‚   └── grader_hard.py
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── Dockerfile
└── validator/
    └── pre_submit_check.py