File size: 5,994 Bytes
3985d80
 
 
 
 
 
 
 
 
 
48c116c
6464b1f
d510c1d
2ce1061
 
 
 
 
d510c1d
 
 
 
 
 
 
 
2ce1061
 
 
 
 
 
 
d510c1d
 
 
2ce1061
 
 
 
 
d510c1d
2ce1061
d510c1d
 
2ce1061
d510c1d
2ce1061
d510c1d
 
 
 
2ce1061
 
 
 
 
d510c1d
2ce1061
d510c1d
2ce1061
d510c1d
 
 
 
 
 
 
 
 
 
2ce1061
d510c1d
 
 
 
 
 
 
 
 
 
 
 
 
 
2ce1061
 
 
 
 
d510c1d
 
2ce1061
d510c1d
 
 
 
 
 
 
 
2ce1061
d510c1d
 
2ce1061
 
 
 
 
d510c1d
 
 
 
2ce1061
d510c1d
 
8485798
2ce1061
 
d510c1d
2ce1061
 
d510c1d
2ce1061
d510c1d
 
2ce1061
d510c1d
2ce1061
d510c1d
 
2ce1061
 
d510c1d
2ce1061
d510c1d
 
 
2ce1061
 
d510c1d
 
 
 
2ce1061
d510c1d
 
 
 
2ce1061
d510c1d
2ce1061
d510c1d
2ce1061
 
 
 
 
d510c1d
2ce1061
 
 
 
d510c1d
2ce1061
d510c1d
 
2ce1061
d510c1d
2ce1061
 
 
 
 
d510c1d
 
 
 
2ce1061
d510c1d
 
8485798
d510c1d
8485798
 
 
 
 
d510c1d
 
8485798
 
d510c1d
8485798
 
 
 
 
 
 
d510c1d
2ce1061
 
 
 
 
d510c1d
2ce1061
d510c1d
 
 
2ce1061
d510c1d
 
 
2ce1061
d510c1d
 
2ce1061
d510c1d
 
 
2ce1061
 
 
 
 
 
 
d510c1d
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
title: code-debug-env
emoji: "πŸ§ͺ"
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---

# Code Debug Environment

An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible RL environment where an LLM agent diagnoses and fixes buggy Python code across three difficulty levels.

---

## Overview

| Property | Value |
|---|---|
| Domain | Real-world Python code debugging |
| Tasks | 45 total (15 easy + 15 medium + 15 hard) |
| Difficulties | easy β†’ medium β†’ hard |
| Reward Range | 0.0 – 1.0 (partial, proportional) |
| Max Steps/Episode | 3 |
| API | OpenEnv standard: `/reset`, `/step`, `/state` |

---

## Environment Description

The agent receives a buggy Python function and must fix it. Tasks come from real-world domains: data processing, string algorithms, API validation, sorting, dynamic programming, and graph algorithms.

- **Easy**: One bug (wrong operator, off-by-one, incorrect return). Reward proportional to test pass rate.
- **Medium**: Two bugs (logic bug + edge case). Reward proportional to test pass rate.
- **Hard**: One algorithmic bug + agent must explain what was wrong. Reward = 0.7 Γ— test score + 0.3 Γ— explanation quality.

---

## Action Space

```json
{
  "fixed_code": "string β€” the corrected Python function (required)",
  "explanation": "string β€” explanation of what was wrong (required for hard tasks)"
}
```

| Field | Type | Required | Description |
|---|---|---|---|
| `fixed_code` | `str` | Always | Complete corrected Python function as a string |
| `explanation` | `str` | Hard tasks | Describe the bug and why your fix is correct |

---

## Observation Space

Returned by `/reset` and `/step`:

```json
{
  "task_id": "easy_003",
  "difficulty": "easy",
  "buggy_code": "def find_max(nums):\n    ...",
  "instructions": "The function has exactly one bug. Fix it.",
  "test_cases_description": "Finds max value in a list without IndexError",
  "reward": 0.67,
  "passed_tests": 2,
  "total_tests": 3,
  "feedback": "Test 1: βœ… ...\nTest 2: βœ… ...\nTest 3: ❌ ...",
  "done": false
}
```

| Field | Type | Description |
|---|---|---|
| `task_id` | `str` | Unique task identifier |
| `difficulty` | `str` | `easy` / `medium` / `hard` |
| `buggy_code` | `str` | Buggy Python function to fix |
| `instructions` | `str` | Task instructions |
| `test_cases_description` | `str` | What the test cases check |
| `reward` | `float\|null` | Score from last step (null on reset) |
| `passed_tests` | `int\|null` | Tests passed (null on reset) |
| `total_tests` | `int` | Total number of test cases |
| `feedback` | `str\|null` | Detailed per-test feedback |
| `done` | `bool` | True when episode is complete |

---

## Reward Function

### Easy & Medium
```
reward = passed_tests / total_tests
```
- 3/3 tests β†’ 1.0
- 2/3 tests β†’ 0.67
- 1/3 tests β†’ 0.33
- 0/3 tests β†’ 0.0

### Hard
```
reward = 0.7 Γ— test_score + 0.3 Γ— explanation_score
```
Explanation is scored by matching key algorithmic concepts. Partial credit is given.

---

## Setup & Local Run

### Prerequisites
- Python 3.10+
- Docker
- Hugging Face CLI

### Install
```bash
git clone https://github.com/YOUR_USERNAME/code-debug-env
cd code-debug-env
pip install -e .
# Also clone OpenEnv for PYTHONPATH
git clone https://github.com/meta-pytorch/OpenEnv.git
export PYTHONPATH=$PYTHONPATH:OpenEnv:OpenEnv/src:.
```

### Run locally
```bash
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
```

### Run with Docker
```bash
docker build -f server/Dockerfile -t code-debug-env .
docker run -p 7860:7860 code-debug-env
```

### Test the API
```bash
# Health check
curl http://localhost:7860/health

# Reset (easy task)
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"difficulty": "easy"}'

# Submit a fix
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"fixed_code": "def find_max(nums):\n    return max(nums)"}'

# Check state
curl http://localhost:7860/state
```

---

## Run Baseline Inference

```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="your-api-key"

# Run all 3 difficulties
python inference.py --url http://localhost:7860

# Run specific difficulty
python inference.py --url http://localhost:7860 --difficulty hard
```

---

## Pre-Submission Validation

Run before submitting to catch any disqualifying issues:

```bash
# Start the environment first, then:
python validator/pre_submit_check.py --url http://localhost:7860

# Or against your HF Space:
python validator/pre_submit_check.py --url https://YOUR_SPACE.hf.space
```

---

## Deploy to Hugging Face Spaces

```bash
# Login
huggingface-cli login

# Create space and push
huggingface-cli repo create code-debug-env --type space --space_sdk docker
cd code-debug-env
git init
git remote add origin https://huggingface.co/spaces/YOUR_USERNAME/code-debug-env
git add .
git commit -m "Initial commit"
git push origin main
```

---

## Project Structure

```
code-debug-env/
β”œβ”€β”€ openenv.yaml          ← OpenEnv manifest
β”œβ”€β”€ inference.py          ← Baseline agent (root, required)
β”œβ”€β”€ pyproject.toml        ← Dependencies
β”œβ”€β”€ README.md
β”œβ”€β”€ models.py             ← Pydantic Action/Observation/State
β”œβ”€β”€ client.py             ← EnvClient for training loops
β”œβ”€β”€ __init__.py
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py            ← FastAPI: /reset /step /state /health
β”‚   β”œβ”€β”€ environment.py    ← Core episode logic
β”‚   β”œβ”€β”€ tasks/
β”‚   β”‚   β”œβ”€β”€ task_easy.py  ← 15 single-bug tasks
β”‚   β”‚   β”œβ”€β”€ task_medium.py← 15 two-bug tasks
β”‚   β”‚   └── task_hard.py  ← 15 algorithmic tasks
β”‚   β”œβ”€β”€ graders/
β”‚   β”‚   β”œβ”€β”€ grader_easy.py
β”‚   β”‚   β”œβ”€β”€ grader_medium.py
β”‚   β”‚   └── grader_hard.py
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── Dockerfile
└── validator/
    └── pre_submit_check.py
```