Viraaj Sawant commited on
Commit
9da318c
·
1 Parent(s): 8a4b89f

added Readme.md

Browse files
README.md ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TraceRL Mini Environment for Autonomous Code Fixing
2
+
3
+ This repository packages an OpenEnv-compatible reinforcement learning environment for autonomous Python bug fixing. An agent receives buggy code, can apply unified-diff patches, run the task's tests, inspect logs, and is rewarded for functional progress, reasonable debugging traces, and solving the problem within a step budget.
4
+
5
+ ## Environment Overview and Motivation
6
+
7
+ The core environment lives in `rl_code_fix_env/` and wraps a code-repair loop around three pieces of functionality:
8
+
9
+ 1. Load a bug-fixing task from either a local curated dataset or a materialized SWE-bench Lite workspace.
10
+ 2. Let the agent iteratively edit the current `buggy.py` contents with `apply_patch`, then execute the task test file.
11
+ 3. Return observations and rewards that make the environment suitable for RL-style training and evaluation.
12
+
13
+ The motivation is to benchmark whether an autonomous agent can do more than generate one-shot code. It must:
14
+
15
+ - read failing code,
16
+ - produce minimal patches,
17
+ - use test feedback to refine its fix,
18
+ - manage a limited interaction budget,
19
+ - and recover from bad intermediate edits.
20
+
21
+ This repo also includes a baseline `inference.py` script, containerization for OpenEnv/Hugging Face Spaces deployment, and run logs for a reference baseline.
22
+
23
+ ## Repository Layout
24
+
25
+ - `rl_code_fix_env/`: main OpenEnv package.
26
+ - `rl_code_fix_env/src/environment/environment.py`: core RL environment logic.
27
+ - `rl_code_fix_env/src/reward/`: reward shaping and trace scoring.
28
+ - `rl_code_fix_env/src/sandbox/`: unified-diff patching and test execution sandbox.
29
+ - `rl_code_fix_env/dataset/`: local bug-fixing tasks and metadata.
30
+ - `rl_code_fix_env/server/`: FastAPI/OpenEnv server and Dockerfile.
31
+ - `rl_code_fix_env/inference.py`: baseline inference agent.
32
+ - `logs.md`: recorded baseline run output.
33
+
34
+ ## Action Space
35
+
36
+ The action model is defined in `rl_code_fix_env/models.py` as:
37
+
38
+ ```python
39
+ CodeFixerAction(
40
+ type: str,
41
+ payload: Optional[str] = None,
42
+ )
43
+ ```
44
+
45
+ Supported action types:
46
+
47
+ - `apply_patch`: `payload` is a unified diff patch. The environment fuzzily applies hunks to the current code string.
48
+ - `run_tests`: executes the task's `test.py` and updates pass/fail state and logs.
49
+ - `get_logs`: returns the most recent logs without changing code.
50
+
51
+ Practical meaning:
52
+
53
+ - `apply_patch` is the editing action.
54
+ - `run_tests` is the feedback action.
55
+ - `get_logs` is a cheap inspection action when the agent wants the last failure output again.
56
+
57
+ ## Observation Space
58
+
59
+ The observation model is also defined in `rl_code_fix_env/models.py`:
60
+
61
+ ```python
62
+ CodeFixerObservation(
63
+ code: str = "",
64
+ logs: Optional[str] = None,
65
+ test_score: float = 0.0,
66
+ total_tests: int = 1,
67
+ steps: int = 0,
68
+ done: bool = False,
69
+ reward: Optional[float] = None,
70
+ )
71
+ ```
72
+
73
+ Field meanings:
74
+
75
+ - `code`: the current patched source code under repair.
76
+ - `logs`: latest pytest output or startup/fallback messages.
77
+ - `test_score`: normalized functional score. In the current local tasks it is `1.0` for pass and `0.0` for fail.
78
+ - `total_tests`: number of task test files tracked by the environment. Current local tasks use a single target test file.
79
+ - `steps`: number of patch actions consumed so far.
80
+ - `done`: episode termination flag.
81
+ - `reward`: latest reward returned by the environment wrapper.
82
+
83
+ ## Reward Design
84
+
85
+ The reward is computed in `rl_code_fix_env/src/reward/reward.py`:
86
+
87
+ ```text
88
+ reward =
89
+ 0.7 * functional_reward
90
+ + 0.2 * trace_reward
91
+ + 0.1 * quality_reward
92
+ - efficiency_penalty
93
+ ```
94
+
95
+ Where:
96
+
97
+ - `functional_reward = test_score`
98
+ - `trace_reward = score_trace(trace_obj)`
99
+ - `quality_reward = 1.0` when non-empty code exists, else `0.0`
100
+ - `efficiency_penalty = 0.05 * (steps_taken / max_steps)`
101
+
102
+ If all tests pass, the environment overrides the reward to `1.0`.
103
+
104
+ ## Task Descriptions and Expected Difficulty Levels
105
+
106
+ ### Official competition-facing task mapping
107
+
108
+ The current local fallback dataset exposes one canonical task per difficulty through `get_hardcoded_task(...)`:
109
+
110
+ | Difficulty | Problem ID | Description | Bug type | Expected steps |
111
+ | --- | --- | --- | --- | --- |
112
+ | Easy | `problem_1` | Reverse words while normalizing repeated spaces | `string-splitting` | 1 |
113
+ | Medium | `problem_10` | Rotate a matrix 90 degrees clockwise | `matrix-transformation` | 1 |
114
+ | Hard | `problem_13` | Preserve recency correctly in an LRU cache | `state-logic` | 2 |
115
+
116
+ Canonical task details:
117
+
118
+ - `easy`:
119
+ The buggy code uses `text.split(" ")`, which preserves empty tokens for repeated spaces. The fix is a small normalization change.
120
+ - `medium`:
121
+ The code transposes the matrix and then reverses rows in the wrong direction, producing a counter-clockwise rotation.
122
+ - `hard`:
123
+ The visible task calls into `cache.py`, where `LRUCache.get()` fails to refresh recency. This is stateful and effectively multi-file reasoning.
124
+
125
+ ### Full local dataset coverage
126
+
127
+ The local dataset currently contains 23 problems:
128
+
129
+ - `easy`: 8 tasks
130
+ - `medium`: 9 tasks
131
+ - `hard`: 6 tasks
132
+
133
+ Bug patterns represented across the dataset include:
134
+
135
+ - whitespace and string normalization
136
+ - off-by-one and boundary-condition mistakes
137
+ - incorrect matrix and sorting transformations
138
+ - recursion and exception-handling bugs
139
+ - stateful cache logic and multi-bug hard tasks
140
+
141
+ ### Difficulty interpretation
142
+
143
+ - `easy`: usually a single-line or single-concept bug with direct test feedback.
144
+ - `medium`: often requires understanding data transformation logic or helper-module behavior.
145
+ - `hard`: commonly involves state, multi-step reasoning, or fixes that span more than one conceptual location.
146
+
147
+ ## Episode Flow
148
+
149
+ 1. `reset()` selects a difficulty.
150
+ 2. The environment loads the buggy code, test path, workspace path, and zeroed metrics.
151
+ 3. The agent alternates between `apply_patch`, `run_tests`, and optional `get_logs`.
152
+ 4. The episode ends when all tests pass or the step budget is exhausted.
153
+
154
+ By default, the server cycles through `easy`, `medium`, and `hard` on reset. You can force a specific difficulty with `TRACERL_TASK=easy`, `TRACERL_TASK=medium`, or `TRACERL_TASK=hard`.
155
+
156
+ ## Data Sources
157
+
158
+ `CodeEnv` defaults to `TASK_SOURCE=swebench`. If SWE-bench Lite task materialization is unavailable, it falls back to the local curated dataset when `SWEBENCH_FALLBACK_LOCAL=1` is enabled, which is the current default behavior.
159
+
160
+ Expected SWE-bench Lite workspace layout:
161
+
162
+ ```text
163
+ rl_code_fix_env/dataset/swebench_lite_tasks/<instance_id>/
164
+ buggy.py
165
+ test.py
166
+ ```
167
+
168
+ ## Setup Instructions
169
+
170
+ ### Local Python setup
171
+
172
+ From the repository root:
173
+
174
+ ```bash
175
+ cd rl_code_fix_env
176
+ uv sync
177
+ ```
178
+
179
+ If you are not using `uv`, install the shared dependencies from the repository root:
180
+
181
+ ```bash
182
+ pip install -r requirements.txt
183
+ ```
184
+
185
+ ### Required environment variables for inference
186
+
187
+ The baseline agent expects:
188
+
189
+ ```bash
190
+ API_BASE_URL=<openai-compatible-endpoint>
191
+ MODEL_NAME=<model-id>
192
+ HF_TOKEN=<api-key>
193
+ ```
194
+
195
+ Useful optional variables:
196
+
197
+ ```bash
198
+ ENV_URL=http://localhost:8000
199
+ TRACERL_TASK=easy
200
+ TASK_SOURCE=swebench
201
+ SWEBENCH_FALLBACK_LOCAL=1
202
+ MAX_STEPS=10
203
+ TEMPERATURE=0.2
204
+ MAX_TOKENS=2048
205
+ SUCCESS_THRESHOLD=1.0
206
+ MAX_RETRIES=3
207
+ ```
208
+
209
+ ## Usage Instructions
210
+
211
+ ### Run the environment server locally
212
+
213
+ ```bash
214
+ cd rl_code_fix_env
215
+ uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
216
+ ```
217
+
218
+ Alternative entry point:
219
+
220
+ ```bash
221
+ cd rl_code_fix_env
222
+ uv run --project . server
223
+ ```
224
+
225
+ ### Run the baseline inference agent
226
+
227
+ Open a second terminal:
228
+
229
+ ```bash
230
+ cd rl_code_fix_env
231
+ python inference.py
232
+ ```
233
+
234
+ The script emits machine-parseable lines in this format:
235
+
236
+ ```text
237
+ [START] task=<task_name> env=<benchmark> model=<model_name>
238
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
239
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
240
+ ```
241
+
242
+ ### Build and run with Docker
243
+
244
+ From `rl_code_fix_env/`:
245
+
246
+ ```bash
247
+ docker build -t rl_code_fix_env-env:latest -f server/Dockerfile .
248
+ docker run -p 8000:8000 rl_code_fix_env-env:latest
249
+ ```
250
+
251
+ ### OpenEnv / Hugging Face Spaces deployment
252
+
253
+ From `rl_code_fix_env/`:
254
+
255
+ ```bash
256
+ openenv push
257
+ ```
258
+
259
+ The package is configured as a FastAPI OpenEnv space via `openenv.yaml`.
260
+
261
+ ## Baseline Performance Scores
262
+
263
+ The current recorded baseline in `logs.md` ran one episode each for `easy`, `medium`, and `hard` using model `qwen/qwen3-coder-480b-a35b-instruct`.
264
+
265
+ | Task | Success | Steps | Final score | Reward trace | Cumulative reward |
266
+ | --- | --- | --- | --- | --- | --- |
267
+ | Easy | `false` | 10 | 0.00 | `0.14,0.13,0.12,0.11,0.10,0.09,0.08,0.07,0.06,0.05` | 0.95 |
268
+ | Medium | `false` | 10 | 0.00 | `0.14,0.13,0.12,0.11,0.10,0.09,0.08,0.07,0.06,0.05` | 0.95 |
269
+ | Hard | `false` | 10 | 0.00 | `0.14,0.13,0.12,0.11,0.10,0.09,0.08,0.07,0.06,0.05` | 0.95 |
270
+
271
+ Aggregate baseline summary:
272
+
273
+ - episodes evaluated: 3
274
+ - success rate: `0/3`
275
+ - mean final score: `0.00`
276
+ - mean cumulative reward: `0.95`
277
+
278
+ Interpretation:
279
+
280
+ - The baseline agent produced syntactically plausible patches and collected small shaped rewards.
281
+ - It did not achieve a passing test score on any recorded task.
282
+ - The current baseline should be treated as a starting point rather than a competitive upper bound.
283
+
284
+ ## Notes and Caveats
285
+
286
+ - The local fallback tasks currently use one target test file per problem, so `test_score` is binary.
287
+ - Patch application uses `unidiff` plus fuzzy matching from `diff-match-patch`, which makes the environment more tolerant to slightly stale context.
288
+ - Test execution prefers Docker sandboxing, but falls back to direct `pytest` execution when Docker is unavailable.
289
+ - The repository root contains supporting notes in `commands.md`, `inference&docker.md`, and `logs.md`.
rl_code_fix_env/README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: Rl Code Fix Env Environment Server
3
- emoji:
4
  colorFrom: green
5
  colorTo: purple
6
  sdk: docker
 
1
  ---
2
  title: Rl Code Fix Env Environment Server
3
+ emoji: "🚀"
4
  colorFrom: green
5
  colorTo: purple
6
  sdk: docker
rl_code_fix_env/inference.py CHANGED
@@ -35,14 +35,14 @@ from models import CodeFixerAction
35
  from dotenv import load_dotenv
36
  load_dotenv()
37
 
38
- API_BASE_URL = os.getenv("API_BASE_URL")
39
  API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
40
- MODEL_NAME = os.getenv("MODEL_NAME")
41
 
42
  MAX_STEPS = int(os.getenv("MAX_STEPS", "10"))
43
- TEMPERATURE = float(os.getenv("TEMPERATURE", "0.2"))
44
- MAX_TOKENS = int(os.getenv("MAX_TOKENS", "2048"))
45
- # FIX: cast to correct types (were left as raw strings before)
46
  SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_THRESHOLD", "1.0"))
47
  MAX_RETRIES = int(os.getenv("MAX_RETRIES", "3"))
48
 
 
35
  from dotenv import load_dotenv
36
  load_dotenv()
37
 
38
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://integrate.api.nvidia.com/v1")
39
  API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
40
+ MODEL_NAME = os.getenv("MODEL_NAME", "qwen/qwen2.5-coder-32b-instruct")
41
 
42
  MAX_STEPS = int(os.getenv("MAX_STEPS", "10"))
43
+ TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
44
+ MAX_TOKENS = int(os.getenv("MAX_TOKENS", "512"))
45
+
46
  SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_THRESHOLD", "1.0"))
47
  MAX_RETRIES = int(os.getenv("MAX_RETRIES", "3"))
48