Spaces:
Running
Running
Viraaj Sawant commited on
Commit ·
9da318c
1
Parent(s): 8a4b89f
added Readme.md
Browse files- README.md +289 -0
- rl_code_fix_env/README.md +1 -1
- rl_code_fix_env/inference.py +5 -5
README.md
ADDED
|
@@ -0,0 +1,289 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TraceRL Mini Environment for Autonomous Code Fixing
|
| 2 |
+
|
| 3 |
+
This repository packages an OpenEnv-compatible reinforcement learning environment for autonomous Python bug fixing. An agent receives buggy code, can apply unified-diff patches, run the task's tests, inspect logs, and is rewarded for functional progress, reasonable debugging traces, and solving the problem within a step budget.
|
| 4 |
+
|
| 5 |
+
## Environment Overview and Motivation
|
| 6 |
+
|
| 7 |
+
The core environment lives in `rl_code_fix_env/` and wraps a code-repair loop around three pieces of functionality:
|
| 8 |
+
|
| 9 |
+
1. Load a bug-fixing task from either a local curated dataset or a materialized SWE-bench Lite workspace.
|
| 10 |
+
2. Let the agent iteratively edit the current `buggy.py` contents with `apply_patch`, then execute the task test file.
|
| 11 |
+
3. Return observations and rewards that make the environment suitable for RL-style training and evaluation.
|
| 12 |
+
|
| 13 |
+
The motivation is to benchmark whether an autonomous agent can do more than generate one-shot code. It must:
|
| 14 |
+
|
| 15 |
+
- read failing code,
|
| 16 |
+
- produce minimal patches,
|
| 17 |
+
- use test feedback to refine its fix,
|
| 18 |
+
- manage a limited interaction budget,
|
| 19 |
+
- and recover from bad intermediate edits.
|
| 20 |
+
|
| 21 |
+
This repo also includes a baseline `inference.py` script, containerization for OpenEnv/Hugging Face Spaces deployment, and run logs for a reference baseline.
|
| 22 |
+
|
| 23 |
+
## Repository Layout
|
| 24 |
+
|
| 25 |
+
- `rl_code_fix_env/`: main OpenEnv package.
|
| 26 |
+
- `rl_code_fix_env/src/environment/environment.py`: core RL environment logic.
|
| 27 |
+
- `rl_code_fix_env/src/reward/`: reward shaping and trace scoring.
|
| 28 |
+
- `rl_code_fix_env/src/sandbox/`: unified-diff patching and test execution sandbox.
|
| 29 |
+
- `rl_code_fix_env/dataset/`: local bug-fixing tasks and metadata.
|
| 30 |
+
- `rl_code_fix_env/server/`: FastAPI/OpenEnv server and Dockerfile.
|
| 31 |
+
- `rl_code_fix_env/inference.py`: baseline inference agent.
|
| 32 |
+
- `logs.md`: recorded baseline run output.
|
| 33 |
+
|
| 34 |
+
## Action Space
|
| 35 |
+
|
| 36 |
+
The action model is defined in `rl_code_fix_env/models.py` as:
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
CodeFixerAction(
|
| 40 |
+
type: str,
|
| 41 |
+
payload: Optional[str] = None,
|
| 42 |
+
)
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
Supported action types:
|
| 46 |
+
|
| 47 |
+
- `apply_patch`: `payload` is a unified diff patch. The environment fuzzily applies hunks to the current code string.
|
| 48 |
+
- `run_tests`: executes the task's `test.py` and updates pass/fail state and logs.
|
| 49 |
+
- `get_logs`: returns the most recent logs without changing code.
|
| 50 |
+
|
| 51 |
+
Practical meaning:
|
| 52 |
+
|
| 53 |
+
- `apply_patch` is the editing action.
|
| 54 |
+
- `run_tests` is the feedback action.
|
| 55 |
+
- `get_logs` is a cheap inspection action when the agent wants the last failure output again.
|
| 56 |
+
|
| 57 |
+
## Observation Space
|
| 58 |
+
|
| 59 |
+
The observation model is also defined in `rl_code_fix_env/models.py`:
|
| 60 |
+
|
| 61 |
+
```python
|
| 62 |
+
CodeFixerObservation(
|
| 63 |
+
code: str = "",
|
| 64 |
+
logs: Optional[str] = None,
|
| 65 |
+
test_score: float = 0.0,
|
| 66 |
+
total_tests: int = 1,
|
| 67 |
+
steps: int = 0,
|
| 68 |
+
done: bool = False,
|
| 69 |
+
reward: Optional[float] = None,
|
| 70 |
+
)
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
Field meanings:
|
| 74 |
+
|
| 75 |
+
- `code`: the current patched source code under repair.
|
| 76 |
+
- `logs`: latest pytest output or startup/fallback messages.
|
| 77 |
+
- `test_score`: normalized functional score. In the current local tasks it is `1.0` for pass and `0.0` for fail.
|
| 78 |
+
- `total_tests`: number of task test files tracked by the environment. Current local tasks use a single target test file.
|
| 79 |
+
- `steps`: number of patch actions consumed so far.
|
| 80 |
+
- `done`: episode termination flag.
|
| 81 |
+
- `reward`: latest reward returned by the environment wrapper.
|
| 82 |
+
|
| 83 |
+
## Reward Design
|
| 84 |
+
|
| 85 |
+
The reward is computed in `rl_code_fix_env/src/reward/reward.py`:
|
| 86 |
+
|
| 87 |
+
```text
|
| 88 |
+
reward =
|
| 89 |
+
0.7 * functional_reward
|
| 90 |
+
+ 0.2 * trace_reward
|
| 91 |
+
+ 0.1 * quality_reward
|
| 92 |
+
- efficiency_penalty
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
Where:
|
| 96 |
+
|
| 97 |
+
- `functional_reward = test_score`
|
| 98 |
+
- `trace_reward = score_trace(trace_obj)`
|
| 99 |
+
- `quality_reward = 1.0` when non-empty code exists, else `0.0`
|
| 100 |
+
- `efficiency_penalty = 0.05 * (steps_taken / max_steps)`
|
| 101 |
+
|
| 102 |
+
If all tests pass, the environment overrides the reward to `1.0`.
|
| 103 |
+
|
| 104 |
+
## Task Descriptions and Expected Difficulty Levels
|
| 105 |
+
|
| 106 |
+
### Official competition-facing task mapping
|
| 107 |
+
|
| 108 |
+
The current local fallback dataset exposes one canonical task per difficulty through `get_hardcoded_task(...)`:
|
| 109 |
+
|
| 110 |
+
| Difficulty | Problem ID | Description | Bug type | Expected steps |
|
| 111 |
+
| --- | --- | --- | --- | --- |
|
| 112 |
+
| Easy | `problem_1` | Reverse words while normalizing repeated spaces | `string-splitting` | 1 |
|
| 113 |
+
| Medium | `problem_10` | Rotate a matrix 90 degrees clockwise | `matrix-transformation` | 1 |
|
| 114 |
+
| Hard | `problem_13` | Preserve recency correctly in an LRU cache | `state-logic` | 2 |
|
| 115 |
+
|
| 116 |
+
Canonical task details:
|
| 117 |
+
|
| 118 |
+
- `easy`:
|
| 119 |
+
The buggy code uses `text.split(" ")`, which preserves empty tokens for repeated spaces. The fix is a small normalization change.
|
| 120 |
+
- `medium`:
|
| 121 |
+
The code transposes the matrix and then reverses rows in the wrong direction, producing a counter-clockwise rotation.
|
| 122 |
+
- `hard`:
|
| 123 |
+
The visible task calls into `cache.py`, where `LRUCache.get()` fails to refresh recency. This is stateful and effectively multi-file reasoning.
|
| 124 |
+
|
| 125 |
+
### Full local dataset coverage
|
| 126 |
+
|
| 127 |
+
The local dataset currently contains 23 problems:
|
| 128 |
+
|
| 129 |
+
- `easy`: 8 tasks
|
| 130 |
+
- `medium`: 9 tasks
|
| 131 |
+
- `hard`: 6 tasks
|
| 132 |
+
|
| 133 |
+
Bug patterns represented across the dataset include:
|
| 134 |
+
|
| 135 |
+
- whitespace and string normalization
|
| 136 |
+
- off-by-one and boundary-condition mistakes
|
| 137 |
+
- incorrect matrix and sorting transformations
|
| 138 |
+
- recursion and exception-handling bugs
|
| 139 |
+
- stateful cache logic and multi-bug hard tasks
|
| 140 |
+
|
| 141 |
+
### Difficulty interpretation
|
| 142 |
+
|
| 143 |
+
- `easy`: usually a single-line or single-concept bug with direct test feedback.
|
| 144 |
+
- `medium`: often requires understanding data transformation logic or helper-module behavior.
|
| 145 |
+
- `hard`: commonly involves state, multi-step reasoning, or fixes that span more than one conceptual location.
|
| 146 |
+
|
| 147 |
+
## Episode Flow
|
| 148 |
+
|
| 149 |
+
1. `reset()` selects a difficulty.
|
| 150 |
+
2. The environment loads the buggy code, test path, workspace path, and zeroed metrics.
|
| 151 |
+
3. The agent alternates between `apply_patch`, `run_tests`, and optional `get_logs`.
|
| 152 |
+
4. The episode ends when all tests pass or the step budget is exhausted.
|
| 153 |
+
|
| 154 |
+
By default, the server cycles through `easy`, `medium`, and `hard` on reset. You can force a specific difficulty with `TRACERL_TASK=easy`, `TRACERL_TASK=medium`, or `TRACERL_TASK=hard`.
|
| 155 |
+
|
| 156 |
+
## Data Sources
|
| 157 |
+
|
| 158 |
+
`CodeEnv` defaults to `TASK_SOURCE=swebench`. If SWE-bench Lite task materialization is unavailable, it falls back to the local curated dataset when `SWEBENCH_FALLBACK_LOCAL=1` is enabled, which is the current default behavior.
|
| 159 |
+
|
| 160 |
+
Expected SWE-bench Lite workspace layout:
|
| 161 |
+
|
| 162 |
+
```text
|
| 163 |
+
rl_code_fix_env/dataset/swebench_lite_tasks/<instance_id>/
|
| 164 |
+
buggy.py
|
| 165 |
+
test.py
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
## Setup Instructions
|
| 169 |
+
|
| 170 |
+
### Local Python setup
|
| 171 |
+
|
| 172 |
+
From the repository root:
|
| 173 |
+
|
| 174 |
+
```bash
|
| 175 |
+
cd rl_code_fix_env
|
| 176 |
+
uv sync
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
If you are not using `uv`, install the shared dependencies from the repository root:
|
| 180 |
+
|
| 181 |
+
```bash
|
| 182 |
+
pip install -r requirements.txt
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
### Required environment variables for inference
|
| 186 |
+
|
| 187 |
+
The baseline agent expects:
|
| 188 |
+
|
| 189 |
+
```bash
|
| 190 |
+
API_BASE_URL=<openai-compatible-endpoint>
|
| 191 |
+
MODEL_NAME=<model-id>
|
| 192 |
+
HF_TOKEN=<api-key>
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
Useful optional variables:
|
| 196 |
+
|
| 197 |
+
```bash
|
| 198 |
+
ENV_URL=http://localhost:8000
|
| 199 |
+
TRACERL_TASK=easy
|
| 200 |
+
TASK_SOURCE=swebench
|
| 201 |
+
SWEBENCH_FALLBACK_LOCAL=1
|
| 202 |
+
MAX_STEPS=10
|
| 203 |
+
TEMPERATURE=0.2
|
| 204 |
+
MAX_TOKENS=2048
|
| 205 |
+
SUCCESS_THRESHOLD=1.0
|
| 206 |
+
MAX_RETRIES=3
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
## Usage Instructions
|
| 210 |
+
|
| 211 |
+
### Run the environment server locally
|
| 212 |
+
|
| 213 |
+
```bash
|
| 214 |
+
cd rl_code_fix_env
|
| 215 |
+
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
|
| 216 |
+
```
|
| 217 |
+
|
| 218 |
+
Alternative entry point:
|
| 219 |
+
|
| 220 |
+
```bash
|
| 221 |
+
cd rl_code_fix_env
|
| 222 |
+
uv run --project . server
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
### Run the baseline inference agent
|
| 226 |
+
|
| 227 |
+
Open a second terminal:
|
| 228 |
+
|
| 229 |
+
```bash
|
| 230 |
+
cd rl_code_fix_env
|
| 231 |
+
python inference.py
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
The script emits machine-parseable lines in this format:
|
| 235 |
+
|
| 236 |
+
```text
|
| 237 |
+
[START] task=<task_name> env=<benchmark> model=<model_name>
|
| 238 |
+
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
|
| 239 |
+
[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
+
### Build and run with Docker
|
| 243 |
+
|
| 244 |
+
From `rl_code_fix_env/`:
|
| 245 |
+
|
| 246 |
+
```bash
|
| 247 |
+
docker build -t rl_code_fix_env-env:latest -f server/Dockerfile .
|
| 248 |
+
docker run -p 8000:8000 rl_code_fix_env-env:latest
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
### OpenEnv / Hugging Face Spaces deployment
|
| 252 |
+
|
| 253 |
+
From `rl_code_fix_env/`:
|
| 254 |
+
|
| 255 |
+
```bash
|
| 256 |
+
openenv push
|
| 257 |
+
```
|
| 258 |
+
|
| 259 |
+
The package is configured as a FastAPI OpenEnv space via `openenv.yaml`.
|
| 260 |
+
|
| 261 |
+
## Baseline Performance Scores
|
| 262 |
+
|
| 263 |
+
The current recorded baseline in `logs.md` ran one episode each for `easy`, `medium`, and `hard` using model `qwen/qwen3-coder-480b-a35b-instruct`.
|
| 264 |
+
|
| 265 |
+
| Task | Success | Steps | Final score | Reward trace | Cumulative reward |
|
| 266 |
+
| --- | --- | --- | --- | --- | --- |
|
| 267 |
+
| Easy | `false` | 10 | 0.00 | `0.14,0.13,0.12,0.11,0.10,0.09,0.08,0.07,0.06,0.05` | 0.95 |
|
| 268 |
+
| Medium | `false` | 10 | 0.00 | `0.14,0.13,0.12,0.11,0.10,0.09,0.08,0.07,0.06,0.05` | 0.95 |
|
| 269 |
+
| Hard | `false` | 10 | 0.00 | `0.14,0.13,0.12,0.11,0.10,0.09,0.08,0.07,0.06,0.05` | 0.95 |
|
| 270 |
+
|
| 271 |
+
Aggregate baseline summary:
|
| 272 |
+
|
| 273 |
+
- episodes evaluated: 3
|
| 274 |
+
- success rate: `0/3`
|
| 275 |
+
- mean final score: `0.00`
|
| 276 |
+
- mean cumulative reward: `0.95`
|
| 277 |
+
|
| 278 |
+
Interpretation:
|
| 279 |
+
|
| 280 |
+
- The baseline agent produced syntactically plausible patches and collected small shaped rewards.
|
| 281 |
+
- It did not achieve a passing test score on any recorded task.
|
| 282 |
+
- The current baseline should be treated as a starting point rather than a competitive upper bound.
|
| 283 |
+
|
| 284 |
+
## Notes and Caveats
|
| 285 |
+
|
| 286 |
+
- The local fallback tasks currently use one target test file per problem, so `test_score` is binary.
|
| 287 |
+
- Patch application uses `unidiff` plus fuzzy matching from `diff-match-patch`, which makes the environment more tolerant to slightly stale context.
|
| 288 |
+
- Test execution prefers Docker sandboxing, but falls back to direct `pytest` execution when Docker is unavailable.
|
| 289 |
+
- The repository root contains supporting notes in `commands.md`, `inference&docker.md`, and `logs.md`.
|
rl_code_fix_env/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
title: Rl Code Fix Env Environment Server
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: green
|
| 5 |
colorTo: purple
|
| 6 |
sdk: docker
|
|
|
|
| 1 |
---
|
| 2 |
title: Rl Code Fix Env Environment Server
|
| 3 |
+
emoji: "🚀"
|
| 4 |
colorFrom: green
|
| 5 |
colorTo: purple
|
| 6 |
sdk: docker
|
rl_code_fix_env/inference.py
CHANGED
|
@@ -35,14 +35,14 @@ from models import CodeFixerAction
|
|
| 35 |
from dotenv import load_dotenv
|
| 36 |
load_dotenv()
|
| 37 |
|
| 38 |
-
API_BASE_URL = os.getenv("API_BASE_URL")
|
| 39 |
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
|
| 40 |
-
MODEL_NAME = os.getenv("MODEL_NAME")
|
| 41 |
|
| 42 |
MAX_STEPS = int(os.getenv("MAX_STEPS", "10"))
|
| 43 |
-
TEMPERATURE = float(os.getenv("TEMPERATURE", "0.
|
| 44 |
-
MAX_TOKENS = int(os.getenv("MAX_TOKENS", "
|
| 45 |
-
|
| 46 |
SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_THRESHOLD", "1.0"))
|
| 47 |
MAX_RETRIES = int(os.getenv("MAX_RETRIES", "3"))
|
| 48 |
|
|
|
|
| 35 |
from dotenv import load_dotenv
|
| 36 |
load_dotenv()
|
| 37 |
|
| 38 |
+
API_BASE_URL = os.getenv("API_BASE_URL", "https://integrate.api.nvidia.com/v1")
|
| 39 |
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
|
| 40 |
+
MODEL_NAME = os.getenv("MODEL_NAME", "qwen/qwen2.5-coder-32b-instruct")
|
| 41 |
|
| 42 |
MAX_STEPS = int(os.getenv("MAX_STEPS", "10"))
|
| 43 |
+
TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
|
| 44 |
+
MAX_TOKENS = int(os.getenv("MAX_TOKENS", "512"))
|
| 45 |
+
|
| 46 |
SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_THRESHOLD", "1.0"))
|
| 47 |
MAX_RETRIES = int(os.getenv("MAX_RETRIES", "3"))
|
| 48 |
|