Spaces:

Viraj0112
/

rl_code_fix_env

Sleeping

App Files Files Community

Viraaj Sawant commited on Apr 8

Commit

9da318c

1 Parent(s): 8a4b89f

added Readme.md

Browse files

Files changed (3) hide show

README.md +289 -0
rl_code_fix_env/README.md +1 -1
rl_code_fix_env/inference.py +5 -5

README.md ADDED Viewed

	@@ -0,0 +1,289 @@

+# TraceRL Mini Environment for Autonomous Code Fixing
+This repository packages an OpenEnv-compatible reinforcement learning environment for autonomous Python bug fixing. An agent receives buggy code, can apply unified-diff patches, run the task's tests, inspect logs, and is rewarded for functional progress, reasonable debugging traces, and solving the problem within a step budget.
+## Environment Overview and Motivation
+The core environment lives in `rl_code_fix_env/` and wraps a code-repair loop around three pieces of functionality:
+1. Load a bug-fixing task from either a local curated dataset or a materialized SWE-bench Lite workspace.
+2. Let the agent iteratively edit the current `buggy.py` contents with `apply_patch`, then execute the task test file.
+3. Return observations and rewards that make the environment suitable for RL-style training and evaluation.
+The motivation is to benchmark whether an autonomous agent can do more than generate one-shot code. It must:
+- read failing code,
+- produce minimal patches,
+- use test feedback to refine its fix,
+- manage a limited interaction budget,
+- and recover from bad intermediate edits.
+This repo also includes a baseline `inference.py` script, containerization for OpenEnv/Hugging Face Spaces deployment, and run logs for a reference baseline.
+## Repository Layout
+- `rl_code_fix_env/`: main OpenEnv package.
+- `rl_code_fix_env/src/environment/environment.py`: core RL environment logic.
+- `rl_code_fix_env/src/reward/`: reward shaping and trace scoring.
+- `rl_code_fix_env/src/sandbox/`: unified-diff patching and test execution sandbox.
+- `rl_code_fix_env/dataset/`: local bug-fixing tasks and metadata.
+- `rl_code_fix_env/server/`: FastAPI/OpenEnv server and Dockerfile.
+- `rl_code_fix_env/inference.py`: baseline inference agent.
+- `logs.md`: recorded baseline run output.
+## Action Space
+The action model is defined in `rl_code_fix_env/models.py` as:
+```python
+CodeFixerAction(
+    type: str,
+    payload: Optional[str] = None,
+)
+```
+Supported action types:
+- `apply_patch`: `payload` is a unified diff patch. The environment fuzzily applies hunks to the current code string.
+- `run_tests`: executes the task's `test.py` and updates pass/fail state and logs.
+- `get_logs`: returns the most recent logs without changing code.
+Practical meaning:
+- `apply_patch` is the editing action.
+- `run_tests` is the feedback action.
+- `get_logs` is a cheap inspection action when the agent wants the last failure output again.
+## Observation Space
+The observation model is also defined in `rl_code_fix_env/models.py`:
+```python
+CodeFixerObservation(
+    code: str = "",
+    logs: Optional[str] = None,
+    test_score: float = 0.0,
+    total_tests: int = 1,
+    steps: int = 0,
+    done: bool = False,
+    reward: Optional[float] = None,
+)
+```
+Field meanings:
+- `code`: the current patched source code under repair.
+- `logs`: latest pytest output or startup/fallback messages.
+- `test_score`: normalized functional score. In the current local tasks it is `1.0` for pass and `0.0` for fail.
+- `total_tests`: number of task test files tracked by the environment. Current local tasks use a single target test file.
+- `steps`: number of patch actions consumed so far.
+- `done`: episode termination flag.
+- `reward`: latest reward returned by the environment wrapper.
+## Reward Design
+The reward is computed in `rl_code_fix_env/src/reward/reward.py`:
+```text
+reward =
+  0.7 * functional_reward
+  + 0.2 * trace_reward
+  + 0.1 * quality_reward
+  - efficiency_penalty
+```
+Where:
+- `functional_reward = test_score`
+- `trace_reward = score_trace(trace_obj)`
+- `quality_reward = 1.0` when non-empty code exists, else `0.0`
+- `efficiency_penalty = 0.05 * (steps_taken / max_steps)`
+If all tests pass, the environment overrides the reward to `1.0`.
+## Task Descriptions and Expected Difficulty Levels
+### Official competition-facing task mapping
+The current local fallback dataset exposes one canonical task per difficulty through `get_hardcoded_task(...)`:
+| Difficulty | Problem ID | Description | Bug type | Expected steps |
+| --- | --- | --- | --- | --- |
+| Easy | `problem_1` | Reverse words while normalizing repeated spaces | `string-splitting` | 1 |
+| Medium | `problem_10` | Rotate a matrix 90 degrees clockwise | `matrix-transformation` | 1 |
+| Hard | `problem_13` | Preserve recency correctly in an LRU cache | `state-logic` | 2 |
+Canonical task details:
+- `easy`:
+  The buggy code uses `text.split(" ")`, which preserves empty tokens for repeated spaces. The fix is a small normalization change.
+- `medium`:
+  The code transposes the matrix and then reverses rows in the wrong direction, producing a counter-clockwise rotation.
+- `hard`:
+  The visible task calls into `cache.py`, where `LRUCache.get()` fails to refresh recency. This is stateful and effectively multi-file reasoning.
+### Full local dataset coverage
+The local dataset currently contains 23 problems:
+- `easy`: 8 tasks
+- `medium`: 9 tasks
+- `hard`: 6 tasks
+Bug patterns represented across the dataset include:
+- whitespace and string normalization
+- off-by-one and boundary-condition mistakes
+- incorrect matrix and sorting transformations
+- recursion and exception-handling bugs
+- stateful cache logic and multi-bug hard tasks
+### Difficulty interpretation
+- `easy`: usually a single-line or single-concept bug with direct test feedback.
+- `medium`: often requires understanding data transformation logic or helper-module behavior.
+- `hard`: commonly involves state, multi-step reasoning, or fixes that span more than one conceptual location.
+## Episode Flow
+1. `reset()` selects a difficulty.
+2. The environment loads the buggy code, test path, workspace path, and zeroed metrics.
+3. The agent alternates between `apply_patch`, `run_tests`, and optional `get_logs`.
+4. The episode ends when all tests pass or the step budget is exhausted.
+By default, the server cycles through `easy`, `medium`, and `hard` on reset. You can force a specific difficulty with `TRACERL_TASK=easy`, `TRACERL_TASK=medium`, or `TRACERL_TASK=hard`.
+## Data Sources
+`CodeEnv` defaults to `TASK_SOURCE=swebench`. If SWE-bench Lite task materialization is unavailable, it falls back to the local curated dataset when `SWEBENCH_FALLBACK_LOCAL=1` is enabled, which is the current default behavior.
+Expected SWE-bench Lite workspace layout:
+```text
+rl_code_fix_env/dataset/swebench_lite_tasks/<instance_id>/
+  buggy.py
+  test.py
+```
+## Setup Instructions
+### Local Python setup
+From the repository root:
+```bash
+cd rl_code_fix_env
+uv sync
+```
+If you are not using `uv`, install the shared dependencies from the repository root:
+```bash
+pip install -r requirements.txt
+```
+### Required environment variables for inference
+The baseline agent expects:
+```bash
+API_BASE_URL=<openai-compatible-endpoint>
+MODEL_NAME=<model-id>
+HF_TOKEN=<api-key>
+```
+Useful optional variables:
+```bash
+ENV_URL=http://localhost:8000
+TRACERL_TASK=easy
+TASK_SOURCE=swebench
+SWEBENCH_FALLBACK_LOCAL=1
+MAX_STEPS=10
+TEMPERATURE=0.2
+MAX_TOKENS=2048
+SUCCESS_THRESHOLD=1.0
+MAX_RETRIES=3
+```
+## Usage Instructions
+### Run the environment server locally
+```bash
+cd rl_code_fix_env
+uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+```
+Alternative entry point:
+```bash
+cd rl_code_fix_env
+uv run --project . server
+```
+### Run the baseline inference agent
+Open a second terminal:
+```bash
+cd rl_code_fix_env
+python inference.py
+```
+The script emits machine-parseable lines in this format:
+```text
+[START] task=<task_name> env=<benchmark> model=<model_name>
+[STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+[END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+```
+### Build and run with Docker
+From `rl_code_fix_env/`:
+```bash
+docker build -t rl_code_fix_env-env:latest -f server/Dockerfile .
+docker run -p 8000:8000 rl_code_fix_env-env:latest
+```
+### OpenEnv / Hugging Face Spaces deployment
+From `rl_code_fix_env/`:
+```bash
+openenv push
+```
+The package is configured as a FastAPI OpenEnv space via `openenv.yaml`.
+## Baseline Performance Scores
+The current recorded baseline in `logs.md` ran one episode each for `easy`, `medium`, and `hard` using model `qwen/qwen3-coder-480b-a35b-instruct`.
+| Task | Success | Steps | Final score | Reward trace | Cumulative reward |
+| --- | --- | --- | --- | --- | --- |
+| Easy | `false` | 10 | 0.00 | `0.14,0.13,0.12,0.11,0.10,0.09,0.08,0.07,0.06,0.05` | 0.95 |
+| Medium | `false` | 10 | 0.00 | `0.14,0.13,0.12,0.11,0.10,0.09,0.08,0.07,0.06,0.05` | 0.95 |
+| Hard | `false` | 10 | 0.00 | `0.14,0.13,0.12,0.11,0.10,0.09,0.08,0.07,0.06,0.05` | 0.95 |
+Aggregate baseline summary:
+- episodes evaluated: 3
+- success rate: `0/3`
+- mean final score: `0.00`
+- mean cumulative reward: `0.95`
+Interpretation:
+- The baseline agent produced syntactically plausible patches and collected small shaped rewards.
+- It did not achieve a passing test score on any recorded task.
+- The current baseline should be treated as a starting point rather than a competitive upper bound.
+## Notes and Caveats
+- The local fallback tasks currently use one target test file per problem, so `test_score` is binary.
+- Patch application uses `unidiff` plus fuzzy matching from `diff-match-patch`, which makes the environment more tolerant to slightly stale context.
+- Test execution prefers Docker sandboxing, but falls back to direct `pytest` execution when Docker is unavailable.
+- The repository root contains supporting notes in `commands.md`, `inference&docker.md`, and `logs.md`.

rl_code_fix_env/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: Rl Code Fix Env Environment Server
-emoji:
 colorFrom: green
 colorTo: purple
 sdk: docker

 ---
 title: Rl Code Fix Env Environment Server
+emoji: "🚀"
 colorFrom: green
 colorTo: purple
 sdk: docker

rl_code_fix_env/inference.py CHANGED Viewed

@@ -35,14 +35,14 @@ from models import CodeFixerAction
 from dotenv import load_dotenv
 load_dotenv()
-API_BASE_URL = os.getenv("API_BASE_URL")
 API_KEY      = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
-MODEL_NAME   = os.getenv("MODEL_NAME")
 MAX_STEPS   = int(os.getenv("MAX_STEPS",   "10"))
-TEMPERATURE = float(os.getenv("TEMPERATURE", "0.2"))
-MAX_TOKENS  = int(os.getenv("MAX_TOKENS",  "2048"))
-# FIX: cast to correct types (were left as raw strings before)
 SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_THRESHOLD", "1.0"))
 MAX_RETRIES             = int(os.getenv("MAX_RETRIES", "3"))

 from dotenv import load_dotenv
 load_dotenv()
+API_BASE_URL = os.getenv("API_BASE_URL", "https://integrate.api.nvidia.com/v1")
 API_KEY      = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+MODEL_NAME   = os.getenv("MODEL_NAME", "qwen/qwen2.5-coder-32b-instruct")
 MAX_STEPS   = int(os.getenv("MAX_STEPS",   "10"))
+TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
+MAX_TOKENS  = int(os.getenv("MAX_TOKENS",  "512"))
 SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_THRESHOLD", "1.0"))
 MAX_RETRIES             = int(os.getenv("MAX_RETRIES", "3"))