tracefix_rl / README.md
databoysu
active graders
7266968
---
title: TraceFix-RL
emoji: πŸ§‘β€πŸ’»
colorFrom: blue
sdk: docker
pinned: false
app_port: 7860
base_path: /web
tags:
- openenv
- reinforcement-learning
- software-engineering
---
## TraceFix-RL
TraceFix-RL is an OpenEnv-compatible environment designed to teach agent behavior
that looks like real software engineering work. Instead of one-shot answers,
the agent must inspect code, form a hypothesis, run tests, patch the code,
verify outcomes, and only then submit. The loop rewards disciplined debugging
and penalizes random edits, forcing the model to learn an engineering workflow.
## Core Design
- **Action space:** `VIEW_CODE`, `RUN_TESTS`, `REPLACE_LINES`, `UNDO_EDIT`, `RESET_TO_ORIGINAL`, `SUBMIT`
- **Observations:** The full code snapshot, localized edit context, execution output, syntax status, and per-test outcomes.
- **Dense Rewards:** `RUN_TESTS` bonus, per-test progress bonus, step-cost penalty, invalid-edit penalties, and a final clamped score bounded within `[0.01, 0.98]`.
- **Curriculum-ready Tasks:** Includes Easy, Medium, and Hard buckets that the OpenEnv trainer can sequence, alongside random fallback for evaluators.
## State Machine Training Pattern
The environment prompt in `environment.py` encodes a strict operating pattern the agent is expected to follow:
1. **ORIENT:** Inspect code (`VIEW_CODE`)
2. **DIAGNOSE:** Run tests and read failures (`RUN_TESTS`)
3. **FIX:** Patch one localized region (`REPLACE_LINES`)
4. **VERIFY:** Rerun tests (`RUN_TESTS`)
5. **REPEAT:** Continue until all failures are resolved
6. **SUBMIT:** Finalize only after tests pass
This sequence naturally guides reinforcement learning toward robust planning, controlled editing, and verification behavior.
## Task Tiers And Test Structure
The registry in `tasks.py` acts as a static curated set of coding challenges (16 tasks total):
- **Easy (4 tasks):** Focuses on basic operators, indexing, and simple string/array logic.
- **Medium (6 tasks):** Focuses on recursive behavior, branching correctness, and text normalization edges.
- **Hard (6 tasks):** Focuses on data-structure invariants, bracket mapping, interval merging, and eviction logic.
Every task contains: `name`, `description`, `difficulty`, `bug_type`, `code` (buggy implementation), `solution`, and executable `tests`. All tests are safely run inside isolated sandboxes via `sandbox.py` using `multiprocessing`.
## Tech Stack & Project Files
This environment enforces strict typing and uses standard modern tooling:
- **`uv`:** Handles dependency management (see `pyproject.toml`).
- **FastAPI:** Provides the `server.app` integration layer for OpenEnv compliance.
- **Pydantic (v2):** Provides strong validation layers for `models.py` (e.g., `CodeAction`, `CodeObservation`).
- **OpenEnv Config:** See `openenv.yaml` which specifies `tracefix_rl` to run the FastAPI app on port `7860`.
**File Layout:**
- `models.py` / `context.py`: Domain and schema logic.
- `tasks.py`: Task metadata definitions.
- `sandbox.py`: Subprocess runtime and output tracking.
- `environment.py`: Reset/step/reward core RL loop logic (`TraceFixRLGym`).
- `server/tracefix_rl_environment.py` / `server/app.py`: Maps the OpenAI/OpenEnv network interface to the core environment.
- `inference.py`: Baseline OpenAI-client inference script to evaluate agents.
## Local Development
You must install [`uv`](https://github.com/astral-sh/uv) on your system.
```bash
# Sync dependencies
uv sync
# Run the OpenEnv server on port 7860
uv run --project . server
```
Server endpoints available:
- `POST /reset`
- `POST /step`
- `GET /health`
- `WS /ws`
## Baseline Scores
Baseline scores are intended to be recorded from the bundled `inference.py` runner against the three validator tasks.
The current environment intentionally squashes scores into the open interval `[0.01, 0.98]`, so benchmark output should be
reported with that convention in mind.
| Task | Baseline Score |
| --- | --- |
| `valid_parentheses_wrong_mapping` | Pending first benchmark run |
| `binary_search_off_by_one` | Pending first benchmark run |
| `reverse_string_returns_original` | Pending first benchmark run |
## Docker + Hugging Face Spaces Deployment
The space runs via Docker. The container is securely configured to run as a non-root `appuser` (UID base `1000`) for Spaces compliance.
### Testing Locally in Docker
```bash
docker build -t tracefix-rl:test -f Dockerfile .
docker run --rm -p 7860:7860 tracefix-rl:test
```
### Deploy to Hugging Face Spaces
This project uses the OpenEnv CLI for seamless Hugging Face Space deployments.
```bash
# Push directly to your specified HF Space
openenv push
```
### Server Pre-validation
Before committing to training, you can validate your deployed server or local space:
```bash
bash ./pre-val.sh https://<your-space>.hf.space .
```
## Inference & Evaluation (`inference.py`)
The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface.
**Requirements for Inference:**
- `API_BASE_URL` (Defaults to `https://router.huggingface.co/v1`)
- `MODEL_NAME` (Defaults to `Qwen/Qwen2.5-72B-Instruct`)
- `HF_TOKEN`
**Usage Flags:**
- `--easy`, `--medium`, `--hard`: Lock the environment to a specific task bucket.
- `--thought`: Send `<thought>` token blocks back to the payload to train chain-of-thought capabilities.
Example execution tracking thoughts in medium tasks:
```bash
python inference.py --medium --thought
```