--- title: TraceFix-RL emoji: 🧑‍💻 colorFrom: blue sdk: docker pinned: false app_port: 7860 base_path: /web tags: - openenv - reinforcement-learning - software-engineering --- ## TraceFix-RL TraceFix-RL is an OpenEnv-compatible environment designed to teach agent behavior that looks like real software engineering work. Instead of one-shot answers, the agent must inspect code, form a hypothesis, run tests, patch the code, verify outcomes, and only then submit. The loop rewards disciplined debugging and penalizes random edits, forcing the model to learn an engineering workflow. ## Core Design - **Action space:** `VIEW_CODE`, `RUN_TESTS`, `REPLACE_LINES`, `UNDO_EDIT`, `RESET_TO_ORIGINAL`, `SUBMIT` - **Observations:** The full code snapshot, localized edit context, execution output, syntax status, and per-test outcomes. - **Dense Rewards:** `RUN_TESTS` bonus, per-test progress bonus, step-cost penalty, invalid-edit penalties, and a final clamped score bounded within `[0.01, 0.98]`. - **Curriculum-ready Tasks:** Includes Easy, Medium, and Hard buckets that the OpenEnv trainer can sequence, alongside random fallback for evaluators. ## State Machine Training Pattern The environment prompt in `environment.py` encodes a strict operating pattern the agent is expected to follow: 1. **ORIENT:** Inspect code (`VIEW_CODE`) 2. **DIAGNOSE:** Run tests and read failures (`RUN_TESTS`) 3. **FIX:** Patch one localized region (`REPLACE_LINES`) 4. **VERIFY:** Rerun tests (`RUN_TESTS`) 5. **REPEAT:** Continue until all failures are resolved 6. **SUBMIT:** Finalize only after tests pass This sequence naturally guides reinforcement learning toward robust planning, controlled editing, and verification behavior. ## Task Tiers And Test Structure The registry in `tasks.py` acts as a static curated set of coding challenges (16 tasks total): - **Easy (4 tasks):** Focuses on basic operators, indexing, and simple string/array logic. - **Medium (6 tasks):** Focuses on recursive behavior, branching correctness, and text normalization edges. - **Hard (6 tasks):** Focuses on data-structure invariants, bracket mapping, interval merging, and eviction logic. Every task contains: `name`, `description`, `difficulty`, `bug_type`, `code` (buggy implementation), `solution`, and executable `tests`. All tests are safely run inside isolated sandboxes via `sandbox.py` using `multiprocessing`. ## Tech Stack & Project Files This environment enforces strict typing and uses standard modern tooling: - **`uv`:** Handles dependency management (see `pyproject.toml`). - **FastAPI:** Provides the `server.app` integration layer for OpenEnv compliance. - **Pydantic (v2):** Provides strong validation layers for `models.py` (e.g., `CodeAction`, `CodeObservation`). - **OpenEnv Config:** See `openenv.yaml` which specifies `tracefix_rl` to run the FastAPI app on port `7860`. **File Layout:** - `models.py` / `context.py`: Domain and schema logic. - `tasks.py`: Task metadata definitions. - `sandbox.py`: Subprocess runtime and output tracking. - `environment.py`: Reset/step/reward core RL loop logic (`TraceFixRLGym`). - `server/tracefix_rl_environment.py` / `server/app.py`: Maps the OpenAI/OpenEnv network interface to the core environment. - `inference.py`: Baseline OpenAI-client inference script to evaluate agents. ## Local Development You must install [`uv`](https://github.com/astral-sh/uv) on your system. ```bash # Sync dependencies uv sync # Run the OpenEnv server on port 7860 uv run --project . server ``` Server endpoints available: - `POST /reset` - `POST /step` - `GET /health` - `WS /ws` ## Baseline Scores Baseline scores are intended to be recorded from the bundled `inference.py` runner against the three validator tasks. The current environment intentionally squashes scores into the open interval `[0.01, 0.98]`, so benchmark output should be reported with that convention in mind. | Task | Baseline Score | | --- | --- | | `valid_parentheses_wrong_mapping` | Pending first benchmark run | | `binary_search_off_by_one` | Pending first benchmark run | | `reverse_string_returns_original` | Pending first benchmark run | ## Docker + Hugging Face Spaces Deployment The space runs via Docker. The container is securely configured to run as a non-root `appuser` (UID base `1000`) for Spaces compliance. ### Testing Locally in Docker ```bash docker build -t tracefix-rl:test -f Dockerfile . docker run --rm -p 7860:7860 tracefix-rl:test ``` ### Deploy to Hugging Face Spaces This project uses the OpenEnv CLI for seamless Hugging Face Space deployments. ```bash # Push directly to your specified HF Space openenv push ``` ### Server Pre-validation Before committing to training, you can validate your deployed server or local space: ```bash bash ./pre-val.sh https://.hf.space . ``` ## Inference & Evaluation (`inference.py`) The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface. **Requirements for Inference:** - `API_BASE_URL` (Defaults to `https://router.huggingface.co/v1`) - `MODEL_NAME` (Defaults to `Qwen/Qwen2.5-72B-Instruct`) - `HF_TOKEN` **Usage Flags:** - `--easy`, `--medium`, `--hard`: Lock the environment to a specific task bucket. - `--thought`: Send `` token blocks back to the payload to train chain-of-thought capabilities. Example execution tracking thoughts in medium tasks: ```bash python inference.py --medium --thought ```