Spaces:
Sleeping
Sleeping
| title: TraceFix-RL | |
| emoji: π§βπ» | |
| colorFrom: blue | |
| sdk: docker | |
| pinned: false | |
| app_port: 7860 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - software-engineering | |
| ## TraceFix-RL | |
| TraceFix-RL is an OpenEnv-compatible environment designed to teach agent behavior | |
| that looks like real software engineering work. Instead of one-shot answers, | |
| the agent must inspect code, form a hypothesis, run tests, patch the code, | |
| verify outcomes, and only then submit. The loop rewards disciplined debugging | |
| and penalizes random edits, forcing the model to learn an engineering workflow. | |
| ## Core Design | |
| - **Action space:** `VIEW_CODE`, `RUN_TESTS`, `REPLACE_LINES`, `UNDO_EDIT`, `RESET_TO_ORIGINAL`, `SUBMIT` | |
| - **Observations:** The full code snapshot, localized edit context, execution output, syntax status, and per-test outcomes. | |
| - **Dense Rewards:** `RUN_TESTS` bonus, per-test progress bonus, step-cost penalty, invalid-edit penalties, and a final clamped score bounded within `[0.01, 0.98]`. | |
| - **Curriculum-ready Tasks:** Includes Easy, Medium, and Hard buckets that the OpenEnv trainer can sequence, alongside random fallback for evaluators. | |
| ## State Machine Training Pattern | |
| The environment prompt in `environment.py` encodes a strict operating pattern the agent is expected to follow: | |
| 1. **ORIENT:** Inspect code (`VIEW_CODE`) | |
| 2. **DIAGNOSE:** Run tests and read failures (`RUN_TESTS`) | |
| 3. **FIX:** Patch one localized region (`REPLACE_LINES`) | |
| 4. **VERIFY:** Rerun tests (`RUN_TESTS`) | |
| 5. **REPEAT:** Continue until all failures are resolved | |
| 6. **SUBMIT:** Finalize only after tests pass | |
| This sequence naturally guides reinforcement learning toward robust planning, controlled editing, and verification behavior. | |
| ## Task Tiers And Test Structure | |
| The registry in `tasks.py` acts as a static curated set of coding challenges (16 tasks total): | |
| - **Easy (4 tasks):** Focuses on basic operators, indexing, and simple string/array logic. | |
| - **Medium (6 tasks):** Focuses on recursive behavior, branching correctness, and text normalization edges. | |
| - **Hard (6 tasks):** Focuses on data-structure invariants, bracket mapping, interval merging, and eviction logic. | |
| Every task contains: `name`, `description`, `difficulty`, `bug_type`, `code` (buggy implementation), `solution`, and executable `tests`. All tests are safely run inside isolated sandboxes via `sandbox.py` using `multiprocessing`. | |
| ## Tech Stack & Project Files | |
| This environment enforces strict typing and uses standard modern tooling: | |
| - **`uv`:** Handles dependency management (see `pyproject.toml`). | |
| - **FastAPI:** Provides the `server.app` integration layer for OpenEnv compliance. | |
| - **Pydantic (v2):** Provides strong validation layers for `models.py` (e.g., `CodeAction`, `CodeObservation`). | |
| - **OpenEnv Config:** See `openenv.yaml` which specifies `tracefix_rl` to run the FastAPI app on port `7860`. | |
| **File Layout:** | |
| - `models.py` / `context.py`: Domain and schema logic. | |
| - `tasks.py`: Task metadata definitions. | |
| - `sandbox.py`: Subprocess runtime and output tracking. | |
| - `environment.py`: Reset/step/reward core RL loop logic (`TraceFixRLGym`). | |
| - `server/tracefix_rl_environment.py` / `server/app.py`: Maps the OpenAI/OpenEnv network interface to the core environment. | |
| - `inference.py`: Baseline OpenAI-client inference script to evaluate agents. | |
| ## Local Development | |
| You must install [`uv`](https://github.com/astral-sh/uv) on your system. | |
| ```bash | |
| # Sync dependencies | |
| uv sync | |
| # Run the OpenEnv server on port 7860 | |
| uv run --project . server | |
| ``` | |
| Server endpoints available: | |
| - `POST /reset` | |
| - `POST /step` | |
| - `GET /health` | |
| - `WS /ws` | |
| ## Baseline Scores | |
| Baseline scores are intended to be recorded from the bundled `inference.py` runner against the three validator tasks. | |
| The current environment intentionally squashes scores into the open interval `[0.01, 0.98]`, so benchmark output should be | |
| reported with that convention in mind. | |
| | Task | Baseline Score | | |
| | --- | --- | | |
| | `valid_parentheses_wrong_mapping` | Pending first benchmark run | | |
| | `binary_search_off_by_one` | Pending first benchmark run | | |
| | `reverse_string_returns_original` | Pending first benchmark run | | |
| ## Docker + Hugging Face Spaces Deployment | |
| The space runs via Docker. The container is securely configured to run as a non-root `appuser` (UID base `1000`) for Spaces compliance. | |
| ### Testing Locally in Docker | |
| ```bash | |
| docker build -t tracefix-rl:test -f Dockerfile . | |
| docker run --rm -p 7860:7860 tracefix-rl:test | |
| ``` | |
| ### Deploy to Hugging Face Spaces | |
| This project uses the OpenEnv CLI for seamless Hugging Face Space deployments. | |
| ```bash | |
| # Push directly to your specified HF Space | |
| openenv push | |
| ``` | |
| ### Server Pre-validation | |
| Before committing to training, you can validate your deployed server or local space: | |
| ```bash | |
| bash ./pre-val.sh https://<your-space>.hf.space . | |
| ``` | |
| ## Inference & Evaluation (`inference.py`) | |
| The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface. | |
| **Requirements for Inference:** | |
| - `API_BASE_URL` (Defaults to `https://router.huggingface.co/v1`) | |
| - `MODEL_NAME` (Defaults to `Qwen/Qwen2.5-72B-Instruct`) | |
| - `HF_TOKEN` | |
| **Usage Flags:** | |
| - `--easy`, `--medium`, `--hard`: Lock the environment to a specific task bucket. | |
| - `--thought`: Send `<thought>` token blocks back to the payload to train chain-of-thought capabilities. | |
| Example execution tracking thoughts in medium tasks: | |
| ```bash | |
| python inference.py --medium --thought | |
| ``` | |