Spaces:

SolusOps
/

tracefix_rl

Sleeping

App Files Files Community

tracefix_rl / README.md

databoysu

active graders

7266968 29 days ago

preview code

raw

history blame contribute delete

5.48 kB

	---
	title: TraceFix-RL
	emoji: 🧑‍💻
	colorFrom: blue
	sdk: docker
	pinned: false
	app_port: 7860
	base_path: /web
	tags:
	- openenv
	- reinforcement-learning
	- software-engineering
	---

	## TraceFix-RL

	TraceFix-RL is an OpenEnv-compatible environment designed to teach agent behavior
	that looks like real software engineering work. Instead of one-shot answers,
	the agent must inspect code, form a hypothesis, run tests, patch the code,
	verify outcomes, and only then submit. The loop rewards disciplined debugging
	and penalizes random edits, forcing the model to learn an engineering workflow.

	## Core Design

	- Action space: `VIEW_CODE`, `RUN_TESTS`, `REPLACE_LINES`, `UNDO_EDIT`, `RESET_TO_ORIGINAL`, `SUBMIT`
	- Observations: The full code snapshot, localized edit context, execution output, syntax status, and per-test outcomes.
	- Dense Rewards: `RUN_TESTS` bonus, per-test progress bonus, step-cost penalty, invalid-edit penalties, and a final clamped score bounded within `[0.01, 0.98]`.
	- Curriculum-ready Tasks: Includes Easy, Medium, and Hard buckets that the OpenEnv trainer can sequence, alongside random fallback for evaluators.

	## State Machine Training Pattern

	The environment prompt in `environment.py` encodes a strict operating pattern the agent is expected to follow:

	1. ORIENT: Inspect code (`VIEW_CODE`)
	2. DIAGNOSE: Run tests and read failures (`RUN_TESTS`)
	3. FIX: Patch one localized region (`REPLACE_LINES`)
	4. VERIFY: Rerun tests (`RUN_TESTS`)
	5. REPEAT: Continue until all failures are resolved
	6. SUBMIT: Finalize only after tests pass

	This sequence naturally guides reinforcement learning toward robust planning, controlled editing, and verification behavior.

	## Task Tiers And Test Structure

	The registry in `tasks.py` acts as a static curated set of coding challenges (16 tasks total):

	- Easy (4 tasks): Focuses on basic operators, indexing, and simple string/array logic.
	- Medium (6 tasks): Focuses on recursive behavior, branching correctness, and text normalization edges.
	- Hard (6 tasks): Focuses on data-structure invariants, bracket mapping, interval merging, and eviction logic.

	Every task contains: `name`, `description`, `difficulty`, `bug_type`, `code` (buggy implementation), `solution`, and executable `tests`. All tests are safely run inside isolated sandboxes via `sandbox.py` using `multiprocessing`.

	## Tech Stack & Project Files

	This environment enforces strict typing and uses standard modern tooling:

	- `uv`: Handles dependency management (see `pyproject.toml`).
	- FastAPI: Provides the `server.app` integration layer for OpenEnv compliance.
	- Pydantic (v2): Provides strong validation layers for `models.py` (e.g., `CodeAction`, `CodeObservation`).
	- OpenEnv Config: See `openenv.yaml` which specifies `tracefix_rl` to run the FastAPI app on port `7860`.

	File Layout:

	- `models.py` / `context.py`: Domain and schema logic.
	- `tasks.py`: Task metadata definitions.
	- `sandbox.py`: Subprocess runtime and output tracking.
	- `environment.py`: Reset/step/reward core RL loop logic (`TraceFixRLGym`).
	- `server/tracefix_rl_environment.py` / `server/app.py`: Maps the OpenAI/OpenEnv network interface to the core environment.
	- `inference.py`: Baseline OpenAI-client inference script to evaluate agents.

	## Local Development

	You must install [`uv`](https://github.com/astral-sh/uv) on your system.

	```bash
	# Sync dependencies
	uv sync

	# Run the OpenEnv server on port 7860
	uv run --project . server
	```

	Server endpoints available:

	- `POST /reset`
	- `POST /step`
	- `GET /health`
	- `WS /ws`

	## Baseline Scores

	Baseline scores are intended to be recorded from the bundled `inference.py` runner against the three validator tasks.
	The current environment intentionally squashes scores into the open interval `[0.01, 0.98]`, so benchmark output should be
	reported with that convention in mind.

	\| Task \| Baseline Score \|
	\| --- \| --- \|
	\| `valid_parentheses_wrong_mapping` \| Pending first benchmark run \|
	\| `binary_search_off_by_one` \| Pending first benchmark run \|
	\| `reverse_string_returns_original` \| Pending first benchmark run \|

	## Docker + Hugging Face Spaces Deployment

	The space runs via Docker. The container is securely configured to run as a non-root `appuser` (UID base `1000`) for Spaces compliance.

	### Testing Locally in Docker

	```bash
	docker build -t tracefix-rl:test -f Dockerfile .
	docker run --rm -p 7860:7860 tracefix-rl:test
	```

	### Deploy to Hugging Face Spaces

	This project uses the OpenEnv CLI for seamless Hugging Face Space deployments.

	```bash
	# Push directly to your specified HF Space
	openenv push
	```

	### Server Pre-validation

	Before committing to training, you can validate your deployed server or local space:

	```bash
	bash ./pre-val.sh https://<your-space>.hf.space .
	```

	## Inference & Evaluation (`inference.py`)

	The baseline inference runner evaluates agents against the environment using an OpenAI-compatible interface.

	Requirements for Inference:

	- `API_BASE_URL` (Defaults to `https://router.huggingface.co/v1`)
	- `MODEL_NAME` (Defaults to `Qwen/Qwen2.5-72B-Instruct`)
	- `HF_TOKEN`

	Usage Flags:

	- `--easy`, `--medium`, `--hard`: Lock the environment to a specific task bucket.
	- `--thought`: Send `<thought>` token blocks back to the payload to train chain-of-thought capabilities.

	Example execution tracking thoughts in medium tasks:

	```bash
	python inference.py --medium --thought
	```