Spaces:

SolusOps
/

tracefix_rl

Sleeping

App Files Files Community

tracefix_rl / CLAUDE.md

databoysu

local test

33ef871 about 1 month ago

preview code

raw

history blame contribute delete

5.42 kB

	# CLAUDE.md - TraceFix-RL (RL_ENV_FINAL)

	Current, code-backed notes for assistants working in this repository.
	Last updated: 2026-04-08

	## Project Status Snapshot

	- Repo: `code_reasoner_rl_env`
	- Branch: `master`
	- Working tree: dirty
	- Modified: `.gitignore`, `inference.py`, `models.py`, `__pycache__/models.cpython-312.pyc`
	- Untracked: `.hfignore`
	- Last recorded pre-validation command in terminal:
	- `./pre-val.sh https://sus-human-tracefix-rl.hf.space .`
	- Exit code: `1`

	This file describes the current implementation in `RL_ENV_FINAL` only.

	## High-Level Architecture

	- `environment.py`: core gym-style state machine (`TraceFixRLGym`)
	- `server/tracefix_rl_environment.py`: OpenEnv adapter (`Environment` interface)
	- `server/app.py`: FastAPI app creation and uvicorn entrypoint
	- `models.py`: action/observation schemas (`CodeAction`, `CodeObservation`, `TestResult`)
	- `sandbox.py`: isolated code execution + test running + timeout handling
	- `tasks.py`: static task registry (easy/medium/hard)
	- `context.py`: localized context windowing around last edit
	- `client.py`: typed OpenEnv client (`TraceFixRLEnv` / `MyEnv`)
	- `inference.py`: baseline agent runner with OpenAI-compatible API
	- `openenv.yaml`: OpenEnv runtime metadata (`app: server.app:app`, `port: 7860`)

	## Runtime and Entry Points

	- Local server via project script:
	- `uv run --project . server`
	- Container command in `Dockerfile`:
	- `uvicorn server.app:app --host 0.0.0.0 --port 7860`
	- OpenEnv spec points to:
	- `server.app:app`

	## Environment Behavior (`environment.py`)

	Action space:

	- `VIEW_CODE`
	- `RUN_TESTS`
	- `REPLACE_LINES`
	- `UNDO_EDIT`
	- `RESET_TO_ORIGINAL`
	- `SUBMIT`

	Reward constants currently defined:

	- `R_STEP_COST = -0.01`
	- `R_RUN_TESTS = +0.10`
	- `R_PER_NEW_PASS = +0.05`
	- `R_SYNTAX_ERROR = -0.10`
	- `R_INVALID_LINE = -0.02`
	- `R_DESTRUCTIVE_PENALTY = -0.20`
	- `R_UNDO_RESET = -0.10`
	- `MAX_STEPS = 50`

	Episode internals include:

	- code snapshotting (`_original_code`, `_edit_history`)
	- anti-loop penalty for repeated identical `action_type`
	- contextual anchor (`_last_edited_line`) for localized context
	- cumulative step-cost tracking (`_accumulated_step_costs`)

	Submit scoring model:

	- `proportion = passing_tests / total_tests` (or `0` on syntax error)
	- `raw_score = proportion - _accumulated_step_costs`
	- `final_score = clamp(raw_score, 0.0, 1.0)`
	- same clamp model used on max-step timeout auto-evaluation

	Task sampling policy:

	- `training_step == 0`: random from `ALL_TASKS`
	- `< 1000`: easy
	- `< 5000`: medium
	- `>= 5000`: hard
	- fallback to first non-empty bucket

	## Schema Notes (`models.py`)

	Important: current code uses Pydantic v2-style validation APIs.

	- `CodeAction` uses `@model_validator(mode="before")`
	- Non-`REPLACE_LINES` actions force `start_line`, `end_line`, `new_code_block` to `None`
	- `REPLACE_LINES` enforces required fields and 1-indexed positive range constraints

	This is not compatible with Pydantic v1-only assumptions.

	## Sandbox Notes (`sandbox.py`)

	`run_code_with_tests(...)` returns a strict 3-tuple:

	- `output_str`
	- `List[TestResult>`
	- `had_syntax_error: bool`

	Execution safeguards:

	- subprocess isolation via `multiprocessing.Process`
	- timeout terminate/kill path
	- tail truncation (`MAX_OUTPUT_CHARS = 1000`)
	- restricted builtins to block risky operations

	## Tasks Registry (`tasks.py`)

	- Static hardcoded registry grouped by difficulty
	- Exports:
	- `TASKS_BY_DIFFICULTY`
	- `ALL_TASKS`
	- Expected total currently: 16 tasks
	- easy: 4
	- medium: 6
	- hard: 6

	## OpenEnv Adapter and Client

	`server/tracefix_rl_environment.py`:

	- Maps optional reset difficulty to `training_step` hints
	- Writes `system_prompt` into observation metadata
	- Sets observation reward/done from gym step output

	`client.py`:

	- Sends actions using `model_dump(exclude_none=True)`
	- Parses OpenEnv payloads into typed `CodeObservation`

	## Inference Runner (`inference.py`)

	Key defaults:

	- `API_BASE_URL = https://router.huggingface.co/v1`
	- `MODEL_NAME = Qwen/Qwen2.5-72B-Instruct`
	- `MAX_STEPS = 50`
	- `SUCCESS_SCORE_THRESHOLD = 0.99`
	- `THINKING_TOKEN_LIMIT = 512`

	Behavior:

	- Logs in strict sequence: `[START]`, repeated `[STEP]`, then `[END]`
	- Uses JSON extraction fallback path from model text
	- Falls back to `RUN_TESTS` on parse or validation failure
	- Supports `--easy`, `--medium`, `--hard`, `--debug`

	## Drift and Risk Notes

	1. `requirements.txt` currently pins `pydantic==1.10.17`, but code in `models.py` uses v2 APIs (`model_validator`).
	2. `pyproject.toml` is the active dependency source for `uv sync`; `requirements.txt` appears stale relative to runtime assumptions.
	3. `environment.py` defines `R_SUBMIT_ALL_PASS` and `R_SUBMIT_FAIL`, but submit currently uses clamped proportion-minus-step-cost scoring instead of those constants.
	4. `server/tracefix_rl_environment.py` advertises concurrent sessions support, while `create_app(..., max_concurrent_envs=1)` constrains server-level concurrency.

	## Practical Checklist Before Validation

	1. Confirm dependency source of truth (`pyproject.toml` vs `requirements.txt`) and align Pydantic version expectations.
	2. Re-run pre-validation and capture the first failing check/output.
	3. Remove tracked cache artifacts from version control if unintended (for example `__pycache__/*.pyc`).
	4. Keep stdout format in `inference.py` unchanged, as validator parsing depends on it.