# CLAUDE.md - TraceFix-RL (RL_ENV_FINAL) Current, code-backed notes for assistants working in this repository. Last updated: 2026-04-08 ## Project Status Snapshot - Repo: `code_reasoner_rl_env` - Branch: `master` - Working tree: dirty - Modified: `.gitignore`, `inference.py`, `models.py`, `__pycache__/models.cpython-312.pyc` - Untracked: `.hfignore` - Last recorded pre-validation command in terminal: - `./pre-val.sh https://sus-human-tracefix-rl.hf.space .` - Exit code: `1` This file describes the current implementation in `RL_ENV_FINAL` only. ## High-Level Architecture - `environment.py`: core gym-style state machine (`TraceFixRLGym`) - `server/tracefix_rl_environment.py`: OpenEnv adapter (`Environment` interface) - `server/app.py`: FastAPI app creation and uvicorn entrypoint - `models.py`: action/observation schemas (`CodeAction`, `CodeObservation`, `TestResult`) - `sandbox.py`: isolated code execution + test running + timeout handling - `tasks.py`: static task registry (easy/medium/hard) - `context.py`: localized context windowing around last edit - `client.py`: typed OpenEnv client (`TraceFixRLEnv` / `MyEnv`) - `inference.py`: baseline agent runner with OpenAI-compatible API - `openenv.yaml`: OpenEnv runtime metadata (`app: server.app:app`, `port: 7860`) ## Runtime and Entry Points - Local server via project script: - `uv run --project . server` - Container command in `Dockerfile`: - `uvicorn server.app:app --host 0.0.0.0 --port 7860` - OpenEnv spec points to: - `server.app:app` ## Environment Behavior (`environment.py`) Action space: - `VIEW_CODE` - `RUN_TESTS` - `REPLACE_LINES` - `UNDO_EDIT` - `RESET_TO_ORIGINAL` - `SUBMIT` Reward constants currently defined: - `R_STEP_COST = -0.01` - `R_RUN_TESTS = +0.10` - `R_PER_NEW_PASS = +0.05` - `R_SYNTAX_ERROR = -0.10` - `R_INVALID_LINE = -0.02` - `R_DESTRUCTIVE_PENALTY = -0.20` - `R_UNDO_RESET = -0.10` - `MAX_STEPS = 50` Episode internals include: - code snapshotting (`_original_code`, `_edit_history`) - anti-loop penalty for repeated identical `action_type` - contextual anchor (`_last_edited_line`) for localized context - cumulative step-cost tracking (`_accumulated_step_costs`) Submit scoring model: - `proportion = passing_tests / total_tests` (or `0` on syntax error) - `raw_score = proportion - _accumulated_step_costs` - `final_score = clamp(raw_score, 0.0, 1.0)` - same clamp model used on max-step timeout auto-evaluation Task sampling policy: - `training_step == 0`: random from `ALL_TASKS` - `< 1000`: easy - `< 5000`: medium - `>= 5000`: hard - fallback to first non-empty bucket ## Schema Notes (`models.py`) Important: current code uses Pydantic v2-style validation APIs. - `CodeAction` uses `@model_validator(mode="before")` - Non-`REPLACE_LINES` actions force `start_line`, `end_line`, `new_code_block` to `None` - `REPLACE_LINES` enforces required fields and 1-indexed positive range constraints This is not compatible with Pydantic v1-only assumptions. ## Sandbox Notes (`sandbox.py`) `run_code_with_tests(...)` returns a strict 3-tuple: - `output_str` - `List[TestResult>` - `had_syntax_error: bool` Execution safeguards: - subprocess isolation via `multiprocessing.Process` - timeout terminate/kill path - tail truncation (`MAX_OUTPUT_CHARS = 1000`) - restricted builtins to block risky operations ## Tasks Registry (`tasks.py`) - Static hardcoded registry grouped by difficulty - Exports: - `TASKS_BY_DIFFICULTY` - `ALL_TASKS` - Expected total currently: 16 tasks - easy: 4 - medium: 6 - hard: 6 ## OpenEnv Adapter and Client `server/tracefix_rl_environment.py`: - Maps optional reset difficulty to `training_step` hints - Writes `system_prompt` into observation metadata - Sets observation reward/done from gym step output `client.py`: - Sends actions using `model_dump(exclude_none=True)` - Parses OpenEnv payloads into typed `CodeObservation` ## Inference Runner (`inference.py`) Key defaults: - `API_BASE_URL = https://router.huggingface.co/v1` - `MODEL_NAME = Qwen/Qwen2.5-72B-Instruct` - `MAX_STEPS = 50` - `SUCCESS_SCORE_THRESHOLD = 0.99` - `THINKING_TOKEN_LIMIT = 512` Behavior: - Logs in strict sequence: `[START]`, repeated `[STEP]`, then `[END]` - Uses JSON extraction fallback path from model text - Falls back to `RUN_TESTS` on parse or validation failure - Supports `--easy`, `--medium`, `--hard`, `--debug` ## Drift and Risk Notes 1. `requirements.txt` currently pins `pydantic==1.10.17`, but code in `models.py` uses v2 APIs (`model_validator`). 2. `pyproject.toml` is the active dependency source for `uv sync`; `requirements.txt` appears stale relative to runtime assumptions. 3. `environment.py` defines `R_SUBMIT_ALL_PASS` and `R_SUBMIT_FAIL`, but submit currently uses clamped proportion-minus-step-cost scoring instead of those constants. 4. `server/tracefix_rl_environment.py` advertises concurrent sessions support, while `create_app(..., max_concurrent_envs=1)` constrains server-level concurrency. ## Practical Checklist Before Validation 1. Confirm dependency source of truth (`pyproject.toml` vs `requirements.txt`) and align Pydantic version expectations. 2. Re-run pre-validation and capture the first failing check/output. 3. Remove tracked cache artifacts from version control if unintended (for example `__pycache__/*.pyc`). 4. Keep stdout format in `inference.py` unchanged, as validator parsing depends on it.