tracefix_rl / CLAUDE.md
databoysu
local test
33ef871
# CLAUDE.md - TraceFix-RL (RL_ENV_FINAL)
Current, code-backed notes for assistants working in this repository.
Last updated: 2026-04-08
## Project Status Snapshot
- Repo: `code_reasoner_rl_env`
- Branch: `master`
- Working tree: dirty
- Modified: `.gitignore`, `inference.py`, `models.py`, `__pycache__/models.cpython-312.pyc`
- Untracked: `.hfignore`
- Last recorded pre-validation command in terminal:
- `./pre-val.sh https://sus-human-tracefix-rl.hf.space .`
- Exit code: `1`
This file describes the current implementation in `RL_ENV_FINAL` only.
## High-Level Architecture
- `environment.py`: core gym-style state machine (`TraceFixRLGym`)
- `server/tracefix_rl_environment.py`: OpenEnv adapter (`Environment` interface)
- `server/app.py`: FastAPI app creation and uvicorn entrypoint
- `models.py`: action/observation schemas (`CodeAction`, `CodeObservation`, `TestResult`)
- `sandbox.py`: isolated code execution + test running + timeout handling
- `tasks.py`: static task registry (easy/medium/hard)
- `context.py`: localized context windowing around last edit
- `client.py`: typed OpenEnv client (`TraceFixRLEnv` / `MyEnv`)
- `inference.py`: baseline agent runner with OpenAI-compatible API
- `openenv.yaml`: OpenEnv runtime metadata (`app: server.app:app`, `port: 7860`)
## Runtime and Entry Points
- Local server via project script:
- `uv run --project . server`
- Container command in `Dockerfile`:
- `uvicorn server.app:app --host 0.0.0.0 --port 7860`
- OpenEnv spec points to:
- `server.app:app`
## Environment Behavior (`environment.py`)
Action space:
- `VIEW_CODE`
- `RUN_TESTS`
- `REPLACE_LINES`
- `UNDO_EDIT`
- `RESET_TO_ORIGINAL`
- `SUBMIT`
Reward constants currently defined:
- `R_STEP_COST = -0.01`
- `R_RUN_TESTS = +0.10`
- `R_PER_NEW_PASS = +0.05`
- `R_SYNTAX_ERROR = -0.10`
- `R_INVALID_LINE = -0.02`
- `R_DESTRUCTIVE_PENALTY = -0.20`
- `R_UNDO_RESET = -0.10`
- `MAX_STEPS = 50`
Episode internals include:
- code snapshotting (`_original_code`, `_edit_history`)
- anti-loop penalty for repeated identical `action_type`
- contextual anchor (`_last_edited_line`) for localized context
- cumulative step-cost tracking (`_accumulated_step_costs`)
Submit scoring model:
- `proportion = passing_tests / total_tests` (or `0` on syntax error)
- `raw_score = proportion - _accumulated_step_costs`
- `final_score = clamp(raw_score, 0.0, 1.0)`
- same clamp model used on max-step timeout auto-evaluation
Task sampling policy:
- `training_step == 0`: random from `ALL_TASKS`
- `< 1000`: easy
- `< 5000`: medium
- `>= 5000`: hard
- fallback to first non-empty bucket
## Schema Notes (`models.py`)
Important: current code uses Pydantic v2-style validation APIs.
- `CodeAction` uses `@model_validator(mode="before")`
- Non-`REPLACE_LINES` actions force `start_line`, `end_line`, `new_code_block` to `None`
- `REPLACE_LINES` enforces required fields and 1-indexed positive range constraints
This is not compatible with Pydantic v1-only assumptions.
## Sandbox Notes (`sandbox.py`)
`run_code_with_tests(...)` returns a strict 3-tuple:
- `output_str`
- `List[TestResult>`
- `had_syntax_error: bool`
Execution safeguards:
- subprocess isolation via `multiprocessing.Process`
- timeout terminate/kill path
- tail truncation (`MAX_OUTPUT_CHARS = 1000`)
- restricted builtins to block risky operations
## Tasks Registry (`tasks.py`)
- Static hardcoded registry grouped by difficulty
- Exports:
- `TASKS_BY_DIFFICULTY`
- `ALL_TASKS`
- Expected total currently: 16 tasks
- easy: 4
- medium: 6
- hard: 6
## OpenEnv Adapter and Client
`server/tracefix_rl_environment.py`:
- Maps optional reset difficulty to `training_step` hints
- Writes `system_prompt` into observation metadata
- Sets observation reward/done from gym step output
`client.py`:
- Sends actions using `model_dump(exclude_none=True)`
- Parses OpenEnv payloads into typed `CodeObservation`
## Inference Runner (`inference.py`)
Key defaults:
- `API_BASE_URL = https://router.huggingface.co/v1`
- `MODEL_NAME = Qwen/Qwen2.5-72B-Instruct`
- `MAX_STEPS = 50`
- `SUCCESS_SCORE_THRESHOLD = 0.99`
- `THINKING_TOKEN_LIMIT = 512`
Behavior:
- Logs in strict sequence: `[START]`, repeated `[STEP]`, then `[END]`
- Uses JSON extraction fallback path from model text
- Falls back to `RUN_TESTS` on parse or validation failure
- Supports `--easy`, `--medium`, `--hard`, `--debug`
## Drift and Risk Notes
1. `requirements.txt` currently pins `pydantic==1.10.17`, but code in `models.py` uses v2 APIs (`model_validator`).
2. `pyproject.toml` is the active dependency source for `uv sync`; `requirements.txt` appears stale relative to runtime assumptions.
3. `environment.py` defines `R_SUBMIT_ALL_PASS` and `R_SUBMIT_FAIL`, but submit currently uses clamped proportion-minus-step-cost scoring instead of those constants.
4. `server/tracefix_rl_environment.py` advertises concurrent sessions support, while `create_app(..., max_concurrent_envs=1)` constrains server-level concurrency.
## Practical Checklist Before Validation
1. Confirm dependency source of truth (`pyproject.toml` vs `requirements.txt`) and align Pydantic version expectations.
2. Re-run pre-validation and capture the first failing check/output.
3. Remove tracked cache artifacts from version control if unintended (for example `__pycache__/*.pyc`).
4. Keep stdout format in `inference.py` unchanged, as validator parsing depends on it.