Spaces:
Sleeping
Sleeping
| # CLAUDE.md - TraceFix-RL (RL_ENV_FINAL) | |
| Current, code-backed notes for assistants working in this repository. | |
| Last updated: 2026-04-08 | |
| ## Project Status Snapshot | |
| - Repo: `code_reasoner_rl_env` | |
| - Branch: `master` | |
| - Working tree: dirty | |
| - Modified: `.gitignore`, `inference.py`, `models.py`, `__pycache__/models.cpython-312.pyc` | |
| - Untracked: `.hfignore` | |
| - Last recorded pre-validation command in terminal: | |
| - `./pre-val.sh https://sus-human-tracefix-rl.hf.space .` | |
| - Exit code: `1` | |
| This file describes the current implementation in `RL_ENV_FINAL` only. | |
| ## High-Level Architecture | |
| - `environment.py`: core gym-style state machine (`TraceFixRLGym`) | |
| - `server/tracefix_rl_environment.py`: OpenEnv adapter (`Environment` interface) | |
| - `server/app.py`: FastAPI app creation and uvicorn entrypoint | |
| - `models.py`: action/observation schemas (`CodeAction`, `CodeObservation`, `TestResult`) | |
| - `sandbox.py`: isolated code execution + test running + timeout handling | |
| - `tasks.py`: static task registry (easy/medium/hard) | |
| - `context.py`: localized context windowing around last edit | |
| - `client.py`: typed OpenEnv client (`TraceFixRLEnv` / `MyEnv`) | |
| - `inference.py`: baseline agent runner with OpenAI-compatible API | |
| - `openenv.yaml`: OpenEnv runtime metadata (`app: server.app:app`, `port: 7860`) | |
| ## Runtime and Entry Points | |
| - Local server via project script: | |
| - `uv run --project . server` | |
| - Container command in `Dockerfile`: | |
| - `uvicorn server.app:app --host 0.0.0.0 --port 7860` | |
| - OpenEnv spec points to: | |
| - `server.app:app` | |
| ## Environment Behavior (`environment.py`) | |
| Action space: | |
| - `VIEW_CODE` | |
| - `RUN_TESTS` | |
| - `REPLACE_LINES` | |
| - `UNDO_EDIT` | |
| - `RESET_TO_ORIGINAL` | |
| - `SUBMIT` | |
| Reward constants currently defined: | |
| - `R_STEP_COST = -0.01` | |
| - `R_RUN_TESTS = +0.10` | |
| - `R_PER_NEW_PASS = +0.05` | |
| - `R_SYNTAX_ERROR = -0.10` | |
| - `R_INVALID_LINE = -0.02` | |
| - `R_DESTRUCTIVE_PENALTY = -0.20` | |
| - `R_UNDO_RESET = -0.10` | |
| - `MAX_STEPS = 50` | |
| Episode internals include: | |
| - code snapshotting (`_original_code`, `_edit_history`) | |
| - anti-loop penalty for repeated identical `action_type` | |
| - contextual anchor (`_last_edited_line`) for localized context | |
| - cumulative step-cost tracking (`_accumulated_step_costs`) | |
| Submit scoring model: | |
| - `proportion = passing_tests / total_tests` (or `0` on syntax error) | |
| - `raw_score = proportion - _accumulated_step_costs` | |
| - `final_score = clamp(raw_score, 0.0, 1.0)` | |
| - same clamp model used on max-step timeout auto-evaluation | |
| Task sampling policy: | |
| - `training_step == 0`: random from `ALL_TASKS` | |
| - `< 1000`: easy | |
| - `< 5000`: medium | |
| - `>= 5000`: hard | |
| - fallback to first non-empty bucket | |
| ## Schema Notes (`models.py`) | |
| Important: current code uses Pydantic v2-style validation APIs. | |
| - `CodeAction` uses `@model_validator(mode="before")` | |
| - Non-`REPLACE_LINES` actions force `start_line`, `end_line`, `new_code_block` to `None` | |
| - `REPLACE_LINES` enforces required fields and 1-indexed positive range constraints | |
| This is not compatible with Pydantic v1-only assumptions. | |
| ## Sandbox Notes (`sandbox.py`) | |
| `run_code_with_tests(...)` returns a strict 3-tuple: | |
| - `output_str` | |
| - `List[TestResult>` | |
| - `had_syntax_error: bool` | |
| Execution safeguards: | |
| - subprocess isolation via `multiprocessing.Process` | |
| - timeout terminate/kill path | |
| - tail truncation (`MAX_OUTPUT_CHARS = 1000`) | |
| - restricted builtins to block risky operations | |
| ## Tasks Registry (`tasks.py`) | |
| - Static hardcoded registry grouped by difficulty | |
| - Exports: | |
| - `TASKS_BY_DIFFICULTY` | |
| - `ALL_TASKS` | |
| - Expected total currently: 16 tasks | |
| - easy: 4 | |
| - medium: 6 | |
| - hard: 6 | |
| ## OpenEnv Adapter and Client | |
| `server/tracefix_rl_environment.py`: | |
| - Maps optional reset difficulty to `training_step` hints | |
| - Writes `system_prompt` into observation metadata | |
| - Sets observation reward/done from gym step output | |
| `client.py`: | |
| - Sends actions using `model_dump(exclude_none=True)` | |
| - Parses OpenEnv payloads into typed `CodeObservation` | |
| ## Inference Runner (`inference.py`) | |
| Key defaults: | |
| - `API_BASE_URL = https://router.huggingface.co/v1` | |
| - `MODEL_NAME = Qwen/Qwen2.5-72B-Instruct` | |
| - `MAX_STEPS = 50` | |
| - `SUCCESS_SCORE_THRESHOLD = 0.99` | |
| - `THINKING_TOKEN_LIMIT = 512` | |
| Behavior: | |
| - Logs in strict sequence: `[START]`, repeated `[STEP]`, then `[END]` | |
| - Uses JSON extraction fallback path from model text | |
| - Falls back to `RUN_TESTS` on parse or validation failure | |
| - Supports `--easy`, `--medium`, `--hard`, `--debug` | |
| ## Drift and Risk Notes | |
| 1. `requirements.txt` currently pins `pydantic==1.10.17`, but code in `models.py` uses v2 APIs (`model_validator`). | |
| 2. `pyproject.toml` is the active dependency source for `uv sync`; `requirements.txt` appears stale relative to runtime assumptions. | |
| 3. `environment.py` defines `R_SUBMIT_ALL_PASS` and `R_SUBMIT_FAIL`, but submit currently uses clamped proportion-minus-step-cost scoring instead of those constants. | |
| 4. `server/tracefix_rl_environment.py` advertises concurrent sessions support, while `create_app(..., max_concurrent_envs=1)` constrains server-level concurrency. | |
| ## Practical Checklist Before Validation | |
| 1. Confirm dependency source of truth (`pyproject.toml` vs `requirements.txt`) and align Pydantic version expectations. | |
| 2. Re-run pre-validation and capture the first failing check/output. | |
| 3. Remove tracked cache artifacts from version control if unintended (for example `__pycache__/*.pyc`). | |
| 4. Keep stdout format in `inference.py` unchanged, as validator parsing depends on it. | |