Spaces:

SolusOps
/

tracefix_rl

Sleeping

App Files Files Community

databoysu commited on Apr 8

Commit

2e11c6a

1 Parent(s): 9882200

TraceFix-RL v1

Browse files

Files changed (15) hide show

.gitignore +2 -1
CLAUDE.md +2 -2
README.md +41 -5
__init__.py +3 -9
client.py +4 -62
context.py +3 -56
environment.py +10 -120
inference.py +8 -8
models.py +1 -1
openenv.yaml +1 -1
pyproject.toml +6 -6
sandbox.py +0 -41
server/__init__.py +3 -9
server/app.py +6 -13
server/{swe_gym_environment.py → tracefix_rl_environment.py} +5 -21

.gitignore CHANGED Viewed

@@ -2,4 +2,5 @@
 .agents
 .env
 uv.lock
-claude.md

 .agents
 .env
 uv.lock
+claude.md
+__pycache__/

CLAUDE.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# CLAUDE.md — Python Debugging Gym (RL_ENV)
 Codebase knowledge for AI assistants. Read before making changes.
@@ -259,7 +259,7 @@ Config from `os.getenv`:
 **Exact stdout log format (regex-parsed by validation judge):**
 ```
-[START] task=<task_name> env=PythonDebuggingGym model=<model_name>
 [STEP] step=<n> action=<action_type> reward=<r.rr> done=<true|false> error=<msg|null>
 [END] success=<true|false> steps=<n> score=<s.sss> rewards=<r1,r2,...,rn>
 ```

+# CLAUDE.md — TraceFix-RL
 Codebase knowledge for AI assistants. Read before making changes.
 **Exact stdout log format (regex-parsed by validation judge):**
 ```
+[START] task=<task_name> env=TraceFixRL model=<model_name>
 [STEP] step=<n> action=<action_type> reward=<r.rr> done=<true|false> error=<msg|null>
 [END] success=<true|false> steps=<n> score=<s.sss> rewards=<r1,r2,...,rn>
 ```

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: SWE-Gym - Software Engineer Gym
 emoji: 🧑‍💻
 colorFrom: blue
 colorTo: cyan
@@ -13,11 +13,13 @@ tags:
   - software-engineering
 ---
-# SWE-Gym - Software Engineer Gym
-SWE-Gym is an OpenEnv-compatible RL environment where an agent must debug broken
-Python code by iteratively inspecting source, running tests, editing lines, and
-submitting once all tests pass.
 ## Core Design
@@ -29,6 +31,40 @@ submitting once all tests pass.
 - Curriculum-ready task sampling:
 easy/medium/hard buckets with safe random fallback for evaluator runs.
 ## Environment Files
 - `models.py`: action/observation schemas

 ---
+title: TraceFix-RL
 emoji: 🧑‍💻
 colorFrom: blue
 colorTo: cyan
   - software-engineering
 ---
+# TraceFix-RL
+TraceFix-RL is an OpenEnv-compatible environment designed to teach agent behavior
+that looks like real software engineering work. Instead of one-shot answers,
+the agent must inspect code, form a hypothesis, run tests, patch the code,
+verify outcomes, and only then submit. The loop rewards disciplined debugging
+and penalizes random edits, forcing the model to learn an engineering workflow.
 ## Core Design
 - Curriculum-ready task sampling:
 easy/medium/hard buckets with safe random fallback for evaluator runs.
+## State Machine Training Pattern
+The environment prompt in `environment.py` encodes a fixed operating pattern
+the agent is expected to follow:
+1. ORIENT: inspect code (`VIEW_CODE`)
+2. DIAGNOSE: run tests and read failures (`RUN_TESTS`)
+3. FIX: patch one region (`REPLACE_LINES`)
+4. VERIFY: rerun tests (`RUN_TESTS`)
+5. REPEAT: continue until all failures are resolved
+6. SUBMIT: finalize only after tests pass
+This structure is intentional: the environment trains planning, controlled
+editing, and verification behavior, not just raw code generation.
+## Task Tiers And Test Structure
+Tasks are organized in `tasks.py` into three tiers.
+- Easy: 4 tasks, each with 4 unit tests.
+  Focus: basic operators, indexing, and simple string/array logic.
+- Medium: 6 tasks, each with 4 unit tests.
+  Focus: recursive behavior, branching correctness, and text normalization edge cases.
+- Hard: 6 tasks, each with 3-4 unit tests.
+  Focus: data-structure invariants, eviction/promotion logic, bracket mapping, and interval merging edge behavior.
+Every task follows the same schema:
+- `name`, `description`, `difficulty`, `bug_type`
+- `code`: buggy implementation (line list)
+- `solution`: reference implementation
+- `tests`: callable assertions executed in the sandbox
+This gives consistent training signals while scaling complexity across tiers.
 ## Environment Files
 - `models.py`: action/observation schemas

__init__.py CHANGED Viewed

@@ -1,18 +1,12 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""SWE-Gym OpenEnv package."""
-from .client import MyEnv, SWEGymEnv
 from .models import CodeAction, CodeObservation, TestResult
 __all__ = [
     "CodeAction",
     "CodeObservation",
     "TestResult",
-    "SWEGymEnv",
     "MyEnv",
 ]

+"""TraceFix-RL OpenEnv package."""
+from .client import MyEnv, TraceFixRLEnv
 from .models import CodeAction, CodeObservation, TestResult
 __all__ = [
     "CodeAction",
     "CodeObservation",
     "TestResult",
+    "TraceFixRLEnv",
     "MyEnv",
 ]

client.py CHANGED Viewed

@@ -1,10 +1,4 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""Client for the SWE-Gym OpenEnv environment."""
 from typing import Dict
@@ -18,57 +12,15 @@ except ImportError:
     from models import CodeAction, CodeObservation, TestResult
-class SWEGymEnv(
     EnvClient[CodeAction, CodeObservation, State]
 ):
-    """
-    Client for the SWE-Gym environment.
-    This client maintains a persistent WebSocket connection to the environment server,
-    enabling efficient multi-step interactions with lower latency.
-    Each client instance has its own dedicated environment session on the server.
-    Example:
-        >>> # Connect to a running server
-        >>> with SWEGymEnv(base_url="http://localhost:7860") as client:
-        ...     result = client.reset()
-        ...     print(result.observation.echoed_message)
-        ...
-        ...     result = client.step(MyAction(message="Hello!"))
-        ...     print(result.observation.echoed_message)
-    Example with Docker:
-        >>> # Automatically start container and connect
-        >>> client = SWEGymEnv.from_docker_image("swe-gym:latest")
-        >>> try:
-        ...     result = client.reset()
-        ...     result = client.step(MyAction(message="Test"))
-        ... finally:
-        ...     client.close()
-    """
     def _step_payload(self, action: CodeAction) -> Dict:
-        """
-        Convert MyAction to JSON payload for step message.
-        Args:
-            action: MyAction instance
-        Returns:
-            Dictionary representation suitable for JSON encoding
-        """
         return action.model_dump(exclude_none=True)
     def _parse_result(self, payload: Dict) -> StepResult[CodeObservation]:
-        """
-        Parse server response into StepResult[CodeObservation].
-        Args:
-            payload: JSON response data from server
-        Returns:
-            StepResult with MyObservation
-        """
         obs_data = payload.get("observation", {})
         observation = CodeObservation(
             code_lines=obs_data.get("code_lines", []),
@@ -94,20 +46,10 @@ class SWEGymEnv(
         )
     def _parse_state(self, payload: Dict) -> State:
-        """
-        Parse server response into State object.
-        Args:
-            payload: JSON response from state request
-        Returns:
-            State object with episode_id and step_count
-        """
         return State(
             episode_id=payload.get("episode_id"),
             step_count=payload.get("step_count", 0),
         )
-# Backward-compatible alias for older code paths.
-MyEnv = SWEGymEnv

+"""Client for TraceFix-RL."""
 from typing import Dict
     from models import CodeAction, CodeObservation, TestResult
+class TraceFixRLEnv(
     EnvClient[CodeAction, CodeObservation, State]
 ):
+    """Typed OpenEnv client."""
     def _step_payload(self, action: CodeAction) -> Dict:
         return action.model_dump(exclude_none=True)
     def _parse_result(self, payload: Dict) -> StepResult[CodeObservation]:
         obs_data = payload.get("observation", {})
         observation = CodeObservation(
             code_lines=obs_data.get("code_lines", []),
         )
     def _parse_state(self, payload: Dict) -> State:
         return State(
             episode_id=payload.get("episode_id"),
             step_count=payload.get("step_count", 0),
         )
+MyEnv = TraceFixRLEnv

context.py CHANGED Viewed

@@ -1,29 +1,11 @@
-"""
-context.py — Layered Context Compaction
-=========================================
-PRINCIPLE 10 — Layered Context Compaction
-  For large files, returning the full source on every observation would rapidly
-  fill the agent's context window, leaving no room for reasoning.
-  Instead we return a *localized* view: a ±WINDOW_LINES slice of the code
-  centred on the last line that was edited. This gives the agent exactly the
-  context it needs — the neighbourhood of its most recent change — without
-  flooding the context with unrelated code.
-  This module is intentionally pure (no environment state dependencies) so
-  it can be unit-tested independently and reused across environment versions.
-"""
 from __future__ import annotations
 from typing import List, Optional
-# How many lines above and below the anchor to include
 WINDOW_LINES: int = 10
-# Maximum characters for the localized context block
-# (Principle 9: all outputs must be bounded)
 MAX_CONTEXT_CHARS: int = 2_000
@@ -32,53 +14,19 @@ def get_localized_context(
     anchor_line: Optional[int],
     window: int = WINDOW_LINES,
 ) -> str:
-    """
-    Return a ±`window`-line slice of `code_lines` centred on `anchor_line`.
-    Parameters
-    ----------
-    code_lines  : Full list of source lines (0-indexed internally).
-    anchor_line : The 1-indexed line number of the most recent edit.
-                  If None (no edits yet) returns an empty string.
-    window      : Number of lines to show above and below the anchor.
-    Returns
-    -------
-    A formatted string with line numbers, bounded to MAX_CONTEXT_CHARS,
-    annotated with the visible range and an anchor marker (▶).
-    Example output
-    --------------
-    [Showing lines 3–13 of 20, anchor ▶ line 7]
-      3 |     left, right = 0, len(arr)
-      4 |     while left <= right:
-      5 |         mid = (left + right) // 2
-      6 |         if arr[mid] == target:
-      7 ▶         return mid          ← last edit
-      8 |         elif arr[mid] < target:
-      9 |             left = mid + 1
-     10 |         else:
-     11 |             right = mid - 1
-     12 |     return -1
-    """
     if anchor_line is None or not code_lines:
         return ""
     total = len(code_lines)
-    # Clamp anchor into valid range
     anchor_0 = max(0, min(anchor_line - 1, total - 1))
-    # Compute slice bounds (inclusive on both ends, 0-indexed)
     start_0 = max(0, anchor_0 - window)
     end_0   = min(total - 1, anchor_0 + window)
-    # Build header
     start_1 = start_0 + 1
     end_1   = end_0   + 1
     header  = f"[Showing lines {start_1}–{end_1} of {total}, anchor ▶ line {anchor_line}]"
-    # Build body
     body_lines = []
     for i in range(start_0, end_0 + 1):
         line_num = i + 1
@@ -87,8 +35,7 @@ def get_localized_context(
     result = header + "\n" + "\n".join(body_lines)
-    # PRINCIPLE 9 — hard cap on output size
     if len(result) > MAX_CONTEXT_CHARS:
         result = result[:MAX_CONTEXT_CHARS] + "\n... [context truncated]"
-    return result

+"""Localized context helpers for TraceFix-RL."""
 from __future__ import annotations
 from typing import List, Optional
 WINDOW_LINES: int = 10
 MAX_CONTEXT_CHARS: int = 2_000
     anchor_line: Optional[int],
     window: int = WINDOW_LINES,
 ) -> str:
+    """Return a bounded ±window slice around the latest edited line."""
     if anchor_line is None or not code_lines:
         return ""
     total = len(code_lines)
     anchor_0 = max(0, min(anchor_line - 1, total - 1))
     start_0 = max(0, anchor_0 - window)
     end_0   = min(total - 1, anchor_0 + window)
     start_1 = start_0 + 1
     end_1   = end_0   + 1
     header  = f"[Showing lines {start_1}–{end_1} of {total}, anchor ▶ line {anchor_line}]"
     body_lines = []
     for i in range(start_0, end_0 + 1):
         line_num = i + 1
     result = header + "\n" + "\n".join(body_lines)
     if len(result) > MAX_CONTEXT_CHARS:
         result = result[:MAX_CONTEXT_CHARS] + "\n... [context truncated]"
+    return result

environment.py CHANGED Viewed

@@ -1,45 +1,4 @@
-"""
-environment.py — Python Debugging Gym (Core RL Environment)
-=============================================================
-PRINCIPLE 1 — You Don't Design the Control Flow
-  The agent decides the sequence of actions. step() is a pure router:
-  it receives whatever action the agent chose (in whatever order),
-  processes it, and returns the new state. There is no forced sequence,
-  no "you must VIEW_CODE before RUN_TESTS" gate. The system prompt
-  explains what tools exist; the agent decides how to use them.
-PRINCIPLE 5 — Cost-Per-Turn Reward Logic
-  Each call to step() costs R_STEP_COST = -0.01. This makes the episode
-  a multi-turn budget problem: the agent is rewarded for solving quickly.
-  An agent that solves in 4 steps scores ~0.14 more than one that takes
-  18 steps to reach the same solution.
-PRINCIPLE 7 — The Prompt is Code
-  The string returned by reset() is the agent's complete operational
-  contract for the session. It states: the goal, the available actions
-  (with exact JSON examples), the reward structure, the current code,
-  and the expected termination condition. Ambiguity in this string
-  directly causes off-task behaviour.
-PRINCIPLE 10 — Layered Context Compaction
-  _build_observation() tracks `_last_edited_line` and passes it to
-  context.get_localized_context() to produce a focused ±10-line view
-  after each write action. This prevents the observation from inflating
-  the agent's context window on large files.
-Reward table (dense, non-sparse — every step emits a signal):
-  +1.00  SUBMIT and ALL tests pass     → episode solved
-  +0.10  RUN_TESTS called              → information-gathering rewarded
-  +0.05  Per test transitioning fail→pass on a RUN_TESTS or SUBMIT
-  -0.01  Every step taken              → efficiency pressure (Principle 5)
-  -0.10  Syntax error detected         → broken code penalised immediately
-  -0.10  UNDO_EDIT or RESET_TO_ORIGINAL → backtracking discouraged
-  -0.02  Invalid line range supplied   → hallucination deterrent
-  -0.20  SUBMIT with tests still failing
-Max episode length: 50 steps.
-"""
 from __future__ import annotations
@@ -59,33 +18,18 @@ except ImportError:
     from tasks import ALL_TASKS, TASKS_BY_DIFFICULTY
-# ---------------------------------------------------------------------------
-# Reward constants
-# ---------------------------------------------------------------------------
 R_SUBMIT_ALL_PASS = +1.00
 R_SUBMIT_FAIL     = -0.20
 R_SYNTAX_ERROR    = -0.10
 R_RUN_TESTS       = +0.10
 R_PER_NEW_PASS    = +0.05
-R_STEP_COST       = -0.01   # PRINCIPLE 5 — every step has a cost
 R_INVALID_LINE    = -0.02
 R_DESTRUCTIVE_PENALTY = -0.20
-R_UNDO_RESET      = -0.10   # Mini-Git backtracking penalty
 MAX_STEPS: int = 50
-# ---------------------------------------------------------------------------
-# System Prompt  (PRINCIPLE 7 — The Prompt is Code)
-# ---------------------------------------------------------------------------
-# This string is the agent's entire operational contract.
-# It must be:
-#   • Self-contained (no assumed context from training data)
-#   • Precise (exact JSON examples, not vague descriptions)
-#   • Non-directive about sequence (Principle 1: agent chooses order)
-#   • Complete (goal, tools, rewards, termination — nothing omitted)
 _SYSTEM_PROMPT = """\
 ╔══════════════════════════════════════════════════════╗
 ║          PYTHON DEBUGGING GYM — EPISODE BRIEF        ║
@@ -176,24 +120,10 @@ CURRENT CODE  (this is the broken version — fix it)
 """
-# ---------------------------------------------------------------------------
-# Environment
-# ---------------------------------------------------------------------------
-class PythonDebuggingGym:
-    """
-    Gymnasium-compatible RL environment for Python debugging.
-    PRINCIPLE 1: step() is a stateless router — the agent chooses the
-    sequence. No internal gates, no forced ordering between actions.
-    Interface
-    ---------
-    obs, system_prompt = env.reset()
-    obs, reward, done, info = env.step(action: CodeAction)
-    """
-    metadata = {"name": "PythonDebuggingGym-v1", "render_modes": []}
     def __init__(
         self,
@@ -203,25 +133,21 @@ class PythonDebuggingGym:
         self._task_index = task_index
         self._rng = random.Random(seed)
-        # All mutable episode state lives here; reset() wipes every field.
         self._code_lines: List[str] = []
         self._task: Dict[str, Any] = {}
         self._step_count: int = 0
         self._prev_pass_count: int = 0
         self._last_test_results: List[TestResult] = []
         self._last_output: str = ""
-        self._last_edited_line: Optional[int] = None   # PRINCIPLE 10
         self._episode_id: str = ""
         self._done: bool = False
         self._cumulative_reward: float = 0.0
-        self._accumulated_step_costs: float = 0.0  # Hackathon compliance
-        # Mini-Git snapshot history (Phase 2)
-        self._original_code: List[str] = []          # pristine copy set at reset()
-        self._edit_history: List[List[str]] = []     # stack of pre-edit snapshots
-        # Curriculum learning — persists across episodes, incremented externally
         self.training_step: int = 0
-    # ── Curriculum task sampler ──────────────────────────────────────────────
     def _sample_task(self, task_override=None) -> Dict[str, Any]:
         """
@@ -241,13 +167,11 @@ class PythonDebuggingGym:
         if isinstance(task_override, dict):
             return task_override
-        # Judge-safe default: no training_step set → random from all tasks
         if self.training_step == 0:
             if not ALL_TASKS:
                 raise RuntimeError("ALL_TASKS is empty — check tasks.py.")
             return self._rng.choice(ALL_TASKS)
-        # Curriculum mode (trainer increments training_step between episodes)
         if self.training_step < 1000:
             bucket = "easy"
         elif self.training_step < 5000:
@@ -257,7 +181,6 @@ class PythonDebuggingGym:
         pool = TASKS_BY_DIFFICULTY.get(bucket, [])
         if not pool:
-            # Fallback: any non-empty bucket rather than crashing
             for b in ("easy", "medium", "hard"):
                 pool = TASKS_BY_DIFFICULTY.get(b, [])
                 if pool:
@@ -267,7 +190,6 @@ class PythonDebuggingGym:
         return self._rng.choice(pool)
-    # ── reset() ─────────────────────────────────────────────────────────────
     def reset(
         self, *, task_index: Optional[int] = None
@@ -281,7 +203,6 @@ class PythonDebuggingGym:
         """
         self._task = self._sample_task(task_index)
-        # ── Complete state wipe ──────────────────────────────────────────
         self._code_lines       = list(self._task["code"])   # deep copy — no alias
         self._step_count       = 0
         self._prev_pass_count  = 0
@@ -292,16 +213,13 @@ class PythonDebuggingGym:
         self._done             = False
         self._cumulative_reward = 0.0
         self._accumulated_step_costs = 0.0
-        # Mini-Git: seed pristine snapshot and clear history
         self._original_code = list(self._task["code"])  # separate copy from _code_lines
         self._edit_history  = []
-        # Anti-Loop history
         self._last_action: Optional[str] = None
         self._consecutive_count: int = 0
         obs = self._build_observation(reward=0.0)
-        # PRINCIPLE 7: build the operational contract string
         system_prompt = _SYSTEM_PROMPT.format(
             task_name   = self._task["name"],
             difficulty  = self._task.get("difficulty", "unknown"),
@@ -312,7 +230,6 @@ class PythonDebuggingGym:
         return obs, system_prompt
-    # ── step() ──────────────────────────────────────────────────────────────
     def step(
         self, action: CodeAction
@@ -337,7 +254,6 @@ class PythonDebuggingGym:
         reward = R_STEP_COST   # PRINCIPLE 5: cost-per-turn baseline
         self._accumulated_step_costs += abs(R_STEP_COST)  # Hackathon compliance
-        # ── Repetition Penalty (Anti-Loop) ───────────────────────────────
         if action.action_type == self._last_action:
             self._consecutive_count += 1
             reward += -0.05 * self._consecutive_count
@@ -345,7 +261,6 @@ class PythonDebuggingGym:
             self._consecutive_count = 0
         self._last_action = action.action_type
-        # ── Route (PRINCIPLE 1: no forced sequence) ──────────────────────
         atype = action.action_type
         if   atype == "VIEW_CODE":
@@ -369,12 +284,8 @@ class PythonDebuggingGym:
             reward += self._act_submit()
             self._done = True
-        # ── Max-steps termination ────────────────────────────────────────
         if self._step_count >= MAX_STEPS and not self._done:
             self._done = True
-            # Deterministic clamp — never trust the LLM to call SUBMIT.
-            # Evaluate the current code and produce a valid [0.0, 1.0] score
-            # regardless of how the episode ended.
             _, results, syntax_err = run_code_with_tests(
                 source=self._source(),
                 test_callables=self._task["tests"],
@@ -398,15 +309,10 @@ class PythonDebuggingGym:
             "step":              self._step_count,
         }
         if self._done:
-            # PRINCIPLE: Ensure Hackathon score leak doesn't occur. It must be strictly [0.0, 1.0].
-            # During SUBMIT, reward might be negative if _act_submit returned 0.0 added to -0.01.
             info["final_score"] = max(0.0, min(1.0, round(reward, 4)))
         return obs, round(reward, 4), self._done, info
-    # ── Action handlers ─────────────────────────────────────────────────────
-    # Each returns the delta reward (R_STEP_COST already applied by step()).
-    # Handlers update self._last_output and self._last_edited_line as needed.
     def _act_view_code(self) -> float:
         self._last_output = (
@@ -416,7 +322,6 @@ class PythonDebuggingGym:
                 for i, line in enumerate(self._code_lines)
             )
         )
-        # VIEW_CODE does not change the code — localized_context stays where it was
         return 0.0
     def _act_run_tests(self) -> float:
@@ -447,12 +352,10 @@ class PythonDebuggingGym:
         if new_code_block is None:
             new_code_block = ""
-        # ── Guard: Destructive Action (Anti-Deletion) ─────────────────────
         if len(new_code_block) == 0 and (end_line - start_line) > 5:
             self._last_output = "Error: Cannot delete more than 5 lines at once."
             return R_DESTRUCTIVE_PENALTY
-        # ── Guard: inverted range ─────────────────────────────────────────
         if start_line > end_line:
             self._last_output = (
                 f"Error: start_line ({start_line}) > end_line ({end_line}). "
@@ -460,7 +363,6 @@ class PythonDebuggingGym:
             )
             return R_INVALID_LINE
-        # ── Guard: out-of-bounds ──────────────────────────────────────────
         if start_line < 1 or start_line > n:
             self._last_output = (
                 f"Error: start_line {start_line} is out of range [1, {n}]. "
@@ -474,19 +376,14 @@ class PythonDebuggingGym:
             )
             return R_INVALID_LINE
-        # ── Slice assignment (PRINCIPLE 1: pure data transformation) ──────
         start_idx = start_line - 1   # convert to 0-indexed
         end_idx   = end_line         # exclusive upper bound for Python slice
-        # ── Mini-Git: snapshot BEFORE mutating (Phase 2) ─────────────────
         self._edit_history.append(list(self._code_lines))
         new_lines = new_code_block.split("\n")
         self._code_lines[start_idx:end_idx] = new_lines
-        # ── Anchor context at END of new block (PRINCIPLE 10) ─────────────
-        # If the agent replaces lines 5–10 with 20 new lines, the anchor
-        # settles at start_line + len(new_lines) - 1, clamped to file length.
         new_end = start_line + len(new_lines) - 1
         self._last_edited_line = min(new_end, len(self._code_lines))
@@ -514,9 +411,6 @@ class PythonDebuggingGym:
         if syntax_err:
             self._last_output += "\n❌ SUBMIT rejected — syntax error in current code."
-        # ── Hackathon compliance: final score ∈ [0.0, 1.0] ───────────────
-        # raw = (tests_passed / total) - accumulated_step_costs
-        # Then clamped so the grader always receives a value in spec.
         proportion  = passes / total if total > 0 else 0.0
         raw_score   = proportion - self._accumulated_step_costs
         final_score = max(0.0, min(1.0, raw_score))
@@ -559,7 +453,6 @@ class PythonDebuggingGym:
                 "Call VIEW_CODE to inspect the restored file."
             )
-        # PRINCIPLE 10 desync fix: anchor is stale after rollback — wipe it.
         self._last_edited_line = None
         return R_UNDO_RESET
@@ -580,11 +473,9 @@ class PythonDebuggingGym:
             "Call VIEW_CODE to inspect the file."
         )
-        # PRINCIPLE 10 desync fix: context anchor is meaningless after full reset.
         self._last_edited_line = None
         return R_UNDO_RESET
-    # ── Helpers ─────────────────────────────────────────────────────────────
     def _source(self) -> str:
         return "\n".join(self._code_lines)
@@ -592,7 +483,6 @@ class PythonDebuggingGym:
     def _build_observation(self, reward: float) -> CodeObservation:
         syntax_valid, _ = check_syntax(self._source())
-        # PRINCIPLE 10: localized context — only ±10 lines around last edit
         localized = get_localized_context(self._code_lines, self._last_edited_line)
         return CodeObservation(

+"""Core TraceFix-RL environment implementation."""
 from __future__ import annotations
     from tasks import ALL_TASKS, TASKS_BY_DIFFICULTY
 R_SUBMIT_ALL_PASS = +1.00
 R_SUBMIT_FAIL     = -0.20
 R_SYNTAX_ERROR    = -0.10
 R_RUN_TESTS       = +0.10
 R_PER_NEW_PASS    = +0.05
+R_STEP_COST       = -0.01
 R_INVALID_LINE    = -0.02
 R_DESTRUCTIVE_PENALTY = -0.20
+R_UNDO_RESET      = -0.10
 MAX_STEPS: int = 50
 _SYSTEM_PROMPT = """\
 ╔══════════════════════════════════════════════════════╗
 ║          PYTHON DEBUGGING GYM — EPISODE BRIEF        ║
 """
+class TraceFixRLGym:
+    """Gym-style environment with reset/step methods."""
+    metadata = {"name": "TraceFixRL-v1", "render_modes": []}
     def __init__(
         self,
         self._task_index = task_index
         self._rng = random.Random(seed)
         self._code_lines: List[str] = []
         self._task: Dict[str, Any] = {}
         self._step_count: int = 0
         self._prev_pass_count: int = 0
         self._last_test_results: List[TestResult] = []
         self._last_output: str = ""
+        self._last_edited_line: Optional[int] = None
         self._episode_id: str = ""
         self._done: bool = False
         self._cumulative_reward: float = 0.0
+        self._accumulated_step_costs: float = 0.0
+        self._original_code: List[str] = []
+        self._edit_history: List[List[str]] = []
         self.training_step: int = 0
     def _sample_task(self, task_override=None) -> Dict[str, Any]:
         """
         if isinstance(task_override, dict):
             return task_override
         if self.training_step == 0:
             if not ALL_TASKS:
                 raise RuntimeError("ALL_TASKS is empty — check tasks.py.")
             return self._rng.choice(ALL_TASKS)
         if self.training_step < 1000:
             bucket = "easy"
         elif self.training_step < 5000:
         pool = TASKS_BY_DIFFICULTY.get(bucket, [])
         if not pool:
             for b in ("easy", "medium", "hard"):
                 pool = TASKS_BY_DIFFICULTY.get(b, [])
                 if pool:
         return self._rng.choice(pool)
     def reset(
         self, *, task_index: Optional[int] = None
         """
         self._task = self._sample_task(task_index)
         self._code_lines       = list(self._task["code"])   # deep copy — no alias
         self._step_count       = 0
         self._prev_pass_count  = 0
         self._done             = False
         self._cumulative_reward = 0.0
         self._accumulated_step_costs = 0.0
         self._original_code = list(self._task["code"])  # separate copy from _code_lines
         self._edit_history  = []
         self._last_action: Optional[str] = None
         self._consecutive_count: int = 0
         obs = self._build_observation(reward=0.0)
         system_prompt = _SYSTEM_PROMPT.format(
             task_name   = self._task["name"],
             difficulty  = self._task.get("difficulty", "unknown"),
         return obs, system_prompt
     def step(
         self, action: CodeAction
         reward = R_STEP_COST   # PRINCIPLE 5: cost-per-turn baseline
         self._accumulated_step_costs += abs(R_STEP_COST)  # Hackathon compliance
         if action.action_type == self._last_action:
             self._consecutive_count += 1
             reward += -0.05 * self._consecutive_count
             self._consecutive_count = 0
         self._last_action = action.action_type
         atype = action.action_type
         if   atype == "VIEW_CODE":
             reward += self._act_submit()
             self._done = True
         if self._step_count >= MAX_STEPS and not self._done:
             self._done = True
             _, results, syntax_err = run_code_with_tests(
                 source=self._source(),
                 test_callables=self._task["tests"],
             "step":              self._step_count,
         }
         if self._done:
             info["final_score"] = max(0.0, min(1.0, round(reward, 4)))
         return obs, round(reward, 4), self._done, info
     def _act_view_code(self) -> float:
         self._last_output = (
                 for i, line in enumerate(self._code_lines)
             )
         )
         return 0.0
     def _act_run_tests(self) -> float:
         if new_code_block is None:
             new_code_block = ""
         if len(new_code_block) == 0 and (end_line - start_line) > 5:
             self._last_output = "Error: Cannot delete more than 5 lines at once."
             return R_DESTRUCTIVE_PENALTY
         if start_line > end_line:
             self._last_output = (
                 f"Error: start_line ({start_line}) > end_line ({end_line}). "
             )
             return R_INVALID_LINE
         if start_line < 1 or start_line > n:
             self._last_output = (
                 f"Error: start_line {start_line} is out of range [1, {n}]. "
             )
             return R_INVALID_LINE
         start_idx = start_line - 1   # convert to 0-indexed
         end_idx   = end_line         # exclusive upper bound for Python slice
         self._edit_history.append(list(self._code_lines))
         new_lines = new_code_block.split("\n")
         self._code_lines[start_idx:end_idx] = new_lines
         new_end = start_line + len(new_lines) - 1
         self._last_edited_line = min(new_end, len(self._code_lines))
         if syntax_err:
             self._last_output += "\n❌ SUBMIT rejected — syntax error in current code."
         proportion  = passes / total if total > 0 else 0.0
         raw_score   = proportion - self._accumulated_step_costs
         final_score = max(0.0, min(1.0, raw_score))
                 "Call VIEW_CODE to inspect the restored file."
             )
         self._last_edited_line = None
         return R_UNDO_RESET
             "Call VIEW_CODE to inspect the file."
         )
         self._last_edited_line = None
         return R_UNDO_RESET
     def _source(self) -> str:
         return "\n".join(self._code_lines)
     def _build_observation(self, reward: float) -> CodeObservation:
         syntax_valid, _ = check_syntax(self._source())
         localized = get_localized_context(self._code_lines, self._last_edited_line)
         return CodeObservation(

inference.py CHANGED Viewed

@@ -1,5 +1,5 @@
 """
-Inference script for SWE-Gym - Software Engineer Gym.
 Mandatory env vars expected in deployment config:
   API_BASE_URL
@@ -24,9 +24,9 @@ from typing import Any
 from openai import OpenAI
 try:
-    from swe_gym import CodeAction, SWEGymEnv
 except ImportError:
-    from client import SWEGymEnv
     from models import CodeAction
@@ -36,8 +36,8 @@ HF_TOKEN = os.getenv("HF_TOKEN", "")
 LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME", "")
 ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
-TASK_NAME = os.getenv("TASK_NAME", "swe_gym")
-BENCHMARK = os.getenv("BENCHMARK", "swe_gym")
 MAX_STEPS = int(os.getenv("MAX_STEPS", "50"))
 SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_SCORE_THRESHOLD", "0.99"))
@@ -170,7 +170,7 @@ def _compute_score(step_result: Any, rewards: list[float]) -> float:
 async def main() -> None:
     client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
-    env: SWEGymEnv | None = None
     rewards: list[float] = []
     history: list[str] = []
     steps_taken = 0
@@ -180,9 +180,9 @@ async def main() -> None:
     try:
         if LOCAL_IMAGE_NAME:
-            env = await SWEGymEnv.from_docker_image(LOCAL_IMAGE_NAME)
         else:
-            env = SWEGymEnv(base_url=ENV_BASE_URL)
         result = await env.reset()
         task_name = result.observation.info.get("task_name") or TASK_NAME

 """
+Inference script for TraceFix-RL.
 Mandatory env vars expected in deployment config:
   API_BASE_URL
 from openai import OpenAI
 try:
+    from tracefix_rl import CodeAction, TraceFixRLEnv
 except ImportError:
+    from client import TraceFixRLEnv
     from models import CodeAction
 LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME", "")
 ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
+TASK_NAME = os.getenv("TASK_NAME", "tracefix_rl")
+BENCHMARK = os.getenv("BENCHMARK", "tracefix_rl")
 MAX_STEPS = int(os.getenv("MAX_STEPS", "50"))
 SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_SCORE_THRESHOLD", "0.99"))
 async def main() -> None:
     client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
+    env: TraceFixRLEnv | None = None
     rewards: list[float] = []
     history: list[str] = []
     steps_taken = 0
     try:
         if LOCAL_IMAGE_NAME:
+            env = await TraceFixRLEnv.from_docker_image(LOCAL_IMAGE_NAME)
         else:
+            env = TraceFixRLEnv(base_url=ENV_BASE_URL)
         result = await env.reset()
         task_name = result.observation.info.get("task_name") or TASK_NAME

models.py CHANGED Viewed

@@ -1,4 +1,4 @@
-"""Pydantic schema layer for SWE-Gym - Software Engineer Gym."""
 from __future__ import annotations


1	+ """Pydantic schema layer for TraceFix-RL."""
2
3	from __future__ import annotations
4

openenv.yaml CHANGED Viewed

@@ -1,5 +1,5 @@
 spec_version: 1
-name: swe_gym
 type: space
 runtime: fastapi
 app: server.app:app

 spec_version: 1
+name: tracefix_rl
 type: space
 runtime: fastapi
 app: server.app:app

pyproject.toml CHANGED Viewed

@@ -9,9 +9,9 @@ requires = ["setuptools>=45", "wheel"]
 build-backend = "setuptools.build_meta"
 [project]
-name = "openenv-swe-gym"
 version = "0.1.0"
-description = "SWE-Gym - Software Engineer Gym environment for OpenEnv"
 requires-python = ">=3.10"
 dependencies = [
     # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
@@ -38,10 +38,10 @@ dev = [
 [project.scripts]
 # Server entry point - enables running via: uv run --project . server
-# or: python -m swe_gym.server.app
-server = "swe_gym.server.app:main"
 [tool.setuptools]
 include-package-data = true
-packages = ["swe_gym", "swe_gym.server"]
-package-dir = { "swe_gym" = ".", "swe_gym.server" = "server" }

 build-backend = "setuptools.build_meta"
 [project]
+name = "openenv-tracefix-rl"
 version = "0.1.0"
+description = "TraceFix-RL environment for OpenEnv"
 requires-python = ">=3.10"
 dependencies = [
     # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
 [project.scripts]
 # Server entry point - enables running via: uv run --project . server
+# or: python -m tracefix_rl.server.app
+server = "tracefix_rl.server.app:main"
 [tool.setuptools]
 include-package-data = true
+packages = ["tracefix_rl", "tracefix_rl.server"]
+package-dir = { "tracefix_rl" = ".", "tracefix_rl.server" = "server" }

sandbox.py CHANGED Viewed

@@ -44,17 +44,11 @@ except ImportError:
     from models import TestResult
-# ---------------------------------------------------------------------------
-# Constants
-# ---------------------------------------------------------------------------
 EXEC_TIMEOUT_SECONDS: int = 5    # Hard wall-clock kill limit (Principle 8)
 MAX_OUTPUT_CHARS: int = 1_000    # Tail-truncate limit (Principle 9)
-# ---------------------------------------------------------------------------
-# Restricted builtins (Principle 8)
-# ---------------------------------------------------------------------------
 def _make_safe_stub(name: str) -> Callable:
     """Return a callable that raises RuntimeError — used to block dangerous builtins."""
@@ -68,26 +62,19 @@ def _make_safe_stub(name: str) -> Callable:
     return _stub
-# Whitelist: safe builtins the agent's code is allowed to use.
-# Everything not in this dict is blocked.
 _SAFE_BUILTINS: Dict[str, Any] = {
-    # Type constructors
     "int": int, "float": float, "str": str, "bool": bool,
     "list": list, "dict": dict, "set": set, "tuple": tuple,
     "bytes": bytes, "bytearray": bytearray, "frozenset": frozenset,
     "complex": complex,
-    # Inspection / iteration
     "len": len, "range": range, "enumerate": enumerate, "zip": zip,
     "map": map, "filter": filter, "reversed": reversed, "sorted": sorted,
     "iter": iter, "next": next, "sum": sum, "min": min, "max": max,
     "abs": abs, "round": round, "divmod": divmod, "pow": pow,
-    # Introspection
     "isinstance": isinstance, "issubclass": issubclass, "type": type,
     "hasattr": hasattr, "getattr": getattr, "setattr": setattr,
     "callable": callable, "repr": repr, "hash": hash, "id": id,
-    # I/O (stdout only — stderr is captured separately)
     "print": print,
-    # Exceptions & control
     "Exception": Exception, "ValueError": ValueError, "TypeError": TypeError,
     "KeyError": KeyError, "IndexError": IndexError, "AttributeError": AttributeError,
     "StopIteration": StopIteration, "RuntimeError": RuntimeError,
@@ -96,13 +83,11 @@ _SAFE_BUILTINS: Dict[str, Any] = {
     "RecursionError": RecursionError, "MemoryError": MemoryError,
     "KeyboardInterrupt": KeyboardInterrupt,
     "BaseException": BaseException,
-    # Functional
     "any": any, "all": all,
     "chr": chr, "ord": ord, "hex": hex, "oct": oct, "bin": bin,
     "format": format,
     "object": object, "property": property, "staticmethod": staticmethod,
     "classmethod": classmethod, "super": super,
-    # Blocked with stubs (Principle 8)
     "open":        _make_safe_stub("open"),
     "__import__":  _make_safe_stub("__import__"),
     "eval":        _make_safe_stub("eval"),
@@ -119,9 +104,6 @@ _SAFE_BUILTINS: Dict[str, Any] = {
 }
-# ---------------------------------------------------------------------------
-# Output truncation (Principle 9)
-# ---------------------------------------------------------------------------
 def _tail_truncate(s: str, limit: int = MAX_OUTPUT_CHARS) -> str:
     """
@@ -138,9 +120,6 @@ def _tail_truncate(s: str, limit: int = MAX_OUTPUT_CHARS) -> str:
     return f"[...truncated {dropped} chars...]\n" + s[-limit:]
-# ---------------------------------------------------------------------------
-# Worker (runs in isolated child process)
-# ---------------------------------------------------------------------------
 def _worker(
     source: str,
@@ -162,41 +141,29 @@ def _worker(
     fn_name = "<unknown>"
     try:
-        # ── Phase 1: Syntax check ─────────────────────────────────────────
-        # Compile before exec() so SyntaxError is caught cleanly.
         try:
             code_obj = compile(source, "<agent_code>", "exec")
         except SyntaxError as exc:
             had_syntax_error = True
-            # Restore streams before writing the error
             sys.stdout, sys.stderr = old_stdout, old_stderr
             err = f"SyntaxError at line {exc.lineno}: {exc.msg}\n  >> {exc.text or ''}"
             result_queue.put((_tail_truncate(err), [], True))
             return
-        # ── Phase 2: Execute agent code into a sandboxed namespace ───────
-        # Use full __builtins__ to prevent __build_class__ errors for class-based tasks.
         namespace: Dict[str, Any] = {"__builtins__": __builtins__}
         try:
             exec(code_obj, namespace)  # noqa: S102
         except Exception:  # noqa: BLE001
-            # PRINCIPLE 2: execution crash is data, not a crash
             tb = traceback.format_exc()
             sys.stdout, sys.stderr = old_stdout, old_stderr
             result_queue.put((_tail_truncate(buf.getvalue() + "\n" + tb), [], False))
             return
-        # ── Phase 3: Run each test function ──────────────────────────────
-        # PRINCIPLE 2: each test is isolated inside its own try-except so a
-        # crash in test N does not prevent tests N+1..M from running.
         for test_src in test_sources:
             fn_name = "<unknown>"
             try:
-                # Inject the test function into the existing namespace so it
-                # can access the agent's defined symbols.
                 exec(test_src, namespace)  # noqa: S102
-                # Extract the last `def` name from the test source.
                 fn_name = [
                     ln.split("(")[0].replace("def ", "").strip()
                     for ln in test_src.splitlines()
@@ -207,7 +174,6 @@ def _worker(
                 test_results.append({"test_name": fn_name, "passed": True})
             except AssertionError as exc:
-                # PRINCIPLE 2: assertion failure is structured data
                 test_results.append({
                     "test_name": fn_name,
                     "passed": False,
@@ -216,7 +182,6 @@ def _worker(
                     ),
                 })
             except Exception:  # noqa: BLE001
-                # PRINCIPLE 2: all other exceptions also become structured data
                 test_results.append({
                     "test_name": fn_name,
                     "passed": False,
@@ -224,7 +189,6 @@ def _worker(
                 })
     except Exception:  # noqa: BLE001
-        # Catch-all for any unexpected failure in the harness itself
         traceback.print_exc(file=buf)
     finally:
         sys.stdout, sys.stderr = old_stdout, old_stderr
@@ -233,9 +197,6 @@ def _worker(
     result_queue.put((captured, test_results, had_syntax_error))
-# ---------------------------------------------------------------------------
-# Public API
-# ---------------------------------------------------------------------------
 def check_syntax(source: str) -> Tuple[bool, str]:
     """
@@ -271,7 +232,6 @@ def run_code_with_tests(
     -------
     (output_str, test_results, had_syntax_error)
     """
-    # Serialise callables → source strings (required for pickling across processes)
     test_sources = [
         textwrap.dedent(inspect.getsource(fn))
         for fn in test_callables
@@ -286,7 +246,6 @@ def run_code_with_tests(
     proc.start()
     proc.join(timeout)
-    # PRINCIPLE 8 — hard kill (SIGTERM first, SIGKILL if still alive)
     if proc.is_alive():
         proc.terminate()
         proc.join(2)   # Give it 2s to handle SIGTERM gracefully

     from models import TestResult
 EXEC_TIMEOUT_SECONDS: int = 5    # Hard wall-clock kill limit (Principle 8)
 MAX_OUTPUT_CHARS: int = 1_000    # Tail-truncate limit (Principle 9)
 def _make_safe_stub(name: str) -> Callable:
     """Return a callable that raises RuntimeError — used to block dangerous builtins."""
     return _stub
 _SAFE_BUILTINS: Dict[str, Any] = {
     "int": int, "float": float, "str": str, "bool": bool,
     "list": list, "dict": dict, "set": set, "tuple": tuple,
     "bytes": bytes, "bytearray": bytearray, "frozenset": frozenset,
     "complex": complex,
     "len": len, "range": range, "enumerate": enumerate, "zip": zip,
     "map": map, "filter": filter, "reversed": reversed, "sorted": sorted,
     "iter": iter, "next": next, "sum": sum, "min": min, "max": max,
     "abs": abs, "round": round, "divmod": divmod, "pow": pow,
     "isinstance": isinstance, "issubclass": issubclass, "type": type,
     "hasattr": hasattr, "getattr": getattr, "setattr": setattr,
     "callable": callable, "repr": repr, "hash": hash, "id": id,
     "print": print,
     "Exception": Exception, "ValueError": ValueError, "TypeError": TypeError,
     "KeyError": KeyError, "IndexError": IndexError, "AttributeError": AttributeError,
     "StopIteration": StopIteration, "RuntimeError": RuntimeError,
     "RecursionError": RecursionError, "MemoryError": MemoryError,
     "KeyboardInterrupt": KeyboardInterrupt,
     "BaseException": BaseException,
     "any": any, "all": all,
     "chr": chr, "ord": ord, "hex": hex, "oct": oct, "bin": bin,
     "format": format,
     "object": object, "property": property, "staticmethod": staticmethod,
     "classmethod": classmethod, "super": super,
     "open":        _make_safe_stub("open"),
     "__import__":  _make_safe_stub("__import__"),
     "eval":        _make_safe_stub("eval"),
 }
 def _tail_truncate(s: str, limit: int = MAX_OUTPUT_CHARS) -> str:
     """
     return f"[...truncated {dropped} chars...]\n" + s[-limit:]
 def _worker(
     source: str,
     fn_name = "<unknown>"
     try:
         try:
             code_obj = compile(source, "<agent_code>", "exec")
         except SyntaxError as exc:
             had_syntax_error = True
             sys.stdout, sys.stderr = old_stdout, old_stderr
             err = f"SyntaxError at line {exc.lineno}: {exc.msg}\n  >> {exc.text or ''}"
             result_queue.put((_tail_truncate(err), [], True))
             return
         namespace: Dict[str, Any] = {"__builtins__": __builtins__}
         try:
             exec(code_obj, namespace)  # noqa: S102
         except Exception:  # noqa: BLE001
             tb = traceback.format_exc()
             sys.stdout, sys.stderr = old_stdout, old_stderr
             result_queue.put((_tail_truncate(buf.getvalue() + "\n" + tb), [], False))
             return
         for test_src in test_sources:
             fn_name = "<unknown>"
             try:
                 exec(test_src, namespace)  # noqa: S102
                 fn_name = [
                     ln.split("(")[0].replace("def ", "").strip()
                     for ln in test_src.splitlines()
                 test_results.append({"test_name": fn_name, "passed": True})
             except AssertionError as exc:
                 test_results.append({
                     "test_name": fn_name,
                     "passed": False,
                     ),
                 })
             except Exception:  # noqa: BLE001
                 test_results.append({
                     "test_name": fn_name,
                     "passed": False,
                 })
     except Exception:  # noqa: BLE001
         traceback.print_exc(file=buf)
     finally:
         sys.stdout, sys.stderr = old_stdout, old_stderr
     result_queue.put((captured, test_results, had_syntax_error))
 def check_syntax(source: str) -> Tuple[bool, str]:
     """
     -------
     (output_str, test_results, had_syntax_error)
     """
     test_sources = [
         textwrap.dedent(inspect.getsource(fn))
         for fn in test_callables
     proc.start()
     proc.join(timeout)
     if proc.is_alive():
         proc.terminate()
         proc.join(2)   # Give it 2s to handle SIGTERM gracefully

server/__init__.py CHANGED Viewed

@@ -1,11 +1,5 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""SWE-Gym server components."""
-from .swe_gym_environment import SWEGymEnvironment
-__all__ = ["SWEGymEnvironment"]

+"""TraceFix-RL server components."""
+from .tracefix_rl_environment import TraceFixRLEnvironment
+__all__ = ["TraceFixRLEnvironment"]

server/app.py CHANGED Viewed

@@ -1,10 +1,4 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""FastAPI entry point for SWE-Gym - Software Engineer Gym."""
 try:
     from openenv.core.env_server.http_server import create_app
@@ -15,23 +9,22 @@ except Exception as e:  # pragma: no cover
 try:
     from ..models import CodeAction, CodeObservation
-    from .swe_gym_environment import SWEGymEnvironment
 except ImportError:
     import sys
     from pathlib import Path
     sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
     from models import CodeAction, CodeObservation
-    from server.swe_gym_environment import SWEGymEnvironment
-# Create the app with web interface and README integration
 app = create_app(
-    SWEGymEnvironment,
     CodeAction,
     CodeObservation,
-    env_name="swe_gym",
-    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
 )

+"""FastAPI entry point for TraceFix-RL."""
 try:
     from openenv.core.env_server.http_server import create_app
 try:
     from ..models import CodeAction, CodeObservation
+    from .tracefix_rl_environment import TraceFixRLEnvironment
 except ImportError:
     import sys
     from pathlib import Path
     sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
     from models import CodeAction, CodeObservation
+    from server.tracefix_rl_environment import TraceFixRLEnvironment
 app = create_app(
+    TraceFixRLEnvironment,
     CodeAction,
     CodeObservation,
+    env_name="tracefix_rl",
+    max_concurrent_envs=1,
 )

server/{swe_gym_environment.py → tracefix_rl_environment.py} RENAMED Viewed

@@ -1,33 +1,23 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""OpenEnv adapter around the SWE-Gym core environment."""
 from openenv.core.env_server.interfaces import Environment
 from openenv.core.env_server.types import State
 try:
-    from ..environment import PythonDebuggingGym
     from ..models import CodeAction, CodeObservation
 except ImportError:
-    from environment import PythonDebuggingGym
     from models import CodeAction, CodeObservation
-class SWEGymEnvironment(Environment):
     """Environment implementation compatible with OpenEnv's server interface."""
-    # Enable concurrent WebSocket sessions.
-    # Set to True if your environment isolates state between instances.
-    # When True, multiple WebSocket clients can connect simultaneously, each
-    # getting their own environment instance (when using factory mode in app.py).
     SUPPORTS_CONCURRENT_SESSIONS: bool = True
     def __init__(self):
-        self._gym = PythonDebuggingGym()
         self._state = State(episode_id="", step_count=0)
     def reset(self) -> CodeObservation:
@@ -56,10 +46,4 @@ class SWEGymEnvironment(Environment):
     @property
     def state(self) -> State:
-        """
-        Get the current environment state.
-        Returns:
-            Current State with episode_id and step_count
-        """
         return self._state

+"""OpenEnv adapter around the TraceFix-RL core environment."""
 from openenv.core.env_server.interfaces import Environment
 from openenv.core.env_server.types import State
 try:
+    from ..environment import TraceFixRLGym
     from ..models import CodeAction, CodeObservation
 except ImportError:
+    from environment import TraceFixRLGym
     from models import CodeAction, CodeObservation
+class TraceFixRLEnvironment(Environment):
     """Environment implementation compatible with OpenEnv's server interface."""
     SUPPORTS_CONCURRENT_SESSIONS: bool = True
     def __init__(self):
+        self._gym = TraceFixRLGym()
         self._state = State(episode_id="", step_count=0)
     def reset(self) -> CodeObservation:
     @property
     def state(self) -> State:
         return self._state