Spaces:

The-Fool-09
/

debugZero

Sleeping

App Files Files Community

The-Fool-09 commited on Apr 26

Commit

8412998

verified ·

1 Parent(s): 22b11ca

Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

Blog.md +98 -98
Dockerfile +1 -0
client.py +23 -23
eval/api_baseline.py +346 -346
inference.py +332 -332
models.py +9 -9
server/debugZero_environment.py +224 -224
server/graders.py +19 -19
server/tasks.py +92 -92
validate-submission.sh +184 -184

Blog.md CHANGED Viewed

@@ -1,98 +1,98 @@
-# DebugZero: Teaching a Coding Agent to Create and Fix Bugs
-Most code benchmarks ask a model to write a fresh solution from scratch. That is useful, but it skips a big part of real programming work: debugging code that is almost correct.
-That is the problem we built **DebugZero** to explore.
-DebugZero is an OpenEnv environment where a coding agent learns through a two-role game:
-- a **Proposer** takes a correct function and introduces a small but meaningful bug
-- a **Solver** takes that buggy function and tries to repair it
-The environment runs the submitted code in a sandbox, executes tests, and returns structured observations and rewards. In other words, the model does not just generate code and hope for the best. It acts inside an environment that can tell it whether a bug is real, whether a fix works, and whether the behavior is improving over time.
-## Why we built it
-We wanted an environment that treats debugging as a first-class skill.
-In practice, strong programmers do more than write correct code. They also:
-- recognize how correct-looking code can fail
-- make small, targeted edits instead of rewriting everything
-- use test failures as evidence
-- recover from mistakes efficiently
-Static benchmarks usually measure the end result. DebugZero is meant to train the process.
-## How an episode works
-Each episode starts from a clean seed task: a short Python function plus a hidden test harness.
-On the first turn, the proposer submits a modified version of the function. The goal is not to destroy the program randomly. The goal is to create a bug that is realistic, small, and detectable by tests.
-The environment then:
-1. parses the submitted code
-2. executes it in a sandboxed subprocess
-3. runs the task tests
-4. returns the current code, execution result, test status, reward, and next role
-If the proposer successfully creates a valid bug, the solver gets the next turn. The solver then submits a repaired function, and the environment checks whether the original behavior has been restored.
-This makes the whole loop executable and grounded. The agent is not rewarded for sounding plausible. It is rewarded for actually changing program behavior in the intended way.
-## What makes the reward signal useful
-DebugZero uses role-aware rewards instead of a single generic success metric.
-For the proposer, reward is higher when the bug is:
-- syntactically valid
-- actually test-breaking
-- close to the original implementation rather than random corruption
-For the solver, reward is higher when the fix cleanly restores the expected behavior.
-That design matters because it pushes both roles toward realistic debugging behavior. The proposer learns to create useful failures. The solver learns to make precise repairs.
-## What we trained
-We trained a policy for this environment using **GRPO** and role-conditioned prompting. One important design choice was to train against the **deployed environment itself**, not against notebook-local copies of the environment logic.
-That means the training loop interacts with the same OpenEnv interface that serves the environment in deployment:
-- reset the environment
-- observe the current task state
-- submit a proposer or solver action
-- receive reward and updated observation
-This kept training aligned with the real environment instead of drifting into a separate offline approximation.
-## Why the two-role setup is interesting
-The most fun part of DebugZero is that it creates its own pressure to improve.
-If the solver becomes stronger, the proposer has to invent better bugs. If the proposer becomes better at making subtle failures, the solver has to become more precise at repair. That gives us a natural self-play curriculum for debugging.
-Instead of hand-authoring every training example, we get an environment where challenge and skill can rise together.
-## What DebugZero is really trying to test
-At a deeper level, this project is about whether coding agents can become better debuggers through interaction rather than static supervision alone.
-We care about questions like:
-- Can an agent learn to create realistic failure modes?
-- Can it repair bugs without over-editing the program?
-- Can self-play produce a useful curriculum for code reasoning?
-- Can reward grounded in execution and tests teach something that static datasets miss?
-DebugZero is our attempt at turning those questions into something concrete and measurable.
-## Links
-- Hugging Face Space: https://the-fool-09-debugzero.hf.space
-- Hugging Face project page: https://huggingface.co/spaces/The-Fool-09/debugZero
-- Training notebook: `notebooks/train_colab_updated_1.ipynb`
-In short, DebugZero is not just a benchmark where a model writes code. It is an environment where the model learns from failure, creates new failure cases, and improves through the loop of breaking and repairing programs. That is the behavior we wanted to surface, and that is what we trained for.

+# DebugZero: Teaching a Coding Agent to Create and Fix Bugs
+Most code benchmarks ask a model to write a fresh solution from scratch. That is useful, but it skips a big part of real programming work: debugging code that is almost correct.
+That is the problem we built **DebugZero** to explore.
+DebugZero is an OpenEnv environment where a coding agent learns through a two-role game:
+- a **Proposer** takes a correct function and introduces a small but meaningful bug
+- a **Solver** takes that buggy function and tries to repair it
+The environment runs the submitted code in a sandbox, executes tests, and returns structured observations and rewards. In other words, the model does not just generate code and hope for the best. It acts inside an environment that can tell it whether a bug is real, whether a fix works, and whether the behavior is improving over time.
+## Why we built it
+We wanted an environment that treats debugging as a first-class skill.
+In practice, strong programmers do more than write correct code. They also:
+- recognize how correct-looking code can fail
+- make small, targeted edits instead of rewriting everything
+- use test failures as evidence
+- recover from mistakes efficiently
+Static benchmarks usually measure the end result. DebugZero is meant to train the process.
+## How an episode works
+Each episode starts from a clean seed task: a short Python function plus a hidden test harness.
+On the first turn, the proposer submits a modified version of the function. The goal is not to destroy the program randomly. The goal is to create a bug that is realistic, small, and detectable by tests.
+The environment then:
+1. parses the submitted code
+2. executes it in a sandboxed subprocess
+3. runs the task tests
+4. returns the current code, execution result, test status, reward, and next role
+If the proposer successfully creates a valid bug, the solver gets the next turn. The solver then submits a repaired function, and the environment checks whether the original behavior has been restored.
+This makes the whole loop executable and grounded. The agent is not rewarded for sounding plausible. It is rewarded for actually changing program behavior in the intended way.
+## What makes the reward signal useful
+DebugZero uses role-aware rewards instead of a single generic success metric.
+For the proposer, reward is higher when the bug is:
+- syntactically valid
+- actually test-breaking
+- close to the original implementation rather than random corruption
+For the solver, reward is higher when the fix cleanly restores the expected behavior.
+That design matters because it pushes both roles toward realistic debugging behavior. The proposer learns to create useful failures. The solver learns to make precise repairs.
+## What we trained
+We trained a policy for this environment using **GRPO** and role-conditioned prompting. One important design choice was to train against the **deployed environment itself**, not against notebook-local copies of the environment logic.
+That means the training loop interacts with the same OpenEnv interface that serves the environment in deployment:
+- reset the environment
+- observe the current task state
+- submit a proposer or solver action
+- receive reward and updated observation
+This kept training aligned with the real environment instead of drifting into a separate offline approximation.
+## Why the two-role setup is interesting
+The most fun part of DebugZero is that it creates its own pressure to improve.
+If the solver becomes stronger, the proposer has to invent better bugs. If the proposer becomes better at making subtle failures, the solver has to become more precise at repair. That gives us a natural self-play curriculum for debugging.
+Instead of hand-authoring every training example, we get an environment where challenge and skill can rise together.
+## What DebugZero is really trying to test
+At a deeper level, this project is about whether coding agents can become better debuggers through interaction rather than static supervision alone.
+We care about questions like:
+- Can an agent learn to create realistic failure modes?
+- Can it repair bugs without over-editing the program?
+- Can self-play produce a useful curriculum for code reasoning?
+- Can reward grounded in execution and tests teach something that static datasets miss?
+DebugZero is our attempt at turning those questions into something concrete and measurable.
+## Links
+- Hugging Face Space: https://the-fool-09-debugzero.hf.space
+- Hugging Face project page: https://huggingface.co/spaces/The-Fool-09/debugZero
+- Training notebook: `notebooks/train_colab_updated_1.ipynb`
+In short, DebugZero is not just a benchmark where a model writes code. It is an environment where the model learns from failure, creates new failure cases, and improves through the loop of breaking and repairing programs. That is the behavior we wanted to surface, and that is what we trained for.

Dockerfile CHANGED Viewed

@@ -47,6 +47,7 @@ RUN apt-get update && \
 COPY --from=builder /app/.venv /app/.venv
 COPY --from=builder /app/env /app/env
 ENV PATH="/app/.venv/bin:$PATH"
 ENV PYTHONPATH="/app/env:$PYTHONPATH"

 COPY --from=builder /app/.venv /app/.venv
 COPY --from=builder /app/env /app/env
+COPY --from=builder /app/env/README.md /app/README.md
 ENV PATH="/app/.venv/bin:$PATH"
 ENV PYTHONPATH="/app/env:$PYTHONPATH"

client.py CHANGED Viewed

@@ -69,29 +69,29 @@ class DebugzeroEnv(
         Args:
             payload: JSON response data from server
-        Returns:
-            StepResult with DebugzeroObservation
-        """
-        obs_data = payload.get("observation", {})
-        reward_value = payload.get("reward", obs_data.get("reward"))
-        done_value = payload.get("done", obs_data.get("done", False))
-        observation = DebugzeroObservation(
-            role_next=obs_data.get("role_next", "proposer"),
-            current_code=obs_data.get("current_code", ""),
-            execution_result=obs_data.get("execution_result", ""),
-            tests_passed=obs_data.get("tests_passed", False),
-            syntax_error=obs_data.get("syntax_error", False),
-            score=obs_data.get("score", 0.0),
-            done=done_value,
-            reward=reward_value,
-            metadata=obs_data.get("metadata", {}),
-        )
-        return StepResult(
-            observation=observation,
-            reward=reward_value,
-            done=done_value,
-        )
     def _parse_state(self, payload: Dict) -> DebugzeroState:
         """

         Args:
             payload: JSON response data from server
+        Returns:
+            StepResult with DebugzeroObservation
+        """
+        obs_data = payload.get("observation", {})
+        reward_value = payload.get("reward", obs_data.get("reward"))
+        done_value = payload.get("done", obs_data.get("done", False))
+        observation = DebugzeroObservation(
+            role_next=obs_data.get("role_next", "proposer"),
+            current_code=obs_data.get("current_code", ""),
+            execution_result=obs_data.get("execution_result", ""),
+            tests_passed=obs_data.get("tests_passed", False),
+            syntax_error=obs_data.get("syntax_error", False),
+            score=obs_data.get("score", 0.0),
+            done=done_value,
+            reward=reward_value,
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=reward_value,
+            done=done_value,
+        )
     def _parse_state(self, payload: Dict) -> DebugzeroState:
         """

eval/api_baseline.py CHANGED Viewed

@@ -1,346 +1,346 @@
-import asyncio
-import inspect
-import json
-import os
-import sys
-import textwrap
-from typing import Any, Optional
-sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
-from dotenv import load_dotenv
-from openai import OpenAI
-from client import DebugzeroEnv
-from models import DebugzeroAction
-load_dotenv()
-API_BASE_URL = os.getenv("API_BASE_URL") or os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
-MODEL_NAME = os.getenv("MODEL_NAME") or os.getenv("OPENAI_MODEL", "meta-llama/llama-3.1-8b-instruct")
-API_KEY = os.getenv("API_KEY") or os.getenv("OPENAI_API_KEY") or os.getenv("HF_TOKEN")
-ENV_URL = os.getenv("DEBUGZERO_ENV_URL", "http://localhost:8000")
-NUM_EPISODES = int(os.getenv("NUM_EPISODES", "6"))
-MAX_STEPS = int(os.getenv("MAX_STEPS", "8"))
-PROPOSER_TEMPERATURE = float(os.getenv("PROPOSER_TEMPERATURE", "0.7"))
-SOLVER_TEMPERATURE = float(os.getenv("SOLVER_TEMPERATURE", "0.2"))
-MAX_TOKENS = int(os.getenv("MAX_TOKENS", "1024"))
-BUG_FOCUS = os.getenv("DEBUGZERO_BUG_FOCUS")
-def extract_python_code(text: str) -> str:
-    content = (text or "").strip()
-    if content.startswith("```"):
-        content = content.split("\n", 1)[-1]
-    if content.endswith("```"):
-        content = content.rsplit("\n", 1)[0]
-    return content.strip()
-def summarize_error(text: str, max_chars: int = 220) -> str:
-    cleaned = " ".join(text.strip().split())
-    if not cleaned:
-        return "null"
-    if len(cleaned) <= max_chars:
-        return cleaned
-    return cleaned[: max_chars - 3].rstrip() + "..."
-def extract_env_error(result: Any) -> Optional[str]:
-    for attr in ("last_action_error", "error", "message"):
-        if hasattr(result, attr):
-            value = getattr(result, attr)
-            if value:
-                return str(value)
-    obs = getattr(result, "observation", None)
-    if obs is None:
-        return None
-    for attr in ("last_action_error", "error"):
-        if hasattr(obs, attr):
-            value = getattr(obs, attr)
-            if value:
-                return str(value)
-    execution_result = getattr(obs, "execution_result", "")
-    if isinstance(execution_result, str) and execution_result:
-        if getattr(obs, "syntax_error", False):
-            return summarize_error(execution_result)
-        if execution_result.startswith("Unsafe import detected."):
-            return execution_result
-        if not getattr(obs, "tests_passed", False):
-            return summarize_error(execution_result)
-    return None
-def compact_action_string(role: str, code: str) -> str:
-    return json.dumps({"role": role, "code": code}, separators=(",", ":"), ensure_ascii=False)
-def build_prompt(obs_dict: dict[str, Any], history: list[str]) -> str:
-    role = str(obs_dict.get("role_next", "proposer"))
-    current_code = str(obs_dict.get("current_code", ""))
-    execution_result = str(obs_dict.get("execution_result", ""))
-    tests_passed = bool(obs_dict.get("tests_passed", False))
-    syntax_error = bool(obs_dict.get("syntax_error", False))
-    metadata = obs_dict.get("metadata", {}) or {}
-    seed_id = metadata.get("seed_id", "unknown")
-    history_block = "\n".join(history[-4:]) if history else "None"
-    if role == "proposer":
-        focus_line = ""
-        if BUG_FOCUS:
-            focus_line = f"- Focus specifically on the `{BUG_FOCUS}` mutation family.\n"
-        instructions = textwrap.dedent(
-            f"""
-            You are the Proposer in a debugging self-play environment.
-            Return a full Python function with exactly one small logical bug injected.
-            Rules:
-            - Keep the code valid Python.
-            - Keep the same function signature.
-            - Preserve the overall structure and formatting as much as possible.
-            - Make exactly one small local behavioral change.
-            - Avoid comments, explanations, markdown outside the code block, and broad rewrites.
-            {focus_line}- Your goal is to make tests fail without creating a syntax error.
-            """
-        ).strip()
-    else:
-        instructions = textwrap.dedent(
-            """
-            You are the Solver in a debugging self-play environment.
-            Return the full fixed Python function.
-            Rules:
-            - Keep the code valid Python.
-            - Keep the same function signature.
-            - Make the smallest correct local fix you can.
-            - Use the failure output to guide the repair.
-            - Avoid comments, explanations, markdown outside the code block, and unrelated refactors.
-            """
-        ).strip()
-    return textwrap.dedent(
-        f"""
-        {instructions}
-        Current environment state:
-        - seed_id: {seed_id}
-        - role_next: {role}
-        - tests_passed: {tests_passed}
-        - syntax_error: {syntax_error}
-        Current code:
-        ```python
-        {current_code}
-        ```
-        Execution result:
-        {execution_result if execution_result else "None"}
-        Previous actions:
-        {history_block}
-        Return only the full Python code inside triple backticks.
-        """
-    ).strip()
-def get_model_code(client: OpenAI, obs_dict: dict[str, Any], history: list[str]) -> str:
-    role = str(obs_dict.get("role_next", "proposer"))
-    prompt = build_prompt(obs_dict, history)
-    temperature = PROPOSER_TEMPERATURE if role == "proposer" else SOLVER_TEMPERATURE
-    response = client.chat.completions.create(
-        model=MODEL_NAME,
-        messages=[
-            {"role": "system", "content": "You are an expert Python coder."},
-            {"role": "user", "content": prompt},
-        ],
-        temperature=temperature,
-        max_tokens=MAX_TOKENS,
-    )
-    return extract_python_code(response.choices[0].message.content or "")
-async def maybe_await(value: Any) -> Any:
-    if inspect.isawaitable(value):
-        return await value
-    return value
-async def call_env_method(obj: Any, method_name: str, *args: Any) -> Any:
-    method = getattr(obj, method_name)
-    result = method(*args)
-    return await maybe_await(result)
-async def make_env() -> Any:
-    max_retries = 30
-    for attempt in range(max_retries):
-        try:
-            return DebugzeroEnv(base_url=ENV_URL)
-        except Exception as exc:
-            print(
-                f"[SYSTEM ERROR] Env connection to {ENV_URL} failed (attempt {attempt + 1}/{max_retries}): {exc}",
-                file=sys.stderr,
-                flush=True,
-            )
-            if attempt < max_retries - 1:
-                await asyncio.sleep(5.0)
-            else:
-                raise
-def print_live_summary(metrics: dict[str, Any]) -> None:
-    episodes = max(1, int(metrics["episodes"]))
-    proposer_attempts = max(1, int(metrics["proposer_attempts"]))
-    solver_attempts = max(1, int(metrics["solver_attempts"]))
-    rewards = metrics["rewards"]
-    average_reward = (sum(rewards) / len(rewards)) if rewards else 0.0
-    print("\n" + "=" * 80)
-    print("Live API summary")
-    print("=" * 80)
-    print(f"Episode success rate:  {metrics['episode_successes'] / episodes:.2%}")
-    print(f"Proposer syntax rate:  {metrics['proposer_syntax_errors'] / proposer_attempts:.2%}")
-    print(f"Solver syntax rate:    {metrics['solver_syntax_errors'] / solver_attempts:.2%}")
-    print(f"Average step reward:   {average_reward:.2f}")
-    print(f"Average steps/episode: {metrics['total_steps'] / episodes:.2f}")
-    print(f"Representative success: {metrics['representative_success']}")
-    print(f"Representative failure: {metrics['representative_failure']}")
-async def run_live_api_probe() -> dict[str, Any] | None:
-    if not API_KEY:
-        print("Skipping live API probe: OPENAI_API_KEY/API_KEY is not set.")
-        return None
-    if not MODEL_NAME:
-        print("Skipping live API probe: OPENAI_MODEL/MODEL_NAME is not set.")
-        return None
-    client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
-    env = await make_env()
-    metrics = {
-        "episodes": NUM_EPISODES,
-        "episode_successes": 0,
-        "proposer_attempts": 0,
-        "solver_attempts": 0,
-        "proposer_syntax_errors": 0,
-        "solver_syntax_errors": 0,
-        "rewards": [],
-        "total_steps": 0,
-        "representative_success": None,
-        "representative_failure": None,
-    }
-    print("=" * 80)
-    print("Live API probe")
-    print("=" * 80)
-    print(f"API base URL: {API_BASE_URL}")
-    print(f"Model: {MODEL_NAME}")
-    print(f"Env URL: {ENV_URL}")
-    try:
-        for episode in range(1, NUM_EPISODES + 1):
-            result = await call_env_method(env, "reset")
-            obs = getattr(result, "observation", None)
-            done = bool(getattr(result, "done", False))
-            history: list[str] = []
-            success = False
-            seed_id = "unknown"
-            if obs is not None:
-                metadata = getattr(obs, "metadata", {}) or {}
-                seed_id = metadata.get("seed_id", "unknown")
-            print(f"\nEpisode {episode}/{NUM_EPISODES} | seed={seed_id}")
-            for step in range(1, MAX_STEPS + 1):
-                if done or obs is None:
-                    break
-                obs_dict = obs.model_dump() if hasattr(obs, "model_dump") else obs.dict()
-                role = str(obs_dict.get("role_next", "proposer"))
-                if role == "proposer":
-                    metrics["proposer_attempts"] += 1
-                else:
-                    metrics["solver_attempts"] += 1
-                try:
-                    code = await asyncio.to_thread(get_model_code, client, obs_dict, history)
-                except Exception as exc:
-                    print(f"[SYSTEM ERROR] Model generation failed: {exc}", file=sys.stderr, flush=True)
-                    code = str(obs_dict.get("current_code", ""))
-                action = DebugzeroAction(role=role, code=code)
-                action_str = compact_action_string(role, code)
-                result = await call_env_method(env, "step", action)
-                obs = getattr(result, "observation", None)
-                done = bool(getattr(result, "done", False))
-                reward = float(getattr(result, "reward", 0.0) or 0.0)
-                error = extract_env_error(result)
-                metrics["rewards"].append(reward)
-                metrics["total_steps"] += 1
-                if obs is not None and getattr(obs, "syntax_error", False):
-                    if role == "proposer":
-                        metrics["proposer_syntax_errors"] += 1
-                    else:
-                        metrics["solver_syntax_errors"] += 1
-                print(
-                    f"  step={step} role={role} reward={reward:.2f} done={str(done).lower()} error={error or 'null'}",
-                    flush=True,
-                )
-                history.append(f"Step {step}: {action_str} -> reward {reward:.2f}")
-                if done and obs is not None:
-                    success = bool(getattr(obs, "tests_passed", False)) and not bool(
-                        getattr(obs, "syntax_error", False)
-                    )
-                    if success:
-                        metrics["episode_successes"] += 1
-                        if metrics["representative_success"] is None:
-                            metrics["representative_success"] = {
-                                "seed_id": getattr(obs, "metadata", {}).get("seed_id", "unknown"),
-                                "steps": step,
-                                "reward": reward,
-                            }
-                    elif metrics["representative_failure"] is None:
-                        metrics["representative_failure"] = {
-                            "seed_id": getattr(obs, "metadata", {}).get("seed_id", "unknown"),
-                            "steps": step,
-                            "execution_result": getattr(obs, "execution_result", ""),
-                        }
-                    break
-            if not success and metrics["representative_failure"] is None:
-                failure_seed = seed_id
-                failure_output = getattr(obs, "execution_result", "") if obs is not None else ""
-                metrics["representative_failure"] = {
-                    "seed_id": failure_seed,
-                    "steps": min(MAX_STEPS, len(history)),
-                    "execution_result": failure_output,
-                }
-        return metrics
-    finally:
-        await call_env_method(env, "close")
-async def main() -> None:
-    metrics = await run_live_api_probe()
-    if metrics is not None:
-        print_live_summary(metrics)
-if __name__ == "__main__":
-    asyncio.run(main())

+import asyncio
+import inspect
+import json
+import os
+import sys
+import textwrap
+from typing import Any, Optional
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
+from dotenv import load_dotenv
+from openai import OpenAI
+from client import DebugzeroEnv
+from models import DebugzeroAction
+load_dotenv()
+API_BASE_URL = os.getenv("API_BASE_URL") or os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
+MODEL_NAME = os.getenv("MODEL_NAME") or os.getenv("OPENAI_MODEL", "meta-llama/llama-3.1-8b-instruct")
+API_KEY = os.getenv("API_KEY") or os.getenv("OPENAI_API_KEY") or os.getenv("HF_TOKEN")
+ENV_URL = os.getenv("DEBUGZERO_ENV_URL", "http://localhost:8000")
+NUM_EPISODES = int(os.getenv("NUM_EPISODES", "6"))
+MAX_STEPS = int(os.getenv("MAX_STEPS", "8"))
+PROPOSER_TEMPERATURE = float(os.getenv("PROPOSER_TEMPERATURE", "0.7"))
+SOLVER_TEMPERATURE = float(os.getenv("SOLVER_TEMPERATURE", "0.2"))
+MAX_TOKENS = int(os.getenv("MAX_TOKENS", "1024"))
+BUG_FOCUS = os.getenv("DEBUGZERO_BUG_FOCUS")
+def extract_python_code(text: str) -> str:
+    content = (text or "").strip()
+    if content.startswith("```"):
+        content = content.split("\n", 1)[-1]
+    if content.endswith("```"):
+        content = content.rsplit("\n", 1)[0]
+    return content.strip()
+def summarize_error(text: str, max_chars: int = 220) -> str:
+    cleaned = " ".join(text.strip().split())
+    if not cleaned:
+        return "null"
+    if len(cleaned) <= max_chars:
+        return cleaned
+    return cleaned[: max_chars - 3].rstrip() + "..."
+def extract_env_error(result: Any) -> Optional[str]:
+    for attr in ("last_action_error", "error", "message"):
+        if hasattr(result, attr):
+            value = getattr(result, attr)
+            if value:
+                return str(value)
+    obs = getattr(result, "observation", None)
+    if obs is None:
+        return None
+    for attr in ("last_action_error", "error"):
+        if hasattr(obs, attr):
+            value = getattr(obs, attr)
+            if value:
+                return str(value)
+    execution_result = getattr(obs, "execution_result", "")
+    if isinstance(execution_result, str) and execution_result:
+        if getattr(obs, "syntax_error", False):
+            return summarize_error(execution_result)
+        if execution_result.startswith("Unsafe import detected."):
+            return execution_result
+        if not getattr(obs, "tests_passed", False):
+            return summarize_error(execution_result)
+    return None
+def compact_action_string(role: str, code: str) -> str:
+    return json.dumps({"role": role, "code": code}, separators=(",", ":"), ensure_ascii=False)
+def build_prompt(obs_dict: dict[str, Any], history: list[str]) -> str:
+    role = str(obs_dict.get("role_next", "proposer"))
+    current_code = str(obs_dict.get("current_code", ""))
+    execution_result = str(obs_dict.get("execution_result", ""))
+    tests_passed = bool(obs_dict.get("tests_passed", False))
+    syntax_error = bool(obs_dict.get("syntax_error", False))
+    metadata = obs_dict.get("metadata", {}) or {}
+    seed_id = metadata.get("seed_id", "unknown")
+    history_block = "\n".join(history[-4:]) if history else "None"
+    if role == "proposer":
+        focus_line = ""
+        if BUG_FOCUS:
+            focus_line = f"- Focus specifically on the `{BUG_FOCUS}` mutation family.\n"
+        instructions = textwrap.dedent(
+            f"""
+            You are the Proposer in a debugging self-play environment.
+            Return a full Python function with exactly one small logical bug injected.
+            Rules:
+            - Keep the code valid Python.
+            - Keep the same function signature.
+            - Preserve the overall structure and formatting as much as possible.
+            - Make exactly one small local behavioral change.
+            - Avoid comments, explanations, markdown outside the code block, and broad rewrites.
+            {focus_line}- Your goal is to make tests fail without creating a syntax error.
+            """
+        ).strip()
+    else:
+        instructions = textwrap.dedent(
+            """
+            You are the Solver in a debugging self-play environment.
+            Return the full fixed Python function.
+            Rules:
+            - Keep the code valid Python.
+            - Keep the same function signature.
+            - Make the smallest correct local fix you can.
+            - Use the failure output to guide the repair.
+            - Avoid comments, explanations, markdown outside the code block, and unrelated refactors.
+            """
+        ).strip()
+    return textwrap.dedent(
+        f"""
+        {instructions}
+        Current environment state:
+        - seed_id: {seed_id}
+        - role_next: {role}
+        - tests_passed: {tests_passed}
+        - syntax_error: {syntax_error}
+        Current code:
+        ```python
+        {current_code}
+        ```
+        Execution result:
+        {execution_result if execution_result else "None"}
+        Previous actions:
+        {history_block}
+        Return only the full Python code inside triple backticks.
+        """
+    ).strip()
+def get_model_code(client: OpenAI, obs_dict: dict[str, Any], history: list[str]) -> str:
+    role = str(obs_dict.get("role_next", "proposer"))
+    prompt = build_prompt(obs_dict, history)
+    temperature = PROPOSER_TEMPERATURE if role == "proposer" else SOLVER_TEMPERATURE
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=[
+            {"role": "system", "content": "You are an expert Python coder."},
+            {"role": "user", "content": prompt},
+        ],
+        temperature=temperature,
+        max_tokens=MAX_TOKENS,
+    )
+    return extract_python_code(response.choices[0].message.content or "")
+async def maybe_await(value: Any) -> Any:
+    if inspect.isawaitable(value):
+        return await value
+    return value
+async def call_env_method(obj: Any, method_name: str, *args: Any) -> Any:
+    method = getattr(obj, method_name)
+    result = method(*args)
+    return await maybe_await(result)
+async def make_env() -> Any:
+    max_retries = 30
+    for attempt in range(max_retries):
+        try:
+            return DebugzeroEnv(base_url=ENV_URL)
+        except Exception as exc:
+            print(
+                f"[SYSTEM ERROR] Env connection to {ENV_URL} failed (attempt {attempt + 1}/{max_retries}): {exc}",
+                file=sys.stderr,
+                flush=True,
+            )
+            if attempt < max_retries - 1:
+                await asyncio.sleep(5.0)
+            else:
+                raise
+def print_live_summary(metrics: dict[str, Any]) -> None:
+    episodes = max(1, int(metrics["episodes"]))
+    proposer_attempts = max(1, int(metrics["proposer_attempts"]))
+    solver_attempts = max(1, int(metrics["solver_attempts"]))
+    rewards = metrics["rewards"]
+    average_reward = (sum(rewards) / len(rewards)) if rewards else 0.0
+    print("\n" + "=" * 80)
+    print("Live API summary")
+    print("=" * 80)
+    print(f"Episode success rate:  {metrics['episode_successes'] / episodes:.2%}")
+    print(f"Proposer syntax rate:  {metrics['proposer_syntax_errors'] / proposer_attempts:.2%}")
+    print(f"Solver syntax rate:    {metrics['solver_syntax_errors'] / solver_attempts:.2%}")
+    print(f"Average step reward:   {average_reward:.2f}")
+    print(f"Average steps/episode: {metrics['total_steps'] / episodes:.2f}")
+    print(f"Representative success: {metrics['representative_success']}")
+    print(f"Representative failure: {metrics['representative_failure']}")
+async def run_live_api_probe() -> dict[str, Any] | None:
+    if not API_KEY:
+        print("Skipping live API probe: OPENAI_API_KEY/API_KEY is not set.")
+        return None
+    if not MODEL_NAME:
+        print("Skipping live API probe: OPENAI_MODEL/MODEL_NAME is not set.")
+        return None
+    client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
+    env = await make_env()
+    metrics = {
+        "episodes": NUM_EPISODES,
+        "episode_successes": 0,
+        "proposer_attempts": 0,
+        "solver_attempts": 0,
+        "proposer_syntax_errors": 0,
+        "solver_syntax_errors": 0,
+        "rewards": [],
+        "total_steps": 0,
+        "representative_success": None,
+        "representative_failure": None,
+    }
+    print("=" * 80)
+    print("Live API probe")
+    print("=" * 80)
+    print(f"API base URL: {API_BASE_URL}")
+    print(f"Model: {MODEL_NAME}")
+    print(f"Env URL: {ENV_URL}")
+    try:
+        for episode in range(1, NUM_EPISODES + 1):
+            result = await call_env_method(env, "reset")
+            obs = getattr(result, "observation", None)
+            done = bool(getattr(result, "done", False))
+            history: list[str] = []
+            success = False
+            seed_id = "unknown"
+            if obs is not None:
+                metadata = getattr(obs, "metadata", {}) or {}
+                seed_id = metadata.get("seed_id", "unknown")
+            print(f"\nEpisode {episode}/{NUM_EPISODES} | seed={seed_id}")
+            for step in range(1, MAX_STEPS + 1):
+                if done or obs is None:
+                    break
+                obs_dict = obs.model_dump() if hasattr(obs, "model_dump") else obs.dict()
+                role = str(obs_dict.get("role_next", "proposer"))
+                if role == "proposer":
+                    metrics["proposer_attempts"] += 1
+                else:
+                    metrics["solver_attempts"] += 1
+                try:
+                    code = await asyncio.to_thread(get_model_code, client, obs_dict, history)
+                except Exception as exc:
+                    print(f"[SYSTEM ERROR] Model generation failed: {exc}", file=sys.stderr, flush=True)
+                    code = str(obs_dict.get("current_code", ""))
+                action = DebugzeroAction(role=role, code=code)
+                action_str = compact_action_string(role, code)
+                result = await call_env_method(env, "step", action)
+                obs = getattr(result, "observation", None)
+                done = bool(getattr(result, "done", False))
+                reward = float(getattr(result, "reward", 0.0) or 0.0)
+                error = extract_env_error(result)
+                metrics["rewards"].append(reward)
+                metrics["total_steps"] += 1
+                if obs is not None and getattr(obs, "syntax_error", False):
+                    if role == "proposer":
+                        metrics["proposer_syntax_errors"] += 1
+                    else:
+                        metrics["solver_syntax_errors"] += 1
+                print(
+                    f"  step={step} role={role} reward={reward:.2f} done={str(done).lower()} error={error or 'null'}",
+                    flush=True,
+                )
+                history.append(f"Step {step}: {action_str} -> reward {reward:.2f}")
+                if done and obs is not None:
+                    success = bool(getattr(obs, "tests_passed", False)) and not bool(
+                        getattr(obs, "syntax_error", False)
+                    )
+                    if success:
+                        metrics["episode_successes"] += 1
+                        if metrics["representative_success"] is None:
+                            metrics["representative_success"] = {
+                                "seed_id": getattr(obs, "metadata", {}).get("seed_id", "unknown"),
+                                "steps": step,
+                                "reward": reward,
+                            }
+                    elif metrics["representative_failure"] is None:
+                        metrics["representative_failure"] = {
+                            "seed_id": getattr(obs, "metadata", {}).get("seed_id", "unknown"),
+                            "steps": step,
+                            "execution_result": getattr(obs, "execution_result", ""),
+                        }
+                    break
+            if not success and metrics["representative_failure"] is None:
+                failure_seed = seed_id
+                failure_output = getattr(obs, "execution_result", "") if obs is not None else ""
+                metrics["representative_failure"] = {
+                    "seed_id": failure_seed,
+                    "steps": min(MAX_STEPS, len(history)),
+                    "execution_result": failure_output,
+                }
+        return metrics
+    finally:
+        await call_env_method(env, "close")
+async def main() -> None:
+    metrics = await run_live_api_probe()
+    if metrics is not None:
+        print_live_summary(metrics)
+if __name__ == "__main__":
+    asyncio.run(main())

inference.py CHANGED Viewed

@@ -1,332 +1,332 @@
-import asyncio
-import inspect
-import json
-import os
-import sys
-import textwrap
-from typing import Any, List, Optional
-from dotenv import load_dotenv
-from openai import OpenAI
-from client import DebugzeroEnv
-from models import DebugzeroAction
-load_dotenv()
-API_BASE_URL = os.getenv("API_BASE_URL") or os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
-MODEL_NAME = os.getenv("MODEL_NAME") or os.getenv("OPENAI_MODEL", "meta-llama/llama-3.1-8b-instruct")
-API_KEY = os.getenv("API_KEY") or os.getenv("OPENAI_API_KEY") or os.getenv("HF_TOKEN")
-LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
-ENV_URL = os.getenv("DEBUGZERO_ENV_URL", "http://localhost:8000")
-TASK_NAME = os.getenv("DEBUGZERO_TASK", "debugging-self-play")
-BENCHMARK = os.getenv("DEBUGZERO_BENCHMARK", "debugzero")
-BUG_FOCUS = os.getenv("DEBUGZERO_BUG_FOCUS")
-NUM_EPISODES = int(os.getenv("NUM_EPISODES", "3"))
-MAX_STEPS = int(os.getenv("MAX_STEPS", "8"))
-PROPOSER_TEMPERATURE = float(os.getenv("PROPOSER_TEMPERATURE", "0.7"))
-SOLVER_TEMPERATURE = float(os.getenv("SOLVER_TEMPERATURE", "0.2"))
-MAX_TOKENS = int(os.getenv("MAX_TOKENS", "1024"))
-def extract_python_code(text: str) -> str:
-    content = (text or "").strip()
-    if content.startswith("```"):
-        content = content.split("\n", 1)[-1]
-    if content.endswith("```"):
-        content = content.rsplit("\n", 1)[0]
-    return content.strip()
-def compact_action_string(role: str, code: str) -> str:
-    obj = {"role": role, "code": code}
-    return json.dumps(obj, separators=(",", ":"), ensure_ascii=False)
-def log_start(task: str, env: str, model: str) -> None:
-    print(f"[START] task={task} env={env} model={model}", flush=True)
-def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
-    error_val = error if error is not None else "null"
-    action_str = action.replace("\n", "\\n")
-    print(
-        f"[STEP] step={step} action={action_str} reward={reward:.2f} done={str(done).lower()} error={error_val}",
-        flush=True,
-    )
-def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
-    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
-    print(
-        f"[END] success={str(success).lower()} steps={steps} score={score:.4f} rewards={rewards_str}",
-        flush=True,
-    )
-def summarize_error(text: str, max_chars: int = 220) -> str:
-    cleaned = " ".join(text.strip().split())
-    if not cleaned:
-        return "null"
-    if len(cleaned) <= max_chars:
-        return cleaned
-    return cleaned[: max_chars - 3].rstrip() + "..."
-def extract_env_error(result: Any) -> Optional[str]:
-    for attr in ("last_action_error", "error", "message"):
-        if hasattr(result, attr):
-            value = getattr(result, attr)
-            if value:
-                return str(value)
-    obs = getattr(result, "observation", None)
-    if obs is None:
-        return None
-    for attr in ("last_action_error", "error"):
-        if hasattr(obs, attr):
-            value = getattr(obs, attr)
-            if value:
-                return str(value)
-    execution_result = getattr(obs, "execution_result", "")
-    if isinstance(execution_result, str) and execution_result:
-        if getattr(obs, "syntax_error", False):
-            return summarize_error(execution_result)
-        if execution_result.startswith("Unsafe import detected."):
-            return execution_result
-        if not getattr(obs, "tests_passed", False):
-            return summarize_error(execution_result)
-    return None
-def build_prompt(obs_dict: dict[str, Any], history: List[str]) -> str:
-    role = str(obs_dict.get("role_next", "proposer"))
-    current_code = str(obs_dict.get("current_code", ""))
-    execution_result = str(obs_dict.get("execution_result", ""))
-    tests_passed = bool(obs_dict.get("tests_passed", False))
-    syntax_error = bool(obs_dict.get("syntax_error", False))
-    metadata = obs_dict.get("metadata", {}) or {}
-    seed_id = metadata.get("seed_id", "unknown")
-    history_block = "\n".join(history[-4:]) if history else "None"
-    if role == "proposer":
-        focus_line = ""
-        if BUG_FOCUS:
-            focus_line = f"- Focus specifically on the `{BUG_FOCUS}` mutation family.\n"
-        task_block = textwrap.dedent(
-            f"""
-            You are the Proposer in a debugging self-play environment.
-            Return a full Python function with exactly one small logical bug injected.
-            Rules:
-            - Keep the code valid Python.
-            - Keep the same function signature.
-            - Preserve the overall structure and formatting as much as possible.
-            - Make exactly one small local behavioral change.
-            - Avoid comments, explanations, markdown outside the code block, and broad rewrites.
-            {focus_line}- Your goal is to make tests fail without creating a syntax error.
-            """
-        ).strip()
-    else:
-        task_block = textwrap.dedent(
-            """
-            You are the Solver in a debugging self-play environment.
-            Return the full fixed Python function.
-            Rules:
-            - Keep the code valid Python.
-            - Keep the same function signature.
-            - Make the smallest correct local fix you can.
-            - Use the failure output to guide the repair.
-            - Avoid comments, explanations, markdown outside the code block, and unrelated refactors.
-            """
-        ).strip()
-    return textwrap.dedent(
-        f"""
-        {task_block}
-        Current environment state:
-        - seed_id: {seed_id}
-        - role_next: {role}
-        - tests_passed: {tests_passed}
-        - syntax_error: {syntax_error}
-        Current code:
-        ```python
-        {current_code}
-        ```
-        Execution result:
-        {execution_result if execution_result else "None"}
-        Previous actions:
-        {history_block}
-        Return only the full Python code inside triple backticks.
-        """
-    ).strip()
-def get_model_code(client: OpenAI, obs_dict: dict[str, Any], history: List[str]) -> str:
-    role = str(obs_dict.get("role_next", "proposer"))
-    prompt = build_prompt(obs_dict, history)
-    temperature = PROPOSER_TEMPERATURE if role == "proposer" else SOLVER_TEMPERATURE
-    response = client.chat.completions.create(
-        model=MODEL_NAME,
-        messages=[
-            {"role": "system", "content": "You are an expert Python coder."},
-            {"role": "user", "content": prompt},
-        ],
-        temperature=temperature,
-        max_tokens=MAX_TOKENS,
-    )
-    return extract_python_code(response.choices[0].message.content or "")
-async def maybe_await(value: Any) -> Any:
-    if inspect.isawaitable(value):
-        return await value
-    return value
-async def call_env_method(obj: Any, method_name: str, *args: Any) -> Any:
-    method = getattr(obj, method_name)
-    result = method(*args)
-    return await maybe_await(result)
-async def make_env() -> Any:
-    max_retries = 30
-    if LOCAL_IMAGE_NAME:
-        for attempt in range(max_retries):
-            try:
-                env = DebugzeroEnv.from_docker_image(LOCAL_IMAGE_NAME)
-                return await maybe_await(env)
-            except Exception as exc:
-                print(
-                    f"[SYSTEM ERROR] Failed to start Docker environment (attempt {attempt + 1}/{max_retries}): {exc}",
-                    file=sys.stderr,
-                    flush=True,
-                )
-                if attempt < max_retries - 1:
-                    await asyncio.sleep(5.0)
-                else:
-                    raise
-    for attempt in range(max_retries):
-        try:
-            return DebugzeroEnv(base_url=ENV_URL)
-        except Exception as exc:
-            print(
-                f"[SYSTEM ERROR] Env connection to {ENV_URL} failed (attempt {attempt + 1}/{max_retries}): {exc}",
-                file=sys.stderr,
-                flush=True,
-            )
-            if attempt < max_retries - 1:
-                await asyncio.sleep(5.0)
-            else:
-                raise
-async def main() -> None:
-    if not API_KEY:
-        print("[SYSTEM ERROR] Missing API key. Set API_KEY, OPENAI_API_KEY, or HF_TOKEN.", file=sys.stderr, flush=True)
-        return
-    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
-    env = None
-    try:
-        env = await make_env()
-        for _episode in range(1, NUM_EPISODES + 1):
-            history: List[str] = []
-            rewards: List[float] = []
-            steps_taken = 0
-            score = 0.0
-            success = False
-            log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
-            try:
-                result = await call_env_method(env, "reset")
-                done = bool(getattr(result, "done", False))
-                obs = getattr(result, "observation", None)
-                for step in range(1, MAX_STEPS + 1):
-                    if done or obs is None:
-                        break
-                    obs_dict = obs.model_dump() if hasattr(obs, "model_dump") else obs.dict()
-                    role = str(obs_dict.get("role_next", "proposer"))
-                    try:
-                        code = await asyncio.to_thread(get_model_code, client, obs_dict, history)
-                        env_action = DebugzeroAction(role=role, code=code)
-                        action_str = compact_action_string(role, code)
-                    except Exception as exc:
-                        print(f"[SYSTEM ERROR] Model generation failed: {exc}", file=sys.stderr, flush=True)
-                        code = obs_dict.get("current_code", "")
-                        env_action = DebugzeroAction(role=role, code=code)
-                        action_str = compact_action_string(role, code)
-                    result = await call_env_method(env, "step", env_action)
-                    obs = getattr(result, "observation", None)
-                    done = bool(getattr(result, "done", False))
-                    reward = float(getattr(result, "reward", 0.0) or 0.0)
-                    rewards.append(reward)
-                    steps_taken = step
-                    error = extract_env_error(result)
-                    if obs is not None:
-                        score = float(getattr(obs, "score", score) or score)
-                        if done:
-                            success = bool(getattr(obs, "tests_passed", False)) and not bool(
-                                getattr(obs, "syntax_error", False)
-                            )
-                    score = max(0.0001, min(0.9999, score))
-                    log_step(step=step, action=action_str, reward=reward, done=done, error=error)
-                    history.append(f"Step {step}: {action_str} -> reward {reward:.2f}")
-                score = max(0.0001, min(0.9999, float(score)))
-            except Exception as exc:
-                print(f"[SYSTEM ERROR] {exc}", file=sys.stderr, flush=True)
-                success = False
-            finally:
-                log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
-    except Exception as exc:
-        print(f"[SYSTEM ERROR] {exc}", file=sys.stderr, flush=True)
-    finally:
-        try:
-            if env is not None and hasattr(env, "close"):
-                await call_env_method(env, "close")
-        except Exception:
-            pass
-if __name__ == "__main__":
-    try:
-        asyncio.run(main())
-        sys.exit(0)
-    except Exception as exc:
-        print(f"[CRITICAL VALIDATION ERROR] {exc}", file=sys.stderr, flush=True)
-        sys.exit(0)
-    except BaseException as base_exc:
-        print(f"[BASE EXCEPTION] {base_exc}", file=sys.stderr, flush=True)
-        sys.exit(0)

+import asyncio
+import inspect
+import json
+import os
+import sys
+import textwrap
+from typing import Any, List, Optional
+from dotenv import load_dotenv
+from openai import OpenAI
+from client import DebugzeroEnv
+from models import DebugzeroAction
+load_dotenv()
+API_BASE_URL = os.getenv("API_BASE_URL") or os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
+MODEL_NAME = os.getenv("MODEL_NAME") or os.getenv("OPENAI_MODEL", "meta-llama/llama-3.1-8b-instruct")
+API_KEY = os.getenv("API_KEY") or os.getenv("OPENAI_API_KEY") or os.getenv("HF_TOKEN")
+LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
+ENV_URL = os.getenv("DEBUGZERO_ENV_URL", "http://localhost:8000")
+TASK_NAME = os.getenv("DEBUGZERO_TASK", "debugging-self-play")
+BENCHMARK = os.getenv("DEBUGZERO_BENCHMARK", "debugzero")
+BUG_FOCUS = os.getenv("DEBUGZERO_BUG_FOCUS")
+NUM_EPISODES = int(os.getenv("NUM_EPISODES", "3"))
+MAX_STEPS = int(os.getenv("MAX_STEPS", "8"))
+PROPOSER_TEMPERATURE = float(os.getenv("PROPOSER_TEMPERATURE", "0.7"))
+SOLVER_TEMPERATURE = float(os.getenv("SOLVER_TEMPERATURE", "0.2"))
+MAX_TOKENS = int(os.getenv("MAX_TOKENS", "1024"))
+def extract_python_code(text: str) -> str:
+    content = (text or "").strip()
+    if content.startswith("```"):
+        content = content.split("\n", 1)[-1]
+    if content.endswith("```"):
+        content = content.rsplit("\n", 1)[0]
+    return content.strip()
+def compact_action_string(role: str, code: str) -> str:
+    obj = {"role": role, "code": code}
+    return json.dumps(obj, separators=(",", ":"), ensure_ascii=False)
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error is not None else "null"
+    action_str = action.replace("\n", "\\n")
+    print(
+        f"[STEP] step={step} action={action_str} reward={reward:.2f} done={str(done).lower()} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.4f} rewards={rewards_str}",
+        flush=True,
+    )
+def summarize_error(text: str, max_chars: int = 220) -> str:
+    cleaned = " ".join(text.strip().split())
+    if not cleaned:
+        return "null"
+    if len(cleaned) <= max_chars:
+        return cleaned
+    return cleaned[: max_chars - 3].rstrip() + "..."
+def extract_env_error(result: Any) -> Optional[str]:
+    for attr in ("last_action_error", "error", "message"):
+        if hasattr(result, attr):
+            value = getattr(result, attr)
+            if value:
+                return str(value)
+    obs = getattr(result, "observation", None)
+    if obs is None:
+        return None
+    for attr in ("last_action_error", "error"):
+        if hasattr(obs, attr):
+            value = getattr(obs, attr)
+            if value:
+                return str(value)
+    execution_result = getattr(obs, "execution_result", "")
+    if isinstance(execution_result, str) and execution_result:
+        if getattr(obs, "syntax_error", False):
+            return summarize_error(execution_result)
+        if execution_result.startswith("Unsafe import detected."):
+            return execution_result
+        if not getattr(obs, "tests_passed", False):
+            return summarize_error(execution_result)
+    return None
+def build_prompt(obs_dict: dict[str, Any], history: List[str]) -> str:
+    role = str(obs_dict.get("role_next", "proposer"))
+    current_code = str(obs_dict.get("current_code", ""))
+    execution_result = str(obs_dict.get("execution_result", ""))
+    tests_passed = bool(obs_dict.get("tests_passed", False))
+    syntax_error = bool(obs_dict.get("syntax_error", False))
+    metadata = obs_dict.get("metadata", {}) or {}
+    seed_id = metadata.get("seed_id", "unknown")
+    history_block = "\n".join(history[-4:]) if history else "None"
+    if role == "proposer":
+        focus_line = ""
+        if BUG_FOCUS:
+            focus_line = f"- Focus specifically on the `{BUG_FOCUS}` mutation family.\n"
+        task_block = textwrap.dedent(
+            f"""
+            You are the Proposer in a debugging self-play environment.
+            Return a full Python function with exactly one small logical bug injected.
+            Rules:
+            - Keep the code valid Python.
+            - Keep the same function signature.
+            - Preserve the overall structure and formatting as much as possible.
+            - Make exactly one small local behavioral change.
+            - Avoid comments, explanations, markdown outside the code block, and broad rewrites.
+            {focus_line}- Your goal is to make tests fail without creating a syntax error.
+            """
+        ).strip()
+    else:
+        task_block = textwrap.dedent(
+            """
+            You are the Solver in a debugging self-play environment.
+            Return the full fixed Python function.
+            Rules:
+            - Keep the code valid Python.
+            - Keep the same function signature.
+            - Make the smallest correct local fix you can.
+            - Use the failure output to guide the repair.
+            - Avoid comments, explanations, markdown outside the code block, and unrelated refactors.
+            """
+        ).strip()
+    return textwrap.dedent(
+        f"""
+        {task_block}
+        Current environment state:
+        - seed_id: {seed_id}
+        - role_next: {role}
+        - tests_passed: {tests_passed}
+        - syntax_error: {syntax_error}
+        Current code:
+        ```python
+        {current_code}
+        ```
+        Execution result:
+        {execution_result if execution_result else "None"}
+        Previous actions:
+        {history_block}
+        Return only the full Python code inside triple backticks.
+        """
+    ).strip()
+def get_model_code(client: OpenAI, obs_dict: dict[str, Any], history: List[str]) -> str:
+    role = str(obs_dict.get("role_next", "proposer"))
+    prompt = build_prompt(obs_dict, history)
+    temperature = PROPOSER_TEMPERATURE if role == "proposer" else SOLVER_TEMPERATURE
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=[
+            {"role": "system", "content": "You are an expert Python coder."},
+            {"role": "user", "content": prompt},
+        ],
+        temperature=temperature,
+        max_tokens=MAX_TOKENS,
+    )
+    return extract_python_code(response.choices[0].message.content or "")
+async def maybe_await(value: Any) -> Any:
+    if inspect.isawaitable(value):
+        return await value
+    return value
+async def call_env_method(obj: Any, method_name: str, *args: Any) -> Any:
+    method = getattr(obj, method_name)
+    result = method(*args)
+    return await maybe_await(result)
+async def make_env() -> Any:
+    max_retries = 30
+    if LOCAL_IMAGE_NAME:
+        for attempt in range(max_retries):
+            try:
+                env = DebugzeroEnv.from_docker_image(LOCAL_IMAGE_NAME)
+                return await maybe_await(env)
+            except Exception as exc:
+                print(
+                    f"[SYSTEM ERROR] Failed to start Docker environment (attempt {attempt + 1}/{max_retries}): {exc}",
+                    file=sys.stderr,
+                    flush=True,
+                )
+                if attempt < max_retries - 1:
+                    await asyncio.sleep(5.0)
+                else:
+                    raise
+    for attempt in range(max_retries):
+        try:
+            return DebugzeroEnv(base_url=ENV_URL)
+        except Exception as exc:
+            print(
+                f"[SYSTEM ERROR] Env connection to {ENV_URL} failed (attempt {attempt + 1}/{max_retries}): {exc}",
+                file=sys.stderr,
+                flush=True,
+            )
+            if attempt < max_retries - 1:
+                await asyncio.sleep(5.0)
+            else:
+                raise
+async def main() -> None:
+    if not API_KEY:
+        print("[SYSTEM ERROR] Missing API key. Set API_KEY, OPENAI_API_KEY, or HF_TOKEN.", file=sys.stderr, flush=True)
+        return
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = None
+    try:
+        env = await make_env()
+        for _episode in range(1, NUM_EPISODES + 1):
+            history: List[str] = []
+            rewards: List[float] = []
+            steps_taken = 0
+            score = 0.0
+            success = False
+            log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+            try:
+                result = await call_env_method(env, "reset")
+                done = bool(getattr(result, "done", False))
+                obs = getattr(result, "observation", None)
+                for step in range(1, MAX_STEPS + 1):
+                    if done or obs is None:
+                        break
+                    obs_dict = obs.model_dump() if hasattr(obs, "model_dump") else obs.dict()
+                    role = str(obs_dict.get("role_next", "proposer"))
+                    try:
+                        code = await asyncio.to_thread(get_model_code, client, obs_dict, history)
+                        env_action = DebugzeroAction(role=role, code=code)
+                        action_str = compact_action_string(role, code)
+                    except Exception as exc:
+                        print(f"[SYSTEM ERROR] Model generation failed: {exc}", file=sys.stderr, flush=True)
+                        code = obs_dict.get("current_code", "")
+                        env_action = DebugzeroAction(role=role, code=code)
+                        action_str = compact_action_string(role, code)
+                    result = await call_env_method(env, "step", env_action)
+                    obs = getattr(result, "observation", None)
+                    done = bool(getattr(result, "done", False))
+                    reward = float(getattr(result, "reward", 0.0) or 0.0)
+                    rewards.append(reward)
+                    steps_taken = step
+                    error = extract_env_error(result)
+                    if obs is not None:
+                        score = float(getattr(obs, "score", score) or score)
+                        if done:
+                            success = bool(getattr(obs, "tests_passed", False)) and not bool(
+                                getattr(obs, "syntax_error", False)
+                            )
+                    score = max(0.0001, min(0.9999, score))
+                    log_step(step=step, action=action_str, reward=reward, done=done, error=error)
+                    history.append(f"Step {step}: {action_str} -> reward {reward:.2f}")
+                score = max(0.0001, min(0.9999, float(score)))
+            except Exception as exc:
+                print(f"[SYSTEM ERROR] {exc}", file=sys.stderr, flush=True)
+                success = False
+            finally:
+                log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    except Exception as exc:
+        print(f"[SYSTEM ERROR] {exc}", file=sys.stderr, flush=True)
+    finally:
+        try:
+            if env is not None and hasattr(env, "close"):
+                await call_env_method(env, "close")
+        except Exception:
+            pass
+if __name__ == "__main__":
+    try:
+        asyncio.run(main())
+        sys.exit(0)
+    except Exception as exc:
+        print(f"[CRITICAL VALIDATION ERROR] {exc}", file=sys.stderr, flush=True)
+        sys.exit(0)
+    except BaseException as base_exc:
+        print(f"[BASE EXCEPTION] {base_exc}", file=sys.stderr, flush=True)
+        sys.exit(0)

models.py CHANGED Viewed

@@ -22,15 +22,15 @@ class DebugzeroAction(Action):
     code: str = Field(..., description="Code injected (by proposer) or fixed (by solver)")
-class DebugzeroObservation(Observation):
-    """Observation from the DebugZero environment following sandbox execution."""
-    role_next: str = Field(default="proposer", description="The role supposed to play next")
-    current_code: str = Field(default="", description="The current state of the python code")
-    execution_result: str = Field(default="", description="Result of evaluating tests in the sandbox")
-    tests_passed: bool = Field(default=False, description="Whether the tests passed")
-    syntax_error: bool = Field(default=False, description="Whether the code had a parse/syntax error")
-    score: float = Field(default=0.0, description="Episode progress score in the range [0.0, 1.0]")
 class DebugzeroState(State):
     """State for the DebugZero environment, extending default state with seed context."""

     code: str = Field(..., description="Code injected (by proposer) or fixed (by solver)")
+class DebugzeroObservation(Observation):
+    """Observation from the DebugZero environment following sandbox execution."""
+    role_next: str = Field(default="proposer", description="The role supposed to play next")
+    current_code: str = Field(default="", description="The current state of the python code")
+    execution_result: str = Field(default="", description="Result of evaluating tests in the sandbox")
+    tests_passed: bool = Field(default=False, description="Whether the tests passed")
+    syntax_error: bool = Field(default=False, description="Whether the code had a parse/syntax error")
+    score: float = Field(default=0.0, description="Episode progress score in the range [0.0, 1.0]")
 class DebugzeroState(State):
     """State for the DebugZero environment, extending default state with seed context."""

server/debugZero_environment.py CHANGED Viewed

@@ -1,224 +1,224 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""
-DebugZero Environment Implementation for adversarial bug-fixing self-play.
-"""
-from __future__ import annotations
-from uuid import uuid4
-from openenv.core.env_server.interfaces import Environment
-try:
-    from ..models import DebugzeroAction, DebugzeroObservation, DebugzeroState
-    from .tasks import SEED_BANK, SeedSpec
-except ImportError:
-    from models import DebugzeroAction, DebugzeroObservation, DebugzeroState
-    from server.tasks import SEED_BANK, SeedSpec
-try:
-    from .bug_injector import infer_bug_operator
-    from .graders import (
-        compute_ast_distance,
-        compute_proposer_reward,
-        compute_solver_reward,
-        is_effectively_unchanged,
-    )
-    from .executor import execute_code
-except ImportError:
-    from bug_injector import infer_bug_operator
-    from graders import (
-        compute_ast_distance,
-        compute_proposer_reward,
-        compute_solver_reward,
-        is_effectively_unchanged,
-    )
-    from executor import execute_code
-class DebugzeroEnvironment(Environment):
-    """
-    Dual-role DebugZero Environment wrapping a Python sandbox execution
-    for Proposer bug injection and Solver bug fixing.
-    """
-    SUPPORTS_CONCURRENT_SESSIONS: bool = True
-    def __init__(self):
-        self._reset_count = 0
-        self._current_seed = SEED_BANK[0]
-        self._current_bug_operator: str | None = None
-        self._current_score = 0.0
-        self._proposer_created_bug = False
-        self._state = self._build_state(self._current_seed)
-    def reset(self) -> DebugzeroObservation:
-        seed = SEED_BANK[self._reset_count % len(SEED_BANK)]
-        self._reset_count += 1
-        self._current_seed = seed
-        self._current_bug_operator = None
-        self._current_score = 0.0
-        self._proposer_created_bug = False
-        self._state = self._build_state(seed)
-        return self._build_observation(
-            role_next="proposer",
-            execution_result="",
-            tests_passed=True,
-            syntax_error=False,
-            done=False,
-            reward=0.0,
-            score=0.0,
-        )
-    def step(self, action: DebugzeroAction) -> DebugzeroObservation:  # type: ignore[override]
-        self._state.step_count += 1
-        tests = self._current_seed.test
-        if action.role == "proposer":
-            self._state.current_code = action.code
-            result = execute_code(self._state.current_code, tests)
-            self._state.role_turn = "solver"
-            reward, score = self._proposer_step_feedback(action.code, result)
-            return self._build_observation(
-                role_next="solver",
-                execution_result=self._truncate_execution_output(result.output),
-                tests_passed=result.passed,
-                syntax_error=result.syntax_error,
-                done=False,
-                reward=reward,
-                score=score,
-            )
-        if action.role == "solver":
-            self._state.current_code = action.code
-            result = execute_code(self._state.current_code, tests)
-            self._state.role_turn = "end"
-            reward, score = self._solver_step_feedback(result)
-            return self._build_observation(
-                role_next="proposer",
-                execution_result=self._truncate_execution_output(result.output),
-                tests_passed=result.passed,
-                syntax_error=result.syntax_error,
-                done=True,
-                reward=reward,
-                score=score,
-            )
-        self._current_score = 0.0
-        self._proposer_created_bug = False
-        return self._build_observation(
-            role_next="end",
-            execution_result="",
-            tests_passed=False,
-            syntax_error=False,
-            done=True,
-            reward=0.0,
-            score=0.0,
-        )
-    @property
-    def state(self) -> DebugzeroState:
-        return self._state
-    def _build_state(self, seed: SeedSpec) -> DebugzeroState:
-        return DebugzeroState(
-            episode_id=str(uuid4()),
-            step_count=0,
-            seed_id=seed.seed_id,
-            original_code=seed.original_code,
-            current_code=seed.original_code,
-            role_turn="proposer",
-        )
-    def _build_observation(
-        self,
-        *,
-        role_next: str,
-        execution_result: str,
-        tests_passed: bool,
-        syntax_error: bool,
-        done: bool,
-        reward: float,
-        score: float,
-    ) -> DebugzeroObservation:
-        self._current_score = score
-        return DebugzeroObservation(
-            role_next=role_next,
-            current_code=self._state.current_code,
-            execution_result=execution_result,
-            tests_passed=tests_passed,
-            syntax_error=syntax_error,
-            score=score,
-            done=done,
-            reward=reward,
-            metadata=self._observation_metadata(),
-        )
-    def _proposer_step_feedback(self, candidate_code: str, result: object) -> tuple[float, float]:
-        original_code = self._state.original_code
-        execution_output = getattr(result, "output", "") or ""
-        syntax_error = bool(getattr(result, "syntax_error", False))
-        tests_passed = bool(getattr(result, "passed", False))
-        unsafe_code = execution_output.startswith("Unsafe import detected.")
-        unchanged_code = is_effectively_unchanged(original_code, candidate_code)
-        changed_but_passing = (not unchanged_code) and tests_passed and (not syntax_error)
-        plausibility_score = 0.0 if syntax_error else compute_ast_distance(original_code, candidate_code)
-        reward = compute_proposer_reward(
-            {
-                "seed_id": self._state.seed_id,
-                "tests_passed": tests_passed,
-                "syntax_error": syntax_error,
-                "unsafe_code": unsafe_code,
-                "unchanged_code": unchanged_code,
-                "changed_but_passing": changed_but_passing,
-                "plausibility_score": plausibility_score,
-            }
-        )
-        valid_bug = (not tests_passed) and (not syntax_error) and (not unsafe_code)
-        self._proposer_created_bug = valid_bug
-        self._current_bug_operator = infer_bug_operator(original_code, candidate_code) if valid_bug else None
-        score = 0.5 if valid_bug else 0.0
-        return reward, score
-    def _solver_step_feedback(self, result: object) -> tuple[float, float]:
-        execution_output = getattr(result, "output", "") or ""
-        syntax_error = bool(getattr(result, "syntax_error", False))
-        tests_passed = bool(getattr(result, "passed", False))
-        unsafe_code = execution_output.startswith("Unsafe import detected.")
-        reward = compute_solver_reward(
-            {
-                "seed_id": self._state.seed_id,
-                "tests_passed": tests_passed,
-                "syntax_error": syntax_error,
-                "unsafe_code": unsafe_code,
-            }
-        )
-        solved = tests_passed and (not syntax_error) and (not unsafe_code)
-        score = 1.0 if solved else (0.5 if self._proposer_created_bug else 0.0)
-        return reward, score
-    def _truncate_execution_output(self, output: str) -> str:
-        return output[:500] if output else ""
-    def _observation_metadata(self) -> dict[str, str]:
-        metadata = {
-            "seed_id": self._state.seed_id,
-            "original_code": self._state.original_code,
-        }
-        if self._current_bug_operator:
-            metadata["bug_operator"] = self._current_bug_operator
-        return metadata

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+DebugZero Environment Implementation for adversarial bug-fixing self-play.
+"""
+from __future__ import annotations
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+try:
+    from ..models import DebugzeroAction, DebugzeroObservation, DebugzeroState
+    from .tasks import SEED_BANK, SeedSpec
+except ImportError:
+    from models import DebugzeroAction, DebugzeroObservation, DebugzeroState
+    from server.tasks import SEED_BANK, SeedSpec
+try:
+    from .bug_injector import infer_bug_operator
+    from .graders import (
+        compute_ast_distance,
+        compute_proposer_reward,
+        compute_solver_reward,
+        is_effectively_unchanged,
+    )
+    from .executor import execute_code
+except ImportError:
+    from bug_injector import infer_bug_operator
+    from graders import (
+        compute_ast_distance,
+        compute_proposer_reward,
+        compute_solver_reward,
+        is_effectively_unchanged,
+    )
+    from executor import execute_code
+class DebugzeroEnvironment(Environment):
+    """
+    Dual-role DebugZero Environment wrapping a Python sandbox execution
+    for Proposer bug injection and Solver bug fixing.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self):
+        self._reset_count = 0
+        self._current_seed = SEED_BANK[0]
+        self._current_bug_operator: str | None = None
+        self._current_score = 0.0
+        self._proposer_created_bug = False
+        self._state = self._build_state(self._current_seed)
+    def reset(self) -> DebugzeroObservation:
+        seed = SEED_BANK[self._reset_count % len(SEED_BANK)]
+        self._reset_count += 1
+        self._current_seed = seed
+        self._current_bug_operator = None
+        self._current_score = 0.0
+        self._proposer_created_bug = False
+        self._state = self._build_state(seed)
+        return self._build_observation(
+            role_next="proposer",
+            execution_result="",
+            tests_passed=True,
+            syntax_error=False,
+            done=False,
+            reward=0.0,
+            score=0.0,
+        )
+    def step(self, action: DebugzeroAction) -> DebugzeroObservation:  # type: ignore[override]
+        self._state.step_count += 1
+        tests = self._current_seed.test
+        if action.role == "proposer":
+            self._state.current_code = action.code
+            result = execute_code(self._state.current_code, tests)
+            self._state.role_turn = "solver"
+            reward, score = self._proposer_step_feedback(action.code, result)
+            return self._build_observation(
+                role_next="solver",
+                execution_result=self._truncate_execution_output(result.output),
+                tests_passed=result.passed,
+                syntax_error=result.syntax_error,
+                done=False,
+                reward=reward,
+                score=score,
+            )
+        if action.role == "solver":
+            self._state.current_code = action.code
+            result = execute_code(self._state.current_code, tests)
+            self._state.role_turn = "end"
+            reward, score = self._solver_step_feedback(result)
+            return self._build_observation(
+                role_next="proposer",
+                execution_result=self._truncate_execution_output(result.output),
+                tests_passed=result.passed,
+                syntax_error=result.syntax_error,
+                done=True,
+                reward=reward,
+                score=score,
+            )
+        self._current_score = 0.0
+        self._proposer_created_bug = False
+        return self._build_observation(
+            role_next="end",
+            execution_result="",
+            tests_passed=False,
+            syntax_error=False,
+            done=True,
+            reward=0.0,
+            score=0.0,
+        )
+    @property
+    def state(self) -> DebugzeroState:
+        return self._state
+    def _build_state(self, seed: SeedSpec) -> DebugzeroState:
+        return DebugzeroState(
+            episode_id=str(uuid4()),
+            step_count=0,
+            seed_id=seed.seed_id,
+            original_code=seed.original_code,
+            current_code=seed.original_code,
+            role_turn="proposer",
+        )
+    def _build_observation(
+        self,
+        *,
+        role_next: str,
+        execution_result: str,
+        tests_passed: bool,
+        syntax_error: bool,
+        done: bool,
+        reward: float,
+        score: float,
+    ) -> DebugzeroObservation:
+        self._current_score = score
+        return DebugzeroObservation(
+            role_next=role_next,
+            current_code=self._state.current_code,
+            execution_result=execution_result,
+            tests_passed=tests_passed,
+            syntax_error=syntax_error,
+            score=score,
+            done=done,
+            reward=reward,
+            metadata=self._observation_metadata(),
+        )
+    def _proposer_step_feedback(self, candidate_code: str, result: object) -> tuple[float, float]:
+        original_code = self._state.original_code
+        execution_output = getattr(result, "output", "") or ""
+        syntax_error = bool(getattr(result, "syntax_error", False))
+        tests_passed = bool(getattr(result, "passed", False))
+        unsafe_code = execution_output.startswith("Unsafe import detected.")
+        unchanged_code = is_effectively_unchanged(original_code, candidate_code)
+        changed_but_passing = (not unchanged_code) and tests_passed and (not syntax_error)
+        plausibility_score = 0.0 if syntax_error else compute_ast_distance(original_code, candidate_code)
+        reward = compute_proposer_reward(
+            {
+                "seed_id": self._state.seed_id,
+                "tests_passed": tests_passed,
+                "syntax_error": syntax_error,
+                "unsafe_code": unsafe_code,
+                "unchanged_code": unchanged_code,
+                "changed_but_passing": changed_but_passing,
+                "plausibility_score": plausibility_score,
+            }
+        )
+        valid_bug = (not tests_passed) and (not syntax_error) and (not unsafe_code)
+        self._proposer_created_bug = valid_bug
+        self._current_bug_operator = infer_bug_operator(original_code, candidate_code) if valid_bug else None
+        score = 0.5 if valid_bug else 0.0
+        return reward, score
+    def _solver_step_feedback(self, result: object) -> tuple[float, float]:
+        execution_output = getattr(result, "output", "") or ""
+        syntax_error = bool(getattr(result, "syntax_error", False))
+        tests_passed = bool(getattr(result, "passed", False))
+        unsafe_code = execution_output.startswith("Unsafe import detected.")
+        reward = compute_solver_reward(
+            {
+                "seed_id": self._state.seed_id,
+                "tests_passed": tests_passed,
+                "syntax_error": syntax_error,
+                "unsafe_code": unsafe_code,
+            }
+        )
+        solved = tests_passed and (not syntax_error) and (not unsafe_code)
+        score = 1.0 if solved else (0.5 if self._proposer_created_bug else 0.0)
+        return reward, score
+    def _truncate_execution_output(self, output: str) -> str:
+        return output[:500] if output else ""
+    def _observation_metadata(self) -> dict[str, str]:
+        metadata = {
+            "seed_id": self._state.seed_id,
+            "original_code": self._state.original_code,
+        }
+        if self._current_bug_operator:
+            metadata["bug_operator"] = self._current_bug_operator
+        return metadata

server/graders.py CHANGED Viewed

@@ -60,25 +60,25 @@ def compute_ast_distance(original_code: str, mutated_code: str) -> float:
     return 0.0
-def compute_proposer_reward(meta: dict) -> float:
-    if meta.get("syntax_error", False):
-        return -0.5
-    if meta.get("unsafe_code", False):
-        return -0.5
-    if meta.get("unchanged_code", False):
-        return 0.0
-    if meta.get("tests_passed", True):
-        return 0.0
-    if meta.get("changed_but_passing", False):
-        return -0.1
-    plausibility_bonus = meta.get("plausibility_score", 0.0)
-    learnability_bonus = 0.0
-    solve_rate = get_solve_rate(meta["seed_id"])
     if 0.2 <= solve_rate <= 0.8:
         learnability_bonus = 1.0

     return 0.0
+def compute_proposer_reward(meta: dict) -> float:
+    if meta.get("syntax_error", False):
+        return -0.5
+    if meta.get("unsafe_code", False):
+        return -0.5
+    if meta.get("unchanged_code", False):
+        return 0.0
+    if meta.get("tests_passed", True):
+        return 0.0
+    if meta.get("changed_but_passing", False):
+        return -0.1
+    plausibility_bonus = meta.get("plausibility_score", 0.0)
+    learnability_bonus = 0.0
+    solve_rate = get_solve_rate(meta["seed_id"])
     if 0.2 <= solve_rate <= 0.8:
         learnability_bonus = 1.0

server/tasks.py CHANGED Viewed

@@ -111,11 +111,11 @@ SEED_BANK = (
             "check(count_nonempty)\n"
         ),
     ),
-    SeedSpec(
-        seed_id="DebugZero/5",
-        entrypoint="running_max",
-        prompt="def running_max(values: list[int]) -> int:",
-        canonical_solution=(
             "    best = values[0]\n"
             "    for idx in range(1, len(values)):\n"
             "        if values[idx] > best:\n"
@@ -127,93 +127,93 @@ SEED_BANK = (
             "    assert candidate([3]) == 3\n"
             "    assert candidate([3, 1, 5, 2]) == 5\n"
             "    assert candidate([-1, -4, -2]) == -1\n"
-            "    assert candidate([0, 0, 0]) == 0\n\n"
-            "check(running_max)\n"
-        ),
-    ),
-    SeedSpec(
-        seed_id="DebugZero/6",
-        entrypoint="first_index_of",
-        prompt="def first_index_of(values: list[int], target: int) -> int:",
-        canonical_solution=(
-            "    for idx, value in enumerate(values):\n"
-            "        if value == target:\n"
-            "            return idx\n"
-            "    return -1\n"
-        ),
-        test=(
-            "def check(candidate):\n"
-            "    assert candidate([], 3) == -1\n"
-            "    assert candidate([1, 2, 3], 1) == 0\n"
-            "    assert candidate([1, 2, 3], 3) == 2\n"
-            "    assert candidate([5, 7, 5, 7], 7) == 1\n"
-            "    assert candidate([9, 8], 4) == -1\n\n"
-            "check(first_index_of)\n"
-        ),
-    ),
-    SeedSpec(
-        seed_id="DebugZero/7",
-        entrypoint="drop_last",
-        prompt="def drop_last(values: list[int]) -> list[int]:",
-        canonical_solution=(
-            "    if not values:\n"
-            "        return []\n"
-            "    return values[:-1]\n"
-        ),
-        test=(
-            "def check(candidate):\n"
-            "    assert candidate([]) == []\n"
-            "    assert candidate([1]) == []\n"
-            "    assert candidate([1, 2]) == [1]\n"
-            "    assert candidate([1, 2, 3, 4]) == [1, 2, 3]\n"
-            "    assert candidate([7, 7, 7]) == [7, 7]\n\n"
-            "check(drop_last)\n"
-        ),
-    ),
-    SeedSpec(
-        seed_id="DebugZero/8",
-        entrypoint="count_greater_than",
-        prompt="def count_greater_than(values: list[int], threshold: int) -> int:",
-        canonical_solution=(
-            "    total = 0\n"
-            "    for value in values:\n"
-            "        if value > threshold:\n"
-            "            total += 1\n"
-            "    return total\n"
-        ),
-        test=(
-            "def check(candidate):\n"
-            "    assert candidate([], 1) == 0\n"
-            "    assert candidate([1, 2, 3], 2) == 1\n"
-            "    assert candidate([4, 5, 6], 3) == 3\n"
-            "    assert candidate([0, -1, 2, 2], 1) == 2\n"
-            "    assert candidate([5, 5, 5], 5) == 0\n\n"
-            "check(count_greater_than)\n"
-        ),
-    ),
-    SeedSpec(
-        seed_id="DebugZero/9",
-        entrypoint="prefix_sums",
-        prompt="def prefix_sums(values: list[int]) -> list[int]:",
-        canonical_solution=(
-            "    total = 0\n"
-            "    result = []\n"
-            "    for value in values:\n"
-            "        total += value\n"
-            "        result.append(total)\n"
-            "    return result\n"
-        ),
-        test=(
-            "def check(candidate):\n"
-            "    assert candidate([]) == []\n"
-            "    assert candidate([3]) == [3]\n"
-            "    assert candidate([1, 2, 3]) == [1, 3, 6]\n"
-            "    assert candidate([2, -1, 4]) == [2, 1, 5]\n"
-            "    assert candidate([0, 0, 0]) == [0, 0, 0]\n\n"
-            "check(prefix_sums)\n"
-        ),
-    ),
-)
 SEED_BY_ID = {seed.seed_id: seed for seed in SEED_BANK}

             "check(count_nonempty)\n"
         ),
     ),
+    SeedSpec(
+        seed_id="DebugZero/5",
+        entrypoint="running_max",
+        prompt="def running_max(values: list[int]) -> int:",
+        canonical_solution=(
             "    best = values[0]\n"
             "    for idx in range(1, len(values)):\n"
             "        if values[idx] > best:\n"
             "    assert candidate([3]) == 3\n"
             "    assert candidate([3, 1, 5, 2]) == 5\n"
             "    assert candidate([-1, -4, -2]) == -1\n"
+            "    assert candidate([0, 0, 0]) == 0\n\n"
+            "check(running_max)\n"
+        ),
+    ),
+    SeedSpec(
+        seed_id="DebugZero/6",
+        entrypoint="first_index_of",
+        prompt="def first_index_of(values: list[int], target: int) -> int:",
+        canonical_solution=(
+            "    for idx, value in enumerate(values):\n"
+            "        if value == target:\n"
+            "            return idx\n"
+            "    return -1\n"
+        ),
+        test=(
+            "def check(candidate):\n"
+            "    assert candidate([], 3) == -1\n"
+            "    assert candidate([1, 2, 3], 1) == 0\n"
+            "    assert candidate([1, 2, 3], 3) == 2\n"
+            "    assert candidate([5, 7, 5, 7], 7) == 1\n"
+            "    assert candidate([9, 8], 4) == -1\n\n"
+            "check(first_index_of)\n"
+        ),
+    ),
+    SeedSpec(
+        seed_id="DebugZero/7",
+        entrypoint="drop_last",
+        prompt="def drop_last(values: list[int]) -> list[int]:",
+        canonical_solution=(
+            "    if not values:\n"
+            "        return []\n"
+            "    return values[:-1]\n"
+        ),
+        test=(
+            "def check(candidate):\n"
+            "    assert candidate([]) == []\n"
+            "    assert candidate([1]) == []\n"
+            "    assert candidate([1, 2]) == [1]\n"
+            "    assert candidate([1, 2, 3, 4]) == [1, 2, 3]\n"
+            "    assert candidate([7, 7, 7]) == [7, 7]\n\n"
+            "check(drop_last)\n"
+        ),
+    ),
+    SeedSpec(
+        seed_id="DebugZero/8",
+        entrypoint="count_greater_than",
+        prompt="def count_greater_than(values: list[int], threshold: int) -> int:",
+        canonical_solution=(
+            "    total = 0\n"
+            "    for value in values:\n"
+            "        if value > threshold:\n"
+            "            total += 1\n"
+            "    return total\n"
+        ),
+        test=(
+            "def check(candidate):\n"
+            "    assert candidate([], 1) == 0\n"
+            "    assert candidate([1, 2, 3], 2) == 1\n"
+            "    assert candidate([4, 5, 6], 3) == 3\n"
+            "    assert candidate([0, -1, 2, 2], 1) == 2\n"
+            "    assert candidate([5, 5, 5], 5) == 0\n\n"
+            "check(count_greater_than)\n"
+        ),
+    ),
+    SeedSpec(
+        seed_id="DebugZero/9",
+        entrypoint="prefix_sums",
+        prompt="def prefix_sums(values: list[int]) -> list[int]:",
+        canonical_solution=(
+            "    total = 0\n"
+            "    result = []\n"
+            "    for value in values:\n"
+            "        total += value\n"
+            "        result.append(total)\n"
+            "    return result\n"
+        ),
+        test=(
+            "def check(candidate):\n"
+            "    assert candidate([]) == []\n"
+            "    assert candidate([3]) == [3]\n"
+            "    assert candidate([1, 2, 3]) == [1, 3, 6]\n"
+            "    assert candidate([2, -1, 4]) == [2, 1, 5]\n"
+            "    assert candidate([0, 0, 0]) == [0, 0, 0]\n\n"
+            "check(prefix_sums)\n"
+        ),
+    ),
+)
 SEED_BY_ID = {seed.seed_id: seed for seed in SEED_BANK}

validate-submission.sh CHANGED Viewed

@@ -1,185 +1,185 @@
-#!/usr/bin/env bash
-#
-# validate-submission.sh — OpenEnv Submission Validator
-#
-# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
-#
-# Prerequisites:
-#   - Docker:       https://docs.docker.com/get-docker/
-#   - openenv-core: pip install openenv-core
-#   - curl (usually pre-installed)
-#
-# Run:
-#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
-#
-#   Or download and run locally:
-#     chmod +x validate-submission.sh
-#     ./validate-submission.sh <ping_url> [repo_dir]
-#
-# Arguments:
-#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
-#   repo_dir   Path to your repo (default: current directory)
-#
-# Examples:
-#   ./validate-submission.sh https://my-team.hf.space
-#   ./validate-submission.sh https://my-team.hf.space ./my-repo
-#
-set -uo pipefail
-DOCKER_BUILD_TIMEOUT=600
-if [ -t 1 ]; then
-  RED='\033[0;31m'
-  GREEN='\033[0;32m'
-  YELLOW='\033[1;33m'
-  BOLD='\033[1m'
-  NC='\033[0m'
-else
-  RED='' GREEN='' YELLOW='' BOLD='' NC=''
-fi
-run_with_timeout() {
-  local secs="$1"; shift
-  if command -v timeout &>/dev/null; then
-    timeout "$secs" "$@"
-  elif command -v gtimeout &>/dev/null; then
-    gtimeout "$secs" "$@"
-  else
-    "$@" &
-    local pid=$!
-    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
-    local watcher=$!
-    wait "$pid" 2>/dev/null
-    local rc=$?
-    kill "$watcher" 2>/dev/null
-    wait "$watcher" 2>/dev/null
-    return $rc
-  fi
-}
-portable_mktemp() {
-  local prefix="${1:-validate}"
-  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
-}
-CLEANUP_FILES=()
-cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
-trap cleanup EXIT
-PING_URL="${1:-}"
-REPO_DIR="${2:-.}"
-if [ -z "$PING_URL" ]; then
-  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
-  printf "\n"
-  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
-  printf "  repo_dir   Path to your repo (default: current directory)\n"
-  exit 1
-fi
-if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
-  printf "Error: directory '%s' not found\n" "${2:-.}"
-  exit 1
-fi
-PING_URL="${PING_URL%/}"
-export PING_URL
-PASS=0
-log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
-pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
-fail() { log "${RED}FAILED${NC} -- $1"; }
-hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
-stop_at() {
-  printf "\n"
-  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
-  exit 1
-}
-printf "\n"
-printf "${BOLD}========================================${NC}\n"
-printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
-printf "${BOLD}========================================${NC}\n"
-log "Repo:     $REPO_DIR"
-log "Ping URL: $PING_URL"
-printf "\n"
-log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
-CURL_OUTPUT=$(portable_mktemp "validate-curl")
-CLEANUP_FILES+=("$CURL_OUTPUT")
-HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
-  -H "Content-Type: application/json" -d '{}' \
-  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
-if [ "$HTTP_CODE" = "200" ]; then
-  pass "HF Space is live and responds to /reset"
-elif [ "$HTTP_CODE" = "000" ]; then
-  fail "HF Space not reachable (connection failed or timed out)"
-  hint "Check your network connection and that the Space is running."
-  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
-  stop_at "Step 1"
-else
-  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
-  hint "Make sure your Space is running and the URL is correct."
-  hint "Try opening $PING_URL in your browser first."
-  stop_at "Step 1"
-fi
-log "${BOLD}Step 2/3: Running docker build${NC} ..."
-if ! command -v docker &>/dev/null; then
-  fail "docker command not found"
-  hint "Install Docker: https://docs.docker.com/get-docker/"
-  stop_at "Step 2"
-fi
-if [ -f "$REPO_DIR/Dockerfile" ]; then
-  DOCKER_CONTEXT="$REPO_DIR"
-elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
-  DOCKER_CONTEXT="$REPO_DIR/server"
-else
-  fail "No Dockerfile found in repo root or server/ directory"
-  stop_at "Step 2"
-fi
-log "  Found Dockerfile in $DOCKER_CONTEXT"
-BUILD_OK=false
-BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
-if [ "$BUILD_OK" = true ]; then
-  pass "Docker build succeeded"
-else
-  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
-  printf "%s\n" "$BUILD_OUTPUT" | tail -20
-  stop_at "Step 2"
-fi
-log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
-if ! command -v openenv &>/dev/null; then
-  fail "openenv command not found"
-  hint "Install it: pip install openenv-core"
-  stop_at "Step 3"
-fi
-VALIDATE_OK=false
-VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
-if [ "$VALIDATE_OK" = true ]; then
-  pass "openenv validate passed"
-  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
-else
-  fail "openenv validate failed"
-  printf "%s\n" "$VALIDATE_OUTPUT"
-  stop_at "Step 3"
-fi
-printf "\n"
-printf "${BOLD}========================================${NC}\n"
-printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
-printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
-printf "${BOLD}========================================${NC}\n"
-printf "\n"
 exit 0

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
 exit 0