Spaces:

Souravdanyal
/

code-debug-env

Running

+# Code Debug Environment
+An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible RL environment where an LLM agent diagnoses and fixes buggy Python code across three difficulty levels.
+---
+## Overview
+| Property | Value |
+|---|---|
+| Domain | Real-world Python code debugging |
+| Tasks | 45 total (15 easy + 15 medium + 15 hard) |
+| Difficulties | easy → medium → hard |
+| Reward Range | 0.0 – 1.0 (partial, proportional) |
+| Max Steps/Episode | 3 |
+| API | OpenEnv standard: `/reset`, `/step`, `/state` |
+---
+## Environment Description
+The agent receives a buggy Python function and must fix it. Tasks come from real-world domains: data processing, string algorithms, API validation, sorting, dynamic programming, and graph algorithms.
+- **Easy**: One bug (wrong operator, off-by-one, incorrect return). Reward proportional to test pass rate.
+- **Medium**: Two bugs (logic bug + edge case). Reward proportional to test pass rate.
+- **Hard**: One algorithmic bug + agent must explain what was wrong. Reward = 0.7 × test score + 0.3 × explanation quality.
+---
+## Action Space
+```json
+{
+  "fixed_code": "string — the corrected Python function (required)",
+  "explanation": "string — explanation of what was wrong (required for hard tasks)"
+}
+```
+| Field | Type | Required | Description |
+|---|---|---|---|
+| `fixed_code` | `str` | Always | Complete corrected Python function as a string |
+| `explanation` | `str` | Hard tasks | Describe the bug and why your fix is correct |
+---
+## Observation Space
+Returned by `/reset` and `/step`:
+```json
+{
+  "task_id": "easy_003",
+  "difficulty": "easy",
+  "buggy_code": "def find_max(nums):\n    ...",
+  "instructions": "The function has exactly one bug. Fix it.",
+  "test_cases_description": "Finds max value in a list without IndexError",
+  "reward": 0.67,
+  "passed_tests": 2,
+  "total_tests": 3,
+  "feedback": "Test 1: ✅ ...\nTest 2: ✅ ...\nTest 3: ❌ ...",
+  "done": false
+}
+```
+| Field | Type | Description |
+|---|---|---|
+| `task_id` | `str` | Unique task identifier |
+| `difficulty` | `str` | `easy` / `medium` / `hard` |
+| `buggy_code` | `str` | Buggy Python function to fix |
+| `instructions` | `str` | Task instructions |
+| `test_cases_description` | `str` | What the test cases check |
+| `reward` | `float\|null` | Score from last step (null on reset) |
+| `passed_tests` | `int\|null` | Tests passed (null on reset) |
+| `total_tests` | `int` | Total number of test cases |
+| `feedback` | `str\|null` | Detailed per-test feedback |
+| `done` | `bool` | True when episode is complete |
+---
+## Reward Function
+### Easy & Medium
+```
+reward = passed_tests / total_tests
+```
+- 3/3 tests → 1.0
+- 2/3 tests → 0.67
+- 1/3 tests → 0.33
+- 0/3 tests → 0.0
+### Hard
+```
+reward = 0.7 × test_score + 0.3 × explanation_score
+```
+Explanation is scored by matching key algorithmic concepts. Partial credit is given.
+---
+## Setup & Local Run
+### Prerequisites
+- Python 3.10+
+- Docker
+- Hugging Face CLI
+### Install
+```bash
+git clone https://github.com/YOUR_USERNAME/code-debug-env
+cd code-debug-env
+pip install -e .
+# Also clone OpenEnv for PYTHONPATH
+git clone https://github.com/meta-pytorch/OpenEnv.git
+export PYTHONPATH=$PYTHONPATH:OpenEnv:OpenEnv/src:.
+```
+### Run locally
+```bash
+uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
+```
+### Run with Docker
+```bash
+docker build -f server/Dockerfile -t code-debug-env .
+docker run -p 7860:7860 code-debug-env
+```
+### Test the API
+```bash
+# Health check
+curl http://localhost:7860/health
+# Reset (easy task)
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"difficulty": "easy"}'
+# Submit a fix
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"fixed_code": "def find_max(nums):\n    return max(nums)"}'
+# Check state
+curl http://localhost:7860/state
+```
+---
+## Run Baseline Inference
+```bash
+export API_BASE_URL="https://api.openai.com/v1"
+export MODEL_NAME="gpt-4o-mini"
+export HF_TOKEN="your-api-key"
+# Run all 3 difficulties
+python inference.py --url http://localhost:7860
+# Run specific difficulty
+python inference.py --url http://localhost:7860 --difficulty hard
+```
+---
+## Pre-Submission Validation
+Run before submitting to catch any disqualifying issues:
+```bash
+# Start the environment first, then:
+python validator/pre_submit_check.py --url http://localhost:7860
+# Or against your HF Space:
+python validator/pre_submit_check.py --url https://YOUR_SPACE.hf.space
+```
+---
+## Deploy to Hugging Face Spaces
+```bash
+# Login
+huggingface-cli login
+# Create space and push
+huggingface-cli repo create code-debug-env --type space --space_sdk docker
+cd code-debug-env
+git init
+git remote add origin https://huggingface.co/spaces/YOUR_USERNAME/code-debug-env
+git add .
+git commit -m "Initial commit"
+git push origin main
+```
+---
+## Project Structure
+```
+code-debug-env/
+├── openenv.yaml          ← OpenEnv manifest
+├── inference.py          ← Baseline agent (root, required)
+├── pyproject.toml        ← Dependencies
+├── README.md
+├── models.py             ← Pydantic Action/Observation/State
+├── client.py             ← EnvClient for training loops
+├── __init__.py
+├── server/
+│   ├── app.py            ← FastAPI: /reset /step /state /health
+│   ├── environment.py    ← Core episode logic
+│   ├── tasks/
+│   │   ├── task_easy.py  ← 15 single-bug tasks
+│   │   ├── task_medium.py← 15 two-bug tasks
+│   │   └── task_hard.py  ← 15 algorithmic tasks
+│   ├── graders/
+│   │   ├── grader_easy.py
+│   │   ├── grader_medium.py
+│   │   └── grader_hard.py
+│   ├── requirements.txt
+│   └── Dockerfile
+└── validator/
+    └── pre_submit_check.py
+```

STRUCTURE.md ADDED Viewed

	@@ -0,0 +1,29 @@

+# Code Debug Environment — Full File Structure
+```
+code-debug-env/
+├── openenv.yaml                  ← OpenEnv manifest (required)
+├── inference.py                  ← Baseline agent script (must be in root)
+├── pyproject.toml                ← Dependencies
+├── README.md                     ← Docs with action/obs spaces
+├── .dockerignore
+├── models.py                     ← Pydantic Action/Observation/State
+├── client.py                     ← EnvClient (for training code)
+├── __init__.py                   ← Exports
+└── server/
+    ├── __init__.py
+    ├── app.py                    ← FastAPI server
+    ├── environment.py            ← Core logic: reset/step/state
+    ├── tasks/
+    │   ├── __init__.py
+    │   ├── task_easy.py          ← 15 buggy code samples
+    │   ├── task_medium.py        ← 15 buggy code samples
+    │   └── task_hard.py          ← 15 buggy code samples
+    ├── graders/
+    │   ├── __init__.py
+    │   ├── grader_easy.py
+    │   ├── grader_medium.py
+    │   └── grader_hard.py
+    ├── requirements.txt
+    └── Dockerfile
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+# __init__.py
+from models import DebugAction, DebugObservation, DebugState
+from client import CodeDebugEnv
+__all__ = ["DebugAction", "DebugObservation", "DebugState", "CodeDebugEnv"]

client.py ADDED Viewed

	@@ -0,0 +1,62 @@

+# client.py
+# Python client for connecting to the Code Debug Environment.
+# Use this in training loops / evaluation scripts.
+#
+# Usage (sync):
+#   with CodeDebugEnv(base_url="https://your-space.hf.space").sync() as env:
+#       result = env.reset(difficulty="easy")
+#       result = env.step(DebugAction(fixed_code="..."))
+#
+# Usage (async):
+#   async with CodeDebugEnv(base_url="https://your-space.hf.space") as env:
+#       result = await env.reset(difficulty="medium")
+#       result = await env.step(DebugAction(fixed_code="..."))
+from openenv.core.env_client import EnvClient
+from openenv.core.client_types import StepResult
+from models import DebugAction, DebugObservation, DebugState
+class CodeDebugEnv(EnvClient[DebugAction, DebugObservation, DebugState]):
+    """
+    Client for the Code Debug Environment.
+    Wraps OpenEnv EnvClient with typed action/observation models.
+    """
+    def _step_payload(self, action: DebugAction) -> dict:
+        payload = {"fixed_code": action.fixed_code}
+        if action.explanation:
+            payload["explanation"] = action.explanation
+        return payload
+    def _parse_result(self, payload: dict) -> StepResult[DebugObservation]:
+        obs_data = payload.get("observation", {})
+        obs = DebugObservation(
+            task_id=obs_data.get("task_id", ""),
+            difficulty=obs_data.get("difficulty", "easy"),
+            buggy_code=obs_data.get("buggy_code", ""),
+            instructions=obs_data.get("instructions", ""),
+            test_cases_description=obs_data.get("test_cases_description", ""),
+            reward=obs_data.get("reward"),
+            passed_tests=obs_data.get("passed_tests"),
+            total_tests=obs_data.get("total_tests"),
+            feedback=obs_data.get("feedback"),
+            done=payload.get("done", False),
+        )
+        return StepResult(
+            observation=obs,
+            reward=payload.get("reward", 0.0),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: dict) -> DebugState:
+        return DebugState(
+            episode_id=payload.get("episode_id", ""),
+            step_count=payload.get("step_count", 0),
+            task_id=payload.get("task_id", ""),
+            difficulty=payload.get("difficulty", "easy"),
+            max_steps=payload.get("max_steps", 3),
+            current_reward=payload.get("current_reward", 0.0),
+            best_reward=payload.get("best_reward", 0.0),
+            done=payload.get("done", False),
+        )

inference.py ADDED Viewed

	@@ -0,0 +1,285 @@

+#!/usr/bin/env python3
+# inference.py
+# ─────────────────────────────────────────────────────────────────────────────
+# Baseline inference script for the Code Debug Environment.
+# Must be run from the project root.
+#
+# Required environment variables:
+#   API_BASE_URL  — LLM API endpoint (OpenAI-compatible)
+#   MODEL_NAME    — Model identifier
+#   HF_TOKEN      — Hugging Face / API key
+#
+# Usage:
+#   python inference.py
+#   python inference.py --url https://your-hf-space.hf.space
+#   python inference.py --difficulty easy
+#
+# Log format: [START], [STEP], [END] — strictly followed for evaluation scoring.
+# ─────────────────────────────────────────────────────────────────────────────
+import os
+import sys
+import json
+import time
+import argparse
+import requests
+from openai import OpenAI
+# ─── Configuration ────────────────────────────────────────────────────────────
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4o-mini")
+HF_TOKEN = os.environ.get("HF_TOKEN", "")
+ENV_URL = os.environ.get("ENV_URL", "http://localhost:7860")
+MAX_STEPS = 3
+DIFFICULTIES = ["easy", "medium", "hard"]
+# ─── OpenAI Client ───────────────────────────────────────────────────────────
+client = OpenAI(
+    api_key=HF_TOKEN or "dummy",
+    base_url=API_BASE_URL,
+)
+# ─── Logging (strict format required by evaluator) ───────────────────────────
+def log_start(task_id: str, difficulty: str, episode: int):
+    print(json.dumps({
+        "type": "START",
+        "episode": episode,
+        "task_id": task_id,
+        "difficulty": difficulty,
+        "timestamp": time.time(),
+    }), flush=True)
+def log_step(task_id: str, step: int, action_summary: str, reward: float, done: bool):
+    print(json.dumps({
+        "type": "STEP",
+        "task_id": task_id,
+        "step": step,
+        "action": action_summary,
+        "reward": reward,
+        "done": done,
+        "timestamp": time.time(),
+    }), flush=True)
+def log_end(task_id: str, difficulty: str, final_reward: float, steps_taken: int, episode: int):
+    print(json.dumps({
+        "type": "END",
+        "episode": episode,
+        "task_id": task_id,
+        "difficulty": difficulty,
+        "final_reward": final_reward,
+        "steps_taken": steps_taken,
+        "timestamp": time.time(),
+    }), flush=True)
+# ─── Environment Client ───────────────────────────────────────────────────────
+def env_reset(env_url: str, difficulty: str) -> dict:
+    resp = requests.post(
+        f"{env_url}/reset",
+        json={"difficulty": difficulty},
+        timeout=30,
+    )
+    resp.raise_for_status()
+    return resp.json()
+def env_step(env_url: str, fixed_code: str, explanation: str = None) -> dict:
+    payload = {"fixed_code": fixed_code}
+    if explanation:
+        payload["explanation"] = explanation
+    resp = requests.post(
+        f"{env_url}/step",
+        json=payload,
+        timeout=30,
+    )
+    resp.raise_for_status()
+    return resp.json()
+def env_state(env_url: str) -> dict:
+    resp = requests.get(f"{env_url}/state", timeout=10)
+    resp.raise_for_status()
+    return resp.json()
+# ─── LLM Agent ───────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = """You are an expert Python debugging agent.
+You will be given buggy Python code and must fix it.
+For easy tasks: fix the single bug.
+For medium tasks: fix both bugs.
+For hard tasks: fix the algorithmic bug AND explain your reasoning in the 'explanation' field.
+You MUST respond ONLY with valid JSON in this exact format:
+{
+  "fixed_code": "<complete fixed Python function as a string>",
+  "explanation": "<required for hard tasks; describe what was wrong and why your fix is correct>"
+}
+Rules:
+- Return the COMPLETE function, not just the changed line.
+- The fixed_code must be valid Python that can be exec'd.
+- For hard tasks, explanation must discuss the algorithm, root cause, and fix.
+- Do NOT include markdown fences or any text outside the JSON object.
+"""
+def call_llm(buggy_code: str, instructions: str, difficulty: str,
+             feedback: str = None, attempt: int = 1) -> dict:
+    """Call the LLM and return parsed {fixed_code, explanation}."""
+    user_content = f"""Task difficulty: {difficulty}
+Instructions: {instructions}
+Buggy code:
+```python
+{buggy_code}
+```
+"""
+    if feedback and attempt > 1:
+        user_content += f"\nPrevious attempt feedback:\n{feedback}\n\nPlease fix the remaining issues."
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": user_content},
+    ]
+    try:
+        response = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=messages,
+            max_tokens=1000,
+            temperature=0.1,
+        )
+        content = response.choices[0].message.content.strip()
+        # Strip markdown fences if present
+        if content.startswith("```"):
+            lines = content.split("\n")
+            content = "\n".join(lines[1:-1]) if lines[-1] == "```" else "\n".join(lines[1:])
+        parsed = json.loads(content)
+        return {
+            "fixed_code": parsed.get("fixed_code", ""),
+            "explanation": parsed.get("explanation", None),
+        }
+    except json.JSONDecodeError:
+        # Fallback: return original code if parsing fails
+        return {"fixed_code": buggy_code, "explanation": None}
+    except Exception as e:
+        print(f"LLM call failed: {e}", file=sys.stderr)
+        return {"fixed_code": buggy_code, "explanation": None}
+# ─── Main Episode Loop ────────────────────────────────────────────────────────
+def run_episode(env_url: str, difficulty: str, episode_num: int) -> float:
+    """Run one full episode. Returns final reward."""
+    # Reset
+    reset_data = env_reset(env_url, difficulty)
+    obs = reset_data["observation"]
+    task_id = obs["task_id"]
+    buggy_code = obs["buggy_code"]
+    instructions = obs["instructions"]
+    log_start(task_id, difficulty, episode_num)
+    last_feedback = None
+    final_reward = 0.0
+    step_num = 0
+    for attempt in range(1, MAX_STEPS + 1):
+        step_num = attempt
+        # Call LLM
+        agent_action = call_llm(
+            buggy_code=buggy_code,
+            instructions=instructions,
+            difficulty=difficulty,
+            feedback=last_feedback,
+            attempt=attempt,
+        )
+        # Submit to environment
+        result = env_step(
+            env_url,
+            fixed_code=agent_action["fixed_code"],
+            explanation=agent_action.get("explanation"),
+        )
+        reward = result.get("reward", 0.0)
+        done = result.get("done", False)
+        obs_result = result.get("observation", {})
+        last_feedback = obs_result.get("feedback", "")
+        log_step(
+            task_id=task_id,
+            step=attempt,
+            action_summary=f"Submitted fix attempt {attempt} ({len(agent_action['fixed_code'])} chars)",
+            reward=reward,
+            done=done,
+        )
+        final_reward = reward
+        if done:
+            break
+    log_end(task_id, difficulty, final_reward, step_num, episode_num)
+    return final_reward
+def main():
+    parser = argparse.ArgumentParser(description="Code Debug Environment Baseline Agent")
+    parser.add_argument("--url", default=ENV_URL, help="Environment base URL")
+    parser.add_argument("--difficulty", default=None, choices=["easy", "medium", "hard", "all"],
+                        help="Difficulty to run. 'all' runs one episode per difficulty.")
+    args = parser.parse_args()
+    env_url = args.url.rstrip("/")
+    # Health check
+    try:
+        health = requests.get(f"{env_url}/health", timeout=10)
+        health.raise_for_status()
+        print(json.dumps({"type": "INFO", "message": f"Environment healthy at {env_url}"}), flush=True)
+    except Exception as e:
+        print(json.dumps({"type": "ERROR", "message": f"Health check failed: {e}"}), flush=True)
+        sys.exit(1)
+    # Determine episodes to run
+    if args.difficulty == "all" or args.difficulty is None:
+        episodes = [("easy", 1), ("medium", 2), ("hard", 3)]
+    else:
+        episodes = [(args.difficulty, 1)]
+    all_rewards = []
+    for episode_num, (difficulty, ep_id) in enumerate(episodes, start=1):
+        reward = run_episode(env_url, difficulty, episode_num)  # use episode_num, not ep_id
+        all_rewards.append({"difficulty": difficulty, "reward": reward})
+        time.sleep(0.5)  # Small pause between episodes
+    # Summary
+    print(json.dumps({
+        "type": "SUMMARY",
+        "total_episodes": len(all_rewards),
+        "results": all_rewards,
+        "average_reward": round(sum(r["reward"] for r in all_rewards) / len(all_rewards), 3),
+        "timestamp": time.time(),
+    }), flush=True)
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,73 @@

+# models.py
+# Typed Pydantic models for Action, Observation, and State
+# These are the contracts between the agent and the environment.
+from typing import Optional, List
+from pydantic import Field
+from openenv.core.env_server.types import Action, Observation, State
+class DebugAction(Action):
+    """Action submitted by the agent: fixed code + optional explanation."""
+    fixed_code: str = Field(
+        ...,
+        description="The corrected Python function as a string. Must be valid Python."
+    )
+    explanation: Optional[str] = Field(
+        default=None,
+        description=(
+            "Required for 'hard' difficulty tasks. Explain what was wrong "
+            "and why your fix is correct. Affects reward on hard tasks."
+        )
+    )
+class TestResult(Action):
+    """Sub-model: result of a single test case."""
+    test_id: int
+    passed: bool
+    expected: str
+    got: str
+class DebugObservation(Observation):
+    """Observation returned after each step()."""
+    # Task info
+    task_id: str = Field(..., description="Unique ID of the current task instance")
+    difficulty: str = Field(..., description="Task difficulty: easy | medium | hard")
+    buggy_code: str = Field(..., description="The buggy Python code the agent must fix")
+    instructions: str = Field(..., description="Natural language instructions for the task")
+    test_cases_description: str = Field(
+        ..., description="Description of what the test cases check"
+    )
+    # After step() — feedback
+    reward: Optional[float] = Field(
+        default=None, description="Score from 0.0 to 1.0 for this step"
+    )
+    passed_tests: Optional[int] = Field(
+        default=None, description="Number of test cases passed"
+    )
+    total_tests: Optional[int] = Field(
+        default=None, description="Total number of test cases"
+    )
+    feedback: Optional[str] = Field(
+        default=None,
+        description="Detailed feedback: which tests failed and why"
+    )
+    done: bool = Field(default=False, description="True when episode is complete")
+class DebugState(State):
+    """Internal environment state, returned by GET /state."""
+    episode_id: str = ""          # ← required by validator: GET /state must return episode_id
+    task_id: str
+    difficulty: str
+    step_count: int = 0
+    max_steps: int = 3
+    current_reward: float = 0.0
+    best_reward: float = 0.0
+    done: bool = False

openenv.yaml ADDED Viewed

	@@ -0,0 +1,48 @@

+spec_version: 1
+name: code-debug-env
+type: typed
+description: >
+  A real-world RL environment where an LLM agent diagnoses and fixes
+  buggy Python code across three difficulty levels (easy, medium, hard).
+  Tasks are drawn from real-world domains: data processing, API handlers,
+  and algorithmic functions. Rewards are partial and proportional to how
+  many test cases pass, with bonuses for correct explanations on hard tasks.
+version: 1.0.0
+author: your-hf-username   # ← REPLACE with your actual HF username before submitting
+runtime:
+  type: docker
+  port: 7860
+app:
+  entry: server/app.py
+  host: 0.0.0.0
+  port: 7860
+tasks:
+  - id: easy
+    description: "Fix a single off-by-one or operator bug in a Python function"
+    difficulty: easy
+    max_steps: 3
+    reward_range: [0.0, 1.0]
+  - id: medium
+    description: "Fix two bugs (logic + edge case) so all test cases pass"
+    difficulty: medium
+    max_steps: 3
+    reward_range: [0.0, 1.0]
+  - id: hard
+    description: "Fix an algorithmic bug AND provide a correct explanation"
+    difficulty: hard
+    max_steps: 3
+    reward_range: [0.0, 1.0]
+reward_range: [0.0, 1.0]
+api:
+  reset: /reset
+  step: /step
+  state: /state
+  health: /health

pyproject.toml ADDED Viewed

	@@ -0,0 +1,26 @@

+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.backends.legacy:build"
+[project]
+name = "code-debug-env"
+version = "1.0.0"
+description = "OpenEnv environment for LLM-based code debugging"
+requires-python = ">=3.10"
+dependencies = [
+    "fastapi>=0.110.0",
+    "uvicorn[standard]>=0.29.0",
+    "pydantic>=2.0.0",
+    "openai>=1.0.0",
+    "requests>=2.31.0",
+    "openenv-core>=0.2.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "httpx>=0.27.0",
+]
+[tool.setuptools.packages.find]
+where = ["."]

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,30 @@

+FROM python:3.11-slim
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
+RUN git clone https://github.com/meta-pytorch/OpenEnv.git /app/OpenEnv
+RUN pip install --no-cache-dir \
+        fastapi \
+        "uvicorn[standard]" \
+        pydantic \
+        openai \
+        requests \
+        openenv-core && \
+    pip install --no-cache-dir -e /app/OpenEnv || true
+COPY . .
+ENV PYTHONPATH="/app:/app/OpenEnv:/app/OpenEnv/src"
+RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
+USER appuser
+HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')"
+EXPOSE 7860
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # server/__init__.py

server/__pycache__/__init__.cpython-39.pyc ADDED Viewed

Binary file (161 Bytes). View file

server/app.py ADDED Viewed

	@@ -0,0 +1,123 @@

+# server/app.py
+# FastAPI server exposing the OpenEnv standard endpoints.
+# Port 7860 required for Hugging Face Spaces.
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from typing import Optional
+from pydantic import BaseModel
+from server.environment import CodeDebugEnvironment
+from models import DebugAction, DebugObservation, DebugState
+app = FastAPI(
+    title="Code Debug Environment",
+    description=(
+        "An OpenEnv environment where LLM agents fix buggy Python code. "
+        "3 difficulty levels: easy (1 bug), medium (2 bugs), hard (algorithmic + explanation)."
+    ),
+    version="1.0.0",
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# One global environment instance (single session)
+# For concurrent sessions, instantiate per-request with a session dict
+env = CodeDebugEnvironment()
+# ─── Request Models ─────────────────────────────────────────────────────────
+class ResetRequest(BaseModel):
+    difficulty: Optional[str] = None  # "easy" | "medium" | "hard" | None (random)
+class StepRequest(BaseModel):
+    fixed_code: str
+    explanation: Optional[str] = None
+# ─── Response wrapper matching OpenEnv StepResult shape ──────────────────────
+class StepResponse(BaseModel):
+    observation: dict
+    reward: float
+    done: bool
+# ─── Endpoints ───────────────────────────────────────────────────────────────
+@app.get("/health")
+async def health():
+    """Health check endpoint — must return 200 for submission validation."""
+    return {"status": "ok", "environment": "code-debug-env", "version": "1.0.0"}
+@app.post("/reset")
+async def reset(request: ResetRequest = ResetRequest()) -> dict:
+    """
+    Reset the environment to start a new episode.
+    Optionally pass difficulty: 'easy' | 'medium' | 'hard'
+    """
+    try:
+        observation = env.reset(difficulty=request.difficulty)
+        return {
+            "observation": observation.model_dump(),
+            "reward": 0.0,
+            "done": False,
+        }
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Reset failed: {str(e)}")
+@app.post("/step")
+async def step(request: StepRequest) -> StepResponse:
+    """
+    Submit a code fix (and optional explanation for hard tasks).
+    Returns observation with reward (0.0–1.0), feedback, and done flag.
+    """
+    if not request.fixed_code or not request.fixed_code.strip():
+        raise HTTPException(status_code=400, detail="fixed_code must not be empty.")
+    try:
+        action = DebugAction(
+            fixed_code=request.fixed_code,
+            explanation=request.explanation,
+        )
+        observation = env.step(action)
+        return StepResponse(
+            observation=observation.model_dump(),
+            reward=observation.reward or 0.0,
+            done=observation.done,
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Step failed: {str(e)}")
+@app.get("/state")
+async def state() -> dict:
+    """Return the current episode state."""
+    try:
+        s = env.state
+        return s.model_dump()
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"State failed: {str(e)}")
+@app.get("/tasks")
+async def list_tasks() -> dict:
+    """List available task IDs per difficulty (for inspection)."""
+    from server.tasks.task_easy import EASY_TASKS
+    from server.tasks.task_medium import MEDIUM_TASKS
+    from server.tasks.task_hard import HARD_TASKS
+    return {
+        "easy": [t["task_id"] for t in EASY_TASKS],
+        "medium": [t["task_id"] for t in MEDIUM_TASKS],
+        "hard": [t["task_id"] for t in HARD_TASKS],
+        "total": len(EASY_TASKS) + len(MEDIUM_TASKS) + len(HARD_TASKS),
+    }

server/environment.py ADDED Viewed

	@@ -0,0 +1,147 @@

+# server/environment.py
+# Core environment: manages episode state, dispatches to task banks and graders.
+import random
+from uuid import uuid4
+from typing import Optional
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+from models import DebugAction, DebugObservation, DebugState
+from server.tasks.task_easy import get_random_easy_task
+from server.tasks.task_medium import get_random_medium_task
+from server.tasks.task_hard import get_random_hard_task
+from server.graders.grader_easy import grade_easy
+from server.graders.grader_medium import grade_medium
+from server.graders.grader_hard import grade_hard
+TASK_GETTERS = {
+    "easy": get_random_easy_task,
+    "medium": get_random_medium_task,
+    "hard": get_random_hard_task,
+}
+GRADERS = {
+    "easy": grade_easy,
+    "medium": grade_medium,
+    "hard": grade_hard,
+}
+MAX_STEPS = 3
+class CodeDebugEnvironment(Environment):
+    """
+    OpenEnv environment for LLM-based code debugging.
+    Supports 3 difficulty levels with partial rewards.
+    """
+    def __init__(self):
+        self._episode_id: str = str(uuid4())
+        self._difficulty: str = "easy"
+        self._current_task: Optional[dict] = None
+        self._step_count: int = 0
+        self._best_reward: float = 0.0
+        self._current_reward: float = 0.0
+        self._done: bool = False
+    def reset(self, difficulty: Optional[str] = None) -> DebugObservation:
+        """
+        Start a new episode. Optionally specify difficulty: easy | medium | hard.
+        If not specified, cycles randomly.
+        """
+        self._episode_id = str(uuid4())
+        self._step_count = 0
+        self._best_reward = 0.0
+        self._current_reward = 0.0
+        self._done = False
+        # Validate difficulty
+        if difficulty and difficulty in TASK_GETTERS:
+            self._difficulty = difficulty
+        else:
+            self._difficulty = random.choice(["easy", "medium", "hard"])
+        # Load a task
+        self._current_task = TASK_GETTERS[self._difficulty]()
+        return DebugObservation(
+            task_id=self._current_task["task_id"],
+            difficulty=self._difficulty,
+            buggy_code=self._current_task["buggy_code"],
+            instructions=self._current_task["instructions"],
+            test_cases_description=self._current_task["test_cases_description"],
+            reward=None,
+            passed_tests=None,
+            total_tests=len(self._current_task["test_cases"]),
+            feedback=None,
+            done=False,
+        )
+    def step(self, action: DebugAction) -> DebugObservation:
+        """
+        Agent submits fixed_code (and optionally explanation for hard tasks).
+        Returns observation with reward, feedback, and done flag.
+        """
+        if self._done:
+            return DebugObservation(
+                task_id=self._current_task["task_id"] if self._current_task else "none",
+                difficulty=self._difficulty,
+                buggy_code=self._current_task["buggy_code"] if self._current_task else "",
+                instructions="Episode is already done. Call reset() to start a new episode.",
+                test_cases_description="",
+                reward=self._best_reward,
+                passed_tests=None,
+                total_tests=0,
+                feedback="Episode ended. Please call reset() to start a new task.",
+                done=True,
+            )
+        self._step_count += 1
+        # Grade the submission
+        grader = GRADERS[self._difficulty]
+        if self._difficulty == "hard":
+            reward, passed, total, feedback, _ = grader(
+                action.fixed_code, self._current_task, action.explanation
+            )
+        else:
+            reward, passed, total, feedback, _ = grader(
+                action.fixed_code, self._current_task
+            )
+        self._current_reward = reward
+        self._best_reward = max(self._best_reward, reward)
+        # Episode ends if: perfect score OR max steps reached
+        done = (reward == 1.0) or (self._step_count >= MAX_STEPS)
+        self._done = done
+        return DebugObservation(
+            task_id=self._current_task["task_id"],
+            difficulty=self._difficulty,
+            buggy_code=self._current_task["buggy_code"],
+            instructions=self._current_task["instructions"],
+            test_cases_description=self._current_task["test_cases_description"],
+            reward=reward,
+            passed_tests=passed,
+            total_tests=total,
+            feedback=feedback,
+            done=done,
+        )
+    @property
+    def state(self) -> DebugState:
+        """Return current episode metadata."""
+        return DebugState(
+            episode_id=self._episode_id,
+            step_count=self._step_count,
+            task_id=self._current_task["task_id"] if self._current_task else "none",
+            difficulty=self._difficulty,
+            max_steps=MAX_STEPS,
+            current_reward=self._current_reward,
+            best_reward=self._best_reward,
+            done=self._done,
+        )

server/graders/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# server/graders/__init__.py
+from .grader_easy import grade_easy
+from .grader_medium import grade_medium
+from .grader_hard import grade_hard
+__all__ = ["grade_easy", "grade_medium", "grade_hard"]

server/graders/grader_easy.py ADDED Viewed

	@@ -0,0 +1,90 @@

+# server/graders/grader_easy.py
+# Grades easy tasks: 1 bug, 3 test cases.
+# Reward is proportional to tests passed (0.33, 0.66, 1.0).
+import traceback
+from typing import Tuple, List
+def _run_code_safely(code: str, func_name: str, test_input):
+    """
+    Executes the submitted code in an isolated namespace and calls the function.
+    Returns (output, error_message).
+    """
+    namespace = {}
+    try:
+        exec(compile(code, "<submitted>", "exec"), namespace)
+    except SyntaxError as e:
+        return None, f"SyntaxError: {e}"
+    except Exception as e:
+        return None, f"Compile error: {e}"
+    func = namespace.get(func_name)
+    if func is None:
+        # Try to find any callable
+        funcs = [v for v in namespace.values() if callable(v) and not v.__name__.startswith("_")]
+        if not funcs:
+            return None, "No callable function found in submitted code."
+        func = funcs[0]
+    try:
+        if isinstance(test_input, list):
+            result = func(*test_input)
+        else:
+            result = func(test_input)
+        return result, None
+    except Exception as e:
+        return None, f"RuntimeError: {traceback.format_exc(limit=2)}"
+def _extract_func_name(code: str) -> str:
+    """Extract the first function name defined in the code."""
+    for line in code.splitlines():
+        line = line.strip()
+        if line.startswith("def "):
+            return line.split("(")[0].replace("def ", "").strip()
+    return "unknown"
+def grade_easy(fixed_code: str, task: dict) -> Tuple[float, int, int, str, List[dict]]:
+    """
+    Grade an easy task submission.
+    Returns:
+        reward (float): 0.0 to 1.0
+        passed (int): number of tests passed
+        total (int): total test cases
+        feedback (str): detailed feedback message
+        results (list): per-test results
+    """
+    test_cases = task["test_cases"]
+    total = len(test_cases)
+    passed = 0
+    results = []
+    func_name = _extract_func_name(fixed_code)
+    feedback_lines = []
+    for i, tc in enumerate(test_cases):
+        inp = tc["input"]
+        expected = tc["expected"]
+        got, error = _run_code_safely(fixed_code, func_name, inp)
+        if error:
+            results.append({"test_id": i + 1, "passed": False, "expected": str(expected), "got": f"ERROR: {error}"})
+            feedback_lines.append(f"Test {i+1}: ❌ Error — {error}")
+        elif got == expected:
+            passed += 1
+            results.append({"test_id": i + 1, "passed": True, "expected": str(expected), "got": str(got)})
+            feedback_lines.append(f"Test {i+1}: ✅ Passed — got {got!r}")
+        else:
+            results.append({"test_id": i + 1, "passed": False, "expected": str(expected), "got": str(got)})
+            feedback_lines.append(f"Test {i+1}: ❌ Failed — expected {expected!r}, got {got!r}")
+    reward = round(passed / total, 2)
+    feedback = "\n".join(feedback_lines)
+    if passed == total:
+        feedback += "\n🎉 All tests passed! Full reward."
+    else:
+        feedback += f"\n{passed}/{total} tests passed. Review the failing cases."
+    return reward, passed, total, feedback, results

server/graders/grader_hard.py ADDED Viewed

	@@ -0,0 +1,70 @@

+# server/graders/grader_hard.py
+# Grades hard tasks: algorithmic bug + explanation required.
+# Reward = 0.7 * test_score + 0.3 * explanation_score
+from typing import Tuple, List, Optional
+from .grader_easy import grade_easy
+def _score_explanation(explanation: Optional[str], keywords: List[str]) -> Tuple[float, str]:
+    """
+    Scores the explanation by checking for required conceptual keywords.
+    Returns (score 0.0-1.0, feedback string).
+    """
+    if not explanation or len(explanation.strip()) < 10:
+        return 0.0, "❌ No explanation provided. Hard tasks require an explanation field."
+    explanation_lower = explanation.lower()
+    hits = [kw for kw in keywords if kw.lower() in explanation_lower]
+    score = min(1.0, len(hits) / max(1, len(keywords) // 2))  # need at least half the keywords
+    if score == 1.0:
+        feedback = f"✅ Explanation excellent! Mentioned key concepts: {', '.join(hits)}"
+    elif score > 0:
+        feedback = (
+            f"⚠️ Partial explanation. Mentioned: {', '.join(hits) if hits else 'none'}. "
+            f"Consider discussing: {', '.join(kw for kw in keywords if kw.lower() not in explanation_lower)[:3]}"
+        )
+    else:
+        feedback = (
+            f"❌ Explanation missing key concepts. "
+            f"Try to explain: {', '.join(keywords[:3])} in your analysis."
+        )
+    return round(score, 2), feedback
+def grade_hard(fixed_code: str, task: dict, explanation: Optional[str] = None) -> Tuple[float, int, int, str, List[dict]]:
+    """
+    Grade a hard task submission.
+    Reward = 0.7 * test_score + 0.3 * explanation_score
+    Returns:
+        reward (float): 0.0 to 1.0
+        passed (int)
+        total (int)
+        feedback (str)
+        results (list)
+    """
+    # Grade code
+    test_reward, passed, total, code_feedback, results = grade_easy(fixed_code, task)
+    # Grade explanation
+    keywords = task.get("explanation_keywords", [])
+    exp_score, exp_feedback = _score_explanation(explanation, keywords)
+    # Combined reward
+    final_reward = round(0.7 * test_reward + 0.3 * exp_score, 2)
+    feedback = (
+        f"--- Code Score (70% weight): {test_reward:.2f} ---\n"
+        f"{code_feedback}\n\n"
+        f"--- Explanation Score (30% weight): {exp_score:.2f} ---\n"
+        f"{exp_feedback}\n\n"
+        f"=== Final Reward: {final_reward:.2f} ==="
+    )
+    if passed < total and not explanation:
+        feedback += "\n💡 Tip: Fix the code bugs AND provide a clear explanation for max reward."
+    return final_reward, passed, total, feedback, results

server/graders/grader_medium.py ADDED Viewed

	@@ -0,0 +1,22 @@

+# server/graders/grader_medium.py
+# Grades medium tasks: 2 bugs, 3 test cases.
+# Same proportional reward as easy but stricter — both bugs must be fixed for full score.
+from .grader_easy import grade_easy  # reuse the same logic
+def grade_medium(fixed_code: str, task: dict):
+    """
+    Grade a medium task. Same mechanics as easy — proportional reward by tests passed.
+    Returns same tuple: reward, passed, total, feedback, results
+    """
+    reward, passed, total, feedback, results = grade_easy(fixed_code, task)
+    # Add medium-specific feedback hint
+    if passed < total:
+        feedback += (
+            "\n💡 Hint: Medium tasks have TWO bugs. "
+            "Make sure you fixed both the primary logic bug AND the edge case."
+        )
+    return reward, passed, total, feedback, results

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+fastapi>=0.110.0
+uvicorn[standard]>=0.29.0
+pydantic>=2.0.0
+openai>=1.0.0
+requests>=2.31.0
+openenv-core>=0.2.0

server/tasks/__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+# server/tasks/__init__.py
+from .task_easy import get_random_easy_task, EASY_TASKS
+from .task_medium import get_random_medium_task, MEDIUM_TASKS
+from .task_hard import get_random_hard_task, HARD_TASKS
+__all__ = [
+    "get_random_easy_task", "EASY_TASKS",
+    "get_random_medium_task", "MEDIUM_TASKS",
+    "get_random_hard_task", "HARD_TASKS",
+]

server/tasks/__pycache__/__init__.cpython-39.pyc ADDED Viewed

Binary file (449 Bytes). View file

server/tasks/__pycache__/task_easy.cpython-39.pyc ADDED Viewed

Binary file (7.37 kB). View file

server/tasks/__pycache__/task_hard.cpython-39.pyc ADDED Viewed

Binary file (16.5 kB). View file

server/tasks/__pycache__/task_medium.cpython-39.pyc ADDED Viewed

Binary file (10.5 kB). View file

server/tasks/task_easy.py ADDED Viewed

	@@ -0,0 +1,415 @@

+# server/tasks/task_easy.py
+# 15 single-bug tasks from real-world domains.
+# Each bug is exactly ONE mistake: off-by-one, wrong operator, wrong return, etc.
+import random
+EASY_TASKS = [
+    {
+        "task_id": "easy_001",
+        "domain": "data processing",
+        "instructions": (
+            "The function below is supposed to return the average of a list of numbers. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def average(nums):
+    total = 0
+    for n in nums:
+        total += n
+    return total / len(nums) + 1
+""",
+        "fixed_code": """\
+def average(nums):
+    total = 0
+    for n in nums:
+        total += n
+    return total / len(nums)
+""",
+        "test_cases": [
+            {"input": [2, 4, 6], "expected": 4.0},
+            {"input": [1, 1, 1, 1], "expected": 1.0},
+            {"input": [10, 20], "expected": 15.0},
+        ],
+        "test_cases_description": "Checks that average([2,4,6])==4.0, average([1,1,1,1])==1.0, average([10,20])==15.0",
+    },
+    {
+        "task_id": "easy_002",
+        "domain": "string processing",
+        "instructions": (
+            "The function should count how many words are in a sentence. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def count_words(sentence):
+    words = sentence.split(' ')
+    return len(words) - 1
+""",
+        "fixed_code": """\
+def count_words(sentence):
+    words = sentence.split()
+    return len(words)
+""",
+        "test_cases": [
+            {"input": "hello world", "expected": 2},
+            {"input": "one two three four", "expected": 4},
+            {"input": "single", "expected": 1},
+        ],
+        "test_cases_description": "Counts words in a sentence correctly",
+    },
+    {
+        "task_id": "easy_003",
+        "domain": "data processing",
+        "instructions": (
+            "The function should return the maximum value in a list. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def find_max(nums):
+    max_val = nums[0]
+    for i in range(1, len(nums) + 1):
+        if nums[i] > max_val:
+            max_val = nums[i]
+    return max_val
+""",
+        "fixed_code": """\
+def find_max(nums):
+    max_val = nums[0]
+    for i in range(1, len(nums)):
+        if nums[i] > max_val:
+            max_val = nums[i]
+    return max_val
+""",
+        "test_cases": [
+            {"input": [3, 1, 4, 1, 5, 9], "expected": 9},
+            {"input": [10, 2, 8], "expected": 10},
+            {"input": [7], "expected": 7},
+        ],
+        "test_cases_description": "Finds max value in a list without IndexError",
+    },
+    {
+        "task_id": "easy_004",
+        "domain": "boolean logic",
+        "instructions": (
+            "The function checks if a number is even. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def is_even(n):
+    return n % 2 == 1
+""",
+        "fixed_code": """\
+def is_even(n):
+    return n % 2 == 0
+""",
+        "test_cases": [
+            {"input": 4, "expected": True},
+            {"input": 7, "expected": False},
+            {"input": 0, "expected": True},
+        ],
+        "test_cases_description": "Correctly identifies even numbers",
+    },
+    {
+        "task_id": "easy_005",
+        "domain": "list operations",
+        "instructions": (
+            "The function should return the second element of a list. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def second_element(lst):
+    return lst[2]
+""",
+        "fixed_code": """\
+def second_element(lst):
+    return lst[1]
+""",
+        "test_cases": [
+            {"input": [10, 20, 30], "expected": 20},
+            {"input": ["a", "b", "c"], "expected": "b"},
+            {"input": [99, 100], "expected": 100},
+        ],
+        "test_cases_description": "Returns correct second element (index 1)",
+    },
+    {
+        "task_id": "easy_006",
+        "domain": "math",
+        "instructions": (
+            "The function should compute the factorial of n. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def factorial(n):
+    if n == 0:
+        return 0
+    result = 1
+    for i in range(1, n + 1):
+        result *= i
+    return result
+""",
+        "fixed_code": """\
+def factorial(n):
+    if n == 0:
+        return 1
+    result = 1
+    for i in range(1, n + 1):
+        result *= i
+    return result
+""",
+        "test_cases": [
+            {"input": 0, "expected": 1},
+            {"input": 5, "expected": 120},
+            {"input": 3, "expected": 6},
+        ],
+        "test_cases_description": "Correct factorial including base case factorial(0)==1",
+    },
+    {
+        "task_id": "easy_007",
+        "domain": "string processing",
+        "instructions": (
+            "The function should check if a string is a palindrome. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def is_palindrome(s):
+    return s == s[1:][::-1]
+""",
+        "fixed_code": """\
+def is_palindrome(s):
+    return s == s[::-1]
+""",
+        "test_cases": [
+            {"input": "racecar", "expected": True},
+            {"input": "hello", "expected": False},
+            {"input": "madam", "expected": True},
+        ],
+        "test_cases_description": "Correctly identifies palindromes",
+    },
+    {
+        "task_id": "easy_008",
+        "domain": "data processing",
+        "instructions": (
+            "The function should sum all even numbers in a list. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def sum_evens(nums):
+    total = 0
+    for n in nums:
+        if n % 2 == 1:
+            total += n
+    return total
+""",
+        "fixed_code": """\
+def sum_evens(nums):
+    total = 0
+    for n in nums:
+        if n % 2 == 0:
+            total += n
+    return total
+""",
+        "test_cases": [
+            {"input": [1, 2, 3, 4, 5, 6], "expected": 12},
+            {"input": [1, 3, 5], "expected": 0},
+            {"input": [2, 4], "expected": 6},
+        ],
+        "test_cases_description": "Sums only even numbers",
+    },
+    {
+        "task_id": "easy_009",
+        "domain": "list operations",
+        "instructions": (
+            "The function should reverse a string. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def reverse_string(s):
+    return s[1:][::-1]
+""",
+        "fixed_code": """\
+def reverse_string(s):
+    return s[::-1]
+""",
+        "test_cases": [
+            {"input": "hello", "expected": "olleh"},
+            {"input": "abc", "expected": "cba"},
+            {"input": "x", "expected": "x"},
+        ],
+        "test_cases_description": "Reverses a string correctly",
+    },
+    {
+        "task_id": "easy_010",
+        "domain": "data processing",
+        "instructions": (
+            "The function should return the minimum value from a list. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def find_min(nums):
+    min_val = nums[0]
+    for n in nums:
+        if n > min_val:
+            min_val = n
+    return min_val
+""",
+        "fixed_code": """\
+def find_min(nums):
+    min_val = nums[0]
+    for n in nums:
+        if n < min_val:
+            min_val = n
+    return min_val
+""",
+        "test_cases": [
+            {"input": [3, 1, 4, 1, 5], "expected": 1},
+            {"input": [10, 2, 8], "expected": 2},
+            {"input": [-5, 0, 5], "expected": -5},
+        ],
+        "test_cases_description": "Finds minimum value in a list",
+    },
+    {
+        "task_id": "easy_011",
+        "domain": "math",
+        "instructions": (
+            "The function should check if a number is prime. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def is_prime(n):
+    if n < 2:
+        return False
+    for i in range(2, n):
+        if n % i == 0:
+            return True
+    return False
+""",
+        "fixed_code": """\
+def is_prime(n):
+    if n < 2:
+        return False
+    for i in range(2, n):
+        if n % i == 0:
+            return False
+    return True
+""",
+        "test_cases": [
+            {"input": 7, "expected": True},
+            {"input": 4, "expected": False},
+            {"input": 13, "expected": True},
+        ],
+        "test_cases_description": "Correctly identifies prime numbers",
+    },
+    {
+        "task_id": "easy_012",
+        "domain": "list operations",
+        "instructions": (
+            "The function should remove duplicates from a list while preserving order. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def remove_duplicates(lst):
+    seen = set()
+    result = []
+    for item in lst:
+        if item in seen:
+            result.append(item)
+        seen.add(item)
+    return result
+""",
+        "fixed_code": """\
+def remove_duplicates(lst):
+    seen = set()
+    result = []
+    for item in lst:
+        if item not in seen:
+            result.append(item)
+        seen.add(item)
+    return result
+""",
+        "test_cases": [
+            {"input": [1, 2, 2, 3, 3, 3], "expected": [1, 2, 3]},
+            {"input": ["a", "b", "a"], "expected": ["a", "b"]},
+            {"input": [1], "expected": [1]},
+        ],
+        "test_cases_description": "Removes duplicates while preserving order",
+    },
+    {
+        "task_id": "easy_013",
+        "domain": "string processing",
+        "instructions": (
+            "The function should capitalize the first letter of every word. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def title_case(sentence):
+    return sentence.lower()
+""",
+        "fixed_code": """\
+def title_case(sentence):
+    return sentence.title()
+""",
+        "test_cases": [
+            {"input": "hello world", "expected": "Hello World"},
+            {"input": "the quick brown fox", "expected": "The Quick Brown Fox"},
+            {"input": "python", "expected": "Python"},
+        ],
+        "test_cases_description": "Converts sentence to title case",
+    },
+    {
+        "task_id": "easy_014",
+        "domain": "data processing",
+        "instructions": (
+            "The function should return the length of the longest word in a sentence. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def longest_word_length(sentence):
+    words = sentence.split()
+    return min(len(w) for w in words)
+""",
+        "fixed_code": """\
+def longest_word_length(sentence):
+    words = sentence.split()
+    return max(len(w) for w in words)
+""",
+        "test_cases": [
+            {"input": "hello world", "expected": 5},
+            {"input": "I am learning Python programming", "expected": 11},
+            {"input": "cat", "expected": 3},
+        ],
+        "test_cases_description": "Returns length of the longest word",
+    },
+    {
+        "task_id": "easy_015",
+        "domain": "math",
+        "instructions": (
+            "The function should return n raised to the power of 2. "
+            "It has exactly one bug. Fix it."
+        ),
+        "buggy_code": """\
+def square(n):
+    return n * 3
+""",
+        "fixed_code": """\
+def square(n):
+    return n * n
+""",
+        "test_cases": [
+            {"input": 4, "expected": 16},
+            {"input": 0, "expected": 0},
+            {"input": -3, "expected": 9},
+        ],
+        "test_cases_description": "Returns n squared",
+    },
+]
+def get_random_easy_task() -> dict:
+    return random.choice(EASY_TASKS).copy()
+def get_task_by_id(task_id: str) -> dict:
+    for t in EASY_TASKS:
+        if t["task_id"] == task_id:
+            return t.copy()
+    return random.choice(EASY_TASKS).copy()

server/tasks/task_hard.py ADDED Viewed

	@@ -0,0 +1,628 @@

+# server/tasks/task_hard.py
+# 15 hard tasks: algorithmic bugs + agent must explain what was wrong.
+# Reward is based on test pass rate PLUS explanation quality.
+import random
+HARD_TASKS = [
+    {
+        "task_id": "hard_001",
+        "domain": "sorting algorithm",
+        "instructions": (
+            "The function implements bubble sort but is broken. "
+            "Fix the algorithm AND explain what was wrong in your 'explanation' field. "
+            "Explanation must mention: loop range, boundary, or swap logic."
+        ),
+        "buggy_code": """\
+def bubble_sort(arr):
+    n = len(arr)
+    for i in range(n):
+        for j in range(n - i):
+            if arr[j] > arr[j + 1]:
+                arr[j], arr[j + 1] = arr[j + 1], arr[j]
+    return arr
+""",
+        "fixed_code": """\
+def bubble_sort(arr):
+    n = len(arr)
+    for i in range(n):
+        for j in range(n - i - 1):
+            if arr[j] > arr[j + 1]:
+                arr[j], arr[j + 1] = arr[j + 1], arr[j]
+    return arr
+""",
+        "explanation_keywords": ["boundary", "index", "range", "n - i - 1", "out of bounds", "last element"],
+        "test_cases": [
+            {"input": [64, 34, 25, 12, 22, 11, 90], "expected": [11, 12, 22, 25, 34, 64, 90]},
+            {"input": [5, 1, 4, 2, 8], "expected": [1, 2, 4, 5, 8]},
+            {"input": [1], "expected": [1]},
+        ],
+        "test_cases_description": "Bubble sort with correct inner loop boundary (n - i - 1)",
+    },
+    {
+        "task_id": "hard_002",
+        "domain": "dynamic programming",
+        "instructions": (
+            "The function computes the longest increasing subsequence (LIS) length. "
+            "Fix the algorithm AND explain what was wrong. "
+            "Explanation must mention: initialization, dp transition, or base case."
+        ),
+        "buggy_code": """\
+def lis_length(nums):
+    if not nums:
+        return 0
+    dp = [0] * len(nums)
+    for i in range(len(nums)):
+        for j in range(i):
+            if nums[j] < nums[i]:
+                dp[i] = max(dp[i], dp[j] + 1)
+    return max(dp)
+""",
+        "fixed_code": """\
+def lis_length(nums):
+    if not nums:
+        return 0
+    dp = [1] * len(nums)
+    for i in range(len(nums)):
+        for j in range(i):
+            if nums[j] < nums[i]:
+                dp[i] = max(dp[i], dp[j] + 1)
+    return max(dp)
+""",
+        "explanation_keywords": ["initialization", "base case", "dp[i]", "1", "zero", "initial value"],
+        "test_cases": [
+            {"input": [10, 9, 2, 5, 3, 7, 101, 18], "expected": 4},
+            {"input": [0, 1, 0, 3, 2, 3], "expected": 4},
+            {"input": [7, 7, 7, 7], "expected": 1},
+        ],
+        "test_cases_description": "LIS with dp initialized to 1 (not 0)",
+    },
+    {
+        "task_id": "hard_003",
+        "domain": "binary search",
+        "instructions": (
+            "The function does binary search on a sorted list. "
+            "Fix the algorithm AND explain what was wrong. "
+            "Explanation must mention: mid calculation, overflow, boundary, or infinite loop."
+        ),
+        "buggy_code": """\
+def binary_search(arr, target):
+    low, high = 0, len(arr)
+    while low < high:
+        mid = (low + high) // 2
+        if arr[mid] == target:
+            return mid
+        elif arr[mid] < target:
+            low = mid
+        else:
+            high = mid - 1
+    return -1
+""",
+        "fixed_code": """\
+def binary_search(arr, target):
+    low, high = 0, len(arr) - 1
+    while low <= high:
+        mid = (low + high) // 2
+        if arr[mid] == target:
+            return mid
+        elif arr[mid] < target:
+            low = mid + 1
+        else:
+            high = mid - 1
+    return -1
+""",
+        "explanation_keywords": ["high", "len - 1", "low = mid", "infinite loop", "boundary", "off-by-one"],
+        "test_cases": [
+            {"input": [[1, 3, 5, 7, 9], 7], "expected": 3},
+            {"input": [[1, 3, 5, 7, 9], 1], "expected": 0},
+            {"input": [[1, 3, 5, 7, 9], 6], "expected": -1},
+        ],
+        "test_cases_description": "Binary search: high = len-1, low = mid+1, while low <= high",
+    },
+    {
+        "task_id": "hard_004",
+        "domain": "dynamic programming",
+        "instructions": (
+            "The function computes the minimum number of coins to make 'amount'. "
+            "Fix the algorithm AND explain what was wrong. "
+            "Explanation must mention: initialization, infinity, dp table, or base case."
+        ),
+        "buggy_code": """\
+def coin_change(coins, amount):
+    dp = [0] * (amount + 1)
+    dp[0] = 0
+    for i in range(1, amount + 1):
+        for coin in coins:
+            if coin <= i:
+                dp[i] = min(dp[i], dp[i - coin] + 1)
+    return dp[amount] if dp[amount] != 0 else -1
+""",
+        "fixed_code": """\
+def coin_change(coins, amount):
+    dp = [float('inf')] * (amount + 1)
+    dp[0] = 0
+    for i in range(1, amount + 1):
+        for coin in coins:
+            if coin <= i:
+                dp[i] = min(dp[i], dp[i - coin] + 1)
+    return dp[amount] if dp[amount] != float('inf') else -1
+""",
+        "explanation_keywords": ["infinity", "inf", "initialization", "0 instead of inf", "unreachable", "base"],
+        "test_cases": [
+            {"input": [[1, 5, 6, 9], 11], "expected": 2},
+            {"input": [[2], 3], "expected": -1},
+            {"input": [[1, 2, 5], 11], "expected": 3},
+        ],
+        "test_cases_description": "Coin change DP: initialized to inf, not 0",
+    },
+    {
+        "task_id": "hard_005",
+        "domain": "graph algorithm",
+        "instructions": (
+            "The function checks if a directed graph has a cycle using DFS. "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: visited, recursion stack, back edge, or state."
+        ),
+        "buggy_code": """\
+def has_cycle(graph):
+    visited = set()
+    def dfs(node):
+        visited.add(node)
+        for neighbor in graph.get(node, []):
+            if neighbor in visited:
+                return True
+            if dfs(neighbor):
+                return True
+        return False
+    for node in graph:
+        if node not in visited:
+            if dfs(node):
+                return True
+    return False
+""",
+        "fixed_code": """\
+def has_cycle(graph):
+    visited = set()
+    rec_stack = set()
+    def dfs(node):
+        visited.add(node)
+        rec_stack.add(node)
+        for neighbor in graph.get(node, []):
+            if neighbor not in visited:
+                if dfs(neighbor):
+                    return True
+            elif neighbor in rec_stack:
+                return True
+        rec_stack.remove(node)
+        return False
+    for node in graph:
+        if node not in visited:
+            if dfs(node):
+                return True
+    return False
+""",
+        "explanation_keywords": ["recursion stack", "rec_stack", "back edge", "visited", "false positive", "path"],
+        "test_cases": [
+            {"input": {"A": ["B"], "B": ["C"], "C": ["A"]}, "expected": True},
+            {"input": {"A": ["B"], "B": ["C"], "C": []}, "expected": False},
+            {"input": {"A": ["B", "C"], "B": ["D"], "C": ["D"], "D": []}, "expected": False},
+        ],
+        "test_cases_description": "Cycle detection in directed graph using recursion stack",
+    },
+    {
+        "task_id": "hard_006",
+        "domain": "dynamic programming",
+        "instructions": (
+            "The function computes the maximum subarray sum (Kadane's algorithm). "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: initialization, negative numbers, current_sum, or reset."
+        ),
+        "buggy_code": """\
+def max_subarray(nums):
+    max_sum = 0
+    current_sum = 0
+    for n in nums:
+        current_sum = max(n, current_sum + n)
+        max_sum = max(max_sum, current_sum)
+    return max_sum
+""",
+        "fixed_code": """\
+def max_subarray(nums):
+    max_sum = nums[0]
+    current_sum = nums[0]
+    for n in nums[1:]:
+        current_sum = max(n, current_sum + n)
+        max_sum = max(max_sum, current_sum)
+    return max_sum
+""",
+        "explanation_keywords": ["initialization", "negative", "nums[0]", "all negative", "zero", "initial"],
+        "test_cases": [
+            {"input": [-2, 1, -3, 4, -1, 2, 1, -5, 4], "expected": 6},
+            {"input": [-1, -2, -3, -4], "expected": -1},
+            {"input": [1], "expected": 1},
+        ],
+        "test_cases_description": "Kadane's algorithm handles all-negative arrays",
+    },
+    {
+        "task_id": "hard_007",
+        "domain": "string algorithm",
+        "instructions": (
+            "The function checks if a string has balanced brackets. "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: stack, matching, empty stack, or closing bracket."
+        ),
+        "buggy_code": """\
+def is_balanced(s):
+    stack = []
+    matching = {')': '(', ']': '[', '}': '{'}
+    for ch in s:
+        if ch in '([{':
+            stack.append(ch)
+        elif ch in ')]}':
+            if stack and stack[-1] == matching[ch]:
+                stack.pop()
+    return len(stack) == 0
+""",
+        "fixed_code": """\
+def is_balanced(s):
+    stack = []
+    matching = {')': '(', ']': '[', '}': '{'}
+    for ch in s:
+        if ch in '([{':
+            stack.append(ch)
+        elif ch in ')]}':
+            if not stack or stack[-1] != matching[ch]:
+                return False
+            stack.pop()
+    return len(stack) == 0
+""",
+        "explanation_keywords": ["stack", "empty stack", "mismatch", "not stack", "early return", "closing"],
+        "test_cases": [
+            {"input": "([{}])", "expected": True},
+            {"input": "([)]", "expected": False},
+            {"input": "]", "expected": False},
+        ],
+        "test_cases_description": "Balanced brackets: early return False on mismatch or empty stack",
+    },
+    {
+        "task_id": "hard_008",
+        "domain": "dynamic programming",
+        "instructions": (
+            "The function computes the number of ways to climb n stairs (1 or 2 steps at a time). "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: base case, dp, index, or off-by-one."
+        ),
+        "buggy_code": """\
+def climb_stairs(n):
+    if n <= 0:
+        return 0
+    dp = [0] * (n + 1)
+    dp[0] = 1
+    dp[1] = 1
+    for i in range(3, n + 1):
+        dp[i] = dp[i - 1] + dp[i - 2]
+    return dp[n]
+""",
+        "fixed_code": """\
+def climb_stairs(n):
+    if n <= 0:
+        return 0
+    dp = [0] * (n + 1)
+    dp[0] = 1
+    dp[1] = 1
+    for i in range(2, n + 1):
+        dp[i] = dp[i - 1] + dp[i - 2]
+    return dp[n]
+""",
+        "explanation_keywords": ["range", "starts at 3", "range(2", "off-by-one", "dp[2]", "skipped"],
+        "test_cases": [
+            {"input": 2, "expected": 2},
+            {"input": 3, "expected": 3},
+            {"input": 5, "expected": 8},
+        ],
+        "test_cases_description": "Climb stairs DP: loop starts at range(2, ...) not range(3, ...)",
+    },
+    {
+        "task_id": "hard_009",
+        "domain": "data processing",
+        "instructions": (
+            "The function implements quicksort. "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: pivot, partition, recursion, or base case."
+        ),
+        "buggy_code": """\
+def quicksort(arr):
+    if len(arr) <= 1:
+        return arr
+    pivot = arr[0]
+    left = [x for x in arr if x < pivot]
+    right = [x for x in arr if x > pivot]
+    return quicksort(left) + [pivot] + quicksort(right)
+""",
+        "fixed_code": """\
+def quicksort(arr):
+    if len(arr) <= 1:
+        return arr
+    pivot = arr[0]
+    left = [x for x in arr[1:] if x <= pivot]
+    right = [x for x in arr[1:] if x > pivot]
+    return quicksort(left) + [pivot] + quicksort(right)
+""",
+        "explanation_keywords": ["duplicate", "arr[1:]", "pivot included", "equal", "lost", "missing"],
+        "test_cases": [
+            {"input": [3, 6, 8, 10, 1, 2, 1], "expected": [1, 1, 2, 3, 6, 8, 10]},
+            {"input": [5, 5, 5], "expected": [5, 5, 5]},
+            {"input": [1], "expected": [1]},
+        ],
+        "test_cases_description": "Quicksort handles duplicates: arr[1:] and x <= pivot",
+    },
+    {
+        "task_id": "hard_010",
+        "domain": "graph algorithm",
+        "instructions": (
+            "The function finds the shortest path length in an unweighted graph using BFS. "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: visited, queue, infinite loop, or distance tracking."
+        ),
+        "buggy_code": """\
+from collections import deque
+def bfs_shortest_path(graph, start, end):
+    queue = deque([(start, 0)])
+    while queue:
+        node, dist = queue.popleft()
+        if node == end:
+            return dist
+        for neighbor in graph.get(node, []):
+            queue.append((neighbor, dist + 1))
+    return -1
+""",
+        "fixed_code": """\
+from collections import deque
+def bfs_shortest_path(graph, start, end):
+    visited = set([start])
+    queue = deque([(start, 0)])
+    while queue:
+        node, dist = queue.popleft()
+        if node == end:
+            return dist
+        for neighbor in graph.get(node, []):
+            if neighbor not in visited:
+                visited.add(neighbor)
+                queue.append((neighbor, dist + 1))
+    return -1
+""",
+        "explanation_keywords": ["visited", "infinite loop", "revisit", "cycle", "set", "already visited"],
+        "test_cases": [
+            {"input": [{"A": ["B", "C"], "B": ["D"], "C": ["D"], "D": []}, "A", "D"], "expected": 2},
+            {"input": [{"A": ["B"], "B": ["A"]}, "A", "B"], "expected": 1},
+            {"input": [{"A": ["B"]}, "A", "C"], "expected": -1},
+        ],
+        "test_cases_description": "BFS shortest path with visited set to prevent revisiting",
+    },
+    {
+        "task_id": "hard_011",
+        "domain": "dynamic programming",
+        "instructions": (
+            "The function computes the 0/1 knapsack maximum value. "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: capacity, dp table, iteration order, or overwrite."
+        ),
+        "buggy_code": """\
+def knapsack(weights, values, capacity):
+    n = len(weights)
+    dp = [0] * (capacity + 1)
+    for i in range(n):
+        for w in range(weights[i], capacity + 1):
+            dp[w] = max(dp[w], dp[w - weights[i]] + values[i])
+    return dp[capacity]
+""",
+        "fixed_code": """\
+def knapsack(weights, values, capacity):
+    n = len(weights)
+    dp = [0] * (capacity + 1)
+    for i in range(n):
+        for w in range(capacity, weights[i] - 1, -1):
+            dp[w] = max(dp[w], dp[w - weights[i]] + values[i])
+    return dp[capacity]
+""",
+        "explanation_keywords": ["reverse", "backward", "overwrite", "0/1", "unbounded", "iteration order", "right to left"],
+        "test_cases": [
+            {"input": [[2, 3, 4, 5], [3, 4, 5, 6], 5], "expected": 7},
+            {"input": [[1, 2, 3], [6, 10, 12], 5], "expected": 22},
+            {"input": [[5], [10], 3], "expected": 0},
+        ],
+        "test_cases_description": "0/1 Knapsack: inner loop must go backward to avoid using item twice",
+    },
+    {
+        "task_id": "hard_012",
+        "domain": "string algorithm",
+        "instructions": (
+            "The function finds the length of the longest substring without repeating characters. "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: window, pointer, index, or update."
+        ),
+        "buggy_code": """\
+def length_of_longest_substring(s):
+    char_index = {}
+    left = 0
+    max_len = 0
+    for right, ch in enumerate(s):
+        if ch in char_index:
+            left = char_index[ch] + 1
+        char_index[ch] = right
+        max_len = max(max_len, right - left + 1)
+    return max_len
+""",
+        "fixed_code": """\
+def length_of_longest_substring(s):
+    char_index = {}
+    left = 0
+    max_len = 0
+    for right, ch in enumerate(s):
+        if ch in char_index and char_index[ch] >= left:
+            left = char_index[ch] + 1
+        char_index[ch] = right
+        max_len = max(max_len, right - left + 1)
+    return max_len
+""",
+        "explanation_keywords": ["left pointer", "stale", "char_index[ch] >= left", "window", "shrink", "old index"],
+        "test_cases": [
+            {"input": "abcabcbb", "expected": 3},
+            {"input": "bbbbb", "expected": 1},
+            {"input": "pwwkew", "expected": 3},
+        ],
+        "test_cases_description": "Longest substring without repeating: only update left if char is within current window",
+    },
+    {
+        "task_id": "hard_013",
+        "domain": "data processing",
+        "instructions": (
+            "The function merges overlapping intervals. "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: sort, overlap, merge condition, or end index."
+        ),
+        "buggy_code": """\
+def merge_intervals(intervals):
+    if not intervals:
+        return []
+    intervals.sort(key=lambda x: x[0])
+    merged = [intervals[0]]
+    for start, end in intervals[1:]:
+        if start <= merged[-1][0]:
+            merged[-1][1] = max(merged[-1][1], end)
+        else:
+            merged.append([start, end])
+    return merged
+""",
+        "fixed_code": """\
+def merge_intervals(intervals):
+    if not intervals:
+        return []
+    intervals.sort(key=lambda x: x[0])
+    merged = [intervals[0]]
+    for start, end in intervals[1:]:
+        if start <= merged[-1][1]:
+            merged[-1][1] = max(merged[-1][1], end)
+        else:
+            merged.append([start, end])
+    return merged
+""",
+        "explanation_keywords": ["merged[-1][1]", "end", "start", "overlap", "last interval", "index 1 vs 0"],
+        "test_cases": [
+            {"input": [[1, 3], [2, 6], [8, 10]], "expected": [[1, 6], [8, 10]]},
+            {"input": [[1, 4], [4, 5]], "expected": [[1, 5]]},
+            {"input": [[1, 2]], "expected": [[1, 2]]},
+        ],
+        "test_cases_description": "Merge intervals: compare start with merged[-1][1] (end), not [0] (start)",
+    },
+    {
+        "task_id": "hard_014",
+        "domain": "math",
+        "instructions": (
+            "The function does integer square root (floor) without using sqrt(). "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: binary search, convergence, mid, or boundary."
+        ),
+        "buggy_code": """\
+def integer_sqrt(n):
+    if n < 2:
+        return n
+    low, high = 1, n
+    while low <= high:
+        mid = (low + high) // 2
+        if mid * mid == n:
+            return mid
+        elif mid * mid < n:
+            low = mid + 1
+        else:
+            high = mid - 1
+    return low
+""",
+        "fixed_code": """\
+def integer_sqrt(n):
+    if n < 2:
+        return n
+    low, high = 1, n // 2
+    while low <= high:
+        mid = (low + high) // 2
+        if mid * mid == n:
+            return mid
+        elif mid * mid < n:
+            low = mid + 1
+        else:
+            high = mid - 1
+    return high
+""",
+        "explanation_keywords": ["high", "n // 2", "return high", "return low", "floor", "boundary", "last valid"],
+        "test_cases": [
+            {"input": 16, "expected": 4},
+            {"input": 8, "expected": 2},
+            {"input": 1, "expected": 1},
+        ],
+        "test_cases_description": "Integer square root: high=n//2, return high (floor result)",
+    },
+    {
+        "task_id": "hard_015",
+        "domain": "string algorithm",
+        "instructions": (
+            "The function implements the Z-algorithm to count pattern occurrences in text. "
+            "Fix it AND explain what was wrong. "
+            "Explanation must mention: concatenation, Z-array, separator, or index offset."
+        ),
+        "buggy_code": """\
+def count_occurrences(text, pattern):
+    concat = pattern + text
+    n = len(concat)
+    z = [0] * n
+    l, r = 0, 0
+    for i in range(1, n):
+        if i < r:
+            z[i] = min(r - i, z[i - l])
+        while i + z[i] < n and concat[z[i]] == concat[i + z[i]]:
+            z[i] += 1
+        if i + z[i] > r:
+            l, r = i, i + z[i]
+    return sum(1 for i in range(len(pattern), n) if z[i] == len(pattern))
+""",
+        "fixed_code": """\
+def count_occurrences(text, pattern):
+    concat = pattern + '#' + text
+    n = len(concat)
+    z = [0] * n
+    l, r = 0, 0
+    for i in range(1, n):
+        if i < r:
+            z[i] = min(r - i, z[i - l])
+        while i + z[i] < n and concat[z[i]] == concat[i + z[i]]:
+            z[i] += 1
+        if i + z[i] > r:
+            l, r = i, i + z[i]
+    p_len = len(pattern)
+    return sum(1 for i in range(p_len + 1, n) if z[i] == p_len)
+""",
+        "explanation_keywords": ["separator", "#", "without separator", "bleed", "p_len + 1", "offset", "boundary"],
+        "test_cases": [
+            {"input": ["aabxaabaab", "aab"], "expected": 3},
+            {"input": ["hello world", "world"], "expected": 1},
+            {"input": ["aaaa", "aa"], "expected": 3},
+        ],
+        "test_cases_description": "Z-algorithm with '#' separator and corrected offset p_len+1",
+    },
+]
+def get_random_hard_task() -> dict:
+    return random.choice(HARD_TASKS).copy()
+def get_task_by_id(task_id: str) -> dict:
+    for t in HARD_TASKS:
+        if t["task_id"] == task_id:
+            return t.copy()
+    return random.choice(HARD_TASKS).copy()

server/tasks/task_medium.py ADDED Viewed

	@@ -0,0 +1,507 @@

+# server/tasks/task_medium.py
+# 15 medium tasks: each function has TWO bugs (logic + edge case).
+# Agent must fix both to get full reward.
+import random
+MEDIUM_TASKS = [
+    {
+        "task_id": "medium_001",
+        "domain": "data processing",
+        "instructions": (
+            "The function should return the average of a list, returning 0.0 for an empty list. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def safe_average(nums):
+    if len(nums) == 0:
+        return -1
+    total = 0
+    for n in nums:
+        total += n
+    return total / len(nums) + 1
+""",
+        "fixed_code": """\
+def safe_average(nums):
+    if len(nums) == 0:
+        return 0.0
+    total = 0
+    for n in nums:
+        total += n
+    return total / len(nums)
+""",
+        "test_cases": [
+            {"input": [2, 4, 6], "expected": 4.0},
+            {"input": [], "expected": 0.0},
+            {"input": [10], "expected": 10.0},
+        ],
+        "test_cases_description": "Average of list; empty list returns 0.0, not -1; no +1 added to result",
+    },
+    {
+        "task_id": "medium_002",
+        "domain": "string processing",
+        "instructions": (
+            "The function should count vowels in a string (case-insensitive). "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def count_vowels(s):
+    vowels = 'aeiou'
+    count = 0
+    for ch in s:
+        if ch in vowels:
+            count += 1
+    return count + 1
+""",
+        "fixed_code": """\
+def count_vowels(s):
+    vowels = 'aeiouAEIOU'
+    count = 0
+    for ch in s:
+        if ch in vowels:
+            count += 1
+    return count
+""",
+        "test_cases": [
+            {"input": "hello", "expected": 2},
+            {"input": "HELLO", "expected": 2},
+            {"input": "rhythm", "expected": 0},
+        ],
+        "test_cases_description": "Counts vowels case-insensitively without off-by-one",
+    },
+    {
+        "task_id": "medium_003",
+        "domain": "list operations",
+        "instructions": (
+            "The function should flatten a list of lists into one list. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def flatten(lists):
+    result = []
+    for sublist in lists:
+        for item in sublist:
+            result.append(item)
+    return result[1:]
+""",
+        "fixed_code": """\
+def flatten(lists):
+    result = []
+    for sublist in lists:
+        for item in sublist:
+            result.append(item)
+    return result
+""",
+        "test_cases": [
+            {"input": [[1, 2], [3, 4]], "expected": [1, 2, 3, 4]},
+            {"input": [[1]], "expected": [1]},
+            {"input": [[], [5, 6]], "expected": [5, 6]},
+        ],
+        "test_cases_description": "Flattens nested lists correctly without slicing off first element",
+    },
+    {
+        "task_id": "medium_004",
+        "domain": "math",
+        "instructions": (
+            "The function should return the GCD of two numbers. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def gcd(a, b):
+    while b != 0:
+        a = b
+        b = a % b
+    return b
+""",
+        "fixed_code": """\
+def gcd(a, b):
+    while b != 0:
+        a, b = b, a % b
+    return a
+""",
+        "test_cases": [
+            {"input": [12, 8], "expected": 4},
+            {"input": [100, 75], "expected": 25},
+            {"input": [7, 3], "expected": 1},
+        ],
+        "test_cases_description": "Correct GCD using Euclidean algorithm",
+    },
+    {
+        "task_id": "medium_005",
+        "domain": "data processing",
+        "instructions": (
+            "The function should count frequency of each element in a list and return a dict. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def count_frequency(lst):
+    freq = {}
+    for item in lst:
+        if item in freq:
+            freq[item] = 1
+        else:
+            freq[item] = freq[item] + 1
+    return freq
+""",
+        "fixed_code": """\
+def count_frequency(lst):
+    freq = {}
+    for item in lst:
+        if item in freq:
+            freq[item] += 1
+        else:
+            freq[item] = 1
+    return freq
+""",
+        "test_cases": [
+            {"input": [1, 2, 2, 3, 3, 3], "expected": {1: 1, 2: 2, 3: 3}},
+            {"input": ["a", "b", "a"], "expected": {"a": 2, "b": 1}},
+            {"input": [5], "expected": {5: 1}},
+        ],
+        "test_cases_description": "Correctly counts frequency; swapped if/else logic fixed",
+    },
+    {
+        "task_id": "medium_006",
+        "domain": "string processing",
+        "instructions": (
+            "The function should check if two strings are anagrams (case-insensitive). "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def are_anagrams(s1, s2):
+    if len(s1) != len(s2):
+        return True
+    return sorted(s1) == sorted(s2)
+""",
+        "fixed_code": """\
+def are_anagrams(s1, s2):
+    if len(s1) != len(s2):
+        return False
+    return sorted(s1.lower()) == sorted(s2.lower())
+""",
+        "test_cases": [
+            {"input": ["listen", "silent"], "expected": True},
+            {"input": ["hello", "world"], "expected": False},
+            {"input": ["Listen", "Silent"], "expected": True},
+        ],
+        "test_cases_description": "Anagram check with case-insensitivity and correct early-return logic",
+    },
+    {
+        "task_id": "medium_007",
+        "domain": "data processing",
+        "instructions": (
+            "The function should merge two sorted lists into one sorted list. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def merge_sorted(a, b):
+    result = []
+    i, j = 0, 0
+    while i < len(a) and j < len(b):
+        if a[i] < b[j]:
+            result.append(b[j])
+            i += 1
+        else:
+            result.append(a[i])
+            j += 1
+    result.extend(a[i:])
+    result.extend(b[j:])
+    return result
+""",
+        "fixed_code": """\
+def merge_sorted(a, b):
+    result = []
+    i, j = 0, 0
+    while i < len(a) and j < len(b):
+        if a[i] < b[j]:
+            result.append(a[i])
+            i += 1
+        else:
+            result.append(b[j])
+            j += 1
+    result.extend(a[i:])
+    result.extend(b[j:])
+    return result
+""",
+        "test_cases": [
+            {"input": [[1, 3, 5], [2, 4, 6]], "expected": [1, 2, 3, 4, 5, 6]},
+            {"input": [[1, 2], [3, 4]], "expected": [1, 2, 3, 4]},
+            {"input": [[], [1, 2]], "expected": [1, 2]},
+        ],
+        "test_cases_description": "Merges two sorted lists correctly",
+    },
+    {
+        "task_id": "medium_008",
+        "domain": "API handler",
+        "instructions": (
+            "The function validates a user registration dict. "
+            "It should return True only if 'email' and 'password' are present and password >= 8 chars. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def validate_registration(data):
+    if 'email' not in data:
+        return False
+    if len(data.get('password', '')) > 8:
+        return False
+    return True
+""",
+        "fixed_code": """\
+def validate_registration(data):
+    if 'email' not in data:
+        return False
+    if len(data.get('password', '')) < 8:
+        return False
+    return True
+""",
+        "test_cases": [
+            {"input": {"email": "a@b.com", "password": "strongpass"}, "expected": True},
+            {"input": {"email": "a@b.com", "password": "short"}, "expected": False},
+            {"input": {"password": "strongpass"}, "expected": False},
+        ],
+        "test_cases_description": "Validates registration with correct password length check",
+    },
+    {
+        "task_id": "medium_009",
+        "domain": "math",
+        "instructions": (
+            "The function should return True if a number is a perfect square. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def is_perfect_square(n):
+    if n < 0:
+        return True
+    root = int(n ** 0.5)
+    return root * root != n
+""",
+        "fixed_code": """\
+def is_perfect_square(n):
+    if n < 0:
+        return False
+    root = int(n ** 0.5)
+    return root * root == n
+""",
+        "test_cases": [
+            {"input": 16, "expected": True},
+            {"input": 15, "expected": False},
+            {"input": -4, "expected": False},
+        ],
+        "test_cases_description": "Correctly identifies perfect squares including negative number check",
+    },
+    {
+        "task_id": "medium_010",
+        "domain": "data processing",
+        "instructions": (
+            "The function should return the top-k most frequent elements in a list. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def top_k_frequent(nums, k):
+    freq = {}
+    for n in nums:
+        freq[n] = freq.get(n, 0) + 1
+    sorted_items = sorted(freq.items(), key=lambda x: x[1])
+    return [item[0] for item in sorted_items[:k]]
+""",
+        "fixed_code": """\
+def top_k_frequent(nums, k):
+    freq = {}
+    for n in nums:
+        freq[n] = freq.get(n, 0) + 1
+    sorted_items = sorted(freq.items(), key=lambda x: x[1], reverse=True)
+    return [item[0] for item in sorted_items[:k]]
+""",
+        "test_cases": [
+            {"input": [[1, 1, 1, 2, 2, 3], 2], "expected": [1, 2]},
+            {"input": [[4, 4, 5, 5, 5], 1], "expected": [5]},
+            {"input": [[1, 2, 3], 3], "expected": [1, 2, 3]},
+        ],
+        "test_cases_description": "Returns top-k frequent elements in descending frequency order",
+    },
+    {
+        "task_id": "medium_011",
+        "domain": "string processing",
+        "instructions": (
+            "The function should return the longest common prefix of a list of strings. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def longest_common_prefix(strs):
+    if not strs:
+        return ''
+    prefix = strs[1]
+    for s in strs:
+        while not s.startswith(prefix):
+            prefix = prefix[:-1]
+            if not prefix:
+                return ''
+    return prefix
+""",
+        "fixed_code": """\
+def longest_common_prefix(strs):
+    if not strs:
+        return ''
+    prefix = strs[0]
+    for s in strs:
+        while not s.startswith(prefix):
+            prefix = prefix[:-1]
+            if not prefix:
+                return ''
+    return prefix
+""",
+        "test_cases": [
+            {"input": ["flower", "flow", "flight"], "expected": "fl"},
+            {"input": ["dog", "racecar", "car"], "expected": ""},
+            {"input": ["interview", "interact", "interface"], "expected": "inter"},
+        ],
+        "test_cases_description": "Correct longest common prefix starting from index 0",
+    },
+    {
+        "task_id": "medium_012",
+        "domain": "list operations",
+        "instructions": (
+            "The function should rotate a list to the right by k positions. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def rotate_right(lst, k):
+    if not lst:
+        return lst
+    k = k % len(lst)
+    return lst[k:] + lst[:k]
+""",
+        "fixed_code": """\
+def rotate_right(lst, k):
+    if not lst:
+        return lst
+    k = k % len(lst)
+    return lst[-k:] + lst[:-k]
+""",
+        "test_cases": [
+            {"input": [[1, 2, 3, 4, 5], 2], "expected": [4, 5, 1, 2, 3]},
+            {"input": [[1, 2, 3], 1], "expected": [3, 1, 2]},
+            {"input": [[], 3], "expected": []},
+        ],
+        "test_cases_description": "Rotates list to the right correctly",
+    },
+    {
+        "task_id": "medium_013",
+        "domain": "API handler",
+        "instructions": (
+            "The function parses a query string into a dict. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def parse_query_string(query):
+    if not query:
+        return None
+    result = {}
+    for pair in query.split('&'):
+        if '=' in pair:
+            key, value = pair.split('=')
+            result[value] = key
+    return result
+""",
+        "fixed_code": """\
+def parse_query_string(query):
+    if not query:
+        return {}
+    result = {}
+    for pair in query.split('&'):
+        if '=' in pair:
+            key, value = pair.split('=', 1)
+            result[key] = value
+    return result
+""",
+        "test_cases": [
+            {"input": "name=Alice&age=30", "expected": {"name": "Alice", "age": "30"}},
+            {"input": "", "expected": {}},
+            {"input": "key=value=extra", "expected": {"key": "value=extra"}},
+        ],
+        "test_cases_description": "Parses query string; empty returns {}; key=value order correct; split on first = only",
+    },
+    {
+        "task_id": "medium_014",
+        "domain": "data processing",
+        "instructions": (
+            "The function should return all pairs of numbers in a list that sum to target. "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def find_pairs(nums, target):
+    pairs = []
+    seen = set()
+    for n in nums:
+        complement = target + n
+        if complement in seen:
+            pairs.append((complement, n))
+        seen.add(n)
+    return pairs
+""",
+        "fixed_code": """\
+def find_pairs(nums, target):
+    pairs = []
+    seen = set()
+    for n in nums:
+        complement = target - n
+        if complement in seen:
+            pairs.append((complement, n))
+        seen.add(n)
+    return pairs
+""",
+        "test_cases": [
+            {"input": [[2, 7, 11, 15], 9], "expected": [(2, 7)]},
+            {"input": [[1, 2, 3, 4], 5], "expected": [(1, 4), (2, 3)]},
+            {"input": [[1, 2], 10], "expected": []},
+        ],
+        "test_cases_description": "Finds all pairs summing to target using complement = target - n",
+    },
+    {
+        "task_id": "medium_015",
+        "domain": "math",
+        "instructions": (
+            "The function should return the nth Fibonacci number (0-indexed). "
+            "It has TWO bugs. Fix both."
+        ),
+        "buggy_code": """\
+def fibonacci(n):
+    if n == 0:
+        return 1
+    if n == 1:
+        return 1
+    a, b = 0, 1
+    for _ in range(2, n):
+        a, b = b, a + b
+    return b
+""",
+        "fixed_code": """\
+def fibonacci(n):
+    if n == 0:
+        return 0
+    if n == 1:
+        return 1
+    a, b = 0, 1
+    for _ in range(2, n + 1):
+        a, b = b, a + b
+    return b
+""",
+        "test_cases": [
+            {"input": 0, "expected": 0},
+            {"input": 1, "expected": 1},
+            {"input": 6, "expected": 8},
+        ],
+        "test_cases_description": "Correct Fibonacci: fib(0)=0, fib(1)=1, fib(6)=8",
+    },
+]
+def get_random_medium_task() -> dict:
+    return random.choice(MEDIUM_TASKS).copy()
+def get_task_by_id(task_id: str) -> dict:
+    for t in MEDIUM_TASKS:
+        if t["task_id"] == task_id:
+            return t.copy()
+    return random.choice(MEDIUM_TASKS).copy()

validator/__pycache__/pre_submit_check.cpython-39.pyc ADDED Viewed

Binary file (5.48 kB). View file

validator/pre_submit_check.py ADDED Viewed

	@@ -0,0 +1,192 @@

+#!/usr/bin/env python3
+# validator/pre_submit_check.py
+# Run this BEFORE submitting to catch any disqualifying issues.
+#
+# Usage:
+#   python validator/pre_submit_check.py
+#   python validator/pre_submit_check.py --url https://your-space.hf.space
+import os
+import sys
+import json
+import argparse
+import requests
+PASS = "✅"
+FAIL = "❌"
+WARN = "⚠️"
+results = []
+def check(name: str, passed: bool, detail: str = ""):
+    status = PASS if passed else FAIL
+    results.append({"check": name, "passed": passed, "detail": detail})
+    print(f"  {status} {name}" + (f": {detail}" if detail else ""))
+    return passed
+def run_checks(base_url: str):
+    print(f"\n{'='*60}")
+    print(f"  Code Debug Environment — Pre-Submission Validator")
+    print(f"  Target: {base_url}")
+    print(f"{'='*60}\n")
+    all_passed = True
+    # ── 1. Health check ───────────────────────────────────────────
+    print("[ CHECK 1 ] Health endpoint")
+    try:
+        r = requests.get(f"{base_url}/health", timeout=10)
+        passed = r.status_code == 200 and r.json().get("status") == "ok"
+        check("GET /health returns 200 with status=ok", passed, f"HTTP {r.status_code}")
+        all_passed &= passed
+    except Exception as e:
+        check("GET /health", False, str(e))
+        all_passed = False
+    # ── 2. Reset responds ─────────────────────────────────────────
+    print("\n[ CHECK 2 ] POST /reset")
+    obs = None
+    for difficulty in ["easy", "medium", "hard"]:
+        try:
+            r = requests.post(f"{base_url}/reset", json={"difficulty": difficulty}, timeout=15)
+            data = r.json()
+            obs = data.get("observation", {})
+            has_fields = all(k in obs for k in ["task_id", "difficulty", "buggy_code", "instructions"])
+            passed = r.status_code == 200 and has_fields
+            check(f"reset(difficulty='{difficulty}') returns valid observation", passed,
+                  f"task_id={obs.get('task_id', 'MISSING')}")
+            all_passed &= passed
+        except Exception as e:
+            check(f"reset(difficulty='{difficulty}')", False, str(e))
+            all_passed = False
+    # ── 3. Step responds ──────────────────────────────────────────
+    print("\n[ CHECK 3 ] POST /step")
+    try:
+        # Reset first to get a fresh task
+        r = requests.post(f"{base_url}/reset", json={"difficulty": "easy"}, timeout=15)
+        buggy_code = r.json()["observation"]["buggy_code"]
+        # Submit the buggy code as-is (reward may be 0, that's fine)
+        r = requests.post(f"{base_url}/step", json={"fixed_code": buggy_code}, timeout=15)
+        data = r.json()
+        has_reward = "reward" in data and isinstance(data["reward"], (int, float))
+        has_done = "done" in data and isinstance(data["done"], bool)
+        reward_in_range = 0.0 <= data.get("reward", -1) <= 1.0
+        passed = r.status_code == 200 and has_reward and has_done and reward_in_range
+        check("step() returns reward in [0.0, 1.0] and done flag", passed,
+              f"reward={data.get('reward')}, done={data.get('done')}")
+        all_passed &= passed
+    except Exception as e:
+        check("POST /step", False, str(e))
+        all_passed = False
+    # ── 4. State responds ─────────────────────────────────────────
+    print("\n[ CHECK 4 ] GET /state")
+    try:
+        r = requests.get(f"{base_url}/state", timeout=10)
+        data = r.json()
+        has_fields = all(k in data for k in ["episode_id", "step_count", "difficulty"])
+        passed = r.status_code == 200 and has_fields
+        check("GET /state returns episode_id, step_count, difficulty", passed)
+        all_passed &= passed
+    except Exception as e:
+        check("GET /state", False, str(e))
+        all_passed = False
+    # ── 5. 3 difficulties all work ────────────────────────────────
+    print("\n[ CHECK 5 ] All 3 task difficulties functional")
+    for difficulty in ["easy", "medium", "hard"]:
+        try:
+            r = requests.post(f"{base_url}/reset", json={"difficulty": difficulty}, timeout=15)
+            obs = r.json()["observation"]
+            passed = obs.get("difficulty") == difficulty
+            check(f"difficulty='{difficulty}' task loads correctly",
+                  passed, f"got difficulty={obs.get('difficulty')}")
+            all_passed &= passed
+        except Exception as e:
+            check(f"difficulty='{difficulty}'", False, str(e))
+            all_passed = False
+    # ── 6. Reward range on perfect answer ─────────────────────────
+    print("\n[ CHECK 6 ] Reward range validation (correct fix)")
+    try:
+        from server.tasks.task_easy import EASY_TASKS
+        task = EASY_TASKS[0]
+        # Reset with the first easy task
+        r = requests.post(f"{base_url}/reset", json={"difficulty": "easy"}, timeout=15)
+        # Submit the known correct fix
+        r = requests.post(f"{base_url}/step",
+                          json={"fixed_code": task["fixed_code"]}, timeout=15)
+        data = r.json()
+        reward = data.get("reward", -1)
+        passed = 0.0 <= reward <= 1.0
+        check(f"Submitting correct fix yields reward in [0.0, 1.0]", passed,
+              f"reward={reward}")
+        all_passed &= passed
+    except Exception as e:
+        check("Reward range check", False, str(e))
+        all_passed = False
+    # ── 7. openenv.yaml exists ────────────────────────────────────
+    print("\n[ CHECK 7 ] Project structure")
+    required_files = [
+        "openenv.yaml",
+        "inference.py",
+        "models.py",
+        "server/app.py",
+        "server/environment.py",
+        "server/Dockerfile",
+        "server/requirements.txt",
+        "pyproject.toml",
+        "README.md",
+    ]
+    for fname in required_files:
+        exists = os.path.exists(fname)
+        check(f"File exists: {fname}", exists)
+        all_passed &= exists
+    # ── 8. inference.py has required log format ───────────────────
+    print("\n[ CHECK 8 ] inference.py log format")
+    try:
+        with open("inference.py") as f:
+            content = f.read()
+        has_start = '"type": "START"' in content
+        has_step = '"type": "STEP"' in content
+        has_end = '"type": "END"' in content
+        check("inference.py emits [START] logs", has_start)
+        check("inference.py emits [STEP] logs", has_step)
+        check("inference.py emits [END] logs", has_end)
+        all_passed &= has_start and has_step and has_end
+    except Exception as e:
+        check("inference.py log format", False, str(e))
+        all_passed = False
+    # ── Final summary ─────────────────────────────────────────────
+    total = len(results)
+    passed_count = sum(1 for r in results if r["passed"])
+    print(f"\n{'='*60}")
+    print(f"  Results: {passed_count}/{total} checks passed")
+    if all_passed:
+        print(f"  {PASS} ALL CHECKS PASSED — you are safe to submit!")
+    else:
+        failed = [r["check"] for r in results if not r["passed"]]
+        print(f"  {FAIL} FAILED CHECKS — fix these before submitting:")
+        for f in failed:
+            print(f"     • {f}")
+    print(f"{'='*60}\n")
+    return all_passed
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--url", default="http://localhost:7860",
+                        help="Base URL of the running environment")
+    args = parser.parse_args()
+    success = run_checks(args.url.rstrip("/"))
+    sys.exit(0 if success else 1)