Spaces:

Souravdanyal
/

code-debug-env

Running

App Files Files Community

Souravdanyal commited on Apr 7

Commit

d510c1d

1 Parent(s): 8ac3859

error fixing

Browse files

Files changed (12) hide show

.gitignore +1 -0
README.md +115 -86
inference.py +94 -77
models.py +21 -43
openenv.yaml +20 -21
py +0 -0
pyproject.toml +26 -0
server/environment.py +26 -40
server/graders/__pycache__/grader_easy.cpython-310.pyc +0 -0
server/graders/__pycache__/grader_hard.cpython-310.pyc +0 -0
server/graders/__pycache__/grader_medium.cpython-310.pyc +0 -0
server/graders/grader_hard.py +51 -34

.gitignore CHANGED Viewed

@@ -2,3 +2,4 @@ __pycache__/
 .vscode/
 __pycache__/
 .vscode/

 .vscode/
 __pycache__/
 .vscode/
+.env

README.md CHANGED Viewed

@@ -1,31 +1,19 @@
----
-title: Code Debug Environment
-emoji: 🐍
-colorFrom: blue
-colorTo: green
-sdk: docker
-sdk_version: "1.0"
-app_file: server/app.py
-pinned: false
----
 # Code Debug Environment
-An OpenEnv-compatible RL environment where an LLM agent diagnoses and fixes buggy Python code across three difficulty levels.
 ---
 ## Overview
-| Property          | Value                                         |
-| ----------------- | --------------------------------------------- |
-| Domain            | Real-world Python code debugging              |
-| Tasks             | 45 total (15 easy + 15 medium + 15 hard)      |
-| Difficulties      | easy → medium → hard                          |
-| Reward Range      | 0.0 – 1.0 (partial, proportional)             |
-| Max Steps/Episode | 3                                             |
-| API               | OpenEnv standard: `/reset`, `/step`, `/state` |
 ---
@@ -33,156 +21,196 @@ An OpenEnv-compatible RL environment where an LLM agent diagnoses and fixes bugg
 The agent receives a buggy Python function and must fix it. Tasks come from real-world domains: data processing, string algorithms, API validation, sorting, dynamic programming, and graph algorithms.
-* Easy: One bug (wrong operator, off-by-one, incorrect return). Reward proportional to test pass rate.
-* Medium: Two bugs (logic bug + edge case). Reward proportional to test pass rate.
-* Hard: One algorithmic bug + agent must explain what was wrong. Reward = 0.7 × test score + 0.3 × explanation quality.
 ---
 ## Action Space
 {
-"fixed_code": "string — the corrected Python function (required)",
-"explanation": "string — explanation of what was wrong (required for hard tasks)"
 }
-| Field       | Type | Required   | Description                                    |
-| ----------- | ---- | ---------- | ---------------------------------------------- |
-| fixed_code  | str  | Always     | Complete corrected Python function as a string |
-| explanation | str  | Hard tasks | Describe the bug and why your fix is correct   |
 ---
 ## Observation Space
-Returned by /reset and /step:
 {
-"task_id": "easy_003",
-"difficulty": "easy",
-"buggy_code": "def find_max(nums):\n    ...",
-"instructions": "The function has exactly one bug. Fix it.",
-"test_cases_description": "Finds max value in a list without IndexError",
-"reward": 0.67,
-"passed_tests": 2,
-"total_tests": 3,
-"feedback": "Test 1: ✅ ...\nTest 2: ✅ ...\nTest 3: ❌ ...",
-"done": false
 }
-| Field                  | Type       | Description                          |
-| ---------------------- | ---------- | ------------------------------------ |
-| task_id                | str        | Unique task identifier               |
-| difficulty             | str        | easy / medium / hard                 |
-| buggy_code             | str        | Buggy Python function to fix         |
-| instructions           | str        | Task instructions                    |
-| test_cases_description | str        | What the test cases check            |
-| reward                 | float/null | Score from last step (null on reset) |
-| passed_tests           | int/null   | Tests passed                         |
-| total_tests            | int        | Total test cases                     |
-| feedback               | str/null   | Detailed feedback                    |
-| done                   | bool       | Episode complete                     |
 ---
 ## Reward Function
-Easy & Medium
 reward = passed_tests / total_tests
-Hard
 reward = 0.7 × test_score + 0.3 × explanation_score
 ---
 ## Setup & Local Run
-Prerequisites
-* Python 3.10+
-* Docker
-* Hugging Face CLI
-Install
 git clone https://github.com/YOUR_USERNAME/code-debug-env
 cd code-debug-env
 pip install -e .
 git clone https://github.com/meta-pytorch/OpenEnv.git
 export PYTHONPATH=$PYTHONPATH:OpenEnv:OpenEnv/src:.
-Run locally
 uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
-Run with Docker
 docker build -f server/Dockerfile -t code-debug-env .
 docker run -p 7860:7860 code-debug-env
----
-## Test the API
 curl http://localhost:7860/health
-curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"difficulty": "easy"}'
-curl -X POST http://localhost:7860/step -H "Content-Type: application/json" -d '{"fixed_code": "def find_max(nums): return max(nums)"}'
 curl http://localhost:7860/state
 ---
 ## Run Baseline Inference
 export API_BASE_URL="https://api.openai.com/v1"
 export MODEL_NAME="gpt-4o-mini"
 export HF_TOKEN="your-api-key"
 python inference.py --url http://localhost:7860
 python inference.py --url http://localhost:7860 --difficulty hard
 ---
 ## Pre-Submission Validation
 python validator/pre_submit_check.py --url http://localhost:7860
 python validator/pre_submit_check.py --url https://YOUR_SPACE.hf.space
 ---
 ## Deploy to Hugging Face Spaces
 huggingface-cli login
 huggingface-cli repo create code-debug-env --type space --space_sdk docker
 cd code-debug-env
 git init
 git remote add origin https://huggingface.co/spaces/YOUR_USERNAME/code-debug-env
 git add .
 git commit -m "Initial commit"
 git push origin main
 ---
 ## Project Structure
 code-debug-env/
-├── openenv.yaml
-├── inference.py
-├── pyproject.toml
 ├── README.md
-├── models.py
-├── client.py
-├── **init**.py
 ├── server/
-│   ├── app.py
-│   ├── environment.py
 │   ├── tasks/
-│   │   ├── task_easy.py
-│   │   ├── task_medium.py
-│   │   └── task_hard.py
 │   ├── graders/
 │   │   ├── grader_easy.py
 │   │   ├── grader_medium.py
@@ -190,4 +218,5 @@ code-debug-env/
 │   ├── requirements.txt
 │   └── Dockerfile
 └── validator/
-└── pre_submit_check.py

 # Code Debug Environment
+An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible RL environment where an LLM agent diagnoses and fixes buggy Python code across three difficulty levels.
 ---
 ## Overview
+| Property | Value |
+|---|---|
+| Domain | Real-world Python code debugging |
+| Tasks | 45 total (15 easy + 15 medium + 15 hard) |
+| Difficulties | easy → medium → hard |
+| Reward Range | 0.0 – 1.0 (partial, proportional) |
+| Max Steps/Episode | 3 |
+| API | OpenEnv standard: `/reset`, `/step`, `/state` |
 ---
 The agent receives a buggy Python function and must fix it. Tasks come from real-world domains: data processing, string algorithms, API validation, sorting, dynamic programming, and graph algorithms.
+- **Easy**: One bug (wrong operator, off-by-one, incorrect return). Reward proportional to test pass rate.
+- **Medium**: Two bugs (logic bug + edge case). Reward proportional to test pass rate.
+- **Hard**: One algorithmic bug + agent must explain what was wrong. Reward = 0.7 × test score + 0.3 × explanation quality.
 ---
 ## Action Space
+```json
 {
+  "fixed_code": "string — the corrected Python function (required)",
+  "explanation": "string — explanation of what was wrong (required for hard tasks)"
 }
+```
+| Field | Type | Required | Description |
+|---|---|---|---|
+| `fixed_code` | `str` | Always | Complete corrected Python function as a string |
+| `explanation` | `str` | Hard tasks | Describe the bug and why your fix is correct |
 ---
 ## Observation Space
+Returned by `/reset` and `/step`:
+```json
 {
+  "task_id": "easy_003",
+  "difficulty": "easy",
+  "buggy_code": "def find_max(nums):\n    ...",
+  "instructions": "The function has exactly one bug. Fix it.",
+  "test_cases_description": "Finds max value in a list without IndexError",
+  "reward": 0.67,
+  "passed_tests": 2,
+  "total_tests": 3,
+  "feedback": "Test 1: ✅ ...\nTest 2: ✅ ...\nTest 3: ❌ ...",
+  "done": false
 }
+```
+| Field | Type | Description |
+|---|---|---|
+| `task_id` | `str` | Unique task identifier |
+| `difficulty` | `str` | `easy` / `medium` / `hard` |
+| `buggy_code` | `str` | Buggy Python function to fix |
+| `instructions` | `str` | Task instructions |
+| `test_cases_description` | `str` | What the test cases check |
+| `reward` | `float\|null` | Score from last step (null on reset) |
+| `passed_tests` | `int\|null` | Tests passed (null on reset) |
+| `total_tests` | `int` | Total number of test cases |
+| `feedback` | `str\|null` | Detailed per-test feedback |
+| `done` | `bool` | True when episode is complete |
 ---
 ## Reward Function
+### Easy & Medium
+```
 reward = passed_tests / total_tests
+```
+- 3/3 tests → 1.0
+- 2/3 tests → 0.67
+- 1/3 tests → 0.33
+- 0/3 tests → 0.0
+### Hard
+```
 reward = 0.7 × test_score + 0.3 × explanation_score
+```
+Explanation is scored by matching key algorithmic concepts. Partial credit is given.
 ---
 ## Setup & Local Run
+### Prerequisites
+- Python 3.10+
+- Docker
+- Hugging Face CLI
+### Install
+```bash
 git clone https://github.com/YOUR_USERNAME/code-debug-env
 cd code-debug-env
 pip install -e .
+# Also clone OpenEnv for PYTHONPATH
 git clone https://github.com/meta-pytorch/OpenEnv.git
 export PYTHONPATH=$PYTHONPATH:OpenEnv:OpenEnv/src:.
+```
+### Run locally
+```bash
 uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
+```
+### Run with Docker
+```bash
 docker build -f server/Dockerfile -t code-debug-env .
 docker run -p 7860:7860 code-debug-env
+```
+### Test the API
+```bash
+# Health check
 curl http://localhost:7860/health
+# Reset (easy task)
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"difficulty": "easy"}'
+# Submit a fix
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"fixed_code": "def find_max(nums):\n    return max(nums)"}'
+# Check state
 curl http://localhost:7860/state
+```
 ---
 ## Run Baseline Inference
+```bash
 export API_BASE_URL="https://api.openai.com/v1"
 export MODEL_NAME="gpt-4o-mini"
 export HF_TOKEN="your-api-key"
+# Run all 3 difficulties
 python inference.py --url http://localhost:7860
+# Run specific difficulty
 python inference.py --url http://localhost:7860 --difficulty hard
+```
 ---
 ## Pre-Submission Validation
+Run before submitting to catch any disqualifying issues:
+```bash
+# Start the environment first, then:
 python validator/pre_submit_check.py --url http://localhost:7860
+# Or against your HF Space:
 python validator/pre_submit_check.py --url https://YOUR_SPACE.hf.space
+```
 ---
 ## Deploy to Hugging Face Spaces
+```bash
+# Login
 huggingface-cli login
+# Create space and push
 huggingface-cli repo create code-debug-env --type space --space_sdk docker
 cd code-debug-env
 git init
 git remote add origin https://huggingface.co/spaces/YOUR_USERNAME/code-debug-env
 git add .
 git commit -m "Initial commit"
 git push origin main
+```
 ---
 ## Project Structure
+```
 code-debug-env/
+├── openenv.yaml          ← OpenEnv manifest
+├── inference.py          ← Baseline agent (root, required)
+├── pyproject.toml        ← Dependencies
 ├── README.md
+├── models.py             ← Pydantic Action/Observation/State
+├── client.py             ← EnvClient for training loops
+├── __init__.py
 ├── server/
+│   ├── app.py            ← FastAPI: /reset /step /state /health
+│   ├── environment.py    ← Core episode logic
 │   ├── tasks/
+│   │   ├── task_easy.py  ← 15 single-bug tasks
+│   │   ├── task_medium.py← 15 two-bug tasks
+│   │   └── task_hard.py  ← 15 algorithmic tasks
 │   ├── graders/
 │   │   ├── grader_easy.py
 │   │   ├── grader_medium.py
 │   ├── requirements.txt
 │   └── Dockerfile
 └── validator/
+    └── pre_submit_check.py
+```

inference.py CHANGED Viewed

@@ -8,18 +8,18 @@ Usage:
   python inference.py --url https://Souravdanyal-code-debug-env.hf.space
   python inference.py --difficulty easy
-STDOUT FORMAT (required by evaluator):
   [START] task=<id> env=<benchmark> model=<model>
   [STEP] step=<n> action=<str> reward=<0.00> done=<true|false> error=<msg|null>
-  [END] success=<true|false> steps=<n> rewards=<r1,r2,...>
 """
-import os, sys, json, time, argparse, requests
 from openai import OpenAI
 from typing import List, Optional
 # ── Config ────────────────────────────────────────────────────────────────────
-API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
 MODEL_NAME   = os.environ.get("MODEL_NAME",   "llama-3.1-8b-instant")
 HF_TOKEN     = os.environ.get("HF_TOKEN",     "")
 ENV_URL      = os.environ.get("ENV_URL",      "http://localhost:7860")
@@ -28,7 +28,7 @@ MAX_STEPS    = 5
 client = OpenAI(api_key=HF_TOKEN or "dummy", base_url=API_BASE_URL)
-# ── Logging ───────────────────────────────────────────────────────────────────
 def log_start(task_id, env, model):
     print(f"[START] task={task_id} env={env} model={model}", flush=True)
@@ -53,101 +53,116 @@ def env_step(url, fixed_code, explanation=None):
     return r.json()
 # ── LLM ──────────────────────────────────────────────────────────────────────
-SYSTEM_PROMPT = """You are an expert Python debugging agent. Fix bugs in Python functions.
-RESPONSE FORMAT — strictly JSON only, no markdown:
-{
-  "fixed_code": "<complete corrected Python function including imports>",
-  "explanation": "<for hard tasks: explain the bug, root cause, and fix>"
-}
 RULES:
-- Return COMPLETE function with all imports (e.g. from collections import deque)
-- fixed_code must be valid Python
-- For hard tasks explanation MUST mention the algorithmic concept listed in instructions
-COMMON BUG PATTERNS:
-- List rotation RIGHT by k: correct is lst[-k:] + lst[:-k]  NOT lst[k:] + lst[:k]
-- List rotation LEFT by k: correct is lst[k:] + lst[:k]
-- Graph/BFS missing visited set → infinite loop → add visited=set()
-- 0/1 Knapsack: must iterate BACKWARD: range(capacity, weight-1, -1) not forward
-- Binary search wrong boundary: return high not low, or high=n//2
-- Off-by-one: lst[2] should be lst[1] for second element
-- Wrong operator: complement = target - n  NOT target + n
-FOR HARD TASKS — explanation MUST include words from the instructions hint.
-Example: if instructions say "mention: iteration order" then write about iteration order.
-Example: if instructions say "mention: visited" then write about visited set.
 """
 def call_llm(buggy_code, instructions, difficulty, feedback=None, attempt=1, prev_code=None):
     content = f"Difficulty: {difficulty}\nInstructions: {instructions}\n\nBuggy code:\n```python\n{buggy_code}\n```\n"
     if feedback and attempt > 1:
         content += f"\nPREVIOUS FIX FAILED. Feedback:\n{feedback}\n\nYour previous code:\n```python\n{prev_code or ''}\n```\n"
-        content += "IMPORTANT: Your fix did not work. Look at the Expected vs Got values carefully.\n"
-        content += "- If Got is a LEFT rotation but Expected is RIGHT: use lst[-k:] + lst[:-k]\n"
-        content += "- If you see TimeoutError: add visited=set() for graph traversal\n"
-        content += "- Try a COMPLETELY DIFFERENT approach.\n"
     if difficulty == "hard":
-        # Extract keyword hints from instructions (e.g. "mention: visited, queue")
-        import re
-        hint_match = re.search(r'[Mm]ention[:\s]+([^.]+)', instructions)
         if hint_match:
             hints = hint_match.group(1).strip()
-            content += f"\nFor your explanation, you MUST mention these concepts: {hints}\n"
-        content += "Include a detailed explanation field — it counts for 30% of your reward.\n"
     try:
         resp = client.chat.completions.create(
             model=MODEL_NAME,
-            messages=[{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": content}],
             max_tokens=1500,
             temperature=0.1 if attempt == 1 else 0.4,
         )
         raw = resp.choices[0].message.content.strip()
-        # Remove markdown fences
-        if "```json" in raw:
-            raw = raw.split("```json")[1].split("```")[0].strip()
-        elif "```" in raw:
-            raw = raw.split("```")[1].split("```")[0].strip()
-            if raw.startswith("json"):
-                raw = raw[4:].strip()
-        # Find JSON object boundaries
-        start = raw.find("{")
-        end = raw.rfind("}") + 1
-        if start >= 0 and end > start:
-            raw = raw[start:end]
-        # Try direct parse first
-        try:
-            parsed = json.loads(raw)
-        except json.JSONDecodeError:
-            # Fix control characters by replacing literal newlines inside strings
-            import re
-            # Replace actual newlines within JSON string values with \n escape
-            raw = re.sub(r'(?<!\\)\n', r'\\n', raw)
-            raw = re.sub(r'(?<!\\)\t', r'\\t', raw)
-            raw = re.sub(r'(?<!\\)\r', r'\\r', raw)
-            try:
-                parsed = json.loads(raw)
-            except json.JSONDecodeError:
-                # Last resort: extract fixed_code manually using regex
-                code_match = re.search(r'"fixed_code"\s*:\s*"(.*?)"(?=\s*[,}])', raw, re.DOTALL)
-                exp_match  = re.search(r'"explanation"\s*:\s*"(.*?)"(?=\s*[,}])', raw, re.DOTALL)
-                if code_match:
-                    code = code_match.group(1).encode().decode('unicode_escape') if '\\n' in code_match.group(1) else code_match.group(1)
-                    return {"fixed_code": code, "explanation": exp_match.group(1) if exp_match else None}
-                raise
-        return {"fixed_code": parsed.get("fixed_code", ""), "explanation": parsed.get("explanation")}
     except Exception as e:
         print(f"# LLM error: {e}", file=sys.stderr)
         return {"fixed_code": buggy_code, "explanation": None}
 # ── Episode ───────────────────────────────────────────────────────────────────
 def run_episode(env_url, difficulty):
     data = env_reset(env_url, difficulty)
@@ -181,7 +196,8 @@ def run_episode(env_url, difficulty):
         reward = result.get("reward", 0.0)
         done   = result.get("done", False)
-        last_feedback = result.get("observation", {}).get("feedback", "")
         log_step(attempt, f"fix_{difficulty}_attempt{attempt}", reward, done, None)
         rewards.append(reward)
@@ -194,9 +210,10 @@ def run_episode(env_url, difficulty):
     log_end(success, steps_taken, rewards)
     return success, steps_taken, rewards
 # ── Main ──────────────────────────────────────────────────────────────────────
 def main():
-    parser = argparse.ArgumentParser()
     parser.add_argument("--url", default=ENV_URL)
     parser.add_argument("--difficulty", default=None, choices=["easy","medium","hard","all"])
     args = parser.parse_args()
@@ -222,4 +239,4 @@ def main():
     print(f"# SUMMARY: {sum(successes)}/{len(diffs)} tasks solved | avg_reward={avg}", flush=True)
 if __name__ == "__main__":
-    main()

   python inference.py --url https://Souravdanyal-code-debug-env.hf.space
   python inference.py --difficulty easy
+STDOUT FORMAT (strictly required by evaluator):
   [START] task=<id> env=<benchmark> model=<model>
   [STEP] step=<n> action=<str> reward=<0.00> done=<true|false> error=<msg|null>
+  [END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
 """
+import os, sys, json, time, argparse, requests, re
 from openai import OpenAI
 from typing import List, Optional
 # ── Config ────────────────────────────────────────────────────────────────────
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.groq.com/openai/v1")
 MODEL_NAME   = os.environ.get("MODEL_NAME",   "llama-3.1-8b-instant")
 HF_TOKEN     = os.environ.get("HF_TOKEN",     "")
 ENV_URL      = os.environ.get("ENV_URL",      "http://localhost:7860")
 client = OpenAI(api_key=HF_TOKEN or "dummy", base_url=API_BASE_URL)
+# ── Logging — STRICT FORMAT ───────────────────────────────────────────────────
 def log_start(task_id, env, model):
     print(f"[START] task={task_id} env={env} model={model}", flush=True)
     return r.json()
 # ── LLM ──────────────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = """You are an expert Python debugging agent.
+RESPONSE FORMAT — JSON only, no markdown fences, no extra text:
+{"fixed_code": "<complete Python function with all imports>", "explanation": "<for hard tasks only>"}
 RULES:
+- Return the COMPLETE function including all imports (e.g. from collections import deque)
+- fixed_code must be valid, executable Python
+- For hard tasks: explanation MUST mention the algorithmic concepts from the instructions
+COMMON BUG PATTERNS — memorize these:
+- RIGHT rotate list by k: lst[-k:] + lst[:-k]   (NOT lst[k:] + lst[:k] which is LEFT rotate)
+- LEFT rotate list by k: lst[k:] + lst[:k]
+- BFS/graph missing visited: add visited=set([start]) before queue, check before appending
+- 0/1 Knapsack: iterate BACKWARD range(capacity, weight-1, -1) NOT forward
+- Binary search boundary: often return high not low, or initial high=n//2 not n
+- Wrong operator: target-n not target+n for complement
+- Off-by-one: lst[1] for second element not lst[2]
+IMPORTANT: If feedback shows TimeoutError → you have infinite loop → add visited set.
+IMPORTANT: If Expected shows right-rotated list → use lst[-k:] + lst[:-k].
 """
+def _parse_llm_response(raw: str, buggy_code: str) -> dict:
+    """Robustly parse LLM response handling control chars and malformed JSON."""
+    # Remove markdown fences
+    if "```json" in raw:
+        raw = raw.split("```json")[1].split("```")[0].strip()
+    elif "```" in raw:
+        parts = raw.split("```")
+        if len(parts) >= 2:
+            raw = parts[1].strip()
+            if raw.startswith("json"):
+                raw = raw[4:].strip()
+    # Find JSON boundaries
+    start = raw.find("{")
+    end = raw.rfind("}") + 1
+    if start >= 0 and end > start:
+        raw = raw[start:end]
+    # Try direct parse
+    try:
+        parsed = json.loads(raw)
+        return {"fixed_code": parsed.get("fixed_code", ""), "explanation": parsed.get("explanation")}
+    except json.JSONDecodeError:
+        pass
+    # Fix control characters (literal newlines inside JSON strings)
+    try:
+        fixed = re.sub(r'(?<!\\)\n', r'\\n', raw)
+        fixed = re.sub(r'(?<!\\)\t', r'\\t', raw)
+        fixed = re.sub(r'(?<!\\)\r', r'\\r', raw)
+        parsed = json.loads(fixed)
+        # Unescape the fixed_code back to real newlines
+        code = parsed.get("fixed_code", "")
+        if "\\n" in code:
+            code = code.replace("\\n", "\n").replace("\\t", "\t")
+        return {"fixed_code": code, "explanation": parsed.get("explanation")}
+    except json.JSONDecodeError:
+        pass
+    # Last resort: regex extraction
+    code_match = re.search(r'"fixed_code"\s*:\s*"((?:[^"\\]|\\.)*)"\s*[,}]', raw, re.DOTALL)
+    exp_match  = re.search(r'"explanation"\s*:\s*"((?:[^"\\]|\\.)*)"\s*[,}]', raw, re.DOTALL)
+    if code_match:
+        code = code_match.group(1).replace("\\n", "\n").replace("\\t", "\t")
+        exp = exp_match.group(1).replace("\\n", "\n") if exp_match else None
+        return {"fixed_code": code, "explanation": exp}
+    # Complete fallback
+    return {"fixed_code": buggy_code, "explanation": None}
 def call_llm(buggy_code, instructions, difficulty, feedback=None, attempt=1, prev_code=None):
     content = f"Difficulty: {difficulty}\nInstructions: {instructions}\n\nBuggy code:\n```python\n{buggy_code}\n```\n"
     if feedback and attempt > 1:
         content += f"\nPREVIOUS FIX FAILED. Feedback:\n{feedback}\n\nYour previous code:\n```python\n{prev_code or ''}\n```\n"
+        content += "ANALYZE THE FEEDBACK CAREFULLY:\n"
+        content += "- Look at Input/Expected/Got for each failing test\n"
+        content += "- If Got shows wrong rotation direction: use lst[-k:] + lst[:-k] for RIGHT rotate\n"
+        content += "- If TimeoutError: add visited=set([start]) before queue in graph code\n"
+        content += "- Try a COMPLETELY DIFFERENT fix.\n"
     if difficulty == "hard":
+        hint_match = re.search(r'[Mm]ention[:\s]+([^.]+?)(?:\.|$)', instructions)
         if hint_match:
             hints = hint_match.group(1).strip()
+            content += f"\nFor explanation, you MUST mention these concepts: {hints}\n"
+        content += "Explanation counts for 30% of reward — make it detailed and specific.\n"
     try:
         resp = client.chat.completions.create(
             model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": content}
+            ],
             max_tokens=1500,
             temperature=0.1 if attempt == 1 else 0.4,
         )
         raw = resp.choices[0].message.content.strip()
+        return _parse_llm_response(raw, buggy_code)
     except Exception as e:
         print(f"# LLM error: {e}", file=sys.stderr)
         return {"fixed_code": buggy_code, "explanation": None}
 # ── Episode ───────────────────────────────────────────────────────────────────
 def run_episode(env_url, difficulty):
     data = env_reset(env_url, difficulty)
         reward = result.get("reward", 0.0)
         done   = result.get("done", False)
+        obs_r  = result.get("observation", {})
+        last_feedback = obs_r.get("feedback", "")
         log_step(attempt, f"fix_{difficulty}_attempt{attempt}", reward, done, None)
         rewards.append(reward)
     log_end(success, steps_taken, rewards)
     return success, steps_taken, rewards
 # ── Main ──────────────────────────────────────────────────────────────────────
 def main():
+    parser = argparse.ArgumentParser(description="Code Debug Environment Baseline Agent")
     parser.add_argument("--url", default=ENV_URL)
     parser.add_argument("--difficulty", default=None, choices=["easy","medium","hard","all"])
     args = parser.parse_args()
     print(f"# SUMMARY: {sum(successes)}/{len(diffs)} tasks solved | avg_reward={avg}", flush=True)
 if __name__ == "__main__":
+    main()

models.py CHANGED Viewed

@@ -1,8 +1,6 @@
-# models.py
-# Typed Pydantic models for Action, Observation, and State
-# These are the contracts between the agent and the environment.
-from typing import Optional, List
 from pydantic import Field
 from openenv.core.env_server.types import Action, Observation, State
@@ -12,62 +10,42 @@ class DebugAction(Action):
     fixed_code: str = Field(
         ...,
-        description="The corrected Python function as a string. Must be valid Python."
     )
     explanation: Optional[str] = Field(
         default=None,
-        description=(
-            "Required for 'hard' difficulty tasks. Explain what was wrong "
-            "and why your fix is correct. Affects reward on hard tasks."
-        )
     )
-class TestResult(Action):
-    """Sub-model: result of a single test case."""
-    test_id: int
-    passed: bool
-    expected: str
-    got: str
 class DebugObservation(Observation):
-    """Observation returned after each step()."""
-    # Task info
-    task_id: str = Field(..., description="Unique ID of the current task instance")
     difficulty: str = Field(..., description="Task difficulty: easy | medium | hard")
     buggy_code: str = Field(..., description="The buggy Python code the agent must fix")
     instructions: str = Field(..., description="Natural language instructions for the task")
-    test_cases_description: str = Field(
-        ..., description="Description of what the test cases check"
-    )
-    # After step() — feedback
-    reward: Optional[float] = Field(
-        default=None, description="Score from 0.0 to 1.0 for this step"
-    )
-    passed_tests: Optional[int] = Field(
-        default=None, description="Number of test cases passed"
-    )
-    total_tests: Optional[int] = Field(
-        default=None, description="Total number of test cases"
-    )
-    feedback: Optional[str] = Field(
-        default=None,
-        description="Detailed feedback: which tests failed and why"
-    )
-    done: bool = Field(default=False, description="True when episode is complete")
 class DebugState(State):
-    """Internal environment state, returned by GET /state."""
-    episode_id: str = ""          # ← required by validator: GET /state must return episode_id
-    task_id: str
-    difficulty: str
     step_count: int = 0
-    max_steps: int = 3
     current_reward: float = 0.0
     best_reward: float = 0.0
     done: bool = False

+# models.py — Typed Pydantic models for Action, Observation, and State
+from typing import Optional
 from pydantic import Field
 from openenv.core.env_server.types import Action, Observation, State
     fixed_code: str = Field(
         ...,
+        description="Complete corrected Python function. Must be valid Python including imports."
     )
     explanation: Optional[str] = Field(
         default=None,
+        description="Required for hard tasks. Explain what was wrong and why your fix is correct."
     )
 class DebugObservation(Observation):
+    """Observation returned after reset() and step()."""
+    task_id: str = Field(..., description="Unique task identifier e.g. easy_003")
     difficulty: str = Field(..., description="Task difficulty: easy | medium | hard")
     buggy_code: str = Field(..., description="The buggy Python code the agent must fix")
     instructions: str = Field(..., description="Natural language instructions for the task")
+    test_cases_description: str = Field(..., description="What the test cases check")
+    # Step feedback fields
+    reward: Optional[float] = Field(default=None, description="Immediate reward 0.0-1.0 (null on reset)")
+    cumulative_reward: float = Field(default=0.0, description="Total reward accumulated this episode")
+    best_reward: float = Field(default=0.0, description="Best reward achieved this episode")
+    passed_tests: Optional[int] = Field(default=None, description="Tests passed (null on reset)")
+    total_tests: Optional[int] = Field(default=None, description="Total test cases (always 3)")
+    feedback: Optional[str] = Field(default=None, description="Per-test feedback: Input, Expected, Got")
+    done: bool = Field(default=False, description="True when episode complete")
 class DebugState(State):
+    """Internal environment state returned by GET /state."""
+    episode_id: str = ""
+    task_id: str = "none"
+    difficulty: str = "easy"
     step_count: int = 0
+    max_steps: int = 5
     current_reward: float = 0.0
+    cumulative_reward: float = 0.0
     best_reward: float = 0.0
     done: bool = False

openenv.yaml CHANGED Viewed

@@ -2,12 +2,11 @@ spec_version: 1
 name: code-debug-env
 type: typed
 description: >
-  A real-world RL environment where an LLM agent diagnoses and fixes
-  buggy Python code across three difficulty levels (easy, medium, hard).
-  Tasks are drawn from real-world domains: data processing, string algorithms,
-  API validation, sorting, dynamic programming, and graph algorithms.
-  Rewards are partial and proportional to test cases passed, with bonuses
-  for correct explanations on hard tasks.
 version: 1.0.0
 author: Souravdanyal
@@ -19,6 +18,7 @@ tags:
   - openenv
   - llm-agent
   - software-engineering
 runtime:
   type: docker
@@ -51,7 +51,7 @@ tasks:
     num_tasks: 15
   - id: hard
-    description: "Fix an algorithmic bug AND provide a correct explanation of the root cause"
     difficulty: hard
     max_steps: 5
     reward_range: [0.0, 1.0]
@@ -67,51 +67,50 @@ action_space:
     fixed_code:
       type: string
       required: true
-      description: "Complete corrected Python function as a string. Must be valid Python."
     explanation:
       type: string
       required: false
-      description: "Required for hard tasks. Explain the bug, root cause, and why fix is correct."
 observation_space:
   type: dict
-  description: "Environment observation returned after reset() and step()"
   fields:
     task_id:
       type: string
-      description: "Unique identifier for the current task instance (e.g. easy_003)"
     difficulty:
       type: enum
       values: [easy, medium, hard]
-      description: "Task difficulty level"
     buggy_code:
       type: string
-      description: "The buggy Python function the agent must fix"
     instructions:
       type: string
-      description: "Natural language description of what is wrong and what to fix"
     test_cases_description:
       type: string
-      description: "Description of what the test cases check"
     reward:
       type: float
-      description: "Score from 0.0 to 1.0 for this step (null on reset)"
     passed_tests:
       type: integer
-      description: "Number of test cases passed (null on reset)"
     total_tests:
       type: integer
-      description: "Total number of test cases (always 3)"
     feedback:
       type: string
-      description: "Detailed per-test feedback showing input, expected, and got values"
     done:
       type: boolean
-      description: "True when episode is complete (perfect score or max steps reached)"
 api:
   reset: /reset
   step: /step
   state: /state
   health: /health
-  tasks: /tasks

 name: code-debug-env
 type: typed
 description: >
+  A real-world RL environment where an LLM agent diagnoses and fixes buggy Python
+  code across three difficulty levels (easy, medium, hard). Tasks cover real-world
+  domains: data processing, string algorithms, API validation, sorting, dynamic
+  programming, and graph algorithms. Rewards are partial and proportional to test
+  cases passed, with bonuses for correct explanations on hard tasks.
 version: 1.0.0
 author: Souravdanyal
   - openenv
   - llm-agent
   - software-engineering
+  - real-world
 runtime:
   type: docker
     num_tasks: 15
   - id: hard
+    description: "Fix an algorithmic bug AND provide a correct explanation of root cause"
     difficulty: hard
     max_steps: 5
     reward_range: [0.0, 1.0]
     fixed_code:
       type: string
       required: true
+      description: "Complete corrected Python function. Must be valid Python including imports."
     explanation:
       type: string
       required: false
+      description: "Required for hard tasks. Explain the bug, root cause, and fix."
 observation_space:
   type: dict
+  description: "Returned after reset() and step()"
   fields:
     task_id:
       type: string
+      description: "Unique task identifier e.g. easy_003"
     difficulty:
       type: enum
       values: [easy, medium, hard]
     buggy_code:
       type: string
+      description: "The buggy Python function to fix"
     instructions:
       type: string
+      description: "Natural language description of what is wrong"
     test_cases_description:
       type: string
+      description: "What the test cases check"
     reward:
       type: float
+      description: "Score 0.0-1.0 (null on reset)"
     passed_tests:
       type: integer
+      description: "Test cases passed (null on reset)"
     total_tests:
       type: integer
+      description: "Total test cases (always 3)"
     feedback:
       type: string
+      description: "Per-test feedback showing Input, Expected, Got"
     done:
       type: boolean
+      description: "True when episode complete"
 api:
   reset: /reset
   step: /step
   state: /state
   health: /health
+  tasks: /tasks

py ADDED Viewed

File without changes

pyproject.toml ADDED Viewed

	@@ -0,0 +1,26 @@

+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.backends.legacy:build"
+[project]
+name = "code-debug-env"
+version = "1.0.0"
+description = "OpenEnv environment for LLM-based code debugging"
+requires-python = ">=3.10"
+dependencies = [
+    "fastapi>=0.110.0",
+    "uvicorn[standard]>=0.29.0",
+    "pydantic>=2.0.0",
+    "openai>=1.0.0",
+    "requests>=2.31.0",
+    "openenv-core>=0.2.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "httpx>=0.27.0",
+]
+[tool.setuptools.packages.find]
+where = ["."]

server/environment.py CHANGED Viewed

@@ -16,17 +16,16 @@ from server.graders.grader_easy import grade_easy
 from server.graders.grader_medium import grade_medium
 from server.graders.grader_hard import grade_hard
 TASK_GETTERS = {
-    "easy": get_random_easy_task,
     "medium": get_random_medium_task,
-    "hard": get_random_hard_task,
 }
 GRADERS = {
-    "easy": grade_easy,
     "medium": grade_medium,
-    "hard": grade_hard,
 }
 MAX_STEPS = 5
@@ -35,7 +34,7 @@ MAX_STEPS = 5
 class CodeDebugEnvironment(Environment):
     """
     OpenEnv environment for LLM-based code debugging.
-    Supports 3 difficulty levels with partial rewards.
     """
     def __init__(self):
@@ -43,28 +42,25 @@ class CodeDebugEnvironment(Environment):
         self._difficulty: str = "easy"
         self._current_task: Optional[dict] = None
         self._step_count: int = 0
         self._best_reward: float = 0.0
         self._current_reward: float = 0.0
         self._done: bool = False
     def reset(self, difficulty: Optional[str] = None) -> DebugObservation:
-        """
-        Start a new episode. Optionally specify difficulty: easy | medium | hard.
-        If not specified, cycles randomly.
-        """
         self._episode_id = str(uuid4())
         self._step_count = 0
         self._best_reward = 0.0
         self._current_reward = 0.0
         self._done = False
-        # Validate difficulty
         if difficulty and difficulty in TASK_GETTERS:
             self._difficulty = difficulty
         else:
             self._difficulty = random.choice(["easy", "medium", "hard"])
-        # Load a task
         self._current_task = TASK_GETTERS[self._difficulty]()
         return DebugObservation(
@@ -74,6 +70,8 @@ class CodeDebugEnvironment(Environment):
             instructions=self._current_task["instructions"],
             test_cases_description=self._current_task["test_cases_description"],
             reward=None,
             passed_tests=None,
             total_tests=len(self._current_task["test_cases"]),
             feedback=None,
@@ -81,31 +79,31 @@ class CodeDebugEnvironment(Environment):
         )
     def step(self, action: DebugAction) -> DebugObservation:
-        """
-        Agent submits fixed_code (and optionally explanation for hard tasks).
-        Returns observation with reward, feedback, and done flag.
-        """
         if self._done:
             return DebugObservation(
                 task_id=self._current_task["task_id"] if self._current_task else "none",
                 difficulty=self._difficulty,
                 buggy_code=self._current_task["buggy_code"] if self._current_task else "",
-                instructions="Episode is already done. Call reset() to start a new episode.",
                 test_cases_description="",
                 reward=self._best_reward,
                 passed_tests=None,
                 total_tests=0,
-                feedback="Episode ended. Please call reset() to start a new task.",
                 done=True,
             )
         self._step_count += 1
-        # ── Invalid action penalty ──────────────────────────────────────────
         code = action.fixed_code.strip() if action.fixed_code else ""
         if not code:
             done = self._step_count >= MAX_STEPS
             self._done = done
             return DebugObservation(
                 task_id=self._current_task["task_id"],
                 difficulty=self._difficulty,
@@ -113,30 +111,15 @@ class CodeDebugEnvironment(Environment):
                 instructions=self._current_task["instructions"],
                 test_cases_description=self._current_task["test_cases_description"],
                 reward=0.0,
                 passed_tests=0,
                 total_tests=len(self._current_task["test_cases"]),
-                feedback="❌ Invalid action: fixed_code is empty. Penalty applied. Submit valid Python code.",
                 done=done,
             )
-        # Check for obvious non-Python (very short or no 'def' keyword)
-        if len(code) < 5 or ("def " not in code and "lambda" not in code and "=" not in code):
-            done = self._step_count >= MAX_STEPS
-            self._done = done
-            return DebugObservation(
-                task_id=self._current_task["task_id"],
-                difficulty=self._difficulty,
-                buggy_code=self._current_task["buggy_code"],
-                instructions=self._current_task["instructions"],
-                test_cases_description=self._current_task["test_cases_description"],
-                reward=0.0,
-                passed_tests=0,
-                total_tests=len(self._current_task["test_cases"]),
-                feedback="❌ Invalid action: submission does not appear to be valid Python. Penalty applied.",
-                done=done,
-            )
-        # Grade the submission
         grader = GRADERS[self._difficulty]
         if self._difficulty == "hard":
             reward, passed, total, feedback, _ = grader(
@@ -148,9 +131,9 @@ class CodeDebugEnvironment(Environment):
             )
         self._current_reward = reward
         self._best_reward = max(self._best_reward, reward)
-        # Episode ends if: perfect score OR max steps reached
         done = (reward == 1.0) or (self._step_count >= MAX_STEPS)
         self._done = done
@@ -161,6 +144,8 @@ class CodeDebugEnvironment(Environment):
             instructions=self._current_task["instructions"],
             test_cases_description=self._current_task["test_cases_description"],
             reward=reward,
             passed_tests=passed,
             total_tests=total,
             feedback=feedback,
@@ -177,6 +162,7 @@ class CodeDebugEnvironment(Environment):
             difficulty=self._difficulty,
             max_steps=MAX_STEPS,
             current_reward=self._current_reward,
             best_reward=self._best_reward,
             done=self._done,
-        )

 from server.graders.grader_medium import grade_medium
 from server.graders.grader_hard import grade_hard
 TASK_GETTERS = {
+    "easy":   get_random_easy_task,
     "medium": get_random_medium_task,
+    "hard":   get_random_hard_task,
 }
 GRADERS = {
+    "easy":   grade_easy,
     "medium": grade_medium,
+    "hard":   grade_hard,
 }
 MAX_STEPS = 5
 class CodeDebugEnvironment(Environment):
     """
     OpenEnv environment for LLM-based code debugging.
+    Supports 3 difficulty levels with partial rewards and cumulative tracking.
     """
     def __init__(self):
         self._difficulty: str = "easy"
         self._current_task: Optional[dict] = None
         self._step_count: int = 0
+        self._cumulative_reward: float = 0.0
         self._best_reward: float = 0.0
         self._current_reward: float = 0.0
         self._done: bool = False
     def reset(self, difficulty: Optional[str] = None) -> DebugObservation:
+        """Start a new episode. Optionally specify difficulty: easy | medium | hard."""
         self._episode_id = str(uuid4())
         self._step_count = 0
+        self._cumulative_reward = 0.0
         self._best_reward = 0.0
         self._current_reward = 0.0
         self._done = False
         if difficulty and difficulty in TASK_GETTERS:
             self._difficulty = difficulty
         else:
             self._difficulty = random.choice(["easy", "medium", "hard"])
         self._current_task = TASK_GETTERS[self._difficulty]()
         return DebugObservation(
             instructions=self._current_task["instructions"],
             test_cases_description=self._current_task["test_cases_description"],
             reward=None,
+            cumulative_reward=0.0,
+            best_reward=0.0,
             passed_tests=None,
             total_tests=len(self._current_task["test_cases"]),
             feedback=None,
         )
     def step(self, action: DebugAction) -> DebugObservation:
+        """Submit fixed_code. Returns observation with reward, cumulative_reward, feedback, done."""
         if self._done:
             return DebugObservation(
                 task_id=self._current_task["task_id"] if self._current_task else "none",
                 difficulty=self._difficulty,
                 buggy_code=self._current_task["buggy_code"] if self._current_task else "",
+                instructions="Episode done. Call reset() to start a new episode.",
                 test_cases_description="",
                 reward=self._best_reward,
+                cumulative_reward=self._cumulative_reward,
+                best_reward=self._best_reward,
                 passed_tests=None,
                 total_tests=0,
+                feedback="Episode ended. Call reset() to start a new task.",
                 done=True,
             )
         self._step_count += 1
+        # ── Invalid action penalty ─────────────────────────────────────────
         code = action.fixed_code.strip() if action.fixed_code else ""
         if not code:
             done = self._step_count >= MAX_STEPS
             self._done = done
+            self._cumulative_reward += 0.0
             return DebugObservation(
                 task_id=self._current_task["task_id"],
                 difficulty=self._difficulty,
                 instructions=self._current_task["instructions"],
                 test_cases_description=self._current_task["test_cases_description"],
                 reward=0.0,
+                cumulative_reward=self._cumulative_reward,
+                best_reward=self._best_reward,
                 passed_tests=0,
                 total_tests=len(self._current_task["test_cases"]),
+                feedback="❌ Invalid action: fixed_code is empty. Submit valid Python code.",
                 done=done,
             )
+        # ── Grade the submission ───────────────────────────────────────────
         grader = GRADERS[self._difficulty]
         if self._difficulty == "hard":
             reward, passed, total, feedback, _ = grader(
             )
         self._current_reward = reward
+        self._cumulative_reward += reward
         self._best_reward = max(self._best_reward, reward)
         done = (reward == 1.0) or (self._step_count >= MAX_STEPS)
         self._done = done
             instructions=self._current_task["instructions"],
             test_cases_description=self._current_task["test_cases_description"],
             reward=reward,
+            cumulative_reward=self._cumulative_reward,
+            best_reward=self._best_reward,
             passed_tests=passed,
             total_tests=total,
             feedback=feedback,
             difficulty=self._difficulty,
             max_steps=MAX_STEPS,
             current_reward=self._current_reward,
+            cumulative_reward=self._cumulative_reward,
             best_reward=self._best_reward,
             done=self._done,
+        )

server/graders/__pycache__/grader_easy.cpython-310.pyc CHANGED Viewed

Binary files a/server/graders/__pycache__/grader_easy.cpython-310.pyc and b/server/graders/__pycache__/grader_easy.cpython-310.pyc differ

server/graders/__pycache__/grader_hard.cpython-310.pyc CHANGED Viewed

Binary files a/server/graders/__pycache__/grader_hard.cpython-310.pyc and b/server/graders/__pycache__/grader_hard.cpython-310.pyc differ

server/graders/__pycache__/grader_medium.cpython-310.pyc CHANGED Viewed

Binary files a/server/graders/__pycache__/grader_medium.cpython-310.pyc and b/server/graders/__pycache__/grader_medium.cpython-310.pyc differ

server/graders/grader_hard.py CHANGED Viewed

@@ -6,68 +6,85 @@ from typing import Tuple, List, Optional
 from .grader_easy import grade_easy
-def _score_explanation(explanation: Optional[str], keywords: List[str]) -> Tuple[float, str]:
     """
-    Score explanation by checking for required conceptual keywords.
-    - No explanation → 0.0
-    - 1+ keyword hit → partial credit proportional to hits
-    - Half or more keywords → 1.0
     """
-    if not explanation or len(explanation.strip()) < 10:
-        return 0.0, "❌ No explanation provided. Hard tasks require an explanation field."
-    explanation_lower = explanation.lower()
-    hits = [kw for kw in keywords if kw.lower() in explanation_lower]
-    if not keywords:
-        score = 1.0 if len(explanation.strip()) > 20 else 0.5
     else:
-        needed = max(1, len(keywords) // 2)
-        if len(hits) == 0:
-            score = 0.0
-        elif len(hits) >= needed:
-            score = 1.0
-        else:
-            score = round(len(hits) / needed, 2)
-    if score == 1.0:
-        feedback = f"✅ Explanation excellent! Mentioned: {', '.join(hits)}"
     elif score > 0:
-        missing = [kw for kw in keywords if kw.lower() not in explanation_lower]
         feedback = (
-            f"⚠️ Partial explanation (score={score}). Mentioned: {', '.join(hits) or 'none'}. "
-            f"Also discuss: {', '.join(missing[:3])}"
         )
     else:
-        feedback = (
-            f"❌ Explanation missing key concepts. "
-            f"Explain: {', '.join(keywords[:3])}"
-        )
     return round(score, 2), feedback
 def grade_hard(fixed_code: str, task: dict, explanation: Optional[str] = None) -> Tuple[float, int, int, str, List[dict]]:
     """
-    Grade a hard task submission.
-    Reward = 0.7 × test_score + 0.3 × explanation_score
     """
     test_reward, passed, total, code_feedback, results = grade_easy(fixed_code, task)
     keywords = task.get("explanation_keywords", [])
-    exp_score, exp_feedback = _score_explanation(explanation, keywords)
     final_reward = round(0.7 * test_reward + 0.3 * exp_score, 2)
     feedback = (
-        f"--- Code Score (70% weight): {test_reward:.2f} ---\n"
         f"{code_feedback}\n\n"
-        f"--- Explanation Score (30% weight): {exp_score:.2f} ---\n"
         f"{exp_feedback}\n\n"
         f"=== Final Reward: {final_reward:.2f} ==="
     )
     if passed == total and exp_score < 1.0:
-        feedback += f"\n💡 Code is correct! Improve explanation by mentioning: {', '.join(keywords[:3])}"
     elif passed < total and not explanation:
-        feedback += "\n💡 Fix the code AND provide a clear explanation for max reward."
     return final_reward, passed, total, feedback, results

 from .grader_easy import grade_easy
+def _score_explanation(explanation: Optional[str], keywords: List[str], instructions: str) -> Tuple[float, str]:
     """
+    Score explanation semantically:
+    - Length check (must be meaningful)
+    - Keyword matching (concept coverage)
+    - Partial credit for any relevant mention
     """
+    if not explanation or len(explanation.strip()) < 15:
+        return 0.0, "❌ No explanation provided. Hard tasks require explanation field."
+    exp_lower = explanation.lower()
+    hits = [kw for kw in keywords if kw.lower() in exp_lower]
+    # Also check for common synonyms
+    synonym_map = {
+        "visited": ["seen", "visited", "track", "memo"],
+        "iteration order": ["order", "direction", "forward", "backward", "reverse"],
+        "overwrite": ["overwrite", "override", "update", "modify"],
+        "reverse": ["reverse", "backward", "right to left", "descending"],
+        "0/1": ["0/1", "zero one", "binary", "knapsack"],
+        "high": ["high", "upper", "boundary", "bound"],
+        "return high": ["return high", "high boundary"],
+        "floor": ["floor", "integer", "truncat"],
+    }
+    synonym_hits = set(hits)
+    for kw in keywords:
+        kw_lower = kw.lower()
+        if kw_lower in synonym_map:
+            for syn in synonym_map[kw_lower]:
+                if syn in exp_lower:
+                    synonym_hits.add(kw)
+                    break
+    total_hits = len(synonym_hits)
+    needed = max(1, len(keywords) // 2)
+    if total_hits == 0:
+        score = 0.1 if len(explanation.strip()) > 50 else 0.0  # minimal credit for any long attempt
+    elif total_hits >= needed:
+        score = 1.0
     else:
+        score = round(total_hits / needed, 2)
+    if score >= 1.0:
+        feedback = f"✅ Explanation excellent! Covered: {', '.join(synonym_hits)}"
     elif score > 0:
+        missing = [kw for kw in keywords if kw.lower() not in exp_lower]
         feedback = (
+            f"⚠️ Partial explanation (score={score}). Covered: {', '.join(synonym_hits) or 'none'}. "
+            f"Also mention: {', '.join(missing[:3])}"
         )
     else:
+        feedback = f"❌ Explanation too vague. Explain: {', '.join(keywords[:3])}"
     return round(score, 2), feedback
 def grade_hard(fixed_code: str, task: dict, explanation: Optional[str] = None) -> Tuple[float, int, int, str, List[dict]]:
     """
+    Grade hard task: Reward = 0.7 × test_score + 0.3 × explanation_score
     """
     test_reward, passed, total, code_feedback, results = grade_easy(fixed_code, task)
     keywords = task.get("explanation_keywords", [])
+    instructions = task.get("instructions", "")
+    exp_score, exp_feedback = _score_explanation(explanation, keywords, instructions)
     final_reward = round(0.7 * test_reward + 0.3 * exp_score, 2)
     feedback = (
+        f"--- Code Score (70%): {test_reward:.2f} ---\n"
         f"{code_feedback}\n\n"
+        f"--- Explanation Score (30%): {exp_score:.2f} ---\n"
         f"{exp_feedback}\n\n"
         f"=== Final Reward: {final_reward:.2f} ==="
     )
     if passed == total and exp_score < 1.0:
+        feedback += f"\n💡 Code correct! Boost score by mentioning: {', '.join(keywords[:3])}"
     elif passed < total and not explanation:
+        feedback += "\n💡 Fix the code AND add explanation for max reward."
     return final_reward, passed, total, feedback, results