Spaces:

h1manshu
/

code_review

Sleeping

App Files Files Community

h1manshu commited on Apr 6

Commit

0fb8bd2

verified ·

1 Parent(s): a158b53

Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +168 -149
client.py +7 -9
dataset/dataset.json +6 -9
inference.py +16 -8
models.py +7 -3
server/code_review_environment.py +64 -30

README.md CHANGED Viewed

@@ -13,43 +13,180 @@ tags:
 # Code Review Environment
-A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
 ## Quick Start
-The simplest way to use the Code Review environment is through the `CodeReviewEnv` class:
-```python
-from code_review import CodeReviewAction, CodeReviewEnv
-try:
-    # Create environment from Docker image
-    code_reviewenv = CodeReviewEnv.from_docker_image("code_review-env:latest")
-    # Reset
-    result = code_reviewenv.reset()
-    print(f"Reset: {result.observation.echoed_message}")
-    # Send multiple messages
-    messages = ["Hello, World!", "Testing echo", "Final message"]
-    for msg in messages:
-        result = code_reviewenv.step(CodeReviewAction(message=msg))
-        print(f"Sent: '{msg}'")
-        print(f"  → Echoed: '{result.observation.echoed_message}'")
-        print(f"  → Length: {result.observation.message_length}")
-        print(f"  → Reward: {result.reward}")
-finally:
-    # Always clean up
-    code_reviewenv.close()
 ```
-That's it! The `CodeReviewEnv.from_docker_image()` method handles:
-- Starting the Docker container
-- Waiting for the server to be ready
-- Connecting to the environment
-- Container cleanup when you call `close()`
 ## Building the Docker Image
@@ -116,124 +253,6 @@ The deployed space includes:
 - **Health Check** at `/health` - Container health monitoring
 - **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
-## Environment Details
-### Action
-**CodeReviewAction**: Contains a single field
-- `message` (str) - The message to echo back
-### Observation
-**CodeReviewObservation**: Contains the echo response and metadata
-- `echoed_message` (str) - The message echoed back
-- `message_length` (int) - Length of the message
-- `reward` (float) - Reward based on message length (length × 0.1)
-- `done` (bool) - Always False for echo environment
-- `metadata` (dict) - Additional info like step count
-### Reward
-The reward is calculated as: `message_length × 0.1`
-- "Hi" → reward: 0.2
-- "Hello, World!" → reward: 1.3
-- Empty message → reward: 0.0
-## Advanced Usage
-### Connecting to an Existing Server
-If you already have a Code Review environment server running, you can connect directly:
-```python
-from code_review import CodeReviewEnv
-# Connect to existing server
-code_reviewenv = CodeReviewEnv(base_url="<ENV_HTTP_URL_HERE>")
-# Use as normal
-result = code_reviewenv.reset()
-result = code_reviewenv.step(CodeReviewAction(message="Hello!"))
-```
-Note: When connecting to an existing server, `code_reviewenv.close()` will NOT stop the server.
-### Using the Context Manager
-The client supports context manager usage for automatic connection management:
-```python
-from code_review import CodeReviewAction, CodeReviewEnv
-# Connect with context manager (auto-connects and closes)
-with CodeReviewEnv(base_url="http://localhost:8000") as env:
-    result = env.reset()
-    print(f"Reset: {result.observation.echoed_message}")
-    # Multiple steps with low latency
-    for msg in ["Hello", "World", "!"]:
-        result = env.step(CodeReviewAction(message=msg))
-        print(f"Echoed: {result.observation.echoed_message}")
-```
-The client uses WebSocket connections for:
-- **Lower latency**: No HTTP connection overhead per request
-- **Persistent session**: Server maintains your environment state
-- **Efficient for episodes**: Better for many sequential steps
-### Concurrent WebSocket Sessions
-The server supports multiple concurrent WebSocket connections. To enable this,
-modify `server/app.py` to use factory mode:
-```python
-# In server/app.py - use factory mode for concurrent sessions
-app = create_app(
-    CodeReviewEnvironment,  # Pass class, not instance
-    CodeReviewAction,
-    CodeReviewObservation,
-    max_concurrent_envs=4,  # Allow 4 concurrent sessions
-)
-```
-Then multiple clients can connect simultaneously:
-```python
-from code_review import CodeReviewAction, CodeReviewEnv
-from concurrent.futures import ThreadPoolExecutor
-def run_episode(client_id: int):
-    with CodeReviewEnv(base_url="http://localhost:8000") as env:
-        result = env.reset()
-        for i in range(10):
-            result = env.step(CodeReviewAction(message=f"Client {client_id}, step {i}"))
-        return client_id, result.observation.message_length
-# Run 4 episodes concurrently
-with ThreadPoolExecutor(max_workers=4) as executor:
-    results = list(executor.map(run_episode, range(4)))
-```
-## Development & Testing
-### Direct Environment Testing
-Test the environment logic directly without starting the HTTP server:
-```bash
-# From the server directory
-python3 server/code_review_environment.py
-```
-This verifies that:
-- Environment resets correctly
-- Step executes actions properly
-- State tracking works
-- Rewards are calculated correctly
-### Running Locally
-Run the server locally for development:
-```bash
-uvicorn server.app:app --reload
-```
 ## Project Structure

 # Code Review Environment
+A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks.
+## Motivation
+Code review is a high-stakes, multi-step reasoning task that requires an agent to:
+- **Detect bugs and security vulnerabilities** from raw code diffs
+- **Generate corrective code** that resolves identified issues
+- **Make a final judgment** (approve/reject) backed by technical reasoning
+Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop — detection, remediation, and decision-making — in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in software development pipelines.
+## Environment Description
+The agent receives a pull request observation at each step and must respond with a structured JSON action. The episode runs for up to `MAX_STEPS = 3` steps, following a prescribed workflow:
+| Step | Expected Action | Purpose |
+|------|----------------|---------|
+| 1 | `comment` | Identify all issues in the diff |
+| 2 | `suggest_fix` | Provide corrected code |
+| 3 | `final_decision` | Approve or reject the PR |
+Each step is independently scored, and the final episode score is the maximum score achieved across all steps.
+## Action Space
+Actions are instances of `CodeReviewAction` and must be returned as JSON with the following fields:
+```json
+{
+  "action_type": "comment | suggest_fix | final_decision",
+  "comment": "Detailed description of identified issues (>30 characters)",
+  "suggested_code": "Corrected code snippet, or null if not applicable",
+  "decision": "approve | reject | null"
+}
+```
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `action_type` | `str` | Always | One of `comment`, `suggest_fix`, `final_decision` |
+| `comment` | `str` | Recommended | Technical description of issues found |
+| `suggested_code` | `str \| null` | Step 2 | Corrected code replacing the buggy diff |
+| `decision` | `str \| null` | Step 3 | `approve` or `reject`; `null` otherwise |
+## Observation Space
+Each step returns a `CodeReviewObservation` with the following fields:
+| Field | Type | Description |
+|-------|------|-------------|
+| `pr` | `CodeReviewPullRequest` | The pull request under review |
+| `pr.id` | `str` | Unique PR identifier |
+| `pr.title` | `str` | Short title of the PR |
+| `pr.description` | `str` | Brief description of intent |
+| `pr.language` | `str` | Programming language (e.g. `python`) |
+| `pr.diffs` | `List[CodeDiff]` | List of file diffs |
+| `pr.diffs[].file_name` | `str` | Name of the changed file |
+| `pr.diffs[].diff` | `str` | The actual code change |
+| `previous_comments` | `List[str]` | Comments made in prior steps |
+| `step_count` | `int` | Current step number |
+| `max_steps` | `int` | Maximum steps per episode (default: 3) |
+## Scoring
+Each action is scored across three components:
+| Component | Weight | Method |
+|-----------|--------|--------|
+| Issue Detection | 40% | Fraction of ground-truth issues mentioned in `comment` |
+| Fix Quality | 30% | Token overlap + sequence similarity between `suggested_code` and ground-truth fix |
+| Decision Accuracy | 30% | Exact match with ground-truth `approve`/`reject`; partial credit (0.2) for wrong decision |
+**Bonuses and penalties applied per step:**
+- `+0.1` — comment length > 30 characters (encourages detail)
+- `+0.1` — correct final decision reached in step ≤ 2 (encourages efficiency)
+- `-0.1` — no comment provided on a non-decision step (penalizes lazy steps)
+- `-0.05` — step count exceeds 3 (penalizes long trajectories)
+The final episode score is the **maximum** `grade_action` score across all steps in the episode. Scores are clamped to `[0.0, 1.0]`.
+## Task Descriptions
+The dataset contains tasks at three difficulty levels:
+### Easy
+Straightforward single-file issues with an obvious fix.
+| PR | Issue | Expected Decision |
+|----|-------|------------------|
+| Missing import | `datetime` used without import | reject |
+**What the agent must do:** Detect the missing `from datetime import datetime` statement and supply the corrected import.
+---
+### Medium
+Logical or performance issues requiring understanding of Python semantics.
+| PR | Issue | Expected Decision |
+|----|-------|------------------|
+| Division function | No guard against division by zero | reject |
+| Inefficient loop | `range(len(arr))` pattern; can use `in` operator | approve |
+**What the agent must do:** For the division task, add a `if b == 0: return None` guard. For the loop task, recognize it as a style/efficiency issue but not a correctness bug — the correct decision is **approve**.
+---
+### Hard
+Security vulnerabilities, injection attacks, and cross-file null-handling bugs.
+| PR | Issue | Expected Decision |
+|----|-------|------------------|
+| Authentication logic | Hardcoded plaintext password `admin123` | reject |
+| SQL query | String concatenation exposes SQL injection | reject |
+| Cross-file null bug | `get_user(None)` called without input validation | reject |
+**What the agent must do:**
+- For auth: detect the hardcoded secret and propose `bcrypt`-based password comparison.
+- For SQL: detect string concatenation and replace with a parameterized query (`%s` placeholder + `cursor.execute`).
+- For null bug: validate `id is not None` before the `db[id]` lookup, and fix the call site in `controller.py`.
 ## Quick Start
+Start the environment server, then run inference:
+```bash
+# Terminal 1 — download code_review repo
+git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
+# Terminal 1 — install packages
+uv pip install -e .
+# Terminal 1 — Run server locally
+uv run server --host 0.0.0.0 --port 8000
+# Terminal 2 — run the agent
+uv run python inference.py
+```
+The agent runs `NUM_EPISODES = 4` episodes (configurable) with each `MAX_STEPS = 3` and logs each step:
 ```
+[START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
+[STEP] step=1 action=... reward=0.55 done=false error=null
+[STEP] step=2 action=... reward=0.72 done=false error=null
+[STEP] step=3 action=... reward=0.85 done=true error=null
+[END] success=true steps=12 score=0.850 rewards=0.55,0.72,0.85,0.60
+```
+## Configuration
+Key constants in `inference.py`:
+| Constant | Default | Description |
+|----------|---------|-------------|
+| `MAX_STEPS` | `3` | Steps per episode |
+| `NUM_EPISODES` | `4` | Number of PRs to review |
+| `TEMPERATURE` | `0.2` | Sampling temperature (lower = more deterministic) |
+| `MAX_TOKENS` | `256` | Max tokens per LLM response |
+| `SUCCESS_SCORE_THRESHOLD` | `0.1` | Minimum normalized score to count as success |
+### Score Interpretation
+| Score Range | Interpretation |
+|-------------|---------------|
+| 0.00 – 0.20 | Failing — agent cannot follow the JSON schema or identify basic issues |
+| 0.20 – 0.50 | Partial — agent detects some issues but misses security vulnerabilities or gives wrong decisions |
+| 0.50 – 0.75 | Competent — agent handles easy and medium tasks; struggles with hard security/null cases |
+| 0.75 – 1.00 | Strong — agent reliably detects all issue types, generates correct fixes, and makes sound decisions |
 ## Building the Docker Image
 - **Health Check** at `/health` - Container health monitoring
 - **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
 ## Project Structure

client.py CHANGED Viewed

@@ -12,12 +12,15 @@ from openenv.core import EnvClient
 from openenv.core.client_types import StepResult
 from openenv.core.env_server.types import State
-from .models import CodeReviewAction, CodeReviewObservation, CodeReviewReward , CodeReviewPullRequest
-class CodeReviewEnv(
-    EnvClient[CodeReviewAction, CodeReviewObservation, State]
-):
     """
     Client for the Code Review Environment.
@@ -90,19 +93,14 @@ class CodeReviewEnv(
         """
         # print("Payload ====== ", payload)
         obs_data = payload.get("observation") or {}
         if "observation" in obs_data:  # nested case
             obs_data = obs_data["observation"]
         if not obs_data or "pr" not in obs_data:
             raise ValueError(f"Invalid observation payload: {payload}")
         pr_data = obs_data["pr"]
         observation = CodeReviewObservation(

 from openenv.core.client_types import StepResult
 from openenv.core.env_server.types import State
+from .models import (
+    CodeReviewAction,
+    CodeReviewObservation,
+    CodeReviewReward,
+    CodeReviewPullRequest,
+)
+class CodeReviewEnv(EnvClient[CodeReviewAction, CodeReviewObservation, State]):
     """
     Client for the Code Review Environment.
         """
         # print("Payload ====== ", payload)
         obs_data = payload.get("observation") or {}
         if "observation" in obs_data:  # nested case
             obs_data = obs_data["observation"]
         if not obs_data or "pr" not in obs_data:
             raise ValueError(f"Invalid observation payload: {payload}")
         pr_data = obs_data["pr"]
         observation = CodeReviewObservation(

dataset/dataset.json CHANGED Viewed

@@ -1,5 +1,4 @@
 [
   {
     "task_type": "easy",
     "pr": {
@@ -17,7 +16,7 @@
     "ground_truth": {
       "issues": ["missing import datetime"],
       "decision": "reject",
-      "fix": "from datetime import datetime"
     }
   },
   {
@@ -37,7 +36,7 @@
     "ground_truth": {
       "issues": ["division by zero"],
       "decision": "reject",
-      "fix": "if b == 0: return None"
     }
   },
   {
@@ -57,10 +56,9 @@
     "ground_truth": {
       "issues": ["inefficient loop"],
       "decision": "approve",
-      "fix": "use 'if target in arr'"
     }
   },
   {
     "task_type": "hard",
     "pr": {
@@ -78,7 +76,7 @@
     "ground_truth": {
       "issues": ["hardcoded password", "security vulnerability"],
       "decision": "reject",
-      "fix": "use hashed password comparison"
     }
   },
   {
@@ -98,7 +96,7 @@
     "ground_truth": {
       "issues": ["sql injection"],
       "decision": "reject",
-      "fix": "use parameterized queries"
     }
   },
   {
@@ -122,8 +120,7 @@
     "ground_truth": {
       "issues": ["invalid input", "null handling"],
       "decision": "reject",
-      "fix": "validate id before calling get_user"
     }
   }
 ]

 [
   {
     "task_type": "easy",
     "pr": {
     "ground_truth": {
       "issues": ["missing import datetime"],
       "decision": "reject",
+      "fix": "from datetime import datetime\nprint(datetime.now())"
     }
   },
   {
     "ground_truth": {
       "issues": ["division by zero"],
       "decision": "reject",
+      "fix": "def divide(a, b):\n    if b == 0:\n        return None\n    return a / b"
     }
   },
   {
     "ground_truth": {
       "issues": ["inefficient loop"],
       "decision": "approve",
+      "fix": "return target in arr"
     }
   },
   {
     "task_type": "hard",
     "pr": {
     "ground_truth": {
       "issues": ["hardcoded password", "security vulnerability"],
       "decision": "reject",
+      "fix": "import bcrypt\n\ndef login(password, hashed_password):\n    return bcrypt.checkpw(password.encode(), hashed_password)"
     }
   },
   {
     "ground_truth": {
       "issues": ["sql injection"],
       "decision": "reject",
+      "fix": "query = \"SELECT * FROM users WHERE id = %s\"\ncursor.execute(query, (user_id,))"
     }
   },
   {
     "ground_truth": {
       "issues": ["invalid input", "null handling"],
       "decision": "reject",
+      "fix": "def get_user(id):\n    if id is None:\n        raise ValueError('id must not be None')\n    return db[id]\n\nuser = get_user(user_id)"
     }
   }
 ]

inference.py CHANGED Viewed

@@ -26,15 +26,15 @@ import asyncio
 from code_review import CodeReviewAction, CodeReviewObservation
 from code_review.client import CodeReviewEnv
-API_BASE_URL = "https://router.huggingface.co/v1"
 API_KEY = os.getenv("HF_TOKEN")
-MODEL_NAME = os.getenv("MODEL_NAME")
 TASK_NAME = "code_review"
 BENCHMARK = "code_review_benchmark"
 MAX_STEPS = 3
 TEMPERATURE = 0.2
-MAX_TOKENS = 150
-NUM_EPISODES = 6
 _MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
 MAX_TOTAL_REWARD = NUM_EPISODES * MAX_STEPS * _MAX_REWARD_PER_STEP
 SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
@@ -68,6 +68,8 @@ Rules:
 - Mention every issue explicitly
 - Use precise technical language
 - Write detailed comments (>30 characters)
 Return ONLY JSON:
@@ -193,7 +195,7 @@ def parse_action(text: str) -> Dict[str, Any]:
     text = text.strip().replace("```json", "").replace("```", "")
     try:
-        return json.loads(text)
     except Exception as e:
         print(e)
         return fallback_action()
@@ -235,7 +237,9 @@ async def run_episode(client, env):
         reward = result.reward
         done = result.done
-        log_step(step=step, action=response_text, reward=reward.score, done=done, error=None)
         final_score = max(final_score, reward.score if reward else 0.0)
     return final_score
@@ -249,7 +253,6 @@ async def main():
     async with CodeReviewEnv(base_url="http://localhost:8000") as env:
         for i in range(NUM_EPISODES):
-            print(f"\n===== Episode {i+1} =====", flush=True)
             env.task_index = i
             score = await run_episode(client, env)
@@ -260,7 +263,12 @@ async def main():
     total_score = sum(scores) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
     final_score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
     success = final_score >= SUCCESS_SCORE_THRESHOLD
-    log_end(success=success, steps=NUM_EPISODES*MAX_STEPS, score=final_score, rewards=scores)
 if __name__ == "__main__":

 from code_review import CodeReviewAction, CodeReviewObservation
 from code_review.client import CodeReviewEnv
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
 API_KEY = os.getenv("HF_TOKEN")
+MODEL_NAME = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.1-8B-Instruct"
 TASK_NAME = "code_review"
 BENCHMARK = "code_review_benchmark"
 MAX_STEPS = 3
 TEMPERATURE = 0.2
+MAX_TOKENS = 256
+NUM_EPISODES = 4
 _MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
 MAX_TOTAL_REWARD = NUM_EPISODES * MAX_STEPS * _MAX_REWARD_PER_STEP
 SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
 - Mention every issue explicitly
 - Use precise technical language
 - Write detailed comments (>30 characters)
+- All string values in the JSON must use \\n for newlines, never literal line breaks
+- Return ONLY raw JSON — no markdown fences, no preamble
 Return ONLY JSON:
     text = text.strip().replace("```json", "").replace("```", "")
     try:
+        return json.loads(text, strict=False)
     except Exception as e:
         print(e)
         return fallback_action()
         reward = result.reward
         done = result.done
+        log_step(
+            step=step, action=response_text, reward=reward.score, done=done, error=None
+        )
         final_score = max(final_score, reward.score if reward else 0.0)
     return final_score
     async with CodeReviewEnv(base_url="http://localhost:8000") as env:
         for i in range(NUM_EPISODES):
             env.task_index = i
             score = await run_episode(client, env)
     total_score = sum(scores) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
     final_score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
     success = final_score >= SUCCESS_SCORE_THRESHOLD
+    log_end(
+        success=success,
+        steps=NUM_EPISODES * MAX_STEPS,
+        score=final_score,
+        rewards=scores,
+    )
 if __name__ == "__main__":

models.py CHANGED Viewed

@@ -11,8 +11,9 @@ The code_review environment is a simple test environment that echoes back messag
 """
 from openenv.core.env_server.types import Action, Observation
-from pydantic import Field, BaseModel
-from typing import Optional, List ,  Any , Dict
 class CodeReviewAction(Action):
     """Action for the Code Review environment - just a message to echo."""
@@ -23,6 +24,7 @@ class CodeReviewAction(Action):
     suggested_code: Optional[str] = None
     decision: Optional[str] = None
 class CodeDiff(BaseModel):
     file_name: str
     diff: str
@@ -35,10 +37,11 @@ class CodeReviewPullRequest(BaseModel):
     diffs: List[CodeDiff]
     language: str
 class CodeReviewObservation(Observation):
     """Observation from the Code Review environment - the echoed message."""
-    #echoed_message: str = Field(default="", description="The echoed message")
     pr: CodeReviewPullRequest
     previous_comments: List[str]
     step_count: int
@@ -49,6 +52,7 @@ class CodeReviewReward(BaseModel):
     score: float
     feedback: str
 class CodeReviewStepResponse(BaseModel):
     observation: CodeReviewObservation
     reward: CodeReviewReward

 """
 from openenv.core.env_server.types import Action, Observation
+from pydantic import Field, BaseModel
+from typing import Optional, List, Any, Dict
 class CodeReviewAction(Action):
     """Action for the Code Review environment - just a message to echo."""
     suggested_code: Optional[str] = None
     decision: Optional[str] = None
 class CodeDiff(BaseModel):
     file_name: str
     diff: str
     diffs: List[CodeDiff]
     language: str
 class CodeReviewObservation(Observation):
     """Observation from the Code Review environment - the echoed message."""
+    # echoed_message: str = Field(default="", description="The echoed message")
     pr: CodeReviewPullRequest
     previous_comments: List[str]
     step_count: int
     score: float
     feedback: str
 class CodeReviewStepResponse(BaseModel):
     observation: CodeReviewObservation
     reward: CodeReviewReward

server/code_review_environment.py CHANGED Viewed

@@ -22,7 +22,7 @@ try:
         CodeReviewObservation,
         CodeReviewReward,
         CodeReviewPullRequest,
-        CodeReviewStepResponse
     )
 except ImportError:
     from models import (
@@ -34,9 +34,33 @@ except ImportError:
 import json
 from pathlib import Path
 dataset_path = Path(__file__).parent.parent / "dataset" / "dataset.json"
 class CodeReviewEnvironment(Environment):
     """
     A simple echo environment that echoes back messages.
@@ -96,7 +120,7 @@ class CodeReviewEnvironment(Environment):
         self.fix_attempted = False
         return CodeReviewObservation(
-            #echoed_message="Code Review environment ready!",
             pr=self.pr,
             previous_comments=self.history,
             step_count=self.step_count,
@@ -150,7 +174,7 @@ class CodeReviewEnvironment(Environment):
             self.fix_attempted = True
         score = self.grade_action(action, self.gt)
-        print(f"Step {self.step_count} - Score: {score:.4f}")
         bonus = 0.0
@@ -184,18 +208,15 @@ class CodeReviewEnvironment(Environment):
         # print(type(CodeReviewObservation))
         # print(type(CodeReviewReward))
-        obs =  CodeReviewObservation(
-                pr=self.pr,
-                previous_comments=[a.comment for a in self.history if a.comment],
-                step_count=self.step_count,
-                max_steps=self.max_steps,
-            )
         # print("Obs == " , obs)
-        rew =  CodeReviewReward(
-                score=score,
-                feedback="graded"
-            )
         # print("FINAL REWARD TYPE:", type(rew))
         # print("FINAL REWARD:", rew)
@@ -223,21 +244,20 @@ class CodeReviewEnvironment(Environment):
         return self._state
     def _invalid_step(self):
-        rew =  CodeReviewReward(score=0.0, feedback="invalid action")
-        obs =  CodeReviewObservation(
-                echoed_message="Invalid action format. Please send a valid CodeReviewAction.",
-                pr=self.pr,
-                previous_comments=[a.comment for a in self.history if a.comment],
-                step_count=self.step_count,
-                max_steps=self.max_steps,
-            )
         return CodeReviewStepResponse(
             observation=obs,
             reward=rew,
             done=True,
             info={"error": "invalid_action"},
         )
     def grade_action(self, action, ground_truth):
         score = 0.0
@@ -295,25 +315,40 @@ class CodeReviewEnvironment(Environment):
     # ==============================
     # FIX MATCH (FUZZY)
     # ==============================
-    def score_fix(self, suggested_code, ground_truth):
         if not suggested_code:
             return 0.0
         expected_fix = self.normalize(ground_truth.get("fix", ""))
         suggested_code = self.normalize(suggested_code)
-        # direct match
         if expected_fix in suggested_code:
             return 1.0
-        # partial keyword match
-        keywords = expected_fix.split()
-        if not keywords:
             return 0.0
-        matches = sum(1 for word in keywords if word in suggested_code)
-        return matches / len(keywords)
     # ==============================
     # DECISION MATCH
@@ -335,4 +370,3 @@ class CodeReviewEnvironment(Environment):
         # Wrong decision → partial penalty (not negative)
         return 0.2

         CodeReviewObservation,
         CodeReviewReward,
         CodeReviewPullRequest,
+        CodeReviewStepResponse,
     )
 except ImportError:
     from models import (
 import json
 from pathlib import Path
+import re
+from difflib import SequenceMatcher
 dataset_path = Path(__file__).parent.parent / "dataset" / "dataset.json"
+STOP_WORDS = {
+    "use",
+    "the",
+    "a",
+    "an",
+    "to",
+    "and",
+    "or",
+    "of",
+    "in",
+    "for",
+    "with",
+    "is",
+    "it",
+    "on",
+    "at",
+    "by",
+    "from",
+    "that",
+}
 class CodeReviewEnvironment(Environment):
     """
     A simple echo environment that echoes back messages.
         self.fix_attempted = False
         return CodeReviewObservation(
+            # echoed_message="Code Review environment ready!",
             pr=self.pr,
             previous_comments=self.history,
             step_count=self.step_count,
             self.fix_attempted = True
         score = self.grade_action(action, self.gt)
+        # print(f"Step {self.step_count} - Score: {score:.4f}")
         bonus = 0.0
         # print(type(CodeReviewObservation))
         # print(type(CodeReviewReward))
+        obs = CodeReviewObservation(
+            pr=self.pr,
+            previous_comments=[a.comment for a in self.history if a.comment],
+            step_count=self.step_count,
+            max_steps=self.max_steps,
+        )
         # print("Obs == " , obs)
+        rew = CodeReviewReward(score=score, feedback="graded")
         # print("FINAL REWARD TYPE:", type(rew))
         # print("FINAL REWARD:", rew)
         return self._state
     def _invalid_step(self):
+        rew = CodeReviewReward(score=0.0, feedback="invalid action")
+        obs = CodeReviewObservation(
+            echoed_message="Invalid action format. Please send a valid CodeReviewAction.",
+            pr=self.pr,
+            previous_comments=[a.comment for a in self.history if a.comment],
+            step_count=self.step_count,
+            max_steps=self.max_steps,
+        )
         return CodeReviewStepResponse(
             observation=obs,
             reward=rew,
             done=True,
             info={"error": "invalid_action"},
         )
     def grade_action(self, action, ground_truth):
         score = 0.0
     # ==============================
     # FIX MATCH (FUZZY)
     # ==============================
+    def score_fix(self, suggested_code: str, ground_truth: dict) -> float:
         if not suggested_code:
             return 0.0
         expected_fix = self.normalize(ground_truth.get("fix", ""))
         suggested_code = self.normalize(suggested_code)
+        if not expected_fix:
+            return 0.0
+        # 1. Exact / substring match — full score
         if expected_fix in suggested_code:
             return 1.0
+        # 2. Token overlap ignoring stop words
+        def code_tokens(text: str) -> list[str]:
+            tokens = re.findall(r"[a-zA-Z_]\w*|\d+|[=<>!+\-*/]+", text)
+            return [t for t in tokens if t.lower() not in STOP_WORDS]
+        expected_tokens = code_tokens(expected_fix)
+        suggested_tokens = set(code_tokens(suggested_code))
+        if not expected_tokens:
             return 0.0
+        token_score = sum(1 for t in expected_tokens if t in suggested_tokens) / len(
+            expected_tokens
+        )
+        # 3. Sequence similarity as a secondary signal
+        seq_score = SequenceMatcher(None, expected_fix, suggested_code).ratio()
+        # Weighted: token overlap matters more than character similarity
+        return round(0.7 * token_score + 0.3 * seq_score, 4)
     # ==============================
     # DECISION MATCH
         # Wrong decision → partial penalty (not negative)
         return 0.2