Spaces:

ArshVerma
/

CodeLens

Sleeping

App Files Files Community

ArshVerma commited on Apr 6

Commit

f8670cd

1 Parent(s): 598a3b1

Initial CodeLens OpenEnv submission

Browse files

Files changed (9) hide show

.dockerignore +2 -2
.env.example +5 -0
Dockerfile +27 -10
README.md +101 -5
app.py +10 -0
codelens_env/env.py +22 -1
codelens_env/models.py +10 -0
inference.py +9 -8
openenv.yaml +3 -0

.dockerignore CHANGED Viewed

@@ -10,8 +10,8 @@ build/
 *.egg
 MANIFEST
-# Node.js / Dashboard (Exclude sources, only keep builds)
-node_modules/
 dashboard/node_modules/
 dashboard/src/
 dashboard/public/

 *.egg
 MANIFEST
+# Dashboard Build (Must be built inside Docker to avoid local skew)
+static/dashboard/
 dashboard/node_modules/
 dashboard/src/
 dashboard/public/

.env.example CHANGED Viewed

@@ -23,3 +23,8 @@ LEADERBOARD_LIMIT=10         # Default entries per task page
 # Logging
 LOG_LEVEL=INFO               # DEBUG | INFO | WARNING | ERROR

 # Logging
 LOG_LEVEL=INFO               # DEBUG | INFO | WARNING | ERROR
+# Inference (OpenEnv spec)
+OPENAI_API_KEY=              # Required for inference.py (OpenAI-compatible API key)
+API_BASE_URL=https://api.openai.com/v1
+MODEL_NAME=gpt-3.5-turbo

Dockerfile CHANGED Viewed

@@ -1,20 +1,33 @@
-# ── Stage 1: Builder ──────────────────────────────────────────
-FROM python:3.11-slim AS builder
-WORKDIR /build
 # Install build dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
-    curl \
     && rm -rf /var/lib/apt/lists/*
-# Install Python dependencies into /build/venv
 COPY requirements.txt .
-RUN python -m venv /build/venv \
-    && /build/venv/bin/pip install --upgrade pip \
-    && /build/venv/bin/pip install --no-cache-dir -r requirements.txt
-# ── Stage 2: Production ───────────────────────────────────────
 FROM python:3.11-slim AS production
 # Security: run as non-root user
@@ -28,7 +41,11 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     && rm -rf /var/lib/apt/lists/*
 # Copy virtualenv from builder
-COPY --from=builder /build/venv /app/venv
 # Copy application code
 COPY --chown=appuser:appuser . .

+# ── Stage 1: Frontend Builder ─────────────────────────────────
+FROM node:20-slim AS frontend-builder
+WORKDIR /src/dashboard
+# Install dependencies
+COPY dashboard/package*.json ./
+RUN npm install
+# Copy source and build (vite.config.ts outputs to ../static/dashboard)
+COPY dashboard/ .
+RUN npm run build
+# ── Stage 2: Python Builder ───────────────────────────────────
+FROM python:3.11-slim AS python-builder
+WORKDIR /build-python
 # Install build dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
     && rm -rf /var/lib/apt/lists/*
+# Install Python dependencies into /build-python/venv
 COPY requirements.txt .
+RUN python -m venv /build-python/venv \
+    && /build-python/venv/bin/pip install --upgrade pip \
+    && /build-python/venv/bin/pip install --no-cache-dir -r requirements.txt
+# ── Stage 3: Production ───────────────────────────────────────
 FROM python:3.11-slim AS production
 # Security: run as non-root user
     && rm -rf /var/lib/apt/lists/*
 # Copy virtualenv from builder
+COPY --from=python-builder /build-python/venv /app/venv
+# Copy dashboard build from frontend-builder
+# (Vite config builds to ../static/dashboard relative to /src/dashboard)
+COPY --chown=appuser:appuser --from=frontend-builder /src/static/dashboard /app/static/dashboard
 # Copy application code
 COPY --chown=appuser:appuser . .

README.md CHANGED Viewed

@@ -1,3 +1,14 @@
 <p align="center">
   <img src="assets/codelens-brand-v2.svg" width="400" alt="CodeLens." />
 </p>
@@ -17,6 +28,19 @@ Designed for researchers and developers building the next generation of AI code
 ---
 ##  Quick Start
 Get up and running locally in under 2 minutes:
@@ -40,11 +64,65 @@ PYTHONPATH=. python app.py
 CodeLens benchmarks agents across three critical engineering domains:
-| Task                   | Scenarios | Max Steps | Focus Area                                                                 |
-| ---------------------- | --------- | --------- | -------------------------------------------------------------------------- |
-| `bug_detection`        | 10        | 10        | Off-by-one errors, null dereferences, race conditions, exception handling  |
-| `security_audit`       | 10        | 15        | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
-| `architectural_review` | 10        | 20        | N+1 queries, god classes, blocking async calls, circular imports           |
 ---
@@ -71,6 +149,24 @@ Every episode permits **5 false positive credits**. Flagging non-existent code p
 ---
 ##  API Reference
 | Method | Endpoint                | Auth     | Description                                   |

+---
+title: CodeLens Environment
+emoji: 🔍
+colorFrom: blue
+colorTo: green
+sdk: docker
+app_port: 7860
+tags:
+  - openenv
+---
 <p align="center">
   <img src="assets/codelens-brand-v2.svg" width="400" alt="CodeLens." />
 </p>
 ---
+## 💡 Motivation
+Progress in AI coding assistants has largely focused on **generation** (writing code), but **evaluation** (reviewing code) is equally critical for software reliability. Manual code review is a high-cognitive-load, real-world task that requires:
+- **Precision**: Identifying exactly where a bug exists.
+- **Context**: Understanding how a local change affects the whole system.
+- **Security-First Mindset**: Spotting non-obvious vulnerabilities like SQL injection or race conditions.
+CodeLens transforms these human-centric skills into a **measurable benchmark**, allowing researchers to evaluate agents on their ability to act as high-fidelity gatekeepers of code quality.
+---
+---
 ##  Quick Start
 Get up and running locally in under 2 minutes:
 CodeLens benchmarks agents across three critical engineering domains:
+| Task                   | Difficulty | Scenarios | Max Steps | Focus Area                                                                 |
+| ---------------------- | ---------- | --------- | --------- | -------------------------------------------------------------------------- |
+| `bug_detection`        | **Easy**   | 10        | 10        | Off-by-one errors, null dereferences, race conditions, exception handling  |
+| `security_audit`       | **Medium** | 10        | 15        | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
+| `architectural_review` | **Hard**   | 10        | 20        | N+1 queries, god classes, blocking async calls, circular imports           |
+---
+## 🎯 Observation Space
+Each `step()` and `reset()` call returns a typed `Observation` object:
+| Field            | Type              | Description                                    |
+| ---------------- | ----------------- | ---------------------------------------------- |
+| `task_id`        | `TaskId` (enum)   | One of `bug_detection`, `security_audit`, `architectural_review` |
+| `scenario_hash`  | `str`             | Deterministic identifier for the scenario      |
+| `pr_title`       | `str`             | Title of the synthetic pull request            |
+| `pr_description` | `str`             | Description/context for the PR                 |
+| `diff`           | `str`             | Full unified diff (all files concatenated)     |
+| `files_changed`  | `List[FileChanged]` | Structured file patches with metadata        |
+| `step_count`     | `int`             | Current step number (0-indexed)                |
+| `max_steps`      | `int`             | Maximum steps allowed for this task            |
+| `noise_budget`   | `int`             | Remaining false-positive credits (starts at 5) |
+| `issues_flagged`  | `int`            | Number of correctly matched issues so far      |
+| `done`           | `bool`            | Whether the episode has terminated             |
+## 🎮 Action Space
+Agents submit typed `Action` objects with the following fields:
+| Field           | Type               | Required For        | Description                                  |
+| --------------- | ------------------ | ------------------- | -------------------------------------------- |
+| `action_type`   | `ActionType` (enum)| All actions         | `flag_issue`, `approve`, `request_changes`, `comment`, `ask_question` |
+| `body`          | `str`              | All actions         | Description or explanation text              |
+| `filename`      | `str`              | `flag_issue`        | File containing the issue                    |
+| `line_number`   | `int`              | `flag_issue`        | Approximate line number of the issue         |
+| `category`      | `Category` (enum)  | `flag_issue`        | `bug`, `security`, `architecture`, `style`, `performance` |
+| `severity`      | `Severity` (enum)  | `flag_issue`        | `critical`, `high`, `medium`, `low`, `info`  |
+| `verdict`       | `Verdict` (enum)   | `approve` / `request_changes` | `lgtm`, `request_changes`, `needs_discussion` |
+### Reward Signal
+Each `step()` returns a typed `Reward` object:
+| Field          | Type    | Description                                      |
+| -------------- | ------- | ------------------------------------------------ |
+| `value`        | `float` | Normalised score (0.0–1.0)                       |
+| `reason`       | `str`   | Human-readable explanation of the reward          |
+| `is_terminal`  | `bool`  | `True` on the final step of an episode            |
+**Reward shaping:** Correct issue flags yield positive rewards scaled by severity (critical=1.0, high=0.8, medium=0.5, low=0.2). False positives and duplicates incur −0.05 penalties and consume noise budget. Episodes terminate when noise budget reaches zero, max steps are exceeded, or a terminal action (approve/request_changes) is submitted.
+### 🧠 Environment Design Highlights
+- **Predictable State Management**: The `reset()` and `step()` functions are strictly idempotent based on task/seed pairs, ensuring 100% reproducible episodes.
+- **Dense Reward Signal**: Unlike "win/loss" environments, CodeLens provides continuous feedback. Every action—from the first issue flagged to the final verdict—produces a typed `Reward` object with human-readable rationale, accelerating agent learning (process supervision).
+- **Novelty: The Reviewer Trust Mechanic**: The **Noise Budget** (5 credits) simulates real-world developer trust. If an agent "hallucinates" too many non-existent bugs, it loses the budget and the episode is terminated, penalizing high-volume, low-precision behavior.
+---
 ---
 ---
+## 📊 Baseline Scores
+Reproducible keyword-based baseline results across all 30 scenarios (10 seeds per task):
+| Task                   | Mean Score | Best Score | Worst Score | Success Rate (>0.5) |
+| ---------------------- | ---------- | ---------- | ----------- | ------------------- |
+| `bug_detection`        | 0.3577     | 0.9167     | 0.0000      | 40%                 |
+| `security_audit`       | 0.1850     | 1.0000     | 0.0000      | 20%                 |
+| `architectural_review` | 0.2930     | 0.6640     | 0.0000      | 40%                 |
+| **Overall**            | **0.2786** | —          | —           | **33%**             |
+> **Agent:** `KeywordAgent` (heuristic, 35+ rules) — see `scripts/baseline.py`
+> **Reproduce:** `python scripts/evaluate.py --agent keyword --output results.json`
+These scores represent a deterministic lower bound. LLM-powered agents (e.g., GPT-4o, Claude) are expected to significantly outperform this baseline.
+---
 ##  API Reference
 | Method | Endpoint                | Auth     | Description                                   |

app.py CHANGED Viewed

@@ -193,6 +193,16 @@ async def http_exception_handler(request, exc):
 # ── Endpoints ─────────────────────────────────────────────────────────────────
 @app.get("/health")
 def health_check():
     return {

 # ── Endpoints ─────────────────────────────────────────────────────────────────
+@app.get("/", include_in_schema=False)
+def root():
+    """Absolute root ping handler for infrastructure readiness checks."""
+    return {
+        "status": "ready",
+        "message": "CodeLens API is operational.",
+        "docs": "/docs",
+        "health": "/health"
+    }
 @app.get("/health")
 def health_check():
     return {

codelens_env/env.py CHANGED Viewed

@@ -2,7 +2,7 @@ from datetime import datetime, timezone
 from typing import List, Optional, Set
 from codelens_env.models import (
     TaskId, Action, Observation, StepResult, ResetResult,
-    ActionType, ActionRecord, EpisodeResult, Severity, GroundTruthIssue
 )
 from codelens_env.scenarios import get_scenario
 from codelens_env.graders.bug_grader import grade_bug_detection
@@ -61,6 +61,7 @@ class CodeLensEnv:
         self.step_count += 1
         reward = 0.0
         # Determine terminal state and reward
         if action.action_type in (ActionType.APPROVE, ActionType.REQUEST_CHANGES):
@@ -100,6 +101,21 @@ class CodeLensEnv:
             self.done = True
             self.terminated_reason = "max_steps"
         # Record action
         record = ActionRecord(
             action_type=action.action_type,
@@ -117,6 +133,11 @@ class CodeLensEnv:
         return StepResult(
             observation=self._build_observation(),
             reward=float(reward),
             done=self.done,
             info={"terminated_reason": self.terminated_reason}
         )

 from typing import List, Optional, Set
 from codelens_env.models import (
     TaskId, Action, Observation, StepResult, ResetResult,
+    ActionType, ActionRecord, EpisodeResult, Severity, GroundTruthIssue, Reward
 )
 from codelens_env.scenarios import get_scenario
 from codelens_env.graders.bug_grader import grade_bug_detection
         self.step_count += 1
         reward = 0.0
+        match = None  # Track matched ground truth issue (if any)
         # Determine terminal state and reward
         if action.action_type in (ActionType.APPROVE, ActionType.REQUEST_CHANGES):
             self.done = True
             self.terminated_reason = "max_steps"
+        # Build reward reason
+        if action.action_type in (ActionType.APPROVE, ActionType.REQUEST_CHANGES):
+            reward_reason = "Terminal action submitted"
+        elif action.action_type == ActionType.FLAG_ISSUE:
+            if match and match.id in self.matched_issue_ids and reward > 0:
+                reward_reason = f"Correctly identified issue: {match.description[:60]}"
+            elif match and reward < 0:
+                reward_reason = "Duplicate issue flagged"
+            elif not match:
+                reward_reason = "False positive: no matching ground truth issue"
+            else:
+                reward_reason = f"Matched issue {match.id}" if match else "No match"
+        else:
+            reward_reason = "Non-scoring action"
         # Record action
         record = ActionRecord(
             action_type=action.action_type,
         return StepResult(
             observation=self._build_observation(),
             reward=float(reward),
+            reward_info=Reward(
+                value=float(max(0.0, reward)),
+                reason=reward_reason,
+                is_terminal=self.done
+            ),
             done=self.done,
             info={"terminated_reason": self.terminated_reason}
         )

codelens_env/models.py CHANGED Viewed

@@ -113,6 +113,15 @@ class Observation(BaseModel):
     issues_flagged: int = 0
     done: bool = False
 class ResetResult(BaseModel):
     task_id: TaskId
     seed: int
@@ -122,6 +131,7 @@ class ResetResult(BaseModel):
 class StepResult(BaseModel):
     observation: Observation
     reward: float
     done: bool
     info: dict = {}

     issues_flagged: int = 0
     done: bool = False
+class Reward(BaseModel):
+    """
+    Typed reward signal returned at each step (OpenEnv spec).
+    All values are normalized in the 0.0 – 1.0 range.
+    """
+    value: float            # 0.0 – 1.0 normalised score
+    reason: str = ""        # human-readable explanation
+    is_terminal: bool = False  # True on the final step
 class ResetResult(BaseModel):
     task_id: TaskId
     seed: int
 class StepResult(BaseModel):
     observation: Observation
     reward: float
+    reward_info: Reward     # typed Reward model (OpenEnv spec)
     done: bool
     info: dict = {}

inference.py CHANGED Viewed

@@ -2,12 +2,12 @@
 CodeLens Inference Script — CodeLens Environment
 ==========================================================
 Required env vars:
-  API_BASE_URL  — OpenAI-compatible base URL  (e.g. https://api.openai.com/v1)
-  MODEL_NAME    — Model identifier             (e.g. gpt-4o, gpt-3.5-turbo)
-  HF_TOKEN      — Hugging Face token (used as api_key for OpenAI client)
-  ENV_URL       — CodeLens env URL           (default: http://localhost:7860)
-Output format (stdout, per CodeLens spec):
   [START] task=<task_id> env=<env_url> model=<model>
   [STEP] step=<n> action=<str> reward=<float> done=<bool> error=<str|None>
   [END] success=<bool> steps=<int> score=<float> rewards=<list>
@@ -20,10 +20,11 @@ import time
 import requests
 from openai import OpenAI
-# ── Environment Variables (exact names required by CodeLens spec) ──────────────
 API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
 MODEL_NAME   = os.environ.get("MODEL_NAME", "gpt-3.5-turbo")
-HF_TOKEN     = os.environ.get("HF_TOKEN", "dummy")
 ENV_URL      = os.environ.get("ENV_URL", "http://localhost:7860")
 # ── Config ────────────────────────────────────────────────────────────────────
@@ -51,7 +52,7 @@ def log_step(step: int, action: str, reward: float, done: bool, error):
 def log_end(success: bool, steps: int, score: float, rewards: list):
     success_str = "true" if success else "false"
-    rewards_str = ",".join([f"{r:.2f}" for r in rewards])
     print(
         f"[END] success={success_str} steps={steps} score={score:.2f} "
         f"rewards={rewards_str}",

 CodeLens Inference Script — CodeLens Environment
 ==========================================================
 Required env vars:
+  API_BASE_URL   — OpenAI-compatible base URL  (e.g. https://api.openai.com/v1)
+  MODEL_NAME     — Model identifier             (e.g. gpt-4o, gpt-3.5-turbo)
+  HF_TOKEN       — API key (Hugging Face / OpenAI compatible)
+  ENV_URL        — CodeLens env URL           (default: http://localhost:7860)
+Output format (stdout, per OpenEnv spec):
   [START] task=<task_id> env=<env_url> model=<model>
   [STEP] step=<n> action=<str> reward=<float> done=<bool> error=<str|None>
   [END] success=<bool> steps=<int> score=<float> rewards=<list>
 import requests
 from openai import OpenAI
+# ── Environment Variables (exact names required by hackathon) ──────────────────
 API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
 MODEL_NAME   = os.environ.get("MODEL_NAME", "gpt-3.5-turbo")
+# Dual support: HF_TOKEN (mandatory instructions) or OPENAI_API_KEY (functional reqs)
+HF_TOKEN     = os.environ.get("HF_TOKEN") or os.environ.get("OPENAI_API_KEY", "dummy")
 ENV_URL      = os.environ.get("ENV_URL", "http://localhost:7860")
 # ── Config ────────────────────────────────────────────────────────────────────
 def log_end(success: bool, steps: int, score: float, rewards: list):
     success_str = "true" if success else "false"
+    rewards_str = "[" + ",".join([f"{r:.2f}" for r in rewards]) + "]"
     print(
         f"[END] success={success_str} steps={steps} score={score:.2f} "
         f"rewards={rewards_str}",

openenv.yaml CHANGED Viewed

@@ -9,6 +9,9 @@ description: >
 entry_point: "app:app"
 dashboard: "/dashboard"
 api_docs: "/docs"
 tasks:
   - id: "bug_detection"

 entry_point: "app:app"
 dashboard: "/dashboard"
 api_docs: "/docs"
+license: "MIT"
+tags: ["code-review", "agentic-eval", "security-audit", "bug-detection"]
+contact: "Arsh Verma <arsh@example.com>"
 tasks:
   - id: "bug_detection"