---
title: CodeLens Environment
emoji: 🔍
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
---

<p align="center">
  <img src="assets/codelens-brand-v2.svg" width="400" alt="CodeLens." />
</p>

# CodeLens Environment

![CI](https://github.com/ArshVermaGit/open-ev-code-handler/actions/workflows/ci.yml/badge.svg)
![Python](https://img.shields.io/badge/python-3.10%2B-blue)
![License](https://img.shields.io/badge/license-MIT-green)
![Docker](https://img.shields.io/badge/docker-ghcr.io-blue)

> **AI evaluation environment for benchmarking code review agents on 30 synthetic pull requests.**

CodeLens is a high-fidelity evaluation environment where AI agents act as senior code reviewers. They analyze pull request diffs to identify bugs, security vulnerabilities, and architectural issues before providing a final verdict.

Designed for researchers and developers building the next generation of AI code assistants, CodeLens provides 30 realistic Python scenarios with ground-truth labels and deterministic, reproducible scoring.

---

## 💡 Motivation

Progress in AI coding assistants has largely focused on **generation** (writing code), but **evaluation** (reviewing code) is equally critical for software reliability. Manual code review is a high-cognitive-load, real-world task that requires:
- **Precision**: Identifying exactly where a bug exists.
- **Context**: Understanding how a local change affects the whole system.
- **Security-First Mindset**: Spotting non-obvious vulnerabilities like SQL injection or race conditions.

CodeLens transforms these human-centric skills into a **measurable benchmark**, allowing researchers to evaluate agents on their ability to act as high-fidelity gatekeepers of code quality.

---

---

##  Quick Start

Get up and running locally in under 2 minutes:

```bash
git clone https://github.com/ArshVermaGit/open-ev-code-handler.git
cd open-ev-code-handler
cp .env.example .env
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python scripts/migrate.py init
PYTHONPATH=. python app.py
```

- **Dashboard**: [http://localhost:7860/dashboard](http://localhost:7860/dashboard)
- **API Docs**: [http://localhost:7860/docs](http://localhost:7860/docs)

---

##  Evaluation Tasks

CodeLens benchmarks agents across three critical engineering domains:

| Task                   | Difficulty | Scenarios | Max Steps | Focus Area                                                                 |
| ---------------------- | ---------- | --------- | --------- | -------------------------------------------------------------------------- |
| `bug_detection`        | **Easy**   | 10        | 10        | Off-by-one errors, null dereferences, race conditions, exception handling  |
| `security_audit`       | **Medium** | 10        | 15        | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
| `architectural_review` | **Hard**   | 10        | 20        | N+1 queries, god classes, blocking async calls, circular imports           |

---

## 🎯 Observation Space

Each `step()` and `reset()` call returns a typed `Observation` object:

| Field            | Type              | Description                                    |
| ---------------- | ----------------- | ---------------------------------------------- |
| `task_id`        | `TaskId` (enum)   | One of `bug_detection`, `security_audit`, `architectural_review` |
| `scenario_hash`  | `str`             | Deterministic identifier for the scenario      |
| `pr_title`       | `str`             | Title of the synthetic pull request            |
| `pr_description` | `str`             | Description/context for the PR                 |
| `diff`           | `str`             | Full unified diff (all files concatenated)     |
| `files_changed`  | `List[FileChanged]` | Structured file patches with metadata        |
| `step_count`     | `int`             | Current step number (0-indexed)                |
| `max_steps`      | `int`             | Maximum steps allowed for this task            |
| `noise_budget`   | `int`             | Remaining false-positive credits (starts at 5) |
| `issues_flagged`  | `int`            | Number of correctly matched issues so far      |
| `done`           | `bool`            | Whether the episode has terminated             |

## 🎮 Action Space

Agents submit typed `Action` objects with the following fields:

| Field           | Type               | Required For        | Description                                  |
| --------------- | ------------------ | ------------------- | -------------------------------------------- |
| `action_type`   | `ActionType` (enum)| All actions         | `flag_issue`, `approve`, `request_changes`, `comment`, `ask_question` |
| `body`          | `str`              | All actions         | Description or explanation text              |
| `filename`      | `str`              | `flag_issue`        | File containing the issue                    |
| `line_number`   | `int`              | `flag_issue`        | Approximate line number of the issue         |
| `category`      | `Category` (enum)  | `flag_issue`        | `bug`, `security`, `architecture`, `style`, `performance` |
| `severity`      | `Severity` (enum)  | `flag_issue`        | `critical`, `high`, `medium`, `low`, `info`  |
| `verdict`       | `Verdict` (enum)   | `approve` / `request_changes` | `lgtm`, `request_changes`, `needs_discussion` |

### Reward Signal

Each `step()` returns a typed `Reward` object:

| Field          | Type    | Description                                      |
| -------------- | ------- | ------------------------------------------------ |
| `value`        | `float` | Normalised score (0.0–1.0)                       |
| `reason`       | `str`   | Human-readable explanation of the reward          |
| `is_terminal`  | `bool`  | `True` on the final step of an episode            |

**Reward shaping:** Correct issue flags yield positive rewards scaled by severity (critical=1.0, high=0.8, medium=0.5, low=0.2). False positives and duplicates incur −0.05 penalties and consume noise budget. Episodes terminate when noise budget reaches zero, max steps are exceeded, or a terminal action (approve/request_changes) is submitted.

### 🧠 Environment Design Highlights

- **Predictable State Management**: The `reset()` and `step()` functions are strictly idempotent based on task/seed pairs, ensuring 100% reproducible episodes.
- **Dense Reward Signal**: Unlike "win/loss" environments, CodeLens provides continuous feedback. Every action—from the first issue flagged to the final verdict—produces a typed `Reward` object with human-readable rationale, accelerating agent learning (process supervision).
- **Novelty: The Reviewer Trust Mechanic**: The **Noise Budget** (5 credits) simulates real-world developer trust. If an agent "hallucinates" too many non-existent bugs, it loses the budget and the episode is terminated, penalizing high-volume, low-precision behavior.

---

---

##  Scoring System

### Bug Detection

Score = `0.4 × coverage + 0.6 × avg_issue_score − 0.1 × false_positive_rate`
Issues are scored on **keyword accuracy** (50%) and **severity matching** (50%).

### Security Audit

Score = `avg(per_issue_score)` where each issue = `0.7 × severity_accuracy + 0.3 × keyword_coverage`.
Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as **LOW** incurs a major penalty.

### Architectural Review

Score = `0.6 × detection_rate + 0.2 × verdict_accuracy + 0.2 × detail_quality`.
Detail quality rewards technical explanations that provide actionable developer feedback.

###  Noise Budget

Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.

---

## 📊 Baseline Scores

Reproducible keyword-based baseline results across all 30 scenarios (10 seeds per task):

| Task                   | Mean Score | Best Score | Worst Score | Success Rate (>0.5) |
| ---------------------- | ---------- | ---------- | ----------- | ------------------- |
| `bug_detection`        | 0.3577     | 0.9167     | 0.0000      | 40%                 |
| `security_audit`       | 0.1850     | 1.0000     | 0.0000      | 20%                 |
| `architectural_review` | 0.2930     | 0.6640     | 0.0000      | 40%                 |
| **Overall**            | **0.2786** | —          | —           | **33%**             |

> **Agent:** `KeywordAgent` (heuristic, 35+ rules) — see `scripts/baseline.py`
> **Reproduce:** `python scripts/evaluate.py --agent keyword --output results.json`

These scores represent a deterministic lower bound. LLM-powered agents (e.g., GPT-4o, Claude) are expected to significantly outperform this baseline.

---

##  API Reference

| Method | Endpoint                | Auth     | Description                                   |
| :----- | :---------------------- | :------- | :-------------------------------------------- |
| `POST` | `/reset`                | Optional | Start a new evaluation episode                |
| `POST` | `/step/{id}`            | Optional | Submit a review action (flag_issue, approve)  |
| `GET`  | `/result/{id}`          | Optional | Retrieve final scores and logs for an episode |
| `GET`  | `/leaderboard`          | None     | Paginated performance rankings                |
| `POST` | `/submit`               | Optional | Persist an episode result to the leaderboard  |
| `GET`  | `/stats`                | None     | Aggregate statistics across all agents        |
| `GET`  | `/episodes/{id}/replay` | Optional | Full event-by-event history replay            |
| `GET`  | `/dashboard`            | None     | Interactive Real-time Dashboard               |
| `GET`  | `/health`               | None     | System status and health check                |

Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for production parity.

---

##  Running with Docker

### Production Mode

```bash
docker compose up -d
# View logs: docker compose logs -f
```

### Direct Pull

```bash
docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest
```

### Automated Testing

```bash
docker compose -f docker-compose.test.yml up
```

---

##  Baseline Agent & Evaluation

### Single Scenario Trial

```bash
python scripts/baseline.py --task bug_detection --seed 3 --verbose
```

### Full Benchmark (All 30 Scenarios)

```bash
# Keyword-based baseline
python scripts/evaluate.py --agent keyword --output results.json

# LLM-powered reviewer (e.g. Claude)
python scripts/evaluate.py --agent llm --api-key $ANTHROPIC_API_KEY
```

---

##  Writing Your Own Agent

CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:

```python
import requests

API = "http://localhost:7860"

# Start new episode
resp = requests.post(f"{API}/reset", json={"task_id": "bug_detection", "seed": 0})
episode_id = resp.json()["episode_id"]

done = False
while not done:
    # Your agent logic analyzes the diff
    action = {
        "action_type": "flag_issue",
        "body": "Identified a vulnerability line 14",
        "filename": "api/search.py",
        "line_number": 14,
        "severity": "critical",
        "category": "security"
    }

    result = requests.post(f"{API}/step/{episode_id}", json=action).json()
    done = result["done"]

# Get final results
final = requests.get(f"{API}/result/{episode_id}").json()
print(f"Final Score: {final['final_score']}")
```

---

##  Project Structure

```text
open-ev-code-handler/
├── app.py                      # FastAPI application (9 endpoints)
├── codelens_env/               # Core evaluation logic
│   ├── database.py             # SQLModel persistence layer
│   ├── env.py                  # Episode state machine
│   ├── models.py               # Pydantic v2 data models
│   ├── scenarios.py            # 30 Synthetic PR scenarios
│   └── graders/                # Grader implementations (Bug, Sec, Arch)
├── scripts/                    # CLI tools (baseline, evaluate, migrate)
├── static/                     # Compiled dashboard assets
├── tests/                      # 155+ Parametrized tests
├── Dockerfile                  # Multi-stage, non-root build
├── docker-compose.yml          # Production orchestration
└── openenv.yaml               # CodeLens v2 specification
```

---

##  Development

```bash
# Setup
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Automated Tests
PYTHONPATH=. pytest tests/ -v --cov=codelens_env

# Linter Check
pylint codelens_env/ app.py

# Scenario Sanity Check
PYTHONPATH=. python scripts/validate.py
```

##  Authors & Maintainers

CodeLens is authored and maintained by:

- **Arsh Verma** — [GitHub](https://github.com/ArshVermaGit)
- **Divyansh Rawat** — [GitHub](https://github.com/DsThakurRawat)

---

##  Contributing & License

Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.

This project is licensed under the **[MIT License](LICENSE)**.