File size: 13,304 Bytes
f8670cd
 
 
 
 
 
 
 
 
 
 
adea8c3
b366f83
adea8c3
 
4b66647
cae4a95
4b66647
 
 
 
d8ee465
4b66647
d8ee465
4b66647
 
 
d8ee465
 
cae4a95
f8670cd
 
 
 
 
 
 
 
 
 
 
 
 
74df718
d8ee465
4b66647
d8ee465
4b66647
 
 
 
 
 
 
 
 
d8ee465
4b66647
 
d8ee465
 
 
74df718
d8ee465
4b66647
d8ee465
f8670cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8ee465
 
 
74df718
d8ee465
4b66647
3e1edbb
4b66647
 
d8ee465
4b66647
3e1edbb
4b66647
 
d8ee465
4b66647
3e1edbb
4b66647
 
d8ee465
74df718
3e1edbb
4b66647
d8ee465
 
cae4a95
f8670cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74df718
3b1e8c5
3e1edbb
 
 
 
 
 
 
 
 
 
 
4b66647
 
d8ee465
 
 
74df718
cae4a95
4b66647
3e1edbb
cae4a95
4b66647
 
cae4a95
 
4b66647
3e1edbb
adea8c3
4b66647
adea8c3
cae4a95
4b66647
3e1edbb
cae4a95
4b66647
cae4a95
 
4b66647
 
74df718
4b66647
 
3e1edbb
cae4a95
4b66647
cae4a95
 
4b66647
3e1edbb
d8ee465
4b66647
 
cae4a95
4b66647
 
d8ee465
cae4a95
4b66647
 
74df718
4b66647
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e1edbb
4b66647
 
 
 
 
 
d8ee465
 
 
 
74df718
4b66647
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74df718
4b66647
d8ee465
4b66647
d8ee465
74df718
d8ee465
4b66647
 
 
 
d8ee465
4b66647
 
d8ee465
4b66647
 
cae4a95
4b66647
 
cae4a95
 
74df718
3e1edbb
 
 
 
 
 
d8ee465
 
74df718
3e1edbb
4b66647
d8ee465
4b66647
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
title: CodeLens Environment
emoji: ๐Ÿ”
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
---

<p align="center">
  <img src="assets/codelens-brand-v2.svg" width="400" alt="CodeLens." />
</p>

# CodeLens Environment

![CI](https://github.com/ArshVermaGit/open-ev-code-handler/actions/workflows/ci.yml/badge.svg)
![Python](https://img.shields.io/badge/python-3.10%2B-blue)
![License](https://img.shields.io/badge/license-MIT-green)
![Docker](https://img.shields.io/badge/docker-ghcr.io-blue)

> **AI evaluation environment for benchmarking code review agents on 30 synthetic pull requests.**

CodeLens is a high-fidelity evaluation environment where AI agents act as senior code reviewers. They analyze pull request diffs to identify bugs, security vulnerabilities, and architectural issues before providing a final verdict.

Designed for researchers and developers building the next generation of AI code assistants, CodeLens provides 30 realistic Python scenarios with ground-truth labels and deterministic, reproducible scoring.

---

## ๐Ÿ’ก Motivation

Progress in AI coding assistants has largely focused on **generation** (writing code), but **evaluation** (reviewing code) is equally critical for software reliability. Manual code review is a high-cognitive-load, real-world task that requires:
- **Precision**: Identifying exactly where a bug exists.
- **Context**: Understanding how a local change affects the whole system.
- **Security-First Mindset**: Spotting non-obvious vulnerabilities like SQL injection or race conditions.

CodeLens transforms these human-centric skills into a **measurable benchmark**, allowing researchers to evaluate agents on their ability to act as high-fidelity gatekeepers of code quality.

---

---

##  Quick Start

Get up and running locally in under 2 minutes:

```bash
git clone https://github.com/ArshVermaGit/open-ev-code-handler.git
cd open-ev-code-handler
cp .env.example .env
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python scripts/migrate.py init
PYTHONPATH=. python app.py
```

- **Dashboard**: [http://localhost:7860/dashboard](http://localhost:7860/dashboard)
- **API Docs**: [http://localhost:7860/docs](http://localhost:7860/docs)

---

##  Evaluation Tasks

CodeLens benchmarks agents across three critical engineering domains:

| Task                   | Difficulty | Scenarios | Max Steps | Focus Area                                                                 |
| ---------------------- | ---------- | --------- | --------- | -------------------------------------------------------------------------- |
| `bug_detection`        | **Easy**   | 10        | 10        | Off-by-one errors, null dereferences, race conditions, exception handling  |
| `security_audit`       | **Medium** | 10        | 15        | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
| `architectural_review` | **Hard**   | 10        | 20        | N+1 queries, god classes, blocking async calls, circular imports           |

---

## ๐ŸŽฏ Observation Space

Each `step()` and `reset()` call returns a typed `Observation` object:

| Field            | Type              | Description                                    |
| ---------------- | ----------------- | ---------------------------------------------- |
| `task_id`        | `TaskId` (enum)   | One of `bug_detection`, `security_audit`, `architectural_review` |
| `scenario_hash`  | `str`             | Deterministic identifier for the scenario      |
| `pr_title`       | `str`             | Title of the synthetic pull request            |
| `pr_description` | `str`             | Description/context for the PR                 |
| `diff`           | `str`             | Full unified diff (all files concatenated)     |
| `files_changed`  | `List[FileChanged]` | Structured file patches with metadata        |
| `step_count`     | `int`             | Current step number (0-indexed)                |
| `max_steps`      | `int`             | Maximum steps allowed for this task            |
| `noise_budget`   | `int`             | Remaining false-positive credits (starts at 5) |
| `issues_flagged`  | `int`            | Number of correctly matched issues so far      |
| `done`           | `bool`            | Whether the episode has terminated             |

## ๐ŸŽฎ Action Space

Agents submit typed `Action` objects with the following fields:

| Field           | Type               | Required For        | Description                                  |
| --------------- | ------------------ | ------------------- | -------------------------------------------- |
| `action_type`   | `ActionType` (enum)| All actions         | `flag_issue`, `approve`, `request_changes`, `comment`, `ask_question` |
| `body`          | `str`              | All actions         | Description or explanation text              |
| `filename`      | `str`              | `flag_issue`        | File containing the issue                    |
| `line_number`   | `int`              | `flag_issue`        | Approximate line number of the issue         |
| `category`      | `Category` (enum)  | `flag_issue`        | `bug`, `security`, `architecture`, `style`, `performance` |
| `severity`      | `Severity` (enum)  | `flag_issue`        | `critical`, `high`, `medium`, `low`, `info`  |
| `verdict`       | `Verdict` (enum)   | `approve` / `request_changes` | `lgtm`, `request_changes`, `needs_discussion` |

### Reward Signal

Each `step()` returns a typed `Reward` object:

| Field          | Type    | Description                                      |
| -------------- | ------- | ------------------------------------------------ |
| `value`        | `float` | Normalised score (0.0โ€“1.0)                       |
| `reason`       | `str`   | Human-readable explanation of the reward          |
| `is_terminal`  | `bool`  | `True` on the final step of an episode            |

**Reward shaping:** Correct issue flags yield positive rewards scaled by severity (critical=1.0, high=0.8, medium=0.5, low=0.2). False positives and duplicates incur โˆ’0.05 penalties and consume noise budget. Episodes terminate when noise budget reaches zero, max steps are exceeded, or a terminal action (approve/request_changes) is submitted.

### ๐Ÿง  Environment Design Highlights

- **Predictable State Management**: The `reset()` and `step()` functions are strictly idempotent based on task/seed pairs, ensuring 100% reproducible episodes.
- **Dense Reward Signal**: Unlike "win/loss" environments, CodeLens provides continuous feedback. Every actionโ€”from the first issue flagged to the final verdictโ€”produces a typed `Reward` object with human-readable rationale, accelerating agent learning (process supervision).
- **Novelty: The Reviewer Trust Mechanic**: The **Noise Budget** (5 credits) simulates real-world developer trust. If an agent "hallucinates" too many non-existent bugs, it loses the budget and the episode is terminated, penalizing high-volume, low-precision behavior.

---

---

##  Scoring System

### Bug Detection

Score = `0.4 ร— coverage + 0.6 ร— avg_issue_score โˆ’ 0.1 ร— false_positive_rate`
Issues are scored on **keyword accuracy** (50%) and **severity matching** (50%).

### Security Audit

Score = `avg(per_issue_score)` where each issue = `0.7 ร— severity_accuracy + 0.3 ร— keyword_coverage`.
Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as **LOW** incurs a major penalty.

### Architectural Review

Score = `0.6 ร— detection_rate + 0.2 ร— verdict_accuracy + 0.2 ร— detail_quality`.
Detail quality rewards technical explanations that provide actionable developer feedback.

###  Noise Budget

Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.

---

## ๐Ÿ“Š Baseline Scores

Reproducible keyword-based baseline results across all 30 scenarios (10 seeds per task):

| Task                   | Mean Score | Best Score | Worst Score | Success Rate (>0.5) |
| ---------------------- | ---------- | ---------- | ----------- | ------------------- |
| `bug_detection`        | 0.3577     | 0.9167     | 0.0000      | 40%                 |
| `security_audit`       | 0.1850     | 1.0000     | 0.0000      | 20%                 |
| `architectural_review` | 0.2930     | 0.6640     | 0.0000      | 40%                 |
| **Overall**            | **0.2786** | โ€”          | โ€”           | **33%**             |

> **Agent:** `KeywordAgent` (heuristic, 35+ rules) โ€” see `scripts/baseline.py`
> **Reproduce:** `python scripts/evaluate.py --agent keyword --output results.json`

These scores represent a deterministic lower bound. LLM-powered agents (e.g., GPT-4o, Claude) are expected to significantly outperform this baseline.

---

##  API Reference

| Method | Endpoint                | Auth     | Description                                   |
| :----- | :---------------------- | :------- | :-------------------------------------------- |
| `POST` | `/reset`                | Optional | Start a new evaluation episode                |
| `POST` | `/step/{id}`            | Optional | Submit a review action (flag_issue, approve)  |
| `GET`  | `/result/{id}`          | Optional | Retrieve final scores and logs for an episode |
| `GET`  | `/leaderboard`          | None     | Paginated performance rankings                |
| `POST` | `/submit`               | Optional | Persist an episode result to the leaderboard  |
| `GET`  | `/stats`                | None     | Aggregate statistics across all agents        |
| `GET`  | `/episodes/{id}/replay` | Optional | Full event-by-event history replay            |
| `GET`  | `/dashboard`            | None     | Interactive Real-time Dashboard               |
| `GET`  | `/health`               | None     | System status and health check                |

Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for production parity.

---

##  Running with Docker

### Production Mode

```bash
docker compose up -d
# View logs: docker compose logs -f
```

### Direct Pull

```bash
docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest
```

### Automated Testing

```bash
docker compose -f docker-compose.test.yml up
```

---

##  Baseline Agent & Evaluation

### Single Scenario Trial

```bash
python scripts/baseline.py --task bug_detection --seed 3 --verbose
```

### Full Benchmark (All 30 Scenarios)

```bash
# Keyword-based baseline
python scripts/evaluate.py --agent keyword --output results.json

# LLM-powered reviewer (e.g. Claude)
python scripts/evaluate.py --agent llm --api-key $ANTHROPIC_API_KEY
```

---

##  Writing Your Own Agent

CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:

```python
import requests

API = "http://localhost:7860"

# Start new episode
resp = requests.post(f"{API}/reset", json={"task_id": "bug_detection", "seed": 0})
episode_id = resp.json()["episode_id"]

done = False
while not done:
    # Your agent logic analyzes the diff
    action = {
        "action_type": "flag_issue",
        "body": "Identified a vulnerability line 14",
        "filename": "api/search.py",
        "line_number": 14,
        "severity": "critical",
        "category": "security"
    }

    result = requests.post(f"{API}/step/{episode_id}", json=action).json()
    done = result["done"]

# Get final results
final = requests.get(f"{API}/result/{episode_id}").json()
print(f"Final Score: {final['final_score']}")
```

---

##  Project Structure

```text
open-ev-code-handler/
โ”œโ”€โ”€ app.py                      # FastAPI application (9 endpoints)
โ”œโ”€โ”€ codelens_env/               # Core evaluation logic
โ”‚   โ”œโ”€โ”€ database.py             # SQLModel persistence layer
โ”‚   โ”œโ”€โ”€ env.py                  # Episode state machine
โ”‚   โ”œโ”€โ”€ models.py               # Pydantic v2 data models
โ”‚   โ”œโ”€โ”€ scenarios.py            # 30 Synthetic PR scenarios
โ”‚   โ””โ”€โ”€ graders/                # Grader implementations (Bug, Sec, Arch)
โ”œโ”€โ”€ scripts/                    # CLI tools (baseline, evaluate, migrate)
โ”œโ”€โ”€ static/                     # Compiled dashboard assets
โ”œโ”€โ”€ tests/                      # 155+ Parametrized tests
โ”œโ”€โ”€ Dockerfile                  # Multi-stage, non-root build
โ”œโ”€โ”€ docker-compose.yml          # Production orchestration
โ””โ”€โ”€ openenv.yaml               # CodeLens v2 specification
```

---

##  Development

```bash
# Setup
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Automated Tests
PYTHONPATH=. pytest tests/ -v --cov=codelens_env

# Linter Check
pylint codelens_env/ app.py

# Scenario Sanity Check
PYTHONPATH=. python scripts/validate.py
```

##  Authors & Maintainers

CodeLens is authored and maintained by:

- **Arsh Verma** โ€” [GitHub](https://github.com/ArshVermaGit)
- **Divyansh Rawat** โ€” [GitHub](https://github.com/DsThakurRawat)

---

##  Contributing & License

Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.

This project is licensed under the **[MIT License](LICENSE)**.