File size: 13,304 Bytes
f8670cd adea8c3 b366f83 adea8c3 4b66647 cae4a95 4b66647 d8ee465 4b66647 d8ee465 4b66647 d8ee465 cae4a95 f8670cd 74df718 d8ee465 4b66647 d8ee465 4b66647 d8ee465 4b66647 d8ee465 74df718 d8ee465 4b66647 d8ee465 f8670cd d8ee465 74df718 d8ee465 4b66647 3e1edbb 4b66647 d8ee465 4b66647 3e1edbb 4b66647 d8ee465 4b66647 3e1edbb 4b66647 d8ee465 74df718 3e1edbb 4b66647 d8ee465 cae4a95 f8670cd 74df718 3b1e8c5 3e1edbb 4b66647 d8ee465 74df718 cae4a95 4b66647 3e1edbb cae4a95 4b66647 cae4a95 4b66647 3e1edbb adea8c3 4b66647 adea8c3 cae4a95 4b66647 3e1edbb cae4a95 4b66647 cae4a95 4b66647 74df718 4b66647 3e1edbb cae4a95 4b66647 cae4a95 4b66647 3e1edbb d8ee465 4b66647 cae4a95 4b66647 d8ee465 cae4a95 4b66647 74df718 4b66647 3e1edbb 4b66647 d8ee465 74df718 4b66647 74df718 4b66647 d8ee465 4b66647 d8ee465 74df718 d8ee465 4b66647 d8ee465 4b66647 d8ee465 4b66647 cae4a95 4b66647 cae4a95 74df718 3e1edbb d8ee465 74df718 3e1edbb 4b66647 d8ee465 4b66647 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 | ---
title: CodeLens Environment
emoji: ๐
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
- openenv
---
<p align="center">
<img src="assets/codelens-brand-v2.svg" width="400" alt="CodeLens." />
</p>
# CodeLens Environment




> **AI evaluation environment for benchmarking code review agents on 30 synthetic pull requests.**
CodeLens is a high-fidelity evaluation environment where AI agents act as senior code reviewers. They analyze pull request diffs to identify bugs, security vulnerabilities, and architectural issues before providing a final verdict.
Designed for researchers and developers building the next generation of AI code assistants, CodeLens provides 30 realistic Python scenarios with ground-truth labels and deterministic, reproducible scoring.
---
## ๐ก Motivation
Progress in AI coding assistants has largely focused on **generation** (writing code), but **evaluation** (reviewing code) is equally critical for software reliability. Manual code review is a high-cognitive-load, real-world task that requires:
- **Precision**: Identifying exactly where a bug exists.
- **Context**: Understanding how a local change affects the whole system.
- **Security-First Mindset**: Spotting non-obvious vulnerabilities like SQL injection or race conditions.
CodeLens transforms these human-centric skills into a **measurable benchmark**, allowing researchers to evaluate agents on their ability to act as high-fidelity gatekeepers of code quality.
---
---
## Quick Start
Get up and running locally in under 2 minutes:
```bash
git clone https://github.com/ArshVermaGit/open-ev-code-handler.git
cd open-ev-code-handler
cp .env.example .env
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python scripts/migrate.py init
PYTHONPATH=. python app.py
```
- **Dashboard**: [http://localhost:7860/dashboard](http://localhost:7860/dashboard)
- **API Docs**: [http://localhost:7860/docs](http://localhost:7860/docs)
---
## Evaluation Tasks
CodeLens benchmarks agents across three critical engineering domains:
| Task | Difficulty | Scenarios | Max Steps | Focus Area |
| ---------------------- | ---------- | --------- | --------- | -------------------------------------------------------------------------- |
| `bug_detection` | **Easy** | 10 | 10 | Off-by-one errors, null dereferences, race conditions, exception handling |
| `security_audit` | **Medium** | 10 | 15 | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
| `architectural_review` | **Hard** | 10 | 20 | N+1 queries, god classes, blocking async calls, circular imports |
---
## ๐ฏ Observation Space
Each `step()` and `reset()` call returns a typed `Observation` object:
| Field | Type | Description |
| ---------------- | ----------------- | ---------------------------------------------- |
| `task_id` | `TaskId` (enum) | One of `bug_detection`, `security_audit`, `architectural_review` |
| `scenario_hash` | `str` | Deterministic identifier for the scenario |
| `pr_title` | `str` | Title of the synthetic pull request |
| `pr_description` | `str` | Description/context for the PR |
| `diff` | `str` | Full unified diff (all files concatenated) |
| `files_changed` | `List[FileChanged]` | Structured file patches with metadata |
| `step_count` | `int` | Current step number (0-indexed) |
| `max_steps` | `int` | Maximum steps allowed for this task |
| `noise_budget` | `int` | Remaining false-positive credits (starts at 5) |
| `issues_flagged` | `int` | Number of correctly matched issues so far |
| `done` | `bool` | Whether the episode has terminated |
## ๐ฎ Action Space
Agents submit typed `Action` objects with the following fields:
| Field | Type | Required For | Description |
| --------------- | ------------------ | ------------------- | -------------------------------------------- |
| `action_type` | `ActionType` (enum)| All actions | `flag_issue`, `approve`, `request_changes`, `comment`, `ask_question` |
| `body` | `str` | All actions | Description or explanation text |
| `filename` | `str` | `flag_issue` | File containing the issue |
| `line_number` | `int` | `flag_issue` | Approximate line number of the issue |
| `category` | `Category` (enum) | `flag_issue` | `bug`, `security`, `architecture`, `style`, `performance` |
| `severity` | `Severity` (enum) | `flag_issue` | `critical`, `high`, `medium`, `low`, `info` |
| `verdict` | `Verdict` (enum) | `approve` / `request_changes` | `lgtm`, `request_changes`, `needs_discussion` |
### Reward Signal
Each `step()` returns a typed `Reward` object:
| Field | Type | Description |
| -------------- | ------- | ------------------------------------------------ |
| `value` | `float` | Normalised score (0.0โ1.0) |
| `reason` | `str` | Human-readable explanation of the reward |
| `is_terminal` | `bool` | `True` on the final step of an episode |
**Reward shaping:** Correct issue flags yield positive rewards scaled by severity (critical=1.0, high=0.8, medium=0.5, low=0.2). False positives and duplicates incur โ0.05 penalties and consume noise budget. Episodes terminate when noise budget reaches zero, max steps are exceeded, or a terminal action (approve/request_changes) is submitted.
### ๐ง Environment Design Highlights
- **Predictable State Management**: The `reset()` and `step()` functions are strictly idempotent based on task/seed pairs, ensuring 100% reproducible episodes.
- **Dense Reward Signal**: Unlike "win/loss" environments, CodeLens provides continuous feedback. Every actionโfrom the first issue flagged to the final verdictโproduces a typed `Reward` object with human-readable rationale, accelerating agent learning (process supervision).
- **Novelty: The Reviewer Trust Mechanic**: The **Noise Budget** (5 credits) simulates real-world developer trust. If an agent "hallucinates" too many non-existent bugs, it loses the budget and the episode is terminated, penalizing high-volume, low-precision behavior.
---
---
## Scoring System
### Bug Detection
Score = `0.4 ร coverage + 0.6 ร avg_issue_score โ 0.1 ร false_positive_rate`
Issues are scored on **keyword accuracy** (50%) and **severity matching** (50%).
### Security Audit
Score = `avg(per_issue_score)` where each issue = `0.7 ร severity_accuracy + 0.3 ร keyword_coverage`.
Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as **LOW** incurs a major penalty.
### Architectural Review
Score = `0.6 ร detection_rate + 0.2 ร verdict_accuracy + 0.2 ร detail_quality`.
Detail quality rewards technical explanations that provide actionable developer feedback.
### Noise Budget
Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
---
## ๐ Baseline Scores
Reproducible keyword-based baseline results across all 30 scenarios (10 seeds per task):
| Task | Mean Score | Best Score | Worst Score | Success Rate (>0.5) |
| ---------------------- | ---------- | ---------- | ----------- | ------------------- |
| `bug_detection` | 0.3577 | 0.9167 | 0.0000 | 40% |
| `security_audit` | 0.1850 | 1.0000 | 0.0000 | 20% |
| `architectural_review` | 0.2930 | 0.6640 | 0.0000 | 40% |
| **Overall** | **0.2786** | โ | โ | **33%** |
> **Agent:** `KeywordAgent` (heuristic, 35+ rules) โ see `scripts/baseline.py`
> **Reproduce:** `python scripts/evaluate.py --agent keyword --output results.json`
These scores represent a deterministic lower bound. LLM-powered agents (e.g., GPT-4o, Claude) are expected to significantly outperform this baseline.
---
## API Reference
| Method | Endpoint | Auth | Description |
| :----- | :---------------------- | :------- | :-------------------------------------------- |
| `POST` | `/reset` | Optional | Start a new evaluation episode |
| `POST` | `/step/{id}` | Optional | Submit a review action (flag_issue, approve) |
| `GET` | `/result/{id}` | Optional | Retrieve final scores and logs for an episode |
| `GET` | `/leaderboard` | None | Paginated performance rankings |
| `POST` | `/submit` | Optional | Persist an episode result to the leaderboard |
| `GET` | `/stats` | None | Aggregate statistics across all agents |
| `GET` | `/episodes/{id}/replay` | Optional | Full event-by-event history replay |
| `GET` | `/dashboard` | None | Interactive Real-time Dashboard |
| `GET` | `/health` | None | System status and health check |
Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for production parity.
---
## Running with Docker
### Production Mode
```bash
docker compose up -d
# View logs: docker compose logs -f
```
### Direct Pull
```bash
docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest
```
### Automated Testing
```bash
docker compose -f docker-compose.test.yml up
```
---
## Baseline Agent & Evaluation
### Single Scenario Trial
```bash
python scripts/baseline.py --task bug_detection --seed 3 --verbose
```
### Full Benchmark (All 30 Scenarios)
```bash
# Keyword-based baseline
python scripts/evaluate.py --agent keyword --output results.json
# LLM-powered reviewer (e.g. Claude)
python scripts/evaluate.py --agent llm --api-key $ANTHROPIC_API_KEY
```
---
## Writing Your Own Agent
CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:
```python
import requests
API = "http://localhost:7860"
# Start new episode
resp = requests.post(f"{API}/reset", json={"task_id": "bug_detection", "seed": 0})
episode_id = resp.json()["episode_id"]
done = False
while not done:
# Your agent logic analyzes the diff
action = {
"action_type": "flag_issue",
"body": "Identified a vulnerability line 14",
"filename": "api/search.py",
"line_number": 14,
"severity": "critical",
"category": "security"
}
result = requests.post(f"{API}/step/{episode_id}", json=action).json()
done = result["done"]
# Get final results
final = requests.get(f"{API}/result/{episode_id}").json()
print(f"Final Score: {final['final_score']}")
```
---
## Project Structure
```text
open-ev-code-handler/
โโโ app.py # FastAPI application (9 endpoints)
โโโ codelens_env/ # Core evaluation logic
โ โโโ database.py # SQLModel persistence layer
โ โโโ env.py # Episode state machine
โ โโโ models.py # Pydantic v2 data models
โ โโโ scenarios.py # 30 Synthetic PR scenarios
โ โโโ graders/ # Grader implementations (Bug, Sec, Arch)
โโโ scripts/ # CLI tools (baseline, evaluate, migrate)
โโโ static/ # Compiled dashboard assets
โโโ tests/ # 155+ Parametrized tests
โโโ Dockerfile # Multi-stage, non-root build
โโโ docker-compose.yml # Production orchestration
โโโ openenv.yaml # CodeLens v2 specification
```
---
## Development
```bash
# Setup
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Automated Tests
PYTHONPATH=. pytest tests/ -v --cov=codelens_env
# Linter Check
pylint codelens_env/ app.py
# Scenario Sanity Check
PYTHONPATH=. python scripts/validate.py
```
## Authors & Maintainers
CodeLens is authored and maintained by:
- **Arsh Verma** โ [GitHub](https://github.com/ArshVermaGit)
- **Divyansh Rawat** โ [GitHub](https://github.com/DsThakurRawat)
---
## Contributing & License
Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
This project is licensed under the **[MIT License](LICENSE)**.
|