---
title: CodeLens Environment
emoji: ๐
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
- openenv
---
# CodeLens Environment




> **AI evaluation environment for benchmarking code review agents on 30 synthetic pull requests.**
CodeLens is a high-fidelity evaluation environment where AI agents act as senior code reviewers. They analyze pull request diffs to identify bugs, security vulnerabilities, and architectural issues before providing a final verdict.
Designed for researchers and developers building the next generation of AI code assistants, CodeLens provides 30 realistic Python scenarios with ground-truth labels and deterministic, reproducible scoring.
---
## ๐ก Motivation
Progress in AI coding assistants has largely focused on **generation** (writing code), but **evaluation** (reviewing code) is equally critical for software reliability. Manual code review is a high-cognitive-load, real-world task that requires:
- **Precision**: Identifying exactly where a bug exists.
- **Context**: Understanding how a local change affects the whole system.
- **Security-First Mindset**: Spotting non-obvious vulnerabilities like SQL injection or race conditions.
CodeLens transforms these human-centric skills into a **measurable benchmark**, allowing researchers to evaluate agents on their ability to act as high-fidelity gatekeepers of code quality.
---
---
## Quick Start
Get up and running locally in under 2 minutes:
```bash
git clone https://github.com/ArshVermaGit/open-ev-code-handler.git
cd open-ev-code-handler
cp .env.example .env
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python scripts/migrate.py init
PYTHONPATH=. python app.py
```
- **Dashboard**: [http://localhost:7860/dashboard](http://localhost:7860/dashboard)
- **API Docs**: [http://localhost:7860/docs](http://localhost:7860/docs)
---
## Evaluation Tasks
CodeLens benchmarks agents across three critical engineering domains:
| Task | Difficulty | Scenarios | Max Steps | Focus Area |
| ---------------------- | ---------- | --------- | --------- | -------------------------------------------------------------------------- |
| `bug_detection` | **Easy** | 10 | 10 | Off-by-one errors, null dereferences, race conditions, exception handling |
| `security_audit` | **Medium** | 10 | 15 | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
| `architectural_review` | **Hard** | 10 | 20 | N+1 queries, god classes, blocking async calls, circular imports |
---
## ๐ฏ Observation Space
Each `step()` and `reset()` call returns a typed `Observation` object:
| Field | Type | Description |
| ---------------- | ----------------- | ---------------------------------------------- |
| `task_id` | `TaskId` (enum) | One of `bug_detection`, `security_audit`, `architectural_review` |
| `scenario_hash` | `str` | Deterministic identifier for the scenario |
| `pr_title` | `str` | Title of the synthetic pull request |
| `pr_description` | `str` | Description/context for the PR |
| `diff` | `str` | Full unified diff (all files concatenated) |
| `files_changed` | `List[FileChanged]` | Structured file patches with metadata |
| `step_count` | `int` | Current step number (0-indexed) |
| `max_steps` | `int` | Maximum steps allowed for this task |
| `noise_budget` | `int` | Remaining false-positive credits (starts at 5) |
| `issues_flagged` | `int` | Number of correctly matched issues so far |
| `done` | `bool` | Whether the episode has terminated |
## ๐ฎ Action Space
Agents submit typed `Action` objects with the following fields:
| Field | Type | Required For | Description |
| --------------- | ------------------ | ------------------- | -------------------------------------------- |
| `action_type` | `ActionType` (enum)| All actions | `flag_issue`, `approve`, `request_changes`, `comment`, `ask_question` |
| `body` | `str` | All actions | Description or explanation text |
| `filename` | `str` | `flag_issue` | File containing the issue |
| `line_number` | `int` | `flag_issue` | Approximate line number of the issue |
| `category` | `Category` (enum) | `flag_issue` | `bug`, `security`, `architecture`, `style`, `performance` |
| `severity` | `Severity` (enum) | `flag_issue` | `critical`, `high`, `medium`, `low`, `info` |
| `verdict` | `Verdict` (enum) | `approve` / `request_changes` | `lgtm`, `request_changes`, `needs_discussion` |
### Reward Signal
Each `step()` returns a typed `Reward` object:
| Field | Type | Description |
| -------------- | ------- | ------------------------------------------------ |
| `value` | `float` | Normalised score (0.0โ1.0) |
| `reason` | `str` | Human-readable explanation of the reward |
| `is_terminal` | `bool` | `True` on the final step of an episode |
**Reward shaping:** Correct issue flags yield positive rewards scaled by severity (critical=1.0, high=0.8, medium=0.5, low=0.2). False positives and duplicates incur โ0.05 penalties and consume noise budget. Episodes terminate when noise budget reaches zero, max steps are exceeded, or a terminal action (approve/request_changes) is submitted.
### ๐ง Environment Design Highlights
- **Predictable State Management**: The `reset()` and `step()` functions are strictly idempotent based on task/seed pairs, ensuring 100% reproducible episodes.
- **Dense Reward Signal**: Unlike "win/loss" environments, CodeLens provides continuous feedback. Every actionโfrom the first issue flagged to the final verdictโproduces a typed `Reward` object with human-readable rationale, accelerating agent learning (process supervision).
- **Novelty: The Reviewer Trust Mechanic**: The **Noise Budget** (5 credits) simulates real-world developer trust. If an agent "hallucinates" too many non-existent bugs, it loses the budget and the episode is terminated, penalizing high-volume, low-precision behavior.
---
---
## Scoring System
### Bug Detection
Score = `0.4 ร coverage + 0.6 ร avg_issue_score โ 0.1 ร false_positive_rate`
Issues are scored on **keyword accuracy** (50%) and **severity matching** (50%).
### Security Audit
Score = `avg(per_issue_score)` where each issue = `0.7 ร severity_accuracy + 0.3 ร keyword_coverage`.
Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as **LOW** incurs a major penalty.
### Architectural Review
Score = `0.6 ร detection_rate + 0.2 ร verdict_accuracy + 0.2 ร detail_quality`.
Detail quality rewards technical explanations that provide actionable developer feedback.
### Noise Budget
Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
---
## ๐ Baseline Scores
Reproducible keyword-based baseline results across all 30 scenarios (10 seeds per task):
| Task | Mean Score | Best Score | Worst Score | Success Rate (>0.5) |
| ---------------------- | ---------- | ---------- | ----------- | ------------------- |
| `bug_detection` | 0.3577 | 0.9167 | 0.0000 | 40% |
| `security_audit` | 0.1850 | 1.0000 | 0.0000 | 20% |
| `architectural_review` | 0.2930 | 0.6640 | 0.0000 | 40% |
| **Overall** | **0.2786** | โ | โ | **33%** |
> **Agent:** `KeywordAgent` (heuristic, 35+ rules) โ see `scripts/baseline.py`
> **Reproduce:** `python scripts/evaluate.py --agent keyword --output results.json`
These scores represent a deterministic lower bound. LLM-powered agents (e.g., GPT-4o, Claude) are expected to significantly outperform this baseline.
---
## API Reference
| Method | Endpoint | Auth | Description |
| :----- | :---------------------- | :------- | :-------------------------------------------- |
| `POST` | `/reset` | Optional | Start a new evaluation episode |
| `POST` | `/step/{id}` | Optional | Submit a review action (flag_issue, approve) |
| `GET` | `/result/{id}` | Optional | Retrieve final scores and logs for an episode |
| `GET` | `/leaderboard` | None | Paginated performance rankings |
| `POST` | `/submit` | Optional | Persist an episode result to the leaderboard |
| `GET` | `/stats` | None | Aggregate statistics across all agents |
| `GET` | `/episodes/{id}/replay` | Optional | Full event-by-event history replay |
| `GET` | `/dashboard` | None | Interactive Real-time Dashboard |
| `GET` | `/health` | None | System status and health check |
Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for production parity.
---
## Running with Docker
### Production Mode
```bash
docker compose up -d
# View logs: docker compose logs -f
```
### Direct Pull
```bash
docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest
```
### Automated Testing
```bash
docker compose -f docker-compose.test.yml up
```
---
## Baseline Agent & Evaluation
### Single Scenario Trial
```bash
python scripts/baseline.py --task bug_detection --seed 3 --verbose
```
### Full Benchmark (All 30 Scenarios)
```bash
# Keyword-based baseline
python scripts/evaluate.py --agent keyword --output results.json
# LLM-powered reviewer (e.g. Claude)
python scripts/evaluate.py --agent llm --api-key $ANTHROPIC_API_KEY
```
---
## Writing Your Own Agent
CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:
```python
import requests
API = "http://localhost:7860"
# Start new episode
resp = requests.post(f"{API}/reset", json={"task_id": "bug_detection", "seed": 0})
episode_id = resp.json()["episode_id"]
done = False
while not done:
# Your agent logic analyzes the diff
action = {
"action_type": "flag_issue",
"body": "Identified a vulnerability line 14",
"filename": "api/search.py",
"line_number": 14,
"severity": "critical",
"category": "security"
}
result = requests.post(f"{API}/step/{episode_id}", json=action).json()
done = result["done"]
# Get final results
final = requests.get(f"{API}/result/{episode_id}").json()
print(f"Final Score: {final['final_score']}")
```
---
## Project Structure
```text
open-ev-code-handler/
โโโ app.py # FastAPI application (9 endpoints)
โโโ codelens_env/ # Core evaluation logic
โ โโโ database.py # SQLModel persistence layer
โ โโโ env.py # Episode state machine
โ โโโ models.py # Pydantic v2 data models
โ โโโ scenarios.py # 30 Synthetic PR scenarios
โ โโโ graders/ # Grader implementations (Bug, Sec, Arch)
โโโ scripts/ # CLI tools (baseline, evaluate, migrate)
โโโ static/ # Compiled dashboard assets
โโโ tests/ # 155+ Parametrized tests
โโโ Dockerfile # Multi-stage, non-root build
โโโ docker-compose.yml # Production orchestration
โโโ openenv.yaml # CodeLens v2 specification
```
---
## Development
```bash
# Setup
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Automated Tests
PYTHONPATH=. pytest tests/ -v --cov=codelens_env
# Linter Check
pylint codelens_env/ app.py
# Scenario Sanity Check
PYTHONPATH=. python scripts/validate.py
```
## Authors & Maintainers
CodeLens is authored and maintained by:
- **Arsh Verma** โ [GitHub](https://github.com/ArshVermaGit)
- **Divyansh Rawat** โ [GitHub](https://github.com/DsThakurRawat)
---
## Contributing & License
Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
This project is licensed under the **[MIT License](LICENSE)**.