Bump project version, add co-author, docs polish
Browse filesBump core version to 2.0.0 (FastAPI metadata and health_check) and update dashboard package metadata (version 2.0.0, add author). Add Divyansh Rawat as a co-author in LICENSE and owners in codelens.yaml. Polish README: reformat tables, clarify scoring and API reference, add Authors & Maintainers, fix example whitespace, and improve Docker/testing instructions for readability. These changes prepare a new release and improve attribution and documentation.
- LICENSE +1 -1
- README.md +34 -17
- app.py +2 -2
- codelens.yaml +1 -0
- dashboard/package.json +2 -1
LICENSE
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
MIT License
|
| 2 |
|
| 3 |
-
Copyright (c) 2024 Arsh Verma
|
| 4 |
|
| 5 |
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
of this software and associated documentation files (the "Software"), to deal
|
|
|
|
| 1 |
MIT License
|
| 2 |
|
| 3 |
+
Copyright (c) 2024 Arsh Verma, Divyansh Rawat
|
| 4 |
|
| 5 |
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
of this software and associated documentation files (the "Software"), to deal
|
README.md
CHANGED
|
@@ -40,46 +40,50 @@ PYTHONPATH=. python app.py
|
|
| 40 |
|
| 41 |
CodeLens benchmarks agents across three critical engineering domains:
|
| 42 |
|
| 43 |
-
| Task
|
| 44 |
-
|------
|
| 45 |
-
| `bug_detection`
|
| 46 |
-
| `security_audit`
|
| 47 |
-
| `architectural_review` | 10
|
| 48 |
|
| 49 |
---
|
| 50 |
|
| 51 |
## π Scoring System
|
| 52 |
|
| 53 |
### Bug Detection
|
|
|
|
| 54 |
Score = `0.4 Γ coverage + 0.6 Γ avg_issue_score β 0.1 Γ false_positive_rate`
|
| 55 |
Issues are scored on **keyword accuracy** (50%) and **severity matching** (50%).
|
| 56 |
|
| 57 |
### Security Audit
|
|
|
|
| 58 |
Score = `avg(per_issue_score)` where each issue = `0.7 Γ severity_accuracy + 0.3 Γ keyword_coverage`.
|
| 59 |
Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as **LOW** incurs a major penalty.
|
| 60 |
|
| 61 |
### Architectural Review
|
|
|
|
| 62 |
Score = `0.6 Γ detection_rate + 0.2 Γ verdict_accuracy + 0.2 Γ detail_quality`.
|
| 63 |
Detail quality rewards technical explanations that provide actionable developer feedback.
|
| 64 |
|
| 65 |
### π Noise Budget
|
|
|
|
| 66 |
Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
|
| 67 |
|
| 68 |
---
|
| 69 |
|
| 70 |
## π API Reference
|
| 71 |
|
| 72 |
-
| Method | Endpoint
|
| 73 |
-
|:-----
|
| 74 |
-
| `POST` | `/reset`
|
| 75 |
-
| `POST` | `/step/{id}`
|
| 76 |
-
| `GET`
|
| 77 |
-
| `GET`
|
| 78 |
-
| `POST` | `/submit`
|
| 79 |
-
| `GET`
|
| 80 |
-
| `GET`
|
| 81 |
-
| `GET`
|
| 82 |
-
| `GET`
|
| 83 |
|
| 84 |
Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for production parity.
|
| 85 |
|
|
@@ -88,17 +92,20 @@ Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for
|
|
| 88 |
## π³ Running with Docker
|
| 89 |
|
| 90 |
### Production Mode
|
|
|
|
| 91 |
```bash
|
| 92 |
docker compose up -d
|
| 93 |
# View logs: docker compose logs -f
|
| 94 |
```
|
| 95 |
|
| 96 |
### Direct Pull
|
|
|
|
| 97 |
```bash
|
| 98 |
docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest
|
| 99 |
```
|
| 100 |
|
| 101 |
### Automated Testing
|
|
|
|
| 102 |
```bash
|
| 103 |
docker compose -f docker-compose.test.yml up
|
| 104 |
```
|
|
@@ -108,11 +115,13 @@ docker compose -f docker-compose.test.yml up
|
|
| 108 |
## π€ Baseline Agent & Evaluation
|
| 109 |
|
| 110 |
### Single Scenario Trial
|
|
|
|
| 111 |
```bash
|
| 112 |
python scripts/baseline.py --task bug_detection --seed 3 --verbose
|
| 113 |
```
|
| 114 |
|
| 115 |
### Full Benchmark (All 30 Scenarios)
|
|
|
|
| 116 |
```bash
|
| 117 |
# Keyword-based baseline
|
| 118 |
python scripts/evaluate.py --agent keyword --output results.json
|
|
@@ -147,7 +156,7 @@ while not done:
|
|
| 147 |
"severity": "critical",
|
| 148 |
"category": "security"
|
| 149 |
}
|
| 150 |
-
|
| 151 |
result = requests.post(f"{API}/step/{episode_id}", json=action).json()
|
| 152 |
done = result["done"]
|
| 153 |
|
|
@@ -196,9 +205,17 @@ pylint codelens_env/ app.py
|
|
| 196 |
PYTHONPATH=. python scripts/validate.py
|
| 197 |
```
|
| 198 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
---
|
| 200 |
|
| 201 |
## π Contributing & License
|
|
|
|
| 202 |
Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
|
| 203 |
|
| 204 |
This project is licensed under the **[MIT License](LICENSE)**.
|
|
|
|
| 40 |
|
| 41 |
CodeLens benchmarks agents across three critical engineering domains:
|
| 42 |
|
| 43 |
+
| Task | Scenarios | Max Steps | Focus Area |
|
| 44 |
+
| ---------------------- | --------- | --------- | -------------------------------------------------------------------------- |
|
| 45 |
+
| `bug_detection` | 10 | 10 | Off-by-one errors, null dereferences, race conditions, exception handling |
|
| 46 |
+
| `security_audit` | 10 | 15 | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
|
| 47 |
+
| `architectural_review` | 10 | 20 | N+1 queries, god classes, blocking async calls, circular imports |
|
| 48 |
|
| 49 |
---
|
| 50 |
|
| 51 |
## π Scoring System
|
| 52 |
|
| 53 |
### Bug Detection
|
| 54 |
+
|
| 55 |
Score = `0.4 Γ coverage + 0.6 Γ avg_issue_score β 0.1 Γ false_positive_rate`
|
| 56 |
Issues are scored on **keyword accuracy** (50%) and **severity matching** (50%).
|
| 57 |
|
| 58 |
### Security Audit
|
| 59 |
+
|
| 60 |
Score = `avg(per_issue_score)` where each issue = `0.7 Γ severity_accuracy + 0.3 Γ keyword_coverage`.
|
| 61 |
Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as **LOW** incurs a major penalty.
|
| 62 |
|
| 63 |
### Architectural Review
|
| 64 |
+
|
| 65 |
Score = `0.6 Γ detection_rate + 0.2 Γ verdict_accuracy + 0.2 Γ detail_quality`.
|
| 66 |
Detail quality rewards technical explanations that provide actionable developer feedback.
|
| 67 |
|
| 68 |
### π Noise Budget
|
| 69 |
+
|
| 70 |
Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
|
| 71 |
|
| 72 |
---
|
| 73 |
|
| 74 |
## π API Reference
|
| 75 |
|
| 76 |
+
| Method | Endpoint | Auth | Description |
|
| 77 |
+
| :----- | :---------------------- | :------- | :-------------------------------------------- |
|
| 78 |
+
| `POST` | `/reset` | Optional | Start a new evaluation episode |
|
| 79 |
+
| `POST` | `/step/{id}` | Optional | Submit a review action (flag_issue, approve) |
|
| 80 |
+
| `GET` | `/result/{id}` | Optional | Retrieve final scores and logs for an episode |
|
| 81 |
+
| `GET` | `/leaderboard` | None | Paginated performance rankings |
|
| 82 |
+
| `POST` | `/submit` | Optional | Persist an episode result to the leaderboard |
|
| 83 |
+
| `GET` | `/stats` | None | Aggregate statistics across all agents |
|
| 84 |
+
| `GET` | `/episodes/{id}/replay` | Optional | Full event-by-event history replay |
|
| 85 |
+
| `GET` | `/dashboard` | None | Interactive Real-time Dashboard |
|
| 86 |
+
| `GET` | `/health` | None | System status and health check |
|
| 87 |
|
| 88 |
Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for production parity.
|
| 89 |
|
|
|
|
| 92 |
## π³ Running with Docker
|
| 93 |
|
| 94 |
### Production Mode
|
| 95 |
+
|
| 96 |
```bash
|
| 97 |
docker compose up -d
|
| 98 |
# View logs: docker compose logs -f
|
| 99 |
```
|
| 100 |
|
| 101 |
### Direct Pull
|
| 102 |
+
|
| 103 |
```bash
|
| 104 |
docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest
|
| 105 |
```
|
| 106 |
|
| 107 |
### Automated Testing
|
| 108 |
+
|
| 109 |
```bash
|
| 110 |
docker compose -f docker-compose.test.yml up
|
| 111 |
```
|
|
|
|
| 115 |
## π€ Baseline Agent & Evaluation
|
| 116 |
|
| 117 |
### Single Scenario Trial
|
| 118 |
+
|
| 119 |
```bash
|
| 120 |
python scripts/baseline.py --task bug_detection --seed 3 --verbose
|
| 121 |
```
|
| 122 |
|
| 123 |
### Full Benchmark (All 30 Scenarios)
|
| 124 |
+
|
| 125 |
```bash
|
| 126 |
# Keyword-based baseline
|
| 127 |
python scripts/evaluate.py --agent keyword --output results.json
|
|
|
|
| 156 |
"severity": "critical",
|
| 157 |
"category": "security"
|
| 158 |
}
|
| 159 |
+
|
| 160 |
result = requests.post(f"{API}/step/{episode_id}", json=action).json()
|
| 161 |
done = result["done"]
|
| 162 |
|
|
|
|
| 205 |
PYTHONPATH=. python scripts/validate.py
|
| 206 |
```
|
| 207 |
|
| 208 |
+
## π₯ Authors & Maintainers
|
| 209 |
+
|
| 210 |
+
CodeLens is authored and maintained by:
|
| 211 |
+
|
| 212 |
+
- **Arsh Verma** β [GitHub](https://github.com/ArshVermaGit)
|
| 213 |
+
- **Divyansh Rawat** β [GitHub](https://github.com/DsThakurRawat)
|
| 214 |
+
|
| 215 |
---
|
| 216 |
|
| 217 |
## π Contributing & License
|
| 218 |
+
|
| 219 |
Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
|
| 220 |
|
| 221 |
This project is licensed under the **[MIT License](LICENSE)**.
|
app.py
CHANGED
|
@@ -64,7 +64,7 @@ app = FastAPI(
|
|
| 64 |
"Trains agents to detect bugs, security vulnerabilities, and architectural issues "
|
| 65 |
"in realistic Python PRs."
|
| 66 |
),
|
| 67 |
-
version="
|
| 68 |
lifespan=lifespan,
|
| 69 |
)
|
| 70 |
|
|
@@ -169,7 +169,7 @@ async def http_exception_handler(request, exc):
|
|
| 169 |
def health_check():
|
| 170 |
return {
|
| 171 |
"status": "ok",
|
| 172 |
-
"version": "
|
| 173 |
"env_ready": True,
|
| 174 |
"env": settings.app_env,
|
| 175 |
"active_episodes": len(episodes),
|
|
|
|
| 64 |
"Trains agents to detect bugs, security vulnerabilities, and architectural issues "
|
| 65 |
"in realistic Python PRs."
|
| 66 |
),
|
| 67 |
+
version="2.0.0",
|
| 68 |
lifespan=lifespan,
|
| 69 |
)
|
| 70 |
|
|
|
|
| 169 |
def health_check():
|
| 170 |
return {
|
| 171 |
"status": "ok",
|
| 172 |
+
"version": "2.0.0",
|
| 173 |
"env_ready": True,
|
| 174 |
"env": settings.app_env,
|
| 175 |
"active_episodes": len(episodes),
|
codelens.yaml
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
version: "2.0"
|
| 2 |
name: "agentorg-codereview"
|
|
|
|
| 3 |
description: >
|
| 4 |
AI Senior Code Reviewer evaluation environment for CodeLens.
|
| 5 |
Benchmarks agents on 30 synthetic pull requests across Bug Detection,
|
|
|
|
| 1 |
version: "2.0"
|
| 2 |
name: "agentorg-codereview"
|
| 3 |
+
owners: ["Arsh Verma", "Divyansh Rawat"]
|
| 4 |
description: >
|
| 5 |
AI Senior Code Reviewer evaluation environment for CodeLens.
|
| 6 |
Benchmarks agents on 30 synthetic pull requests across Bug Detection,
|
dashboard/package.json
CHANGED
|
@@ -1,7 +1,8 @@
|
|
| 1 |
{
|
| 2 |
"name": "codelens-dashboard",
|
|
|
|
| 3 |
"private": true,
|
| 4 |
-
"
|
| 5 |
"type": "module",
|
| 6 |
"scripts": {
|
| 7 |
"dev": "vite",
|
|
|
|
| 1 |
{
|
| 2 |
"name": "codelens-dashboard",
|
| 3 |
+
"version": "2.0.0",
|
| 4 |
"private": true,
|
| 5 |
+
"author": "Arsh Verma, Divyansh Rawat",
|
| 6 |
"type": "module",
|
| 7 |
"scripts": {
|
| 8 |
"dev": "vite",
|