ArshVerma commited on
Commit
3e1edbb
Β·
1 Parent(s): 4b66647

Bump project version, add co-author, docs polish

Browse files

Bump core version to 2.0.0 (FastAPI metadata and health_check) and update dashboard package metadata (version 2.0.0, add author). Add Divyansh Rawat as a co-author in LICENSE and owners in codelens.yaml. Polish README: reformat tables, clarify scoring and API reference, add Authors & Maintainers, fix example whitespace, and improve Docker/testing instructions for readability. These changes prepare a new release and improve attribution and documentation.

Files changed (5) hide show
  1. LICENSE +1 -1
  2. README.md +34 -17
  3. app.py +2 -2
  4. codelens.yaml +1 -0
  5. dashboard/package.json +2 -1
LICENSE CHANGED
@@ -1,6 +1,6 @@
1
  MIT License
2
 
3
- Copyright (c) 2024 Arsh Verma
4
 
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
  of this software and associated documentation files (the "Software"), to deal
 
1
  MIT License
2
 
3
+ Copyright (c) 2024 Arsh Verma, Divyansh Rawat
4
 
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
  of this software and associated documentation files (the "Software"), to deal
README.md CHANGED
@@ -40,46 +40,50 @@ PYTHONPATH=. python app.py
40
 
41
  CodeLens benchmarks agents across three critical engineering domains:
42
 
43
- | Task | Scenarios | Max Steps | Focus Area |
44
- |------|-----------|-----------|------------|
45
- | `bug_detection` | 10 | 10 | Off-by-one errors, null dereferences, race conditions, exception handling |
46
- | `security_audit` | 10 | 15 | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
47
- | `architectural_review` | 10 | 20 | N+1 queries, god classes, blocking async calls, circular imports |
48
 
49
  ---
50
 
51
  ## πŸ“ˆ Scoring System
52
 
53
  ### Bug Detection
 
54
  Score = `0.4 Γ— coverage + 0.6 Γ— avg_issue_score βˆ’ 0.1 Γ— false_positive_rate`
55
  Issues are scored on **keyword accuracy** (50%) and **severity matching** (50%).
56
 
57
  ### Security Audit
 
58
  Score = `avg(per_issue_score)` where each issue = `0.7 Γ— severity_accuracy + 0.3 Γ— keyword_coverage`.
59
  Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as **LOW** incurs a major penalty.
60
 
61
  ### Architectural Review
 
62
  Score = `0.6 Γ— detection_rate + 0.2 Γ— verdict_accuracy + 0.2 Γ— detail_quality`.
63
  Detail quality rewards technical explanations that provide actionable developer feedback.
64
 
65
  ### πŸ›‘ Noise Budget
 
66
  Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
67
 
68
  ---
69
 
70
  ## πŸ”Œ API Reference
71
 
72
- | Method | Endpoint | Auth | Description |
73
- |:-------|:---------|:-----|:------------|
74
- | `POST` | `/reset` | Optional | Start a new evaluation episode |
75
- | `POST` | `/step/{id}` | Optional | Submit a review action (flag_issue, approve) |
76
- | `GET` | `/result/{id}` | Optional | Retrieve final scores and logs for an episode |
77
- | `GET` | `/leaderboard` | None | Paginated performance rankings |
78
- | `POST` | `/submit` | Optional | Persist an episode result to the leaderboard |
79
- | `GET` | `/stats` | None | Aggregate statistics across all agents |
80
- | `GET` | `/episodes/{id}/replay` | Optional | Full event-by-event history replay |
81
- | `GET` | `/dashboard` | None | Interactive Real-time Dashboard |
82
- | `GET` | `/health` | None | System status and health check |
83
 
84
  Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for production parity.
85
 
@@ -88,17 +92,20 @@ Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for
88
  ## 🐳 Running with Docker
89
 
90
  ### Production Mode
 
91
  ```bash
92
  docker compose up -d
93
  # View logs: docker compose logs -f
94
  ```
95
 
96
  ### Direct Pull
 
97
  ```bash
98
  docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest
99
  ```
100
 
101
  ### Automated Testing
 
102
  ```bash
103
  docker compose -f docker-compose.test.yml up
104
  ```
@@ -108,11 +115,13 @@ docker compose -f docker-compose.test.yml up
108
  ## πŸ€– Baseline Agent & Evaluation
109
 
110
  ### Single Scenario Trial
 
111
  ```bash
112
  python scripts/baseline.py --task bug_detection --seed 3 --verbose
113
  ```
114
 
115
  ### Full Benchmark (All 30 Scenarios)
 
116
  ```bash
117
  # Keyword-based baseline
118
  python scripts/evaluate.py --agent keyword --output results.json
@@ -147,7 +156,7 @@ while not done:
147
  "severity": "critical",
148
  "category": "security"
149
  }
150
-
151
  result = requests.post(f"{API}/step/{episode_id}", json=action).json()
152
  done = result["done"]
153
 
@@ -196,9 +205,17 @@ pylint codelens_env/ app.py
196
  PYTHONPATH=. python scripts/validate.py
197
  ```
198
 
 
 
 
 
 
 
 
199
  ---
200
 
201
  ## πŸ“„ Contributing & License
 
202
  Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
203
 
204
  This project is licensed under the **[MIT License](LICENSE)**.
 
40
 
41
  CodeLens benchmarks agents across three critical engineering domains:
42
 
43
+ | Task | Scenarios | Max Steps | Focus Area |
44
+ | ---------------------- | --------- | --------- | -------------------------------------------------------------------------- |
45
+ | `bug_detection` | 10 | 10 | Off-by-one errors, null dereferences, race conditions, exception handling |
46
+ | `security_audit` | 10 | 15 | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
47
+ | `architectural_review` | 10 | 20 | N+1 queries, god classes, blocking async calls, circular imports |
48
 
49
  ---
50
 
51
  ## πŸ“ˆ Scoring System
52
 
53
  ### Bug Detection
54
+
55
  Score = `0.4 Γ— coverage + 0.6 Γ— avg_issue_score βˆ’ 0.1 Γ— false_positive_rate`
56
  Issues are scored on **keyword accuracy** (50%) and **severity matching** (50%).
57
 
58
  ### Security Audit
59
+
60
  Score = `avg(per_issue_score)` where each issue = `0.7 Γ— severity_accuracy + 0.3 Γ— keyword_coverage`.
61
  Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as **LOW** incurs a major penalty.
62
 
63
  ### Architectural Review
64
+
65
  Score = `0.6 Γ— detection_rate + 0.2 Γ— verdict_accuracy + 0.2 Γ— detail_quality`.
66
  Detail quality rewards technical explanations that provide actionable developer feedback.
67
 
68
  ### πŸ›‘ Noise Budget
69
+
70
  Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
71
 
72
  ---
73
 
74
  ## πŸ”Œ API Reference
75
 
76
+ | Method | Endpoint | Auth | Description |
77
+ | :----- | :---------------------- | :------- | :-------------------------------------------- |
78
+ | `POST` | `/reset` | Optional | Start a new evaluation episode |
79
+ | `POST` | `/step/{id}` | Optional | Submit a review action (flag_issue, approve) |
80
+ | `GET` | `/result/{id}` | Optional | Retrieve final scores and logs for an episode |
81
+ | `GET` | `/leaderboard` | None | Paginated performance rankings |
82
+ | `POST` | `/submit` | Optional | Persist an episode result to the leaderboard |
83
+ | `GET` | `/stats` | None | Aggregate statistics across all agents |
84
+ | `GET` | `/episodes/{id}/replay` | Optional | Full event-by-event history replay |
85
+ | `GET` | `/dashboard` | None | Interactive Real-time Dashboard |
86
+ | `GET` | `/health` | None | System status and health check |
87
 
88
  Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for production parity.
89
 
 
92
  ## 🐳 Running with Docker
93
 
94
  ### Production Mode
95
+
96
  ```bash
97
  docker compose up -d
98
  # View logs: docker compose logs -f
99
  ```
100
 
101
  ### Direct Pull
102
+
103
  ```bash
104
  docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest
105
  ```
106
 
107
  ### Automated Testing
108
+
109
  ```bash
110
  docker compose -f docker-compose.test.yml up
111
  ```
 
115
  ## πŸ€– Baseline Agent & Evaluation
116
 
117
  ### Single Scenario Trial
118
+
119
  ```bash
120
  python scripts/baseline.py --task bug_detection --seed 3 --verbose
121
  ```
122
 
123
  ### Full Benchmark (All 30 Scenarios)
124
+
125
  ```bash
126
  # Keyword-based baseline
127
  python scripts/evaluate.py --agent keyword --output results.json
 
156
  "severity": "critical",
157
  "category": "security"
158
  }
159
+
160
  result = requests.post(f"{API}/step/{episode_id}", json=action).json()
161
  done = result["done"]
162
 
 
205
  PYTHONPATH=. python scripts/validate.py
206
  ```
207
 
208
+ ## πŸ‘₯ Authors & Maintainers
209
+
210
+ CodeLens is authored and maintained by:
211
+
212
+ - **Arsh Verma** β€” [GitHub](https://github.com/ArshVermaGit)
213
+ - **Divyansh Rawat** β€” [GitHub](https://github.com/DsThakurRawat)
214
+
215
  ---
216
 
217
  ## πŸ“„ Contributing & License
218
+
219
  Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
220
 
221
  This project is licensed under the **[MIT License](LICENSE)**.
app.py CHANGED
@@ -64,7 +64,7 @@ app = FastAPI(
64
  "Trains agents to detect bugs, security vulnerabilities, and architectural issues "
65
  "in realistic Python PRs."
66
  ),
67
- version="1.0.0",
68
  lifespan=lifespan,
69
  )
70
 
@@ -169,7 +169,7 @@ async def http_exception_handler(request, exc):
169
  def health_check():
170
  return {
171
  "status": "ok",
172
- "version": "1.0.0",
173
  "env_ready": True,
174
  "env": settings.app_env,
175
  "active_episodes": len(episodes),
 
64
  "Trains agents to detect bugs, security vulnerabilities, and architectural issues "
65
  "in realistic Python PRs."
66
  ),
67
+ version="2.0.0",
68
  lifespan=lifespan,
69
  )
70
 
 
169
  def health_check():
170
  return {
171
  "status": "ok",
172
+ "version": "2.0.0",
173
  "env_ready": True,
174
  "env": settings.app_env,
175
  "active_episodes": len(episodes),
codelens.yaml CHANGED
@@ -1,5 +1,6 @@
1
  version: "2.0"
2
  name: "agentorg-codereview"
 
3
  description: >
4
  AI Senior Code Reviewer evaluation environment for CodeLens.
5
  Benchmarks agents on 30 synthetic pull requests across Bug Detection,
 
1
  version: "2.0"
2
  name: "agentorg-codereview"
3
+ owners: ["Arsh Verma", "Divyansh Rawat"]
4
  description: >
5
  AI Senior Code Reviewer evaluation environment for CodeLens.
6
  Benchmarks agents on 30 synthetic pull requests across Bug Detection,
dashboard/package.json CHANGED
@@ -1,7 +1,8 @@
1
  {
2
  "name": "codelens-dashboard",
 
3
  "private": true,
4
- "version": "0.1.0",
5
  "type": "module",
6
  "scripts": {
7
  "dev": "vite",
 
1
  {
2
  "name": "codelens-dashboard",
3
+ "version": "2.0.0",
4
  "private": true,
5
+ "author": "Arsh Verma, Divyansh Rawat",
6
  "type": "module",
7
  "scripts": {
8
  "dev": "vite",