AIMLxDIV commited on
Commit
74df718
Β·
1 Parent(s): 9486e76

chore : updated configs and formatiing to meet openev specs

Browse files
CHANGELOG.md CHANGED
@@ -37,7 +37,7 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
37
  - **CLI**: Port mismatch in `baseline.py` (8000 β†’ 7860) and added `--url`, `--task`, `--seed` CLI flags.
38
  - **Crash Fixes**: Leaderboard submit crash after list slicing (captured rank before slice).
39
  - **WebSocket**: Disconnect now handled with typed `WebSocketDisconnect` and `clients.discard()`.
40
- - **Metadata**: Incoherent weight structure in `codelens.yaml` replaced with named, accurate pairs.
41
  - **Security**: Implemented `TrustedHostMiddleware` and hardened headers.
42
 
43
  ## [0.1.0] - Initial Baseline Fork
 
37
  - **CLI**: Port mismatch in `baseline.py` (8000 β†’ 7860) and added `--url`, `--task`, `--seed` CLI flags.
38
  - **Crash Fixes**: Leaderboard submit crash after list slicing (captured rank before slice).
39
  - **WebSocket**: Disconnect now handled with typed `WebSocketDisconnect` and `clients.discard()`.
40
+ - **Metadata**: Incoherent weight structure in `openenv.yaml` replaced with named, accurate pairs.
41
  - **Security**: Implemented `TrustedHostMiddleware` and hardened headers.
42
 
43
  ## [0.1.0] - Initial Baseline Fork
CONTRIBUTING.md CHANGED
@@ -4,7 +4,7 @@ Welcome! We appreciate contributions of all kinds. Here's how to get started.
4
 
5
  ---
6
 
7
- ## πŸ—οΈ Development Setup
8
 
9
  To get started with local development:
10
 
@@ -29,7 +29,7 @@ To get started with local development:
29
 
30
  ---
31
 
32
- ## πŸ“ Adding a New Scenario
33
 
34
  Scenarios live in `codelens_env/scenarios.py`. Each scenario needs:
35
 
@@ -65,7 +65,7 @@ All 30 (or more) scenarios must pass validation.
65
 
66
  ---
67
 
68
- ## πŸš€ Pull Request Process
69
 
70
  1. Fork the repo and create a branch: `feat/my-feature`, `fix/my-bug`, `test/more-tests`
71
  2. Make your changes
@@ -75,7 +75,7 @@ All 30 (or more) scenarios must pass validation.
75
 
76
  ---
77
 
78
- ## πŸ“„ Code Style
79
 
80
  - **Type hints** on all public functions and methods
81
  - **Docstrings** on all public classes and non-trivial functions
@@ -85,7 +85,7 @@ All 30 (or more) scenarios must pass validation.
85
 
86
  ---
87
 
88
- ## πŸ“ Commit Message Format
89
 
90
  We use [Conventional Commits](https://www.conventionalcommits.org/):
91
 
 
4
 
5
  ---
6
 
7
+ ## Development Setup
8
 
9
  To get started with local development:
10
 
 
29
 
30
  ---
31
 
32
+ ## Adding a New Scenario
33
 
34
  Scenarios live in `codelens_env/scenarios.py`. Each scenario needs:
35
 
 
65
 
66
  ---
67
 
68
+ ## Pull Request Process
69
 
70
  1. Fork the repo and create a branch: `feat/my-feature`, `fix/my-bug`, `test/more-tests`
71
  2. Make your changes
 
75
 
76
  ---
77
 
78
+ ## Code Style
79
 
80
  - **Type hints** on all public functions and methods
81
  - **Docstrings** on all public classes and non-trivial functions
 
85
 
86
  ---
87
 
88
+ ## Commit Message Format
89
 
90
  We use [Conventional Commits](https://www.conventionalcommits.org/):
91
 
DEPLOYMENT.md CHANGED
@@ -4,7 +4,7 @@ Follow this guide to deploy **CodeLens. v1.0.0** to the professional cloud. This
4
 
5
  ---
6
 
7
- ## 1. πŸ—„οΈ Setup the Database (PostgreSQL)
8
 
9
  Since SQLite is disk-based and will be deleted at every restart on Render/Vercel, you **must** use a managed PostgreSQL service.
10
 
@@ -15,7 +15,7 @@ Since SQLite is disk-based and will be deleted at every restart on Render/Vercel
15
 
16
  ---
17
 
18
- ## 2. πŸš€ Setup the Backend (Render)
19
 
20
  Render will host your FastAPI API and your Dockerized environment.
21
 
@@ -33,7 +33,7 @@ Render will host your FastAPI API and your Dockerized environment.
33
 
34
  ---
35
 
36
- ## 3. 🎨 Setup the Frontend (Vercel)
37
 
38
  Vercel will host your React/Vite dashboard.
39
 
@@ -46,7 +46,7 @@ Vercel will host your React/Vite dashboard.
46
 
47
  ---
48
 
49
- ## 4. πŸ€– Running Remote Evaluations
50
 
51
  Once deployed, you can run the benchmark script from your local machine (or any CI) against your **production** instance:
52
 
 
4
 
5
  ---
6
 
7
+ ## 1. Setup the Database (PostgreSQL)
8
 
9
  Since SQLite is disk-based and will be deleted at every restart on Render/Vercel, you **must** use a managed PostgreSQL service.
10
 
 
15
 
16
  ---
17
 
18
+ ## 2. Setup the Backend (Render)
19
 
20
  Render will host your FastAPI API and your Dockerized environment.
21
 
 
33
 
34
  ---
35
 
36
+ ## 3. Setup the Frontend (Vercel)
37
 
38
  Vercel will host your React/Vite dashboard.
39
 
 
46
 
47
  ---
48
 
49
+ ## 4. Running Remote Evaluations
50
 
51
  Once deployed, you can run the benchmark script from your local machine (or any CI) against your **production** instance:
52
 
GET_STARTED.md CHANGED
@@ -1,4 +1,4 @@
1
- # πŸš€ Getting Started with CodeLens.
2
 
3
  Welcome to **CodeLens.**, a production-grade AI agent evaluation environment. This guide will help you get up and running in less than 2 minutes.
4
 
@@ -44,7 +44,7 @@ PYTHONPATH=. python app.py
44
 
45
  Once the server is running, you can access the CodeLens Dashboard at:
46
 
47
- πŸ‘‰ **[http://localhost:7860/dashboard](http://localhost:7860/dashboard)**
48
 
49
  From here, you can see the top-10 leaderboard and monitor real-time agent evaluations via the live event feed.
50
 
@@ -64,7 +64,7 @@ python scripts/evaluate.py --agent keyword
64
 
65
  ---
66
 
67
- ## πŸ§ͺ Running Tests
68
 
69
  To verify everything is working perfectly, you can run the full 155-test suite:
70
 
@@ -74,7 +74,7 @@ PYTHONPATH=. pytest tests/ -v
74
 
75
  ---
76
 
77
- ## πŸ› οΈ Troubleshooting
78
 
79
  ### 1. `ModuleNotFoundError: No module named 'requests'`
80
  This happens if you haven't activated the virtual environment in your current terminal tab.
@@ -90,7 +90,7 @@ If the logo shows a broken image placeholder:
90
 
91
  ---
92
 
93
- ## 🀝 Next Steps
94
 
95
  - **Add Scenarios**: Learn how to author new code review benchmarks in **[CONTRIBUTING.md](CONTRIBUTING.md)**.
96
  - **Batch Evaluation**: Scale up from single evaluations to full 30-scenario reports using `scripts/evaluate.py`.
 
1
+ # Getting Started with CodeLens.
2
 
3
  Welcome to **CodeLens.**, a production-grade AI agent evaluation environment. This guide will help you get up and running in less than 2 minutes.
4
 
 
44
 
45
  Once the server is running, you can access the CodeLens Dashboard at:
46
 
47
+ **[http://localhost:7860/dashboard](http://localhost:7860/dashboard)**
48
 
49
  From here, you can see the top-10 leaderboard and monitor real-time agent evaluations via the live event feed.
50
 
 
64
 
65
  ---
66
 
67
+ ## Running Tests
68
 
69
  To verify everything is working perfectly, you can run the full 155-test suite:
70
 
 
74
 
75
  ---
76
 
77
+ ## Troubleshooting
78
 
79
  ### 1. `ModuleNotFoundError: No module named 'requests'`
80
  This happens if you haven't activated the virtual environment in your current terminal tab.
 
90
 
91
  ---
92
 
93
+ ## Next Steps
94
 
95
  - **Add Scenarios**: Learn how to author new code review benchmarks in **[CONTRIBUTING.md](CONTRIBUTING.md)**.
96
  - **Batch Evaluation**: Scale up from single evaluations to full 30-scenario reports using `scripts/evaluate.py`.
README.md CHANGED
@@ -17,7 +17,7 @@ Designed for researchers and developers building the next generation of AI code
17
 
18
  ---
19
 
20
- ## πŸš€ Quick Start
21
 
22
  Get up and running locally in under 2 minutes:
23
 
@@ -36,7 +36,7 @@ PYTHONPATH=. python app.py
36
 
37
  ---
38
 
39
- ## πŸ“‹ Evaluation Tasks
40
 
41
  CodeLens benchmarks agents across three critical engineering domains:
42
 
@@ -48,7 +48,7 @@ CodeLens benchmarks agents across three critical engineering domains:
48
 
49
  ---
50
 
51
- ## πŸ“ˆ Scoring System
52
 
53
  ### Bug Detection
54
 
@@ -65,13 +65,13 @@ Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as *
65
  Score = `0.6 Γ— detection_rate + 0.2 Γ— verdict_accuracy + 0.2 Γ— detail_quality`.
66
  Detail quality rewards technical explanations that provide actionable developer feedback.
67
 
68
- ### πŸ›‘ Noise Budget
69
 
70
  Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
71
 
72
  ---
73
 
74
- ## πŸ”Œ API Reference
75
 
76
  | Method | Endpoint | Auth | Description |
77
  | :----- | :---------------------- | :------- | :-------------------------------------------- |
@@ -89,7 +89,7 @@ Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for
89
 
90
  ---
91
 
92
- ## 🐳 Running with Docker
93
 
94
  ### Production Mode
95
 
@@ -112,7 +112,7 @@ docker compose -f docker-compose.test.yml up
112
 
113
  ---
114
 
115
- ## πŸ€– Baseline Agent & Evaluation
116
 
117
  ### Single Scenario Trial
118
 
@@ -132,7 +132,7 @@ python scripts/evaluate.py --agent llm --api-key $ANTHROPIC_API_KEY
132
 
133
  ---
134
 
135
- ## 🧠 Writing Your Own Agent
136
 
137
  CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:
138
 
@@ -167,7 +167,7 @@ print(f"Final Score: {final['final_score']}")
167
 
168
  ---
169
 
170
- ## πŸ“‚ Project Structure
171
 
172
  ```text
173
  open-ev-code-handler/
@@ -183,12 +183,12 @@ open-ev-code-handler/
183
  β”œβ”€β”€ tests/ # 155+ Parametrized tests
184
  β”œβ”€β”€ Dockerfile # Multi-stage, non-root build
185
  β”œβ”€β”€ docker-compose.yml # Production orchestration
186
- └── codelens.yaml # CodeLens v2 specification
187
  ```
188
 
189
  ---
190
 
191
- ## πŸ› οΈ Development
192
 
193
  ```bash
194
  # Setup
@@ -205,7 +205,7 @@ pylint codelens_env/ app.py
205
  PYTHONPATH=. python scripts/validate.py
206
  ```
207
 
208
- ## πŸ‘₯ Authors & Maintainers
209
 
210
  CodeLens is authored and maintained by:
211
 
@@ -214,7 +214,7 @@ CodeLens is authored and maintained by:
214
 
215
  ---
216
 
217
- ## πŸ“„ Contributing & License
218
 
219
  Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
220
 
 
17
 
18
  ---
19
 
20
+ ## Quick Start
21
 
22
  Get up and running locally in under 2 minutes:
23
 
 
36
 
37
  ---
38
 
39
+ ## Evaluation Tasks
40
 
41
  CodeLens benchmarks agents across three critical engineering domains:
42
 
 
48
 
49
  ---
50
 
51
+ ## Scoring System
52
 
53
  ### Bug Detection
54
 
 
65
  Score = `0.6 Γ— detection_rate + 0.2 Γ— verdict_accuracy + 0.2 Γ— detail_quality`.
66
  Detail quality rewards technical explanations that provide actionable developer feedback.
67
 
68
+ ### Noise Budget
69
 
70
  Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
71
 
72
  ---
73
 
74
+ ## API Reference
75
 
76
  | Method | Endpoint | Auth | Description |
77
  | :----- | :---------------------- | :------- | :-------------------------------------------- |
 
89
 
90
  ---
91
 
92
+ ## Running with Docker
93
 
94
  ### Production Mode
95
 
 
112
 
113
  ---
114
 
115
+ ## Baseline Agent & Evaluation
116
 
117
  ### Single Scenario Trial
118
 
 
132
 
133
  ---
134
 
135
+ ## Writing Your Own Agent
136
 
137
  CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:
138
 
 
167
 
168
  ---
169
 
170
+ ## Project Structure
171
 
172
  ```text
173
  open-ev-code-handler/
 
183
  β”œβ”€β”€ tests/ # 155+ Parametrized tests
184
  β”œβ”€β”€ Dockerfile # Multi-stage, non-root build
185
  β”œβ”€β”€ docker-compose.yml # Production orchestration
186
+ └── openenv.yaml # CodeLens v2 specification
187
  ```
188
 
189
  ---
190
 
191
+ ## Development
192
 
193
  ```bash
194
  # Setup
 
205
  PYTHONPATH=. python scripts/validate.py
206
  ```
207
 
208
+ ## Authors & Maintainers
209
 
210
  CodeLens is authored and maintained by:
211
 
 
214
 
215
  ---
216
 
217
+ ## Contributing & License
218
 
219
  Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
220
 
app.py CHANGED
@@ -20,7 +20,7 @@ from sqlmodel import Session
20
  import os
21
 
22
  from codelens_env.models import (
23
- TaskId, Action, ResetResult, StepResult, EpisodeResult, ActionRecord
24
  )
25
  from codelens_env.env import CodeLensEnv
26
  from codelens_env.config import get_settings
@@ -229,6 +229,15 @@ async def step_env(request: Request, episode_id: str, action: Action, _: None =
229
  except RuntimeError as e:
230
  raise HTTPException(status_code=400, detail=str(e))
231
 
 
 
 
 
 
 
 
 
 
232
  @app.get("/result/{episode_id}", response_model=EpisodeResult)
233
  def get_result(
234
  episode_id: str,
 
20
  import os
21
 
22
  from codelens_env.models import (
23
+ TaskId, Action, ResetResult, StepResult, EpisodeResult, ActionRecord, Observation
24
  )
25
  from codelens_env.env import CodeLensEnv
26
  from codelens_env.config import get_settings
 
229
  except RuntimeError as e:
230
  raise HTTPException(status_code=400, detail=str(e))
231
 
232
+ @app.get("/state/{episode_id}", response_model=Observation)
233
+ @limiter.limit(f"{settings.rate_limit_per_minute}/minute")
234
+ def get_state(request: Request, episode_id: str, _: None = Depends(verify_api_key)):
235
+ if episode_id not in episodes:
236
+ raise HTTPException(status_code=404, detail="Episode not found")
237
+
238
+ env = episodes[episode_id]
239
+ return env._build_observation()
240
+
241
  @app.get("/result/{episode_id}", response_model=EpisodeResult)
242
  def get_result(
243
  episode_id: str,
codelens_env/env.py CHANGED
@@ -150,9 +150,11 @@ class CodeLensEnv:
150
  noise_budget=self.noise_budget,
151
  max_noise_budget=self.MAX_NOISE_BUDGET,
152
  issues_flagged=len(self.matched_issue_ids),
153
- done=self.done
154
  )
155
 
 
 
 
156
  def get_final_result(self) -> EpisodeResult:
157
  if self.task_id == TaskId.BUG_DETECTION:
158
  final_score = grade_bug_detection(self.scenario, self.history)
 
150
  noise_budget=self.noise_budget,
151
  max_noise_budget=self.MAX_NOISE_BUDGET,
152
  issues_flagged=len(self.matched_issue_ids),
 
153
  )
154
 
155
+ def state(self) -> Observation:
156
+ return self._build_observation()
157
+
158
  def get_final_result(self) -> EpisodeResult:
159
  if self.task_id == TaskId.BUG_DETECTION:
160
  final_score = grade_bug_detection(self.scenario, self.history)
inference.py CHANGED
@@ -41,16 +41,20 @@ def log_start(task: str, env: str, model: str):
41
  print(f"[START] task={task} env={env} model={model}", flush=True)
42
 
43
  def log_step(step: int, action: str, reward: float, done: bool, error):
 
 
44
  print(
45
- f"[STEP] step={step} action={action!r} reward={reward:.4f} "
46
- f"done={done} error={error}",
47
  flush=True
48
  )
49
 
50
  def log_end(success: bool, steps: int, score: float, rewards: list):
 
 
51
  print(
52
- f"[END] success={success} steps={steps} score={score:.4f} "
53
- f"rewards={rewards}",
54
  flush=True
55
  )
56
 
@@ -193,7 +197,8 @@ def sanitize_action(action_dict: dict, task_id: str) -> dict:
193
 
194
  def run_episode(task_id: str, seed: int) -> dict:
195
  """Run a single episode. Returns {score, steps, success, rewards}."""
196
- log_start(task_id, ENV_URL, MODEL_NAME)
 
197
 
198
  # ── Reset ──────────────────────────────────────────────────────────────
199
  try:
@@ -284,7 +289,7 @@ def run_episode(task_id: str, seed: int) -> dict:
284
  def main():
285
  """Run all tasks across multiple seeds and print a summary."""
286
  print("=" * 60, flush=True)
287
- print(f"CodeLens Baseline", flush=True)
288
  print(f"Model: {MODEL_NAME}", flush=True)
289
  print(f"EnvURL: {ENV_URL}", flush=True)
290
  print("=" * 60, flush=True)
 
41
  print(f"[START] task={task} env={env} model={model}", flush=True)
42
 
43
  def log_step(step: int, action: str, reward: float, done: bool, error):
44
+ error_str = str(error) if error else "null"
45
+ done_str = "true" if done else "false"
46
  print(
47
+ f"[STEP] step={step} action={action} reward={reward:.2f} "
48
+ f"done={done_str} error={error_str}",
49
  flush=True
50
  )
51
 
52
  def log_end(success: bool, steps: int, score: float, rewards: list):
53
+ success_str = "true" if success else "false"
54
+ rewards_str = ",".join([f"{r:.2f}" for r in rewards])
55
  print(
56
+ f"[END] success={success_str} steps={steps} score={score:.2f} "
57
+ f"rewards={rewards_str}",
58
  flush=True
59
  )
60
 
 
197
 
198
  def run_episode(task_id: str, seed: int) -> dict:
199
  """Run a single episode. Returns {score, steps, success, rewards}."""
200
+ benchmark = os.environ.get("BENCHMARK", "codelens")
201
+ log_start(task_id, benchmark, MODEL_NAME)
202
 
203
  # ── Reset ──────────────────────────────────────────────────────────────
204
  try:
 
289
  def main():
290
  """Run all tasks across multiple seeds and print a summary."""
291
  print("=" * 60, flush=True)
292
+ print("CodeLens Baseline", flush=True)
293
  print(f"Model: {MODEL_NAME}", flush=True)
294
  print(f"EnvURL: {ENV_URL}", flush=True)
295
  print("=" * 60, flush=True)
codelens.yaml β†’ openenv.yaml RENAMED
File without changes