chore : updated configs and formatiing to meet openev specs
Browse files- CHANGELOG.md +1 -1
- CONTRIBUTING.md +5 -5
- DEPLOYMENT.md +4 -4
- GET_STARTED.md +5 -5
- README.md +13 -13
- app.py +10 -1
- codelens_env/env.py +3 -1
- inference.py +11 -6
- codelens.yaml β openenv.yaml +0 -0
CHANGELOG.md
CHANGED
|
@@ -37,7 +37,7 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
| 37 |
- **CLI**: Port mismatch in `baseline.py` (8000 β 7860) and added `--url`, `--task`, `--seed` CLI flags.
|
| 38 |
- **Crash Fixes**: Leaderboard submit crash after list slicing (captured rank before slice).
|
| 39 |
- **WebSocket**: Disconnect now handled with typed `WebSocketDisconnect` and `clients.discard()`.
|
| 40 |
-
- **Metadata**: Incoherent weight structure in `
|
| 41 |
- **Security**: Implemented `TrustedHostMiddleware` and hardened headers.
|
| 42 |
|
| 43 |
## [0.1.0] - Initial Baseline Fork
|
|
|
|
| 37 |
- **CLI**: Port mismatch in `baseline.py` (8000 β 7860) and added `--url`, `--task`, `--seed` CLI flags.
|
| 38 |
- **Crash Fixes**: Leaderboard submit crash after list slicing (captured rank before slice).
|
| 39 |
- **WebSocket**: Disconnect now handled with typed `WebSocketDisconnect` and `clients.discard()`.
|
| 40 |
+
- **Metadata**: Incoherent weight structure in `openenv.yaml` replaced with named, accurate pairs.
|
| 41 |
- **Security**: Implemented `TrustedHostMiddleware` and hardened headers.
|
| 42 |
|
| 43 |
## [0.1.0] - Initial Baseline Fork
|
CONTRIBUTING.md
CHANGED
|
@@ -4,7 +4,7 @@ Welcome! We appreciate contributions of all kinds. Here's how to get started.
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
-
##
|
| 8 |
|
| 9 |
To get started with local development:
|
| 10 |
|
|
@@ -29,7 +29,7 @@ To get started with local development:
|
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
-
##
|
| 33 |
|
| 34 |
Scenarios live in `codelens_env/scenarios.py`. Each scenario needs:
|
| 35 |
|
|
@@ -65,7 +65,7 @@ All 30 (or more) scenarios must pass validation.
|
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
-
##
|
| 69 |
|
| 70 |
1. Fork the repo and create a branch: `feat/my-feature`, `fix/my-bug`, `test/more-tests`
|
| 71 |
2. Make your changes
|
|
@@ -75,7 +75,7 @@ All 30 (or more) scenarios must pass validation.
|
|
| 75 |
|
| 76 |
---
|
| 77 |
|
| 78 |
-
##
|
| 79 |
|
| 80 |
- **Type hints** on all public functions and methods
|
| 81 |
- **Docstrings** on all public classes and non-trivial functions
|
|
@@ -85,7 +85,7 @@ All 30 (or more) scenarios must pass validation.
|
|
| 85 |
|
| 86 |
---
|
| 87 |
|
| 88 |
-
##
|
| 89 |
|
| 90 |
We use [Conventional Commits](https://www.conventionalcommits.org/):
|
| 91 |
|
|
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
+
## Development Setup
|
| 8 |
|
| 9 |
To get started with local development:
|
| 10 |
|
|
|
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
+
## Adding a New Scenario
|
| 33 |
|
| 34 |
Scenarios live in `codelens_env/scenarios.py`. Each scenario needs:
|
| 35 |
|
|
|
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
+
## Pull Request Process
|
| 69 |
|
| 70 |
1. Fork the repo and create a branch: `feat/my-feature`, `fix/my-bug`, `test/more-tests`
|
| 71 |
2. Make your changes
|
|
|
|
| 75 |
|
| 76 |
---
|
| 77 |
|
| 78 |
+
## Code Style
|
| 79 |
|
| 80 |
- **Type hints** on all public functions and methods
|
| 81 |
- **Docstrings** on all public classes and non-trivial functions
|
|
|
|
| 85 |
|
| 86 |
---
|
| 87 |
|
| 88 |
+
## Commit Message Format
|
| 89 |
|
| 90 |
We use [Conventional Commits](https://www.conventionalcommits.org/):
|
| 91 |
|
DEPLOYMENT.md
CHANGED
|
@@ -4,7 +4,7 @@ Follow this guide to deploy **CodeLens. v1.0.0** to the professional cloud. This
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
-
## 1.
|
| 8 |
|
| 9 |
Since SQLite is disk-based and will be deleted at every restart on Render/Vercel, you **must** use a managed PostgreSQL service.
|
| 10 |
|
|
@@ -15,7 +15,7 @@ Since SQLite is disk-based and will be deleted at every restart on Render/Vercel
|
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
-
## 2.
|
| 19 |
|
| 20 |
Render will host your FastAPI API and your Dockerized environment.
|
| 21 |
|
|
@@ -33,7 +33,7 @@ Render will host your FastAPI API and your Dockerized environment.
|
|
| 33 |
|
| 34 |
---
|
| 35 |
|
| 36 |
-
## 3.
|
| 37 |
|
| 38 |
Vercel will host your React/Vite dashboard.
|
| 39 |
|
|
@@ -46,7 +46,7 @@ Vercel will host your React/Vite dashboard.
|
|
| 46 |
|
| 47 |
---
|
| 48 |
|
| 49 |
-
## 4.
|
| 50 |
|
| 51 |
Once deployed, you can run the benchmark script from your local machine (or any CI) against your **production** instance:
|
| 52 |
|
|
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
+
## 1. Setup the Database (PostgreSQL)
|
| 8 |
|
| 9 |
Since SQLite is disk-based and will be deleted at every restart on Render/Vercel, you **must** use a managed PostgreSQL service.
|
| 10 |
|
|
|
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
+
## 2. Setup the Backend (Render)
|
| 19 |
|
| 20 |
Render will host your FastAPI API and your Dockerized environment.
|
| 21 |
|
|
|
|
| 33 |
|
| 34 |
---
|
| 35 |
|
| 36 |
+
## 3. Setup the Frontend (Vercel)
|
| 37 |
|
| 38 |
Vercel will host your React/Vite dashboard.
|
| 39 |
|
|
|
|
| 46 |
|
| 47 |
---
|
| 48 |
|
| 49 |
+
## 4. Running Remote Evaluations
|
| 50 |
|
| 51 |
Once deployed, you can run the benchmark script from your local machine (or any CI) against your **production** instance:
|
| 52 |
|
GET_STARTED.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
Welcome to **CodeLens.**, a production-grade AI agent evaluation environment. This guide will help you get up and running in less than 2 minutes.
|
| 4 |
|
|
@@ -44,7 +44,7 @@ PYTHONPATH=. python app.py
|
|
| 44 |
|
| 45 |
Once the server is running, you can access the CodeLens Dashboard at:
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
From here, you can see the top-10 leaderboard and monitor real-time agent evaluations via the live event feed.
|
| 50 |
|
|
@@ -64,7 +64,7 @@ python scripts/evaluate.py --agent keyword
|
|
| 64 |
|
| 65 |
---
|
| 66 |
|
| 67 |
-
##
|
| 68 |
|
| 69 |
To verify everything is working perfectly, you can run the full 155-test suite:
|
| 70 |
|
|
@@ -74,7 +74,7 @@ PYTHONPATH=. pytest tests/ -v
|
|
| 74 |
|
| 75 |
---
|
| 76 |
|
| 77 |
-
##
|
| 78 |
|
| 79 |
### 1. `ModuleNotFoundError: No module named 'requests'`
|
| 80 |
This happens if you haven't activated the virtual environment in your current terminal tab.
|
|
@@ -90,7 +90,7 @@ If the logo shows a broken image placeholder:
|
|
| 90 |
|
| 91 |
---
|
| 92 |
|
| 93 |
-
##
|
| 94 |
|
| 95 |
- **Add Scenarios**: Learn how to author new code review benchmarks in **[CONTRIBUTING.md](CONTRIBUTING.md)**.
|
| 96 |
- **Batch Evaluation**: Scale up from single evaluations to full 30-scenario reports using `scripts/evaluate.py`.
|
|
|
|
| 1 |
+
# Getting Started with CodeLens.
|
| 2 |
|
| 3 |
Welcome to **CodeLens.**, a production-grade AI agent evaluation environment. This guide will help you get up and running in less than 2 minutes.
|
| 4 |
|
|
|
|
| 44 |
|
| 45 |
Once the server is running, you can access the CodeLens Dashboard at:
|
| 46 |
|
| 47 |
+
**[http://localhost:7860/dashboard](http://localhost:7860/dashboard)**
|
| 48 |
|
| 49 |
From here, you can see the top-10 leaderboard and monitor real-time agent evaluations via the live event feed.
|
| 50 |
|
|
|
|
| 64 |
|
| 65 |
---
|
| 66 |
|
| 67 |
+
## Running Tests
|
| 68 |
|
| 69 |
To verify everything is working perfectly, you can run the full 155-test suite:
|
| 70 |
|
|
|
|
| 74 |
|
| 75 |
---
|
| 76 |
|
| 77 |
+
## Troubleshooting
|
| 78 |
|
| 79 |
### 1. `ModuleNotFoundError: No module named 'requests'`
|
| 80 |
This happens if you haven't activated the virtual environment in your current terminal tab.
|
|
|
|
| 90 |
|
| 91 |
---
|
| 92 |
|
| 93 |
+
## Next Steps
|
| 94 |
|
| 95 |
- **Add Scenarios**: Learn how to author new code review benchmarks in **[CONTRIBUTING.md](CONTRIBUTING.md)**.
|
| 96 |
- **Batch Evaluation**: Scale up from single evaluations to full 30-scenario reports using `scripts/evaluate.py`.
|
README.md
CHANGED
|
@@ -17,7 +17,7 @@ Designed for researchers and developers building the next generation of AI code
|
|
| 17 |
|
| 18 |
---
|
| 19 |
|
| 20 |
-
##
|
| 21 |
|
| 22 |
Get up and running locally in under 2 minutes:
|
| 23 |
|
|
@@ -36,7 +36,7 @@ PYTHONPATH=. python app.py
|
|
| 36 |
|
| 37 |
---
|
| 38 |
|
| 39 |
-
##
|
| 40 |
|
| 41 |
CodeLens benchmarks agents across three critical engineering domains:
|
| 42 |
|
|
@@ -48,7 +48,7 @@ CodeLens benchmarks agents across three critical engineering domains:
|
|
| 48 |
|
| 49 |
---
|
| 50 |
|
| 51 |
-
##
|
| 52 |
|
| 53 |
### Bug Detection
|
| 54 |
|
|
@@ -65,13 +65,13 @@ Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as *
|
|
| 65 |
Score = `0.6 Γ detection_rate + 0.2 Γ verdict_accuracy + 0.2 Γ detail_quality`.
|
| 66 |
Detail quality rewards technical explanations that provide actionable developer feedback.
|
| 67 |
|
| 68 |
-
###
|
| 69 |
|
| 70 |
Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
|
| 71 |
|
| 72 |
---
|
| 73 |
|
| 74 |
-
##
|
| 75 |
|
| 76 |
| Method | Endpoint | Auth | Description |
|
| 77 |
| :----- | :---------------------- | :------- | :-------------------------------------------- |
|
|
@@ -89,7 +89,7 @@ Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for
|
|
| 89 |
|
| 90 |
---
|
| 91 |
|
| 92 |
-
##
|
| 93 |
|
| 94 |
### Production Mode
|
| 95 |
|
|
@@ -112,7 +112,7 @@ docker compose -f docker-compose.test.yml up
|
|
| 112 |
|
| 113 |
---
|
| 114 |
|
| 115 |
-
##
|
| 116 |
|
| 117 |
### Single Scenario Trial
|
| 118 |
|
|
@@ -132,7 +132,7 @@ python scripts/evaluate.py --agent llm --api-key $ANTHROPIC_API_KEY
|
|
| 132 |
|
| 133 |
---
|
| 134 |
|
| 135 |
-
##
|
| 136 |
|
| 137 |
CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:
|
| 138 |
|
|
@@ -167,7 +167,7 @@ print(f"Final Score: {final['final_score']}")
|
|
| 167 |
|
| 168 |
---
|
| 169 |
|
| 170 |
-
##
|
| 171 |
|
| 172 |
```text
|
| 173 |
open-ev-code-handler/
|
|
@@ -183,12 +183,12 @@ open-ev-code-handler/
|
|
| 183 |
βββ tests/ # 155+ Parametrized tests
|
| 184 |
βββ Dockerfile # Multi-stage, non-root build
|
| 185 |
βββ docker-compose.yml # Production orchestration
|
| 186 |
-
βββ
|
| 187 |
```
|
| 188 |
|
| 189 |
---
|
| 190 |
|
| 191 |
-
##
|
| 192 |
|
| 193 |
```bash
|
| 194 |
# Setup
|
|
@@ -205,7 +205,7 @@ pylint codelens_env/ app.py
|
|
| 205 |
PYTHONPATH=. python scripts/validate.py
|
| 206 |
```
|
| 207 |
|
| 208 |
-
##
|
| 209 |
|
| 210 |
CodeLens is authored and maintained by:
|
| 211 |
|
|
@@ -214,7 +214,7 @@ CodeLens is authored and maintained by:
|
|
| 214 |
|
| 215 |
---
|
| 216 |
|
| 217 |
-
##
|
| 218 |
|
| 219 |
Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
|
| 220 |
|
|
|
|
| 17 |
|
| 18 |
---
|
| 19 |
|
| 20 |
+
## Quick Start
|
| 21 |
|
| 22 |
Get up and running locally in under 2 minutes:
|
| 23 |
|
|
|
|
| 36 |
|
| 37 |
---
|
| 38 |
|
| 39 |
+
## Evaluation Tasks
|
| 40 |
|
| 41 |
CodeLens benchmarks agents across three critical engineering domains:
|
| 42 |
|
|
|
|
| 48 |
|
| 49 |
---
|
| 50 |
|
| 51 |
+
## Scoring System
|
| 52 |
|
| 53 |
### Bug Detection
|
| 54 |
|
|
|
|
| 65 |
Score = `0.6 Γ detection_rate + 0.2 Γ verdict_accuracy + 0.2 Γ detail_quality`.
|
| 66 |
Detail quality rewards technical explanations that provide actionable developer feedback.
|
| 67 |
|
| 68 |
+
### Noise Budget
|
| 69 |
|
| 70 |
Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
|
| 71 |
|
| 72 |
---
|
| 73 |
|
| 74 |
+
## API Reference
|
| 75 |
|
| 76 |
| Method | Endpoint | Auth | Description |
|
| 77 |
| :----- | :---------------------- | :------- | :-------------------------------------------- |
|
|
|
|
| 89 |
|
| 90 |
---
|
| 91 |
|
| 92 |
+
## Running with Docker
|
| 93 |
|
| 94 |
### Production Mode
|
| 95 |
|
|
|
|
| 112 |
|
| 113 |
---
|
| 114 |
|
| 115 |
+
## Baseline Agent & Evaluation
|
| 116 |
|
| 117 |
### Single Scenario Trial
|
| 118 |
|
|
|
|
| 132 |
|
| 133 |
---
|
| 134 |
|
| 135 |
+
## Writing Your Own Agent
|
| 136 |
|
| 137 |
CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:
|
| 138 |
|
|
|
|
| 167 |
|
| 168 |
---
|
| 169 |
|
| 170 |
+
## Project Structure
|
| 171 |
|
| 172 |
```text
|
| 173 |
open-ev-code-handler/
|
|
|
|
| 183 |
βββ tests/ # 155+ Parametrized tests
|
| 184 |
βββ Dockerfile # Multi-stage, non-root build
|
| 185 |
βββ docker-compose.yml # Production orchestration
|
| 186 |
+
βββ openenv.yaml # CodeLens v2 specification
|
| 187 |
```
|
| 188 |
|
| 189 |
---
|
| 190 |
|
| 191 |
+
## Development
|
| 192 |
|
| 193 |
```bash
|
| 194 |
# Setup
|
|
|
|
| 205 |
PYTHONPATH=. python scripts/validate.py
|
| 206 |
```
|
| 207 |
|
| 208 |
+
## Authors & Maintainers
|
| 209 |
|
| 210 |
CodeLens is authored and maintained by:
|
| 211 |
|
|
|
|
| 214 |
|
| 215 |
---
|
| 216 |
|
| 217 |
+
## Contributing & License
|
| 218 |
|
| 219 |
Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
|
| 220 |
|
app.py
CHANGED
|
@@ -20,7 +20,7 @@ from sqlmodel import Session
|
|
| 20 |
import os
|
| 21 |
|
| 22 |
from codelens_env.models import (
|
| 23 |
-
TaskId, Action, ResetResult, StepResult, EpisodeResult, ActionRecord
|
| 24 |
)
|
| 25 |
from codelens_env.env import CodeLensEnv
|
| 26 |
from codelens_env.config import get_settings
|
|
@@ -229,6 +229,15 @@ async def step_env(request: Request, episode_id: str, action: Action, _: None =
|
|
| 229 |
except RuntimeError as e:
|
| 230 |
raise HTTPException(status_code=400, detail=str(e))
|
| 231 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 232 |
@app.get("/result/{episode_id}", response_model=EpisodeResult)
|
| 233 |
def get_result(
|
| 234 |
episode_id: str,
|
|
|
|
| 20 |
import os
|
| 21 |
|
| 22 |
from codelens_env.models import (
|
| 23 |
+
TaskId, Action, ResetResult, StepResult, EpisodeResult, ActionRecord, Observation
|
| 24 |
)
|
| 25 |
from codelens_env.env import CodeLensEnv
|
| 26 |
from codelens_env.config import get_settings
|
|
|
|
| 229 |
except RuntimeError as e:
|
| 230 |
raise HTTPException(status_code=400, detail=str(e))
|
| 231 |
|
| 232 |
+
@app.get("/state/{episode_id}", response_model=Observation)
|
| 233 |
+
@limiter.limit(f"{settings.rate_limit_per_minute}/minute")
|
| 234 |
+
def get_state(request: Request, episode_id: str, _: None = Depends(verify_api_key)):
|
| 235 |
+
if episode_id not in episodes:
|
| 236 |
+
raise HTTPException(status_code=404, detail="Episode not found")
|
| 237 |
+
|
| 238 |
+
env = episodes[episode_id]
|
| 239 |
+
return env._build_observation()
|
| 240 |
+
|
| 241 |
@app.get("/result/{episode_id}", response_model=EpisodeResult)
|
| 242 |
def get_result(
|
| 243 |
episode_id: str,
|
codelens_env/env.py
CHANGED
|
@@ -150,9 +150,11 @@ class CodeLensEnv:
|
|
| 150 |
noise_budget=self.noise_budget,
|
| 151 |
max_noise_budget=self.MAX_NOISE_BUDGET,
|
| 152 |
issues_flagged=len(self.matched_issue_ids),
|
| 153 |
-
done=self.done
|
| 154 |
)
|
| 155 |
|
|
|
|
|
|
|
|
|
|
| 156 |
def get_final_result(self) -> EpisodeResult:
|
| 157 |
if self.task_id == TaskId.BUG_DETECTION:
|
| 158 |
final_score = grade_bug_detection(self.scenario, self.history)
|
|
|
|
| 150 |
noise_budget=self.noise_budget,
|
| 151 |
max_noise_budget=self.MAX_NOISE_BUDGET,
|
| 152 |
issues_flagged=len(self.matched_issue_ids),
|
|
|
|
| 153 |
)
|
| 154 |
|
| 155 |
+
def state(self) -> Observation:
|
| 156 |
+
return self._build_observation()
|
| 157 |
+
|
| 158 |
def get_final_result(self) -> EpisodeResult:
|
| 159 |
if self.task_id == TaskId.BUG_DETECTION:
|
| 160 |
final_score = grade_bug_detection(self.scenario, self.history)
|
inference.py
CHANGED
|
@@ -41,16 +41,20 @@ def log_start(task: str, env: str, model: str):
|
|
| 41 |
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 42 |
|
| 43 |
def log_step(step: int, action: str, reward: float, done: bool, error):
|
|
|
|
|
|
|
| 44 |
print(
|
| 45 |
-
f"[STEP] step={step} action={action
|
| 46 |
-
f"done={
|
| 47 |
flush=True
|
| 48 |
)
|
| 49 |
|
| 50 |
def log_end(success: bool, steps: int, score: float, rewards: list):
|
|
|
|
|
|
|
| 51 |
print(
|
| 52 |
-
f"[END] success={
|
| 53 |
-
f"rewards={
|
| 54 |
flush=True
|
| 55 |
)
|
| 56 |
|
|
@@ -193,7 +197,8 @@ def sanitize_action(action_dict: dict, task_id: str) -> dict:
|
|
| 193 |
|
| 194 |
def run_episode(task_id: str, seed: int) -> dict:
|
| 195 |
"""Run a single episode. Returns {score, steps, success, rewards}."""
|
| 196 |
-
|
|
|
|
| 197 |
|
| 198 |
# ββ Reset ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 199 |
try:
|
|
@@ -284,7 +289,7 @@ def run_episode(task_id: str, seed: int) -> dict:
|
|
| 284 |
def main():
|
| 285 |
"""Run all tasks across multiple seeds and print a summary."""
|
| 286 |
print("=" * 60, flush=True)
|
| 287 |
-
print(
|
| 288 |
print(f"Model: {MODEL_NAME}", flush=True)
|
| 289 |
print(f"EnvURL: {ENV_URL}", flush=True)
|
| 290 |
print("=" * 60, flush=True)
|
|
|
|
| 41 |
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 42 |
|
| 43 |
def log_step(step: int, action: str, reward: float, done: bool, error):
|
| 44 |
+
error_str = str(error) if error else "null"
|
| 45 |
+
done_str = "true" if done else "false"
|
| 46 |
print(
|
| 47 |
+
f"[STEP] step={step} action={action} reward={reward:.2f} "
|
| 48 |
+
f"done={done_str} error={error_str}",
|
| 49 |
flush=True
|
| 50 |
)
|
| 51 |
|
| 52 |
def log_end(success: bool, steps: int, score: float, rewards: list):
|
| 53 |
+
success_str = "true" if success else "false"
|
| 54 |
+
rewards_str = ",".join([f"{r:.2f}" for r in rewards])
|
| 55 |
print(
|
| 56 |
+
f"[END] success={success_str} steps={steps} score={score:.2f} "
|
| 57 |
+
f"rewards={rewards_str}",
|
| 58 |
flush=True
|
| 59 |
)
|
| 60 |
|
|
|
|
| 197 |
|
| 198 |
def run_episode(task_id: str, seed: int) -> dict:
|
| 199 |
"""Run a single episode. Returns {score, steps, success, rewards}."""
|
| 200 |
+
benchmark = os.environ.get("BENCHMARK", "codelens")
|
| 201 |
+
log_start(task_id, benchmark, MODEL_NAME)
|
| 202 |
|
| 203 |
# ββ Reset ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 204 |
try:
|
|
|
|
| 289 |
def main():
|
| 290 |
"""Run all tasks across multiple seeds and print a summary."""
|
| 291 |
print("=" * 60, flush=True)
|
| 292 |
+
print("CodeLens Baseline", flush=True)
|
| 293 |
print(f"Model: {MODEL_NAME}", flush=True)
|
| 294 |
print(f"EnvURL: {ENV_URL}", flush=True)
|
| 295 |
print("=" * 60, flush=True)
|
codelens.yaml β openenv.yaml
RENAMED
|
File without changes
|