Spaces:

ArshVerma
/

CodeLens

Sleeping

App Files Files Community

AIMLxDIV commited on Apr 5

Commit

74df718

1 Parent(s): 9486e76

chore : updated configs and formatiing to meet openev specs

Browse files

Files changed (9) hide show

CHANGELOG.md +1 -1
CONTRIBUTING.md +5 -5
DEPLOYMENT.md +4 -4
GET_STARTED.md +5 -5
README.md +13 -13
app.py +10 -1
codelens_env/env.py +3 -1
inference.py +11 -6
codelens.yaml → openenv.yaml +0 -0

CHANGELOG.md CHANGED Viewed

@@ -37,7 +37,7 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 - **CLI**: Port mismatch in `baseline.py` (8000 → 7860) and added `--url`, `--task`, `--seed` CLI flags.
 - **Crash Fixes**: Leaderboard submit crash after list slicing (captured rank before slice).
 - **WebSocket**: Disconnect now handled with typed `WebSocketDisconnect` and `clients.discard()`.
-- **Metadata**: Incoherent weight structure in `codelens.yaml` replaced with named, accurate pairs.
 - **Security**: Implemented `TrustedHostMiddleware` and hardened headers.
 ## [0.1.0] - Initial Baseline Fork

 - **CLI**: Port mismatch in `baseline.py` (8000 → 7860) and added `--url`, `--task`, `--seed` CLI flags.
 - **Crash Fixes**: Leaderboard submit crash after list slicing (captured rank before slice).
 - **WebSocket**: Disconnect now handled with typed `WebSocketDisconnect` and `clients.discard()`.
+- **Metadata**: Incoherent weight structure in `openenv.yaml` replaced with named, accurate pairs.
 - **Security**: Implemented `TrustedHostMiddleware` and hardened headers.
 ## [0.1.0] - Initial Baseline Fork

CONTRIBUTING.md CHANGED Viewed

@@ -4,7 +4,7 @@ Welcome! We appreciate contributions of all kinds. Here's how to get started.
 ---
-## 🏗️ Development Setup
 To get started with local development:
@@ -29,7 +29,7 @@ To get started with local development:
 ---
-## 📝 Adding a New Scenario
 Scenarios live in `codelens_env/scenarios.py`. Each scenario needs:
@@ -65,7 +65,7 @@ All 30 (or more) scenarios must pass validation.
 ---
-## 🚀 Pull Request Process
 1. Fork the repo and create a branch: `feat/my-feature`, `fix/my-bug`, `test/more-tests`
 2. Make your changes
@@ -75,7 +75,7 @@ All 30 (or more) scenarios must pass validation.
 ---
-## 📄 Code Style
 - **Type hints** on all public functions and methods
 - **Docstrings** on all public classes and non-trivial functions
@@ -85,7 +85,7 @@ All 30 (or more) scenarios must pass validation.
 ---
-## 📝 Commit Message Format
 We use [Conventional Commits](https://www.conventionalcommits.org/):

 ---
+##  Development Setup
 To get started with local development:
 ---
+##  Adding a New Scenario
 Scenarios live in `codelens_env/scenarios.py`. Each scenario needs:
 ---
+##  Pull Request Process
 1. Fork the repo and create a branch: `feat/my-feature`, `fix/my-bug`, `test/more-tests`
 2. Make your changes
 ---
+##  Code Style
 - **Type hints** on all public functions and methods
 - **Docstrings** on all public classes and non-trivial functions
 ---
+##  Commit Message Format
 We use [Conventional Commits](https://www.conventionalcommits.org/):

DEPLOYMENT.md CHANGED Viewed

@@ -4,7 +4,7 @@ Follow this guide to deploy **CodeLens. v1.0.0** to the professional cloud. This
 ---
-## 1. 🗄️ Setup the Database (PostgreSQL)
 Since SQLite is disk-based and will be deleted at every restart on Render/Vercel, you **must** use a managed PostgreSQL service.
@@ -15,7 +15,7 @@ Since SQLite is disk-based and will be deleted at every restart on Render/Vercel
 ---
-## 2. 🚀 Setup the Backend (Render)
 Render will host your FastAPI API and your Dockerized environment.
@@ -33,7 +33,7 @@ Render will host your FastAPI API and your Dockerized environment.
 ---
-## 3. 🎨 Setup the Frontend (Vercel)
 Vercel will host your React/Vite dashboard.
@@ -46,7 +46,7 @@ Vercel will host your React/Vite dashboard.
 ---
-## 4. 🤖 Running Remote Evaluations
 Once deployed, you can run the benchmark script from your local machine (or any CI) against your **production** instance:

 ---
+## 1.  Setup the Database (PostgreSQL)
 Since SQLite is disk-based and will be deleted at every restart on Render/Vercel, you **must** use a managed PostgreSQL service.
 ---
+## 2.  Setup the Backend (Render)
 Render will host your FastAPI API and your Dockerized environment.
 ---
+## 3.  Setup the Frontend (Vercel)
 Vercel will host your React/Vite dashboard.
 ---
+## 4.  Running Remote Evaluations
 Once deployed, you can run the benchmark script from your local machine (or any CI) against your **production** instance:

GET_STARTED.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# 🚀 Getting Started with CodeLens.
 Welcome to **CodeLens.**, a production-grade AI agent evaluation environment. This guide will help you get up and running in less than 2 minutes.
@@ -44,7 +44,7 @@ PYTHONPATH=. python app.py
 Once the server is running, you can access the CodeLens Dashboard at:
-👉 **[http://localhost:7860/dashboard](http://localhost:7860/dashboard)**
 From here, you can see the top-10 leaderboard and monitor real-time agent evaluations via the live event feed.
@@ -64,7 +64,7 @@ python scripts/evaluate.py --agent keyword
 ---
-## 🧪 Running Tests
 To verify everything is working perfectly, you can run the full 155-test suite:
@@ -74,7 +74,7 @@ PYTHONPATH=. pytest tests/ -v
 ---
-## 🛠️ Troubleshooting
 ### 1. `ModuleNotFoundError: No module named 'requests'`
 This happens if you haven't activated the virtual environment in your current terminal tab.
@@ -90,7 +90,7 @@ If the logo shows a broken image placeholder:
 ---
-## 🤝 Next Steps
 - **Add Scenarios**: Learn how to author new code review benchmarks in **[CONTRIBUTING.md](CONTRIBUTING.md)**.
 - **Batch Evaluation**: Scale up from single evaluations to full 30-scenario reports using `scripts/evaluate.py`.

+#  Getting Started with CodeLens.
 Welcome to **CodeLens.**, a production-grade AI agent evaluation environment. This guide will help you get up and running in less than 2 minutes.
 Once the server is running, you can access the CodeLens Dashboard at:
+ **[http://localhost:7860/dashboard](http://localhost:7860/dashboard)**
 From here, you can see the top-10 leaderboard and monitor real-time agent evaluations via the live event feed.
 ---
+##  Running Tests
 To verify everything is working perfectly, you can run the full 155-test suite:
 ---
+##  Troubleshooting
 ### 1. `ModuleNotFoundError: No module named 'requests'`
 This happens if you haven't activated the virtual environment in your current terminal tab.
 ---
+##  Next Steps
 - **Add Scenarios**: Learn how to author new code review benchmarks in **[CONTRIBUTING.md](CONTRIBUTING.md)**.
 - **Batch Evaluation**: Scale up from single evaluations to full 30-scenario reports using `scripts/evaluate.py`.

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ Designed for researchers and developers building the next generation of AI code
 ---
-## 🚀 Quick Start
 Get up and running locally in under 2 minutes:
@@ -36,7 +36,7 @@ PYTHONPATH=. python app.py
 ---
-## 📋 Evaluation Tasks
 CodeLens benchmarks agents across three critical engineering domains:
@@ -48,7 +48,7 @@ CodeLens benchmarks agents across three critical engineering domains:
 ---
-## 📈 Scoring System
 ### Bug Detection
@@ -65,13 +65,13 @@ Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as *
 Score = `0.6 × detection_rate + 0.2 × verdict_accuracy + 0.2 × detail_quality`.
 Detail quality rewards technical explanations that provide actionable developer feedback.
-### 🛑 Noise Budget
 Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
 ---
-## 🔌 API Reference
 | Method | Endpoint                | Auth     | Description                                   |
 | :----- | :---------------------- | :------- | :-------------------------------------------- |
@@ -89,7 +89,7 @@ Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for
 ---
-## 🐳 Running with Docker
 ### Production Mode
@@ -112,7 +112,7 @@ docker compose -f docker-compose.test.yml up
 ---
-## 🤖 Baseline Agent & Evaluation
 ### Single Scenario Trial
@@ -132,7 +132,7 @@ python scripts/evaluate.py --agent llm --api-key $ANTHROPIC_API_KEY
 ---
-## 🧠 Writing Your Own Agent
 CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:
@@ -167,7 +167,7 @@ print(f"Final Score: {final['final_score']}")
 ---
-## 📂 Project Structure
 ```text
 open-ev-code-handler/
@@ -183,12 +183,12 @@ open-ev-code-handler/
 ├── tests/                      # 155+ Parametrized tests
 ├── Dockerfile                  # Multi-stage, non-root build
 ├── docker-compose.yml          # Production orchestration
-└── codelens.yaml               # CodeLens v2 specification
 ```
 ---
-## 🛠️ Development
 ```bash
 # Setup
@@ -205,7 +205,7 @@ pylint codelens_env/ app.py
 PYTHONPATH=. python scripts/validate.py
 ```
-## 👥 Authors & Maintainers
 CodeLens is authored and maintained by:
@@ -214,7 +214,7 @@ CodeLens is authored and maintained by:
 ---
-## 📄 Contributing & License
 Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.

 ---
+##  Quick Start
 Get up and running locally in under 2 minutes:
 ---
+##  Evaluation Tasks
 CodeLens benchmarks agents across three critical engineering domains:
 ---
+##  Scoring System
 ### Bug Detection
 Score = `0.6 × detection_rate + 0.2 × verdict_accuracy + 0.2 × detail_quality`.
 Detail quality rewards technical explanations that provide actionable developer feedback.
+###  Noise Budget
 Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
 ---
+##  API Reference
 | Method | Endpoint                | Auth     | Description                                   |
 | :----- | :---------------------- | :------- | :-------------------------------------------- |
 ---
+##  Running with Docker
 ### Production Mode
 ---
+##  Baseline Agent & Evaluation
 ### Single Scenario Trial
 ---
+##  Writing Your Own Agent
 CodeLens is designed to be agent-agnostic. Use standard HTTP requests to build your reviewer:
 ---
+##  Project Structure
 ```text
 open-ev-code-handler/
 ├── tests/                      # 155+ Parametrized tests
 ├── Dockerfile                  # Multi-stage, non-root build
 ├── docker-compose.yml          # Production orchestration
+└── openenv.yaml               # CodeLens v2 specification
 ```
 ---
+##  Development
 ```bash
 # Setup
 PYTHONPATH=. python scripts/validate.py
 ```
+##  Authors & Maintainers
 CodeLens is authored and maintained by:
 ---
+##  Contributing & License
 Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.

app.py CHANGED Viewed

@@ -20,7 +20,7 @@ from sqlmodel import Session
 import os
 from codelens_env.models import (
-    TaskId, Action, ResetResult, StepResult, EpisodeResult, ActionRecord
 )
 from codelens_env.env import CodeLensEnv
 from codelens_env.config import get_settings
@@ -229,6 +229,15 @@ async def step_env(request: Request, episode_id: str, action: Action, _: None =
     except RuntimeError as e:
         raise HTTPException(status_code=400, detail=str(e))
 @app.get("/result/{episode_id}", response_model=EpisodeResult)
 def get_result(
     episode_id: str,

 import os
 from codelens_env.models import (
+    TaskId, Action, ResetResult, StepResult, EpisodeResult, ActionRecord, Observation
 )
 from codelens_env.env import CodeLensEnv
 from codelens_env.config import get_settings
     except RuntimeError as e:
         raise HTTPException(status_code=400, detail=str(e))
+@app.get("/state/{episode_id}", response_model=Observation)
+@limiter.limit(f"{settings.rate_limit_per_minute}/minute")
+def get_state(request: Request, episode_id: str, _: None = Depends(verify_api_key)):
+    if episode_id not in episodes:
+        raise HTTPException(status_code=404, detail="Episode not found")
+    env = episodes[episode_id]
+    return env._build_observation()
 @app.get("/result/{episode_id}", response_model=EpisodeResult)
 def get_result(
     episode_id: str,

codelens_env/env.py CHANGED Viewed

@@ -150,9 +150,11 @@ class CodeLensEnv:
             noise_budget=self.noise_budget,
             max_noise_budget=self.MAX_NOISE_BUDGET,
             issues_flagged=len(self.matched_issue_ids),
-            done=self.done
         )
     def get_final_result(self) -> EpisodeResult:
         if self.task_id == TaskId.BUG_DETECTION:
             final_score = grade_bug_detection(self.scenario, self.history)

             noise_budget=self.noise_budget,
             max_noise_budget=self.MAX_NOISE_BUDGET,
             issues_flagged=len(self.matched_issue_ids),
         )
+    def state(self) -> Observation:
+        return self._build_observation()
     def get_final_result(self) -> EpisodeResult:
         if self.task_id == TaskId.BUG_DETECTION:
             final_score = grade_bug_detection(self.scenario, self.history)

inference.py CHANGED Viewed

@@ -41,16 +41,20 @@ def log_start(task: str, env: str, model: str):
     print(f"[START] task={task} env={env} model={model}", flush=True)
 def log_step(step: int, action: str, reward: float, done: bool, error):
     print(
-        f"[STEP] step={step} action={action!r} reward={reward:.4f} "
-        f"done={done} error={error}",
         flush=True
     )
 def log_end(success: bool, steps: int, score: float, rewards: list):
     print(
-        f"[END] success={success} steps={steps} score={score:.4f} "
-        f"rewards={rewards}",
         flush=True
     )
@@ -193,7 +197,8 @@ def sanitize_action(action_dict: dict, task_id: str) -> dict:
 def run_episode(task_id: str, seed: int) -> dict:
     """Run a single episode. Returns {score, steps, success, rewards}."""
-    log_start(task_id, ENV_URL, MODEL_NAME)
     # ── Reset ──────────────────────────────────────────────────────────────
     try:
@@ -284,7 +289,7 @@ def run_episode(task_id: str, seed: int) -> dict:
 def main():
     """Run all tasks across multiple seeds and print a summary."""
     print("=" * 60, flush=True)
-    print(f"CodeLens Baseline", flush=True)
     print(f"Model:  {MODEL_NAME}", flush=True)
     print(f"EnvURL: {ENV_URL}", flush=True)
     print("=" * 60, flush=True)

     print(f"[START] task={task} env={env} model={model}", flush=True)
 def log_step(step: int, action: str, reward: float, done: bool, error):
+    error_str = str(error) if error else "null"
+    done_str = "true" if done else "false"
     print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} "
+        f"done={done_str} error={error_str}",
         flush=True
     )
 def log_end(success: bool, steps: int, score: float, rewards: list):
+    success_str = "true" if success else "false"
+    rewards_str = ",".join([f"{r:.2f}" for r in rewards])
     print(
+        f"[END] success={success_str} steps={steps} score={score:.2f} "
+        f"rewards={rewards_str}",
         flush=True
     )
 def run_episode(task_id: str, seed: int) -> dict:
     """Run a single episode. Returns {score, steps, success, rewards}."""
+    benchmark = os.environ.get("BENCHMARK", "codelens")
+    log_start(task_id, benchmark, MODEL_NAME)
     # ── Reset ──────────────────────────────────────────────────────────────
     try:
 def main():
     """Run all tasks across multiple seeds and print a summary."""
     print("=" * 60, flush=True)
+    print("CodeLens Baseline", flush=True)
     print(f"Model:  {MODEL_NAME}", flush=True)
     print(f"EnvURL: {ENV_URL}", flush=True)
     print("=" * 60, flush=True)

codelens.yaml → openenv.yaml RENAMED Viewed

File without changes