Spaces:

ArshVerma
/

CodeLens

Sleeping

ArshVerma commited on Apr 5

Commit

3e1edbb

1 Parent(s): 4b66647

Bump project version, add co-author, docs polish

Bump core version to 2.0.0 (FastAPI metadata and health_check) and update dashboard package metadata (version 2.0.0, add author). Add Divyansh Rawat as a co-author in LICENSE and owners in codelens.yaml. Polish README: reformat tables, clarify scoring and API reference, add Authors & Maintainers, fix example whitespace, and improve Docker/testing instructions for readability. These changes prepare a new release and improve attribution and documentation.

Files changed (5) hide show

LICENSE +1 -1
README.md +34 -17
app.py +2 -2
codelens.yaml +1 -0
dashboard/package.json +2 -1

LICENSE CHANGED Viewed

@@ -1,6 +1,6 @@
 MIT License
-Copyright (c) 2024 Arsh Verma
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

 MIT License
+Copyright (c) 2024 Arsh Verma, Divyansh Rawat
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

README.md CHANGED Viewed

@@ -40,46 +40,50 @@ PYTHONPATH=. python app.py
 CodeLens benchmarks agents across three critical engineering domains:
-| Task | Scenarios | Max Steps | Focus Area |
-|------|-----------|-----------|------------|
-| `bug_detection` | 10 | 10 | Off-by-one errors, null dereferences, race conditions, exception handling |
-| `security_audit` | 10 | 15 | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
-| `architectural_review` | 10 | 20 | N+1 queries, god classes, blocking async calls, circular imports |
 ---
 ## 📈 Scoring System
 ### Bug Detection
 Score = `0.4 × coverage + 0.6 × avg_issue_score − 0.1 × false_positive_rate`
 Issues are scored on **keyword accuracy** (50%) and **severity matching** (50%).
 ### Security Audit
 Score = `avg(per_issue_score)` where each issue = `0.7 × severity_accuracy + 0.3 × keyword_coverage`.
 Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as **LOW** incurs a major penalty.
 ### Architectural Review
 Score = `0.6 × detection_rate + 0.2 × verdict_accuracy + 0.2 × detail_quality`.
 Detail quality rewards technical explanations that provide actionable developer feedback.
 ### 🛑 Noise Budget
 Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
 ---
 ## 🔌 API Reference
-| Method | Endpoint | Auth | Description |
-|:-------|:---------|:-----|:------------|
-| `POST` | `/reset` | Optional | Start a new evaluation episode |
-| `POST` | `/step/{id}` | Optional | Submit a review action (flag_issue, approve) |
-| `GET` | `/result/{id}` | Optional | Retrieve final scores and logs for an episode |
-| `GET` | `/leaderboard` | None | Paginated performance rankings |
-| `POST` | `/submit` | Optional | Persist an episode result to the leaderboard |
-| `GET` | `/stats` | None | Aggregate statistics across all agents |
-| `GET` | `/episodes/{id}/replay` | Optional | Full event-by-event history replay |
-| `GET` | `/dashboard` | None | Interactive Real-time Dashboard |
-| `GET` | `/health` | None | System status and health check |
 Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for production parity.
@@ -88,17 +92,20 @@ Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for
 ## 🐳 Running with Docker
 ### Production Mode
 ```bash
 docker compose up -d
 # View logs: docker compose logs -f
 ```
 ### Direct Pull
 ```bash
 docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest
 ```
 ### Automated Testing
 ```bash
 docker compose -f docker-compose.test.yml up
 ```
@@ -108,11 +115,13 @@ docker compose -f docker-compose.test.yml up
 ## 🤖 Baseline Agent & Evaluation
 ### Single Scenario Trial
 ```bash
 python scripts/baseline.py --task bug_detection --seed 3 --verbose
 ```
 ### Full Benchmark (All 30 Scenarios)
 ```bash
 # Keyword-based baseline
 python scripts/evaluate.py --agent keyword --output results.json
@@ -147,7 +156,7 @@ while not done:
         "severity": "critical",
         "category": "security"
     }
     result = requests.post(f"{API}/step/{episode_id}", json=action).json()
     done = result["done"]
@@ -196,9 +205,17 @@ pylint codelens_env/ app.py
 PYTHONPATH=. python scripts/validate.py
 ```
 ---
 ## 📄 Contributing & License
 Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
 This project is licensed under the **[MIT License](LICENSE)**.

 CodeLens benchmarks agents across three critical engineering domains:
+| Task                   | Scenarios | Max Steps | Focus Area                                                                 |
+| ---------------------- | --------- | --------- | -------------------------------------------------------------------------- |
+| `bug_detection`        | 10        | 10        | Off-by-one errors, null dereferences, race conditions, exception handling  |
+| `security_audit`       | 10        | 15        | SQL injection, hardcoded secrets, path traversal, insecure deserialization |
+| `architectural_review` | 10        | 20        | N+1 queries, god classes, blocking async calls, circular imports           |
 ---
 ## 📈 Scoring System
 ### Bug Detection
 Score = `0.4 × coverage + 0.6 × avg_issue_score − 0.1 × false_positive_rate`
 Issues are scored on **keyword accuracy** (50%) and **severity matching** (50%).
 ### Security Audit
 Score = `avg(per_issue_score)` where each issue = `0.7 × severity_accuracy + 0.3 × keyword_coverage`.
 Severity accuracy is distance-weighted: misclassifying a **CRITICAL** issue as **LOW** incurs a major penalty.
 ### Architectural Review
 Score = `0.6 × detection_rate + 0.2 × verdict_accuracy + 0.2 × detail_quality`.
 Detail quality rewards technical explanations that provide actionable developer feedback.
 ### 🛑 Noise Budget
 Every episode permits **5 false positive credits**. Flagging non-existent code paths spends one credit. Reaching zero terminates the episode immediately to prevent agent hallucination loops.
 ---
 ## 🔌 API Reference
+| Method | Endpoint                | Auth     | Description                                   |
+| :----- | :---------------------- | :------- | :-------------------------------------------- |
+| `POST` | `/reset`                | Optional | Start a new evaluation episode                |
+| `POST` | `/step/{id}`            | Optional | Submit a review action (flag_issue, approve)  |
+| `GET`  | `/result/{id}`          | Optional | Retrieve final scores and logs for an episode |
+| `GET`  | `/leaderboard`          | None     | Paginated performance rankings                |
+| `POST` | `/submit`               | Optional | Persist an episode result to the leaderboard  |
+| `GET`  | `/stats`                | None     | Aggregate statistics across all agents        |
+| `GET`  | `/episodes/{id}/replay` | Optional | Full event-by-event history replay            |
+| `GET`  | `/dashboard`            | None     | Interactive Real-time Dashboard               |
+| `GET`  | `/health`               | None     | System status and health check                |
 Authentication is disabled by default. Set `API_KEY_ENABLED=true` in `.env` for production parity.
 ## 🐳 Running with Docker
 ### Production Mode
 ```bash
 docker compose up -d
 # View logs: docker compose logs -f
 ```
 ### Direct Pull
 ```bash
 docker run -p 7860:7860 ghcr.io/ArshVermaGit/open-ev-code-handler:latest
 ```
 ### Automated Testing
 ```bash
 docker compose -f docker-compose.test.yml up
 ```
 ## 🤖 Baseline Agent & Evaluation
 ### Single Scenario Trial
 ```bash
 python scripts/baseline.py --task bug_detection --seed 3 --verbose
 ```
 ### Full Benchmark (All 30 Scenarios)
 ```bash
 # Keyword-based baseline
 python scripts/evaluate.py --agent keyword --output results.json
         "severity": "critical",
         "category": "security"
     }
     result = requests.post(f"{API}/step/{episode_id}", json=action).json()
     done = result["done"]
 PYTHONPATH=. python scripts/validate.py
 ```
+## 👥 Authors & Maintainers
+CodeLens is authored and maintained by:
+- **Arsh Verma** — [GitHub](https://github.com/ArshVermaGit)
+- **Divyansh Rawat** — [GitHub](https://github.com/DsThakurRawat)
 ---
 ## 📄 Contributing & License
 Please see **[CONTRIBUTING.md](CONTRIBUTING.md)** for details on authoring new scenarios and submission standards.
 This project is licensed under the **[MIT License](LICENSE)**.

app.py CHANGED Viewed

@@ -64,7 +64,7 @@ app = FastAPI(
         "Trains agents to detect bugs, security vulnerabilities, and architectural issues "
         "in realistic Python PRs."
     ),
-    version="1.0.0",
     lifespan=lifespan,
 )
@@ -169,7 +169,7 @@ async def http_exception_handler(request, exc):
 def health_check():
     return {
         "status": "ok",
-        "version": "1.0.0",
         "env_ready": True,
         "env": settings.app_env,
         "active_episodes": len(episodes),

         "Trains agents to detect bugs, security vulnerabilities, and architectural issues "
         "in realistic Python PRs."
     ),
+    version="2.0.0",
     lifespan=lifespan,
 )
 def health_check():
     return {
         "status": "ok",
+        "version": "2.0.0",
         "env_ready": True,
         "env": settings.app_env,
         "active_episodes": len(episodes),

codelens.yaml CHANGED Viewed

@@ -1,5 +1,6 @@
 version: "2.0"
 name: "agentorg-codereview"
 description: >
   AI Senior Code Reviewer evaluation environment for CodeLens.
   Benchmarks agents on 30 synthetic pull requests across Bug Detection,

 version: "2.0"
 name: "agentorg-codereview"
+owners: ["Arsh Verma", "Divyansh Rawat"]
 description: >
   AI Senior Code Reviewer evaluation environment for CodeLens.
   Benchmarks agents on 30 synthetic pull requests across Bug Detection,

dashboard/package.json CHANGED Viewed

@@ -1,7 +1,8 @@
 {
   "name": "codelens-dashboard",
   "private": true,
-  "version": "0.1.0",
   "type": "module",
   "scripts": {
     "dev": "vite",

 {
   "name": "codelens-dashboard",
+  "version": "2.0.0",
   "private": true,
+  "author": "Arsh Verma, Divyansh Rawat",
   "type": "module",
   "scripts": {
     "dev": "vite",