Spaces:
Sleeping
Sleeping
Commit ·
86c3e08
1
Parent(s): 1432cf4
Add Phase 06 Plan for Adaptive Judging and Edge Intelligence; Create initial project outline for GraphReview RL Environment
Browse files- code-review-env/README.md +63 -0
- code-review-env/db/migrations.py +12 -0
- code-review-env/db/schema.py +1 -0
- code-review-env/db/seed.py +12 -0
- code-review-env/db/store.py +5 -0
- code-review-env/env/env_loader.py +21 -0
- code-review-env/env/environment.py +14 -0
- code-review-env/env/reward.py +2 -0
- code-review-env/env/runtime_config.py +3 -0
- code-review-env/graders/hard_grader.py +123 -16
- code-review-env/graph/graph_manager.py +1 -0
- code-review-env/llm/__init__.py +1 -0
- code-review-env/llm/edge_summarizer.py +79 -0
- code-review-env/llm/lora_adapter.py +69 -0
- code-review-env/llm/lora_finetune.py +59 -0
- code-review-env/parser/ast_parser.py +12 -0
- code-review-env/parser/graph_builder.py +1 -0
- code-review-env/pyproject.toml +1 -0
- code-review-env/run_project.py +82 -0
- code-review-env/server/app.py +4 -0
- code-review-env/tests/test_graders.py +3 -1
- code-review-env/visualizer/pyvis_renderer.py +10 -3
- code-review-env/visualizer/report_generator.py +4 -1
- plans/phase-06-adaptive-judge-edge-summary-lora-plan.md +63 -0
- temp.md +1333 -0
code-review-env/README.md
CHANGED
|
@@ -35,6 +35,14 @@ Phase 5:
|
|
| 35 |
- Added confidence scoring that balances precision/recall with severity/security coverage and attribution validity.
|
| 36 |
- Added API endpoint to generate artifacts and CLI support for real project runs.
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
## Core Runtime Components
|
| 39 |
|
| 40 |
- `env/environment.py`
|
|
@@ -105,6 +113,8 @@ with `auth_token` connect arg.
|
|
| 105 |
|
| 106 |
## LLM and Runtime Env Vars
|
| 107 |
|
|
|
|
|
|
|
| 108 |
Judge settings:
|
| 109 |
|
| 110 |
- `GRAPHREVIEW_JUDGE_PROVIDER` (default `ollama_openai_compat`)
|
|
@@ -117,6 +127,39 @@ Judge settings:
|
|
| 117 |
- `GRAPHREVIEW_JUDGE_MAX_CONSECUTIVE_FAILURES` (default `3`)
|
| 118 |
- `GRAPHREVIEW_JUDGE_THINK` (`false|true|low|medium|high`, default `false`)
|
| 119 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
General runtime settings:
|
| 121 |
|
| 122 |
- `GRAPHREVIEW_SOURCE_ROOT` (default `sample_project`)
|
|
@@ -143,6 +186,26 @@ curl -s http://localhost:8000/health
|
|
| 143 |
curl -s http://localhost:8000/tasks
|
| 144 |
```
|
| 145 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
## Direct Module Review (Phase 4)
|
| 147 |
|
| 148 |
Example: run `logic_review` with explicit module focus:
|
|
|
|
| 35 |
- Added confidence scoring that balances precision/recall with severity/security coverage and attribution validity.
|
| 36 |
- Added API endpoint to generate artifacts and CLI support for real project runs.
|
| 37 |
|
| 38 |
+
Phase 6:
|
| 39 |
+
|
| 40 |
+
- Added adaptive hard-grader fusion: deterministic graph gate + primary judge + verifier judge.
|
| 41 |
+
- Added disagreement-aware reweighting to reduce single-model catastrophic errors.
|
| 42 |
+
- Added per-edge `connection_summary` generation using LLM with deterministic fallback.
|
| 43 |
+
- Added optional LoRA trajectory logging for cross-project learning data collection.
|
| 44 |
+
- Added root `.env` support for centralized configuration management.
|
| 45 |
+
|
| 46 |
## Core Runtime Components
|
| 47 |
|
| 48 |
- `env/environment.py`
|
|
|
|
| 113 |
|
| 114 |
## LLM and Runtime Env Vars
|
| 115 |
|
| 116 |
+
`.env` at project root is auto-loaded by runtime configuration, DB initialization, and server startup.
|
| 117 |
+
|
| 118 |
Judge settings:
|
| 119 |
|
| 120 |
- `GRAPHREVIEW_JUDGE_PROVIDER` (default `ollama_openai_compat`)
|
|
|
|
| 127 |
- `GRAPHREVIEW_JUDGE_MAX_CONSECUTIVE_FAILURES` (default `3`)
|
| 128 |
- `GRAPHREVIEW_JUDGE_THINK` (`false|true|low|medium|high`, default `false`)
|
| 129 |
|
| 130 |
+
Verifier and adaptive fusion settings:
|
| 131 |
+
|
| 132 |
+
- `GRAPHREVIEW_VERIFIER_ENABLED` (default `true`)
|
| 133 |
+
- `GRAPHREVIEW_VERIFIER_PROVIDER`
|
| 134 |
+
- `GRAPHREVIEW_VERIFIER_MODEL`
|
| 135 |
+
- `GRAPHREVIEW_VERIFIER_BASE_URL`
|
| 136 |
+
- `GRAPHREVIEW_VERIFIER_API_KEY`
|
| 137 |
+
- `GRAPHREVIEW_VERIFIER_TIMEOUT_SECONDS`
|
| 138 |
+
- `GRAPHREVIEW_JUDGE_WEIGHT_DETERMINISTIC` (default `0.5`)
|
| 139 |
+
- `GRAPHREVIEW_JUDGE_WEIGHT_PRIMARY` (default `0.3`)
|
| 140 |
+
- `GRAPHREVIEW_JUDGE_WEIGHT_VERIFIER` (default `0.2`)
|
| 141 |
+
- `GRAPHREVIEW_JUDGE_DISAGREEMENT_THRESHOLD` (default `0.5`)
|
| 142 |
+
|
| 143 |
+
Edge summary settings:
|
| 144 |
+
|
| 145 |
+
- `GRAPHREVIEW_EDGE_SUMMARY_ENABLED` (default `false`, enable when you want LLM edge summaries)
|
| 146 |
+
- `GRAPHREVIEW_EDGE_SUMMARY_MODEL`
|
| 147 |
+
- `GRAPHREVIEW_EDGE_SUMMARY_BASE_URL`
|
| 148 |
+
- `GRAPHREVIEW_EDGE_SUMMARY_API_KEY`
|
| 149 |
+
- `GRAPHREVIEW_EDGE_SUMMARY_TIMEOUT_SECONDS`
|
| 150 |
+
- `GRAPHREVIEW_EDGE_SUMMARY_MAX_CALLS`
|
| 151 |
+
|
| 152 |
+
LoRA trajectory hooks:
|
| 153 |
+
|
| 154 |
+
- `GRAPHREVIEW_LORA_ENABLED` (default `false`)
|
| 155 |
+
- `GRAPHREVIEW_LORA_DATA_PATH` (default `outputs/lora/transitions.jsonl`)
|
| 156 |
+
|
| 157 |
+
Generate a LoRA-ready SFT dataset from transitions:
|
| 158 |
+
|
| 159 |
+
```bash
|
| 160 |
+
python -m llm.lora_finetune --transitions outputs/lora/transitions.jsonl --output outputs/lora/sft_dataset.jsonl
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
General runtime settings:
|
| 164 |
|
| 165 |
- `GRAPHREVIEW_SOURCE_ROOT` (default `sample_project`)
|
|
|
|
| 186 |
curl -s http://localhost:8000/tasks
|
| 187 |
```
|
| 188 |
|
| 189 |
+
## Unified One-Command Runner
|
| 190 |
+
|
| 191 |
+
Run seed + easy/medium/hard reviews + artifact generation on any target codebase:
|
| 192 |
+
|
| 193 |
+
```bash
|
| 194 |
+
graphreview /absolute/path/to/your/codebase --force-seed
|
| 195 |
+
```
|
| 196 |
+
|
| 197 |
+
Equivalent without installing entrypoints:
|
| 198 |
+
|
| 199 |
+
```bash
|
| 200 |
+
python run_project.py /absolute/path/to/your/codebase --force-seed
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
Optional focused run:
|
| 204 |
+
|
| 205 |
+
```bash
|
| 206 |
+
graphreview /absolute/path/to/your/codebase --modules checkout auth --filter-hops 1 --report-prefix myrun
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
## Direct Module Review (Phase 4)
|
| 210 |
|
| 211 |
Example: run `logic_review` with explicit module focus:
|
code-review-env/db/migrations.py
CHANGED
|
@@ -6,6 +6,8 @@ from pathlib import Path
|
|
| 6 |
from sqlmodel import SQLModel, create_engine
|
| 7 |
from sqlalchemy import inspect, text
|
| 8 |
|
|
|
|
|
|
|
| 9 |
|
| 10 |
def get_default_db_path() -> Path:
|
| 11 |
project_root = Path(__file__).resolve().parents[1]
|
|
@@ -13,6 +15,7 @@ def get_default_db_path() -> Path:
|
|
| 13 |
|
| 14 |
|
| 15 |
def get_engine(db_path: str | Path | None = None, echo: bool = False):
|
|
|
|
| 16 |
env_url = os.getenv("GRAPHREVIEW_DATABASE_URL", "").strip()
|
| 17 |
if env_url:
|
| 18 |
connect_args: dict[str, object] = {}
|
|
@@ -43,6 +46,7 @@ def get_engine(db_path: str | Path | None = None, echo: bool = False):
|
|
| 43 |
|
| 44 |
|
| 45 |
def init_db(db_path: str | Path | None = None, echo: bool = False) -> None:
|
|
|
|
| 46 |
from db import schema # noqa: F401
|
| 47 |
|
| 48 |
engine = get_engine(db_path=db_path, echo=echo)
|
|
@@ -66,6 +70,14 @@ def _apply_lightweight_migrations(engine) -> None:
|
|
| 66 |
if "is_amendment" not in existing_columns:
|
| 67 |
add_statements.append("ALTER TABLE reviewannotation ADD COLUMN is_amendment BOOLEAN DEFAULT 0")
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
if not add_statements:
|
| 70 |
return
|
| 71 |
|
|
|
|
| 6 |
from sqlmodel import SQLModel, create_engine
|
| 7 |
from sqlalchemy import inspect, text
|
| 8 |
|
| 9 |
+
from env.env_loader import load_env_file
|
| 10 |
+
|
| 11 |
|
| 12 |
def get_default_db_path() -> Path:
|
| 13 |
project_root = Path(__file__).resolve().parents[1]
|
|
|
|
| 15 |
|
| 16 |
|
| 17 |
def get_engine(db_path: str | Path | None = None, echo: bool = False):
|
| 18 |
+
load_env_file()
|
| 19 |
env_url = os.getenv("GRAPHREVIEW_DATABASE_URL", "").strip()
|
| 20 |
if env_url:
|
| 21 |
connect_args: dict[str, object] = {}
|
|
|
|
| 46 |
|
| 47 |
|
| 48 |
def init_db(db_path: str | Path | None = None, echo: bool = False) -> None:
|
| 49 |
+
load_env_file()
|
| 50 |
from db import schema # noqa: F401
|
| 51 |
|
| 52 |
engine = get_engine(db_path=db_path, echo=echo)
|
|
|
|
| 70 |
if "is_amendment" not in existing_columns:
|
| 71 |
add_statements.append("ALTER TABLE reviewannotation ADD COLUMN is_amendment BOOLEAN DEFAULT 0")
|
| 72 |
|
| 73 |
+
if not add_statements:
|
| 74 |
+
add_statements = []
|
| 75 |
+
|
| 76 |
+
if "moduleedge" in inspector.get_table_names():
|
| 77 |
+
edge_columns = {col["name"] for col in inspector.get_columns("moduleedge")}
|
| 78 |
+
if "connection_summary" not in edge_columns:
|
| 79 |
+
add_statements.append("ALTER TABLE moduleedge ADD COLUMN connection_summary TEXT DEFAULT ''")
|
| 80 |
+
|
| 81 |
if not add_statements:
|
| 82 |
return
|
| 83 |
|
code-review-env/db/schema.py
CHANGED
|
@@ -53,6 +53,7 @@ class ModuleEdge(SQLModel, table=True):
|
|
| 53 |
edge_type: EdgeType = Field(default=EdgeType.EXPLICIT_IMPORT)
|
| 54 |
import_line: str
|
| 55 |
weight: float = 1.0
|
|
|
|
| 56 |
|
| 57 |
|
| 58 |
class LinterFinding(SQLModel, table=True):
|
|
|
|
| 53 |
edge_type: EdgeType = Field(default=EdgeType.EXPLICIT_IMPORT)
|
| 54 |
import_line: str
|
| 55 |
weight: float = 1.0
|
| 56 |
+
connection_summary: str = ""
|
| 57 |
|
| 58 |
|
| 59 |
class LinterFinding(SQLModel, table=True):
|
code-review-env/db/seed.py
CHANGED
|
@@ -14,6 +14,7 @@ from parser.chunker import chunk_module
|
|
| 14 |
from parser.graph_builder import build_edges
|
| 15 |
from parser.linter import run_linters
|
| 16 |
from parser.summarizer import summarize_module
|
|
|
|
| 17 |
|
| 18 |
|
| 19 |
_SKIP_DIRS = {
|
|
@@ -160,13 +161,24 @@ def seed_project(target_dir: Path, db_path: str | None = None, force: bool = Fal
|
|
| 160 |
store.replace_findings_for_module(parsed.module_id, [issue.model_dump() for issue in issues])
|
| 161 |
|
| 162 |
edges = build_edges(parsed_modules, module_ids, chunk_ids_by_parent)
|
|
|
|
| 163 |
for edge in edges:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 164 |
store.upsert_edge(
|
| 165 |
source_module_id=edge.source_module_id,
|
| 166 |
target_module_id=edge.target_module_id,
|
| 167 |
edge_type=edge.edge_type,
|
| 168 |
import_line=edge.import_line,
|
| 169 |
weight=edge.weight,
|
|
|
|
| 170 |
)
|
| 171 |
|
| 172 |
snapshot = store.get_full_graph()
|
|
|
|
| 14 |
from parser.graph_builder import build_edges
|
| 15 |
from parser.linter import run_linters
|
| 16 |
from parser.summarizer import summarize_module
|
| 17 |
+
from llm.edge_summarizer import EdgeSummarizer, EdgeSummaryInput
|
| 18 |
|
| 19 |
|
| 20 |
_SKIP_DIRS = {
|
|
|
|
| 161 |
store.replace_findings_for_module(parsed.module_id, [issue.model_dump() for issue in issues])
|
| 162 |
|
| 163 |
edges = build_edges(parsed_modules, module_ids, chunk_ids_by_parent)
|
| 164 |
+
edge_summarizer = EdgeSummarizer()
|
| 165 |
for edge in edges:
|
| 166 |
+
connection_summary = edge_summarizer.summarize(
|
| 167 |
+
EdgeSummaryInput(
|
| 168 |
+
source_module_id=edge.source_module_id,
|
| 169 |
+
target_module_id=edge.target_module_id,
|
| 170 |
+
edge_type=edge.edge_type.value,
|
| 171 |
+
import_line=edge.import_line,
|
| 172 |
+
scope=edge.scope,
|
| 173 |
+
)
|
| 174 |
+
)
|
| 175 |
store.upsert_edge(
|
| 176 |
source_module_id=edge.source_module_id,
|
| 177 |
target_module_id=edge.target_module_id,
|
| 178 |
edge_type=edge.edge_type,
|
| 179 |
import_line=edge.import_line,
|
| 180 |
weight=edge.weight,
|
| 181 |
+
connection_summary=connection_summary,
|
| 182 |
)
|
| 183 |
|
| 184 |
snapshot = store.get_full_graph()
|
code-review-env/db/store.py
CHANGED
|
@@ -54,6 +54,7 @@ class GraphEdgeRecord(BaseModel):
|
|
| 54 |
target_module_id: str
|
| 55 |
weight: float
|
| 56 |
import_line: str
|
|
|
|
| 57 |
|
| 58 |
|
| 59 |
class GraphSnapshot(BaseModel):
|
|
@@ -133,6 +134,7 @@ class Store:
|
|
| 133 |
edge_type: EdgeType,
|
| 134 |
import_line: str,
|
| 135 |
weight: float,
|
|
|
|
| 136 |
) -> ModuleEdge:
|
| 137 |
with Session(self.engine) as session:
|
| 138 |
existing = session.exec(
|
|
@@ -146,6 +148,7 @@ class Store:
|
|
| 146 |
if existing:
|
| 147 |
existing.edge_type = edge_type
|
| 148 |
existing.weight = weight
|
|
|
|
| 149 |
session.add(existing)
|
| 150 |
session.commit()
|
| 151 |
session.refresh(existing)
|
|
@@ -158,6 +161,7 @@ class Store:
|
|
| 158 |
edge_type=edge_type,
|
| 159 |
import_line=import_line,
|
| 160 |
weight=weight,
|
|
|
|
| 161 |
)
|
| 162 |
session.add(edge)
|
| 163 |
session.commit()
|
|
@@ -337,6 +341,7 @@ class Store:
|
|
| 337 |
target_module_id=edge.target_module_id,
|
| 338 |
weight=edge.weight,
|
| 339 |
import_line=edge.import_line,
|
|
|
|
| 340 |
)
|
| 341 |
for edge in edges
|
| 342 |
],
|
|
|
|
| 54 |
target_module_id: str
|
| 55 |
weight: float
|
| 56 |
import_line: str
|
| 57 |
+
connection_summary: str
|
| 58 |
|
| 59 |
|
| 60 |
class GraphSnapshot(BaseModel):
|
|
|
|
| 134 |
edge_type: EdgeType,
|
| 135 |
import_line: str,
|
| 136 |
weight: float,
|
| 137 |
+
connection_summary: str = "",
|
| 138 |
) -> ModuleEdge:
|
| 139 |
with Session(self.engine) as session:
|
| 140 |
existing = session.exec(
|
|
|
|
| 148 |
if existing:
|
| 149 |
existing.edge_type = edge_type
|
| 150 |
existing.weight = weight
|
| 151 |
+
existing.connection_summary = connection_summary or existing.connection_summary
|
| 152 |
session.add(existing)
|
| 153 |
session.commit()
|
| 154 |
session.refresh(existing)
|
|
|
|
| 161 |
edge_type=edge_type,
|
| 162 |
import_line=import_line,
|
| 163 |
weight=weight,
|
| 164 |
+
connection_summary=connection_summary,
|
| 165 |
)
|
| 166 |
session.add(edge)
|
| 167 |
session.commit()
|
|
|
|
| 341 |
target_module_id=edge.target_module_id,
|
| 342 |
weight=edge.weight,
|
| 343 |
import_line=edge.import_line,
|
| 344 |
+
connection_summary=edge.connection_summary,
|
| 345 |
)
|
| 346 |
for edge in edges
|
| 347 |
],
|
code-review-env/env/env_loader.py
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import os
|
| 4 |
+
from pathlib import Path
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
def load_env_file(path: str | Path | None = None) -> None:
|
| 8 |
+
"""Load key-value pairs from .env without overriding existing env vars."""
|
| 9 |
+
env_path = Path(path) if path is not None else Path(__file__).resolve().parents[1] / ".env"
|
| 10 |
+
if not env_path.exists():
|
| 11 |
+
return
|
| 12 |
+
|
| 13 |
+
for raw_line in env_path.read_text(encoding="utf-8").splitlines():
|
| 14 |
+
line = raw_line.strip()
|
| 15 |
+
if not line or line.startswith("#") or "=" not in line:
|
| 16 |
+
continue
|
| 17 |
+
key, value = line.split("=", 1)
|
| 18 |
+
key = key.strip()
|
| 19 |
+
value = value.strip().strip('"').strip("'")
|
| 20 |
+
if key and key not in os.environ:
|
| 21 |
+
os.environ[key] = value
|
code-review-env/env/environment.py
CHANGED
|
@@ -21,6 +21,7 @@ from graders.base_grader import BaseGrader
|
|
| 21 |
from graders.easy_grader import EasyGrader
|
| 22 |
from graders.hard_grader import HardGrader
|
| 23 |
from graders.medium_grader import MediumGrader
|
|
|
|
| 24 |
from tasks.task_registry import TaskSpec, get_task, list_tasks, resolve_task_modules
|
| 25 |
|
| 26 |
|
|
@@ -80,6 +81,7 @@ class CodeReviewEnv:
|
|
| 80 |
self.store = Store(source_root=self.source_root, db_path=self.db_path)
|
| 81 |
self.graph_manager = GraphManager(source_root=self.source_root, db_path=self.db_path)
|
| 82 |
self.observation_builder = ObservationBuilder(source_root=self.source_root, db_path=self.db_path)
|
|
|
|
| 83 |
|
| 84 |
self._runtime: _EpisodeRuntime | None = None
|
| 85 |
self._grader: BaseGrader | None = None
|
|
@@ -196,6 +198,18 @@ class CodeReviewEnv:
|
|
| 196 |
context_request=context_request,
|
| 197 |
)
|
| 198 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
return StepResult(
|
| 200 |
observation=observation,
|
| 201 |
reward=reward.raw_value,
|
|
|
|
| 21 |
from graders.easy_grader import EasyGrader
|
| 22 |
from graders.hard_grader import HardGrader
|
| 23 |
from graders.medium_grader import MediumGrader
|
| 24 |
+
from llm.lora_adapter import LoRATrajectoryLogger
|
| 25 |
from tasks.task_registry import TaskSpec, get_task, list_tasks, resolve_task_modules
|
| 26 |
|
| 27 |
|
|
|
|
| 81 |
self.store = Store(source_root=self.source_root, db_path=self.db_path)
|
| 82 |
self.graph_manager = GraphManager(source_root=self.source_root, db_path=self.db_path)
|
| 83 |
self.observation_builder = ObservationBuilder(source_root=self.source_root, db_path=self.db_path)
|
| 84 |
+
self.lora_logger = LoRATrajectoryLogger()
|
| 85 |
|
| 86 |
self._runtime: _EpisodeRuntime | None = None
|
| 87 |
self._grader: BaseGrader | None = None
|
|
|
|
| 198 |
context_request=context_request,
|
| 199 |
)
|
| 200 |
|
| 201 |
+
self.lora_logger.log(
|
| 202 |
+
source_root=self.source_root,
|
| 203 |
+
episode_id=runtime.episode_id,
|
| 204 |
+
module_id=module_id,
|
| 205 |
+
step_number=step_number,
|
| 206 |
+
action=action,
|
| 207 |
+
reward=reward.raw_value,
|
| 208 |
+
done=runtime.done,
|
| 209 |
+
task_id=runtime.task.task_id,
|
| 210 |
+
observation_summary=f"module={observation.module_id} actions={','.join(observation.available_actions[:6])}",
|
| 211 |
+
)
|
| 212 |
+
|
| 213 |
return StepResult(
|
| 214 |
observation=observation,
|
| 215 |
reward=reward.raw_value,
|
code-review-env/env/reward.py
CHANGED
|
@@ -9,6 +9,7 @@ class RewardReason(StrEnum):
|
|
| 9 |
CORRECT_FLAG = "correct_flag"
|
| 10 |
ACCURATE_COMMENT = "accurate_comment"
|
| 11 |
CORRECT_DEPENDENCY_ATTRIBUTION = "correct_dependency_attribution"
|
|
|
|
| 12 |
INCORRECT_DEPENDENCY_ATTRIBUTION = "incorrect_dependency_attribution"
|
| 13 |
CORRECT_AMENDMENT = "correct_amendment"
|
| 14 |
REQUEST_CONTEXT_COST = "request_context_cost"
|
|
@@ -23,6 +24,7 @@ RAW_REWARD_TABLE: dict[RewardReason, float] = {
|
|
| 23 |
RewardReason.CORRECT_FLAG: 0.5,
|
| 24 |
RewardReason.ACCURATE_COMMENT: 0.3,
|
| 25 |
RewardReason.CORRECT_DEPENDENCY_ATTRIBUTION: 0.6,
|
|
|
|
| 26 |
RewardReason.INCORRECT_DEPENDENCY_ATTRIBUTION: 0.1,
|
| 27 |
RewardReason.CORRECT_AMENDMENT: 0.4,
|
| 28 |
RewardReason.REQUEST_CONTEXT_COST: -0.1,
|
|
|
|
| 9 |
CORRECT_FLAG = "correct_flag"
|
| 10 |
ACCURATE_COMMENT = "accurate_comment"
|
| 11 |
CORRECT_DEPENDENCY_ATTRIBUTION = "correct_dependency_attribution"
|
| 12 |
+
PARTIAL_DEPENDENCY_ATTRIBUTION = "partial_dependency_attribution"
|
| 13 |
INCORRECT_DEPENDENCY_ATTRIBUTION = "incorrect_dependency_attribution"
|
| 14 |
CORRECT_AMENDMENT = "correct_amendment"
|
| 15 |
REQUEST_CONTEXT_COST = "request_context_cost"
|
|
|
|
| 24 |
RewardReason.CORRECT_FLAG: 0.5,
|
| 25 |
RewardReason.ACCURATE_COMMENT: 0.3,
|
| 26 |
RewardReason.CORRECT_DEPENDENCY_ATTRIBUTION: 0.6,
|
| 27 |
+
RewardReason.PARTIAL_DEPENDENCY_ATTRIBUTION: 0.35,
|
| 28 |
RewardReason.INCORRECT_DEPENDENCY_ATTRIBUTION: 0.1,
|
| 29 |
RewardReason.CORRECT_AMENDMENT: 0.4,
|
| 30 |
RewardReason.REQUEST_CONTEXT_COST: -0.1,
|
code-review-env/env/runtime_config.py
CHANGED
|
@@ -3,6 +3,8 @@ from __future__ import annotations
|
|
| 3 |
import os
|
| 4 |
from dataclasses import dataclass
|
| 5 |
|
|
|
|
|
|
|
| 6 |
|
| 7 |
@dataclass(frozen=True)
|
| 8 |
class RuntimeConfig:
|
|
@@ -15,6 +17,7 @@ class RuntimeConfig:
|
|
| 15 |
|
| 16 |
|
| 17 |
def load_runtime_config() -> RuntimeConfig:
|
|
|
|
| 18 |
return RuntimeConfig(
|
| 19 |
llm_provider=os.getenv("GRAPHREVIEW_LLM_PROVIDER", "ollama_openai_compat"),
|
| 20 |
llm_base_url=os.getenv("GRAPHREVIEW_LLM_BASE_URL", "http://localhost:11434/v1"),
|
|
|
|
| 3 |
import os
|
| 4 |
from dataclasses import dataclass
|
| 5 |
|
| 6 |
+
from env.env_loader import load_env_file
|
| 7 |
+
|
| 8 |
|
| 9 |
@dataclass(frozen=True)
|
| 10 |
class RuntimeConfig:
|
|
|
|
| 17 |
|
| 18 |
|
| 19 |
def load_runtime_config() -> RuntimeConfig:
|
| 20 |
+
load_env_file()
|
| 21 |
return RuntimeConfig(
|
| 22 |
llm_provider=os.getenv("GRAPHREVIEW_LLM_PROVIDER", "ollama_openai_compat"),
|
| 23 |
llm_base_url=os.getenv("GRAPHREVIEW_LLM_BASE_URL", "http://localhost:11434/v1"),
|
code-review-env/graders/hard_grader.py
CHANGED
|
@@ -38,21 +38,36 @@ class HardGrader(MediumGrader):
|
|
| 38 |
"GRAPHREVIEW_JUDGE_PROVIDER",
|
| 39 |
"ollama_openai_compat",
|
| 40 |
)
|
|
|
|
| 41 |
self.base_url = os.getenv("GRAPHREVIEW_JUDGE_BASE_URL", "http://localhost:11434/v1")
|
|
|
|
| 42 |
self.api_key = os.getenv("GRAPHREVIEW_JUDGE_API_KEY", "ollama")
|
|
|
|
| 43 |
self.timeout = float(os.getenv("GRAPHREVIEW_JUDGE_TIMEOUT_SECONDS", "8"))
|
|
|
|
| 44 |
self.judge_system_prompt = os.getenv(
|
| 45 |
"GRAPHREVIEW_JUDGE_SYSTEM_PROMPT",
|
| 46 |
self.DEFAULT_JUDGE_SYSTEM_PROMPT,
|
| 47 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
self.reasoning_effort = os.getenv("GRAPHREVIEW_JUDGE_REASONING_EFFORT", "none")
|
| 49 |
self.think_value = os.getenv("GRAPHREVIEW_JUDGE_THINK", "false").strip().lower()
|
| 50 |
self.max_judge_calls = int(os.getenv("GRAPHREVIEW_JUDGE_MAX_CALLS", "200"))
|
| 51 |
self.max_consecutive_failures = int(os.getenv("GRAPHREVIEW_JUDGE_MAX_CONSECUTIVE_FAILURES", "3"))
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
self._judge_calls = 0
|
| 53 |
self._consecutive_failures = 0
|
| 54 |
self._judge_cache: dict[str, tuple[float, str]] = {}
|
| 55 |
self.prompt_hash = hashlib.sha256(self.judge_system_prompt.encode("utf-8")).hexdigest()
|
|
|
|
| 56 |
|
| 57 |
def grade_action(
|
| 58 |
self,
|
|
@@ -95,39 +110,91 @@ class HardGrader(MediumGrader):
|
|
| 95 |
)
|
| 96 |
|
| 97 |
normalized_action = action.model_copy(update={"attributed_to": attributed_to})
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
return make_reward(
|
| 105 |
RewardReason.INCORRECT_DEPENDENCY_ATTRIBUTION,
|
| 106 |
-
|
| 107 |
metadata={
|
| 108 |
-
"judge_score":
|
|
|
|
|
|
|
|
|
|
| 109 |
"judge_provider": self.judge_provider,
|
| 110 |
"judge_model": self.judge_model,
|
|
|
|
|
|
|
| 111 |
"temperature": 0.0,
|
| 112 |
"prompt_hash": self.prompt_hash,
|
|
|
|
| 113 |
},
|
| 114 |
)
|
| 115 |
|
| 116 |
base_reason = RewardReason.CORRECT_DEPENDENCY_ATTRIBUTION
|
|
|
|
|
|
|
| 117 |
reward = make_reward(
|
| 118 |
base_reason,
|
| 119 |
-
|
| 120 |
metadata={
|
| 121 |
-
"judge_score":
|
|
|
|
|
|
|
|
|
|
| 122 |
"judge_provider": self.judge_provider,
|
| 123 |
"judge_model": self.judge_model,
|
|
|
|
|
|
|
| 124 |
"temperature": 0.0,
|
| 125 |
"prompt_hash": self.prompt_hash,
|
|
|
|
| 126 |
},
|
| 127 |
)
|
| 128 |
return reward
|
| 129 |
|
| 130 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
if not self.judge_enabled:
|
| 132 |
return 1.0, "Judge disabled by configuration; graph-consistent attribution accepted"
|
| 133 |
|
|
@@ -145,21 +212,21 @@ class HardGrader(MediumGrader):
|
|
| 145 |
"rubric": "0.0 wrong or unsupported; 0.5 partially justified; 1.0 well-justified root cause",
|
| 146 |
}
|
| 147 |
payload_text = json.dumps(payload, sort_keys=True)
|
| 148 |
-
cache_key = hashlib.sha256(payload_text.encode("utf-8")).hexdigest()
|
| 149 |
cached = self._judge_cache.get(cache_key)
|
| 150 |
if cached is not None:
|
| 151 |
return cached
|
| 152 |
|
| 153 |
try:
|
| 154 |
self._judge_calls += 1
|
| 155 |
-
client = OpenAI(api_key=
|
| 156 |
|
| 157 |
request_kwargs: dict[str, Any] = {
|
| 158 |
-
"model":
|
| 159 |
"temperature": 0.0,
|
| 160 |
"response_format": {"type": "json_object"},
|
| 161 |
"messages": [
|
| 162 |
-
{"role": "system", "content":
|
| 163 |
{"role": "user", "content": payload_text},
|
| 164 |
],
|
| 165 |
}
|
|
@@ -167,7 +234,7 @@ class HardGrader(MediumGrader):
|
|
| 167 |
if self.reasoning_effort in {"none", "low", "medium", "high"}:
|
| 168 |
request_kwargs["reasoning_effort"] = self.reasoning_effort
|
| 169 |
|
| 170 |
-
if
|
| 171 |
if self.think_value in {"true", "false", "low", "medium", "high"}:
|
| 172 |
think: bool | str
|
| 173 |
if self.think_value in {"low", "medium", "high"}:
|
|
@@ -206,3 +273,43 @@ class HardGrader(MediumGrader):
|
|
| 206 |
if start >= 0 and end > start:
|
| 207 |
return json.loads(text[start : end + 1])
|
| 208 |
raise
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
"GRAPHREVIEW_JUDGE_PROVIDER",
|
| 39 |
"ollama_openai_compat",
|
| 40 |
)
|
| 41 |
+
self.verifier_provider = os.getenv("GRAPHREVIEW_VERIFIER_PROVIDER", self.judge_provider)
|
| 42 |
self.base_url = os.getenv("GRAPHREVIEW_JUDGE_BASE_URL", "http://localhost:11434/v1")
|
| 43 |
+
self.verifier_base_url = os.getenv("GRAPHREVIEW_VERIFIER_BASE_URL", self.base_url)
|
| 44 |
self.api_key = os.getenv("GRAPHREVIEW_JUDGE_API_KEY", "ollama")
|
| 45 |
+
self.verifier_api_key = os.getenv("GRAPHREVIEW_VERIFIER_API_KEY", self.api_key)
|
| 46 |
self.timeout = float(os.getenv("GRAPHREVIEW_JUDGE_TIMEOUT_SECONDS", "8"))
|
| 47 |
+
self.verifier_timeout = float(os.getenv("GRAPHREVIEW_VERIFIER_TIMEOUT_SECONDS", str(self.timeout)))
|
| 48 |
self.judge_system_prompt = os.getenv(
|
| 49 |
"GRAPHREVIEW_JUDGE_SYSTEM_PROMPT",
|
| 50 |
self.DEFAULT_JUDGE_SYSTEM_PROMPT,
|
| 51 |
)
|
| 52 |
+
self.verifier_enabled = os.getenv("GRAPHREVIEW_VERIFIER_ENABLED", "true").strip().lower() == "true"
|
| 53 |
+
self.verifier_model = os.getenv("GRAPHREVIEW_VERIFIER_MODEL", self.judge_model)
|
| 54 |
+
self.verifier_system_prompt = os.getenv(
|
| 55 |
+
"GRAPHREVIEW_VERIFIER_SYSTEM_PROMPT",
|
| 56 |
+
self.DEFAULT_JUDGE_SYSTEM_PROMPT,
|
| 57 |
+
)
|
| 58 |
self.reasoning_effort = os.getenv("GRAPHREVIEW_JUDGE_REASONING_EFFORT", "none")
|
| 59 |
self.think_value = os.getenv("GRAPHREVIEW_JUDGE_THINK", "false").strip().lower()
|
| 60 |
self.max_judge_calls = int(os.getenv("GRAPHREVIEW_JUDGE_MAX_CALLS", "200"))
|
| 61 |
self.max_consecutive_failures = int(os.getenv("GRAPHREVIEW_JUDGE_MAX_CONSECUTIVE_FAILURES", "3"))
|
| 62 |
+
self.weight_deterministic = float(os.getenv("GRAPHREVIEW_JUDGE_WEIGHT_DETERMINISTIC", "0.5"))
|
| 63 |
+
self.weight_primary = float(os.getenv("GRAPHREVIEW_JUDGE_WEIGHT_PRIMARY", "0.3"))
|
| 64 |
+
self.weight_verifier = float(os.getenv("GRAPHREVIEW_JUDGE_WEIGHT_VERIFIER", "0.2"))
|
| 65 |
+
self.disagreement_threshold = float(os.getenv("GRAPHREVIEW_JUDGE_DISAGREEMENT_THRESHOLD", "0.5"))
|
| 66 |
self._judge_calls = 0
|
| 67 |
self._consecutive_failures = 0
|
| 68 |
self._judge_cache: dict[str, tuple[float, str]] = {}
|
| 69 |
self.prompt_hash = hashlib.sha256(self.judge_system_prompt.encode("utf-8")).hexdigest()
|
| 70 |
+
self.verifier_prompt_hash = hashlib.sha256(self.verifier_system_prompt.encode("utf-8")).hexdigest()
|
| 71 |
|
| 72 |
def grade_action(
|
| 73 |
self,
|
|
|
|
| 110 |
)
|
| 111 |
|
| 112 |
normalized_action = action.model_copy(update={"attributed_to": attributed_to})
|
| 113 |
+
primary_score, primary_explanation = self._judge_with_model(
|
| 114 |
+
module_id=module_id,
|
| 115 |
+
action=normalized_action,
|
| 116 |
+
model=self.judge_model,
|
| 117 |
+
provider=self.judge_provider,
|
| 118 |
+
base_url=self.base_url,
|
| 119 |
+
api_key=self.api_key,
|
| 120 |
+
timeout=self.timeout,
|
| 121 |
+
system_prompt=self.judge_system_prompt,
|
| 122 |
+
cache_scope="primary",
|
| 123 |
+
)
|
| 124 |
+
verifier_score = primary_score
|
| 125 |
+
verifier_explanation = "Verifier disabled"
|
| 126 |
+
if self.verifier_enabled:
|
| 127 |
+
verifier_score, verifier_explanation = self._judge_with_model(
|
| 128 |
+
module_id=module_id,
|
| 129 |
+
action=normalized_action,
|
| 130 |
+
model=self.verifier_model,
|
| 131 |
+
provider=self.verifier_provider,
|
| 132 |
+
base_url=self.verifier_base_url,
|
| 133 |
+
api_key=self.verifier_api_key,
|
| 134 |
+
timeout=self.verifier_timeout,
|
| 135 |
+
system_prompt=self.verifier_system_prompt,
|
| 136 |
+
cache_scope="verifier",
|
| 137 |
+
)
|
| 138 |
+
|
| 139 |
+
final_score, blend = self._blend_scores(
|
| 140 |
+
deterministic_score=1.0,
|
| 141 |
+
primary_score=primary_score,
|
| 142 |
+
verifier_score=verifier_score,
|
| 143 |
+
)
|
| 144 |
+
|
| 145 |
+
if final_score < 0.45:
|
| 146 |
return make_reward(
|
| 147 |
RewardReason.INCORRECT_DEPENDENCY_ATTRIBUTION,
|
| 148 |
+
f"{primary_explanation} | verifier: {verifier_explanation}",
|
| 149 |
metadata={
|
| 150 |
+
"judge_score": primary_score,
|
| 151 |
+
"verifier_score": verifier_score,
|
| 152 |
+
"final_score": final_score,
|
| 153 |
+
"blend": json.dumps(blend, sort_keys=True),
|
| 154 |
"judge_provider": self.judge_provider,
|
| 155 |
"judge_model": self.judge_model,
|
| 156 |
+
"verifier_provider": self.verifier_provider,
|
| 157 |
+
"verifier_model": self.verifier_model,
|
| 158 |
"temperature": 0.0,
|
| 159 |
"prompt_hash": self.prompt_hash,
|
| 160 |
+
"verifier_prompt_hash": self.verifier_prompt_hash,
|
| 161 |
},
|
| 162 |
)
|
| 163 |
|
| 164 |
base_reason = RewardReason.CORRECT_DEPENDENCY_ATTRIBUTION
|
| 165 |
+
if final_score < 0.75:
|
| 166 |
+
base_reason = RewardReason.PARTIAL_DEPENDENCY_ATTRIBUTION
|
| 167 |
reward = make_reward(
|
| 168 |
base_reason,
|
| 169 |
+
f"{primary_explanation} | verifier: {verifier_explanation}",
|
| 170 |
metadata={
|
| 171 |
+
"judge_score": primary_score,
|
| 172 |
+
"verifier_score": verifier_score,
|
| 173 |
+
"final_score": final_score,
|
| 174 |
+
"blend": json.dumps(blend, sort_keys=True),
|
| 175 |
"judge_provider": self.judge_provider,
|
| 176 |
"judge_model": self.judge_model,
|
| 177 |
+
"verifier_provider": self.verifier_provider,
|
| 178 |
+
"verifier_model": self.verifier_model,
|
| 179 |
"temperature": 0.0,
|
| 180 |
"prompt_hash": self.prompt_hash,
|
| 181 |
+
"verifier_prompt_hash": self.verifier_prompt_hash,
|
| 182 |
},
|
| 183 |
)
|
| 184 |
return reward
|
| 185 |
|
| 186 |
+
def _judge_with_model(
|
| 187 |
+
self,
|
| 188 |
+
module_id: str,
|
| 189 |
+
action: ReviewAction,
|
| 190 |
+
model: str,
|
| 191 |
+
provider: str,
|
| 192 |
+
base_url: str,
|
| 193 |
+
api_key: str,
|
| 194 |
+
timeout: float,
|
| 195 |
+
system_prompt: str,
|
| 196 |
+
cache_scope: str,
|
| 197 |
+
) -> tuple[float, str]:
|
| 198 |
if not self.judge_enabled:
|
| 199 |
return 1.0, "Judge disabled by configuration; graph-consistent attribution accepted"
|
| 200 |
|
|
|
|
| 212 |
"rubric": "0.0 wrong or unsupported; 0.5 partially justified; 1.0 well-justified root cause",
|
| 213 |
}
|
| 214 |
payload_text = json.dumps(payload, sort_keys=True)
|
| 215 |
+
cache_key = hashlib.sha256(f"{cache_scope}:{model}:{payload_text}".encode("utf-8")).hexdigest()
|
| 216 |
cached = self._judge_cache.get(cache_key)
|
| 217 |
if cached is not None:
|
| 218 |
return cached
|
| 219 |
|
| 220 |
try:
|
| 221 |
self._judge_calls += 1
|
| 222 |
+
client = OpenAI(api_key=api_key, base_url=base_url, timeout=timeout)
|
| 223 |
|
| 224 |
request_kwargs: dict[str, Any] = {
|
| 225 |
+
"model": model,
|
| 226 |
"temperature": 0.0,
|
| 227 |
"response_format": {"type": "json_object"},
|
| 228 |
"messages": [
|
| 229 |
+
{"role": "system", "content": system_prompt},
|
| 230 |
{"role": "user", "content": payload_text},
|
| 231 |
],
|
| 232 |
}
|
|
|
|
| 234 |
if self.reasoning_effort in {"none", "low", "medium", "high"}:
|
| 235 |
request_kwargs["reasoning_effort"] = self.reasoning_effort
|
| 236 |
|
| 237 |
+
if provider == "ollama_openai_compat":
|
| 238 |
if self.think_value in {"true", "false", "low", "medium", "high"}:
|
| 239 |
think: bool | str
|
| 240 |
if self.think_value in {"low", "medium", "high"}:
|
|
|
|
| 273 |
if start >= 0 and end > start:
|
| 274 |
return json.loads(text[start : end + 1])
|
| 275 |
raise
|
| 276 |
+
|
| 277 |
+
def _blend_scores(
|
| 278 |
+
self,
|
| 279 |
+
deterministic_score: float,
|
| 280 |
+
primary_score: float,
|
| 281 |
+
verifier_score: float,
|
| 282 |
+
) -> tuple[float, dict[str, float | bool]]:
|
| 283 |
+
d = max(0.0, min(1.0, deterministic_score))
|
| 284 |
+
p = max(0.0, min(1.0, primary_score))
|
| 285 |
+
v = max(0.0, min(1.0, verifier_score))
|
| 286 |
+
|
| 287 |
+
wd = max(self.weight_deterministic, 0.0)
|
| 288 |
+
wp = max(self.weight_primary, 0.0)
|
| 289 |
+
wv = max(self.weight_verifier, 0.0)
|
| 290 |
+
disagreement = abs(p - v)
|
| 291 |
+
disagreement_guard = disagreement >= self.disagreement_threshold
|
| 292 |
+
if disagreement_guard:
|
| 293 |
+
wp = min(wp, 0.1)
|
| 294 |
+
wv = max(wv, 0.4)
|
| 295 |
+
wd = max(wd, 0.5)
|
| 296 |
+
|
| 297 |
+
total = wd + wp + wv
|
| 298 |
+
if total <= 0:
|
| 299 |
+
return 0.0, {"wd": 0.0, "wp": 0.0, "wv": 0.0, "disagreement": disagreement, "guard": disagreement_guard}
|
| 300 |
+
|
| 301 |
+
wd /= total
|
| 302 |
+
wp /= total
|
| 303 |
+
wv /= total
|
| 304 |
+
|
| 305 |
+
final = (wd * d) + (wp * p) + (wv * v)
|
| 306 |
+
if p == 1.0 and v == 0.0:
|
| 307 |
+
final = min(final, 0.45)
|
| 308 |
+
|
| 309 |
+
return final, {
|
| 310 |
+
"wd": wd,
|
| 311 |
+
"wp": wp,
|
| 312 |
+
"wv": wv,
|
| 313 |
+
"disagreement": disagreement,
|
| 314 |
+
"guard": disagreement_guard,
|
| 315 |
+
}
|
code-review-env/graph/graph_manager.py
CHANGED
|
@@ -57,6 +57,7 @@ class GraphManager:
|
|
| 57 |
edge_type=edge.edge_type.value,
|
| 58 |
import_line=edge.import_line,
|
| 59 |
weight=edge.weight,
|
|
|
|
| 60 |
)
|
| 61 |
|
| 62 |
self._graph_cache = graph
|
|
|
|
| 57 |
edge_type=edge.edge_type.value,
|
| 58 |
import_line=edge.import_line,
|
| 59 |
weight=edge.weight,
|
| 60 |
+
connection_summary=edge.connection_summary,
|
| 61 |
)
|
| 62 |
|
| 63 |
self._graph_cache = graph
|
code-review-env/llm/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""LLM helpers for GraphReview."""
|
code-review-env/llm/edge_summarizer.py
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import hashlib
|
| 4 |
+
import json
|
| 5 |
+
import os
|
| 6 |
+
from dataclasses import dataclass
|
| 7 |
+
|
| 8 |
+
from openai import OpenAI
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
@dataclass(frozen=True)
|
| 12 |
+
class EdgeSummaryInput:
|
| 13 |
+
source_module_id: str
|
| 14 |
+
target_module_id: str
|
| 15 |
+
edge_type: str
|
| 16 |
+
import_line: str
|
| 17 |
+
scope: str
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
class EdgeSummarizer:
|
| 21 |
+
"""Generate concise edge relationship summaries with deterministic fallback."""
|
| 22 |
+
|
| 23 |
+
def __init__(self) -> None:
|
| 24 |
+
self.enabled = os.getenv("GRAPHREVIEW_EDGE_SUMMARY_ENABLED", "false").strip().lower() == "true"
|
| 25 |
+
if os.getenv("PYTEST_CURRENT_TEST"):
|
| 26 |
+
self.enabled = False
|
| 27 |
+
self.base_url = os.getenv("GRAPHREVIEW_EDGE_SUMMARY_BASE_URL", os.getenv("GRAPHREVIEW_LLM_BASE_URL", "http://localhost:11434/v1"))
|
| 28 |
+
self.api_key = os.getenv("GRAPHREVIEW_EDGE_SUMMARY_API_KEY", os.getenv("GRAPHREVIEW_LLM_API_KEY", "ollama"))
|
| 29 |
+
self.model = os.getenv("GRAPHREVIEW_EDGE_SUMMARY_MODEL", os.getenv("GRAPHREVIEW_LLM_MODEL_AGENT", "gemma4:e4b"))
|
| 30 |
+
self.timeout = float(os.getenv("GRAPHREVIEW_EDGE_SUMMARY_TIMEOUT_SECONDS", "8"))
|
| 31 |
+
self.max_calls = int(os.getenv("GRAPHREVIEW_EDGE_SUMMARY_MAX_CALLS", "5000"))
|
| 32 |
+
self._calls = 0
|
| 33 |
+
self._cache: dict[str, str] = {}
|
| 34 |
+
|
| 35 |
+
def summarize(self, edge: EdgeSummaryInput) -> str:
|
| 36 |
+
payload = json.dumps(edge.__dict__, sort_keys=True)
|
| 37 |
+
cache_key = hashlib.sha256(payload.encode("utf-8")).hexdigest()
|
| 38 |
+
if cache_key in self._cache:
|
| 39 |
+
return self._cache[cache_key]
|
| 40 |
+
|
| 41 |
+
summary = self._fallback_summary(edge)
|
| 42 |
+
if self.enabled and self._calls < self.max_calls:
|
| 43 |
+
try:
|
| 44 |
+
self._calls += 1
|
| 45 |
+
client = OpenAI(api_key=self.api_key, base_url=self.base_url, timeout=self.timeout)
|
| 46 |
+
response = client.chat.completions.create(
|
| 47 |
+
model=self.model,
|
| 48 |
+
temperature=0.0,
|
| 49 |
+
messages=[
|
| 50 |
+
{
|
| 51 |
+
"role": "system",
|
| 52 |
+
"content": (
|
| 53 |
+
"You summarize Python dependency edges. Produce one sentence (max 24 words) "
|
| 54 |
+
"explaining why source depends on target using the import/call evidence."
|
| 55 |
+
),
|
| 56 |
+
},
|
| 57 |
+
{"role": "user", "content": payload},
|
| 58 |
+
],
|
| 59 |
+
)
|
| 60 |
+
text = (response.choices[0].message.content or "").strip()
|
| 61 |
+
if text:
|
| 62 |
+
summary = text[:240]
|
| 63 |
+
except Exception:
|
| 64 |
+
# Keep deterministic fallback to avoid breaking seed.
|
| 65 |
+
pass
|
| 66 |
+
|
| 67 |
+
self._cache[cache_key] = summary
|
| 68 |
+
return summary
|
| 69 |
+
|
| 70 |
+
@staticmethod
|
| 71 |
+
def _fallback_summary(edge: EdgeSummaryInput) -> str:
|
| 72 |
+
edge_kind = edge.edge_type.replace("_", " ")
|
| 73 |
+
evidence = edge.import_line.strip() or "implicit usage"
|
| 74 |
+
if len(evidence) > 120:
|
| 75 |
+
evidence = evidence[:117] + "..."
|
| 76 |
+
return (
|
| 77 |
+
f"{edge.source_module_id} depends on {edge.target_module_id} via {edge_kind}; "
|
| 78 |
+
f"evidence: {evidence}."
|
| 79 |
+
)
|
code-review-env/llm/lora_adapter.py
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
import os
|
| 5 |
+
from dataclasses import dataclass
|
| 6 |
+
from datetime import UTC, datetime
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
|
| 9 |
+
from env.action import ReviewAction
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
@dataclass(frozen=True)
|
| 13 |
+
class TransitionRecord:
|
| 14 |
+
source_root: str
|
| 15 |
+
episode_id: str
|
| 16 |
+
module_id: str
|
| 17 |
+
step_number: int
|
| 18 |
+
action_type: str
|
| 19 |
+
reward: float
|
| 20 |
+
done: bool
|
| 21 |
+
task_id: str
|
| 22 |
+
observation_summary: str
|
| 23 |
+
action_payload: dict[str, object]
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
class LoRATrajectoryLogger:
|
| 27 |
+
"""Append RL transitions to JSONL for optional LoRA fine-tuning workflows."""
|
| 28 |
+
|
| 29 |
+
def __init__(self) -> None:
|
| 30 |
+
self.enabled = os.getenv("GRAPHREVIEW_LORA_ENABLED", "false").strip().lower() == "true"
|
| 31 |
+
output_path = os.getenv("GRAPHREVIEW_LORA_DATA_PATH", "outputs/lora/transitions.jsonl")
|
| 32 |
+
self.path = Path(output_path)
|
| 33 |
+
|
| 34 |
+
def log(
|
| 35 |
+
self,
|
| 36 |
+
*,
|
| 37 |
+
source_root: str,
|
| 38 |
+
episode_id: str,
|
| 39 |
+
module_id: str,
|
| 40 |
+
step_number: int,
|
| 41 |
+
action: ReviewAction,
|
| 42 |
+
reward: float,
|
| 43 |
+
done: bool,
|
| 44 |
+
task_id: str,
|
| 45 |
+
observation_summary: str,
|
| 46 |
+
) -> None:
|
| 47 |
+
if not self.enabled:
|
| 48 |
+
return
|
| 49 |
+
|
| 50 |
+
record = TransitionRecord(
|
| 51 |
+
source_root=source_root,
|
| 52 |
+
episode_id=episode_id,
|
| 53 |
+
module_id=module_id,
|
| 54 |
+
step_number=step_number,
|
| 55 |
+
action_type=action.action_type.value,
|
| 56 |
+
reward=reward,
|
| 57 |
+
done=done,
|
| 58 |
+
task_id=task_id,
|
| 59 |
+
observation_summary=observation_summary,
|
| 60 |
+
action_payload=action.model_dump(mode="json", exclude_none=True),
|
| 61 |
+
)
|
| 62 |
+
|
| 63 |
+
self.path.parent.mkdir(parents=True, exist_ok=True)
|
| 64 |
+
payload = {
|
| 65 |
+
**record.__dict__,
|
| 66 |
+
"created_at": datetime.now(UTC).isoformat(),
|
| 67 |
+
}
|
| 68 |
+
with self.path.open("a", encoding="utf-8") as handle:
|
| 69 |
+
handle.write(json.dumps(payload, sort_keys=True) + "\n")
|
code-review-env/llm/lora_finetune.py
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import argparse
|
| 4 |
+
import json
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
def export_sft_dataset(transitions_path: Path, output_path: Path) -> int:
|
| 9 |
+
"""Convert transition logs into a simple instruction-tuning JSONL dataset."""
|
| 10 |
+
if not transitions_path.exists():
|
| 11 |
+
raise FileNotFoundError(f"Transitions file not found: {transitions_path}")
|
| 12 |
+
|
| 13 |
+
rows = transitions_path.read_text(encoding="utf-8").splitlines()
|
| 14 |
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 15 |
+
|
| 16 |
+
count = 0
|
| 17 |
+
with output_path.open("w", encoding="utf-8") as out:
|
| 18 |
+
for row in rows:
|
| 19 |
+
payload = json.loads(row)
|
| 20 |
+
sample = {
|
| 21 |
+
"instruction": (
|
| 22 |
+
"Review this module using graph-aware reasoning and choose the best next action."
|
| 23 |
+
),
|
| 24 |
+
"input": payload.get("observation_summary", ""),
|
| 25 |
+
"output": json.dumps(payload.get("action_payload", {}), sort_keys=True),
|
| 26 |
+
"meta": {
|
| 27 |
+
"reward": payload.get("reward", 0.0),
|
| 28 |
+
"task_id": payload.get("task_id", ""),
|
| 29 |
+
"module_id": payload.get("module_id", ""),
|
| 30 |
+
},
|
| 31 |
+
}
|
| 32 |
+
out.write(json.dumps(sample, sort_keys=True) + "\n")
|
| 33 |
+
count += 1
|
| 34 |
+
return count
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def _build_parser() -> argparse.ArgumentParser:
|
| 38 |
+
parser = argparse.ArgumentParser(description="Prepare LoRA fine-tuning dataset from GraphReview transitions")
|
| 39 |
+
parser.add_argument(
|
| 40 |
+
"--transitions",
|
| 41 |
+
default="outputs/lora/transitions.jsonl",
|
| 42 |
+
help="Input transition JSONL produced by runtime",
|
| 43 |
+
)
|
| 44 |
+
parser.add_argument(
|
| 45 |
+
"--output",
|
| 46 |
+
default="outputs/lora/sft_dataset.jsonl",
|
| 47 |
+
help="Output SFT JSONL dataset path",
|
| 48 |
+
)
|
| 49 |
+
return parser
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def main() -> None:
|
| 53 |
+
args = _build_parser().parse_args()
|
| 54 |
+
count = export_sft_dataset(Path(args.transitions), Path(args.output))
|
| 55 |
+
print(json.dumps({"ok": True, "samples": count, "output": args.output}, indent=2))
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
if __name__ == "__main__":
|
| 59 |
+
main()
|
code-review-env/parser/ast_parser.py
CHANGED
|
@@ -9,6 +9,7 @@ from pydantic import BaseModel
|
|
| 9 |
|
| 10 |
from db.schema import EdgeType
|
| 11 |
from db.store import Store
|
|
|
|
| 12 |
from parser.linter import run_linters
|
| 13 |
from parser.summarizer import summarize_module
|
| 14 |
|
|
@@ -205,6 +206,7 @@ def parse_directory(target_dir: Path, db_path: str | None = None) -> Store:
|
|
| 205 |
py_files = _iter_python_files(target_dir)
|
| 206 |
parsed_modules = [parse_python_file(py_file, target_dir) for py_file in py_files]
|
| 207 |
known_module_ids = {parsed.module_id for parsed in parsed_modules}
|
|
|
|
| 208 |
|
| 209 |
for py_file, parsed in zip(py_files, parsed_modules):
|
| 210 |
issues = run_linters(py_file)
|
|
@@ -223,12 +225,22 @@ def parse_directory(target_dir: Path, db_path: str | None = None) -> Store:
|
|
| 223 |
)
|
| 224 |
for imported in parsed.imports:
|
| 225 |
if imported.target_module and imported.target_module in known_module_ids:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 226 |
store.upsert_edge(
|
| 227 |
source_module_id=parsed.module_id,
|
| 228 |
target_module_id=imported.target_module,
|
| 229 |
edge_type=imported.edge_type,
|
| 230 |
import_line=imported.import_line,
|
| 231 |
weight=imported.weight,
|
|
|
|
| 232 |
)
|
| 233 |
|
| 234 |
return store
|
|
|
|
| 9 |
|
| 10 |
from db.schema import EdgeType
|
| 11 |
from db.store import Store
|
| 12 |
+
from llm.edge_summarizer import EdgeSummarizer, EdgeSummaryInput
|
| 13 |
from parser.linter import run_linters
|
| 14 |
from parser.summarizer import summarize_module
|
| 15 |
|
|
|
|
| 206 |
py_files = _iter_python_files(target_dir)
|
| 207 |
parsed_modules = [parse_python_file(py_file, target_dir) for py_file in py_files]
|
| 208 |
known_module_ids = {parsed.module_id for parsed in parsed_modules}
|
| 209 |
+
edge_summarizer = EdgeSummarizer()
|
| 210 |
|
| 211 |
for py_file, parsed in zip(py_files, parsed_modules):
|
| 212 |
issues = run_linters(py_file)
|
|
|
|
| 225 |
)
|
| 226 |
for imported in parsed.imports:
|
| 227 |
if imported.target_module and imported.target_module in known_module_ids:
|
| 228 |
+
connection_summary = edge_summarizer.summarize(
|
| 229 |
+
EdgeSummaryInput(
|
| 230 |
+
source_module_id=parsed.module_id,
|
| 231 |
+
target_module_id=imported.target_module,
|
| 232 |
+
edge_type=imported.edge_type.value,
|
| 233 |
+
import_line=imported.import_line,
|
| 234 |
+
scope=imported.scope,
|
| 235 |
+
)
|
| 236 |
+
)
|
| 237 |
store.upsert_edge(
|
| 238 |
source_module_id=parsed.module_id,
|
| 239 |
target_module_id=imported.target_module,
|
| 240 |
edge_type=imported.edge_type,
|
| 241 |
import_line=imported.import_line,
|
| 242 |
weight=imported.weight,
|
| 243 |
+
connection_summary=connection_summary,
|
| 244 |
)
|
| 245 |
|
| 246 |
return store
|
code-review-env/parser/graph_builder.py
CHANGED
|
@@ -15,6 +15,7 @@ class EdgeRecord(BaseModel):
|
|
| 15 |
import_line: str
|
| 16 |
scope: str
|
| 17 |
weight: float
|
|
|
|
| 18 |
|
| 19 |
|
| 20 |
def _build_intra_file_edges(parsed: ParsedModule, available_chunk_ids: set[str]) -> list[EdgeRecord]:
|
|
|
|
| 15 |
import_line: str
|
| 16 |
scope: str
|
| 17 |
weight: float
|
| 18 |
+
connection_summary: str = ""
|
| 19 |
|
| 20 |
|
| 21 |
def _build_intra_file_edges(parsed: ParsedModule, available_chunk_ids: set[str]) -> list[EdgeRecord]:
|
code-review-env/pyproject.toml
CHANGED
|
@@ -16,6 +16,7 @@ dependencies = [
|
|
| 16 |
|
| 17 |
[project.scripts]
|
| 18 |
server = "server.app:main"
|
|
|
|
| 19 |
|
| 20 |
[tool.pytest.ini_options]
|
| 21 |
pythonpath = ["."]
|
|
|
|
| 16 |
|
| 17 |
[project.scripts]
|
| 18 |
server = "server.app:main"
|
| 19 |
+
graphreview = "run_project:main"
|
| 20 |
|
| 21 |
[tool.pytest.ini_options]
|
| 22 |
pythonpath = ["."]
|
code-review-env/run_project.py
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import argparse
|
| 4 |
+
import json
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
|
| 7 |
+
from graders.review_runner import generate_reports, run_review
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def _build_parser() -> argparse.ArgumentParser:
|
| 11 |
+
parser = argparse.ArgumentParser(
|
| 12 |
+
description="Unified GraphReview runner: seed + run easy/medium/hard + generate artifacts"
|
| 13 |
+
)
|
| 14 |
+
parser.add_argument("target", help="Target Python project folder")
|
| 15 |
+
parser.add_argument("--db-path", default=None, help="Optional SQLite DB path")
|
| 16 |
+
parser.add_argument("--force-seed", action="store_true", help="Force graph reseed")
|
| 17 |
+
parser.add_argument("--skip-seed", action="store_true", help="Skip seeding and reuse DB")
|
| 18 |
+
parser.add_argument("--modules", nargs="*", default=None, help="Optional module focus list")
|
| 19 |
+
parser.add_argument("--filter-hops", type=int, default=1, help="Neighbor expansion hops for --modules")
|
| 20 |
+
parser.add_argument("--output-dir", default="outputs", help="Artifacts output directory")
|
| 21 |
+
parser.add_argument("--report-prefix", default="graphreview_full", help="Artifact prefix")
|
| 22 |
+
parser.add_argument("--no-progress", action="store_true", help="Disable progress logs")
|
| 23 |
+
parser.add_argument(
|
| 24 |
+
"--levels",
|
| 25 |
+
nargs="*",
|
| 26 |
+
choices=["easy", "medium", "hard"],
|
| 27 |
+
default=["easy", "medium", "hard"],
|
| 28 |
+
help="Review levels to run",
|
| 29 |
+
)
|
| 30 |
+
return parser
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
def main() -> None:
|
| 34 |
+
args = _build_parser().parse_args()
|
| 35 |
+
target = Path(args.target).resolve()
|
| 36 |
+
|
| 37 |
+
summary: dict[str, object] = {
|
| 38 |
+
"target": str(target),
|
| 39 |
+
"levels": {},
|
| 40 |
+
}
|
| 41 |
+
|
| 42 |
+
for idx, level in enumerate(args.levels):
|
| 43 |
+
scores = run_review(
|
| 44 |
+
target=target,
|
| 45 |
+
db_path=args.db_path,
|
| 46 |
+
grader_level=level,
|
| 47 |
+
force_seed=args.force_seed if idx == 0 else False,
|
| 48 |
+
skip_seed=args.skip_seed if idx == 0 else True,
|
| 49 |
+
show_progress=not args.no_progress,
|
| 50 |
+
module_filter=args.modules,
|
| 51 |
+
filter_hops=args.filter_hops,
|
| 52 |
+
)
|
| 53 |
+
total = float(sum(scores.values()))
|
| 54 |
+
summary["levels"][level] = {
|
| 55 |
+
"modules": len(scores),
|
| 56 |
+
"raw_total": total,
|
| 57 |
+
"avg_raw_per_module": (total / len(scores)) if scores else 0.0,
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
artifacts = generate_reports(
|
| 61 |
+
target=target,
|
| 62 |
+
db_path=args.db_path,
|
| 63 |
+
output_dir=args.output_dir,
|
| 64 |
+
module_filter=args.modules,
|
| 65 |
+
filter_hops=args.filter_hops,
|
| 66 |
+
report_prefix=args.report_prefix,
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
summary["artifacts"] = {
|
| 70 |
+
"markdown": artifacts.markdown_path,
|
| 71 |
+
"json": artifacts.json_path,
|
| 72 |
+
"html": artifacts.html_path,
|
| 73 |
+
"confidence_score": artifacts.confidence_score,
|
| 74 |
+
"module_count": artifacts.module_count,
|
| 75 |
+
"edge_count": artifacts.edge_count,
|
| 76 |
+
}
|
| 77 |
+
|
| 78 |
+
print(json.dumps(summary, indent=2))
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
if __name__ == "__main__":
|
| 82 |
+
main()
|
code-review-env/server/app.py
CHANGED
|
@@ -14,12 +14,15 @@ from sqlmodel import Session, select
|
|
| 14 |
|
| 15 |
from db.schema import ModuleEdge, ModuleNode
|
| 16 |
from db.store import Store
|
|
|
|
| 17 |
from env.action import ActionType, ReviewAction
|
| 18 |
from env.environment import CodeReviewEnv, StepResult
|
| 19 |
from env.observation import CodeObservation
|
| 20 |
from env.state import GraphState
|
| 21 |
from visualizer.report_generator import GeneratedArtifacts, generate_phase5_outputs
|
| 22 |
|
|
|
|
|
|
|
| 23 |
|
| 24 |
class ResetRequest(BaseModel):
|
| 25 |
model_config = ConfigDict(strict=True, extra="forbid")
|
|
@@ -297,6 +300,7 @@ def ui_result(report_path: str = Query(..., min_length=1)) -> ResultDetail:
|
|
| 297 |
"edge_type",
|
| 298 |
"import_line",
|
| 299 |
"weight",
|
|
|
|
| 300 |
],
|
| 301 |
},
|
| 302 |
)
|
|
|
|
| 14 |
|
| 15 |
from db.schema import ModuleEdge, ModuleNode
|
| 16 |
from db.store import Store
|
| 17 |
+
from env.env_loader import load_env_file
|
| 18 |
from env.action import ActionType, ReviewAction
|
| 19 |
from env.environment import CodeReviewEnv, StepResult
|
| 20 |
from env.observation import CodeObservation
|
| 21 |
from env.state import GraphState
|
| 22 |
from visualizer.report_generator import GeneratedArtifacts, generate_phase5_outputs
|
| 23 |
|
| 24 |
+
load_env_file()
|
| 25 |
+
|
| 26 |
|
| 27 |
class ResetRequest(BaseModel):
|
| 28 |
model_config = ConfigDict(strict=True, extra="forbid")
|
|
|
|
| 300 |
"edge_type",
|
| 301 |
"import_line",
|
| 302 |
"weight",
|
| 303 |
+
"connection_summary",
|
| 304 |
],
|
| 305 |
},
|
| 306 |
)
|
code-review-env/tests/test_graders.py
CHANGED
|
@@ -93,7 +93,9 @@ def test_hard_grader_dependency_attribution(tmp_path: Path) -> None:
|
|
| 93 |
graph = GraphManager(source_root=str(project), db_path=str(db_path))
|
| 94 |
grader = HardGrader(store, graph)
|
| 95 |
|
| 96 |
-
grader.
|
|
|
|
|
|
|
| 97 |
|
| 98 |
good = grader.grade_episode(
|
| 99 |
module_id="a",
|
|
|
|
| 93 |
graph = GraphManager(source_root=str(project), db_path=str(db_path))
|
| 94 |
grader = HardGrader(store, graph)
|
| 95 |
|
| 96 |
+
grader._judge_with_model = ( # type: ignore[method-assign]
|
| 97 |
+
lambda module_id, action, model, provider, base_url, api_key, timeout, system_prompt, cache_scope: (1.0, "ok")
|
| 98 |
+
)
|
| 99 |
|
| 100 |
good = grader.grade_episode(
|
| 101 |
module_id="a",
|
code-review-env/visualizer/pyvis_renderer.py
CHANGED
|
@@ -54,12 +54,19 @@ def render_graph_html(
|
|
| 54 |
|
| 55 |
for edge in edges:
|
| 56 |
edge_type = str(edge.get("edge_type", "explicit_import"))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
net.add_edge(
|
| 58 |
source=str(edge["source"]),
|
| 59 |
to=str(edge["target"]),
|
| 60 |
-
title=
|
| 61 |
color=EDGE_COLORS.get(edge_type, EDGE_COLORS["explicit_import"]),
|
| 62 |
-
value=
|
|
|
|
| 63 |
arrows="to",
|
| 64 |
)
|
| 65 |
|
|
@@ -85,7 +92,7 @@ def render_graph_html(
|
|
| 85 |
},
|
| 86 |
"edges": {
|
| 87 |
"smooth": {"enabled": False},
|
| 88 |
-
"arrows": {"to": {"enabled": True, "scaleFactor": 0.
|
| 89 |
},
|
| 90 |
}
|
| 91 |
)
|
|
|
|
| 54 |
|
| 55 |
for edge in edges:
|
| 56 |
edge_type = str(edge.get("edge_type", "explicit_import"))
|
| 57 |
+
edge_title = str(edge.get("title", edge_type))
|
| 58 |
+
formatted_title = (
|
| 59 |
+
"<div style='max-width:360px'>"
|
| 60 |
+
f"<b>{edge_type}</b><br>{edge_title}"
|
| 61 |
+
"</div>"
|
| 62 |
+
)
|
| 63 |
net.add_edge(
|
| 64 |
source=str(edge["source"]),
|
| 65 |
to=str(edge["target"]),
|
| 66 |
+
title=formatted_title,
|
| 67 |
color=EDGE_COLORS.get(edge_type, EDGE_COLORS["explicit_import"]),
|
| 68 |
+
value=1.0,
|
| 69 |
+
width=max(1.0, min(float(edge.get("weight", 1.0)) * 1.3, 2.2)),
|
| 70 |
arrows="to",
|
| 71 |
)
|
| 72 |
|
|
|
|
| 92 |
},
|
| 93 |
"edges": {
|
| 94 |
"smooth": {"enabled": False},
|
| 95 |
+
"arrows": {"to": {"enabled": True, "scaleFactor": 0.35}},
|
| 96 |
},
|
| 97 |
}
|
| 98 |
)
|
code-review-env/visualizer/report_generator.py
CHANGED
|
@@ -347,6 +347,7 @@ def _build_json_payload(
|
|
| 347 |
"edge_type": edge.edge_type.value,
|
| 348 |
"weight": edge.weight,
|
| 349 |
"import_line": edge.import_line,
|
|
|
|
| 350 |
}
|
| 351 |
for edge in sorted(edges, key=lambda item: (item.source_module_id, item.target_module_id, item.import_line))
|
| 352 |
]
|
|
@@ -562,7 +563,9 @@ def generate_phase5_outputs(
|
|
| 562 |
"target": edge.target_module_id,
|
| 563 |
"edge_type": edge.edge_type.value,
|
| 564 |
"weight": edge.weight,
|
| 565 |
-
"title":
|
|
|
|
|
|
|
| 566 |
}
|
| 567 |
)
|
| 568 |
|
|
|
|
| 347 |
"edge_type": edge.edge_type.value,
|
| 348 |
"weight": edge.weight,
|
| 349 |
"import_line": edge.import_line,
|
| 350 |
+
"connection_summary": edge.connection_summary,
|
| 351 |
}
|
| 352 |
for edge in sorted(edges, key=lambda item: (item.source_module_id, item.target_module_id, item.import_line))
|
| 353 |
]
|
|
|
|
| 563 |
"target": edge.target_module_id,
|
| 564 |
"edge_type": edge.edge_type.value,
|
| 565 |
"weight": edge.weight,
|
| 566 |
+
"title": (
|
| 567 |
+
f"{edge.edge_type.value}: {edge.connection_summary or edge.import_line}"
|
| 568 |
+
),
|
| 569 |
}
|
| 570 |
)
|
| 571 |
|
plans/phase-06-adaptive-judge-edge-summary-lora-plan.md
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 06 Plan - Adaptive Judging, Edge Intelligence, LoRA Hooks, and Config Hygiene
|
| 2 |
+
|
| 3 |
+
## Objective
|
| 4 |
+
Upgrade the current GraphReview environment for Round 1 reliability with:
|
| 5 |
+
- Adaptive hard-grader fusion to reduce catastrophic judge mistakes.
|
| 6 |
+
- Per-edge connection summaries generated by LLM (with deterministic fallback).
|
| 7 |
+
- LoRA learning hooks so the system can improve across projects.
|
| 8 |
+
- Centralized `.env`-driven configuration for runtime, models, server, and reporting.
|
| 9 |
+
|
| 10 |
+
## Design Principles
|
| 11 |
+
- Occam's Razor: add only minimal mechanisms that directly improve reliability and score.
|
| 12 |
+
- Single Responsibility: parser builds graph; graders score; learning hooks collect trajectories.
|
| 13 |
+
- Determinism First: easy/medium deterministic; hard judge constrained and auditable.
|
| 14 |
+
- Fail-Safe Defaults: LLM optional, deterministic fallback mandatory.
|
| 15 |
+
- Open/Closed: add extensible configs and adapters without rewriting core runtime.
|
| 16 |
+
|
| 17 |
+
## Scope
|
| 18 |
+
1. Adaptive hard-grader fusion
|
| 19 |
+
- Add deterministic gate + primary judge + verifier judge fusion.
|
| 20 |
+
- Dynamic weighting with disagreement-aware reweighting.
|
| 21 |
+
- Persist judge metadata and fusion breakdown in annotation payload.
|
| 22 |
+
|
| 23 |
+
2. Edge connection summaries
|
| 24 |
+
- Extend `ModuleEdge` schema with `connection_summary`.
|
| 25 |
+
- Build `llm/edge_summarizer.py` using OpenAI-compatible API.
|
| 26 |
+
- Generate edge summary for each edge during seed.
|
| 27 |
+
- Fallback summary if LLM unavailable.
|
| 28 |
+
|
| 29 |
+
3. LoRA learning hooks
|
| 30 |
+
- Add transition logging to JSONL during runtime (`state`, `action`, `reward`, `done`).
|
| 31 |
+
- Add `llm/lora_finetune.py` skeleton for dataset export + optional train path.
|
| 32 |
+
- Keep training optional via env vars and feature flags.
|
| 33 |
+
|
| 34 |
+
4. `.env` and config hygiene
|
| 35 |
+
- Add `.env` file with all tunables: host/port, DB, models, judge settings, edge summarizer, LoRA toggles.
|
| 36 |
+
- Add lightweight env loader utility and invoke early in runtime/server/migrations.
|
| 37 |
+
|
| 38 |
+
## Implementation Steps
|
| 39 |
+
1. Add env loader and wire it to startup-sensitive modules.
|
| 40 |
+
2. Add `connection_summary` field + migration + store methods.
|
| 41 |
+
3. Add edge summarizer module and integrate into seed pipeline.
|
| 42 |
+
4. Add adaptive hard grader fusion and metadata persistence.
|
| 43 |
+
5. Add LoRA transition logger + finetune utility script.
|
| 44 |
+
6. Update visualization and report generation to display connection summaries.
|
| 45 |
+
7. Update README with new env variables and usage.
|
| 46 |
+
|
| 47 |
+
## Verification
|
| 48 |
+
- Seeding produces non-empty `connection_summary` for all stored edges.
|
| 49 |
+
- Hard grader returns stable fused score and persists fusion metadata.
|
| 50 |
+
- If primary judge and verifier disagree strongly, final score is reduced safely.
|
| 51 |
+
- Runtime emits LoRA trajectory JSONL when enabled.
|
| 52 |
+
- Server reads `.env` and applies host/port/model settings.
|
| 53 |
+
|
| 54 |
+
## Risks and Mitigations
|
| 55 |
+
- LLM edge summarization latency: use caching + timeout + deterministic fallback.
|
| 56 |
+
- Judge model outages: keep deterministic gate and verifier fallback behavior.
|
| 57 |
+
- LoRA dependency burden: keep optional and fail gracefully if packages absent.
|
| 58 |
+
|
| 59 |
+
## Definition of Done
|
| 60 |
+
- Plan implemented in code with passing runtime smoke checks.
|
| 61 |
+
- New config values available through `.env` and documented.
|
| 62 |
+
- Graph UI and reports now show concise per-edge connection summaries.
|
| 63 |
+
- Hard grader is safer against single-model catastrophic errors.
|
temp.md
ADDED
|
@@ -0,0 +1,1333 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
You are a project planning expert. I am attending a pre Hackathon competition and I need to build a rl environment. I am building this project for this submission
|
| 2 |
+
|
| 3 |
+
# Builder Prompt — GraphReview RL Environment
|
| 4 |
+
|
| 5 |
+
You are an expert Python engineer planner. You do not build. You can add more tools to catch more security vulnerabilities for the modules before actually sending it out. ANd you can also turn on thinking for the gemma 4 model if it works better and ensure it runs on all the modules and actually finds info not just repeating the stuff from previous models. But the previous info should also be provided as context and told to find more if possible about those errors and any new errors. a production-quality RL environment for a competitive hackathon (OpenEnv Round 1). You have one job: build the GraphReview environment correctly, phase by phase, without breaking prior work.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## What You Are Building
|
| 10 |
+
|
| 11 |
+
An OpenEnv-compliant RL environment where an LLM agent reviews Python code with full dependency graph awareness. The environment parses a Python codebase into a persistent SQLite-backed dependency graph, pre-computes ground truth linter flags, and exposes a step()/reset()/state() API for an agent to interact with.
|
| 12 |
+
|
| 13 |
+
This is online RL — no training dataset is needed. The ground truth (pylint/bandit/pyflakes results) is computed once at seed time and stored in SQLite. The agent explores the environment and receives rewards compared against that ground truth.
|
| 14 |
+
|
| 15 |
+
The full phase plan and architecture are provided below. Read the entire plan before writing a single line of code.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## Your Operating Rules
|
| 20 |
+
|
| 21 |
+
1. **Before building each phase, read the full plan for that phase.** Do not start coding until you understand what the phase produces and what its success criteria are.
|
| 22 |
+
|
| 23 |
+
2. **Ask me questions before starting if any of the following are unclear:**
|
| 24 |
+
- A design decision that affects DB schema or file structure
|
| 25 |
+
- Anything that would be hard to change later (interfaces, Pydantic models, DB tables)
|
| 26 |
+
- Ambiguity in how two components interact
|
| 27 |
+
Do NOT ask about low-level implementation details — choose the best approach yourself.
|
| 28 |
+
|
| 29 |
+
3. **Use context7 MCP to look up documentation** for: openenv-core, SQLAlchemy, NetworkX, Pyvis, astroid, pylint API, FastAPI, Pydantic v2. Do not rely on memory for library APIs — always verify.
|
| 30 |
+
|
| 31 |
+
4. **One phase at a time.** Complete a phase fully before moving to the next. Each phase has explicit success criteria — verify them before declaring a phase done.
|
| 32 |
+
|
| 33 |
+
5. **Never break prior phases.** If a later phase requires changing an earlier interface, explicitly flag it, explain why, and get confirmation before making the change.
|
| 34 |
+
|
| 35 |
+
6. **DB is the source of truth.** All state lives in SQLite. Nothing important lives only in memory. reset() clears only task-run annotations — never re-parses the codebase.
|
| 36 |
+
|
| 37 |
+
7. **Token budget is a hard constraint.** No observation may exceed 2000 tokens. Enforce this in token_budget.py — do not leave it as a soft guideline.
|
| 38 |
+
|
| 39 |
+
8. **Graders must be deterministic.** Easy and medium graders: zero LLM calls, same input always produces same output. Hard grader: temperature=0, document prompt hash. Test this explicitly.
|
| 40 |
+
|
| 41 |
+
9. **inference.py log format is mandatory.** [START], [STEP], [END] format must be exact. Any deviation causes evaluation failure. Treat this as a contract.
|
| 42 |
+
|
| 43 |
+
10. **Write clean, typed Python.** All functions typed. All Pydantic models complete. No `Any` types unless unavoidable with explanation.
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Phase Plan
|
| 48 |
+
|
| 49 |
+
[INSERT FULL PHASE PLAN HERE — paste the contents of the phase plan artifact]
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## Sample Project Specification
|
| 54 |
+
|
| 55 |
+
The sample_project/ directory must contain exactly these files with these injected bugs:
|
| 56 |
+
|
| 57 |
+
```
|
| 58 |
+
auth.py — validate_token() can return None (not handled)
|
| 59 |
+
checkout.py — calls auth.validate_token(), doesn't check for None
|
| 60 |
+
cart.py — style violations only (PEP8)
|
| 61 |
+
config.py — missing required key in get_config() (root cause of cascade)
|
| 62 |
+
database.py — SQL query built with string concatenation (SQL injection)
|
| 63 |
+
utils.py — unused imports, dead code
|
| 64 |
+
models.py — clean file (no issues, tests APPROVE path)
|
| 65 |
+
payments.py — depends on checkout.py, inherits None risk
|
| 66 |
+
api.py — depends on auth.py and checkout.py
|
| 67 |
+
main.py — entry point, light glue code
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
Task mapping:
|
| 71 |
+
- easy_task: cart.py (style only)
|
| 72 |
+
- medium_task: checkout.py + auth.py (null reference)
|
| 73 |
+
- hard_task: config.py → auth.py → checkout.py (cascade)
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## Tech Stack
|
| 78 |
+
|
| 79 |
+
- Python 3.11
|
| 80 |
+
- SQLite via SQLAlchemy ORM
|
| 81 |
+
- NetworkX + astroid + Python ast
|
| 82 |
+
- pylint + bandit + pyflakes
|
| 83 |
+
- Pyvis for visualization
|
| 84 |
+
- Pydantic v2
|
| 85 |
+
- FastAPI
|
| 86 |
+
- OpenAI client (inference.py + hard grader judge)
|
| 87 |
+
- openenv-core
|
| 88 |
+
- context7 MCP for all library lookups
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## Start Instructions
|
| 93 |
+
|
| 94 |
+
Begin with Phase 1. Before writing any code:
|
| 95 |
+
1. Use context7 MCP to look up: openenv-core spec, SQLAlchemy ORM setup, astroid API
|
| 96 |
+
2. Ask me any design questions that affect DB schema or file structure
|
| 97 |
+
3. Confirm the sample_project file list with me if you want to adjust it
|
| 98 |
+
4. Then build Phase 1 completely and verify all success criteria before stopping
|
| 99 |
+
|
| 100 |
+
These are the requirements
|
| 101 |
+
|
| 102 |
+
Registration
|
| 103 |
+
|
| 104 |
+
14th March - 3rd April
|
| 105 |
+
|
| 106 |
+
Declaration
|
| 107 |
+
|
| 108 |
+
Before R1
|
| 109 |
+
|
| 110 |
+
Prepare
|
| 111 |
+
|
| 112 |
+
Now - 25th March
|
| 113 |
+
|
| 114 |
+
Round 1
|
| 115 |
+
|
| 116 |
+
25th March - 8th April
|
| 117 |
+
|
| 118 |
+
Results
|
| 119 |
+
|
| 120 |
+
10th April
|
| 121 |
+
|
| 122 |
+
Finale
|
| 123 |
+
|
| 124 |
+
25th-26th April
|
| 125 |
+
|
| 126 |
+
Welcome Shreyas S Joshi!
|
| 127 |
+
|
| 128 |
+
shreyasjoshi2511@gmail.com
|
| 129 |
+
Copy
|
| 130 |
+
Join the Discord Community
|
| 131 |
+
|
| 132 |
+
All announcements, mentor access, and team matching happens here.
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
Join Discord
|
| 136 |
+
QUICK TOGGLe
|
| 137 |
+
|
| 138 |
+
Team form Submission
|
| 139 |
+
|
| 140 |
+
Preparatory Course
|
| 141 |
+
|
| 142 |
+
Start Assessment
|
| 143 |
+
|
| 144 |
+
FAQs
|
| 145 |
+
|
| 146 |
+
step 1
|
| 147 |
+
|
| 148 |
+
How will you compete?
|
| 149 |
+
|
| 150 |
+
Choose solo or team before you can start the assessment
|
| 151 |
+
|
| 152 |
+
Step 1 Complete
|
| 153 |
+
Team: Shreyas S Joshi's team
|
| 154 |
+
|
| 155 |
+
👤
|
| 156 |
+
Athmabhiram S J
|
| 157 |
+
athmabhiram@gmail.com
|
| 158 |
+
Accepted
|
| 159 |
+
👤
|
| 160 |
+
Shreyas S Joshi
|
| 161 |
+
shreyasjoshi2511@gmail.com
|
| 162 |
+
Team Lead
|
| 163 |
+
🔒
|
| 164 |
+
Team is permanently locked. Changes are not allowed after confirmation.
|
| 165 |
+
|
| 166 |
+
OpenEnv Round 1 Bootcamp
|
| 167 |
+
|
| 168 |
+
OpenEnv Round 1 Bootcamp
|
| 169 |
+
|
| 170 |
+
OpenEnv Round 1 Bootcamp
|
| 171 |
+
|
| 172 |
+
OpenEnv Round 1 Bootcamp
|
| 173 |
+
|
| 174 |
+
OpenEnv Round 1 Bootcamp
|
| 175 |
+
|
| 176 |
+
OpenEnv Round 1 Bootcamp
|
| 177 |
+
|
| 178 |
+
OpenEnv Round 1 Bootcamp
|
| 179 |
+
|
| 180 |
+
OpenEnv Round 1 Bootcamp
|
| 181 |
+
|
| 182 |
+
OpenEnv Round 1 Bootcamp
|
| 183 |
+
|
| 184 |
+
OpenEnv Round 1 Bootcamp: Build Your First RL Environment
|
| 185 |
+
|
| 186 |
+
Live walkthrough to submit a strong Round 1 entry
|
| 187 |
+
|
| 188 |
+
timing
|
| 189 |
+
|
| 190 |
+
8:00 PM Onwards
|
| 191 |
+
|
| 192 |
+
Wednesday, 1st April
|
| 193 |
+
|
| 194 |
+
Host
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
Ben Burtenshaw
|
| 198 |
+
|
| 199 |
+
Community Education in AI at Hugging Face
|
| 200 |
+
|
| 201 |
+
|
| 202 |
+
Pulkit Aneja
|
| 203 |
+
|
| 204 |
+
Scaler Instructor
|
| 205 |
+
|
| 206 |
+
Watch Recording
|
| 207 |
+
|
| 208 |
+
PROBLEM STATEMENT
|
| 209 |
+
|
| 210 |
+
Round 1 — Problem Statement
|
| 211 |
+
|
| 212 |
+
The Task
|
| 213 |
+
|
| 214 |
+
Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard step() / reset() / state() API.
|
| 215 |
+
|
| 216 |
+
Key Requirements at a Glance
|
| 217 |
+
|
| 218 |
+
Must simulate a real-world task (not games or toys)
|
| 219 |
+
|
| 220 |
+
Implement full OpenEnv spec: typed models, step()/reset()/state(), openenv.yaml
|
| 221 |
+
|
| 222 |
+
Minimum 3 tasks with agent graders (easy → medium → hard, scores/reward 0.0–1.0)
|
| 223 |
+
|
| 224 |
+
Meaningful reward function with partial progress signals
|
| 225 |
+
|
| 226 |
+
Baseline inference script with reproducible scores
|
| 227 |
+
|
| 228 |
+
Deploy to Hugging Face Spaces + working Dockerfile
|
| 229 |
+
|
| 230 |
+
README with environment description, action/observation spaces, setup instructions
|
| 231 |
+
|
| 232 |
+
Functional Requirements
|
| 233 |
+
|
| 234 |
+
Real-world task simulation
|
| 235 |
+
|
| 236 |
+
The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
|
| 237 |
+
|
| 238 |
+
OpenEnv spec compliance
|
| 239 |
+
|
| 240 |
+
Implement the full OpenEnv interface: typed Observation, Action, and Reward Pydantic models. step(action) → returns observation, reward, done, info. reset() → returns initial observation. state() → returns current state. openenv.yaml with metadata. Tested via openenv validate.
|
| 241 |
+
|
| 242 |
+
Minimum 3 tasks with agent graders
|
| 243 |
+
|
| 244 |
+
Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
|
| 245 |
+
|
| 246 |
+
Meaningful reward function
|
| 247 |
+
|
| 248 |
+
Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).
|
| 249 |
+
|
| 250 |
+
Baseline inference script
|
| 251 |
+
|
| 252 |
+
Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.
|
| 253 |
+
|
| 254 |
+
Detailed Requirements
|
| 255 |
+
|
| 256 |
+
Non-Functional Requirements
|
| 257 |
+
|
| 258 |
+
Deploys to a Hugging Face Space
|
| 259 |
+
|
| 260 |
+
Environment must run as a containerized HF Space tagged with openenv.
|
| 261 |
+
|
| 262 |
+
Containerized execution
|
| 263 |
+
|
| 264 |
+
Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.
|
| 265 |
+
|
| 266 |
+
Documentation
|
| 267 |
+
|
| 268 |
+
README must include: environment description and motivation, action and observation space definitions, task descriptions with expected difficulty, setup and usage instructions, baseline scores.
|
| 269 |
+
|
| 270 |
+
Parameter
|
| 271 |
+
|
| 272 |
+
Weight
|
| 273 |
+
|
| 274 |
+
Description
|
| 275 |
+
|
| 276 |
+
Real-world utility
|
| 277 |
+
|
| 278 |
+
30%
|
| 279 |
+
|
| 280 |
+
Does the environment model a genuine task? Would someone actually use this to train or evaluate agents?
|
| 281 |
+
|
| 282 |
+
Task & grader quality
|
| 283 |
+
|
| 284 |
+
25%
|
| 285 |
+
|
| 286 |
+
Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression?
|
| 287 |
+
|
| 288 |
+
Environment design
|
| 289 |
+
|
| 290 |
+
20%
|
| 291 |
+
|
| 292 |
+
Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries.
|
| 293 |
+
|
| 294 |
+
Code quality & spec compliance
|
| 295 |
+
|
| 296 |
+
15%
|
| 297 |
+
|
| 298 |
+
Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works.
|
| 299 |
+
|
| 300 |
+
Creativity & novelty
|
| 301 |
+
|
| 302 |
+
10%
|
| 303 |
+
|
| 304 |
+
Novel problem domain, interesting mechanics, clever reward design, original approach.
|
| 305 |
+
|
| 306 |
+
Scoring Breakdown
|
| 307 |
+
|
| 308 |
+
Real-world utility (30%)
|
| 309 |
+
|
| 310 |
+
• 0–5: Toy/artificial problem with no practical application
|
| 311 |
+
|
| 312 |
+
• 6–15: Valid domain but shallow modeling of the real task
|
| 313 |
+
|
| 314 |
+
• 16–25: Good domain modeling, would be useful for agent evaluation
|
| 315 |
+
|
| 316 |
+
• 26–30: Excellent — fills a real gap, immediate value for the RL/agent community
|
| 317 |
+
|
| 318 |
+
Task & grader quality (25%)
|
| 319 |
+
|
| 320 |
+
• 3+ tasks with difficulty range?
|
| 321 |
+
|
| 322 |
+
• Graders produce scores between 0.0–1.0?
|
| 323 |
+
|
| 324 |
+
• Graders deterministic and reproducible?
|
| 325 |
+
|
| 326 |
+
• Hard task genuinely challenges frontier models?
|
| 327 |
+
|
| 328 |
+
Environment design (20%)
|
| 329 |
+
|
| 330 |
+
• reset() produces clean state?
|
| 331 |
+
|
| 332 |
+
• Action/observation types well-designed and documented?
|
| 333 |
+
|
| 334 |
+
• Reward function provides useful varying signal (not just sparse)?
|
| 335 |
+
|
| 336 |
+
• Episode boundaries sensible?
|
| 337 |
+
|
| 338 |
+
Code quality & spec compliance (15%)
|
| 339 |
+
|
| 340 |
+
• openenv validate passes?
|
| 341 |
+
|
| 342 |
+
• docker build && docker run works?
|
| 343 |
+
|
| 344 |
+
• HF Space deploys and responds?
|
| 345 |
+
|
| 346 |
+
• Baseline script runs and reproduces scores?
|
| 347 |
+
|
| 348 |
+
Creativity & novelty (10%)
|
| 349 |
+
|
| 350 |
+
• Domain we haven’t seen in OpenEnv before?
|
| 351 |
+
|
| 352 |
+
• Reward design has interesting properties?
|
| 353 |
+
|
| 354 |
+
• Clever mechanics that make the environment engaging?
|
| 355 |
+
|
| 356 |
+
Evaluation Criteria
|
| 357 |
+
|
| 358 |
+
Phase 1: Automated Validation
|
| 359 |
+
|
| 360 |
+
Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
|
| 361 |
+
|
| 362 |
+
Phase 2: Agentic Evaluation
|
| 363 |
+
|
| 364 |
+
Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
|
| 365 |
+
|
| 366 |
+
Phase 3: Human Review
|
| 367 |
+
|
| 368 |
+
Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
|
| 369 |
+
|
| 370 |
+
Disqualification Criteria
|
| 371 |
+
|
| 372 |
+
Environment does not deploy or respond
|
| 373 |
+
|
| 374 |
+
Plagiarized or trivially modified existing environments
|
| 375 |
+
|
| 376 |
+
Graders that always return the same score
|
| 377 |
+
|
| 378 |
+
No baseline inference script
|
| 379 |
+
|
| 380 |
+
How Judging works
|
| 381 |
+
|
| 382 |
+
Pre-Submission Checklist — all must pass or you're disqualified
|
| 383 |
+
|
| 384 |
+
HF Space deploys
|
| 385 |
+
|
| 386 |
+
Automated ping to the Space URL — must return 200 and respond to reset()
|
| 387 |
+
|
| 388 |
+
OpenEnv spec compliance
|
| 389 |
+
|
| 390 |
+
Validate openenv.yaml, typed models, step()/reset()/state() endpoints
|
| 391 |
+
|
| 392 |
+
Dockerfile builds
|
| 393 |
+
|
| 394 |
+
Automated docker build on the submitted repo
|
| 395 |
+
|
| 396 |
+
Baseline reproduces
|
| 397 |
+
|
| 398 |
+
Run the submitted inference script — must complete without error and produce scores
|
| 399 |
+
|
| 400 |
+
3+ tasks with graders
|
| 401 |
+
|
| 402 |
+
Enumerate tasks, run each grader, verify scores/reward in 0.0–1.0 range
|
| 403 |
+
|
| 404 |
+
Mandatory Additional Instructions
|
| 405 |
+
|
| 406 |
+
Before submitting, ensure the following variables are defined in your environment configuration:
|
| 407 |
+
|
| 408 |
+
API_BASE_URL The API endpoint for the LLM.
|
| 409 |
+
|
| 410 |
+
MODEL_NAME The model identifier to use for inference.
|
| 411 |
+
|
| 412 |
+
HF_TOKEN Your Hugging Face / API key.
|
| 413 |
+
|
| 414 |
+
The inference script must be named `inference.py` and placed in the root directory of the project
|
| 415 |
+
|
| 416 |
+
Participants must use OpenAI Client for all LLM calls using above variables
|
| 417 |
+
|
| 418 |
+
Participants must emit structured stdout logs strictly following the [START], [STEP], and [END] format defined in the sample inference.py provided below. Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring. Refer to the Sample Inference Script for the complete format specification and examples.
|
| 419 |
+
|
| 420 |
+
Infra Restrictions
|
| 421 |
+
|
| 422 |
+
Runtime of inference script should be less than 20min
|
| 423 |
+
|
| 424 |
+
Make sure your env and inference can run on a machine with vcpu=2, memory=8gb
|
| 425 |
+
|
| 426 |
+
Validator
|
| 427 |
+
|
| 428 |
+
Run the pre-submission validation script before submitting
|
| 429 |
+
|
| 430 |
+
NEW
|
| 431 |
+
Sample Inference Script
|
| 432 |
+
|
| 433 |
+
NEW
|
| 434 |
+
Pre Validation Script
|
| 435 |
+
|
| 436 |
+
Submission window opens on 28th March
|
| 437 |
+
|
| 438 |
+
Deadline: 8 Apr 11:59 PM
|
| 439 |
+
|
| 440 |
+
|
| 441 |
+
Submit your Assessment
|
| 442 |
+
→
|
| 443 |
+
Study material
|
| 444 |
+
|
| 445 |
+
Preparatory Course
|
| 446 |
+
|
| 447 |
+
4 modules · ~3.5 hours
|
| 448 |
+
|
| 449 |
+
Each module: read the README first, then open the notebook in Colab. No local setup needed.
|
| 450 |
+
|
| 451 |
+
Module 1: Why OpenEnv?
|
| 452 |
+
|
| 453 |
+
ESSENTIAL FOR ROUND 1
|
| 454 |
+
|
| 455 |
+
45 min
|
| 456 |
+
|
| 457 |
+
Module 2: Using Existing Environments
|
| 458 |
+
|
| 459 |
+
ESSENTIAL FOR ROUND 1
|
| 460 |
+
|
| 461 |
+
50 min
|
| 462 |
+
|
| 463 |
+
Module 3: Deploying Environments
|
| 464 |
+
|
| 465 |
+
ESSENTIAL FOR ROUND 1
|
| 466 |
+
|
| 467 |
+
45 min
|
| 468 |
+
|
| 469 |
+
Module 4: Building Your Own Environment
|
| 470 |
+
|
| 471 |
+
MOST IMPORTANT FOR ROUND 1
|
| 472 |
+
|
| 473 |
+
60 min
|
| 474 |
+
|
| 475 |
+
View full course repository
|
| 476 |
+
|
| 477 |
+
GUIDE
|
| 478 |
+
|
| 479 |
+
Round 1 Guide
|
| 480 |
+
|
| 481 |
+
What to Expect
|
| 482 |
+
|
| 483 |
+
When Round 1 opens, you'll choose 1 of 4–5 problem statements and build an OpenEnv environment around it.
|
| 484 |
+
|
| 485 |
+
Example of what a problem statement looks like
|
| 486 |
+
|
| 487 |
+
"Build a mini-game RL environment with clearly defined tasks, automated graders, and reward logic using the OpenEnv framework."
|
| 488 |
+
|
| 489 |
+
→ Create a mini-game an AI agent can play
|
| 490 |
+
|
| 491 |
+
→ Define tasks with increasing difficulty
|
| 492 |
+
|
| 493 |
+
→ Write graders that verify task completion
|
| 494 |
+
|
| 495 |
+
→ Define reward logic for scoring
|
| 496 |
+
|
| 497 |
+
→ Package using OpenEnv for automated evaluation
|
| 498 |
+
|
| 499 |
+
Evaluation Criteria
|
| 500 |
+
|
| 501 |
+
Runtime correctness
|
| 502 |
+
|
| 503 |
+
Runs without errors
|
| 504 |
+
|
| 505 |
+
Interface compliance
|
| 506 |
+
|
| 507 |
+
Follows OpenEnv standard
|
| 508 |
+
|
| 509 |
+
Task design
|
| 510 |
+
|
| 511 |
+
Clear, realistic, testable
|
| 512 |
+
|
| 513 |
+
Grading logic
|
| 514 |
+
|
| 515 |
+
Reward system makes sense
|
| 516 |
+
|
| 517 |
+
20,000 → 3,000 teams advance
|
| 518 |
+
|
| 519 |
+
Prerequisites
|
| 520 |
+
|
| 521 |
+
Install before April 1st.
|
| 522 |
+
|
| 523 |
+
Required
|
| 524 |
+
|
| 525 |
+
Python 3.10+
|
| 526 |
+
|
| 527 |
+
Install 3.10, 3.11, or 3.12.
|
| 528 |
+
|
| 529 |
+
$
|
| 530 |
+
python --version
|
| 531 |
+
Copy
|
| 532 |
+
Git + GitHub account
|
| 533 |
+
|
| 534 |
+
Push your submission to GitHub or HF.
|
| 535 |
+
|
| 536 |
+
$
|
| 537 |
+
git --version
|
| 538 |
+
Copy
|
| 539 |
+
Hugging Face CLI
|
| 540 |
+
|
| 541 |
+
Deploy to HF Spaces.
|
| 542 |
+
|
| 543 |
+
$
|
| 544 |
+
pip install huggingface_hub --version
|
| 545 |
+
Copy
|
| 546 |
+
$
|
| 547 |
+
huggingface-cli login
|
| 548 |
+
Copy
|
| 549 |
+
OpenEnv
|
| 550 |
+
|
| 551 |
+
The framework.
|
| 552 |
+
|
| 553 |
+
$
|
| 554 |
+
pip install openenv-core
|
| 555 |
+
Copy
|
| 556 |
+
Google Colab
|
| 557 |
+
|
| 558 |
+
Prep course runs in Colab. Free tier works.
|
| 559 |
+
|
| 560 |
+
$
|
| 561 |
+
pip install openenv-core
|
| 562 |
+
Copy
|
| 563 |
+
OpenEnv
|
| 564 |
+
|
| 565 |
+
The framework.
|
| 566 |
+
|
| 567 |
+
→ colab.research.google.com
|
| 568 |
+
Copy
|
| 569 |
+
Docker
|
| 570 |
+
|
| 571 |
+
Isolated container testing.
|
| 572 |
+
|
| 573 |
+
docker --version
|
| 574 |
+
Copy
|
| 575 |
+
Recommended
|
| 576 |
+
|
| 577 |
+
VS Code
|
| 578 |
+
|
| 579 |
+
Best Python + Docker support
|
| 580 |
+
|
| 581 |
+
How to Submit
|
| 582 |
+
|
| 583 |
+
When Round 1 starts on 1 April:
|
| 584 |
+
|
| 585 |
+
Step 1
|
| 586 |
+
|
| 587 |
+
Application Form
|
| 588 |
+
Choose 1 of the 4–5 problem statements revealed on the platform.
|
| 589 |
+
|
| 590 |
+
Step 2
|
| 591 |
+
|
| 592 |
+
Scaffold
|
| 593 |
+
$
|
| 594 |
+
openenv init my_env
|
| 595 |
+
Copy
|
| 596 |
+
Generate project structure.
|
| 597 |
+
|
| 598 |
+
Step 3
|
| 599 |
+
|
| 600 |
+
Build
|
| 601 |
+
Define your environment in the generated files.
|
| 602 |
+
|
| 603 |
+
Step 4
|
| 604 |
+
|
| 605 |
+
Test locally
|
| 606 |
+
$
|
| 607 |
+
uv run server
|
| 608 |
+
Copy
|
| 609 |
+
Step 5
|
| 610 |
+
|
| 611 |
+
Deploy
|
| 612 |
+
$
|
| 613 |
+
openenv push --repo-id your-username/my-env
|
| 614 |
+
Copy
|
| 615 |
+
Step 6
|
| 616 |
+
|
| 617 |
+
Submit
|
| 618 |
+
Paste your HF Spaces URL here before the deadline.
|
| 619 |
+
|
| 620 |
+
Deadline: 8 April 2026, 11:59 PM IST
|
| 621 |
+
|
| 622 |
+
Step 2
|
| 623 |
+
|
| 624 |
+
Submit your Assessment
|
| 625 |
+
|
| 626 |
+
Complete Step 1 first
|
| 627 |
+
|
| 628 |
+
Problem Statement is live. Build and submit.
|
| 629 |
+
|
| 630 |
+
Round 1 begins
|
| 631 |
+
|
| 632 |
+
Submission window opens on 28th March
|
| 633 |
+
|
| 634 |
+
Deadline: 8 Apr 11:59 PM
|
| 635 |
+
|
| 636 |
+
|
| 637 |
+
Submit your Assessment
|
| 638 |
+
→
|
| 639 |
+
NOTE: Only team leaders can make the final submission.
|
| 640 |
+
|
| 641 |
+
FAQs
|
| 642 |
+
|
| 643 |
+
Frequently Asked Questions
|
| 644 |
+
|
| 645 |
+
|
| 646 |
+
|
| 647 |
+
|
| 648 |
+
|
| 649 |
+
|
| 650 |
+
|
| 651 |
+
|
| 652 |
+
|
| 653 |
+
|
| 654 |
+
|
| 655 |
+
|
| 656 |
+
|
| 657 |
+
Need help? Reach out to us
|
| 658 |
+
|
| 659 |
+
help_openenvhackathon@scaler.com
|
| 660 |
+
|
| 661 |
+
Contact Support
|
| 662 |
+
|
| 663 |
+
submission Deadline: 8th April 11:59 PM
|
| 664 |
+
|
| 665 |
+
|
| 666 |
+
Submit your Assessment
|
| 667 |
+
→
|
| 668 |
+
How to Submit?
|
| 669 |
+
|
| 670 |
+
|
| 671 |
+
|
| 672 |
+
Great question. Here's exactly what the agent does in a **Code Review RL Environment:**
|
| 673 |
+
|
| 674 |
+
---
|
| 675 |
+
|
| 676 |
+
## 🤖 The Agent's Job
|
| 677 |
+
|
| 678 |
+
The agent acts as a **junior code reviewer**. Each episode, it's shown a code snippet and must take actions to review it — just like a human would on GitHub.
|
| 679 |
+
|
| 680 |
+
---
|
| 681 |
+
|
| 682 |
+
## 🎮 The Action Space
|
| 683 |
+
|
| 684 |
+
The agent can take these actions:
|
| 685 |
+
|
| 686 |
+
```
|
| 687 |
+
APPROVE → Code looks good, no issues
|
| 688 |
+
FLAG_STYLE → Flag a style/formatting issue
|
| 689 |
+
FLAG_BUG → Flag a logic bug
|
| 690 |
+
FLAG_SECURITY → Flag a security vulnerability
|
| 691 |
+
ADD_COMMENT(txt) → Leave a review comment explaining the issue
|
| 692 |
+
REQUEST_CHANGES → Block the PR from merging
|
| 693 |
+
```
|
| 694 |
+
|
| 695 |
+
---
|
| 696 |
+
|
| 697 |
+
## 🔁 One Episode — Step by Step
|
| 698 |
+
|
| 699 |
+
```
|
| 700 |
+
reset()
|
| 701 |
+
→ Agent receives a code snippet (the PR diff)
|
| 702 |
+
|
| 703 |
+
step(FLAG_BUG)
|
| 704 |
+
→ Grader checks: was there actually a bug?
|
| 705 |
+
→ Reward: +0.5 if correct, -0.2 if false positive
|
| 706 |
+
|
| 707 |
+
step(ADD_COMMENT("This causes a null pointer on line 12"))
|
| 708 |
+
→ Grader checks comment relevance
|
| 709 |
+
→ Reward: +0.3 if accurate, 0.0 if vague
|
| 710 |
+
|
| 711 |
+
step(REQUEST_CHANGES)
|
| 712 |
+
→ Episode ends
|
| 713 |
+
→ Final reward tallied
|
| 714 |
+
```
|
| 715 |
+
|
| 716 |
+
---
|
| 717 |
+
|
| 718 |
+
## 📊 The 3 Tasks (Easy → Hard)
|
| 719 |
+
|
| 720 |
+
| Task | What the agent sees | What it must do | Grader |
|
| 721 |
+
|---|---|---|---|
|
| 722 |
+
| **Easy** | Code with a PEP8 style issue | Flag the style issue | Deterministic — AST/linter check |
|
| 723 |
+
| **Medium** | Code with a subtle logic bug | Flag the bug + comment location | Check flag + line number accuracy |
|
| 724 |
+
| **Hard** | Code with a hidden security vuln (SQL injection, path traversal) | Flag security issue + explain risk | Check flag + comment quality via LLM grader |
|
| 725 |
+
|
| 726 |
+
---
|
| 727 |
+
|
| 728 |
+
## 🏅 Reward Design
|
| 729 |
+
|
| 730 |
+
| Action | Reward |
|
| 731 |
+
|---|---|
|
| 732 |
+
| Correct flag on real issue | +0.5 |
|
| 733 |
+
| Accurate comment/explanation | +0.3 |
|
| 734 |
+
| False positive flag | -0.2 |
|
| 735 |
+
| Missed critical issue | -0.4 |
|
| 736 |
+
| APPROVE on buggy code | -1.0 |
|
| 737 |
+
| Clean episode completion | +0.2 bonus |
|
| 738 |
+
|
| 739 |
+
This gives **partial progress signals** across the whole trajectory, not just end-of-episode — which directly satisfies one of the judging criteria.
|
| 740 |
+
|
| 741 |
+
---
|
| 742 |
+
|
| 743 |
+
## 🔑 Why This Works Well for You
|
| 744 |
+
|
| 745 |
+
- Every action is **deterministic to grade** (you're a SE, you can write the graders confidently)
|
| 746 |
+
- Code snippets can be **synthetically generated** — no dataset needed
|
| 747 |
+
- The hard task grader can use an **LLM call internally** to score comment quality, which is also where GraphRAG could plug in
|
| 748 |
+
|
| 749 |
+
Want me to now create the **full file structure and implementation plan** you can start coding today?
|
| 750 |
+
|
| 751 |
+
Okay so basically what if I take the code covert it into a graph and then feed the model and ask it review and that also will again be stored in the graph and in the end the user will have a dependency graph and a review of each module in depth and using the LLM to read the logical reasoning behind the code not just the pylint errors beging caught.
|
| 752 |
+
|
| 753 |
+
That's a genuinely clever idea! Let me break down what you're describing and stress-test it against the requirements.
|
| 754 |
+
|
| 755 |
+
---
|
| 756 |
+
|
| 757 |
+
## 🧠 What You're Describing
|
| 758 |
+
|
| 759 |
+
```
|
| 760 |
+
Codebase
|
| 761 |
+
↓
|
| 762 |
+
GraphRAG parses it → Dependency Graph (nodes = modules, edges = imports/calls)
|
| 763 |
+
↓
|
| 764 |
+
Agent receives a module + its graph context as observation
|
| 765 |
+
↓
|
| 766 |
+
Agent reviews that module (flags bugs, security issues, style)
|
| 767 |
+
↓
|
| 768 |
+
Review is stored back into the graph as an annotation
|
| 769 |
+
↓
|
| 770 |
+
Final output: Annotated dependency graph with reviews on every module
|
| 771 |
+
```
|
| 772 |
+
|
| 773 |
+
This is actually a **multi-episode environment** where each episode = one module, and the final deliverable is a fully reviewed codebase graph.
|
| 774 |
+
|
| 775 |
+
---
|
| 776 |
+
|
| 777 |
+
## ✅ Why This Is Strong
|
| 778 |
+
|
| 779 |
+
| Criteria | How it scores |
|
| 780 |
+
|---|---|
|
| 781 |
+
| Real-world utility (30%) | Extremely high — this is literally what code review tools like CodeClimate do, but agent-driven |
|
| 782 |
+
| Creativity (10%) | Very novel — graph-aware code review hasn't been done in OpenEnv |
|
| 783 |
+
| Environment design (20%) | Graph context as observation is well-structured and meaningful |
|
| 784 |
+
| Task & grader quality (25%) | Natural difficulty progression across modules |
|
| 785 |
+
|
| 786 |
+
---
|
| 787 |
+
|
| 788 |
+
## 🔁 How the RL Loop Works
|
| 789 |
+
|
| 790 |
+
```
|
| 791 |
+
reset()
|
| 792 |
+
→ Load codebase graph
|
| 793 |
+
→ Pick starting module (e.g. lowest dependency = easiest)
|
| 794 |
+
→ Observation = {code, graph_neighbors, import_chain}
|
| 795 |
+
|
| 796 |
+
step(action)
|
| 797 |
+
→ Agent flags issues, adds comments
|
| 798 |
+
→ Grader scores the review
|
| 799 |
+
→ Reward returned
|
| 800 |
+
→ Review annotation stored on graph node
|
| 801 |
+
→ Next module selected (traversal order)
|
| 802 |
+
|
| 803 |
+
state()
|
| 804 |
+
→ Returns current graph with all annotations so far
|
| 805 |
+
|
| 806 |
+
done = True when all modules reviewed
|
| 807 |
+
```
|
| 808 |
+
|
| 809 |
+
---
|
| 810 |
+
|
| 811 |
+
## ⚠️ One Risk to Manage
|
| 812 |
+
|
| 813 |
+
The **grader for review quality** needs to be deterministic to pass the auto-validation gate. Two ways to handle this:
|
| 814 |
+
|
| 815 |
+
- **Easy/Medium tasks** — use AST + linter (pylint, bandit) as ground truth. Grader checks if agent caught what the linter already knows.
|
| 816 |
+
- **Hard tasks** — LLM-as-judge internally, but with a fixed seed/prompt so it's reproducible.
|
| 817 |
+
|
| 818 |
+
---
|
| 819 |
+
|
| 820 |
+
## 🛠️ Tech Stack
|
| 821 |
+
|
| 822 |
+
| Component | Tool |
|
| 823 |
+
|---|---|
|
| 824 |
+
| Graph construction | NetworkX + AST parser |
|
| 825 |
+
| Graph storage | NetworkX (in-memory, no DB needed) |
|
| 826 |
+
| RAG layer | LightRAG or simple vector search over node embeddings |
|
| 827 |
+
| Grader (easy/medium) | pylint, bandit, pyflakes |
|
| 828 |
+
| Grader (hard) | LLM-as-judge with fixed prompt |
|
| 829 |
+
| Observation model | Pydantic — code + graph context |
|
| 830 |
+
|
| 831 |
+
---
|
| 832 |
+
|
| 833 |
+
## 💡 My Honest Take
|
| 834 |
+
|
| 835 |
+
This is your **best idea yet**. It's novel, it's real-world, and the graph-as-state is a genuinely interesting RL mechanic that judges from Meta/HF will notice. The risk is build time — this is more complex than a flat code review env.
|
| 836 |
+
|
| 837 |
+
**Given you have ~1.5 days, want me to scope a minimal viable version of this that hits all the judging criteria without overbuilding?**
|
| 838 |
+
|
| 839 |
+
Okay so a particular module might have let's say a dependency issue but that might be due to another issue. Like that module has taken a particular decision because of a requirement in another module but due to that another issue has raised but the model might report to remove that crucial decision for the project. Also a module might be needed in a lot of places how would you suggest to traverse the graph cause you need to compare all the dependencies and issues with all the modules it's not just bfs. And is gemma 4 4Eb good enough for the demo? Search online
|
| 840 |
+
|
| 841 |
+
Great questions — you're thinking about this at a deep level. Let me tackle each part.Good news on Gemma 4. Now let me address all three of your questions properly.
|
| 842 |
+
|
| 843 |
+
---
|
| 844 |
+
|
| 845 |
+
## 1. 🧠 The Cascading Dependency Problem
|
| 846 |
+
|
| 847 |
+
You're describing a real and important issue — **a decision in module A causes a problem in module B, but the agent reviewing B in isolation flags it as wrong without knowing why.**
|
| 848 |
+
|
| 849 |
+
The solution is **context-aware observations**. When the agent reviews a module, its observation must include:
|
| 850 |
+
|
| 851 |
+
```
|
| 852 |
+
Observation = {
|
| 853 |
+
current_module: code + issues found,
|
| 854 |
+
upstream_deps: why they made decisions that affect this module,
|
| 855 |
+
downstream_deps: what modules depend on THIS module,
|
| 856 |
+
existing_reviews: annotations already written on neighbors,
|
| 857 |
+
constraint_flags: ["decision X is forced by module Y's requirement"]
|
| 858 |
+
}
|
| 859 |
+
```
|
| 860 |
+
|
| 861 |
+
So the agent never reviews a module in isolation. It always sees the **"why" from upstream** before flagging anything. If module B has a weird pattern because module A forced it, the agent sees that constraint and should reward `ADD_COMMENT("dependency constraint from A")` over `FLAG_BUG`.
|
| 862 |
+
|
| 863 |
+
This is actually what makes your environment **genuinely novel** — most code review tools don't model this.
|
| 864 |
+
|
| 865 |
+
---
|
| 866 |
+
|
| 867 |
+
## 2. 🗺️ Graph Traversal Strategy
|
| 868 |
+
|
| 869 |
+
You're right that BFS is wrong here. Here's what I'd suggest — a **weighted priority traversal**:
|
| 870 |
+
|
| 871 |
+
```
|
| 872 |
+
Score each node by:
|
| 873 |
+
- Dependency centrality (how many modules import this one)
|
| 874 |
+
- Issue density (how many raw linter flags exist)
|
| 875 |
+
- Review dependency (can't review B properly until A is reviewed)
|
| 876 |
+
|
| 877 |
+
Traversal order:
|
| 878 |
+
Step 1 → Leaf modules (no dependencies) — safe to review in isolation
|
| 879 |
+
Step 2 → High centrality nodes — review these with full leaf context
|
| 880 |
+
Step 3 → Complex nodes — reviewed last with full graph context available
|
| 881 |
+
```
|
| 882 |
+
|
| 883 |
+
Think of it like a **topological sort weighted by centrality**. NetworkX has both `topological_sort()` and `betweenness_centrality()` built in — you can combine them directly.
|
| 884 |
+
|
| 885 |
+
The key insight: **a module's review is only valid once all its dependencies have been reviewed first.** That's your episode ordering logic.
|
| 886 |
+
|
| 887 |
+
---
|
| 888 |
+
|
| 889 |
+
## 3. 🤖 Is Gemma 4 E4B Good Enough?
|
| 890 |
+
|
| 891 |
+
**Short answer: Yes for a demo, but use the 27B MoE if you can.**
|
| 892 |
+
|
| 893 |
+
Here's what the search shows:
|
| 894 |
+
|
| 895 |
+
- Gemma 4 E4B is designed for edge devices, prioritizing multimodal capability, low-latency processing, and on-device utility. It's capable but optimized for efficiency over raw reasoning power.
|
| 896 |
+
|
| 897 |
+
- The 26B MoE model runs almost as fast as a 4B model since it only activates 4B parameters at a time — so if your HF Space can handle it, the 27B MoE gives you much stronger reasoning at nearly the same inference speed.
|
| 898 |
+
|
| 899 |
+
- E4B has a 128K context window — which is actually great for your use case since you'll be feeding code + graph context into each prompt.
|
| 900 |
+
|
| 901 |
+
- Gemma 4 shows significant improvements in coding benchmarks and native function-calling support for agentic workflows — which directly maps to what your agent needs to do.
|
| 902 |
+
|
| 903 |
+
**My recommendation:**
|
| 904 |
+
|
| 905 |
+
| Scenario | Model |
|
| 906 |
+
|---|---|
|
| 907 |
+
| Local testing / dev | Gemma 4 E4B — fits in 8GB RAM |
|
| 908 |
+
| HF Space demo (inference.py) | Gemma 4 27B MoE — stronger reviews, still fast |
|
| 909 |
+
| Fallback if quota issues | Gemma 4 E4B is acceptable for passing evaluation |
|
| 910 |
+
|
| 911 |
+
The competition's inference script runs on 2 vCPU / 8GB RAM — E4B fits, 27B MoE likely won't. So **E4B is your safest bet for the submission**, but use 27B MoE via API if you're calling it remotely.
|
| 912 |
+
|
| 913 |
+
---
|
| 914 |
+
|
| 915 |
+
Want me to now draft the **actual file structure and core implementation skeleton** so you can start coding today?
|
| 916 |
+
|
| 917 |
+
Great questions — let me search for both.Great — I have a thorough picture now. Here's the full analysis:
|
| 918 |
+
|
| 919 |
+
---
|
| 920 |
+
|
| 921 |
+
## 🔍 Does Your Idea Already Exist?
|
| 922 |
+
|
| 923 |
+
### What CodeRabbit Does
|
| 924 |
+
|
| 925 |
+
CodeRabbit uses a "Codegraph" to understand complex dependencies across files and uncover the impact of changes, pulling in 40+ linters and security scanners. Sounds similar, right? But here's the critical difference:
|
| 926 |
+
|
| 927 |
+
CodeRabbit reviews are tied to diff visibility only — it can't reason about system-wide architecture, cross-repo dependencies, or historical design decisions. It can't validate whether a microservice change breaks contracts with downstream services.
|
| 928 |
+
|
| 929 |
+
That's exactly the gap your environment targets. CodeRabbit handles first-pass mechanical review: style violations, security issues, common bugs. Human reviewers still need to evaluate architecture, business logic, and design decisions.
|
| 930 |
+
|
| 931 |
+
### The Core Differentiation of Your Idea
|
| 932 |
+
|
| 933 |
+
| Feature | CodeRabbit | Your RL Environment |
|
| 934 |
+
|---|---|---|
|
| 935 |
+
| Graph of codebase | ✅ Lightweight map | ✅ Full dependency graph |
|
| 936 |
+
| Context-aware review | Partial (diff only) | ✅ Full upstream/downstream context |
|
| 937 |
+
| Cascading dependency reasoning | ❌ | ✅ Core mechanic |
|
| 938 |
+
| Reviews stored back to graph | ❌ | ✅ Annotated output |
|
| 939 |
+
| RL agent learns from rewards | ❌ Static tool | ✅ Trainable agent |
|
| 940 |
+
| Final deliverable to user | PR comments | Annotated dependency map |
|
| 941 |
+
|
| 942 |
+
**Your environment fills a documented gap.** This is strong for the real-world utility score (30%).
|
| 943 |
+
|
| 944 |
+
---
|
| 945 |
+
|
| 946 |
+
## 🏗️ Architectural Questions You Still Need to Answer
|
| 947 |
+
|
| 948 |
+
### 1. Graph Schema Design
|
| 949 |
+
What does a node actually contain?
|
| 950 |
+
```
|
| 951 |
+
Node = {
|
| 952 |
+
module_id: str,
|
| 953 |
+
code: str,
|
| 954 |
+
ast_summary: dict, # function signatures, classes
|
| 955 |
+
linter_flags: list, # pre-computed ground truth for graders
|
| 956 |
+
dependency_reason: str, # WHY it depends on neighbors
|
| 957 |
+
review_annotation: dict # written by agent, starts null
|
| 958 |
+
}
|
| 959 |
+
```
|
| 960 |
+
You need to decide this upfront — it drives everything else.
|
| 961 |
+
|
| 962 |
+
---
|
| 963 |
+
|
| 964 |
+
### 2. Observation Construction Strategy
|
| 965 |
+
How much graph context do you inject per step? Too little = agent reviews blindly. Too much = exceeds context window.
|
| 966 |
+
|
| 967 |
+
**Recommended approach — tiered context:**
|
| 968 |
+
```
|
| 969 |
+
Easy task → current module only
|
| 970 |
+
Medium task → current module + direct neighbors
|
| 971 |
+
Hard task → current module + 2-hop neighborhood + existing reviews
|
| 972 |
+
```
|
| 973 |
+
This naturally creates difficulty progression and respects the 128K context window of Gemma 4 E4B.
|
| 974 |
+
|
| 975 |
+
---
|
| 976 |
+
|
| 977 |
+
### 3. Episode Boundary Design
|
| 978 |
+
When does an episode end? Two valid designs:
|
| 979 |
+
|
| 980 |
+
- **Per-module episodes** — one episode = one module reviewed. Fast, easy to grade.
|
| 981 |
+
- **Full-codebase episodes** — one episode = entire codebase reviewed in traversal order. More realistic, harder to grade.
|
| 982 |
+
|
| 983 |
+
For the submission I'd recommend **per-module with a meta-episode wrapper** — each module is an episode, but the graph state persists across them so reviews accumulate.
|
| 984 |
+
|
| 985 |
+
---
|
| 986 |
+
|
| 987 |
+
### 4. Reward Attribution Problem
|
| 988 |
+
If the agent correctly identifies a cascading issue (bug in B caused by A), who gets the reward — the step that flagged it, or the final episode? You need a clear policy:
|
| 989 |
+
|
| 990 |
+
```
|
| 991 |
+
Reward policy:
|
| 992 |
+
+0.6 if agent flags issue AND correctly attributes it to upstream module
|
| 993 |
+
+0.3 if agent flags issue but misattributes cause
|
| 994 |
+
-0.3 if agent flags false positive ignoring upstream context
|
| 995 |
+
```
|
| 996 |
+
|
| 997 |
+
---
|
| 998 |
+
|
| 999 |
+
### 5. Graph Generation Strategy
|
| 1000 |
+
Where does the test codebase come from? Three options:
|
| 1001 |
+
|
| 1002 |
+
| Option | Pro | Con |
|
| 1003 |
+
|---|---|---|
|
| 1004 |
+
| Parse a real OSS repo (e.g. Flask, FastAPI) | Realistic | Complex to control |
|
| 1005 |
+
| Synthetically generate modules | Fully deterministic graders | Less realistic |
|
| 1006 |
+
| Hybrid — real structure, injected bugs | Best of both | Medium effort |
|
| 1007 |
+
|
| 1008 |
+
**Hybrid is the right call** — use a real repo's structure but inject known bugs so graders can be 100% deterministic.
|
| 1009 |
+
|
| 1010 |
+
---
|
| 1011 |
+
|
| 1012 |
+
### 6. Grader Architecture (Critical for Pass/Fail Gate)
|
| 1013 |
+
|
| 1014 |
+
```
|
| 1015 |
+
Easy grader → pylint/bandit pre-run, check if agent flagged same issues ✅ fully deterministic
|
| 1016 |
+
Medium grader → AST diff between agent's suggested fix and ground truth ✅ deterministic
|
| 1017 |
+
Hard grader → LLM-as-judge with fixed seed + rubric for cascading reasoning ⚠️ quasi-deterministic
|
| 1018 |
+
```
|
| 1019 |
+
|
| 1020 |
+
The hard grader is your biggest risk. Fix the judge prompt, temperature=0, and document it clearly in README.
|
| 1021 |
+
|
| 1022 |
+
---
|
| 1023 |
+
|
| 1024 |
+
### 7. State Persistence Across Episodes
|
| 1025 |
+
The graph with accumulated annotations IS your `state()`. You need to define:
|
| 1026 |
+
- What gets serialized (NetworkX graph → JSON)
|
| 1027 |
+
- How `reset()` clears annotations but keeps graph structure
|
| 1028 |
+
- Whether partial reviews survive between episodes (they should)
|
| 1029 |
+
|
| 1030 |
+
---
|
| 1031 |
+
|
| 1032 |
+
## ✅ Summary: What Makes You Novel vs CodeRabbit
|
| 1033 |
+
|
| 1034 |
+
The key architectural insight is that an AI seeing only one file without context on its neighbors leads to false positives — it worries about things that are actually handled upstream. Your environment is the first RL training ground that explicitly teaches agents to reason about this. That's your pitch to the judges.
|
| 1035 |
+
|
| 1036 |
+
---
|
| 1037 |
+
|
| 1038 |
+
Full up and full down will be too heavy for the model to look at in complex projects right having thousands of lines of code
|
| 1039 |
+
|
| 1040 |
+
Exactly right. You've identified the most critical architectural bottleneck. Here's how to solve it:
|
| 1041 |
+
|
| 1042 |
+
---
|
| 1043 |
+
|
| 1044 |
+
## 🧠 The Core Problem
|
| 1045 |
+
|
| 1046 |
+
A real project could have:
|
| 1047 |
+
- Module A → 500 lines
|
| 1048 |
+
- 10 direct dependencies → each 300 lines
|
| 1049 |
+
- 2-hop neighbors → another 20 modules
|
| 1050 |
+
|
| 1051 |
+
That's **thousands of tokens just for context**, before the agent even starts reviewing. Gemma 4 E4B's 128K window sounds big but fills fast.
|
| 1052 |
+
|
| 1053 |
+
---
|
| 1054 |
+
|
| 1055 |
+
## ✅ Solution: Compressed Graph Context (Not Raw Code)
|
| 1056 |
+
|
| 1057 |
+
**Never feed raw code as context. Feed summaries.**
|
| 1058 |
+
|
| 1059 |
+
```
|
| 1060 |
+
Instead of:
|
| 1061 |
+
"Here is module_auth.py [500 lines of code]..."
|
| 1062 |
+
|
| 1063 |
+
Feed:
|
| 1064 |
+
"module_auth.py → exposes: validate_token(str) → User | None
|
| 1065 |
+
Known issues: may return None on expired token
|
| 1066 |
+
Depended on by: checkout.py, cart.py
|
| 1067 |
+
Review status: flagged null return risk"
|
| 1068 |
+
```
|
| 1069 |
+
|
| 1070 |
+
This is your **Graph Summary Layer** — sitting between the raw graph and the agent's observation.
|
| 1071 |
+
|
| 1072 |
+
---
|
| 1073 |
+
|
| 1074 |
+
## 🏗️ The 3-Layer Observation Architecture
|
| 1075 |
+
|
| 1076 |
+
```
|
| 1077 |
+
Layer 1: AST Compression
|
| 1078 |
+
Raw code → extract only signatures, return types,
|
| 1079 |
+
decorators, class names
|
| 1080 |
+
Tool: Python's ast module
|
| 1081 |
+
Output: ~50 tokens per module (vs 500+ raw)
|
| 1082 |
+
|
| 1083 |
+
Layer 2: Graph Summary Node
|
| 1084 |
+
Per node store:
|
| 1085 |
+
{
|
| 1086 |
+
exports: ["validate_token(str) → User | None"],
|
| 1087 |
+
known_issues: ["may return None"],
|
| 1088 |
+
constraint_flags: ["forced by requirement in auth_service.py"],
|
| 1089 |
+
review_status: "pending | reviewed | flagged",
|
| 1090 |
+
review_summary: "one line of what was found" ← written by agent
|
| 1091 |
+
}
|
| 1092 |
+
|
| 1093 |
+
Layer 3: Agent Observation (what model actually sees)
|
| 1094 |
+
{
|
| 1095 |
+
current_module: full code, ← only this is full
|
| 1096 |
+
direct_deps: [graph summaries], ← compressed
|
| 1097 |
+
dependents: [graph summaries], ← compressed
|
| 1098 |
+
relevant_reviews: [one-liners] ← already written
|
| 1099 |
+
}
|
| 1100 |
+
```
|
| 1101 |
+
|
| 1102 |
+
So the agent always sees **one module in full**, and everything else as compressed summaries.
|
| 1103 |
+
|
| 1104 |
+
---
|
| 1105 |
+
|
| 1106 |
+
## 📐 Token Budget Estimate
|
| 1107 |
+
|
| 1108 |
+
| Component | Tokens |
|
| 1109 |
+
|---|---|
|
| 1110 |
+
| Current module (full code) | ~800 |
|
| 1111 |
+
| 5 dependency summaries × 50 tokens | ~250 |
|
| 1112 |
+
| 5 dependent summaries × 50 tokens | ~250 |
|
| 1113 |
+
| Existing review annotations | ~200 |
|
| 1114 |
+
| System prompt + action space | ~300 |
|
| 1115 |
+
| **Total per step** | **~1,800** |
|
| 1116 |
+
|
| 1117 |
+
That's well within E4B's window, even for complex projects.
|
| 1118 |
+
|
| 1119 |
+
---
|
| 1120 |
+
|
| 1121 |
+
## 🎯 Bonus: This Becomes a Feature, Not a Workaround
|
| 1122 |
+
|
| 1123 |
+
The **summary layer is actually what makes your environment novel**. You're teaching the agent to:
|
| 1124 |
+
|
| 1125 |
+
1. Read compressed dependency context
|
| 1126 |
+
2. Decide if a deeper look is needed
|
| 1127 |
+
3. Optionally request full code of a neighbor via a `REQUEST_CONTEXT(module_id)` action
|
| 1128 |
+
|
| 1129 |
+
That `REQUEST_CONTEXT` action costs a small reward penalty (simulating real reviewer time cost), which is **genuinely interesting RL reward shaping** that judges will notice.
|
| 1130 |
+
|
| 1131 |
+
---
|
| 1132 |
+
|
| 1133 |
+
## The Revised Action Space
|
| 1134 |
+
|
| 1135 |
+
```
|
| 1136 |
+
APPROVE
|
| 1137 |
+
FLAG_STYLE
|
| 1138 |
+
FLAG_BUG
|
| 1139 |
+
FLAG_SECURITY
|
| 1140 |
+
FLAG_DEPENDENCY_ISSUE ← new: cascading problem from neighbor
|
| 1141 |
+
ADD_COMMENT(txt)
|
| 1142 |
+
REQUEST_CHANGES
|
| 1143 |
+
REQUEST_CONTEXT(module) ← new: pay token cost, get full neighbor code
|
| 1144 |
+
```
|
| 1145 |
+
|
| 1146 |
+
This elegantly solves your original cascading dependency problem too — the agent can *choose* to investigate upstream before judging.
|
| 1147 |
+
|
| 1148 |
+
Want me to now write the full file structure and implementation skeleton?
|
| 1149 |
+
|
| 1150 |
+
Great — these are the four hardest design questions in the project. Let me work through each one carefully.
|
| 1151 |
+
|
| 1152 |
+
---
|
| 1153 |
+
|
| 1154 |
+
## 1. 🔪 Converting Code Into Modules Accurately
|
| 1155 |
+
|
| 1156 |
+
The challenge is — what IS a module? You need a deterministic answer.
|
| 1157 |
+
|
| 1158 |
+
**Recommended: File = Module, with AST sub-structure**
|
| 1159 |
+
|
| 1160 |
+
```
|
| 1161 |
+
Step 1: File-level split
|
| 1162 |
+
Each .py file = one node in the graph
|
| 1163 |
+
filename → module_id
|
| 1164 |
+
|
| 1165 |
+
Step 2: AST parsing per file
|
| 1166 |
+
Extract:
|
| 1167 |
+
- All function signatures + return types
|
| 1168 |
+
- All class definitions
|
| 1169 |
+
- All imports (this gives you edges)
|
| 1170 |
+
- All global variables
|
| 1171 |
+
|
| 1172 |
+
Step 3: Edge construction from imports
|
| 1173 |
+
"from auth import validate_token"
|
| 1174 |
+
→ edge: current_module → auth.py
|
| 1175 |
+
|
| 1176 |
+
Step 4: Dependency reason tagging
|
| 1177 |
+
Use the import line + first usage context
|
| 1178 |
+
as the "why this depends on that" annotation
|
| 1179 |
+
```
|
| 1180 |
+
|
| 1181 |
+
**The hard problem: implicit dependencies**
|
| 1182 |
+
Sometimes module B doesn't import A directly but uses a shared global or config. Handle this with a second pass:
|
| 1183 |
+
|
| 1184 |
+
```
|
| 1185 |
+
Pass 1: Explicit edges (imports)
|
| 1186 |
+
Pass 2: Name resolution edges
|
| 1187 |
+
- scan function bodies for names not defined locally
|
| 1188 |
+
- trace them back to source module
|
| 1189 |
+
- add a "implicit dependency" edge with lower weight
|
| 1190 |
+
```
|
| 1191 |
+
|
| 1192 |
+
Python's `ast` module handles all of this natively. No external library needed.
|
| 1193 |
+
|
| 1194 |
+
---
|
| 1195 |
+
|
| 1196 |
+
## 2. 📊 How Reporting Works
|
| 1197 |
+
|
| 1198 |
+
Think of reporting as **three layers that build progressively**:
|
| 1199 |
+
|
| 1200 |
+
```
|
| 1201 |
+
Layer 1: Per-step annotation (live)
|
| 1202 |
+
Every time agent calls ADD_COMMENT or FLAG_*,
|
| 1203 |
+
that gets written immediately to the graph node
|
| 1204 |
+
as a review_annotation field
|
| 1205 |
+
|
| 1206 |
+
Layer 2: Per-module summary (end of episode)
|
| 1207 |
+
When episode ends (agent calls APPROVE or REQUEST_CHANGES),
|
| 1208 |
+
environment compiles all step annotations into:
|
| 1209 |
+
{
|
| 1210 |
+
verdict: "approved | changes_requested",
|
| 1211 |
+
issues: [...],
|
| 1212 |
+
dependency_notes: [...],
|
| 1213 |
+
confidence: 0.0-1.0 ← derived from reward trajectory
|
| 1214 |
+
}
|
| 1215 |
+
|
| 1216 |
+
Layer 3: Full codebase report (end of all episodes)
|
| 1217 |
+
state() returns the entire annotated graph
|
| 1218 |
+
Serialize to:
|
| 1219 |
+
- JSON (machine readable)
|
| 1220 |
+
- Markdown report (human readable)
|
| 1221 |
+
- Visual graph (NetworkX → graphviz or mermaid)
|
| 1222 |
+
```
|
| 1223 |
+
|
| 1224 |
+
**Updating reviews as agent learns more** is the elegant part. Because reviews are stored on graph nodes, when the agent later reviews module B and discovers the root cause was actually in module A, it can call:
|
| 1225 |
+
|
| 1226 |
+
```
|
| 1227 |
+
AMEND_REVIEW(module_id="auth.py", note="root cause of checkout.py null issue")
|
| 1228 |
+
```
|
| 1229 |
+
|
| 1230 |
+
This updates the node annotation retroactively. The reward for this action is high — it's exactly the cascading reasoning you want to incentivize.
|
| 1231 |
+
|
| 1232 |
+
---
|
| 1233 |
+
|
| 1234 |
+
## 3. ✅ Does This Align With Round 1 Requirements?
|
| 1235 |
+
|
| 1236 |
+
Let's go requirement by requirement:
|
| 1237 |
+
|
| 1238 |
+
| Requirement | Your Design | Status |
|
| 1239 |
+
|---|---|---|
|
| 1240 |
+
| Real-world task | Code review with dependency reasoning | ✅ Strong |
|
| 1241 |
+
| step() / reset() / state() | Per-module episodes, graph persists in state() | ✅ |
|
| 1242 |
+
| Typed Pydantic models | Observation = code + summaries, Action = flag/comment/request, Reward = float | ✅ |
|
| 1243 |
+
| Minimum 3 tasks easy→hard | Easy: style/linter, Medium: logic bug with direct dep context, Hard: cascading bug across 2+ modules | ✅ |
|
| 1244 |
+
| Reward 0.0–1.0 with partial signal | Per-step rewards for each correct flag/comment/attribution | ✅ |
|
| 1245 |
+
| Deterministic graders | Easy/medium use AST+linter ground truth, hard uses fixed-seed LLM judge | ✅ with care |
|
| 1246 |
+
| Baseline inference script | Agent reviews all 3 task codebases, emits [START]/[STEP]/[END] logs | ✅ |
|
| 1247 |
+
| Dockerfile + HF Space | Standard containerization | ✅ |
|
| 1248 |
+
| openenv.yaml + validate | Standard spec compliance | ✅ |
|
| 1249 |
+
|
| 1250 |
+
One gap to watch: **the hard task grader quasi-determinism**. Document your judge prompt and temperature=0 explicitly in README to satisfy the reproducibility requirement.
|
| 1251 |
+
|
| 1252 |
+
---
|
| 1253 |
+
|
| 1254 |
+
## 4. 🤖 Where Is The RL? Where Is OpenEnv?
|
| 1255 |
+
|
| 1256 |
+
This is the most important question to be clear on — because judges WILL ask.
|
| 1257 |
+
|
| 1258 |
+
### The RL Loop
|
| 1259 |
+
|
| 1260 |
+
```
|
| 1261 |
+
Environment (your code) Agent (Gemma 4 / any LLM)
|
| 1262 |
+
───────────────────────── ──────────────────────────
|
| 1263 |
+
reset() → receives initial observation
|
| 1264 |
+
(module code + graph context)
|
| 1265 |
+
|
| 1266 |
+
← action: FLAG_BUG
|
| 1267 |
+
|
| 1268 |
+
step(FLAG_BUG) → returns:
|
| 1269 |
+
- new observation (updated graph)
|
| 1270 |
+
- reward (+0.5 if real bug)
|
| 1271 |
+
- done (False)
|
| 1272 |
+
- info {}
|
| 1273 |
+
|
| 1274 |
+
← action: ADD_COMMENT("null risk line 12")
|
| 1275 |
+
|
| 1276 |
+
step(ADD_COMMENT(...)) → reward (+0.3 if accurate)
|
| 1277 |
+
|
| 1278 |
+
← action: REQUEST_CHANGES
|
| 1279 |
+
|
| 1280 |
+
step(REQUEST_CHANGES) → reward (+0.2 episode bonus)
|
| 1281 |
+
done = True
|
| 1282 |
+
|
| 1283 |
+
state() → full annotated graph so far
|
| 1284 |
+
```
|
| 1285 |
+
|
| 1286 |
+
The **RL part** is: the agent is learning a *policy* — which actions to take given a code observation and graph context — to maximize cumulative reward. It's not just calling an LLM once. It's a multi-step decision loop.
|
| 1287 |
+
|
| 1288 |
+
### Where OpenEnv Fits
|
| 1289 |
+
|
| 1290 |
+
```
|
| 1291 |
+
openenv.yaml
|
| 1292 |
+
name: code-review-env
|
| 1293 |
+
version: 1.0.0
|
| 1294 |
+
tasks: [style_review, logic_review, cascading_review]
|
| 1295 |
+
|
| 1296 |
+
environment.py
|
| 1297 |
+
class CodeReviewEnv:
|
| 1298 |
+
observation: CodeObservation ← Pydantic model
|
| 1299 |
+
action: ReviewAction ← Pydantic model
|
| 1300 |
+
reward: ReviewReward ← Pydantic model
|
| 1301 |
+
|
| 1302 |
+
def reset() → CodeObservation
|
| 1303 |
+
def step(action) → (obs, reward, done, info)
|
| 1304 |
+
def state() → GraphState
|
| 1305 |
+
|
| 1306 |
+
graders/
|
| 1307 |
+
easy_grader.py ← linter diff check
|
| 1308 |
+
medium_grader.py ← AST + line attribution check
|
| 1309 |
+
hard_grader.py ← LLM judge, fixed seed
|
| 1310 |
+
|
| 1311 |
+
inference.py ← runs Gemma 4 as the agent
|
| 1312 |
+
[START] task=cascading_review
|
| 1313 |
+
[STEP] action=FLAG_DEPENDENCY_ISSUE reward=0.5
|
| 1314 |
+
[STEP] action=ADD_COMMENT reward=0.3
|
| 1315 |
+
[END] total_reward=0.8
|
| 1316 |
+
```
|
| 1317 |
+
|
| 1318 |
+
OpenEnv is the **standardized wrapper** that makes your environment pluggable — any agent, any model, can be dropped in and evaluated against your graders. That's the whole point of the spec.
|
| 1319 |
+
|
| 1320 |
+
---
|
| 1321 |
+
|
| 1322 |
+
## The One-Line Pitch to Judges
|
| 1323 |
+
|
| 1324 |
+
> *"An RL environment where an agent learns to review code with full dependency awareness — outputting an annotated codebase graph, not just PR comments. It teaches agents to reason about WHY a bug exists, not just WHERE."*
|
| 1325 |
+
|
| 1326 |
+
That's differentiated from CodeRabbit, aligned with all Round 1 requirements, and has a clear RL loop.
|
| 1327 |
+
|
| 1328 |
+
**Ready to build the file skeleton now?**
|
| 1329 |
+
|
| 1330 |
+
|
| 1331 |
+
What I basically want is there should be RL in this project meaning it will first see and find out issues about the code and then there should be a review of the bug reported and if the bug is correctly reported there should be reward or else punishment and based on that the actual model should learn and understand this should be integrated into it.
|
| 1332 |
+
|
| 1333 |
+
Also right now we have an issue where all the modules are not linked properly. And it's not just about imports but also where they are being called or run and this linking is not proper and causes isolated modules. And also for large modules it's broken into many things. Also Right now I suspect that the LLM is not grading properly the code it's just passing through it and just repeating the errors from the previous the hard filter should also catch errors and find issues and give detailed reports on all in clear tasks. Give it a super detailed agent prompt for this task and the format for output also ensuring it is adaptable. And after the errors being found it should then again verify those errors with another model which the user can define and then it will learn from it and become RL. Assign proper grades for the machine learning to be good for this particular task. Also the arrow marks in the graph are too thick sometimes and when I hover over them they give me a big like of text rather than a well formatted overlay where it gives me info about the modules and also when I click on the module it should show in the side bar the report for it well formatted
|