shreyas-joshi commited on
Commit
86c3e08
·
1 Parent(s): 1432cf4

Add Phase 06 Plan for Adaptive Judging and Edge Intelligence; Create initial project outline for GraphReview RL Environment

Browse files
code-review-env/README.md CHANGED
@@ -35,6 +35,14 @@ Phase 5:
35
  - Added confidence scoring that balances precision/recall with severity/security coverage and attribution validity.
36
  - Added API endpoint to generate artifacts and CLI support for real project runs.
37
 
 
 
 
 
 
 
 
 
38
  ## Core Runtime Components
39
 
40
  - `env/environment.py`
@@ -105,6 +113,8 @@ with `auth_token` connect arg.
105
 
106
  ## LLM and Runtime Env Vars
107
 
 
 
108
  Judge settings:
109
 
110
  - `GRAPHREVIEW_JUDGE_PROVIDER` (default `ollama_openai_compat`)
@@ -117,6 +127,39 @@ Judge settings:
117
  - `GRAPHREVIEW_JUDGE_MAX_CONSECUTIVE_FAILURES` (default `3`)
118
  - `GRAPHREVIEW_JUDGE_THINK` (`false|true|low|medium|high`, default `false`)
119
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  General runtime settings:
121
 
122
  - `GRAPHREVIEW_SOURCE_ROOT` (default `sample_project`)
@@ -143,6 +186,26 @@ curl -s http://localhost:8000/health
143
  curl -s http://localhost:8000/tasks
144
  ```
145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  ## Direct Module Review (Phase 4)
147
 
148
  Example: run `logic_review` with explicit module focus:
 
35
  - Added confidence scoring that balances precision/recall with severity/security coverage and attribution validity.
36
  - Added API endpoint to generate artifacts and CLI support for real project runs.
37
 
38
+ Phase 6:
39
+
40
+ - Added adaptive hard-grader fusion: deterministic graph gate + primary judge + verifier judge.
41
+ - Added disagreement-aware reweighting to reduce single-model catastrophic errors.
42
+ - Added per-edge `connection_summary` generation using LLM with deterministic fallback.
43
+ - Added optional LoRA trajectory logging for cross-project learning data collection.
44
+ - Added root `.env` support for centralized configuration management.
45
+
46
  ## Core Runtime Components
47
 
48
  - `env/environment.py`
 
113
 
114
  ## LLM and Runtime Env Vars
115
 
116
+ `.env` at project root is auto-loaded by runtime configuration, DB initialization, and server startup.
117
+
118
  Judge settings:
119
 
120
  - `GRAPHREVIEW_JUDGE_PROVIDER` (default `ollama_openai_compat`)
 
127
  - `GRAPHREVIEW_JUDGE_MAX_CONSECUTIVE_FAILURES` (default `3`)
128
  - `GRAPHREVIEW_JUDGE_THINK` (`false|true|low|medium|high`, default `false`)
129
 
130
+ Verifier and adaptive fusion settings:
131
+
132
+ - `GRAPHREVIEW_VERIFIER_ENABLED` (default `true`)
133
+ - `GRAPHREVIEW_VERIFIER_PROVIDER`
134
+ - `GRAPHREVIEW_VERIFIER_MODEL`
135
+ - `GRAPHREVIEW_VERIFIER_BASE_URL`
136
+ - `GRAPHREVIEW_VERIFIER_API_KEY`
137
+ - `GRAPHREVIEW_VERIFIER_TIMEOUT_SECONDS`
138
+ - `GRAPHREVIEW_JUDGE_WEIGHT_DETERMINISTIC` (default `0.5`)
139
+ - `GRAPHREVIEW_JUDGE_WEIGHT_PRIMARY` (default `0.3`)
140
+ - `GRAPHREVIEW_JUDGE_WEIGHT_VERIFIER` (default `0.2`)
141
+ - `GRAPHREVIEW_JUDGE_DISAGREEMENT_THRESHOLD` (default `0.5`)
142
+
143
+ Edge summary settings:
144
+
145
+ - `GRAPHREVIEW_EDGE_SUMMARY_ENABLED` (default `false`, enable when you want LLM edge summaries)
146
+ - `GRAPHREVIEW_EDGE_SUMMARY_MODEL`
147
+ - `GRAPHREVIEW_EDGE_SUMMARY_BASE_URL`
148
+ - `GRAPHREVIEW_EDGE_SUMMARY_API_KEY`
149
+ - `GRAPHREVIEW_EDGE_SUMMARY_TIMEOUT_SECONDS`
150
+ - `GRAPHREVIEW_EDGE_SUMMARY_MAX_CALLS`
151
+
152
+ LoRA trajectory hooks:
153
+
154
+ - `GRAPHREVIEW_LORA_ENABLED` (default `false`)
155
+ - `GRAPHREVIEW_LORA_DATA_PATH` (default `outputs/lora/transitions.jsonl`)
156
+
157
+ Generate a LoRA-ready SFT dataset from transitions:
158
+
159
+ ```bash
160
+ python -m llm.lora_finetune --transitions outputs/lora/transitions.jsonl --output outputs/lora/sft_dataset.jsonl
161
+ ```
162
+
163
  General runtime settings:
164
 
165
  - `GRAPHREVIEW_SOURCE_ROOT` (default `sample_project`)
 
186
  curl -s http://localhost:8000/tasks
187
  ```
188
 
189
+ ## Unified One-Command Runner
190
+
191
+ Run seed + easy/medium/hard reviews + artifact generation on any target codebase:
192
+
193
+ ```bash
194
+ graphreview /absolute/path/to/your/codebase --force-seed
195
+ ```
196
+
197
+ Equivalent without installing entrypoints:
198
+
199
+ ```bash
200
+ python run_project.py /absolute/path/to/your/codebase --force-seed
201
+ ```
202
+
203
+ Optional focused run:
204
+
205
+ ```bash
206
+ graphreview /absolute/path/to/your/codebase --modules checkout auth --filter-hops 1 --report-prefix myrun
207
+ ```
208
+
209
  ## Direct Module Review (Phase 4)
210
 
211
  Example: run `logic_review` with explicit module focus:
code-review-env/db/migrations.py CHANGED
@@ -6,6 +6,8 @@ from pathlib import Path
6
  from sqlmodel import SQLModel, create_engine
7
  from sqlalchemy import inspect, text
8
 
 
 
9
 
10
  def get_default_db_path() -> Path:
11
  project_root = Path(__file__).resolve().parents[1]
@@ -13,6 +15,7 @@ def get_default_db_path() -> Path:
13
 
14
 
15
  def get_engine(db_path: str | Path | None = None, echo: bool = False):
 
16
  env_url = os.getenv("GRAPHREVIEW_DATABASE_URL", "").strip()
17
  if env_url:
18
  connect_args: dict[str, object] = {}
@@ -43,6 +46,7 @@ def get_engine(db_path: str | Path | None = None, echo: bool = False):
43
 
44
 
45
  def init_db(db_path: str | Path | None = None, echo: bool = False) -> None:
 
46
  from db import schema # noqa: F401
47
 
48
  engine = get_engine(db_path=db_path, echo=echo)
@@ -66,6 +70,14 @@ def _apply_lightweight_migrations(engine) -> None:
66
  if "is_amendment" not in existing_columns:
67
  add_statements.append("ALTER TABLE reviewannotation ADD COLUMN is_amendment BOOLEAN DEFAULT 0")
68
 
 
 
 
 
 
 
 
 
69
  if not add_statements:
70
  return
71
 
 
6
  from sqlmodel import SQLModel, create_engine
7
  from sqlalchemy import inspect, text
8
 
9
+ from env.env_loader import load_env_file
10
+
11
 
12
  def get_default_db_path() -> Path:
13
  project_root = Path(__file__).resolve().parents[1]
 
15
 
16
 
17
  def get_engine(db_path: str | Path | None = None, echo: bool = False):
18
+ load_env_file()
19
  env_url = os.getenv("GRAPHREVIEW_DATABASE_URL", "").strip()
20
  if env_url:
21
  connect_args: dict[str, object] = {}
 
46
 
47
 
48
  def init_db(db_path: str | Path | None = None, echo: bool = False) -> None:
49
+ load_env_file()
50
  from db import schema # noqa: F401
51
 
52
  engine = get_engine(db_path=db_path, echo=echo)
 
70
  if "is_amendment" not in existing_columns:
71
  add_statements.append("ALTER TABLE reviewannotation ADD COLUMN is_amendment BOOLEAN DEFAULT 0")
72
 
73
+ if not add_statements:
74
+ add_statements = []
75
+
76
+ if "moduleedge" in inspector.get_table_names():
77
+ edge_columns = {col["name"] for col in inspector.get_columns("moduleedge")}
78
+ if "connection_summary" not in edge_columns:
79
+ add_statements.append("ALTER TABLE moduleedge ADD COLUMN connection_summary TEXT DEFAULT ''")
80
+
81
  if not add_statements:
82
  return
83
 
code-review-env/db/schema.py CHANGED
@@ -53,6 +53,7 @@ class ModuleEdge(SQLModel, table=True):
53
  edge_type: EdgeType = Field(default=EdgeType.EXPLICIT_IMPORT)
54
  import_line: str
55
  weight: float = 1.0
 
56
 
57
 
58
  class LinterFinding(SQLModel, table=True):
 
53
  edge_type: EdgeType = Field(default=EdgeType.EXPLICIT_IMPORT)
54
  import_line: str
55
  weight: float = 1.0
56
+ connection_summary: str = ""
57
 
58
 
59
  class LinterFinding(SQLModel, table=True):
code-review-env/db/seed.py CHANGED
@@ -14,6 +14,7 @@ from parser.chunker import chunk_module
14
  from parser.graph_builder import build_edges
15
  from parser.linter import run_linters
16
  from parser.summarizer import summarize_module
 
17
 
18
 
19
  _SKIP_DIRS = {
@@ -160,13 +161,24 @@ def seed_project(target_dir: Path, db_path: str | None = None, force: bool = Fal
160
  store.replace_findings_for_module(parsed.module_id, [issue.model_dump() for issue in issues])
161
 
162
  edges = build_edges(parsed_modules, module_ids, chunk_ids_by_parent)
 
163
  for edge in edges:
 
 
 
 
 
 
 
 
 
164
  store.upsert_edge(
165
  source_module_id=edge.source_module_id,
166
  target_module_id=edge.target_module_id,
167
  edge_type=edge.edge_type,
168
  import_line=edge.import_line,
169
  weight=edge.weight,
 
170
  )
171
 
172
  snapshot = store.get_full_graph()
 
14
  from parser.graph_builder import build_edges
15
  from parser.linter import run_linters
16
  from parser.summarizer import summarize_module
17
+ from llm.edge_summarizer import EdgeSummarizer, EdgeSummaryInput
18
 
19
 
20
  _SKIP_DIRS = {
 
161
  store.replace_findings_for_module(parsed.module_id, [issue.model_dump() for issue in issues])
162
 
163
  edges = build_edges(parsed_modules, module_ids, chunk_ids_by_parent)
164
+ edge_summarizer = EdgeSummarizer()
165
  for edge in edges:
166
+ connection_summary = edge_summarizer.summarize(
167
+ EdgeSummaryInput(
168
+ source_module_id=edge.source_module_id,
169
+ target_module_id=edge.target_module_id,
170
+ edge_type=edge.edge_type.value,
171
+ import_line=edge.import_line,
172
+ scope=edge.scope,
173
+ )
174
+ )
175
  store.upsert_edge(
176
  source_module_id=edge.source_module_id,
177
  target_module_id=edge.target_module_id,
178
  edge_type=edge.edge_type,
179
  import_line=edge.import_line,
180
  weight=edge.weight,
181
+ connection_summary=connection_summary,
182
  )
183
 
184
  snapshot = store.get_full_graph()
code-review-env/db/store.py CHANGED
@@ -54,6 +54,7 @@ class GraphEdgeRecord(BaseModel):
54
  target_module_id: str
55
  weight: float
56
  import_line: str
 
57
 
58
 
59
  class GraphSnapshot(BaseModel):
@@ -133,6 +134,7 @@ class Store:
133
  edge_type: EdgeType,
134
  import_line: str,
135
  weight: float,
 
136
  ) -> ModuleEdge:
137
  with Session(self.engine) as session:
138
  existing = session.exec(
@@ -146,6 +148,7 @@ class Store:
146
  if existing:
147
  existing.edge_type = edge_type
148
  existing.weight = weight
 
149
  session.add(existing)
150
  session.commit()
151
  session.refresh(existing)
@@ -158,6 +161,7 @@ class Store:
158
  edge_type=edge_type,
159
  import_line=import_line,
160
  weight=weight,
 
161
  )
162
  session.add(edge)
163
  session.commit()
@@ -337,6 +341,7 @@ class Store:
337
  target_module_id=edge.target_module_id,
338
  weight=edge.weight,
339
  import_line=edge.import_line,
 
340
  )
341
  for edge in edges
342
  ],
 
54
  target_module_id: str
55
  weight: float
56
  import_line: str
57
+ connection_summary: str
58
 
59
 
60
  class GraphSnapshot(BaseModel):
 
134
  edge_type: EdgeType,
135
  import_line: str,
136
  weight: float,
137
+ connection_summary: str = "",
138
  ) -> ModuleEdge:
139
  with Session(self.engine) as session:
140
  existing = session.exec(
 
148
  if existing:
149
  existing.edge_type = edge_type
150
  existing.weight = weight
151
+ existing.connection_summary = connection_summary or existing.connection_summary
152
  session.add(existing)
153
  session.commit()
154
  session.refresh(existing)
 
161
  edge_type=edge_type,
162
  import_line=import_line,
163
  weight=weight,
164
+ connection_summary=connection_summary,
165
  )
166
  session.add(edge)
167
  session.commit()
 
341
  target_module_id=edge.target_module_id,
342
  weight=edge.weight,
343
  import_line=edge.import_line,
344
+ connection_summary=edge.connection_summary,
345
  )
346
  for edge in edges
347
  ],
code-review-env/env/env_loader.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import os
4
+ from pathlib import Path
5
+
6
+
7
+ def load_env_file(path: str | Path | None = None) -> None:
8
+ """Load key-value pairs from .env without overriding existing env vars."""
9
+ env_path = Path(path) if path is not None else Path(__file__).resolve().parents[1] / ".env"
10
+ if not env_path.exists():
11
+ return
12
+
13
+ for raw_line in env_path.read_text(encoding="utf-8").splitlines():
14
+ line = raw_line.strip()
15
+ if not line or line.startswith("#") or "=" not in line:
16
+ continue
17
+ key, value = line.split("=", 1)
18
+ key = key.strip()
19
+ value = value.strip().strip('"').strip("'")
20
+ if key and key not in os.environ:
21
+ os.environ[key] = value
code-review-env/env/environment.py CHANGED
@@ -21,6 +21,7 @@ from graders.base_grader import BaseGrader
21
  from graders.easy_grader import EasyGrader
22
  from graders.hard_grader import HardGrader
23
  from graders.medium_grader import MediumGrader
 
24
  from tasks.task_registry import TaskSpec, get_task, list_tasks, resolve_task_modules
25
 
26
 
@@ -80,6 +81,7 @@ class CodeReviewEnv:
80
  self.store = Store(source_root=self.source_root, db_path=self.db_path)
81
  self.graph_manager = GraphManager(source_root=self.source_root, db_path=self.db_path)
82
  self.observation_builder = ObservationBuilder(source_root=self.source_root, db_path=self.db_path)
 
83
 
84
  self._runtime: _EpisodeRuntime | None = None
85
  self._grader: BaseGrader | None = None
@@ -196,6 +198,18 @@ class CodeReviewEnv:
196
  context_request=context_request,
197
  )
198
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  return StepResult(
200
  observation=observation,
201
  reward=reward.raw_value,
 
21
  from graders.easy_grader import EasyGrader
22
  from graders.hard_grader import HardGrader
23
  from graders.medium_grader import MediumGrader
24
+ from llm.lora_adapter import LoRATrajectoryLogger
25
  from tasks.task_registry import TaskSpec, get_task, list_tasks, resolve_task_modules
26
 
27
 
 
81
  self.store = Store(source_root=self.source_root, db_path=self.db_path)
82
  self.graph_manager = GraphManager(source_root=self.source_root, db_path=self.db_path)
83
  self.observation_builder = ObservationBuilder(source_root=self.source_root, db_path=self.db_path)
84
+ self.lora_logger = LoRATrajectoryLogger()
85
 
86
  self._runtime: _EpisodeRuntime | None = None
87
  self._grader: BaseGrader | None = None
 
198
  context_request=context_request,
199
  )
200
 
201
+ self.lora_logger.log(
202
+ source_root=self.source_root,
203
+ episode_id=runtime.episode_id,
204
+ module_id=module_id,
205
+ step_number=step_number,
206
+ action=action,
207
+ reward=reward.raw_value,
208
+ done=runtime.done,
209
+ task_id=runtime.task.task_id,
210
+ observation_summary=f"module={observation.module_id} actions={','.join(observation.available_actions[:6])}",
211
+ )
212
+
213
  return StepResult(
214
  observation=observation,
215
  reward=reward.raw_value,
code-review-env/env/reward.py CHANGED
@@ -9,6 +9,7 @@ class RewardReason(StrEnum):
9
  CORRECT_FLAG = "correct_flag"
10
  ACCURATE_COMMENT = "accurate_comment"
11
  CORRECT_DEPENDENCY_ATTRIBUTION = "correct_dependency_attribution"
 
12
  INCORRECT_DEPENDENCY_ATTRIBUTION = "incorrect_dependency_attribution"
13
  CORRECT_AMENDMENT = "correct_amendment"
14
  REQUEST_CONTEXT_COST = "request_context_cost"
@@ -23,6 +24,7 @@ RAW_REWARD_TABLE: dict[RewardReason, float] = {
23
  RewardReason.CORRECT_FLAG: 0.5,
24
  RewardReason.ACCURATE_COMMENT: 0.3,
25
  RewardReason.CORRECT_DEPENDENCY_ATTRIBUTION: 0.6,
 
26
  RewardReason.INCORRECT_DEPENDENCY_ATTRIBUTION: 0.1,
27
  RewardReason.CORRECT_AMENDMENT: 0.4,
28
  RewardReason.REQUEST_CONTEXT_COST: -0.1,
 
9
  CORRECT_FLAG = "correct_flag"
10
  ACCURATE_COMMENT = "accurate_comment"
11
  CORRECT_DEPENDENCY_ATTRIBUTION = "correct_dependency_attribution"
12
+ PARTIAL_DEPENDENCY_ATTRIBUTION = "partial_dependency_attribution"
13
  INCORRECT_DEPENDENCY_ATTRIBUTION = "incorrect_dependency_attribution"
14
  CORRECT_AMENDMENT = "correct_amendment"
15
  REQUEST_CONTEXT_COST = "request_context_cost"
 
24
  RewardReason.CORRECT_FLAG: 0.5,
25
  RewardReason.ACCURATE_COMMENT: 0.3,
26
  RewardReason.CORRECT_DEPENDENCY_ATTRIBUTION: 0.6,
27
+ RewardReason.PARTIAL_DEPENDENCY_ATTRIBUTION: 0.35,
28
  RewardReason.INCORRECT_DEPENDENCY_ATTRIBUTION: 0.1,
29
  RewardReason.CORRECT_AMENDMENT: 0.4,
30
  RewardReason.REQUEST_CONTEXT_COST: -0.1,
code-review-env/env/runtime_config.py CHANGED
@@ -3,6 +3,8 @@ from __future__ import annotations
3
  import os
4
  from dataclasses import dataclass
5
 
 
 
6
 
7
  @dataclass(frozen=True)
8
  class RuntimeConfig:
@@ -15,6 +17,7 @@ class RuntimeConfig:
15
 
16
 
17
  def load_runtime_config() -> RuntimeConfig:
 
18
  return RuntimeConfig(
19
  llm_provider=os.getenv("GRAPHREVIEW_LLM_PROVIDER", "ollama_openai_compat"),
20
  llm_base_url=os.getenv("GRAPHREVIEW_LLM_BASE_URL", "http://localhost:11434/v1"),
 
3
  import os
4
  from dataclasses import dataclass
5
 
6
+ from env.env_loader import load_env_file
7
+
8
 
9
  @dataclass(frozen=True)
10
  class RuntimeConfig:
 
17
 
18
 
19
  def load_runtime_config() -> RuntimeConfig:
20
+ load_env_file()
21
  return RuntimeConfig(
22
  llm_provider=os.getenv("GRAPHREVIEW_LLM_PROVIDER", "ollama_openai_compat"),
23
  llm_base_url=os.getenv("GRAPHREVIEW_LLM_BASE_URL", "http://localhost:11434/v1"),
code-review-env/graders/hard_grader.py CHANGED
@@ -38,21 +38,36 @@ class HardGrader(MediumGrader):
38
  "GRAPHREVIEW_JUDGE_PROVIDER",
39
  "ollama_openai_compat",
40
  )
 
41
  self.base_url = os.getenv("GRAPHREVIEW_JUDGE_BASE_URL", "http://localhost:11434/v1")
 
42
  self.api_key = os.getenv("GRAPHREVIEW_JUDGE_API_KEY", "ollama")
 
43
  self.timeout = float(os.getenv("GRAPHREVIEW_JUDGE_TIMEOUT_SECONDS", "8"))
 
44
  self.judge_system_prompt = os.getenv(
45
  "GRAPHREVIEW_JUDGE_SYSTEM_PROMPT",
46
  self.DEFAULT_JUDGE_SYSTEM_PROMPT,
47
  )
 
 
 
 
 
 
48
  self.reasoning_effort = os.getenv("GRAPHREVIEW_JUDGE_REASONING_EFFORT", "none")
49
  self.think_value = os.getenv("GRAPHREVIEW_JUDGE_THINK", "false").strip().lower()
50
  self.max_judge_calls = int(os.getenv("GRAPHREVIEW_JUDGE_MAX_CALLS", "200"))
51
  self.max_consecutive_failures = int(os.getenv("GRAPHREVIEW_JUDGE_MAX_CONSECUTIVE_FAILURES", "3"))
 
 
 
 
52
  self._judge_calls = 0
53
  self._consecutive_failures = 0
54
  self._judge_cache: dict[str, tuple[float, str]] = {}
55
  self.prompt_hash = hashlib.sha256(self.judge_system_prompt.encode("utf-8")).hexdigest()
 
56
 
57
  def grade_action(
58
  self,
@@ -95,39 +110,91 @@ class HardGrader(MediumGrader):
95
  )
96
 
97
  normalized_action = action.model_copy(update={"attributed_to": attributed_to})
98
- judge_result = self._judge_dependency_reasoning(module_id, normalized_action)
99
- if len(judge_result) == 2:
100
- judge_score, explanation = judge_result
101
- else:
102
- judge_score, explanation = judge_result[0], judge_result[1]
103
- if judge_score <= 0.0:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  return make_reward(
105
  RewardReason.INCORRECT_DEPENDENCY_ATTRIBUTION,
106
- explanation,
107
  metadata={
108
- "judge_score": judge_score,
 
 
 
109
  "judge_provider": self.judge_provider,
110
  "judge_model": self.judge_model,
 
 
111
  "temperature": 0.0,
112
  "prompt_hash": self.prompt_hash,
 
113
  },
114
  )
115
 
116
  base_reason = RewardReason.CORRECT_DEPENDENCY_ATTRIBUTION
 
 
117
  reward = make_reward(
118
  base_reason,
119
- explanation,
120
  metadata={
121
- "judge_score": judge_score,
 
 
 
122
  "judge_provider": self.judge_provider,
123
  "judge_model": self.judge_model,
 
 
124
  "temperature": 0.0,
125
  "prompt_hash": self.prompt_hash,
 
126
  },
127
  )
128
  return reward
129
 
130
- def _judge_dependency_reasoning(self, module_id: str, action: ReviewAction) -> tuple[float, str]:
 
 
 
 
 
 
 
 
 
 
 
131
  if not self.judge_enabled:
132
  return 1.0, "Judge disabled by configuration; graph-consistent attribution accepted"
133
 
@@ -145,21 +212,21 @@ class HardGrader(MediumGrader):
145
  "rubric": "0.0 wrong or unsupported; 0.5 partially justified; 1.0 well-justified root cause",
146
  }
147
  payload_text = json.dumps(payload, sort_keys=True)
148
- cache_key = hashlib.sha256(payload_text.encode("utf-8")).hexdigest()
149
  cached = self._judge_cache.get(cache_key)
150
  if cached is not None:
151
  return cached
152
 
153
  try:
154
  self._judge_calls += 1
155
- client = OpenAI(api_key=self.api_key, base_url=self.base_url, timeout=self.timeout)
156
 
157
  request_kwargs: dict[str, Any] = {
158
- "model": self.judge_model,
159
  "temperature": 0.0,
160
  "response_format": {"type": "json_object"},
161
  "messages": [
162
- {"role": "system", "content": self.judge_system_prompt},
163
  {"role": "user", "content": payload_text},
164
  ],
165
  }
@@ -167,7 +234,7 @@ class HardGrader(MediumGrader):
167
  if self.reasoning_effort in {"none", "low", "medium", "high"}:
168
  request_kwargs["reasoning_effort"] = self.reasoning_effort
169
 
170
- if self.judge_provider == "ollama_openai_compat":
171
  if self.think_value in {"true", "false", "low", "medium", "high"}:
172
  think: bool | str
173
  if self.think_value in {"low", "medium", "high"}:
@@ -206,3 +273,43 @@ class HardGrader(MediumGrader):
206
  if start >= 0 and end > start:
207
  return json.loads(text[start : end + 1])
208
  raise
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  "GRAPHREVIEW_JUDGE_PROVIDER",
39
  "ollama_openai_compat",
40
  )
41
+ self.verifier_provider = os.getenv("GRAPHREVIEW_VERIFIER_PROVIDER", self.judge_provider)
42
  self.base_url = os.getenv("GRAPHREVIEW_JUDGE_BASE_URL", "http://localhost:11434/v1")
43
+ self.verifier_base_url = os.getenv("GRAPHREVIEW_VERIFIER_BASE_URL", self.base_url)
44
  self.api_key = os.getenv("GRAPHREVIEW_JUDGE_API_KEY", "ollama")
45
+ self.verifier_api_key = os.getenv("GRAPHREVIEW_VERIFIER_API_KEY", self.api_key)
46
  self.timeout = float(os.getenv("GRAPHREVIEW_JUDGE_TIMEOUT_SECONDS", "8"))
47
+ self.verifier_timeout = float(os.getenv("GRAPHREVIEW_VERIFIER_TIMEOUT_SECONDS", str(self.timeout)))
48
  self.judge_system_prompt = os.getenv(
49
  "GRAPHREVIEW_JUDGE_SYSTEM_PROMPT",
50
  self.DEFAULT_JUDGE_SYSTEM_PROMPT,
51
  )
52
+ self.verifier_enabled = os.getenv("GRAPHREVIEW_VERIFIER_ENABLED", "true").strip().lower() == "true"
53
+ self.verifier_model = os.getenv("GRAPHREVIEW_VERIFIER_MODEL", self.judge_model)
54
+ self.verifier_system_prompt = os.getenv(
55
+ "GRAPHREVIEW_VERIFIER_SYSTEM_PROMPT",
56
+ self.DEFAULT_JUDGE_SYSTEM_PROMPT,
57
+ )
58
  self.reasoning_effort = os.getenv("GRAPHREVIEW_JUDGE_REASONING_EFFORT", "none")
59
  self.think_value = os.getenv("GRAPHREVIEW_JUDGE_THINK", "false").strip().lower()
60
  self.max_judge_calls = int(os.getenv("GRAPHREVIEW_JUDGE_MAX_CALLS", "200"))
61
  self.max_consecutive_failures = int(os.getenv("GRAPHREVIEW_JUDGE_MAX_CONSECUTIVE_FAILURES", "3"))
62
+ self.weight_deterministic = float(os.getenv("GRAPHREVIEW_JUDGE_WEIGHT_DETERMINISTIC", "0.5"))
63
+ self.weight_primary = float(os.getenv("GRAPHREVIEW_JUDGE_WEIGHT_PRIMARY", "0.3"))
64
+ self.weight_verifier = float(os.getenv("GRAPHREVIEW_JUDGE_WEIGHT_VERIFIER", "0.2"))
65
+ self.disagreement_threshold = float(os.getenv("GRAPHREVIEW_JUDGE_DISAGREEMENT_THRESHOLD", "0.5"))
66
  self._judge_calls = 0
67
  self._consecutive_failures = 0
68
  self._judge_cache: dict[str, tuple[float, str]] = {}
69
  self.prompt_hash = hashlib.sha256(self.judge_system_prompt.encode("utf-8")).hexdigest()
70
+ self.verifier_prompt_hash = hashlib.sha256(self.verifier_system_prompt.encode("utf-8")).hexdigest()
71
 
72
  def grade_action(
73
  self,
 
110
  )
111
 
112
  normalized_action = action.model_copy(update={"attributed_to": attributed_to})
113
+ primary_score, primary_explanation = self._judge_with_model(
114
+ module_id=module_id,
115
+ action=normalized_action,
116
+ model=self.judge_model,
117
+ provider=self.judge_provider,
118
+ base_url=self.base_url,
119
+ api_key=self.api_key,
120
+ timeout=self.timeout,
121
+ system_prompt=self.judge_system_prompt,
122
+ cache_scope="primary",
123
+ )
124
+ verifier_score = primary_score
125
+ verifier_explanation = "Verifier disabled"
126
+ if self.verifier_enabled:
127
+ verifier_score, verifier_explanation = self._judge_with_model(
128
+ module_id=module_id,
129
+ action=normalized_action,
130
+ model=self.verifier_model,
131
+ provider=self.verifier_provider,
132
+ base_url=self.verifier_base_url,
133
+ api_key=self.verifier_api_key,
134
+ timeout=self.verifier_timeout,
135
+ system_prompt=self.verifier_system_prompt,
136
+ cache_scope="verifier",
137
+ )
138
+
139
+ final_score, blend = self._blend_scores(
140
+ deterministic_score=1.0,
141
+ primary_score=primary_score,
142
+ verifier_score=verifier_score,
143
+ )
144
+
145
+ if final_score < 0.45:
146
  return make_reward(
147
  RewardReason.INCORRECT_DEPENDENCY_ATTRIBUTION,
148
+ f"{primary_explanation} | verifier: {verifier_explanation}",
149
  metadata={
150
+ "judge_score": primary_score,
151
+ "verifier_score": verifier_score,
152
+ "final_score": final_score,
153
+ "blend": json.dumps(blend, sort_keys=True),
154
  "judge_provider": self.judge_provider,
155
  "judge_model": self.judge_model,
156
+ "verifier_provider": self.verifier_provider,
157
+ "verifier_model": self.verifier_model,
158
  "temperature": 0.0,
159
  "prompt_hash": self.prompt_hash,
160
+ "verifier_prompt_hash": self.verifier_prompt_hash,
161
  },
162
  )
163
 
164
  base_reason = RewardReason.CORRECT_DEPENDENCY_ATTRIBUTION
165
+ if final_score < 0.75:
166
+ base_reason = RewardReason.PARTIAL_DEPENDENCY_ATTRIBUTION
167
  reward = make_reward(
168
  base_reason,
169
+ f"{primary_explanation} | verifier: {verifier_explanation}",
170
  metadata={
171
+ "judge_score": primary_score,
172
+ "verifier_score": verifier_score,
173
+ "final_score": final_score,
174
+ "blend": json.dumps(blend, sort_keys=True),
175
  "judge_provider": self.judge_provider,
176
  "judge_model": self.judge_model,
177
+ "verifier_provider": self.verifier_provider,
178
+ "verifier_model": self.verifier_model,
179
  "temperature": 0.0,
180
  "prompt_hash": self.prompt_hash,
181
+ "verifier_prompt_hash": self.verifier_prompt_hash,
182
  },
183
  )
184
  return reward
185
 
186
+ def _judge_with_model(
187
+ self,
188
+ module_id: str,
189
+ action: ReviewAction,
190
+ model: str,
191
+ provider: str,
192
+ base_url: str,
193
+ api_key: str,
194
+ timeout: float,
195
+ system_prompt: str,
196
+ cache_scope: str,
197
+ ) -> tuple[float, str]:
198
  if not self.judge_enabled:
199
  return 1.0, "Judge disabled by configuration; graph-consistent attribution accepted"
200
 
 
212
  "rubric": "0.0 wrong or unsupported; 0.5 partially justified; 1.0 well-justified root cause",
213
  }
214
  payload_text = json.dumps(payload, sort_keys=True)
215
+ cache_key = hashlib.sha256(f"{cache_scope}:{model}:{payload_text}".encode("utf-8")).hexdigest()
216
  cached = self._judge_cache.get(cache_key)
217
  if cached is not None:
218
  return cached
219
 
220
  try:
221
  self._judge_calls += 1
222
+ client = OpenAI(api_key=api_key, base_url=base_url, timeout=timeout)
223
 
224
  request_kwargs: dict[str, Any] = {
225
+ "model": model,
226
  "temperature": 0.0,
227
  "response_format": {"type": "json_object"},
228
  "messages": [
229
+ {"role": "system", "content": system_prompt},
230
  {"role": "user", "content": payload_text},
231
  ],
232
  }
 
234
  if self.reasoning_effort in {"none", "low", "medium", "high"}:
235
  request_kwargs["reasoning_effort"] = self.reasoning_effort
236
 
237
+ if provider == "ollama_openai_compat":
238
  if self.think_value in {"true", "false", "low", "medium", "high"}:
239
  think: bool | str
240
  if self.think_value in {"low", "medium", "high"}:
 
273
  if start >= 0 and end > start:
274
  return json.loads(text[start : end + 1])
275
  raise
276
+
277
+ def _blend_scores(
278
+ self,
279
+ deterministic_score: float,
280
+ primary_score: float,
281
+ verifier_score: float,
282
+ ) -> tuple[float, dict[str, float | bool]]:
283
+ d = max(0.0, min(1.0, deterministic_score))
284
+ p = max(0.0, min(1.0, primary_score))
285
+ v = max(0.0, min(1.0, verifier_score))
286
+
287
+ wd = max(self.weight_deterministic, 0.0)
288
+ wp = max(self.weight_primary, 0.0)
289
+ wv = max(self.weight_verifier, 0.0)
290
+ disagreement = abs(p - v)
291
+ disagreement_guard = disagreement >= self.disagreement_threshold
292
+ if disagreement_guard:
293
+ wp = min(wp, 0.1)
294
+ wv = max(wv, 0.4)
295
+ wd = max(wd, 0.5)
296
+
297
+ total = wd + wp + wv
298
+ if total <= 0:
299
+ return 0.0, {"wd": 0.0, "wp": 0.0, "wv": 0.0, "disagreement": disagreement, "guard": disagreement_guard}
300
+
301
+ wd /= total
302
+ wp /= total
303
+ wv /= total
304
+
305
+ final = (wd * d) + (wp * p) + (wv * v)
306
+ if p == 1.0 and v == 0.0:
307
+ final = min(final, 0.45)
308
+
309
+ return final, {
310
+ "wd": wd,
311
+ "wp": wp,
312
+ "wv": wv,
313
+ "disagreement": disagreement,
314
+ "guard": disagreement_guard,
315
+ }
code-review-env/graph/graph_manager.py CHANGED
@@ -57,6 +57,7 @@ class GraphManager:
57
  edge_type=edge.edge_type.value,
58
  import_line=edge.import_line,
59
  weight=edge.weight,
 
60
  )
61
 
62
  self._graph_cache = graph
 
57
  edge_type=edge.edge_type.value,
58
  import_line=edge.import_line,
59
  weight=edge.weight,
60
+ connection_summary=edge.connection_summary,
61
  )
62
 
63
  self._graph_cache = graph
code-review-env/llm/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """LLM helpers for GraphReview."""
code-review-env/llm/edge_summarizer.py ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import hashlib
4
+ import json
5
+ import os
6
+ from dataclasses import dataclass
7
+
8
+ from openai import OpenAI
9
+
10
+
11
+ @dataclass(frozen=True)
12
+ class EdgeSummaryInput:
13
+ source_module_id: str
14
+ target_module_id: str
15
+ edge_type: str
16
+ import_line: str
17
+ scope: str
18
+
19
+
20
+ class EdgeSummarizer:
21
+ """Generate concise edge relationship summaries with deterministic fallback."""
22
+
23
+ def __init__(self) -> None:
24
+ self.enabled = os.getenv("GRAPHREVIEW_EDGE_SUMMARY_ENABLED", "false").strip().lower() == "true"
25
+ if os.getenv("PYTEST_CURRENT_TEST"):
26
+ self.enabled = False
27
+ self.base_url = os.getenv("GRAPHREVIEW_EDGE_SUMMARY_BASE_URL", os.getenv("GRAPHREVIEW_LLM_BASE_URL", "http://localhost:11434/v1"))
28
+ self.api_key = os.getenv("GRAPHREVIEW_EDGE_SUMMARY_API_KEY", os.getenv("GRAPHREVIEW_LLM_API_KEY", "ollama"))
29
+ self.model = os.getenv("GRAPHREVIEW_EDGE_SUMMARY_MODEL", os.getenv("GRAPHREVIEW_LLM_MODEL_AGENT", "gemma4:e4b"))
30
+ self.timeout = float(os.getenv("GRAPHREVIEW_EDGE_SUMMARY_TIMEOUT_SECONDS", "8"))
31
+ self.max_calls = int(os.getenv("GRAPHREVIEW_EDGE_SUMMARY_MAX_CALLS", "5000"))
32
+ self._calls = 0
33
+ self._cache: dict[str, str] = {}
34
+
35
+ def summarize(self, edge: EdgeSummaryInput) -> str:
36
+ payload = json.dumps(edge.__dict__, sort_keys=True)
37
+ cache_key = hashlib.sha256(payload.encode("utf-8")).hexdigest()
38
+ if cache_key in self._cache:
39
+ return self._cache[cache_key]
40
+
41
+ summary = self._fallback_summary(edge)
42
+ if self.enabled and self._calls < self.max_calls:
43
+ try:
44
+ self._calls += 1
45
+ client = OpenAI(api_key=self.api_key, base_url=self.base_url, timeout=self.timeout)
46
+ response = client.chat.completions.create(
47
+ model=self.model,
48
+ temperature=0.0,
49
+ messages=[
50
+ {
51
+ "role": "system",
52
+ "content": (
53
+ "You summarize Python dependency edges. Produce one sentence (max 24 words) "
54
+ "explaining why source depends on target using the import/call evidence."
55
+ ),
56
+ },
57
+ {"role": "user", "content": payload},
58
+ ],
59
+ )
60
+ text = (response.choices[0].message.content or "").strip()
61
+ if text:
62
+ summary = text[:240]
63
+ except Exception:
64
+ # Keep deterministic fallback to avoid breaking seed.
65
+ pass
66
+
67
+ self._cache[cache_key] = summary
68
+ return summary
69
+
70
+ @staticmethod
71
+ def _fallback_summary(edge: EdgeSummaryInput) -> str:
72
+ edge_kind = edge.edge_type.replace("_", " ")
73
+ evidence = edge.import_line.strip() or "implicit usage"
74
+ if len(evidence) > 120:
75
+ evidence = evidence[:117] + "..."
76
+ return (
77
+ f"{edge.source_module_id} depends on {edge.target_module_id} via {edge_kind}; "
78
+ f"evidence: {evidence}."
79
+ )
code-review-env/llm/lora_adapter.py ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import os
5
+ from dataclasses import dataclass
6
+ from datetime import UTC, datetime
7
+ from pathlib import Path
8
+
9
+ from env.action import ReviewAction
10
+
11
+
12
+ @dataclass(frozen=True)
13
+ class TransitionRecord:
14
+ source_root: str
15
+ episode_id: str
16
+ module_id: str
17
+ step_number: int
18
+ action_type: str
19
+ reward: float
20
+ done: bool
21
+ task_id: str
22
+ observation_summary: str
23
+ action_payload: dict[str, object]
24
+
25
+
26
+ class LoRATrajectoryLogger:
27
+ """Append RL transitions to JSONL for optional LoRA fine-tuning workflows."""
28
+
29
+ def __init__(self) -> None:
30
+ self.enabled = os.getenv("GRAPHREVIEW_LORA_ENABLED", "false").strip().lower() == "true"
31
+ output_path = os.getenv("GRAPHREVIEW_LORA_DATA_PATH", "outputs/lora/transitions.jsonl")
32
+ self.path = Path(output_path)
33
+
34
+ def log(
35
+ self,
36
+ *,
37
+ source_root: str,
38
+ episode_id: str,
39
+ module_id: str,
40
+ step_number: int,
41
+ action: ReviewAction,
42
+ reward: float,
43
+ done: bool,
44
+ task_id: str,
45
+ observation_summary: str,
46
+ ) -> None:
47
+ if not self.enabled:
48
+ return
49
+
50
+ record = TransitionRecord(
51
+ source_root=source_root,
52
+ episode_id=episode_id,
53
+ module_id=module_id,
54
+ step_number=step_number,
55
+ action_type=action.action_type.value,
56
+ reward=reward,
57
+ done=done,
58
+ task_id=task_id,
59
+ observation_summary=observation_summary,
60
+ action_payload=action.model_dump(mode="json", exclude_none=True),
61
+ )
62
+
63
+ self.path.parent.mkdir(parents=True, exist_ok=True)
64
+ payload = {
65
+ **record.__dict__,
66
+ "created_at": datetime.now(UTC).isoformat(),
67
+ }
68
+ with self.path.open("a", encoding="utf-8") as handle:
69
+ handle.write(json.dumps(payload, sort_keys=True) + "\n")
code-review-env/llm/lora_finetune.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ import json
5
+ from pathlib import Path
6
+
7
+
8
+ def export_sft_dataset(transitions_path: Path, output_path: Path) -> int:
9
+ """Convert transition logs into a simple instruction-tuning JSONL dataset."""
10
+ if not transitions_path.exists():
11
+ raise FileNotFoundError(f"Transitions file not found: {transitions_path}")
12
+
13
+ rows = transitions_path.read_text(encoding="utf-8").splitlines()
14
+ output_path.parent.mkdir(parents=True, exist_ok=True)
15
+
16
+ count = 0
17
+ with output_path.open("w", encoding="utf-8") as out:
18
+ for row in rows:
19
+ payload = json.loads(row)
20
+ sample = {
21
+ "instruction": (
22
+ "Review this module using graph-aware reasoning and choose the best next action."
23
+ ),
24
+ "input": payload.get("observation_summary", ""),
25
+ "output": json.dumps(payload.get("action_payload", {}), sort_keys=True),
26
+ "meta": {
27
+ "reward": payload.get("reward", 0.0),
28
+ "task_id": payload.get("task_id", ""),
29
+ "module_id": payload.get("module_id", ""),
30
+ },
31
+ }
32
+ out.write(json.dumps(sample, sort_keys=True) + "\n")
33
+ count += 1
34
+ return count
35
+
36
+
37
+ def _build_parser() -> argparse.ArgumentParser:
38
+ parser = argparse.ArgumentParser(description="Prepare LoRA fine-tuning dataset from GraphReview transitions")
39
+ parser.add_argument(
40
+ "--transitions",
41
+ default="outputs/lora/transitions.jsonl",
42
+ help="Input transition JSONL produced by runtime",
43
+ )
44
+ parser.add_argument(
45
+ "--output",
46
+ default="outputs/lora/sft_dataset.jsonl",
47
+ help="Output SFT JSONL dataset path",
48
+ )
49
+ return parser
50
+
51
+
52
+ def main() -> None:
53
+ args = _build_parser().parse_args()
54
+ count = export_sft_dataset(Path(args.transitions), Path(args.output))
55
+ print(json.dumps({"ok": True, "samples": count, "output": args.output}, indent=2))
56
+
57
+
58
+ if __name__ == "__main__":
59
+ main()
code-review-env/parser/ast_parser.py CHANGED
@@ -9,6 +9,7 @@ from pydantic import BaseModel
9
 
10
  from db.schema import EdgeType
11
  from db.store import Store
 
12
  from parser.linter import run_linters
13
  from parser.summarizer import summarize_module
14
 
@@ -205,6 +206,7 @@ def parse_directory(target_dir: Path, db_path: str | None = None) -> Store:
205
  py_files = _iter_python_files(target_dir)
206
  parsed_modules = [parse_python_file(py_file, target_dir) for py_file in py_files]
207
  known_module_ids = {parsed.module_id for parsed in parsed_modules}
 
208
 
209
  for py_file, parsed in zip(py_files, parsed_modules):
210
  issues = run_linters(py_file)
@@ -223,12 +225,22 @@ def parse_directory(target_dir: Path, db_path: str | None = None) -> Store:
223
  )
224
  for imported in parsed.imports:
225
  if imported.target_module and imported.target_module in known_module_ids:
 
 
 
 
 
 
 
 
 
226
  store.upsert_edge(
227
  source_module_id=parsed.module_id,
228
  target_module_id=imported.target_module,
229
  edge_type=imported.edge_type,
230
  import_line=imported.import_line,
231
  weight=imported.weight,
 
232
  )
233
 
234
  return store
 
9
 
10
  from db.schema import EdgeType
11
  from db.store import Store
12
+ from llm.edge_summarizer import EdgeSummarizer, EdgeSummaryInput
13
  from parser.linter import run_linters
14
  from parser.summarizer import summarize_module
15
 
 
206
  py_files = _iter_python_files(target_dir)
207
  parsed_modules = [parse_python_file(py_file, target_dir) for py_file in py_files]
208
  known_module_ids = {parsed.module_id for parsed in parsed_modules}
209
+ edge_summarizer = EdgeSummarizer()
210
 
211
  for py_file, parsed in zip(py_files, parsed_modules):
212
  issues = run_linters(py_file)
 
225
  )
226
  for imported in parsed.imports:
227
  if imported.target_module and imported.target_module in known_module_ids:
228
+ connection_summary = edge_summarizer.summarize(
229
+ EdgeSummaryInput(
230
+ source_module_id=parsed.module_id,
231
+ target_module_id=imported.target_module,
232
+ edge_type=imported.edge_type.value,
233
+ import_line=imported.import_line,
234
+ scope=imported.scope,
235
+ )
236
+ )
237
  store.upsert_edge(
238
  source_module_id=parsed.module_id,
239
  target_module_id=imported.target_module,
240
  edge_type=imported.edge_type,
241
  import_line=imported.import_line,
242
  weight=imported.weight,
243
+ connection_summary=connection_summary,
244
  )
245
 
246
  return store
code-review-env/parser/graph_builder.py CHANGED
@@ -15,6 +15,7 @@ class EdgeRecord(BaseModel):
15
  import_line: str
16
  scope: str
17
  weight: float
 
18
 
19
 
20
  def _build_intra_file_edges(parsed: ParsedModule, available_chunk_ids: set[str]) -> list[EdgeRecord]:
 
15
  import_line: str
16
  scope: str
17
  weight: float
18
+ connection_summary: str = ""
19
 
20
 
21
  def _build_intra_file_edges(parsed: ParsedModule, available_chunk_ids: set[str]) -> list[EdgeRecord]:
code-review-env/pyproject.toml CHANGED
@@ -16,6 +16,7 @@ dependencies = [
16
 
17
  [project.scripts]
18
  server = "server.app:main"
 
19
 
20
  [tool.pytest.ini_options]
21
  pythonpath = ["."]
 
16
 
17
  [project.scripts]
18
  server = "server.app:main"
19
+ graphreview = "run_project:main"
20
 
21
  [tool.pytest.ini_options]
22
  pythonpath = ["."]
code-review-env/run_project.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ import json
5
+ from pathlib import Path
6
+
7
+ from graders.review_runner import generate_reports, run_review
8
+
9
+
10
+ def _build_parser() -> argparse.ArgumentParser:
11
+ parser = argparse.ArgumentParser(
12
+ description="Unified GraphReview runner: seed + run easy/medium/hard + generate artifacts"
13
+ )
14
+ parser.add_argument("target", help="Target Python project folder")
15
+ parser.add_argument("--db-path", default=None, help="Optional SQLite DB path")
16
+ parser.add_argument("--force-seed", action="store_true", help="Force graph reseed")
17
+ parser.add_argument("--skip-seed", action="store_true", help="Skip seeding and reuse DB")
18
+ parser.add_argument("--modules", nargs="*", default=None, help="Optional module focus list")
19
+ parser.add_argument("--filter-hops", type=int, default=1, help="Neighbor expansion hops for --modules")
20
+ parser.add_argument("--output-dir", default="outputs", help="Artifacts output directory")
21
+ parser.add_argument("--report-prefix", default="graphreview_full", help="Artifact prefix")
22
+ parser.add_argument("--no-progress", action="store_true", help="Disable progress logs")
23
+ parser.add_argument(
24
+ "--levels",
25
+ nargs="*",
26
+ choices=["easy", "medium", "hard"],
27
+ default=["easy", "medium", "hard"],
28
+ help="Review levels to run",
29
+ )
30
+ return parser
31
+
32
+
33
+ def main() -> None:
34
+ args = _build_parser().parse_args()
35
+ target = Path(args.target).resolve()
36
+
37
+ summary: dict[str, object] = {
38
+ "target": str(target),
39
+ "levels": {},
40
+ }
41
+
42
+ for idx, level in enumerate(args.levels):
43
+ scores = run_review(
44
+ target=target,
45
+ db_path=args.db_path,
46
+ grader_level=level,
47
+ force_seed=args.force_seed if idx == 0 else False,
48
+ skip_seed=args.skip_seed if idx == 0 else True,
49
+ show_progress=not args.no_progress,
50
+ module_filter=args.modules,
51
+ filter_hops=args.filter_hops,
52
+ )
53
+ total = float(sum(scores.values()))
54
+ summary["levels"][level] = {
55
+ "modules": len(scores),
56
+ "raw_total": total,
57
+ "avg_raw_per_module": (total / len(scores)) if scores else 0.0,
58
+ }
59
+
60
+ artifacts = generate_reports(
61
+ target=target,
62
+ db_path=args.db_path,
63
+ output_dir=args.output_dir,
64
+ module_filter=args.modules,
65
+ filter_hops=args.filter_hops,
66
+ report_prefix=args.report_prefix,
67
+ )
68
+
69
+ summary["artifacts"] = {
70
+ "markdown": artifacts.markdown_path,
71
+ "json": artifacts.json_path,
72
+ "html": artifacts.html_path,
73
+ "confidence_score": artifacts.confidence_score,
74
+ "module_count": artifacts.module_count,
75
+ "edge_count": artifacts.edge_count,
76
+ }
77
+
78
+ print(json.dumps(summary, indent=2))
79
+
80
+
81
+ if __name__ == "__main__":
82
+ main()
code-review-env/server/app.py CHANGED
@@ -14,12 +14,15 @@ from sqlmodel import Session, select
14
 
15
  from db.schema import ModuleEdge, ModuleNode
16
  from db.store import Store
 
17
  from env.action import ActionType, ReviewAction
18
  from env.environment import CodeReviewEnv, StepResult
19
  from env.observation import CodeObservation
20
  from env.state import GraphState
21
  from visualizer.report_generator import GeneratedArtifacts, generate_phase5_outputs
22
 
 
 
23
 
24
  class ResetRequest(BaseModel):
25
  model_config = ConfigDict(strict=True, extra="forbid")
@@ -297,6 +300,7 @@ def ui_result(report_path: str = Query(..., min_length=1)) -> ResultDetail:
297
  "edge_type",
298
  "import_line",
299
  "weight",
 
300
  ],
301
  },
302
  )
 
14
 
15
  from db.schema import ModuleEdge, ModuleNode
16
  from db.store import Store
17
+ from env.env_loader import load_env_file
18
  from env.action import ActionType, ReviewAction
19
  from env.environment import CodeReviewEnv, StepResult
20
  from env.observation import CodeObservation
21
  from env.state import GraphState
22
  from visualizer.report_generator import GeneratedArtifacts, generate_phase5_outputs
23
 
24
+ load_env_file()
25
+
26
 
27
  class ResetRequest(BaseModel):
28
  model_config = ConfigDict(strict=True, extra="forbid")
 
300
  "edge_type",
301
  "import_line",
302
  "weight",
303
+ "connection_summary",
304
  ],
305
  },
306
  )
code-review-env/tests/test_graders.py CHANGED
@@ -93,7 +93,9 @@ def test_hard_grader_dependency_attribution(tmp_path: Path) -> None:
93
  graph = GraphManager(source_root=str(project), db_path=str(db_path))
94
  grader = HardGrader(store, graph)
95
 
96
- grader._judge_dependency_reasoning = lambda module_id, action: (1.0, "ok", "hash") # type: ignore[method-assign]
 
 
97
 
98
  good = grader.grade_episode(
99
  module_id="a",
 
93
  graph = GraphManager(source_root=str(project), db_path=str(db_path))
94
  grader = HardGrader(store, graph)
95
 
96
+ grader._judge_with_model = ( # type: ignore[method-assign]
97
+ lambda module_id, action, model, provider, base_url, api_key, timeout, system_prompt, cache_scope: (1.0, "ok")
98
+ )
99
 
100
  good = grader.grade_episode(
101
  module_id="a",
code-review-env/visualizer/pyvis_renderer.py CHANGED
@@ -54,12 +54,19 @@ def render_graph_html(
54
 
55
  for edge in edges:
56
  edge_type = str(edge.get("edge_type", "explicit_import"))
 
 
 
 
 
 
57
  net.add_edge(
58
  source=str(edge["source"]),
59
  to=str(edge["target"]),
60
- title=str(edge.get("title", edge_type)),
61
  color=EDGE_COLORS.get(edge_type, EDGE_COLORS["explicit_import"]),
62
- value=max(float(edge.get("weight", 1.0)), 0.2),
 
63
  arrows="to",
64
  )
65
 
@@ -85,7 +92,7 @@ def render_graph_html(
85
  },
86
  "edges": {
87
  "smooth": {"enabled": False},
88
- "arrows": {"to": {"enabled": True, "scaleFactor": 0.5}},
89
  },
90
  }
91
  )
 
54
 
55
  for edge in edges:
56
  edge_type = str(edge.get("edge_type", "explicit_import"))
57
+ edge_title = str(edge.get("title", edge_type))
58
+ formatted_title = (
59
+ "<div style='max-width:360px'>"
60
+ f"<b>{edge_type}</b><br>{edge_title}"
61
+ "</div>"
62
+ )
63
  net.add_edge(
64
  source=str(edge["source"]),
65
  to=str(edge["target"]),
66
+ title=formatted_title,
67
  color=EDGE_COLORS.get(edge_type, EDGE_COLORS["explicit_import"]),
68
+ value=1.0,
69
+ width=max(1.0, min(float(edge.get("weight", 1.0)) * 1.3, 2.2)),
70
  arrows="to",
71
  )
72
 
 
92
  },
93
  "edges": {
94
  "smooth": {"enabled": False},
95
+ "arrows": {"to": {"enabled": True, "scaleFactor": 0.35}},
96
  },
97
  }
98
  )
code-review-env/visualizer/report_generator.py CHANGED
@@ -347,6 +347,7 @@ def _build_json_payload(
347
  "edge_type": edge.edge_type.value,
348
  "weight": edge.weight,
349
  "import_line": edge.import_line,
 
350
  }
351
  for edge in sorted(edges, key=lambda item: (item.source_module_id, item.target_module_id, item.import_line))
352
  ]
@@ -562,7 +563,9 @@ def generate_phase5_outputs(
562
  "target": edge.target_module_id,
563
  "edge_type": edge.edge_type.value,
564
  "weight": edge.weight,
565
- "title": f"{edge.edge_type.value}: {edge.import_line}",
 
 
566
  }
567
  )
568
 
 
347
  "edge_type": edge.edge_type.value,
348
  "weight": edge.weight,
349
  "import_line": edge.import_line,
350
+ "connection_summary": edge.connection_summary,
351
  }
352
  for edge in sorted(edges, key=lambda item: (item.source_module_id, item.target_module_id, item.import_line))
353
  ]
 
563
  "target": edge.target_module_id,
564
  "edge_type": edge.edge_type.value,
565
  "weight": edge.weight,
566
+ "title": (
567
+ f"{edge.edge_type.value}: {edge.connection_summary or edge.import_line}"
568
+ ),
569
  }
570
  )
571
 
plans/phase-06-adaptive-judge-edge-summary-lora-plan.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 06 Plan - Adaptive Judging, Edge Intelligence, LoRA Hooks, and Config Hygiene
2
+
3
+ ## Objective
4
+ Upgrade the current GraphReview environment for Round 1 reliability with:
5
+ - Adaptive hard-grader fusion to reduce catastrophic judge mistakes.
6
+ - Per-edge connection summaries generated by LLM (with deterministic fallback).
7
+ - LoRA learning hooks so the system can improve across projects.
8
+ - Centralized `.env`-driven configuration for runtime, models, server, and reporting.
9
+
10
+ ## Design Principles
11
+ - Occam's Razor: add only minimal mechanisms that directly improve reliability and score.
12
+ - Single Responsibility: parser builds graph; graders score; learning hooks collect trajectories.
13
+ - Determinism First: easy/medium deterministic; hard judge constrained and auditable.
14
+ - Fail-Safe Defaults: LLM optional, deterministic fallback mandatory.
15
+ - Open/Closed: add extensible configs and adapters without rewriting core runtime.
16
+
17
+ ## Scope
18
+ 1. Adaptive hard-grader fusion
19
+ - Add deterministic gate + primary judge + verifier judge fusion.
20
+ - Dynamic weighting with disagreement-aware reweighting.
21
+ - Persist judge metadata and fusion breakdown in annotation payload.
22
+
23
+ 2. Edge connection summaries
24
+ - Extend `ModuleEdge` schema with `connection_summary`.
25
+ - Build `llm/edge_summarizer.py` using OpenAI-compatible API.
26
+ - Generate edge summary for each edge during seed.
27
+ - Fallback summary if LLM unavailable.
28
+
29
+ 3. LoRA learning hooks
30
+ - Add transition logging to JSONL during runtime (`state`, `action`, `reward`, `done`).
31
+ - Add `llm/lora_finetune.py` skeleton for dataset export + optional train path.
32
+ - Keep training optional via env vars and feature flags.
33
+
34
+ 4. `.env` and config hygiene
35
+ - Add `.env` file with all tunables: host/port, DB, models, judge settings, edge summarizer, LoRA toggles.
36
+ - Add lightweight env loader utility and invoke early in runtime/server/migrations.
37
+
38
+ ## Implementation Steps
39
+ 1. Add env loader and wire it to startup-sensitive modules.
40
+ 2. Add `connection_summary` field + migration + store methods.
41
+ 3. Add edge summarizer module and integrate into seed pipeline.
42
+ 4. Add adaptive hard grader fusion and metadata persistence.
43
+ 5. Add LoRA transition logger + finetune utility script.
44
+ 6. Update visualization and report generation to display connection summaries.
45
+ 7. Update README with new env variables and usage.
46
+
47
+ ## Verification
48
+ - Seeding produces non-empty `connection_summary` for all stored edges.
49
+ - Hard grader returns stable fused score and persists fusion metadata.
50
+ - If primary judge and verifier disagree strongly, final score is reduced safely.
51
+ - Runtime emits LoRA trajectory JSONL when enabled.
52
+ - Server reads `.env` and applies host/port/model settings.
53
+
54
+ ## Risks and Mitigations
55
+ - LLM edge summarization latency: use caching + timeout + deterministic fallback.
56
+ - Judge model outages: keep deterministic gate and verifier fallback behavior.
57
+ - LoRA dependency burden: keep optional and fail gracefully if packages absent.
58
+
59
+ ## Definition of Done
60
+ - Plan implemented in code with passing runtime smoke checks.
61
+ - New config values available through `.env` and documented.
62
+ - Graph UI and reports now show concise per-edge connection summaries.
63
+ - Hard grader is safer against single-model catastrophic errors.
temp.md ADDED
@@ -0,0 +1,1333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You are a project planning expert. I am attending a pre Hackathon competition and I need to build a rl environment. I am building this project for this submission
2
+
3
+ # Builder Prompt — GraphReview RL Environment
4
+
5
+ You are an expert Python engineer planner. You do not build. You can add more tools to catch more security vulnerabilities for the modules before actually sending it out. ANd you can also turn on thinking for the gemma 4 model if it works better and ensure it runs on all the modules and actually finds info not just repeating the stuff from previous models. But the previous info should also be provided as context and told to find more if possible about those errors and any new errors. a production-quality RL environment for a competitive hackathon (OpenEnv Round 1). You have one job: build the GraphReview environment correctly, phase by phase, without breaking prior work.
6
+
7
+ ---
8
+
9
+ ## What You Are Building
10
+
11
+ An OpenEnv-compliant RL environment where an LLM agent reviews Python code with full dependency graph awareness. The environment parses a Python codebase into a persistent SQLite-backed dependency graph, pre-computes ground truth linter flags, and exposes a step()/reset()/state() API for an agent to interact with.
12
+
13
+ This is online RL — no training dataset is needed. The ground truth (pylint/bandit/pyflakes results) is computed once at seed time and stored in SQLite. The agent explores the environment and receives rewards compared against that ground truth.
14
+
15
+ The full phase plan and architecture are provided below. Read the entire plan before writing a single line of code.
16
+
17
+ ---
18
+
19
+ ## Your Operating Rules
20
+
21
+ 1. **Before building each phase, read the full plan for that phase.** Do not start coding until you understand what the phase produces and what its success criteria are.
22
+
23
+ 2. **Ask me questions before starting if any of the following are unclear:**
24
+ - A design decision that affects DB schema or file structure
25
+ - Anything that would be hard to change later (interfaces, Pydantic models, DB tables)
26
+ - Ambiguity in how two components interact
27
+ Do NOT ask about low-level implementation details — choose the best approach yourself.
28
+
29
+ 3. **Use context7 MCP to look up documentation** for: openenv-core, SQLAlchemy, NetworkX, Pyvis, astroid, pylint API, FastAPI, Pydantic v2. Do not rely on memory for library APIs — always verify.
30
+
31
+ 4. **One phase at a time.** Complete a phase fully before moving to the next. Each phase has explicit success criteria — verify them before declaring a phase done.
32
+
33
+ 5. **Never break prior phases.** If a later phase requires changing an earlier interface, explicitly flag it, explain why, and get confirmation before making the change.
34
+
35
+ 6. **DB is the source of truth.** All state lives in SQLite. Nothing important lives only in memory. reset() clears only task-run annotations — never re-parses the codebase.
36
+
37
+ 7. **Token budget is a hard constraint.** No observation may exceed 2000 tokens. Enforce this in token_budget.py — do not leave it as a soft guideline.
38
+
39
+ 8. **Graders must be deterministic.** Easy and medium graders: zero LLM calls, same input always produces same output. Hard grader: temperature=0, document prompt hash. Test this explicitly.
40
+
41
+ 9. **inference.py log format is mandatory.** [START], [STEP], [END] format must be exact. Any deviation causes evaluation failure. Treat this as a contract.
42
+
43
+ 10. **Write clean, typed Python.** All functions typed. All Pydantic models complete. No `Any` types unless unavoidable with explanation.
44
+
45
+ ---
46
+
47
+ ## Phase Plan
48
+
49
+ [INSERT FULL PHASE PLAN HERE — paste the contents of the phase plan artifact]
50
+
51
+ ---
52
+
53
+ ## Sample Project Specification
54
+
55
+ The sample_project/ directory must contain exactly these files with these injected bugs:
56
+
57
+ ```
58
+ auth.py — validate_token() can return None (not handled)
59
+ checkout.py — calls auth.validate_token(), doesn't check for None
60
+ cart.py — style violations only (PEP8)
61
+ config.py — missing required key in get_config() (root cause of cascade)
62
+ database.py — SQL query built with string concatenation (SQL injection)
63
+ utils.py — unused imports, dead code
64
+ models.py — clean file (no issues, tests APPROVE path)
65
+ payments.py — depends on checkout.py, inherits None risk
66
+ api.py — depends on auth.py and checkout.py
67
+ main.py — entry point, light glue code
68
+ ```
69
+
70
+ Task mapping:
71
+ - easy_task: cart.py (style only)
72
+ - medium_task: checkout.py + auth.py (null reference)
73
+ - hard_task: config.py → auth.py → checkout.py (cascade)
74
+
75
+ ---
76
+
77
+ ## Tech Stack
78
+
79
+ - Python 3.11
80
+ - SQLite via SQLAlchemy ORM
81
+ - NetworkX + astroid + Python ast
82
+ - pylint + bandit + pyflakes
83
+ - Pyvis for visualization
84
+ - Pydantic v2
85
+ - FastAPI
86
+ - OpenAI client (inference.py + hard grader judge)
87
+ - openenv-core
88
+ - context7 MCP for all library lookups
89
+
90
+ ---
91
+
92
+ ## Start Instructions
93
+
94
+ Begin with Phase 1. Before writing any code:
95
+ 1. Use context7 MCP to look up: openenv-core spec, SQLAlchemy ORM setup, astroid API
96
+ 2. Ask me any design questions that affect DB schema or file structure
97
+ 3. Confirm the sample_project file list with me if you want to adjust it
98
+ 4. Then build Phase 1 completely and verify all success criteria before stopping
99
+
100
+ These are the requirements
101
+
102
+ Registration
103
+
104
+ 14th March - 3rd April
105
+
106
+ Declaration
107
+
108
+ Before R1
109
+
110
+ Prepare
111
+
112
+ Now - 25th March
113
+
114
+ Round 1
115
+
116
+ 25th March - 8th April
117
+
118
+ Results
119
+
120
+ 10th April
121
+
122
+ Finale
123
+
124
+ 25th-26th April
125
+
126
+ Welcome Shreyas S Joshi!
127
+
128
+ shreyasjoshi2511@gmail.com
129
+ Copy
130
+ Join the Discord Community
131
+
132
+ All announcements, mentor access, and team matching happens here.
133
+
134
+
135
+ Join Discord
136
+ QUICK TOGGLe
137
+
138
+ Team form Submission
139
+
140
+ Preparatory Course
141
+
142
+ Start Assessment
143
+
144
+ FAQs
145
+
146
+ step 1
147
+
148
+ How will you compete?
149
+
150
+ Choose solo or team before you can start the assessment
151
+
152
+ Step 1 Complete
153
+ Team: Shreyas S Joshi's team
154
+
155
+ 👤
156
+ Athmabhiram S J
157
+ athmabhiram@gmail.com
158
+ Accepted
159
+ 👤
160
+ Shreyas S Joshi
161
+ shreyasjoshi2511@gmail.com
162
+ Team Lead
163
+ 🔒
164
+ Team is permanently locked. Changes are not allowed after confirmation.
165
+
166
+ OpenEnv Round 1 Bootcamp
167
+
168
+ OpenEnv Round 1 Bootcamp
169
+
170
+ OpenEnv Round 1 Bootcamp
171
+
172
+ OpenEnv Round 1 Bootcamp
173
+
174
+ OpenEnv Round 1 Bootcamp
175
+
176
+ OpenEnv Round 1 Bootcamp
177
+
178
+ OpenEnv Round 1 Bootcamp
179
+
180
+ OpenEnv Round 1 Bootcamp
181
+
182
+ OpenEnv Round 1 Bootcamp
183
+
184
+ OpenEnv Round 1 Bootcamp: Build Your First RL Environment
185
+
186
+ Live walkthrough to submit a strong Round 1 entry
187
+
188
+ timing
189
+
190
+ 8:00 PM Onwards
191
+
192
+ Wednesday, 1st April
193
+
194
+ Host
195
+
196
+
197
+ Ben Burtenshaw
198
+
199
+ Community Education in AI at Hugging Face
200
+
201
+
202
+ Pulkit Aneja
203
+
204
+ Scaler Instructor
205
+
206
+ Watch Recording
207
+
208
+ PROBLEM STATEMENT
209
+
210
+ Round 1 — Problem Statement
211
+
212
+ The Task
213
+
214
+ Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard step() / reset() / state() API.
215
+
216
+ Key Requirements at a Glance
217
+
218
+ Must simulate a real-world task (not games or toys)
219
+
220
+ Implement full OpenEnv spec: typed models, step()/reset()/state(), openenv.yaml
221
+
222
+ Minimum 3 tasks with agent graders (easy → medium → hard, scores/reward 0.0–1.0)
223
+
224
+ Meaningful reward function with partial progress signals
225
+
226
+ Baseline inference script with reproducible scores
227
+
228
+ Deploy to Hugging Face Spaces + working Dockerfile
229
+
230
+ README with environment description, action/observation spaces, setup instructions
231
+
232
+ Functional Requirements
233
+
234
+ Real-world task simulation
235
+
236
+ The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
237
+
238
+ OpenEnv spec compliance
239
+
240
+ Implement the full OpenEnv interface: typed Observation, Action, and Reward Pydantic models. step(action) → returns observation, reward, done, info. reset() → returns initial observation. state() → returns current state. openenv.yaml with metadata. Tested via openenv validate.
241
+
242
+ Minimum 3 tasks with agent graders
243
+
244
+ Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
245
+
246
+ Meaningful reward function
247
+
248
+ Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).
249
+
250
+ Baseline inference script
251
+
252
+ Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.
253
+
254
+ Detailed Requirements
255
+
256
+ Non-Functional Requirements
257
+
258
+ Deploys to a Hugging Face Space
259
+
260
+ Environment must run as a containerized HF Space tagged with openenv.
261
+
262
+ Containerized execution
263
+
264
+ Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.
265
+
266
+ Documentation
267
+
268
+ README must include: environment description and motivation, action and observation space definitions, task descriptions with expected difficulty, setup and usage instructions, baseline scores.
269
+
270
+ Parameter
271
+
272
+ Weight
273
+
274
+ Description
275
+
276
+ Real-world utility
277
+
278
+ 30%
279
+
280
+ Does the environment model a genuine task? Would someone actually use this to train or evaluate agents?
281
+
282
+ Task & grader quality
283
+
284
+ 25%
285
+
286
+ Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression?
287
+
288
+ Environment design
289
+
290
+ 20%
291
+
292
+ Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries.
293
+
294
+ Code quality & spec compliance
295
+
296
+ 15%
297
+
298
+ Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works.
299
+
300
+ Creativity & novelty
301
+
302
+ 10%
303
+
304
+ Novel problem domain, interesting mechanics, clever reward design, original approach.
305
+
306
+ Scoring Breakdown
307
+
308
+ Real-world utility (30%)
309
+
310
+ • 0–5: Toy/artificial problem with no practical application
311
+
312
+ • 6–15: Valid domain but shallow modeling of the real task
313
+
314
+ • 16–25: Good domain modeling, would be useful for agent evaluation
315
+
316
+ • 26–30: Excellent — fills a real gap, immediate value for the RL/agent community
317
+
318
+ Task & grader quality (25%)
319
+
320
+ • 3+ tasks with difficulty range?
321
+
322
+ • Graders produce scores between 0.0–1.0?
323
+
324
+ • Graders deterministic and reproducible?
325
+
326
+ • Hard task genuinely challenges frontier models?
327
+
328
+ Environment design (20%)
329
+
330
+ • reset() produces clean state?
331
+
332
+ • Action/observation types well-designed and documented?
333
+
334
+ • Reward function provides useful varying signal (not just sparse)?
335
+
336
+ • Episode boundaries sensible?
337
+
338
+ Code quality & spec compliance (15%)
339
+
340
+ • openenv validate passes?
341
+
342
+ • docker build && docker run works?
343
+
344
+ • HF Space deploys and responds?
345
+
346
+ • Baseline script runs and reproduces scores?
347
+
348
+ Creativity & novelty (10%)
349
+
350
+ • Domain we haven’t seen in OpenEnv before?
351
+
352
+ • Reward design has interesting properties?
353
+
354
+ • Clever mechanics that make the environment engaging?
355
+
356
+ Evaluation Criteria
357
+
358
+ Phase 1: Automated Validation
359
+
360
+ Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
361
+
362
+ Phase 2: Agentic Evaluation
363
+
364
+ Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
365
+
366
+ Phase 3: Human Review
367
+
368
+ Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
369
+
370
+ Disqualification Criteria
371
+
372
+ Environment does not deploy or respond
373
+
374
+ Plagiarized or trivially modified existing environments
375
+
376
+ Graders that always return the same score
377
+
378
+ No baseline inference script
379
+
380
+ How Judging works
381
+
382
+ Pre-Submission Checklist — all must pass or you're disqualified
383
+
384
+ HF Space deploys
385
+
386
+ Automated ping to the Space URL — must return 200 and respond to reset()
387
+
388
+ OpenEnv spec compliance
389
+
390
+ Validate openenv.yaml, typed models, step()/reset()/state() endpoints
391
+
392
+ Dockerfile builds
393
+
394
+ Automated docker build on the submitted repo
395
+
396
+ Baseline reproduces
397
+
398
+ Run the submitted inference script — must complete without error and produce scores
399
+
400
+ 3+ tasks with graders
401
+
402
+ Enumerate tasks, run each grader, verify scores/reward in 0.0–1.0 range
403
+
404
+ Mandatory Additional Instructions
405
+
406
+ Before submitting, ensure the following variables are defined in your environment configuration:
407
+
408
+ API_BASE_URL The API endpoint for the LLM.
409
+
410
+ MODEL_NAME The model identifier to use for inference.
411
+
412
+ HF_TOKEN Your Hugging Face / API key.
413
+
414
+ The inference script must be named `inference.py` and placed in the root directory of the project
415
+
416
+ Participants must use OpenAI Client for all LLM calls using above variables
417
+
418
+ Participants must emit structured stdout logs strictly following the [START], [STEP], and [END] format defined in the sample inference.py provided below. Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring. Refer to the Sample Inference Script for the complete format specification and examples.
419
+
420
+ Infra Restrictions
421
+
422
+ Runtime of inference script should be less than 20min
423
+
424
+ Make sure your env and inference can run on a machine with vcpu=2, memory=8gb
425
+
426
+ Validator
427
+
428
+ Run the pre-submission validation script before submitting
429
+
430
+ NEW
431
+ Sample Inference Script
432
+
433
+ NEW
434
+ Pre Validation Script
435
+
436
+ Submission window opens on 28th March
437
+
438
+ Deadline: 8 Apr 11:59 PM
439
+
440
+
441
+ Submit your Assessment
442
+
443
+ Study material
444
+
445
+ Preparatory Course
446
+
447
+ 4 modules · ~3.5 hours
448
+
449
+ Each module: read the README first, then open the notebook in Colab. No local setup needed.
450
+
451
+ Module 1: Why OpenEnv?
452
+
453
+ ESSENTIAL FOR ROUND 1
454
+
455
+ 45 min
456
+
457
+ Module 2: Using Existing Environments
458
+
459
+ ESSENTIAL FOR ROUND 1
460
+
461
+ 50 min
462
+
463
+ Module 3: Deploying Environments
464
+
465
+ ESSENTIAL FOR ROUND 1
466
+
467
+ 45 min
468
+
469
+ Module 4: Building Your Own Environment
470
+
471
+ MOST IMPORTANT FOR ROUND 1
472
+
473
+ 60 min
474
+
475
+ View full course repository
476
+
477
+ GUIDE
478
+
479
+ Round 1 Guide
480
+
481
+ What to Expect
482
+
483
+ When Round 1 opens, you'll choose 1 of 4–5 problem statements and build an OpenEnv environment around it.
484
+
485
+ Example of what a problem statement looks like
486
+
487
+ "Build a mini-game RL environment with clearly defined tasks, automated graders, and reward logic using the OpenEnv framework."
488
+
489
+ → Create a mini-game an AI agent can play
490
+
491
+ → Define tasks with increasing difficulty
492
+
493
+ → Write graders that verify task completion
494
+
495
+ → Define reward logic for scoring
496
+
497
+ → Package using OpenEnv for automated evaluation
498
+
499
+ Evaluation Criteria
500
+
501
+ Runtime correctness
502
+
503
+ Runs without errors
504
+
505
+ Interface compliance
506
+
507
+ Follows OpenEnv standard
508
+
509
+ Task design
510
+
511
+ Clear, realistic, testable
512
+
513
+ Grading logic
514
+
515
+ Reward system makes sense
516
+
517
+ 20,000 → 3,000 teams advance
518
+
519
+ Prerequisites
520
+
521
+ Install before April 1st.
522
+
523
+ Required
524
+
525
+ Python 3.10+
526
+
527
+ Install 3.10, 3.11, or 3.12.
528
+
529
+ $
530
+ python --version
531
+ Copy
532
+ Git + GitHub account
533
+
534
+ Push your submission to GitHub or HF.
535
+
536
+ $
537
+ git --version
538
+ Copy
539
+ Hugging Face CLI
540
+
541
+ Deploy to HF Spaces.
542
+
543
+ $
544
+ pip install huggingface_hub --version
545
+ Copy
546
+ $
547
+ huggingface-cli login
548
+ Copy
549
+ OpenEnv
550
+
551
+ The framework.
552
+
553
+ $
554
+ pip install openenv-core
555
+ Copy
556
+ Google Colab
557
+
558
+ Prep course runs in Colab. Free tier works.
559
+
560
+ $
561
+ pip install openenv-core
562
+ Copy
563
+ OpenEnv
564
+
565
+ The framework.
566
+
567
+ → colab.research.google.com
568
+ Copy
569
+ Docker
570
+
571
+ Isolated container testing.
572
+
573
+ docker --version
574
+ Copy
575
+ Recommended
576
+
577
+ VS Code
578
+
579
+ Best Python + Docker support
580
+
581
+ How to Submit
582
+
583
+ When Round 1 starts on 1 April:
584
+
585
+ Step 1
586
+
587
+ Application Form
588
+ Choose 1 of the 4–5 problem statements revealed on the platform.
589
+
590
+ Step 2
591
+
592
+ Scaffold
593
+ $
594
+ openenv init my_env
595
+ Copy
596
+ Generate project structure.
597
+
598
+ Step 3
599
+
600
+ Build
601
+ Define your environment in the generated files.
602
+
603
+ Step 4
604
+
605
+ Test locally
606
+ $
607
+ uv run server
608
+ Copy
609
+ Step 5
610
+
611
+ Deploy
612
+ $
613
+ openenv push --repo-id your-username/my-env
614
+ Copy
615
+ Step 6
616
+
617
+ Submit
618
+ Paste your HF Spaces URL here before the deadline.
619
+
620
+ Deadline: 8 April 2026, 11:59 PM IST
621
+
622
+ Step 2
623
+
624
+ Submit your Assessment
625
+
626
+ Complete Step 1 first
627
+
628
+ Problem Statement is live. Build and submit.
629
+
630
+ Round 1 begins
631
+
632
+ Submission window opens on 28th March
633
+
634
+ Deadline: 8 Apr 11:59 PM
635
+
636
+
637
+ Submit your Assessment
638
+
639
+ NOTE: Only team leaders can make the final submission.
640
+
641
+ FAQs
642
+
643
+ Frequently Asked Questions
644
+
645
+
646
+
647
+
648
+
649
+
650
+
651
+
652
+
653
+
654
+
655
+
656
+
657
+ Need help? Reach out to us
658
+
659
+ help_openenvhackathon@scaler.com
660
+
661
+ Contact Support
662
+
663
+ submission Deadline: 8th April 11:59 PM
664
+
665
+
666
+ Submit your Assessment
667
+
668
+ How to Submit?
669
+
670
+
671
+
672
+ Great question. Here's exactly what the agent does in a **Code Review RL Environment:**
673
+
674
+ ---
675
+
676
+ ## 🤖 The Agent's Job
677
+
678
+ The agent acts as a **junior code reviewer**. Each episode, it's shown a code snippet and must take actions to review it — just like a human would on GitHub.
679
+
680
+ ---
681
+
682
+ ## 🎮 The Action Space
683
+
684
+ The agent can take these actions:
685
+
686
+ ```
687
+ APPROVE → Code looks good, no issues
688
+ FLAG_STYLE → Flag a style/formatting issue
689
+ FLAG_BUG → Flag a logic bug
690
+ FLAG_SECURITY → Flag a security vulnerability
691
+ ADD_COMMENT(txt) → Leave a review comment explaining the issue
692
+ REQUEST_CHANGES → Block the PR from merging
693
+ ```
694
+
695
+ ---
696
+
697
+ ## 🔁 One Episode — Step by Step
698
+
699
+ ```
700
+ reset()
701
+ → Agent receives a code snippet (the PR diff)
702
+
703
+ step(FLAG_BUG)
704
+ → Grader checks: was there actually a bug?
705
+ → Reward: +0.5 if correct, -0.2 if false positive
706
+
707
+ step(ADD_COMMENT("This causes a null pointer on line 12"))
708
+ → Grader checks comment relevance
709
+ → Reward: +0.3 if accurate, 0.0 if vague
710
+
711
+ step(REQUEST_CHANGES)
712
+ → Episode ends
713
+ → Final reward tallied
714
+ ```
715
+
716
+ ---
717
+
718
+ ## 📊 The 3 Tasks (Easy → Hard)
719
+
720
+ | Task | What the agent sees | What it must do | Grader |
721
+ |---|---|---|---|
722
+ | **Easy** | Code with a PEP8 style issue | Flag the style issue | Deterministic — AST/linter check |
723
+ | **Medium** | Code with a subtle logic bug | Flag the bug + comment location | Check flag + line number accuracy |
724
+ | **Hard** | Code with a hidden security vuln (SQL injection, path traversal) | Flag security issue + explain risk | Check flag + comment quality via LLM grader |
725
+
726
+ ---
727
+
728
+ ## 🏅 Reward Design
729
+
730
+ | Action | Reward |
731
+ |---|---|
732
+ | Correct flag on real issue | +0.5 |
733
+ | Accurate comment/explanation | +0.3 |
734
+ | False positive flag | -0.2 |
735
+ | Missed critical issue | -0.4 |
736
+ | APPROVE on buggy code | -1.0 |
737
+ | Clean episode completion | +0.2 bonus |
738
+
739
+ This gives **partial progress signals** across the whole trajectory, not just end-of-episode — which directly satisfies one of the judging criteria.
740
+
741
+ ---
742
+
743
+ ## 🔑 Why This Works Well for You
744
+
745
+ - Every action is **deterministic to grade** (you're a SE, you can write the graders confidently)
746
+ - Code snippets can be **synthetically generated** — no dataset needed
747
+ - The hard task grader can use an **LLM call internally** to score comment quality, which is also where GraphRAG could plug in
748
+
749
+ Want me to now create the **full file structure and implementation plan** you can start coding today?
750
+
751
+ Okay so basically what if I take the code covert it into a graph and then feed the model and ask it review and that also will again be stored in the graph and in the end the user will have a dependency graph and a review of each module in depth and using the LLM to read the logical reasoning behind the code not just the pylint errors beging caught.
752
+
753
+ That's a genuinely clever idea! Let me break down what you're describing and stress-test it against the requirements.
754
+
755
+ ---
756
+
757
+ ## 🧠 What You're Describing
758
+
759
+ ```
760
+ Codebase
761
+
762
+ GraphRAG parses it → Dependency Graph (nodes = modules, edges = imports/calls)
763
+
764
+ Agent receives a module + its graph context as observation
765
+
766
+ Agent reviews that module (flags bugs, security issues, style)
767
+
768
+ Review is stored back into the graph as an annotation
769
+
770
+ Final output: Annotated dependency graph with reviews on every module
771
+ ```
772
+
773
+ This is actually a **multi-episode environment** where each episode = one module, and the final deliverable is a fully reviewed codebase graph.
774
+
775
+ ---
776
+
777
+ ## ✅ Why This Is Strong
778
+
779
+ | Criteria | How it scores |
780
+ |---|---|
781
+ | Real-world utility (30%) | Extremely high — this is literally what code review tools like CodeClimate do, but agent-driven |
782
+ | Creativity (10%) | Very novel — graph-aware code review hasn't been done in OpenEnv |
783
+ | Environment design (20%) | Graph context as observation is well-structured and meaningful |
784
+ | Task & grader quality (25%) | Natural difficulty progression across modules |
785
+
786
+ ---
787
+
788
+ ## 🔁 How the RL Loop Works
789
+
790
+ ```
791
+ reset()
792
+ → Load codebase graph
793
+ → Pick starting module (e.g. lowest dependency = easiest)
794
+ → Observation = {code, graph_neighbors, import_chain}
795
+
796
+ step(action)
797
+ → Agent flags issues, adds comments
798
+ → Grader scores the review
799
+ → Reward returned
800
+ → Review annotation stored on graph node
801
+ → Next module selected (traversal order)
802
+
803
+ state()
804
+ → Returns current graph with all annotations so far
805
+
806
+ done = True when all modules reviewed
807
+ ```
808
+
809
+ ---
810
+
811
+ ## ⚠️ One Risk to Manage
812
+
813
+ The **grader for review quality** needs to be deterministic to pass the auto-validation gate. Two ways to handle this:
814
+
815
+ - **Easy/Medium tasks** — use AST + linter (pylint, bandit) as ground truth. Grader checks if agent caught what the linter already knows.
816
+ - **Hard tasks** — LLM-as-judge internally, but with a fixed seed/prompt so it's reproducible.
817
+
818
+ ---
819
+
820
+ ## 🛠️ Tech Stack
821
+
822
+ | Component | Tool |
823
+ |---|---|
824
+ | Graph construction | NetworkX + AST parser |
825
+ | Graph storage | NetworkX (in-memory, no DB needed) |
826
+ | RAG layer | LightRAG or simple vector search over node embeddings |
827
+ | Grader (easy/medium) | pylint, bandit, pyflakes |
828
+ | Grader (hard) | LLM-as-judge with fixed prompt |
829
+ | Observation model | Pydantic — code + graph context |
830
+
831
+ ---
832
+
833
+ ## 💡 My Honest Take
834
+
835
+ This is your **best idea yet**. It's novel, it's real-world, and the graph-as-state is a genuinely interesting RL mechanic that judges from Meta/HF will notice. The risk is build time — this is more complex than a flat code review env.
836
+
837
+ **Given you have ~1.5 days, want me to scope a minimal viable version of this that hits all the judging criteria without overbuilding?**
838
+
839
+ Okay so a particular module might have let's say a dependency issue but that might be due to another issue. Like that module has taken a particular decision because of a requirement in another module but due to that another issue has raised but the model might report to remove that crucial decision for the project. Also a module might be needed in a lot of places how would you suggest to traverse the graph cause you need to compare all the dependencies and issues with all the modules it's not just bfs. And is gemma 4 4Eb good enough for the demo? Search online
840
+
841
+ Great questions — you're thinking about this at a deep level. Let me tackle each part.Good news on Gemma 4. Now let me address all three of your questions properly.
842
+
843
+ ---
844
+
845
+ ## 1. 🧠 The Cascading Dependency Problem
846
+
847
+ You're describing a real and important issue — **a decision in module A causes a problem in module B, but the agent reviewing B in isolation flags it as wrong without knowing why.**
848
+
849
+ The solution is **context-aware observations**. When the agent reviews a module, its observation must include:
850
+
851
+ ```
852
+ Observation = {
853
+ current_module: code + issues found,
854
+ upstream_deps: why they made decisions that affect this module,
855
+ downstream_deps: what modules depend on THIS module,
856
+ existing_reviews: annotations already written on neighbors,
857
+ constraint_flags: ["decision X is forced by module Y's requirement"]
858
+ }
859
+ ```
860
+
861
+ So the agent never reviews a module in isolation. It always sees the **"why" from upstream** before flagging anything. If module B has a weird pattern because module A forced it, the agent sees that constraint and should reward `ADD_COMMENT("dependency constraint from A")` over `FLAG_BUG`.
862
+
863
+ This is actually what makes your environment **genuinely novel** — most code review tools don't model this.
864
+
865
+ ---
866
+
867
+ ## 2. 🗺️ Graph Traversal Strategy
868
+
869
+ You're right that BFS is wrong here. Here's what I'd suggest — a **weighted priority traversal**:
870
+
871
+ ```
872
+ Score each node by:
873
+ - Dependency centrality (how many modules import this one)
874
+ - Issue density (how many raw linter flags exist)
875
+ - Review dependency (can't review B properly until A is reviewed)
876
+
877
+ Traversal order:
878
+ Step 1 → Leaf modules (no dependencies) — safe to review in isolation
879
+ Step 2 → High centrality nodes — review these with full leaf context
880
+ Step 3 → Complex nodes — reviewed last with full graph context available
881
+ ```
882
+
883
+ Think of it like a **topological sort weighted by centrality**. NetworkX has both `topological_sort()` and `betweenness_centrality()` built in — you can combine them directly.
884
+
885
+ The key insight: **a module's review is only valid once all its dependencies have been reviewed first.** That's your episode ordering logic.
886
+
887
+ ---
888
+
889
+ ## 3. 🤖 Is Gemma 4 E4B Good Enough?
890
+
891
+ **Short answer: Yes for a demo, but use the 27B MoE if you can.**
892
+
893
+ Here's what the search shows:
894
+
895
+ - Gemma 4 E4B is designed for edge devices, prioritizing multimodal capability, low-latency processing, and on-device utility. It's capable but optimized for efficiency over raw reasoning power.
896
+
897
+ - The 26B MoE model runs almost as fast as a 4B model since it only activates 4B parameters at a time — so if your HF Space can handle it, the 27B MoE gives you much stronger reasoning at nearly the same inference speed.
898
+
899
+ - E4B has a 128K context window — which is actually great for your use case since you'll be feeding code + graph context into each prompt.
900
+
901
+ - Gemma 4 shows significant improvements in coding benchmarks and native function-calling support for agentic workflows — which directly maps to what your agent needs to do.
902
+
903
+ **My recommendation:**
904
+
905
+ | Scenario | Model |
906
+ |---|---|
907
+ | Local testing / dev | Gemma 4 E4B — fits in 8GB RAM |
908
+ | HF Space demo (inference.py) | Gemma 4 27B MoE — stronger reviews, still fast |
909
+ | Fallback if quota issues | Gemma 4 E4B is acceptable for passing evaluation |
910
+
911
+ The competition's inference script runs on 2 vCPU / 8GB RAM — E4B fits, 27B MoE likely won't. So **E4B is your safest bet for the submission**, but use 27B MoE via API if you're calling it remotely.
912
+
913
+ ---
914
+
915
+ Want me to now draft the **actual file structure and core implementation skeleton** so you can start coding today?
916
+
917
+ Great questions — let me search for both.Great — I have a thorough picture now. Here's the full analysis:
918
+
919
+ ---
920
+
921
+ ## 🔍 Does Your Idea Already Exist?
922
+
923
+ ### What CodeRabbit Does
924
+
925
+ CodeRabbit uses a "Codegraph" to understand complex dependencies across files and uncover the impact of changes, pulling in 40+ linters and security scanners. Sounds similar, right? But here's the critical difference:
926
+
927
+ CodeRabbit reviews are tied to diff visibility only — it can't reason about system-wide architecture, cross-repo dependencies, or historical design decisions. It can't validate whether a microservice change breaks contracts with downstream services.
928
+
929
+ That's exactly the gap your environment targets. CodeRabbit handles first-pass mechanical review: style violations, security issues, common bugs. Human reviewers still need to evaluate architecture, business logic, and design decisions.
930
+
931
+ ### The Core Differentiation of Your Idea
932
+
933
+ | Feature | CodeRabbit | Your RL Environment |
934
+ |---|---|---|
935
+ | Graph of codebase | ✅ Lightweight map | ✅ Full dependency graph |
936
+ | Context-aware review | Partial (diff only) | ✅ Full upstream/downstream context |
937
+ | Cascading dependency reasoning | ❌ | ✅ Core mechanic |
938
+ | Reviews stored back to graph | ❌ | ✅ Annotated output |
939
+ | RL agent learns from rewards | ❌ Static tool | ✅ Trainable agent |
940
+ | Final deliverable to user | PR comments | Annotated dependency map |
941
+
942
+ **Your environment fills a documented gap.** This is strong for the real-world utility score (30%).
943
+
944
+ ---
945
+
946
+ ## 🏗️ Architectural Questions You Still Need to Answer
947
+
948
+ ### 1. Graph Schema Design
949
+ What does a node actually contain?
950
+ ```
951
+ Node = {
952
+ module_id: str,
953
+ code: str,
954
+ ast_summary: dict, # function signatures, classes
955
+ linter_flags: list, # pre-computed ground truth for graders
956
+ dependency_reason: str, # WHY it depends on neighbors
957
+ review_annotation: dict # written by agent, starts null
958
+ }
959
+ ```
960
+ You need to decide this upfront — it drives everything else.
961
+
962
+ ---
963
+
964
+ ### 2. Observation Construction Strategy
965
+ How much graph context do you inject per step? Too little = agent reviews blindly. Too much = exceeds context window.
966
+
967
+ **Recommended approach — tiered context:**
968
+ ```
969
+ Easy task → current module only
970
+ Medium task → current module + direct neighbors
971
+ Hard task → current module + 2-hop neighborhood + existing reviews
972
+ ```
973
+ This naturally creates difficulty progression and respects the 128K context window of Gemma 4 E4B.
974
+
975
+ ---
976
+
977
+ ### 3. Episode Boundary Design
978
+ When does an episode end? Two valid designs:
979
+
980
+ - **Per-module episodes** — one episode = one module reviewed. Fast, easy to grade.
981
+ - **Full-codebase episodes** — one episode = entire codebase reviewed in traversal order. More realistic, harder to grade.
982
+
983
+ For the submission I'd recommend **per-module with a meta-episode wrapper** — each module is an episode, but the graph state persists across them so reviews accumulate.
984
+
985
+ ---
986
+
987
+ ### 4. Reward Attribution Problem
988
+ If the agent correctly identifies a cascading issue (bug in B caused by A), who gets the reward — the step that flagged it, or the final episode? You need a clear policy:
989
+
990
+ ```
991
+ Reward policy:
992
+ +0.6 if agent flags issue AND correctly attributes it to upstream module
993
+ +0.3 if agent flags issue but misattributes cause
994
+ -0.3 if agent flags false positive ignoring upstream context
995
+ ```
996
+
997
+ ---
998
+
999
+ ### 5. Graph Generation Strategy
1000
+ Where does the test codebase come from? Three options:
1001
+
1002
+ | Option | Pro | Con |
1003
+ |---|---|---|
1004
+ | Parse a real OSS repo (e.g. Flask, FastAPI) | Realistic | Complex to control |
1005
+ | Synthetically generate modules | Fully deterministic graders | Less realistic |
1006
+ | Hybrid — real structure, injected bugs | Best of both | Medium effort |
1007
+
1008
+ **Hybrid is the right call** — use a real repo's structure but inject known bugs so graders can be 100% deterministic.
1009
+
1010
+ ---
1011
+
1012
+ ### 6. Grader Architecture (Critical for Pass/Fail Gate)
1013
+
1014
+ ```
1015
+ Easy grader → pylint/bandit pre-run, check if agent flagged same issues ✅ fully deterministic
1016
+ Medium grader → AST diff between agent's suggested fix and ground truth ✅ deterministic
1017
+ Hard grader → LLM-as-judge with fixed seed + rubric for cascading reasoning ⚠️ quasi-deterministic
1018
+ ```
1019
+
1020
+ The hard grader is your biggest risk. Fix the judge prompt, temperature=0, and document it clearly in README.
1021
+
1022
+ ---
1023
+
1024
+ ### 7. State Persistence Across Episodes
1025
+ The graph with accumulated annotations IS your `state()`. You need to define:
1026
+ - What gets serialized (NetworkX graph → JSON)
1027
+ - How `reset()` clears annotations but keeps graph structure
1028
+ - Whether partial reviews survive between episodes (they should)
1029
+
1030
+ ---
1031
+
1032
+ ## ✅ Summary: What Makes You Novel vs CodeRabbit
1033
+
1034
+ The key architectural insight is that an AI seeing only one file without context on its neighbors leads to false positives — it worries about things that are actually handled upstream. Your environment is the first RL training ground that explicitly teaches agents to reason about this. That's your pitch to the judges.
1035
+
1036
+ ---
1037
+
1038
+ Full up and full down will be too heavy for the model to look at in complex projects right having thousands of lines of code
1039
+
1040
+ Exactly right. You've identified the most critical architectural bottleneck. Here's how to solve it:
1041
+
1042
+ ---
1043
+
1044
+ ## 🧠 The Core Problem
1045
+
1046
+ A real project could have:
1047
+ - Module A → 500 lines
1048
+ - 10 direct dependencies → each 300 lines
1049
+ - 2-hop neighbors → another 20 modules
1050
+
1051
+ That's **thousands of tokens just for context**, before the agent even starts reviewing. Gemma 4 E4B's 128K window sounds big but fills fast.
1052
+
1053
+ ---
1054
+
1055
+ ## ✅ Solution: Compressed Graph Context (Not Raw Code)
1056
+
1057
+ **Never feed raw code as context. Feed summaries.**
1058
+
1059
+ ```
1060
+ Instead of:
1061
+ "Here is module_auth.py [500 lines of code]..."
1062
+
1063
+ Feed:
1064
+ "module_auth.py → exposes: validate_token(str) → User | None
1065
+ Known issues: may return None on expired token
1066
+ Depended on by: checkout.py, cart.py
1067
+ Review status: flagged null return risk"
1068
+ ```
1069
+
1070
+ This is your **Graph Summary Layer** — sitting between the raw graph and the agent's observation.
1071
+
1072
+ ---
1073
+
1074
+ ## 🏗️ The 3-Layer Observation Architecture
1075
+
1076
+ ```
1077
+ Layer 1: AST Compression
1078
+ Raw code → extract only signatures, return types,
1079
+ decorators, class names
1080
+ Tool: Python's ast module
1081
+ Output: ~50 tokens per module (vs 500+ raw)
1082
+
1083
+ Layer 2: Graph Summary Node
1084
+ Per node store:
1085
+ {
1086
+ exports: ["validate_token(str) → User | None"],
1087
+ known_issues: ["may return None"],
1088
+ constraint_flags: ["forced by requirement in auth_service.py"],
1089
+ review_status: "pending | reviewed | flagged",
1090
+ review_summary: "one line of what was found" ← written by agent
1091
+ }
1092
+
1093
+ Layer 3: Agent Observation (what model actually sees)
1094
+ {
1095
+ current_module: full code, ← only this is full
1096
+ direct_deps: [graph summaries], ← compressed
1097
+ dependents: [graph summaries], ← compressed
1098
+ relevant_reviews: [one-liners] ← already written
1099
+ }
1100
+ ```
1101
+
1102
+ So the agent always sees **one module in full**, and everything else as compressed summaries.
1103
+
1104
+ ---
1105
+
1106
+ ## 📐 Token Budget Estimate
1107
+
1108
+ | Component | Tokens |
1109
+ |---|---|
1110
+ | Current module (full code) | ~800 |
1111
+ | 5 dependency summaries × 50 tokens | ~250 |
1112
+ | 5 dependent summaries × 50 tokens | ~250 |
1113
+ | Existing review annotations | ~200 |
1114
+ | System prompt + action space | ~300 |
1115
+ | **Total per step** | **~1,800** |
1116
+
1117
+ That's well within E4B's window, even for complex projects.
1118
+
1119
+ ---
1120
+
1121
+ ## 🎯 Bonus: This Becomes a Feature, Not a Workaround
1122
+
1123
+ The **summary layer is actually what makes your environment novel**. You're teaching the agent to:
1124
+
1125
+ 1. Read compressed dependency context
1126
+ 2. Decide if a deeper look is needed
1127
+ 3. Optionally request full code of a neighbor via a `REQUEST_CONTEXT(module_id)` action
1128
+
1129
+ That `REQUEST_CONTEXT` action costs a small reward penalty (simulating real reviewer time cost), which is **genuinely interesting RL reward shaping** that judges will notice.
1130
+
1131
+ ---
1132
+
1133
+ ## The Revised Action Space
1134
+
1135
+ ```
1136
+ APPROVE
1137
+ FLAG_STYLE
1138
+ FLAG_BUG
1139
+ FLAG_SECURITY
1140
+ FLAG_DEPENDENCY_ISSUE ← new: cascading problem from neighbor
1141
+ ADD_COMMENT(txt)
1142
+ REQUEST_CHANGES
1143
+ REQUEST_CONTEXT(module) ← new: pay token cost, get full neighbor code
1144
+ ```
1145
+
1146
+ This elegantly solves your original cascading dependency problem too — the agent can *choose* to investigate upstream before judging.
1147
+
1148
+ Want me to now write the full file structure and implementation skeleton?
1149
+
1150
+ Great — these are the four hardest design questions in the project. Let me work through each one carefully.
1151
+
1152
+ ---
1153
+
1154
+ ## 1. 🔪 Converting Code Into Modules Accurately
1155
+
1156
+ The challenge is — what IS a module? You need a deterministic answer.
1157
+
1158
+ **Recommended: File = Module, with AST sub-structure**
1159
+
1160
+ ```
1161
+ Step 1: File-level split
1162
+ Each .py file = one node in the graph
1163
+ filename → module_id
1164
+
1165
+ Step 2: AST parsing per file
1166
+ Extract:
1167
+ - All function signatures + return types
1168
+ - All class definitions
1169
+ - All imports (this gives you edges)
1170
+ - All global variables
1171
+
1172
+ Step 3: Edge construction from imports
1173
+ "from auth import validate_token"
1174
+ → edge: current_module → auth.py
1175
+
1176
+ Step 4: Dependency reason tagging
1177
+ Use the import line + first usage context
1178
+ as the "why this depends on that" annotation
1179
+ ```
1180
+
1181
+ **The hard problem: implicit dependencies**
1182
+ Sometimes module B doesn't import A directly but uses a shared global or config. Handle this with a second pass:
1183
+
1184
+ ```
1185
+ Pass 1: Explicit edges (imports)
1186
+ Pass 2: Name resolution edges
1187
+ - scan function bodies for names not defined locally
1188
+ - trace them back to source module
1189
+ - add a "implicit dependency" edge with lower weight
1190
+ ```
1191
+
1192
+ Python's `ast` module handles all of this natively. No external library needed.
1193
+
1194
+ ---
1195
+
1196
+ ## 2. 📊 How Reporting Works
1197
+
1198
+ Think of reporting as **three layers that build progressively**:
1199
+
1200
+ ```
1201
+ Layer 1: Per-step annotation (live)
1202
+ Every time agent calls ADD_COMMENT or FLAG_*,
1203
+ that gets written immediately to the graph node
1204
+ as a review_annotation field
1205
+
1206
+ Layer 2: Per-module summary (end of episode)
1207
+ When episode ends (agent calls APPROVE or REQUEST_CHANGES),
1208
+ environment compiles all step annotations into:
1209
+ {
1210
+ verdict: "approved | changes_requested",
1211
+ issues: [...],
1212
+ dependency_notes: [...],
1213
+ confidence: 0.0-1.0 ← derived from reward trajectory
1214
+ }
1215
+
1216
+ Layer 3: Full codebase report (end of all episodes)
1217
+ state() returns the entire annotated graph
1218
+ Serialize to:
1219
+ - JSON (machine readable)
1220
+ - Markdown report (human readable)
1221
+ - Visual graph (NetworkX → graphviz or mermaid)
1222
+ ```
1223
+
1224
+ **Updating reviews as agent learns more** is the elegant part. Because reviews are stored on graph nodes, when the agent later reviews module B and discovers the root cause was actually in module A, it can call:
1225
+
1226
+ ```
1227
+ AMEND_REVIEW(module_id="auth.py", note="root cause of checkout.py null issue")
1228
+ ```
1229
+
1230
+ This updates the node annotation retroactively. The reward for this action is high — it's exactly the cascading reasoning you want to incentivize.
1231
+
1232
+ ---
1233
+
1234
+ ## 3. ✅ Does This Align With Round 1 Requirements?
1235
+
1236
+ Let's go requirement by requirement:
1237
+
1238
+ | Requirement | Your Design | Status |
1239
+ |---|---|---|
1240
+ | Real-world task | Code review with dependency reasoning | ✅ Strong |
1241
+ | step() / reset() / state() | Per-module episodes, graph persists in state() | ✅ |
1242
+ | Typed Pydantic models | Observation = code + summaries, Action = flag/comment/request, Reward = float | ✅ |
1243
+ | Minimum 3 tasks easy→hard | Easy: style/linter, Medium: logic bug with direct dep context, Hard: cascading bug across 2+ modules | ✅ |
1244
+ | Reward 0.0–1.0 with partial signal | Per-step rewards for each correct flag/comment/attribution | ✅ |
1245
+ | Deterministic graders | Easy/medium use AST+linter ground truth, hard uses fixed-seed LLM judge | ✅ with care |
1246
+ | Baseline inference script | Agent reviews all 3 task codebases, emits [START]/[STEP]/[END] logs | ✅ |
1247
+ | Dockerfile + HF Space | Standard containerization | ✅ |
1248
+ | openenv.yaml + validate | Standard spec compliance | ✅ |
1249
+
1250
+ One gap to watch: **the hard task grader quasi-determinism**. Document your judge prompt and temperature=0 explicitly in README to satisfy the reproducibility requirement.
1251
+
1252
+ ---
1253
+
1254
+ ## 4. 🤖 Where Is The RL? Where Is OpenEnv?
1255
+
1256
+ This is the most important question to be clear on — because judges WILL ask.
1257
+
1258
+ ### The RL Loop
1259
+
1260
+ ```
1261
+ Environment (your code) Agent (Gemma 4 / any LLM)
1262
+ ───────────────────────── ──────────────────────────
1263
+ reset() → receives initial observation
1264
+ (module code + graph context)
1265
+
1266
+ ← action: FLAG_BUG
1267
+
1268
+ step(FLAG_BUG) → returns:
1269
+ - new observation (updated graph)
1270
+ - reward (+0.5 if real bug)
1271
+ - done (False)
1272
+ - info {}
1273
+
1274
+ ← action: ADD_COMMENT("null risk line 12")
1275
+
1276
+ step(ADD_COMMENT(...)) → reward (+0.3 if accurate)
1277
+
1278
+ ← action: REQUEST_CHANGES
1279
+
1280
+ step(REQUEST_CHANGES) → reward (+0.2 episode bonus)
1281
+ done = True
1282
+
1283
+ state() → full annotated graph so far
1284
+ ```
1285
+
1286
+ The **RL part** is: the agent is learning a *policy* — which actions to take given a code observation and graph context — to maximize cumulative reward. It's not just calling an LLM once. It's a multi-step decision loop.
1287
+
1288
+ ### Where OpenEnv Fits
1289
+
1290
+ ```
1291
+ openenv.yaml
1292
+ name: code-review-env
1293
+ version: 1.0.0
1294
+ tasks: [style_review, logic_review, cascading_review]
1295
+
1296
+ environment.py
1297
+ class CodeReviewEnv:
1298
+ observation: CodeObservation ← Pydantic model
1299
+ action: ReviewAction ← Pydantic model
1300
+ reward: ReviewReward ← Pydantic model
1301
+
1302
+ def reset() → CodeObservation
1303
+ def step(action) → (obs, reward, done, info)
1304
+ def state() → GraphState
1305
+
1306
+ graders/
1307
+ easy_grader.py ← linter diff check
1308
+ medium_grader.py ← AST + line attribution check
1309
+ hard_grader.py ← LLM judge, fixed seed
1310
+
1311
+ inference.py ← runs Gemma 4 as the agent
1312
+ [START] task=cascading_review
1313
+ [STEP] action=FLAG_DEPENDENCY_ISSUE reward=0.5
1314
+ [STEP] action=ADD_COMMENT reward=0.3
1315
+ [END] total_reward=0.8
1316
+ ```
1317
+
1318
+ OpenEnv is the **standardized wrapper** that makes your environment pluggable — any agent, any model, can be dropped in and evaluated against your graders. That's the whole point of the spec.
1319
+
1320
+ ---
1321
+
1322
+ ## The One-Line Pitch to Judges
1323
+
1324
+ > *"An RL environment where an agent learns to review code with full dependency awareness — outputting an annotated codebase graph, not just PR comments. It teaches agents to reason about WHY a bug exists, not just WHERE."*
1325
+
1326
+ That's differentiated from CodeRabbit, aligned with all Round 1 requirements, and has a clear RL loop.
1327
+
1328
+ **Ready to build the file skeleton now?**
1329
+
1330
+
1331
+ What I basically want is there should be RL in this project meaning it will first see and find out issues about the code and then there should be a review of the bug reported and if the bug is correctly reported there should be reward or else punishment and based on that the actual model should learn and understand this should be integrated into it.
1332
+
1333
+ Also right now we have an issue where all the modules are not linked properly. And it's not just about imports but also where they are being called or run and this linking is not proper and causes isolated modules. And also for large modules it's broken into many things. Also Right now I suspect that the LLM is not grading properly the code it's just passing through it and just repeating the errors from the previous the hard filter should also catch errors and find issues and give detailed reports on all in clear tasks. Give it a super detailed agent prompt for this task and the format for output also ensuring it is adaptable. And after the errors being found it should then again verify those errors with another model which the user can define and then it will learn from it and become RL. Assign proper grades for the machine learning to be good for this particular task. Also the arrow marks in the graph are too thick sometimes and when I hover over them they give me a big like of text rather than a well formatted overlay where it gives me info about the modules and also when I click on the module it should show in the side bar the report for it well formatted