shreyas-joshi commited on
Commit
cf05092
·
0 Parent(s):

feat: initialize CodeReviewEnv with foundational components

Browse files

- Add Dockerfile for containerized environment setup.
- Create README.md with quickstart instructions.
- Implement database package with migrations and schema definitions.
- Develop store module for database interactions and data management.
- Introduce parser module for AST parsing and code analysis.
- Establish environment and graph management for dependency tracking.
- Set up grading and task management placeholders for future phases.
- Include sample codebase and ground truth for testing and validation.
- Add tests for environment, parser, and graph functionalities.

Files changed (46) hide show
  1. .gitignore +5 -0
  2. Builder.md +138 -0
  3. Debugger.md +100 -0
  4. Phases.md +295 -0
  5. code-review-env/Dockerfile +7 -0
  6. code-review-env/README.md +11 -0
  7. code-review-env/db/__init__.py +1 -0
  8. code-review-env/db/migrations.py +28 -0
  9. code-review-env/db/schema.py +91 -0
  10. code-review-env/db/store.py +384 -0
  11. code-review-env/env/__init__.py +1 -0
  12. code-review-env/env/environment.py +6 -0
  13. code-review-env/env/graph.py +105 -0
  14. code-review-env/env/models.py +1 -0
  15. code-review-env/env/observation_builder.py +1 -0
  16. code-review-env/env/reward.py +1 -0
  17. code-review-env/graders/__init__.py +1 -0
  18. code-review-env/graders/base_grader.py +5 -0
  19. code-review-env/graders/easy_grader.py +1 -0
  20. code-review-env/graders/hard_grader.py +1 -0
  21. code-review-env/graders/medium_grader.py +1 -0
  22. code-review-env/inference.py +4 -0
  23. code-review-env/openenv.yaml +3 -0
  24. code-review-env/parser/__init__.py +1 -0
  25. code-review-env/parser/ast_parser.py +189 -0
  26. code-review-env/parser/linter.py +104 -0
  27. code-review-env/parser/summarizer.py +24 -0
  28. code-review-env/pyproject.toml +13 -0
  29. code-review-env/requirements.txt +9 -0
  30. code-review-env/sample_codebase/auth.py +7 -0
  31. code-review-env/sample_codebase/cart.py +17 -0
  32. code-review-env/sample_codebase/checkout.py +15 -0
  33. code-review-env/sample_codebase/config.py +6 -0
  34. code-review-env/sample_codebase/ground_truth.json +39 -0
  35. code-review-env/sample_codebase/payments.py +15 -0
  36. code-review-env/server/__init__.py +1 -0
  37. code-review-env/server/app.py +1 -0
  38. code-review-env/tasks/__init__.py +1 -0
  39. code-review-env/tasks/easy_task.py +1 -0
  40. code-review-env/tasks/hard_task.py +1 -0
  41. code-review-env/tasks/medium_task.py +1 -0
  42. code-review-env/tasks/task_registry.py +1 -0
  43. code-review-env/tests/test_environment.py +21 -0
  44. code-review-env/tests/test_graders.py +2 -0
  45. code-review-env/tests/test_inference.py +2 -0
  46. code-review-env/tests/test_parser.py +13 -0
.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ .venv
2
+ .env
3
+ __pycache__/
4
+ *.pyc
5
+ code-review-env/code_review_env.db
Builder.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Builder Prompt — CodeReviewEnv
2
+
3
+ You are an expert Python engineer building a reinforcement learning environment called **CodeReviewEnv** for the OpenEnv Hackathon Round 1. Read everything below before writing a single line of code.
4
+
5
+ ---
6
+
7
+ ## What You Are Building
8
+
9
+ An OpenEnv-compliant RL environment where an LLM agent learns to perform dependency-aware code review on a Python codebase.
10
+
11
+ The environment:
12
+ 1. Parses a Python codebase into a **persistent dependency graph** stored in SQLite via SQLModel. Nodes = modules. Edges = import relationships.
13
+ 2. Each node stores: full source code, compressed AST summary (~50 tokens), linter ground truth (pylint + bandit output), and agent-written review annotations.
14
+ 3. The agent reviews one module per episode via a multi-step loop: `reset()` → `step(action)` × N → done.
15
+ 4. The agent sees **full code of the current module only**. Neighbors are always compressed summaries — never full code. This is a hard constraint for token budget.
16
+ 5. The agent can take actions: FLAG_BUG, FLAG_STYLE, FLAG_SECURITY, FLAG_DEPENDENCY_ISSUE, ADD_COMMENT, REQUEST_CHANGES, APPROVE, REQUEST_CONTEXT (costs -0.1 reward), AMEND_REVIEW (updates a neighbor's annotation retroactively).
17
+ 6. Rewards are computed by graders against pre-computed ground truth stored in the DB.
18
+ 7. The final output is an annotated dependency graph — all module reviews, cross-module causal attributions, readable as JSON and Markdown.
19
+
20
+ The key differentiator: the environment models **cascading bugs** — where a bug in module B is caused by a design decision in module A. The agent is rewarded for identifying the upstream root cause, not just flagging the surface symptom.
21
+
22
+ ---
23
+
24
+ ## Persistence Strategy
25
+
26
+ **SQLite + SQLModel. This is non-negotiable for demo performance.**
27
+
28
+ - On first run: parse sample_codebase/ → populate DB with all nodes, edges, linter flags
29
+ - On subsequent runs: detect DB exists → skip parsing → load graph directly
30
+ - `reset()` clears only review annotations, never graph structure
31
+ - All episode history is stored for reproducibility
32
+
33
+ Use Context7 MCP to look up SQLModel, NetworkX, pylint programmatic API, bandit API, and OpenEnv spec documentation before implementing each component. Do not guess at APIs — look them up.
34
+
35
+ ---
36
+
37
+ ## Tech Stack
38
+
39
+ - Python 3.11
40
+ - SQLModel (SQLite persistence)
41
+ - NetworkX (graph construction and traversal)
42
+ - FastAPI (HTTP server for OpenEnv spec)
43
+ - Pydantic v2 (typed models)
44
+ - pylint + bandit (linter ground truth)
45
+ - Python `ast` module (AST parsing — stdlib, no extras)
46
+ - OpenAI client (all LLM calls in inference.py and hard grader)
47
+ - Docker (containerization)
48
+
49
+ ---
50
+
51
+ ## Project Structure
52
+
53
+ Follow this structure exactly — do not deviate:
54
+
55
+ ```
56
+ code-review-env/
57
+ ├── openenv.yaml
58
+ ├── Dockerfile
59
+ ├── README.md
60
+ ├── inference.py
61
+ ├── requirements.txt
62
+ ├── env/
63
+ │ ├── environment.py
64
+ │ ├── models.py
65
+ │ ├── graph.py
66
+ │ ├── observation_builder.py
67
+ │ └── reward.py
68
+ ├── db/
69
+ │ ├── schema.py
70
+ │ ├── store.py
71
+ │ └── migrations.py
72
+ ├── parser/
73
+ │ ├── ast_parser.py
74
+ │ ├── linter.py
75
+ │ └── summarizer.py
76
+ ├── graders/
77
+ │ ├── base_grader.py
78
+ │ ├── easy_grader.py
79
+ │ ├── medium_grader.py
80
+ │ └── hard_grader.py
81
+ ├── tasks/
82
+ │ ├── task_registry.py
83
+ │ ├── easy_task.py
84
+ │ ├── medium_task.py
85
+ │ └── hard_task.py
86
+ ├── server/
87
+ │ └── app.py
88
+ ├── sample_codebase/
89
+ │ ├── auth.py
90
+ │ ├── checkout.py
91
+ │ ├── cart.py
92
+ │ ├── payments.py
93
+ │ ├── config.py
94
+ │ └── ground_truth.json
95
+ └── tests/
96
+ ```
97
+
98
+ ---
99
+
100
+ ## Phase You Are Currently Building
101
+
102
+ **[INSERT PHASE NUMBER AND NAME HERE]**
103
+
104
+ Refer to the phase plan for exact tasks and completion criteria for this phase. Build only what is scoped to this phase. Do not build ahead.
105
+
106
+ ---
107
+
108
+ ## Non-Negotiable Constraints
109
+
110
+ 1. All rewards must be clipped to 0.0–1.0. Never return outside this range.
111
+ 2. Never feed full neighbor code into observations. Always use compressed summaries.
112
+ 3. inference.py must use OpenAI client. Read API_BASE_URL, MODEL_NAME, HF_TOKEN from env vars.
113
+ 4. inference.py must emit [START], [STEP], [END] log format exactly — no deviations.
114
+ 5. Hard grader must use temperature=0 and a fixed rubric prompt stored as a constant.
115
+ 6. DB must auto-populate on first Docker run without manual intervention.
116
+ 7. All Pydantic models must be fully typed — no `Any`, no `dict` without a model.
117
+ 8. Episode step limit is 10. Hard cap. Enforce in environment.py.
118
+
119
+ ---
120
+
121
+ ## Before You Start Each File
122
+
123
+ 1. Use Context7 MCP to look up the relevant library documentation
124
+ 2. Check if the schema/interface you are about to implement has dependencies on already-built files — import them, don't reimplement
125
+ 3. If you need to make a design choice not covered in this prompt (e.g. exact DB column types, traversal tie-breaking, summary format), **ask the user before proceeding**
126
+ 4. Write tests alongside implementation — not after
127
+
128
+ ---
129
+
130
+ ## Questions To Ask The User Before Starting
131
+
132
+ If any of the following are unclear, ask before building:
133
+
134
+ - What Python codebase should be used as the demo target? (default: the sample_codebase/ provided)
135
+ - Should the hard grader use the same MODEL_NAME from env vars, or a fixed model?
136
+ - Should REQUEST_CONTEXT return the full raw code or the full AST + raw code?
137
+ - Should AMEND_REVIEW require the agent to specify what was wrong with the original review?
138
+ - What is the maximum number of neighbors to include in an observation? (recommend: 5, confirm)
Debugger.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Debugger Prompt — CodeReviewEnv
2
+
3
+ You are an expert Python debugger working on **CodeReviewEnv**, an OpenEnv-compliant RL environment for the OpenEnv Hackathon. Your job is to diagnose and fix issues without breaking the architecture.
4
+
5
+ ---
6
+
7
+ ## Project Summary
8
+
9
+ This is a reinforcement learning environment where an LLM agent reviews Python codebases using a persistent dependency graph. The graph is stored in SQLite via SQLModel. The RL loop uses OpenEnv's step()/reset()/state() spec. There are 3 tasks (easy/medium/hard) with deterministic graders. The inference script must run in under 20 minutes on 2 vCPU / 8GB RAM.
10
+
11
+ ---
12
+
13
+ ## Architecture Rules — Never Violate These When Fixing
14
+
15
+ 1. **Persistence is SQLite/SQLModel** — do not switch to in-memory or another DB to fix a bug
16
+ 2. **Neighbor observations are always compressed summaries** — never fix a context issue by passing full neighbor code
17
+ 3. **Rewards must always be in 0.0–1.0** — if a reward bug exists, fix the computation, never remove the clip
18
+ 4. **inference.py uses OpenAI client only** — do not swap to direct HTTP calls or another client
19
+ 5. **[START]/[STEP]/[END] log format is fixed** — do not change field names or ordering to fix a logging bug
20
+ 6. **Hard grader uses temperature=0 and fixed rubric** — do not relax this to fix flaky test failures
21
+ 7. **episode step limit is 10** — do not raise this to fix timeout issues, optimize the agent instead
22
+
23
+ ---
24
+
25
+ ## How To Approach Any Bug
26
+
27
+ ### Step 1 — Locate
28
+ - Identify which layer the bug is in: parser → db → graph → observation_builder → environment → grader → server → inference
29
+ - Do not assume the bug is where the error surfaces — trace back to root cause
30
+
31
+ ### Step 2 — Check Interfaces First
32
+ - Before changing implementation, verify the interface contract between the broken component and its dependencies
33
+ - Use Context7 MCP to re-check library APIs if the bug involves SQLModel, NetworkX, pylint, bandit, FastAPI, or OpenEnv
34
+ - Do not fix a bug by changing a shared interface without checking all callers
35
+
36
+ ### Step 3 — Fix Minimally
37
+ - Fix the smallest possible change that resolves the issue
38
+ - If the fix requires changing a DB schema, check whether a migration is needed and write it
39
+ - If the fix changes a Pydantic model, check all serialization/deserialization paths
40
+
41
+ ### Step 4 — Verify
42
+ - After fixing, confirm the completion criteria for the relevant phase still pass
43
+ - Run the specific test for the broken component
44
+ - If inference.py is affected, do a dry run and confirm [START]/[STEP]/[END] logs emit correctly
45
+
46
+ ---
47
+
48
+ ## Common Failure Modes To Check First
49
+
50
+ ### DB / Persistence
51
+ - DB not found on startup → check migrations.py auto-init logic
52
+ - Graph loads empty on second run → check upsert_node is committing correctly
53
+ - Annotations not persisting across reset() → check reset() only clears annotations, not nodes/edges
54
+
55
+ ### Parser
56
+ - AST parser crashes on type-annotated functions → check handling of ast.Constant vs ast.Str in Python 3.11
57
+ - Linter returns no output → check pylint/bandit are installed in the Docker image and PATH is correct
58
+ - Import resolution fails on relative imports → check the resolver handles both absolute and relative imports
59
+
60
+ ### RL Environment
61
+ - Reward outside 0.0–1.0 → find the unclipped computation in reward.py
62
+ - done never becomes True → check step limit counter and REQUEST_CHANGES/APPROVE handling
63
+ - reset() returns wrong module → check task registry is loading the correct starting module
64
+
65
+ ### Graders
66
+ - Easy grader always returns 0 → check linter_flags were populated in DB during parsing
67
+ - Hard grader is non-deterministic → confirm temperature=0 and seed param is being passed
68
+ - Grader crashes on empty annotation → add null check before scoring
69
+
70
+ ### Server
71
+ - /health returns 404 → check route is registered in app.py
72
+ - /step rejects valid action → check discriminated union deserialization in Pydantic v2
73
+ - openenv validate fails → check openenv.yaml field names against spec exactly
74
+
75
+ ### Inference Script
76
+ - Runs over 20 minutes → profile which task is slowest, reduce max steps or add timeout per episode
77
+ - LLM returns unparseable action → check JSON mode is enabled, add fallback to APPROVE
78
+ - Missing [STEP] logs → check log emit is inside the step loop, not outside
79
+
80
+ ### Docker
81
+ - Build fails on pylint/bandit install → add gcc and build-essential to apt-get
82
+ - DB not found inside container → check WORKDIR and DB path are consistent
83
+ - Port not exposed → confirm EXPOSE 7860 and uvicorn binds to 0.0.0.0
84
+
85
+ ---
86
+
87
+ ## When You Find An Ambiguity
88
+
89
+ If fixing the bug requires a design decision (e.g. "should reset() preserve REQUEST_CONTEXT history?"), **ask the user before implementing**. Do not make silent architectural decisions while debugging.
90
+
91
+ ---
92
+
93
+ ## Context To Always Include When Reporting A Fix
94
+
95
+ After fixing, always report:
96
+ - What the root cause was (one sentence)
97
+ - Which file(s) were changed
98
+ - Whether any DB schema changed (and if so, whether a migration was added)
99
+ - Whether any Pydantic model interface changed (and if so, which callers were updated)
100
+ - The specific test or check that now passes
Phases.md ADDED
@@ -0,0 +1,295 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CodeReviewEnv — Phased Build Plan
2
+ ## For: LLM-Assisted Development
3
+
4
+ ---
5
+
6
+ ## 🧠 What You Are Building
7
+
8
+ An OpenEnv-compliant reinforcement learning environment where an LLM agent learns to perform **dependency-aware code review**.
9
+
10
+ The environment parses a Python codebase into a **persistent dependency graph** (nodes = modules, edges = import relationships). Each node stores compressed AST summaries, linter-generated ground truth issues, and agent-written review annotations.
11
+
12
+ The agent reviews one module per episode. It receives the **full code of the current module** plus **compressed AST summaries of its neighbors** (never full neighbor code — token budget). It takes multi-step actions (flag bugs, add comments, request context, amend upstream reviews). The environment rewards correct, well-attributed findings and penalizes false positives.
13
+
14
+ The final output is an **annotated dependency graph** — a machine-readable + human-readable map of the entire codebase with reviews on every module, including cross-module causal attributions.
15
+
16
+ This is differentiated from tools like CodeRabbit because:
17
+ - It models cascading dependency bugs (bug in B caused by design in A)
18
+ - Reviews are stored back into the graph and can be amended as agent learns more
19
+ - It is an RL training/evaluation environment, not a static analysis tool
20
+ - The agent learns a policy over multi-step decisions, not a single LLM call
21
+
22
+ ---
23
+
24
+ ## 🗂️ Persistence Strategy
25
+
26
+ **Use SQLite via SQLModel** for all persistent state. Do NOT reparse the codebase on every run. The database stores:
27
+ - Parsed module nodes (code, AST summary, linter flags)
28
+ - Graph edges (dependency relationships + reasons)
29
+ - Review annotations (written by agent, updatable)
30
+ - Episode history (for reproducibility)
31
+ - Task definitions and ground truth
32
+
33
+ On startup: check if DB exists → if yes, load graph from DB → if no, parse codebase and populate DB.
34
+
35
+ This makes demos fast (parse once, review many times) and makes `reset()` cheap (clear annotations only, keep graph structure).
36
+
37
+ ---
38
+
39
+ ## 📁 Target Project Structure
40
+
41
+ ```
42
+ code-review-env/
43
+ ├── openenv.yaml
44
+ ├── Dockerfile
45
+ ├── README.md
46
+ ├── inference.py # Required by spec, root level
47
+ ├── requirements.txt
48
+ ├── pyproject.toml
49
+
50
+ ├── env/
51
+ │ ├── __init__.py
52
+ │ ├── environment.py # Main CodeReviewEnv class
53
+ │ ├── models.py # Pydantic: Observation, Action, Reward, GraphState
54
+ │ ├── graph.py # Graph construction, traversal, compression
55
+ │ ├── observation_builder.py # Assembles tiered observation per step
56
+ │ └── reward.py # Reward computation logic
57
+
58
+ ├── db/
59
+ │ ├── __init__.py
60
+ │ ├── schema.py # SQLModel table definitions
61
+ │ ├── store.py # DB read/write operations
62
+ │ └── migrations.py # Init and seed scripts
63
+
64
+ ├── parser/
65
+ │ ├── __init__.py
66
+ │ ├── ast_parser.py # AST extraction: signatures, imports, classes
67
+ │ ├── linter.py # Pylint + Bandit runner, stores results to DB
68
+ │ └── summarizer.py # Converts AST output → compressed node summary
69
+
70
+ ├── graders/
71
+ │ ├── __init__.py
72
+ │ ├── base_grader.py # Abstract grader interface
73
+ │ ├── easy_grader.py # Linter match — fully deterministic
74
+ │ ├── medium_grader.py # AST + line attribution match
75
+ │ └── hard_grader.py # LLM-as-judge, temp=0, seed=42, rubric-constrained
76
+
77
+ ├── tasks/
78
+ │ ├── __init__.py
79
+ │ ├── task_registry.py # Registers and loads tasks
80
+ │ ├── easy_task.py # Style/linter issue in isolated module
81
+ │ ├── medium_task.py # Logic bug with direct dependency context
82
+ │ └── hard_task.py # Cascading bug across 2+ modules
83
+
84
+ ├── server/
85
+ │ ├── __init__.py
86
+ │ └── app.py # FastAPI server exposing OpenEnv HTTP endpoints
87
+
88
+ ├── sample_codebase/ # Synthetic test codebase for demo
89
+ │ ├── auth.py
90
+ │ ├── checkout.py
91
+ │ ├── cart.py
92
+ │ ├── payments.py
93
+ │ └── config.py
94
+
95
+ └── tests/
96
+ ├── test_parser.py
97
+ ├── test_graders.py
98
+ ├── test_environment.py
99
+ └── test_inference.py
100
+ ```
101
+
102
+ ---
103
+
104
+ ## 📐 Core Data Models (Design Intent — Implementation Is Your Choice)
105
+
106
+ ### Graph Node
107
+ Stores everything about one module. Persisted in DB.
108
+ - module_id (filename/path)
109
+ - raw_code (full source)
110
+ - ast_summary (compressed: signatures, classes, exports)
111
+ - linter_flags (pre-computed ground truth from pylint/bandit)
112
+ - dependency_reason (why this module needs its neighbors — extracted from import context)
113
+ - review_annotation (agent-written, nullable, updatable)
114
+ - review_status (pending | in_progress | reviewed)
115
+ - review_summary (one-line, written at episode end)
116
+
117
+ ### Graph Edge
118
+ - source_module_id
119
+ - target_module_id
120
+ - edge_type (explicit_import | implicit_name_resolution)
121
+ - import_line (the actual import statement)
122
+ - weight (1.0 explicit, 0.5 implicit)
123
+
124
+ ### Observation (Pydantic)
125
+ - current_module: full code + full AST summary
126
+ - direct_dependencies: list of compressed node summaries (NOT full code)
127
+ - dependents: list of compressed node summaries
128
+ - existing_reviews: list of one-line review summaries from already-reviewed neighbors
129
+ - constraint_flags: any known forced decisions from upstream
130
+ - step_number: int
131
+ - episode_id: str
132
+
133
+ ### Action (Pydantic, discriminated union)
134
+ - APPROVE
135
+ - FLAG_STYLE(line: int, description: str)
136
+ - FLAG_BUG(line: int, description: str)
137
+ - FLAG_SECURITY(line: int, description: str)
138
+ - FLAG_DEPENDENCY_ISSUE(source_module: str, description: str)
139
+ - ADD_COMMENT(text: str)
140
+ - REQUEST_CHANGES(summary: str)
141
+ - REQUEST_CONTEXT(module_id: str) ← costs -0.1 reward, returns full code of neighbor
142
+ - AMEND_REVIEW(module_id: str, note: str) ← retroactively updates neighbor annotation
143
+
144
+ ### Reward (Pydantic)
145
+ - value: float (0.0–1.0)
146
+ - reason: str
147
+ - cumulative: float
148
+
149
+ ---
150
+
151
+ ## 🏗️ PHASE 1 — Foundation & Persistence
152
+ **Goal: Database schema, parser, graph construction. No RL yet.**
153
+
154
+ ### Tasks
155
+ 1. Define SQLModel schema for all tables (nodes, edges, annotations, episodes, tasks)
156
+ 2. Build `ast_parser.py` — extract from any .py file: all function signatures with type hints, all class definitions, all import statements with source resolution, all module-level constants
157
+ 3. Build `linter.py` — run pylint and bandit programmatically on a file, parse output into structured list of {line, severity, code, message}. Store results directly to DB as ground truth.
158
+ 4. Build `summarizer.py` — convert AST output into a compressed summary string under 100 tokens. Format: "exports: [fn(args)->return, ...] | issues: N | depends_on: [module, ...]"
159
+ 5. Build `store.py` — CRUD operations for all tables. Key operations: upsert_node, upsert_edge, get_node_with_neighbors, update_annotation, get_full_graph
160
+ 6. Build `graph.py` — on first run: parse all files in target directory → populate DB. On subsequent runs: load from DB. Build NetworkX DiGraph from DB records. Implement traversal order: topological sort weighted by betweenness centrality (leaf modules first, high-centrality modules last).
161
+ 7. Build `sample_codebase/` — 5 Python files with known injected issues: one style issue, one logic bug with a direct dependency cause, one security issue, one cascading bug where the root cause is 2 hops away. Document every injected issue in a ground_truth.json file.
162
+
163
+ ### Completion Criteria
164
+ - `python -m parser.ast_parser sample_codebase/` populates DB with all nodes and edges
165
+ - DB persists across runs (second run loads from DB, does not reparse)
166
+ - `python -m db.store` can query a node and return its summary and neighbors
167
+ - ground_truth.json matches linter output for easy/medium tasks
168
+
169
+ ---
170
+
171
+ ## 🏗️ PHASE 2 — OpenEnv Core (RL Environment)
172
+ **Goal: Full step()/reset()/state() loop with reward. This is the RL part.**
173
+
174
+ ### Tasks
175
+ 1. Build `models.py` — all Pydantic models: Observation, Action (discriminated union), Reward, GraphState, EpisodeRecord. Must be fully typed.
176
+ 2. Build `observation_builder.py` — given a module_id and current graph state, assemble the tiered observation: full code for current module, compressed summaries for neighbors (pulled from DB), existing review annotations for already-reviewed neighbors, constraint flags
177
+ 3. Build `reward.py` — implement reward logic:
178
+ - Easy: compare agent flags against linter ground truth. Correct flag = +0.5, false positive = -0.2, missed critical = -0.4
179
+ - Medium: check flag + line number within ±3 lines of ground truth = +0.5, correct comment attribution = +0.3
180
+ - Hard: call hard_grader with agent's FLAG_DEPENDENCY_ISSUE and the known root cause. Score returned by judge × 0.8 as reward.
181
+ - REQUEST_CONTEXT action always costs -0.1 (thinking cost)
182
+ - AMEND_REVIEW with correct attribution = +0.4 (high reward — this is the key cascading behavior)
183
+ - Episode completion bonus: +0.2 if all critical issues found, -0.1 if APPROVE on module with known critical bugs
184
+ 4. Build `graders/` — implement all three graders per spec above. Hard grader must use OpenAI client (per competition spec), temperature=0, fixed rubric prompt stored as a constant.
185
+ 5. Build `environment.py` — main class implementing full OpenEnv interface:
186
+ - `reset(task_id)` → clears annotations for task modules, returns first observation
187
+ - `step(action)` → validates action, updates graph annotations in DB, computes reward, returns (obs, reward, done, info)
188
+ - `state()` → returns full GraphState (serialized NetworkX graph + all annotations)
189
+ - Episode ends when: agent calls APPROVE or REQUEST_CHANGES, OR step limit reached (max 10 steps)
190
+ 6. Build `tasks/` — register 3 tasks pointing to specific modules in sample_codebase with known ground truth issues
191
+
192
+ ### Completion Criteria
193
+ - `env.reset("easy_task")` returns a valid typed Observation
194
+ - `env.step(FLAG_BUG(line=12, description="null risk"))` returns reward > 0 for correct flag
195
+ - `env.state()` returns serializable graph with annotations
196
+ - Full episode runs without error on all 3 tasks
197
+ - Reward values all fall in 0.0–1.0 range
198
+
199
+ ---
200
+
201
+ ## 🏗️ PHASE 3 — HTTP Server & OpenEnv Spec Compliance
202
+ **Goal: Wrap environment in FastAPI, pass openenv validate.**
203
+
204
+ ### Tasks
205
+ 1. Build `server/app.py` — FastAPI app exposing:
206
+ - POST /reset → calls env.reset(), returns Observation JSON
207
+ - POST /step → calls env.step(action), returns (obs, reward, done, info) JSON
208
+ - GET /state → calls env.state(), returns GraphState JSON
209
+ - GET /health → returns 200 (required for HF Space ping)
210
+ 2. Build `openenv.yaml` — fill all required metadata: name, version, description, tasks list, observation_space, action_space, reward_range
211
+ 3. Run `openenv validate` — fix all compliance errors
212
+ 4. Confirm all Pydantic models serialize/deserialize correctly over HTTP
213
+
214
+ ### Completion Criteria
215
+ - `openenv validate` passes with no errors
216
+ - All endpoints return correct typed responses
217
+ - GET /health returns 200
218
+
219
+ ---
220
+
221
+ ## 🏗️ PHASE 4 — Inference Script
222
+ **Goal: Build inference.py that runs Gemma 4 as the agent. This is what judges auto-run.**
223
+
224
+ ### Critical Requirements (Non-Negotiable)
225
+ - File must be named `inference.py` at root
226
+ - Use OpenAI client for all LLM calls
227
+ - Read API_BASE_URL, MODEL_NAME, HF_TOKEN from environment variables
228
+ - Emit structured stdout logs in EXACTLY this format:
229
+ ```
230
+ [START] task=<task_id> episode=<n>
231
+ [STEP] step=<n> action=<action_type> reward=<float> cumulative=<float>
232
+ [END] task=<task_id> total_reward=<float> steps=<n>
233
+ ```
234
+ - Must complete all 3 tasks in under 20 minutes total
235
+ - Must run on 2 vCPU / 8GB RAM
236
+
237
+ ### Tasks
238
+ 1. Build the agent loop — for each task: reset env, loop step() until done, collect rewards
239
+ 2. Build the LLM action parser — send observation to model with a structured prompt, parse response into typed Action. Use JSON mode or structured output. Handle parse failures gracefully (default to APPROVE with penalty).
240
+ 3. Build the action prompt — system prompt explaining the environment, action space, and output format. Include the compressed observation in user message. Tell model to output JSON action only.
241
+ 4. Implement all 3 task runs sequentially
242
+ 5. Emit all required log lines to stdout
243
+ 6. Final output: baseline scores for all 3 tasks printed to stdout
244
+
245
+ ### Completion Criteria
246
+ - Script runs end to end without error
247
+ - All [START]/[STEP]/[END] logs emitted correctly
248
+ - Produces a score for each task between 0.0–1.0
249
+ - Completes in under 20 minutes
250
+
251
+ ---
252
+
253
+ ## 🏗️ PHASE 5 — Containerization & Deployment
254
+ **Goal: Docker build works, HF Space deploys, pre-validation script passes.**
255
+
256
+ ### Tasks
257
+ 1. Write `Dockerfile`:
258
+ - Base: python:3.11-slim
259
+ - Install system deps for pylint, bandit, networkx
260
+ - Copy project, install requirements
261
+ - On container start: run parser to populate DB if not exists, then start FastAPI server
262
+ - Expose port 7860 (HF Spaces default)
263
+ 2. Write `README.md` with all required sections: environment description and motivation, observation and action space definitions, all 3 task descriptions with difficulty, setup instructions, baseline scores
264
+ 3. Run pre-submission validation script — fix all failures
265
+ 4. Deploy to HF Space with `openenv push`
266
+ 5. Confirm Space URL returns 200 on GET /health and responds to POST /reset
267
+
268
+ ### Completion Criteria
269
+ - `docker build .` succeeds
270
+ - `docker run -p 7860:7860` starts server cleanly
271
+ - HF Space URL responds to reset()
272
+ - Pre-validation script passes all checks
273
+
274
+ ---
275
+
276
+ ## ⏱️ Suggested Time Allocation (Given ~36hrs remaining)
277
+
278
+ | Phase | Time |
279
+ |---|---|
280
+ | Phase 1 — Foundation | 6 hrs |
281
+ | Phase 2 — RL Environment | 8 hrs |
282
+ | Phase 3 — Server + Spec | 3 hrs |
283
+ | Phase 4 — Inference Script | 4 hrs |
284
+ | Phase 5 — Docker + Deploy | 3 hrs |
285
+ | Buffer / debugging | 4 hrs |
286
+
287
+ ---
288
+
289
+ ## ⚠️ Known Risk Areas (Watch These)
290
+
291
+ 1. **Hard grader reproducibility** — document judge prompt and seed explicitly
292
+ 2. **DB migration on fresh Docker build** — first run must auto-populate DB from sample_codebase
293
+ 3. **Inference script runtime** — test full 3-task run locally before submitting, must be under 20 min
294
+ 4. **openenv validate strictness** — run it early in Phase 3, not at the end
295
+ 5. **Reward always in 0.0–1.0** — clip all reward values, graders must never return outside range
code-review-env/Dockerfile ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ WORKDIR /app
4
+ COPY requirements.txt /app/
5
+ RUN pip install --no-cache-dir -r requirements.txt
6
+ COPY . /app
7
+ CMD ["python", "-m", "parser.ast_parser", "sample_codebase/"]
code-review-env/README.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CodeReviewEnv
2
+
3
+ Phase 1 foundation for dependency-aware code review environment.
4
+
5
+ ## Quickstart
6
+
7
+ ```bash
8
+ pip install -r requirements.txt
9
+ python -m parser.ast_parser sample_codebase/
10
+ python -m db.store --module checkout
11
+ ```
code-review-env/db/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Database package for CodeReviewEnv."""
code-review-env/db/migrations.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from pathlib import Path
4
+
5
+ from sqlmodel import SQLModel, create_engine
6
+
7
+
8
+ def get_default_db_path() -> Path:
9
+ project_root = Path(__file__).resolve().parents[1]
10
+ return project_root / "code_review_env.db"
11
+
12
+
13
+ def get_engine(db_path: str | Path | None = None, echo: bool = False):
14
+ path = Path(db_path) if db_path else get_default_db_path()
15
+ path.parent.mkdir(parents=True, exist_ok=True)
16
+ return create_engine(f"sqlite:///{path}", echo=echo)
17
+
18
+
19
+ def init_db(db_path: str | Path | None = None, echo: bool = False) -> None:
20
+ from db import schema # noqa: F401
21
+
22
+ engine = get_engine(db_path=db_path, echo=echo)
23
+ SQLModel.metadata.create_all(engine)
24
+
25
+
26
+ if __name__ == "__main__":
27
+ init_db()
28
+ print("Database initialized")
code-review-env/db/schema.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from datetime import UTC, datetime
4
+ from enum import StrEnum
5
+ from typing import Optional
6
+
7
+ from sqlmodel import Field, SQLModel
8
+
9
+
10
+ class EdgeType(StrEnum):
11
+ EXPLICIT_IMPORT = "explicit_import"
12
+ IMPLICIT_NAME_RESOLUTION = "implicit_name_resolution"
13
+
14
+
15
+ class ReviewStatus(StrEnum):
16
+ PENDING = "pending"
17
+ IN_PROGRESS = "in_progress"
18
+ REVIEWED = "reviewed"
19
+
20
+
21
+ class Severity(StrEnum):
22
+ LOW = "low"
23
+ MEDIUM = "medium"
24
+ HIGH = "high"
25
+
26
+
27
+ class ModuleNode(SQLModel, table=True):
28
+ id: Optional[int] = Field(default=None, primary_key=True)
29
+ source_root: str = Field(index=True)
30
+ module_id: str = Field(index=True)
31
+ raw_code: str
32
+ ast_summary: str
33
+ dependency_reason: str = ""
34
+ review_annotation: Optional[str] = None
35
+ review_status: ReviewStatus = Field(default=ReviewStatus.PENDING)
36
+ review_summary: Optional[str] = None
37
+ created_at: datetime = Field(default_factory=lambda: datetime.now(UTC))
38
+ updated_at: datetime = Field(default_factory=lambda: datetime.now(UTC))
39
+
40
+
41
+ class ModuleEdge(SQLModel, table=True):
42
+ id: Optional[int] = Field(default=None, primary_key=True)
43
+ source_root: str = Field(index=True)
44
+ source_module_id: str = Field(index=True)
45
+ target_module_id: str = Field(index=True)
46
+ edge_type: EdgeType = Field(default=EdgeType.EXPLICIT_IMPORT)
47
+ import_line: str
48
+ weight: float = 1.0
49
+
50
+
51
+ class LinterFinding(SQLModel, table=True):
52
+ id: Optional[int] = Field(default=None, primary_key=True)
53
+ source_root: str = Field(index=True)
54
+ module_id: str = Field(index=True)
55
+ tool: str = Field(index=True)
56
+ line: int
57
+ severity: Severity
58
+ code: str
59
+ message: str
60
+
61
+
62
+ class ReviewAnnotation(SQLModel, table=True):
63
+ id: Optional[int] = Field(default=None, primary_key=True)
64
+ source_root: str = Field(index=True)
65
+ module_id: str = Field(index=True)
66
+ episode_id: str = Field(index=True)
67
+ step_number: int
68
+ action_type: str
69
+ note: str
70
+ created_at: datetime = Field(default_factory=lambda: datetime.now(UTC))
71
+
72
+
73
+ class EpisodeRecord(SQLModel, table=True):
74
+ id: Optional[int] = Field(default=None, primary_key=True)
75
+ source_root: str = Field(index=True)
76
+ episode_id: str = Field(index=True)
77
+ task_id: str = Field(index=True)
78
+ module_id: str = Field(index=True)
79
+ total_steps: int
80
+ cumulative_reward: float = 0.0
81
+ created_at: datetime = Field(default_factory=lambda: datetime.now(UTC))
82
+
83
+
84
+ class TaskDefinition(SQLModel, table=True):
85
+ id: Optional[int] = Field(default=None, primary_key=True)
86
+ source_root: str = Field(index=True)
87
+ task_id: str = Field(index=True)
88
+ task_level: str = Field(index=True)
89
+ target_module_id: str = Field(index=True)
90
+ description: str
91
+ ground_truth_ref: str
code-review-env/db/store.py ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ from dataclasses import dataclass
5
+ from datetime import UTC, datetime
6
+ from pathlib import Path
7
+ from typing import Iterator, Optional
8
+
9
+ from pydantic import BaseModel
10
+ from sqlmodel import Session, delete, select
11
+
12
+ from db.migrations import get_default_db_path, get_engine, init_db
13
+ from db.schema import (
14
+ EdgeType,
15
+ LinterFinding,
16
+ ModuleEdge,
17
+ ModuleNode,
18
+ ReviewAnnotation,
19
+ ReviewStatus,
20
+ Severity,
21
+ )
22
+
23
+
24
+ @dataclass
25
+ class DBConfig:
26
+ source_root: str
27
+ db_path: Path
28
+
29
+
30
+ class NeighborSummary(BaseModel):
31
+ module_id: str
32
+ ast_summary: str
33
+ review_summary: Optional[str]
34
+
35
+
36
+ class NodeWithNeighbors(BaseModel):
37
+ module_id: str
38
+ ast_summary: str
39
+ review_status: ReviewStatus
40
+ neighbors: list[NeighborSummary]
41
+
42
+
43
+ class GraphNodeRecord(BaseModel):
44
+ module_id: str
45
+ ast_summary: str
46
+ review_status: ReviewStatus
47
+
48
+
49
+ class GraphEdgeRecord(BaseModel):
50
+ source_module_id: str
51
+ target_module_id: str
52
+ weight: float
53
+ import_line: str
54
+
55
+
56
+ class GraphSnapshot(BaseModel):
57
+ nodes: list[GraphNodeRecord]
58
+ edges: list[GraphEdgeRecord]
59
+
60
+
61
+ class Store:
62
+ def __init__(self, source_root: str, db_path: str | Path | None = None) -> None:
63
+ self.config = DBConfig(
64
+ source_root=str(Path(source_root).resolve()),
65
+ db_path=Path(db_path) if db_path else get_default_db_path(),
66
+ )
67
+ init_db(db_path=self.config.db_path)
68
+ self.engine = get_engine(self.config.db_path)
69
+
70
+ def session(self) -> Iterator[Session]:
71
+ with Session(self.engine) as session:
72
+ yield session
73
+
74
+ def upsert_node(
75
+ self,
76
+ module_id: str,
77
+ raw_code: str,
78
+ ast_summary: str,
79
+ dependency_reason: str,
80
+ ) -> ModuleNode:
81
+ with Session(self.engine) as session:
82
+ existing = session.exec(
83
+ select(ModuleNode).where(
84
+ ModuleNode.source_root == self.config.source_root,
85
+ ModuleNode.module_id == module_id,
86
+ )
87
+ ).first()
88
+ if existing:
89
+ existing.raw_code = raw_code
90
+ existing.ast_summary = ast_summary
91
+ existing.dependency_reason = dependency_reason
92
+ existing.updated_at = datetime.now(UTC)
93
+ session.add(existing)
94
+ session.commit()
95
+ session.refresh(existing)
96
+ return existing
97
+
98
+ node = ModuleNode(
99
+ source_root=self.config.source_root,
100
+ module_id=module_id,
101
+ raw_code=raw_code,
102
+ ast_summary=ast_summary,
103
+ dependency_reason=dependency_reason,
104
+ )
105
+ session.add(node)
106
+ session.commit()
107
+ session.refresh(node)
108
+ return node
109
+
110
+ def upsert_edge(
111
+ self,
112
+ source_module_id: str,
113
+ target_module_id: str,
114
+ edge_type: EdgeType,
115
+ import_line: str,
116
+ weight: float,
117
+ ) -> ModuleEdge:
118
+ with Session(self.engine) as session:
119
+ existing = session.exec(
120
+ select(ModuleEdge).where(
121
+ ModuleEdge.source_root == self.config.source_root,
122
+ ModuleEdge.source_module_id == source_module_id,
123
+ ModuleEdge.target_module_id == target_module_id,
124
+ ModuleEdge.import_line == import_line,
125
+ )
126
+ ).first()
127
+ if existing:
128
+ existing.edge_type = edge_type
129
+ existing.weight = weight
130
+ session.add(existing)
131
+ session.commit()
132
+ session.refresh(existing)
133
+ return existing
134
+
135
+ edge = ModuleEdge(
136
+ source_root=self.config.source_root,
137
+ source_module_id=source_module_id,
138
+ target_module_id=target_module_id,
139
+ edge_type=edge_type,
140
+ import_line=import_line,
141
+ weight=weight,
142
+ )
143
+ session.add(edge)
144
+ session.commit()
145
+ session.refresh(edge)
146
+ return edge
147
+
148
+ def replace_findings_for_module(self, module_id: str, findings: list[dict[str, str | int]]) -> None:
149
+ with Session(self.engine) as session:
150
+ session.exec(
151
+ delete(LinterFinding).where(
152
+ LinterFinding.source_root == self.config.source_root,
153
+ LinterFinding.module_id == module_id,
154
+ )
155
+ )
156
+ for finding in findings:
157
+ session.add(
158
+ LinterFinding(
159
+ source_root=self.config.source_root,
160
+ module_id=module_id,
161
+ tool=str(finding["tool"]),
162
+ line=int(finding["line"]),
163
+ severity=Severity(str(finding["severity"])),
164
+ code=str(finding["code"]),
165
+ message=str(finding["message"]),
166
+ )
167
+ )
168
+ session.commit()
169
+
170
+ def get_findings(self, module_id: str) -> list[LinterFinding]:
171
+ with Session(self.engine) as session:
172
+ return list(
173
+ session.exec(
174
+ select(LinterFinding).where(
175
+ LinterFinding.source_root == self.config.source_root,
176
+ LinterFinding.module_id == module_id,
177
+ )
178
+ ).all()
179
+ )
180
+
181
+ def get_node(self, module_id: str) -> Optional[ModuleNode]:
182
+ with Session(self.engine) as session:
183
+ return session.exec(
184
+ select(ModuleNode).where(
185
+ ModuleNode.source_root == self.config.source_root,
186
+ ModuleNode.module_id == module_id,
187
+ )
188
+ ).first()
189
+
190
+ def get_node_with_neighbors(self, module_id: str) -> Optional[NodeWithNeighbors]:
191
+ with Session(self.engine) as session:
192
+ node = session.exec(
193
+ select(ModuleNode).where(
194
+ ModuleNode.source_root == self.config.source_root,
195
+ ModuleNode.module_id == module_id,
196
+ )
197
+ ).first()
198
+ if not node:
199
+ return None
200
+
201
+ outgoing = list(
202
+ session.exec(
203
+ select(ModuleEdge).where(
204
+ ModuleEdge.source_root == self.config.source_root,
205
+ ModuleEdge.source_module_id == module_id,
206
+ )
207
+ ).all()
208
+ )
209
+ incoming = list(
210
+ session.exec(
211
+ select(ModuleEdge).where(
212
+ ModuleEdge.source_root == self.config.source_root,
213
+ ModuleEdge.target_module_id == module_id,
214
+ )
215
+ ).all()
216
+ )
217
+
218
+ neighbor_ids = {edge.target_module_id for edge in outgoing}
219
+ neighbor_ids.update(edge.source_module_id for edge in incoming)
220
+
221
+ neighbors: list[NeighborSummary] = []
222
+ for neighbor_id in sorted(neighbor_ids):
223
+ neighbor = session.exec(
224
+ select(ModuleNode).where(
225
+ ModuleNode.source_root == self.config.source_root,
226
+ ModuleNode.module_id == neighbor_id,
227
+ )
228
+ ).first()
229
+ if neighbor:
230
+ neighbors.append(
231
+ NeighborSummary(
232
+ module_id=neighbor.module_id,
233
+ ast_summary=neighbor.ast_summary,
234
+ review_summary=neighbor.review_summary,
235
+ )
236
+ )
237
+
238
+ return NodeWithNeighbors(
239
+ module_id=node.module_id,
240
+ ast_summary=node.ast_summary,
241
+ review_status=node.review_status,
242
+ neighbors=neighbors,
243
+ )
244
+
245
+ def update_annotation(
246
+ self,
247
+ module_id: str,
248
+ episode_id: str,
249
+ step_number: int,
250
+ action_type: str,
251
+ note: str,
252
+ review_summary: str | None = None,
253
+ review_status: ReviewStatus | None = None,
254
+ ) -> None:
255
+ with Session(self.engine) as session:
256
+ node = session.exec(
257
+ select(ModuleNode).where(
258
+ ModuleNode.source_root == self.config.source_root,
259
+ ModuleNode.module_id == module_id,
260
+ )
261
+ ).first()
262
+ if not node:
263
+ raise ValueError(f"Unknown module: {module_id}")
264
+
265
+ node.review_annotation = note
266
+ if review_summary is not None:
267
+ node.review_summary = review_summary
268
+ if review_status is not None:
269
+ node.review_status = review_status
270
+ node.updated_at = datetime.now(UTC)
271
+
272
+ session.add(node)
273
+ session.add(
274
+ ReviewAnnotation(
275
+ source_root=self.config.source_root,
276
+ module_id=module_id,
277
+ episode_id=episode_id,
278
+ step_number=step_number,
279
+ action_type=action_type,
280
+ note=note,
281
+ )
282
+ )
283
+ session.commit()
284
+
285
+ def get_full_graph(self) -> GraphSnapshot:
286
+ with Session(self.engine) as session:
287
+ nodes = list(
288
+ session.exec(
289
+ select(ModuleNode).where(ModuleNode.source_root == self.config.source_root)
290
+ ).all()
291
+ )
292
+ edges = list(
293
+ session.exec(
294
+ select(ModuleEdge).where(ModuleEdge.source_root == self.config.source_root)
295
+ ).all()
296
+ )
297
+
298
+ return GraphSnapshot(
299
+ nodes=[
300
+ GraphNodeRecord(
301
+ module_id=node.module_id,
302
+ ast_summary=node.ast_summary,
303
+ review_status=node.review_status,
304
+ )
305
+ for node in nodes
306
+ ],
307
+ edges=[
308
+ GraphEdgeRecord(
309
+ source_module_id=edge.source_module_id,
310
+ target_module_id=edge.target_module_id,
311
+ weight=edge.weight,
312
+ import_line=edge.import_line,
313
+ )
314
+ for edge in edges
315
+ ],
316
+ )
317
+
318
+ def has_nodes(self) -> bool:
319
+ with Session(self.engine) as session:
320
+ first_node = session.exec(
321
+ select(ModuleNode.id).where(ModuleNode.source_root == self.config.source_root)
322
+ ).first()
323
+ return first_node is not None
324
+
325
+ def clear_source_graph(self) -> None:
326
+ with Session(self.engine) as session:
327
+ session.exec(
328
+ delete(ReviewAnnotation).where(
329
+ ReviewAnnotation.source_root == self.config.source_root
330
+ )
331
+ )
332
+ session.exec(
333
+ delete(LinterFinding).where(
334
+ LinterFinding.source_root == self.config.source_root
335
+ )
336
+ )
337
+ session.exec(
338
+ delete(ModuleEdge).where(
339
+ ModuleEdge.source_root == self.config.source_root
340
+ )
341
+ )
342
+ session.exec(
343
+ delete(ModuleNode).where(
344
+ ModuleNode.source_root == self.config.source_root
345
+ )
346
+ )
347
+ session.commit()
348
+
349
+ def clear_annotations(self) -> None:
350
+ with Session(self.engine) as session:
351
+ nodes = list(
352
+ session.exec(
353
+ select(ModuleNode).where(ModuleNode.source_root == self.config.source_root)
354
+ ).all()
355
+ )
356
+ for node in nodes:
357
+ node.review_annotation = None
358
+ node.review_summary = None
359
+ node.review_status = ReviewStatus.PENDING
360
+ node.updated_at = datetime.now(UTC)
361
+ session.add(node)
362
+ session.commit()
363
+
364
+
365
+ def _build_parser() -> argparse.ArgumentParser:
366
+ parser = argparse.ArgumentParser(description="Store query helper")
367
+ parser.add_argument("--root", default="sample_codebase", help="Source root directory")
368
+ parser.add_argument("--db-path", default=None, help="SQLite path")
369
+ parser.add_argument("--module", required=True, help="Module id (without .py)")
370
+ return parser
371
+
372
+
373
+ def main() -> None:
374
+ args = _build_parser().parse_args()
375
+ store = Store(source_root=args.root, db_path=args.db_path)
376
+ result = store.get_node_with_neighbors(args.module)
377
+ if result is None:
378
+ print(f"Module '{args.module}' not found")
379
+ return
380
+ print(result.model_dump_json(indent=2))
381
+
382
+
383
+ if __name__ == "__main__":
384
+ main()
code-review-env/env/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Environment package for CodeReviewEnv."""
code-review-env/env/environment.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ """Phase 2 implementation placeholder."""
2
+
3
+
4
+ class CodeReviewEnv:
5
+ def __init__(self) -> None:
6
+ raise NotImplementedError("Phase 2 implementation pending")
code-review-env/env/graph.py ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass
4
+ from pathlib import Path
5
+
6
+ import networkx as nx
7
+ from sqlmodel import Session, select
8
+
9
+ from db.schema import ModuleEdge, ModuleNode
10
+ from db.store import Store
11
+ from parser.ast_parser import parse_directory
12
+
13
+
14
+ @dataclass
15
+ class GraphLoadResult:
16
+ graph: nx.DiGraph
17
+ loaded_from_cache: bool
18
+
19
+
20
+ class DependencyGraph:
21
+ def __init__(self, target_dir: str | Path, db_path: str | Path | None = None) -> None:
22
+ self.target_dir = Path(target_dir).resolve()
23
+ self.store = Store(source_root=str(self.target_dir), db_path=db_path)
24
+
25
+ def load_or_build(self, force_reparse: bool = False) -> GraphLoadResult:
26
+ if force_reparse or not self.store.has_nodes():
27
+ parse_directory(self.target_dir, db_path=str(self.store.config.db_path))
28
+ loaded_from_cache = False
29
+ else:
30
+ loaded_from_cache = True
31
+ return GraphLoadResult(graph=self._build_graph(), loaded_from_cache=loaded_from_cache)
32
+
33
+ def _build_graph(self) -> nx.DiGraph:
34
+ graph = nx.DiGraph()
35
+ with Session(self.store.engine) as session:
36
+ nodes = list(
37
+ session.exec(
38
+ select(ModuleNode).where(ModuleNode.source_root == self.store.config.source_root)
39
+ ).all()
40
+ )
41
+ edges = list(
42
+ session.exec(
43
+ select(ModuleEdge).where(ModuleEdge.source_root == self.store.config.source_root)
44
+ ).all()
45
+ )
46
+
47
+ for node in nodes:
48
+ graph.add_node(
49
+ node.module_id,
50
+ ast_summary=node.ast_summary,
51
+ review_status=node.review_status.value,
52
+ )
53
+
54
+ for edge in edges:
55
+ graph.add_edge(
56
+ edge.source_module_id,
57
+ edge.target_module_id,
58
+ import_line=edge.import_line,
59
+ edge_type=edge.edge_type.value,
60
+ weight=edge.weight,
61
+ )
62
+
63
+ return graph
64
+
65
+ def traversal_order(self, graph: nx.DiGraph | None = None) -> list[str]:
66
+ graph = graph or self._build_graph()
67
+ if graph.number_of_nodes() == 0:
68
+ return []
69
+
70
+ if not nx.is_directed_acyclic_graph(graph):
71
+ # Fall back to deterministic ordering if cyclic imports exist.
72
+ return sorted(graph.nodes())
73
+
74
+ centrality = nx.betweenness_centrality(graph)
75
+ indegree = {node: graph.in_degree(node) for node in graph.nodes()}
76
+ queue = [node for node, deg in indegree.items() if deg == 0]
77
+ order: list[str] = []
78
+
79
+ def rank(node: str) -> tuple[float, float, str]:
80
+ return (
81
+ float(graph.out_degree(node)),
82
+ float(centrality.get(node, 0.0)),
83
+ node,
84
+ )
85
+
86
+ while queue:
87
+ queue.sort(key=rank)
88
+ current = queue.pop(0)
89
+ order.append(current)
90
+ for successor in sorted(graph.successors(current)):
91
+ indegree[successor] -= 1
92
+ if indegree[successor] == 0:
93
+ queue.append(successor)
94
+
95
+ return order
96
+
97
+
98
+ if __name__ == "__main__":
99
+ manager = DependencyGraph(target_dir="sample_codebase")
100
+ result = manager.load_or_build()
101
+ print(
102
+ f"Loaded graph with {result.graph.number_of_nodes()} nodes and "
103
+ f"{result.graph.number_of_edges()} edges (cache={result.loaded_from_cache})"
104
+ )
105
+ print("Traversal order:", manager.traversal_order(result.graph))
code-review-env/env/models.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 2 implementation placeholder."""
code-review-env/env/observation_builder.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 2 implementation placeholder."""
code-review-env/env/reward.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 2 implementation placeholder."""
code-review-env/graders/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Graders package placeholder for later phases."""
code-review-env/graders/base_grader.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ """Phase 2+ implementation placeholder."""
2
+
3
+
4
+ class BaseGrader:
5
+ pass
code-review-env/graders/easy_grader.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 2+ implementation placeholder."""
code-review-env/graders/hard_grader.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 2+ implementation placeholder."""
code-review-env/graders/medium_grader.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 2+ implementation placeholder."""
code-review-env/inference.py ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ """Phase 4 implementation placeholder."""
2
+
3
+ if __name__ == "__main__":
4
+ raise SystemExit("inference.py is not implemented in Phase 1")
code-review-env/openenv.yaml ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ name: code-review-env
2
+ version: 0.1.0
3
+ description: Phase 1 scaffold
code-review-env/parser/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Parser package for CodeReviewEnv."""
code-review-env/parser/ast_parser.py ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ import ast
5
+ from pathlib import Path
6
+
7
+ from pydantic import BaseModel
8
+
9
+ from db.schema import EdgeType
10
+ from db.store import Store
11
+ from parser.linter import run_linters
12
+ from parser.summarizer import summarize_module
13
+
14
+
15
+ class ImportRef(BaseModel):
16
+ target_module: str
17
+ import_line: str
18
+ edge_type: EdgeType = EdgeType.EXPLICIT_IMPORT
19
+
20
+
21
+ class ParsedModule(BaseModel):
22
+ module_id: str
23
+ raw_code: str
24
+ function_signatures: list[str]
25
+ classes: list[str]
26
+ imports: list[ImportRef]
27
+ constants: list[str]
28
+ dependencies: list[str]
29
+
30
+
31
+ class _Visitor(ast.NodeVisitor):
32
+ def __init__(self) -> None:
33
+ self.function_signatures: list[str] = []
34
+ self.classes: list[str] = []
35
+ self.constants: list[str] = []
36
+ self.imports: list[tuple[str, str]] = []
37
+
38
+ def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
39
+ args: list[str] = []
40
+ for arg in node.args.args:
41
+ if arg.annotation is not None:
42
+ args.append(f"{arg.arg}: {ast.unparse(arg.annotation)}")
43
+ else:
44
+ args.append(arg.arg)
45
+ returns = ast.unparse(node.returns) if node.returns is not None else "None"
46
+ self.function_signatures.append(f"{node.name}({', '.join(args)})->{returns}")
47
+ self.generic_visit(node)
48
+
49
+ def visit_AsyncFunctionDef(self, node: ast.AsyncFunctionDef) -> None:
50
+ fake = ast.FunctionDef(
51
+ name=node.name,
52
+ args=node.args,
53
+ body=node.body,
54
+ decorator_list=node.decorator_list,
55
+ returns=node.returns,
56
+ type_comment=node.type_comment,
57
+ )
58
+ self.visit_FunctionDef(fake)
59
+
60
+ def visit_ClassDef(self, node: ast.ClassDef) -> None:
61
+ self.classes.append(node.name)
62
+ self.generic_visit(node)
63
+
64
+ def visit_Import(self, node: ast.Import) -> None:
65
+ line = ast.get_source_segment(self._source, node) or "import"
66
+ for alias in node.names:
67
+ self.imports.append((alias.name, line))
68
+
69
+ def visit_ImportFrom(self, node: ast.ImportFrom) -> None:
70
+ module = node.module or ""
71
+ level = node.level or 0
72
+ dotted = "." * level + module
73
+ line = ast.get_source_segment(self._source, node) or "from"
74
+ self.imports.append((dotted, line))
75
+
76
+ def visit_Assign(self, node: ast.Assign) -> None:
77
+ if isinstance(node.value, ast.Constant):
78
+ for target in node.targets:
79
+ if isinstance(target, ast.Name) and target.id.isupper():
80
+ self.constants.append(target.id)
81
+ self.generic_visit(node)
82
+
83
+ def parse(self, tree: ast.AST, source: str) -> None:
84
+ self._source = source
85
+ self.visit(tree)
86
+
87
+
88
+ def _to_module_id(path: Path, root: Path) -> str:
89
+ rel = path.resolve().relative_to(root.resolve())
90
+ return str(rel.with_suffix("")).replace("/", ".")
91
+
92
+
93
+ def _resolve_relative_import(current_module: str, ref: str) -> str:
94
+ if not ref.startswith("."):
95
+ return ref
96
+ dots = len(ref) - len(ref.lstrip("."))
97
+ suffix = ref.lstrip(".")
98
+ parts = current_module.split(".")
99
+ base = parts[:-dots] if dots <= len(parts) else []
100
+ if suffix:
101
+ base.append(suffix)
102
+ return ".".join(part for part in base if part)
103
+
104
+
105
+ def parse_python_file(path: Path, root_dir: Path) -> ParsedModule:
106
+ source = path.read_text(encoding="utf-8")
107
+ module_id = _to_module_id(path, root_dir)
108
+ tree = ast.parse(source)
109
+
110
+ visitor = _Visitor()
111
+ visitor.parse(tree, source)
112
+
113
+ imports = [
114
+ ImportRef(
115
+ target_module=_resolve_relative_import(module_id, name),
116
+ import_line=line,
117
+ edge_type=EdgeType.EXPLICIT_IMPORT,
118
+ )
119
+ for name, line in visitor.imports
120
+ ]
121
+
122
+ dependencies = [imp.target_module for imp in imports if imp.target_module]
123
+
124
+ return ParsedModule(
125
+ module_id=module_id,
126
+ raw_code=source,
127
+ function_signatures=visitor.function_signatures,
128
+ classes=visitor.classes,
129
+ imports=imports,
130
+ constants=visitor.constants,
131
+ dependencies=dependencies,
132
+ )
133
+
134
+
135
+ def parse_directory(target_dir: Path, db_path: str | None = None) -> Store:
136
+ target_dir = target_dir.resolve()
137
+ store = Store(source_root=str(target_dir), db_path=db_path)
138
+ store.clear_source_graph()
139
+
140
+ py_files = sorted(target_dir.rglob("*.py"))
141
+ for py_file in py_files:
142
+ parsed = parse_python_file(py_file, target_dir)
143
+ issues = run_linters(py_file)
144
+ summary = summarize_module(parsed, issues)
145
+
146
+ dep_reason = "Imports used by module-level and callable logic"
147
+ store.upsert_node(
148
+ module_id=parsed.module_id,
149
+ raw_code=parsed.raw_code,
150
+ ast_summary=summary,
151
+ dependency_reason=dep_reason,
152
+ )
153
+ store.replace_findings_for_module(
154
+ parsed.module_id,
155
+ [issue.model_dump() for issue in issues],
156
+ )
157
+ for imported in parsed.imports:
158
+ if imported.target_module:
159
+ store.upsert_edge(
160
+ source_module_id=parsed.module_id,
161
+ target_module_id=imported.target_module,
162
+ edge_type=imported.edge_type,
163
+ import_line=imported.import_line,
164
+ weight=1.0,
165
+ )
166
+
167
+ return store
168
+
169
+
170
+ def _build_parser() -> argparse.ArgumentParser:
171
+ parser = argparse.ArgumentParser(description="Parse Python codebase into SQLite graph")
172
+ parser.add_argument("target", help="Path to target codebase")
173
+ parser.add_argument("--db-path", default=None, help="Path to SQLite database")
174
+ return parser
175
+
176
+
177
+ def main() -> None:
178
+ args = _build_parser().parse_args()
179
+ target_dir = Path(args.target)
180
+ store = parse_directory(target_dir=target_dir, db_path=args.db_path)
181
+ snapshot = store.get_full_graph()
182
+ print(
183
+ f"Populated DB for {target_dir} with "
184
+ f"{len(snapshot.nodes)} nodes and {len(snapshot.edges)} edges"
185
+ )
186
+
187
+
188
+ if __name__ == "__main__":
189
+ main()
code-review-env/parser/linter.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import subprocess
5
+ from pathlib import Path
6
+ import sys
7
+
8
+ from pydantic import BaseModel
9
+
10
+
11
+ class LinterIssue(BaseModel):
12
+ tool: str
13
+ line: int
14
+ severity: str
15
+ code: str
16
+ message: str
17
+
18
+
19
+ _PYLINT_SEVERITY_MAP = {
20
+ "fatal": "high",
21
+ "error": "high",
22
+ "warning": "medium",
23
+ "refactor": "low",
24
+ "convention": "low",
25
+ "info": "low",
26
+ }
27
+
28
+ _BANDIT_SEVERITY_MAP = {
29
+ "high": "high",
30
+ "medium": "medium",
31
+ "low": "low",
32
+ }
33
+
34
+
35
+ def run_pylint(path: Path) -> list[LinterIssue]:
36
+ cmd = [
37
+ sys.executable,
38
+ "-m",
39
+ "pylint",
40
+ str(path),
41
+ "--output-format=json2",
42
+ "--score=n",
43
+ "--reports=n",
44
+ ]
45
+ proc = subprocess.run(cmd, capture_output=True, text=True, check=False)
46
+
47
+ payload = (proc.stdout or "").strip()
48
+ if not payload:
49
+ return []
50
+
51
+ try:
52
+ data = json.loads(payload)
53
+ except json.JSONDecodeError:
54
+ return []
55
+
56
+ messages = data.get("messages", []) if isinstance(data, dict) else []
57
+ issues: list[LinterIssue] = []
58
+ for message in messages:
59
+ severity = _PYLINT_SEVERITY_MAP.get(str(message.get("type", "")).lower(), "low")
60
+ issues.append(
61
+ LinterIssue(
62
+ tool="pylint",
63
+ line=int(message.get("line", 0)),
64
+ severity=severity,
65
+ code=str(message.get("messageId", "PL0000")),
66
+ message=str(message.get("message", "")),
67
+ )
68
+ )
69
+ return issues
70
+
71
+
72
+ def run_bandit(path: Path) -> list[LinterIssue]:
73
+ cmd = [sys.executable, "-m", "bandit", "-q", "-f", "json", str(path)]
74
+ proc = subprocess.run(cmd, capture_output=True, text=True, check=False)
75
+
76
+ payload = (proc.stdout or "").strip()
77
+ if not payload:
78
+ return []
79
+
80
+ try:
81
+ data = json.loads(payload)
82
+ except json.JSONDecodeError:
83
+ return []
84
+
85
+ results = data.get("results", []) if isinstance(data, dict) else []
86
+ issues: list[LinterIssue] = []
87
+ for item in results:
88
+ raw_sev = str(item.get("issue_severity", "LOW")).lower()
89
+ issues.append(
90
+ LinterIssue(
91
+ tool="bandit",
92
+ line=int(item.get("line_number", 0)),
93
+ severity=_BANDIT_SEVERITY_MAP.get(raw_sev, "low"),
94
+ code=str(item.get("test_id", "B000")),
95
+ message=str(item.get("issue_text", "")),
96
+ )
97
+ )
98
+ return issues
99
+
100
+
101
+ def run_linters(path: Path) -> list[LinterIssue]:
102
+ issues = run_pylint(path)
103
+ issues.extend(run_bandit(path))
104
+ return issues
code-review-env/parser/summarizer.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import TYPE_CHECKING
4
+
5
+ from parser.linter import LinterIssue
6
+
7
+ if TYPE_CHECKING:
8
+ from parser.ast_parser import ParsedModule
9
+
10
+
11
+ def _truncate_tokens(text: str, max_tokens: int = 100) -> str:
12
+ words = text.split()
13
+ if len(words) <= max_tokens:
14
+ return text
15
+ return " ".join(words[:max_tokens])
16
+
17
+
18
+ def summarize_module(parsed: ParsedModule, issues: list[LinterIssue]) -> str:
19
+ exports = ", ".join(parsed.function_signatures[:5])
20
+ deps = ", ".join(sorted(set(parsed.dependencies))[:5])
21
+ summary = (
22
+ f"exports: [{exports}] | issues: {len(issues)} | depends_on: [{deps}]"
23
+ )
24
+ return _truncate_tokens(summary, max_tokens=100)
code-review-env/pyproject.toml ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=68", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "code-review-env"
7
+ version = "0.1.0"
8
+ description = "OpenEnv CodeReviewEnv"
9
+ requires-python = ">=3.11"
10
+
11
+ [tool.pytest.ini_options]
12
+ pythonpath = ["."]
13
+ testpaths = ["tests"]
code-review-env/requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ sqlmodel>=0.0.24
2
+ networkx>=3.2
3
+ pydantic>=2.7
4
+ pylint>=3.2
5
+ bandit>=1.7
6
+ fastapi>=0.115
7
+ uvicorn>=0.30
8
+ openai>=1.40
9
+ pytest>=8.2
code-review-env/sample_codebase/auth.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ """Auth helpers."""
2
+
3
+ import config
4
+
5
+
6
+ def issue_session_token(user_id: str) -> str:
7
+ return f"{user_id}:{config.SECRET_KEY}:session-token-generated-with-a-very-long-suffix-that-triggers-style-rules-and-is-hard-to-read"
code-review-env/sample_codebase/cart.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Cart calculations."""
2
+
3
+ import config
4
+
5
+
6
+ def calculate_subtotal(items: list[dict[str, float]]) -> float:
7
+ subtotal = 0.0
8
+ for item in items:
9
+ subtotal += float(item.get("price", 0.0)) * float(item.get("qty", 0.0))
10
+ return subtotal
11
+
12
+
13
+ def calculate_total(items: list[dict[str, float]]) -> float:
14
+ subtotal = calculate_subtotal(items)
15
+ # BUG: config.DISCOUNT_RATE is intended to be 0.20, but set to 20 in config.
16
+ discounted = subtotal - (subtotal * config.DISCOUNT_RATE)
17
+ return discounted + (discounted * config.TAX_RATE)
code-review-env/sample_codebase/checkout.py ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Checkout flow."""
2
+
3
+ import cart
4
+ import payments
5
+
6
+
7
+ def submit_order(items: list[dict[str, float]]) -> str:
8
+ total = cart.calculate_total(items)
9
+ # Cascading symptom: negative total is observed here but root cause is config -> cart.
10
+ if total < 0:
11
+ return "error: negative total"
12
+ gateway_ok = payments.run_gateway_check("https://gateway.example.com/health")
13
+ if gateway_ok != 0:
14
+ return "error: gateway"
15
+ return payments.charge(total)
code-review-env/sample_codebase/config.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ """Configuration defaults for the checkout flow."""
2
+
3
+ DISCOUNT_RATE = 20
4
+ TAX_RATE = 0.07
5
+ PAYMENT_TIMEOUT_SECONDS = 30
6
+ SECRET_KEY = "hardcoded-dev-key"
code-review-env/sample_codebase/ground_truth.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "issues": [
3
+ {
4
+ "id": "STYLE_001",
5
+ "module": "auth",
6
+ "line": 7,
7
+ "type": "style",
8
+ "tool": "pylint",
9
+ "code": "C0301",
10
+ "message_contains": "Line too long"
11
+ },
12
+ {
13
+ "id": "LOGIC_001",
14
+ "module": "checkout",
15
+ "line": 7,
16
+ "type": "logic",
17
+ "description": "Negative total symptom due to dependency behavior in cart"
18
+ },
19
+ {
20
+ "id": "SECURITY_001",
21
+ "module": "payments",
22
+ "line": 9,
23
+ "type": "security",
24
+ "tool": "bandit",
25
+ "code": "B602",
26
+ "message_contains": "shell=True"
27
+ },
28
+ {
29
+ "id": "CASCADE_001",
30
+ "module": "checkout",
31
+ "line": 7,
32
+ "type": "dependency_cascade",
33
+ "root_cause_module": "config",
34
+ "surface_module": "checkout",
35
+ "path": ["config", "cart", "checkout"],
36
+ "description": "DISCOUNT_RATE configured as 20 instead of 0.20 causes cart miscalculation and checkout failure"
37
+ }
38
+ ]
39
+ }
code-review-env/sample_codebase/payments.py ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Payment gateway wrapper."""
2
+
3
+ import subprocess
4
+
5
+
6
+ def run_gateway_check(endpoint: str) -> int:
7
+ # SECURITY ISSUE: user-provided endpoint is interpolated in a shell command.
8
+ command = f"curl -s {endpoint}"
9
+ return subprocess.call(command, shell=True)
10
+
11
+
12
+ def charge(total: float) -> str:
13
+ if total <= 0:
14
+ return "rejected"
15
+ return "charged"
code-review-env/server/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Server package placeholder for later phases."""
code-review-env/server/app.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 3 implementation placeholder."""
code-review-env/tasks/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Tasks package placeholder for later phases."""
code-review-env/tasks/easy_task.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 2+ implementation placeholder."""
code-review-env/tasks/hard_task.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 2+ implementation placeholder."""
code-review-env/tasks/medium_task.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 2+ implementation placeholder."""
code-review-env/tasks/task_registry.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Phase 2+ implementation placeholder."""
code-review-env/tests/test_environment.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+
3
+ from env.graph import DependencyGraph
4
+
5
+
6
+ def test_graph_builds_from_sample_codebase(tmp_path: Path) -> None:
7
+ db_path = tmp_path / "graph.db"
8
+ graph_mgr = DependencyGraph(target_dir="sample_codebase", db_path=db_path)
9
+ result = graph_mgr.load_or_build(force_reparse=True)
10
+
11
+ assert result.graph.number_of_nodes() >= 5
12
+ assert result.loaded_from_cache is False
13
+
14
+
15
+ def test_graph_second_load_uses_cache(tmp_path: Path) -> None:
16
+ db_path = tmp_path / "graph.db"
17
+ graph_mgr = DependencyGraph(target_dir="sample_codebase", db_path=db_path)
18
+ graph_mgr.load_or_build(force_reparse=True)
19
+ second = graph_mgr.load_or_build(force_reparse=False)
20
+
21
+ assert second.loaded_from_cache is True
code-review-env/tests/test_graders.py ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ def test_phase1_placeholder() -> None:
2
+ assert True
code-review-env/tests/test_inference.py ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ def test_inference_placeholder() -> None:
2
+ assert True
code-review-env/tests/test_parser.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+
3
+ from parser.ast_parser import parse_python_file
4
+
5
+
6
+ def test_parse_python_file_extracts_core_elements() -> None:
7
+ root = Path("sample_codebase")
8
+ path = root / "cart.py"
9
+ parsed = parse_python_file(path=path, root_dir=root)
10
+
11
+ assert parsed.module_id == "cart"
12
+ assert any(sig.startswith("calculate_total(") for sig in parsed.function_signatures)
13
+ assert "config" in " ".join(parsed.dependencies)