Spaces:

Athmabhiram1
/

nodeaudit-openenv

Sleeping

shreyas-joshi commited on Apr 7

Commit

cf05092

0 Parent(s):

feat: initialize CodeReviewEnv with foundational components

- Add Dockerfile for containerized environment setup.
- Create README.md with quickstart instructions.
- Implement database package with migrations and schema definitions.
- Develop store module for database interactions and data management.
- Introduce parser module for AST parsing and code analysis.
- Establish environment and graph management for dependency tracking.
- Set up grading and task management placeholders for future phases.
- Include sample codebase and ground truth for testing and validation.
- Add tests for environment, parser, and graph functionalities.

Files changed (46) hide show

.gitignore +5 -0
Builder.md +138 -0
Debugger.md +100 -0
Phases.md +295 -0
code-review-env/Dockerfile +7 -0
code-review-env/README.md +11 -0
code-review-env/db/__init__.py +1 -0
code-review-env/db/migrations.py +28 -0
code-review-env/db/schema.py +91 -0
code-review-env/db/store.py +384 -0
code-review-env/env/__init__.py +1 -0
code-review-env/env/environment.py +6 -0
code-review-env/env/graph.py +105 -0
code-review-env/env/models.py +1 -0
code-review-env/env/observation_builder.py +1 -0
code-review-env/env/reward.py +1 -0
code-review-env/graders/__init__.py +1 -0
code-review-env/graders/base_grader.py +5 -0
code-review-env/graders/easy_grader.py +1 -0
code-review-env/graders/hard_grader.py +1 -0
code-review-env/graders/medium_grader.py +1 -0
code-review-env/inference.py +4 -0
code-review-env/openenv.yaml +3 -0
code-review-env/parser/__init__.py +1 -0
code-review-env/parser/ast_parser.py +189 -0
code-review-env/parser/linter.py +104 -0
code-review-env/parser/summarizer.py +24 -0
code-review-env/pyproject.toml +13 -0
code-review-env/requirements.txt +9 -0
code-review-env/sample_codebase/auth.py +7 -0
code-review-env/sample_codebase/cart.py +17 -0
code-review-env/sample_codebase/checkout.py +15 -0
code-review-env/sample_codebase/config.py +6 -0
code-review-env/sample_codebase/ground_truth.json +39 -0
code-review-env/sample_codebase/payments.py +15 -0
code-review-env/server/__init__.py +1 -0
code-review-env/server/app.py +1 -0
code-review-env/tasks/__init__.py +1 -0
code-review-env/tasks/easy_task.py +1 -0
code-review-env/tasks/hard_task.py +1 -0
code-review-env/tasks/medium_task.py +1 -0
code-review-env/tasks/task_registry.py +1 -0
code-review-env/tests/test_environment.py +21 -0
code-review-env/tests/test_graders.py +2 -0
code-review-env/tests/test_inference.py +2 -0
code-review-env/tests/test_parser.py +13 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+.venv
+.env
+__pycache__/
+*.pyc
+code-review-env/code_review_env.db

Builder.md ADDED Viewed

	@@ -0,0 +1,138 @@

+# Builder Prompt — CodeReviewEnv
+You are an expert Python engineer building a reinforcement learning environment called **CodeReviewEnv** for the OpenEnv Hackathon Round 1. Read everything below before writing a single line of code.
+---
+## What You Are Building
+An OpenEnv-compliant RL environment where an LLM agent learns to perform dependency-aware code review on a Python codebase.
+The environment:
+1. Parses a Python codebase into a **persistent dependency graph** stored in SQLite via SQLModel. Nodes = modules. Edges = import relationships.
+2. Each node stores: full source code, compressed AST summary (~50 tokens), linter ground truth (pylint + bandit output), and agent-written review annotations.
+3. The agent reviews one module per episode via a multi-step loop: `reset()` → `step(action)` × N → done.
+4. The agent sees **full code of the current module only**. Neighbors are always compressed summaries — never full code. This is a hard constraint for token budget.
+5. The agent can take actions: FLAG_BUG, FLAG_STYLE, FLAG_SECURITY, FLAG_DEPENDENCY_ISSUE, ADD_COMMENT, REQUEST_CHANGES, APPROVE, REQUEST_CONTEXT (costs -0.1 reward), AMEND_REVIEW (updates a neighbor's annotation retroactively).
+6. Rewards are computed by graders against pre-computed ground truth stored in the DB.
+7. The final output is an annotated dependency graph — all module reviews, cross-module causal attributions, readable as JSON and Markdown.
+The key differentiator: the environment models **cascading bugs** — where a bug in module B is caused by a design decision in module A. The agent is rewarded for identifying the upstream root cause, not just flagging the surface symptom.
+---
+## Persistence Strategy
+**SQLite + SQLModel. This is non-negotiable for demo performance.**
+- On first run: parse sample_codebase/ → populate DB with all nodes, edges, linter flags
+- On subsequent runs: detect DB exists → skip parsing → load graph directly
+- `reset()` clears only review annotations, never graph structure
+- All episode history is stored for reproducibility
+Use Context7 MCP to look up SQLModel, NetworkX, pylint programmatic API, bandit API, and OpenEnv spec documentation before implementing each component. Do not guess at APIs — look them up.
+---
+## Tech Stack
+- Python 3.11
+- SQLModel (SQLite persistence)
+- NetworkX (graph construction and traversal)
+- FastAPI (HTTP server for OpenEnv spec)
+- Pydantic v2 (typed models)
+- pylint + bandit (linter ground truth)
+- Python `ast` module (AST parsing — stdlib, no extras)
+- OpenAI client (all LLM calls in inference.py and hard grader)
+- Docker (containerization)
+---
+## Project Structure
+Follow this structure exactly — do not deviate:
+```
+code-review-env/
+├── openenv.yaml
+├── Dockerfile
+├── README.md
+├── inference.py
+├── requirements.txt
+├── env/
+│   ├── environment.py
+│   ├── models.py
+│   ├── graph.py
+│   ├── observation_builder.py
+│   └── reward.py
+├── db/
+│   ├── schema.py
+│   ├── store.py
+│   └── migrations.py
+├── parser/
+│   ├── ast_parser.py
+│   ├── linter.py
+│   └── summarizer.py
+├── graders/
+│   ├── base_grader.py
+│   ├── easy_grader.py
+│   ├── medium_grader.py
+│   └── hard_grader.py
+├── tasks/
+│   ├── task_registry.py
+│   ├── easy_task.py
+│   ├── medium_task.py
+│   └── hard_task.py
+├── server/
+│   └── app.py
+├── sample_codebase/
+│   ├── auth.py
+│   ├── checkout.py
+│   ├── cart.py
+│   ├── payments.py
+│   ├── config.py
+│   └── ground_truth.json
+└── tests/
+```
+---
+## Phase You Are Currently Building
+**[INSERT PHASE NUMBER AND NAME HERE]**
+Refer to the phase plan for exact tasks and completion criteria for this phase. Build only what is scoped to this phase. Do not build ahead.
+---
+## Non-Negotiable Constraints
+1. All rewards must be clipped to 0.0–1.0. Never return outside this range.
+2. Never feed full neighbor code into observations. Always use compressed summaries.
+3. inference.py must use OpenAI client. Read API_BASE_URL, MODEL_NAME, HF_TOKEN from env vars.
+4. inference.py must emit [START], [STEP], [END] log format exactly — no deviations.
+5. Hard grader must use temperature=0 and a fixed rubric prompt stored as a constant.
+6. DB must auto-populate on first Docker run without manual intervention.
+7. All Pydantic models must be fully typed — no `Any`, no `dict` without a model.
+8. Episode step limit is 10. Hard cap. Enforce in environment.py.
+---
+## Before You Start Each File
+1. Use Context7 MCP to look up the relevant library documentation
+2. Check if the schema/interface you are about to implement has dependencies on already-built files — import them, don't reimplement
+3. If you need to make a design choice not covered in this prompt (e.g. exact DB column types, traversal tie-breaking, summary format), **ask the user before proceeding**
+4. Write tests alongside implementation — not after
+---
+## Questions To Ask The User Before Starting
+If any of the following are unclear, ask before building:
+- What Python codebase should be used as the demo target? (default: the sample_codebase/ provided)
+- Should the hard grader use the same MODEL_NAME from env vars, or a fixed model?
+- Should REQUEST_CONTEXT return the full raw code or the full AST + raw code?
+- Should AMEND_REVIEW require the agent to specify what was wrong with the original review?
+- What is the maximum number of neighbors to include in an observation? (recommend: 5, confirm)

Debugger.md ADDED Viewed

	@@ -0,0 +1,100 @@

+# Debugger Prompt — CodeReviewEnv
+You are an expert Python debugger working on **CodeReviewEnv**, an OpenEnv-compliant RL environment for the OpenEnv Hackathon. Your job is to diagnose and fix issues without breaking the architecture.
+---
+## Project Summary
+This is a reinforcement learning environment where an LLM agent reviews Python codebases using a persistent dependency graph. The graph is stored in SQLite via SQLModel. The RL loop uses OpenEnv's step()/reset()/state() spec. There are 3 tasks (easy/medium/hard) with deterministic graders. The inference script must run in under 20 minutes on 2 vCPU / 8GB RAM.
+---
+## Architecture Rules — Never Violate These When Fixing
+1. **Persistence is SQLite/SQLModel** — do not switch to in-memory or another DB to fix a bug
+2. **Neighbor observations are always compressed summaries** — never fix a context issue by passing full neighbor code
+3. **Rewards must always be in 0.0–1.0** — if a reward bug exists, fix the computation, never remove the clip
+4. **inference.py uses OpenAI client only** — do not swap to direct HTTP calls or another client
+5. **[START]/[STEP]/[END] log format is fixed** — do not change field names or ordering to fix a logging bug
+6. **Hard grader uses temperature=0 and fixed rubric** — do not relax this to fix flaky test failures
+7. **episode step limit is 10** — do not raise this to fix timeout issues, optimize the agent instead
+---
+## How To Approach Any Bug
+### Step 1 — Locate
+- Identify which layer the bug is in: parser → db → graph → observation_builder → environment → grader → server → inference
+- Do not assume the bug is where the error surfaces — trace back to root cause
+### Step 2 — Check Interfaces First
+- Before changing implementation, verify the interface contract between the broken component and its dependencies
+- Use Context7 MCP to re-check library APIs if the bug involves SQLModel, NetworkX, pylint, bandit, FastAPI, or OpenEnv
+- Do not fix a bug by changing a shared interface without checking all callers
+### Step 3 — Fix Minimally
+- Fix the smallest possible change that resolves the issue
+- If the fix requires changing a DB schema, check whether a migration is needed and write it
+- If the fix changes a Pydantic model, check all serialization/deserialization paths
+### Step 4 — Verify
+- After fixing, confirm the completion criteria for the relevant phase still pass
+- Run the specific test for the broken component
+- If inference.py is affected, do a dry run and confirm [START]/[STEP]/[END] logs emit correctly
+---
+## Common Failure Modes To Check First
+### DB / Persistence
+- DB not found on startup → check migrations.py auto-init logic
+- Graph loads empty on second run → check upsert_node is committing correctly
+- Annotations not persisting across reset() → check reset() only clears annotations, not nodes/edges
+### Parser
+- AST parser crashes on type-annotated functions → check handling of ast.Constant vs ast.Str in Python 3.11
+- Linter returns no output → check pylint/bandit are installed in the Docker image and PATH is correct
+- Import resolution fails on relative imports → check the resolver handles both absolute and relative imports
+### RL Environment
+- Reward outside 0.0–1.0 → find the unclipped computation in reward.py
+- done never becomes True → check step limit counter and REQUEST_CHANGES/APPROVE handling
+- reset() returns wrong module → check task registry is loading the correct starting module
+### Graders
+- Easy grader always returns 0 → check linter_flags were populated in DB during parsing
+- Hard grader is non-deterministic → confirm temperature=0 and seed param is being passed
+- Grader crashes on empty annotation → add null check before scoring
+### Server
+- /health returns 404 → check route is registered in app.py
+- /step rejects valid action → check discriminated union deserialization in Pydantic v2
+- openenv validate fails → check openenv.yaml field names against spec exactly
+### Inference Script
+- Runs over 20 minutes → profile which task is slowest, reduce max steps or add timeout per episode
+- LLM returns unparseable action → check JSON mode is enabled, add fallback to APPROVE
+- Missing [STEP] logs → check log emit is inside the step loop, not outside
+### Docker
+- Build fails on pylint/bandit install → add gcc and build-essential to apt-get
+- DB not found inside container → check WORKDIR and DB path are consistent
+- Port not exposed → confirm EXPOSE 7860 and uvicorn binds to 0.0.0.0
+---
+## When You Find An Ambiguity
+If fixing the bug requires a design decision (e.g. "should reset() preserve REQUEST_CONTEXT history?"), **ask the user before implementing**. Do not make silent architectural decisions while debugging.
+---
+## Context To Always Include When Reporting A Fix
+After fixing, always report:
+- What the root cause was (one sentence)
+- Which file(s) were changed
+- Whether any DB schema changed (and if so, whether a migration was added)
+- Whether any Pydantic model interface changed (and if so, which callers were updated)
+- The specific test or check that now passes

Phases.md ADDED Viewed

	@@ -0,0 +1,295 @@

+# CodeReviewEnv — Phased Build Plan
+## For: LLM-Assisted Development
+---
+## 🧠 What You Are Building
+An OpenEnv-compliant reinforcement learning environment where an LLM agent learns to perform **dependency-aware code review**.
+The environment parses a Python codebase into a **persistent dependency graph** (nodes = modules, edges = import relationships). Each node stores compressed AST summaries, linter-generated ground truth issues, and agent-written review annotations.
+The agent reviews one module per episode. It receives the **full code of the current module** plus **compressed AST summaries of its neighbors** (never full neighbor code — token budget). It takes multi-step actions (flag bugs, add comments, request context, amend upstream reviews). The environment rewards correct, well-attributed findings and penalizes false positives.
+The final output is an **annotated dependency graph** — a machine-readable + human-readable map of the entire codebase with reviews on every module, including cross-module causal attributions.
+This is differentiated from tools like CodeRabbit because:
+- It models cascading dependency bugs (bug in B caused by design in A)
+- Reviews are stored back into the graph and can be amended as agent learns more
+- It is an RL training/evaluation environment, not a static analysis tool
+- The agent learns a policy over multi-step decisions, not a single LLM call
+---
+## 🗂️ Persistence Strategy
+**Use SQLite via SQLModel** for all persistent state. Do NOT reparse the codebase on every run. The database stores:
+- Parsed module nodes (code, AST summary, linter flags)
+- Graph edges (dependency relationships + reasons)
+- Review annotations (written by agent, updatable)
+- Episode history (for reproducibility)
+- Task definitions and ground truth
+On startup: check if DB exists → if yes, load graph from DB → if no, parse codebase and populate DB.
+This makes demos fast (parse once, review many times) and makes `reset()` cheap (clear annotations only, keep graph structure).
+---
+## 📁 Target Project Structure
+```
+code-review-env/
+├── openenv.yaml
+├── Dockerfile
+├── README.md
+├── inference.py               # Required by spec, root level
+├── requirements.txt
+├── pyproject.toml
+│
+├── env/
+│   ├── __init__.py
+│   ├── environment.py         # Main CodeReviewEnv class
+│   ├── models.py              # Pydantic: Observation, Action, Reward, GraphState
+│   ├── graph.py               # Graph construction, traversal, compression
+│   ├── observation_builder.py # Assembles tiered observation per step
+│   └── reward.py              # Reward computation logic
+│
+├── db/
+│   ├── __init__.py
+│   ├── schema.py              # SQLModel table definitions
+│   ├── store.py               # DB read/write operations
+│   └── migrations.py          # Init and seed scripts
+│
+├── parser/
+│   ├── __init__.py
+│   ├── ast_parser.py          # AST extraction: signatures, imports, classes
+│   ├── linter.py              # Pylint + Bandit runner, stores results to DB
+│   └── summarizer.py          # Converts AST output → compressed node summary
+│
+├── graders/
+│   ├── __init__.py
+│   ├── base_grader.py         # Abstract grader interface
+│   ├── easy_grader.py         # Linter match — fully deterministic
+│   ├── medium_grader.py       # AST + line attribution match
+│   └── hard_grader.py         # LLM-as-judge, temp=0, seed=42, rubric-constrained
+│
+├── tasks/
+│   ├── __init__.py
+│   ├── task_registry.py       # Registers and loads tasks
+│   ├── easy_task.py           # Style/linter issue in isolated module
+│   ├── medium_task.py         # Logic bug with direct dependency context
+│   └── hard_task.py           # Cascading bug across 2+ modules
+│
+├── server/
+│   ├── __init__.py
+│   └── app.py                 # FastAPI server exposing OpenEnv HTTP endpoints
+│
+├── sample_codebase/           # Synthetic test codebase for demo
+│   ├── auth.py
+│   ├── checkout.py
+│   ├── cart.py
+│   ├── payments.py
+│   └── config.py
+│
+└── tests/
+    ├── test_parser.py
+    ├── test_graders.py
+    ├── test_environment.py
+    └── test_inference.py
+```
+---
+## 📐 Core Data Models (Design Intent — Implementation Is Your Choice)
+### Graph Node
+Stores everything about one module. Persisted in DB.
+- module_id (filename/path)
+- raw_code (full source)
+- ast_summary (compressed: signatures, classes, exports)
+- linter_flags (pre-computed ground truth from pylint/bandit)
+- dependency_reason (why this module needs its neighbors — extracted from import context)
+- review_annotation (agent-written, nullable, updatable)
+- review_status (pending | in_progress | reviewed)
+- review_summary (one-line, written at episode end)
+### Graph Edge
+- source_module_id
+- target_module_id
+- edge_type (explicit_import | implicit_name_resolution)
+- import_line (the actual import statement)
+- weight (1.0 explicit, 0.5 implicit)
+### Observation (Pydantic)
+- current_module: full code + full AST summary
+- direct_dependencies: list of compressed node summaries (NOT full code)
+- dependents: list of compressed node summaries
+- existing_reviews: list of one-line review summaries from already-reviewed neighbors
+- constraint_flags: any known forced decisions from upstream
+- step_number: int
+- episode_id: str
+### Action (Pydantic, discriminated union)
+- APPROVE
+- FLAG_STYLE(line: int, description: str)
+- FLAG_BUG(line: int, description: str)
+- FLAG_SECURITY(line: int, description: str)
+- FLAG_DEPENDENCY_ISSUE(source_module: str, description: str)
+- ADD_COMMENT(text: str)
+- REQUEST_CHANGES(summary: str)
+- REQUEST_CONTEXT(module_id: str)  ← costs -0.1 reward, returns full code of neighbor
+- AMEND_REVIEW(module_id: str, note: str)  ← retroactively updates neighbor annotation
+### Reward (Pydantic)
+- value: float (0.0–1.0)
+- reason: str
+- cumulative: float
+---
+## 🏗️ PHASE 1 — Foundation & Persistence
+**Goal: Database schema, parser, graph construction. No RL yet.**
+### Tasks
+1. Define SQLModel schema for all tables (nodes, edges, annotations, episodes, tasks)
+2. Build `ast_parser.py` — extract from any .py file: all function signatures with type hints, all class definitions, all import statements with source resolution, all module-level constants
+3. Build `linter.py` — run pylint and bandit programmatically on a file, parse output into structured list of {line, severity, code, message}. Store results directly to DB as ground truth.
+4. Build `summarizer.py` — convert AST output into a compressed summary string under 100 tokens. Format: "exports: [fn(args)->return, ...] | issues: N | depends_on: [module, ...]"
+5. Build `store.py` — CRUD operations for all tables. Key operations: upsert_node, upsert_edge, get_node_with_neighbors, update_annotation, get_full_graph
+6. Build `graph.py` — on first run: parse all files in target directory → populate DB. On subsequent runs: load from DB. Build NetworkX DiGraph from DB records. Implement traversal order: topological sort weighted by betweenness centrality (leaf modules first, high-centrality modules last).
+7. Build `sample_codebase/` — 5 Python files with known injected issues: one style issue, one logic bug with a direct dependency cause, one security issue, one cascading bug where the root cause is 2 hops away. Document every injected issue in a ground_truth.json file.
+### Completion Criteria
+- `python -m parser.ast_parser sample_codebase/` populates DB with all nodes and edges
+- DB persists across runs (second run loads from DB, does not reparse)
+- `python -m db.store` can query a node and return its summary and neighbors
+- ground_truth.json matches linter output for easy/medium tasks
+---
+## 🏗️ PHASE 2 — OpenEnv Core (RL Environment)
+**Goal: Full step()/reset()/state() loop with reward. This is the RL part.**
+### Tasks
+1. Build `models.py` — all Pydantic models: Observation, Action (discriminated union), Reward, GraphState, EpisodeRecord. Must be fully typed.
+2. Build `observation_builder.py` — given a module_id and current graph state, assemble the tiered observation: full code for current module, compressed summaries for neighbors (pulled from DB), existing review annotations for already-reviewed neighbors, constraint flags
+3. Build `reward.py` — implement reward logic:
+   - Easy: compare agent flags against linter ground truth. Correct flag = +0.5, false positive = -0.2, missed critical = -0.4
+   - Medium: check flag + line number within ±3 lines of ground truth = +0.5, correct comment attribution = +0.3
+   - Hard: call hard_grader with agent's FLAG_DEPENDENCY_ISSUE and the known root cause. Score returned by judge × 0.8 as reward.
+   - REQUEST_CONTEXT action always costs -0.1 (thinking cost)
+   - AMEND_REVIEW with correct attribution = +0.4 (high reward — this is the key cascading behavior)
+   - Episode completion bonus: +0.2 if all critical issues found, -0.1 if APPROVE on module with known critical bugs
+4. Build `graders/` — implement all three graders per spec above. Hard grader must use OpenAI client (per competition spec), temperature=0, fixed rubric prompt stored as a constant.
+5. Build `environment.py` — main class implementing full OpenEnv interface:
+   - `reset(task_id)` → clears annotations for task modules, returns first observation
+   - `step(action)` → validates action, updates graph annotations in DB, computes reward, returns (obs, reward, done, info)
+   - `state()` → returns full GraphState (serialized NetworkX graph + all annotations)
+   - Episode ends when: agent calls APPROVE or REQUEST_CHANGES, OR step limit reached (max 10 steps)
+6. Build `tasks/` — register 3 tasks pointing to specific modules in sample_codebase with known ground truth issues
+### Completion Criteria
+- `env.reset("easy_task")` returns a valid typed Observation
+- `env.step(FLAG_BUG(line=12, description="null risk"))` returns reward > 0 for correct flag
+- `env.state()` returns serializable graph with annotations
+- Full episode runs without error on all 3 tasks
+- Reward values all fall in 0.0–1.0 range
+---
+## 🏗️ PHASE 3 — HTTP Server & OpenEnv Spec Compliance
+**Goal: Wrap environment in FastAPI, pass openenv validate.**
+### Tasks
+1. Build `server/app.py` — FastAPI app exposing:
+   - POST /reset → calls env.reset(), returns Observation JSON
+   - POST /step → calls env.step(action), returns (obs, reward, done, info) JSON
+   - GET /state → calls env.state(), returns GraphState JSON
+   - GET /health → returns 200 (required for HF Space ping)
+2. Build `openenv.yaml` — fill all required metadata: name, version, description, tasks list, observation_space, action_space, reward_range
+3. Run `openenv validate` — fix all compliance errors
+4. Confirm all Pydantic models serialize/deserialize correctly over HTTP
+### Completion Criteria
+- `openenv validate` passes with no errors
+- All endpoints return correct typed responses
+- GET /health returns 200
+---
+## 🏗️ PHASE 4 — Inference Script
+**Goal: Build inference.py that runs Gemma 4 as the agent. This is what judges auto-run.**
+### Critical Requirements (Non-Negotiable)
+- File must be named `inference.py` at root
+- Use OpenAI client for all LLM calls
+- Read API_BASE_URL, MODEL_NAME, HF_TOKEN from environment variables
+- Emit structured stdout logs in EXACTLY this format:
+```
+[START] task=<task_id> episode=<n>
+[STEP] step=<n> action=<action_type> reward=<float> cumulative=<float>
+[END] task=<task_id> total_reward=<float> steps=<n>
+```
+- Must complete all 3 tasks in under 20 minutes total
+- Must run on 2 vCPU / 8GB RAM
+### Tasks
+1. Build the agent loop — for each task: reset env, loop step() until done, collect rewards
+2. Build the LLM action parser — send observation to model with a structured prompt, parse response into typed Action. Use JSON mode or structured output. Handle parse failures gracefully (default to APPROVE with penalty).
+3. Build the action prompt — system prompt explaining the environment, action space, and output format. Include the compressed observation in user message. Tell model to output JSON action only.
+4. Implement all 3 task runs sequentially
+5. Emit all required log lines to stdout
+6. Final output: baseline scores for all 3 tasks printed to stdout
+### Completion Criteria
+- Script runs end to end without error
+- All [START]/[STEP]/[END] logs emitted correctly
+- Produces a score for each task between 0.0–1.0
+- Completes in under 20 minutes
+---
+## 🏗️ PHASE 5 — Containerization & Deployment
+**Goal: Docker build works, HF Space deploys, pre-validation script passes.**
+### Tasks
+1. Write `Dockerfile`:
+   - Base: python:3.11-slim
+   - Install system deps for pylint, bandit, networkx
+   - Copy project, install requirements
+   - On container start: run parser to populate DB if not exists, then start FastAPI server
+   - Expose port 7860 (HF Spaces default)
+2. Write `README.md` with all required sections: environment description and motivation, observation and action space definitions, all 3 task descriptions with difficulty, setup instructions, baseline scores
+3. Run pre-submission validation script — fix all failures
+4. Deploy to HF Space with `openenv push`
+5. Confirm Space URL returns 200 on GET /health and responds to POST /reset
+### Completion Criteria
+- `docker build .` succeeds
+- `docker run -p 7860:7860` starts server cleanly
+- HF Space URL responds to reset()
+- Pre-validation script passes all checks
+---
+## ⏱️ Suggested Time Allocation (Given ~36hrs remaining)
+| Phase | Time |
+|---|---|
+| Phase 1 — Foundation | 6 hrs |
+| Phase 2 — RL Environment | 8 hrs |
+| Phase 3 — Server + Spec | 3 hrs |
+| Phase 4 — Inference Script | 4 hrs |
+| Phase 5 — Docker + Deploy | 3 hrs |
+| Buffer / debugging | 4 hrs |
+---
+## ⚠️ Known Risk Areas (Watch These)
+1. **Hard grader reproducibility** — document judge prompt and seed explicitly
+2. **DB migration on fresh Docker build** — first run must auto-populate DB from sample_codebase
+3. **Inference script runtime** — test full 3-task run locally before submitting, must be under 20 min
+4. **openenv validate strictness** — run it early in Phase 3, not at the end
+5. **Reward always in 0.0–1.0** — clip all reward values, graders must never return outside range

code-review-env/Dockerfile ADDED Viewed

	@@ -0,0 +1,7 @@

+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt /app/
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . /app
+CMD ["python", "-m", "parser.ast_parser", "sample_codebase/"]

code-review-env/README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+# CodeReviewEnv
+Phase 1 foundation for dependency-aware code review environment.
+## Quickstart
+```bash
+pip install -r requirements.txt
+python -m parser.ast_parser sample_codebase/
+python -m db.store --module checkout
+```

code-review-env/db/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Database package for CodeReviewEnv."""

code-review-env/db/migrations.py ADDED Viewed

	@@ -0,0 +1,28 @@

+from __future__ import annotations
+from pathlib import Path
+from sqlmodel import SQLModel, create_engine
+def get_default_db_path() -> Path:
+    project_root = Path(__file__).resolve().parents[1]
+    return project_root / "code_review_env.db"
+def get_engine(db_path: str | Path | None = None, echo: bool = False):
+    path = Path(db_path) if db_path else get_default_db_path()
+    path.parent.mkdir(parents=True, exist_ok=True)
+    return create_engine(f"sqlite:///{path}", echo=echo)
+def init_db(db_path: str | Path | None = None, echo: bool = False) -> None:
+    from db import schema  # noqa: F401
+    engine = get_engine(db_path=db_path, echo=echo)
+    SQLModel.metadata.create_all(engine)
+if __name__ == "__main__":
+    init_db()
+    print("Database initialized")

code-review-env/db/schema.py ADDED Viewed

	@@ -0,0 +1,91 @@

+from __future__ import annotations
+from datetime import UTC, datetime
+from enum import StrEnum
+from typing import Optional
+from sqlmodel import Field, SQLModel
+class EdgeType(StrEnum):
+    EXPLICIT_IMPORT = "explicit_import"
+    IMPLICIT_NAME_RESOLUTION = "implicit_name_resolution"
+class ReviewStatus(StrEnum):
+    PENDING = "pending"
+    IN_PROGRESS = "in_progress"
+    REVIEWED = "reviewed"
+class Severity(StrEnum):
+    LOW = "low"
+    MEDIUM = "medium"
+    HIGH = "high"
+class ModuleNode(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    source_root: str = Field(index=True)
+    module_id: str = Field(index=True)
+    raw_code: str
+    ast_summary: str
+    dependency_reason: str = ""
+    review_annotation: Optional[str] = None
+    review_status: ReviewStatus = Field(default=ReviewStatus.PENDING)
+    review_summary: Optional[str] = None
+    created_at: datetime = Field(default_factory=lambda: datetime.now(UTC))
+    updated_at: datetime = Field(default_factory=lambda: datetime.now(UTC))
+class ModuleEdge(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    source_root: str = Field(index=True)
+    source_module_id: str = Field(index=True)
+    target_module_id: str = Field(index=True)
+    edge_type: EdgeType = Field(default=EdgeType.EXPLICIT_IMPORT)
+    import_line: str
+    weight: float = 1.0
+class LinterFinding(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    source_root: str = Field(index=True)
+    module_id: str = Field(index=True)
+    tool: str = Field(index=True)
+    line: int
+    severity: Severity
+    code: str
+    message: str
+class ReviewAnnotation(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    source_root: str = Field(index=True)
+    module_id: str = Field(index=True)
+    episode_id: str = Field(index=True)
+    step_number: int
+    action_type: str
+    note: str
+    created_at: datetime = Field(default_factory=lambda: datetime.now(UTC))
+class EpisodeRecord(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    source_root: str = Field(index=True)
+    episode_id: str = Field(index=True)
+    task_id: str = Field(index=True)
+    module_id: str = Field(index=True)
+    total_steps: int
+    cumulative_reward: float = 0.0
+    created_at: datetime = Field(default_factory=lambda: datetime.now(UTC))
+class TaskDefinition(SQLModel, table=True):
+    id: Optional[int] = Field(default=None, primary_key=True)
+    source_root: str = Field(index=True)
+    task_id: str = Field(index=True)
+    task_level: str = Field(index=True)
+    target_module_id: str = Field(index=True)
+    description: str
+    ground_truth_ref: str

code-review-env/db/store.py ADDED Viewed

	@@ -0,0 +1,384 @@

+from __future__ import annotations
+import argparse
+from dataclasses import dataclass
+from datetime import UTC, datetime
+from pathlib import Path
+from typing import Iterator, Optional
+from pydantic import BaseModel
+from sqlmodel import Session, delete, select
+from db.migrations import get_default_db_path, get_engine, init_db
+from db.schema import (
+    EdgeType,
+    LinterFinding,
+    ModuleEdge,
+    ModuleNode,
+    ReviewAnnotation,
+    ReviewStatus,
+    Severity,
+)
+@dataclass
+class DBConfig:
+    source_root: str
+    db_path: Path
+class NeighborSummary(BaseModel):
+    module_id: str
+    ast_summary: str
+    review_summary: Optional[str]
+class NodeWithNeighbors(BaseModel):
+    module_id: str
+    ast_summary: str
+    review_status: ReviewStatus
+    neighbors: list[NeighborSummary]
+class GraphNodeRecord(BaseModel):
+    module_id: str
+    ast_summary: str
+    review_status: ReviewStatus
+class GraphEdgeRecord(BaseModel):
+    source_module_id: str
+    target_module_id: str
+    weight: float
+    import_line: str
+class GraphSnapshot(BaseModel):
+    nodes: list[GraphNodeRecord]
+    edges: list[GraphEdgeRecord]
+class Store:
+    def __init__(self, source_root: str, db_path: str | Path | None = None) -> None:
+        self.config = DBConfig(
+            source_root=str(Path(source_root).resolve()),
+            db_path=Path(db_path) if db_path else get_default_db_path(),
+        )
+        init_db(db_path=self.config.db_path)
+        self.engine = get_engine(self.config.db_path)
+    def session(self) -> Iterator[Session]:
+        with Session(self.engine) as session:
+            yield session
+    def upsert_node(
+        self,
+        module_id: str,
+        raw_code: str,
+        ast_summary: str,
+        dependency_reason: str,
+    ) -> ModuleNode:
+        with Session(self.engine) as session:
+            existing = session.exec(
+                select(ModuleNode).where(
+                    ModuleNode.source_root == self.config.source_root,
+                    ModuleNode.module_id == module_id,
+                )
+            ).first()
+            if existing:
+                existing.raw_code = raw_code
+                existing.ast_summary = ast_summary
+                existing.dependency_reason = dependency_reason
+                existing.updated_at = datetime.now(UTC)
+                session.add(existing)
+                session.commit()
+                session.refresh(existing)
+                return existing
+            node = ModuleNode(
+                source_root=self.config.source_root,
+                module_id=module_id,
+                raw_code=raw_code,
+                ast_summary=ast_summary,
+                dependency_reason=dependency_reason,
+            )
+            session.add(node)
+            session.commit()
+            session.refresh(node)
+            return node
+    def upsert_edge(
+        self,
+        source_module_id: str,
+        target_module_id: str,
+        edge_type: EdgeType,
+        import_line: str,
+        weight: float,
+    ) -> ModuleEdge:
+        with Session(self.engine) as session:
+            existing = session.exec(
+                select(ModuleEdge).where(
+                    ModuleEdge.source_root == self.config.source_root,
+                    ModuleEdge.source_module_id == source_module_id,
+                    ModuleEdge.target_module_id == target_module_id,
+                    ModuleEdge.import_line == import_line,
+                )
+            ).first()
+            if existing:
+                existing.edge_type = edge_type
+                existing.weight = weight
+                session.add(existing)
+                session.commit()
+                session.refresh(existing)
+                return existing
+            edge = ModuleEdge(
+                source_root=self.config.source_root,
+                source_module_id=source_module_id,
+                target_module_id=target_module_id,
+                edge_type=edge_type,
+                import_line=import_line,
+                weight=weight,
+            )
+            session.add(edge)
+            session.commit()
+            session.refresh(edge)
+            return edge
+    def replace_findings_for_module(self, module_id: str, findings: list[dict[str, str | int]]) -> None:
+        with Session(self.engine) as session:
+            session.exec(
+                delete(LinterFinding).where(
+                    LinterFinding.source_root == self.config.source_root,
+                    LinterFinding.module_id == module_id,
+                )
+            )
+            for finding in findings:
+                session.add(
+                    LinterFinding(
+                        source_root=self.config.source_root,
+                        module_id=module_id,
+                        tool=str(finding["tool"]),
+                        line=int(finding["line"]),
+                        severity=Severity(str(finding["severity"])),
+                        code=str(finding["code"]),
+                        message=str(finding["message"]),
+                    )
+                )
+            session.commit()
+    def get_findings(self, module_id: str) -> list[LinterFinding]:
+        with Session(self.engine) as session:
+            return list(
+                session.exec(
+                    select(LinterFinding).where(
+                        LinterFinding.source_root == self.config.source_root,
+                        LinterFinding.module_id == module_id,
+                    )
+                ).all()
+            )
+    def get_node(self, module_id: str) -> Optional[ModuleNode]:
+        with Session(self.engine) as session:
+            return session.exec(
+                select(ModuleNode).where(
+                    ModuleNode.source_root == self.config.source_root,
+                    ModuleNode.module_id == module_id,
+                )
+            ).first()
+    def get_node_with_neighbors(self, module_id: str) -> Optional[NodeWithNeighbors]:
+        with Session(self.engine) as session:
+            node = session.exec(
+                select(ModuleNode).where(
+                    ModuleNode.source_root == self.config.source_root,
+                    ModuleNode.module_id == module_id,
+                )
+            ).first()
+            if not node:
+                return None
+            outgoing = list(
+                session.exec(
+                    select(ModuleEdge).where(
+                        ModuleEdge.source_root == self.config.source_root,
+                        ModuleEdge.source_module_id == module_id,
+                    )
+                ).all()
+            )
+            incoming = list(
+                session.exec(
+                    select(ModuleEdge).where(
+                        ModuleEdge.source_root == self.config.source_root,
+                        ModuleEdge.target_module_id == module_id,
+                    )
+                ).all()
+            )
+            neighbor_ids = {edge.target_module_id for edge in outgoing}
+            neighbor_ids.update(edge.source_module_id for edge in incoming)
+            neighbors: list[NeighborSummary] = []
+            for neighbor_id in sorted(neighbor_ids):
+                neighbor = session.exec(
+                    select(ModuleNode).where(
+                        ModuleNode.source_root == self.config.source_root,
+                        ModuleNode.module_id == neighbor_id,
+                    )
+                ).first()
+                if neighbor:
+                    neighbors.append(
+                        NeighborSummary(
+                            module_id=neighbor.module_id,
+                            ast_summary=neighbor.ast_summary,
+                            review_summary=neighbor.review_summary,
+                        )
+                    )
+            return NodeWithNeighbors(
+                module_id=node.module_id,
+                ast_summary=node.ast_summary,
+                review_status=node.review_status,
+                neighbors=neighbors,
+            )
+    def update_annotation(
+        self,
+        module_id: str,
+        episode_id: str,
+        step_number: int,
+        action_type: str,
+        note: str,
+        review_summary: str | None = None,
+        review_status: ReviewStatus | None = None,
+    ) -> None:
+        with Session(self.engine) as session:
+            node = session.exec(
+                select(ModuleNode).where(
+                    ModuleNode.source_root == self.config.source_root,
+                    ModuleNode.module_id == module_id,
+                )
+            ).first()
+            if not node:
+                raise ValueError(f"Unknown module: {module_id}")
+            node.review_annotation = note
+            if review_summary is not None:
+                node.review_summary = review_summary
+            if review_status is not None:
+                node.review_status = review_status
+            node.updated_at = datetime.now(UTC)
+            session.add(node)
+            session.add(
+                ReviewAnnotation(
+                    source_root=self.config.source_root,
+                    module_id=module_id,
+                    episode_id=episode_id,
+                    step_number=step_number,
+                    action_type=action_type,
+                    note=note,
+                )
+            )
+            session.commit()
+    def get_full_graph(self) -> GraphSnapshot:
+        with Session(self.engine) as session:
+            nodes = list(
+                session.exec(
+                    select(ModuleNode).where(ModuleNode.source_root == self.config.source_root)
+                ).all()
+            )
+            edges = list(
+                session.exec(
+                    select(ModuleEdge).where(ModuleEdge.source_root == self.config.source_root)
+                ).all()
+            )
+        return GraphSnapshot(
+            nodes=[
+                GraphNodeRecord(
+                    module_id=node.module_id,
+                    ast_summary=node.ast_summary,
+                    review_status=node.review_status,
+                )
+                for node in nodes
+            ],
+            edges=[
+                GraphEdgeRecord(
+                    source_module_id=edge.source_module_id,
+                    target_module_id=edge.target_module_id,
+                    weight=edge.weight,
+                    import_line=edge.import_line,
+                )
+                for edge in edges
+            ],
+        )
+    def has_nodes(self) -> bool:
+        with Session(self.engine) as session:
+            first_node = session.exec(
+                select(ModuleNode.id).where(ModuleNode.source_root == self.config.source_root)
+            ).first()
+            return first_node is not None
+    def clear_source_graph(self) -> None:
+        with Session(self.engine) as session:
+            session.exec(
+                delete(ReviewAnnotation).where(
+                    ReviewAnnotation.source_root == self.config.source_root
+                )
+            )
+            session.exec(
+                delete(LinterFinding).where(
+                    LinterFinding.source_root == self.config.source_root
+                )
+            )
+            session.exec(
+                delete(ModuleEdge).where(
+                    ModuleEdge.source_root == self.config.source_root
+                )
+            )
+            session.exec(
+                delete(ModuleNode).where(
+                    ModuleNode.source_root == self.config.source_root
+                )
+            )
+            session.commit()
+    def clear_annotations(self) -> None:
+        with Session(self.engine) as session:
+            nodes = list(
+                session.exec(
+                    select(ModuleNode).where(ModuleNode.source_root == self.config.source_root)
+                ).all()
+            )
+            for node in nodes:
+                node.review_annotation = None
+                node.review_summary = None
+                node.review_status = ReviewStatus.PENDING
+                node.updated_at = datetime.now(UTC)
+                session.add(node)
+            session.commit()
+def _build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Store query helper")
+    parser.add_argument("--root", default="sample_codebase", help="Source root directory")
+    parser.add_argument("--db-path", default=None, help="SQLite path")
+    parser.add_argument("--module", required=True, help="Module id (without .py)")
+    return parser
+def main() -> None:
+    args = _build_parser().parse_args()
+    store = Store(source_root=args.root, db_path=args.db_path)
+    result = store.get_node_with_neighbors(args.module)
+    if result is None:
+        print(f"Module '{args.module}' not found")
+        return
+    print(result.model_dump_json(indent=2))
+if __name__ == "__main__":
+    main()

code-review-env/env/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Environment package for CodeReviewEnv."""

code-review-env/env/environment.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""Phase 2 implementation placeholder."""
+class CodeReviewEnv:
+    def __init__(self) -> None:
+        raise NotImplementedError("Phase 2 implementation pending")

code-review-env/env/graph.py ADDED Viewed

	@@ -0,0 +1,105 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from pathlib import Path
+import networkx as nx
+from sqlmodel import Session, select
+from db.schema import ModuleEdge, ModuleNode
+from db.store import Store
+from parser.ast_parser import parse_directory
+@dataclass
+class GraphLoadResult:
+    graph: nx.DiGraph
+    loaded_from_cache: bool
+class DependencyGraph:
+    def __init__(self, target_dir: str | Path, db_path: str | Path | None = None) -> None:
+        self.target_dir = Path(target_dir).resolve()
+        self.store = Store(source_root=str(self.target_dir), db_path=db_path)
+    def load_or_build(self, force_reparse: bool = False) -> GraphLoadResult:
+        if force_reparse or not self.store.has_nodes():
+            parse_directory(self.target_dir, db_path=str(self.store.config.db_path))
+            loaded_from_cache = False
+        else:
+            loaded_from_cache = True
+        return GraphLoadResult(graph=self._build_graph(), loaded_from_cache=loaded_from_cache)
+    def _build_graph(self) -> nx.DiGraph:
+        graph = nx.DiGraph()
+        with Session(self.store.engine) as session:
+            nodes = list(
+                session.exec(
+                    select(ModuleNode).where(ModuleNode.source_root == self.store.config.source_root)
+                ).all()
+            )
+            edges = list(
+                session.exec(
+                    select(ModuleEdge).where(ModuleEdge.source_root == self.store.config.source_root)
+                ).all()
+            )
+        for node in nodes:
+            graph.add_node(
+                node.module_id,
+                ast_summary=node.ast_summary,
+                review_status=node.review_status.value,
+            )
+        for edge in edges:
+            graph.add_edge(
+                edge.source_module_id,
+                edge.target_module_id,
+                import_line=edge.import_line,
+                edge_type=edge.edge_type.value,
+                weight=edge.weight,
+            )
+        return graph
+    def traversal_order(self, graph: nx.DiGraph | None = None) -> list[str]:
+        graph = graph or self._build_graph()
+        if graph.number_of_nodes() == 0:
+            return []
+        if not nx.is_directed_acyclic_graph(graph):
+            # Fall back to deterministic ordering if cyclic imports exist.
+            return sorted(graph.nodes())
+        centrality = nx.betweenness_centrality(graph)
+        indegree = {node: graph.in_degree(node) for node in graph.nodes()}
+        queue = [node for node, deg in indegree.items() if deg == 0]
+        order: list[str] = []
+        def rank(node: str) -> tuple[float, float, str]:
+            return (
+                float(graph.out_degree(node)),
+                float(centrality.get(node, 0.0)),
+                node,
+            )
+        while queue:
+            queue.sort(key=rank)
+            current = queue.pop(0)
+            order.append(current)
+            for successor in sorted(graph.successors(current)):
+                indegree[successor] -= 1
+                if indegree[successor] == 0:
+                    queue.append(successor)
+        return order
+if __name__ == "__main__":
+    manager = DependencyGraph(target_dir="sample_codebase")
+    result = manager.load_or_build()
+    print(
+        f"Loaded graph with {result.graph.number_of_nodes()} nodes and "
+        f"{result.graph.number_of_edges()} edges (cache={result.loaded_from_cache})"
+    )
+    print("Traversal order:", manager.traversal_order(result.graph))

code-review-env/env/models.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 2 implementation placeholder."""

code-review-env/env/observation_builder.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 2 implementation placeholder."""

code-review-env/env/reward.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 2 implementation placeholder."""

code-review-env/graders/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Graders package placeholder for later phases."""

code-review-env/graders/base_grader.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Phase 2+ implementation placeholder."""
+class BaseGrader:
+    pass

code-review-env/graders/easy_grader.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 2+ implementation placeholder."""

code-review-env/graders/hard_grader.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 2+ implementation placeholder."""

code-review-env/graders/medium_grader.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 2+ implementation placeholder."""

code-review-env/inference.py ADDED Viewed

	@@ -0,0 +1,4 @@

+"""Phase 4 implementation placeholder."""
+if __name__ == "__main__":
+    raise SystemExit("inference.py is not implemented in Phase 1")

code-review-env/openenv.yaml ADDED Viewed

	@@ -0,0 +1,3 @@

+name: code-review-env
+version: 0.1.0
+description: Phase 1 scaffold

code-review-env/parser/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Parser package for CodeReviewEnv."""

code-review-env/parser/ast_parser.py ADDED Viewed

	@@ -0,0 +1,189 @@

+from __future__ import annotations
+import argparse
+import ast
+from pathlib import Path
+from pydantic import BaseModel
+from db.schema import EdgeType
+from db.store import Store
+from parser.linter import run_linters
+from parser.summarizer import summarize_module
+class ImportRef(BaseModel):
+    target_module: str
+    import_line: str
+    edge_type: EdgeType = EdgeType.EXPLICIT_IMPORT
+class ParsedModule(BaseModel):
+    module_id: str
+    raw_code: str
+    function_signatures: list[str]
+    classes: list[str]
+    imports: list[ImportRef]
+    constants: list[str]
+    dependencies: list[str]
+class _Visitor(ast.NodeVisitor):
+    def __init__(self) -> None:
+        self.function_signatures: list[str] = []
+        self.classes: list[str] = []
+        self.constants: list[str] = []
+        self.imports: list[tuple[str, str]] = []
+    def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
+        args: list[str] = []
+        for arg in node.args.args:
+            if arg.annotation is not None:
+                args.append(f"{arg.arg}: {ast.unparse(arg.annotation)}")
+            else:
+                args.append(arg.arg)
+        returns = ast.unparse(node.returns) if node.returns is not None else "None"
+        self.function_signatures.append(f"{node.name}({', '.join(args)})->{returns}")
+        self.generic_visit(node)
+    def visit_AsyncFunctionDef(self, node: ast.AsyncFunctionDef) -> None:
+        fake = ast.FunctionDef(
+            name=node.name,
+            args=node.args,
+            body=node.body,
+            decorator_list=node.decorator_list,
+            returns=node.returns,
+            type_comment=node.type_comment,
+        )
+        self.visit_FunctionDef(fake)
+    def visit_ClassDef(self, node: ast.ClassDef) -> None:
+        self.classes.append(node.name)
+        self.generic_visit(node)
+    def visit_Import(self, node: ast.Import) -> None:
+        line = ast.get_source_segment(self._source, node) or "import"
+        for alias in node.names:
+            self.imports.append((alias.name, line))
+    def visit_ImportFrom(self, node: ast.ImportFrom) -> None:
+        module = node.module or ""
+        level = node.level or 0
+        dotted = "." * level + module
+        line = ast.get_source_segment(self._source, node) or "from"
+        self.imports.append((dotted, line))
+    def visit_Assign(self, node: ast.Assign) -> None:
+        if isinstance(node.value, ast.Constant):
+            for target in node.targets:
+                if isinstance(target, ast.Name) and target.id.isupper():
+                    self.constants.append(target.id)
+        self.generic_visit(node)
+    def parse(self, tree: ast.AST, source: str) -> None:
+        self._source = source
+        self.visit(tree)
+def _to_module_id(path: Path, root: Path) -> str:
+    rel = path.resolve().relative_to(root.resolve())
+    return str(rel.with_suffix("")).replace("/", ".")
+def _resolve_relative_import(current_module: str, ref: str) -> str:
+    if not ref.startswith("."):
+        return ref
+    dots = len(ref) - len(ref.lstrip("."))
+    suffix = ref.lstrip(".")
+    parts = current_module.split(".")
+    base = parts[:-dots] if dots <= len(parts) else []
+    if suffix:
+        base.append(suffix)
+    return ".".join(part for part in base if part)
+def parse_python_file(path: Path, root_dir: Path) -> ParsedModule:
+    source = path.read_text(encoding="utf-8")
+    module_id = _to_module_id(path, root_dir)
+    tree = ast.parse(source)
+    visitor = _Visitor()
+    visitor.parse(tree, source)
+    imports = [
+        ImportRef(
+            target_module=_resolve_relative_import(module_id, name),
+            import_line=line,
+            edge_type=EdgeType.EXPLICIT_IMPORT,
+        )
+        for name, line in visitor.imports
+    ]
+    dependencies = [imp.target_module for imp in imports if imp.target_module]
+    return ParsedModule(
+        module_id=module_id,
+        raw_code=source,
+        function_signatures=visitor.function_signatures,
+        classes=visitor.classes,
+        imports=imports,
+        constants=visitor.constants,
+        dependencies=dependencies,
+    )
+def parse_directory(target_dir: Path, db_path: str | None = None) -> Store:
+    target_dir = target_dir.resolve()
+    store = Store(source_root=str(target_dir), db_path=db_path)
+    store.clear_source_graph()
+    py_files = sorted(target_dir.rglob("*.py"))
+    for py_file in py_files:
+        parsed = parse_python_file(py_file, target_dir)
+        issues = run_linters(py_file)
+        summary = summarize_module(parsed, issues)
+        dep_reason = "Imports used by module-level and callable logic"
+        store.upsert_node(
+            module_id=parsed.module_id,
+            raw_code=parsed.raw_code,
+            ast_summary=summary,
+            dependency_reason=dep_reason,
+        )
+        store.replace_findings_for_module(
+            parsed.module_id,
+            [issue.model_dump() for issue in issues],
+        )
+        for imported in parsed.imports:
+            if imported.target_module:
+                store.upsert_edge(
+                    source_module_id=parsed.module_id,
+                    target_module_id=imported.target_module,
+                    edge_type=imported.edge_type,
+                    import_line=imported.import_line,
+                    weight=1.0,
+                )
+    return store
+def _build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Parse Python codebase into SQLite graph")
+    parser.add_argument("target", help="Path to target codebase")
+    parser.add_argument("--db-path", default=None, help="Path to SQLite database")
+    return parser
+def main() -> None:
+    args = _build_parser().parse_args()
+    target_dir = Path(args.target)
+    store = parse_directory(target_dir=target_dir, db_path=args.db_path)
+    snapshot = store.get_full_graph()
+    print(
+        f"Populated DB for {target_dir} with "
+        f"{len(snapshot.nodes)} nodes and {len(snapshot.edges)} edges"
+    )
+if __name__ == "__main__":
+    main()

code-review-env/parser/linter.py ADDED Viewed

	@@ -0,0 +1,104 @@

+from __future__ import annotations
+import json
+import subprocess
+from pathlib import Path
+import sys
+from pydantic import BaseModel
+class LinterIssue(BaseModel):
+    tool: str
+    line: int
+    severity: str
+    code: str
+    message: str
+_PYLINT_SEVERITY_MAP = {
+    "fatal": "high",
+    "error": "high",
+    "warning": "medium",
+    "refactor": "low",
+    "convention": "low",
+    "info": "low",
+}
+_BANDIT_SEVERITY_MAP = {
+    "high": "high",
+    "medium": "medium",
+    "low": "low",
+}
+def run_pylint(path: Path) -> list[LinterIssue]:
+    cmd = [
+        sys.executable,
+        "-m",
+        "pylint",
+        str(path),
+        "--output-format=json2",
+        "--score=n",
+        "--reports=n",
+    ]
+    proc = subprocess.run(cmd, capture_output=True, text=True, check=False)
+    payload = (proc.stdout or "").strip()
+    if not payload:
+        return []
+    try:
+        data = json.loads(payload)
+    except json.JSONDecodeError:
+        return []
+    messages = data.get("messages", []) if isinstance(data, dict) else []
+    issues: list[LinterIssue] = []
+    for message in messages:
+        severity = _PYLINT_SEVERITY_MAP.get(str(message.get("type", "")).lower(), "low")
+        issues.append(
+            LinterIssue(
+                tool="pylint",
+                line=int(message.get("line", 0)),
+                severity=severity,
+                code=str(message.get("messageId", "PL0000")),
+                message=str(message.get("message", "")),
+            )
+        )
+    return issues
+def run_bandit(path: Path) -> list[LinterIssue]:
+    cmd = [sys.executable, "-m", "bandit", "-q", "-f", "json", str(path)]
+    proc = subprocess.run(cmd, capture_output=True, text=True, check=False)
+    payload = (proc.stdout or "").strip()
+    if not payload:
+        return []
+    try:
+        data = json.loads(payload)
+    except json.JSONDecodeError:
+        return []
+    results = data.get("results", []) if isinstance(data, dict) else []
+    issues: list[LinterIssue] = []
+    for item in results:
+        raw_sev = str(item.get("issue_severity", "LOW")).lower()
+        issues.append(
+            LinterIssue(
+                tool="bandit",
+                line=int(item.get("line_number", 0)),
+                severity=_BANDIT_SEVERITY_MAP.get(raw_sev, "low"),
+                code=str(item.get("test_id", "B000")),
+                message=str(item.get("issue_text", "")),
+            )
+        )
+    return issues
+def run_linters(path: Path) -> list[LinterIssue]:
+    issues = run_pylint(path)
+    issues.extend(run_bandit(path))
+    return issues

code-review-env/parser/summarizer.py ADDED Viewed

	@@ -0,0 +1,24 @@

+from __future__ import annotations
+from typing import TYPE_CHECKING
+from parser.linter import LinterIssue
+if TYPE_CHECKING:
+    from parser.ast_parser import ParsedModule
+def _truncate_tokens(text: str, max_tokens: int = 100) -> str:
+    words = text.split()
+    if len(words) <= max_tokens:
+        return text
+    return " ".join(words[:max_tokens])
+def summarize_module(parsed: ParsedModule, issues: list[LinterIssue]) -> str:
+    exports = ", ".join(parsed.function_signatures[:5])
+    deps = ", ".join(sorted(set(parsed.dependencies))[:5])
+    summary = (
+        f"exports: [{exports}] | issues: {len(issues)} | depends_on: [{deps}]"
+    )
+    return _truncate_tokens(summary, max_tokens=100)

code-review-env/pyproject.toml ADDED Viewed

	@@ -0,0 +1,13 @@

+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "code-review-env"
+version = "0.1.0"
+description = "OpenEnv CodeReviewEnv"
+requires-python = ">=3.11"
+[tool.pytest.ini_options]
+pythonpath = ["."]
+testpaths = ["tests"]

code-review-env/requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+sqlmodel>=0.0.24
+networkx>=3.2
+pydantic>=2.7
+pylint>=3.2
+bandit>=1.7
+fastapi>=0.115
+uvicorn>=0.30
+openai>=1.40
+pytest>=8.2

code-review-env/sample_codebase/auth.py ADDED Viewed

	@@ -0,0 +1,7 @@

+"""Auth helpers."""
+import config
+def issue_session_token(user_id: str) -> str:
+    return f"{user_id}:{config.SECRET_KEY}:session-token-generated-with-a-very-long-suffix-that-triggers-style-rules-and-is-hard-to-read"

code-review-env/sample_codebase/cart.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""Cart calculations."""
+import config
+def calculate_subtotal(items: list[dict[str, float]]) -> float:
+    subtotal = 0.0
+    for item in items:
+        subtotal += float(item.get("price", 0.0)) * float(item.get("qty", 0.0))
+    return subtotal
+def calculate_total(items: list[dict[str, float]]) -> float:
+    subtotal = calculate_subtotal(items)
+    # BUG: config.DISCOUNT_RATE is intended to be 0.20, but set to 20 in config.
+    discounted = subtotal - (subtotal * config.DISCOUNT_RATE)
+    return discounted + (discounted * config.TAX_RATE)

code-review-env/sample_codebase/checkout.py ADDED Viewed

	@@ -0,0 +1,15 @@

+"""Checkout flow."""
+import cart
+import payments
+def submit_order(items: list[dict[str, float]]) -> str:
+    total = cart.calculate_total(items)
+    # Cascading symptom: negative total is observed here but root cause is config -> cart.
+    if total < 0:
+        return "error: negative total"
+    gateway_ok = payments.run_gateway_check("https://gateway.example.com/health")
+    if gateway_ok != 0:
+        return "error: gateway"
+    return payments.charge(total)

code-review-env/sample_codebase/config.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""Configuration defaults for the checkout flow."""
+DISCOUNT_RATE = 20
+TAX_RATE = 0.07
+PAYMENT_TIMEOUT_SECONDS = 30
+SECRET_KEY = "hardcoded-dev-key"

code-review-env/sample_codebase/ground_truth.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "issues": [
+    {
+      "id": "STYLE_001",
+      "module": "auth",
+      "line": 7,
+      "type": "style",
+      "tool": "pylint",
+      "code": "C0301",
+      "message_contains": "Line too long"
+    },
+    {
+      "id": "LOGIC_001",
+      "module": "checkout",
+      "line": 7,
+      "type": "logic",
+      "description": "Negative total symptom due to dependency behavior in cart"
+    },
+    {
+      "id": "SECURITY_001",
+      "module": "payments",
+      "line": 9,
+      "type": "security",
+      "tool": "bandit",
+      "code": "B602",
+      "message_contains": "shell=True"
+    },
+    {
+      "id": "CASCADE_001",
+      "module": "checkout",
+      "line": 7,
+      "type": "dependency_cascade",
+      "root_cause_module": "config",
+      "surface_module": "checkout",
+      "path": ["config", "cart", "checkout"],
+      "description": "DISCOUNT_RATE configured as 20 instead of 0.20 causes cart miscalculation and checkout failure"
+    }
+  ]
+}

code-review-env/sample_codebase/payments.py ADDED Viewed

	@@ -0,0 +1,15 @@

+"""Payment gateway wrapper."""
+import subprocess
+def run_gateway_check(endpoint: str) -> int:
+    # SECURITY ISSUE: user-provided endpoint is interpolated in a shell command.
+    command = f"curl -s {endpoint}"
+    return subprocess.call(command, shell=True)
+def charge(total: float) -> str:
+    if total <= 0:
+        return "rejected"
+    return "charged"

code-review-env/server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Server package placeholder for later phases."""

code-review-env/server/app.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 3 implementation placeholder."""

code-review-env/tasks/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Tasks package placeholder for later phases."""

code-review-env/tasks/easy_task.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 2+ implementation placeholder."""

code-review-env/tasks/hard_task.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 2+ implementation placeholder."""

code-review-env/tasks/medium_task.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 2+ implementation placeholder."""

code-review-env/tasks/task_registry.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Phase 2+ implementation placeholder."""

code-review-env/tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,21 @@

+from pathlib import Path
+from env.graph import DependencyGraph
+def test_graph_builds_from_sample_codebase(tmp_path: Path) -> None:
+    db_path = tmp_path / "graph.db"
+    graph_mgr = DependencyGraph(target_dir="sample_codebase", db_path=db_path)
+    result = graph_mgr.load_or_build(force_reparse=True)
+    assert result.graph.number_of_nodes() >= 5
+    assert result.loaded_from_cache is False
+def test_graph_second_load_uses_cache(tmp_path: Path) -> None:
+    db_path = tmp_path / "graph.db"
+    graph_mgr = DependencyGraph(target_dir="sample_codebase", db_path=db_path)
+    graph_mgr.load_or_build(force_reparse=True)
+    second = graph_mgr.load_or_build(force_reparse=False)
+    assert second.loaded_from_cache is True

code-review-env/tests/test_graders.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ def test_phase1_placeholder() -> None:
2	+ assert True

code-review-env/tests/test_inference.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ def test_inference_placeholder() -> None:
2	+ assert True

code-review-env/tests/test_parser.py ADDED Viewed

	@@ -0,0 +1,13 @@

+from pathlib import Path
+from parser.ast_parser import parse_python_file
+def test_parse_python_file_extracts_core_elements() -> None:
+    root = Path("sample_codebase")
+    path = root / "cart.py"
+    parsed = parse_python_file(path=path, root_dir=root)
+    assert parsed.module_id == "cart"
+    assert any(sig.startswith("calculate_total(") for sig in parsed.function_signatures)
+    assert "config" in " ".join(parsed.dependencies)