nodeaudit-openenv / README.md
shreyas-joshi's picture
fix: update HF Space badge to be a clickable link in README.md
e47a3ab
metadata
title: NodeAudit  Graph-Aware Code Review RL Environment
emoji: 🔍
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
tags:
  - openenv
  - reinforcement-learning
  - code-review
  - dependency-graph
  - rl-environment

NodeAudit — Graph-Aware Code Review RL Environment

1. Title + Badges

OpenEnv Validate Python License HF Space Docker Build

NodeAudit is an OpenEnv-compatible RL environment for dependency-aware code review where agents must reason about upstream causes before deciding downstream review actions.

Hugging Face Space: https://huggingface.co/spaces/Athmabhiram1/nodeaudit-openenv

2. Overview

NodeAudit simulates multi-file Python code review in a realistic setting where bugs and security risks propagate through imports and call paths. This is hard for AI agents because a local-looking defect in one module can be a symptom of a root cause in another module, and the agent must decide when to request context versus when to act under token limits. Unlike tools like CodeRabbit that review mostly in local diff context, this environment explicitly exposes dependency and dependent summaries so agents can reason about why a module behaves incorrectly before flagging it. A trained agent outputs structured review actions over episodes, then produces a fully annotated dependency graph with module-level review status, per-step rewards, and attribution links.

3. The RL Loop

3.1 Environment as MDP

Define the environment as an episodic MDP $(S, O, A, R, P)$.

  • State space $S$: Persistent graph state in SQLite plus episode runtime (episode_id, task spec, module order/index, step counters, cumulative reward, grader state, and all persisted annotations). Concrete state payload is exposed via GraphState in env/state.py.
  • Observation space $O$: CodeObservation from env/observation.py, token-budget constrained to MAX_TOTAL_TOKENS=2000 by graph/token_budget.py, including current module code, AST summary, ranked dependency/dependent summaries, neighbor review snippets, available actions, optional requested context.
  • Action space $A$: ReviewAction in env/action.py with ActionType in {FLAG_STYLE, FLAG_BUG, FLAG_SECURITY, FLAG_DEPENDENCY_ISSUE, ADD_COMMENT, REQUEST_CONTEXT, REQUEST_CHANGES, APPROVE, AMEND_REVIEW} plus typed fields (target_line, content, attributed_to, context_request).
  • Reward function $R$: Deterministic reward reasons in env/reward.py mapped by RAW_REWARD_TABLE and emitted by graders (easy, medium, hard) through ReviewReward.
  • Episode boundaries: reset(task_id=...) creates episode/module records and returns first observation. In step(), done=True when module list is exhausted or when step_count >= max(task.max_steps, NodeAudit_MAX_STEPS_PER_EPISODE).

3.2 The Step Loop

from env.environment import CodeReviewEnv

env = CodeReviewEnv(source_root="sample_project")
obs = env.reset(task_id="cascade_review")
done = False

while not done:
    action = agent.act(obs)  # agent decides action (must return ReviewAction)
    result = env.step(action)  # StepResult
    obs = result.observation
    reward = result.reward
    done = result.done
    # reward is computed by grader against deterministic analyzer-backed findings
    # obs includes updated context and prior persisted review annotations

graph_state = env.state()  # full episode+graph state snapshot

3.3 Why This Is Real RL

This is online interaction, not static scoring: each step() mutates persistent environment state and changes subsequent observations. Rewards are produced live by grader logic over action trajectories, not by one-shot offline label lookup. Ground truth findings are generated ahead of episodes by deterministic static analyzers and then consumed during policy interaction. The agent must learn a sequential policy over graph-structured observations under token budgets and episode step limits.

4. Architecture

4.1 System Diagram

flowchart TD
    A[Codebase] --> B[AST Parser]
    B --> C[Chunker]
    C --> D[Graph Builder]
    D --> E[SQLite DB]
    E --> F[Graph Manager]
    F --> G["reset()"]
    G --> H[Observation Builder]
    H --> I[Agent]
    I --> J["step(action)"]
    J --> K[Grader]
    K --> L[Reward]
    L --> M[Review Annotation]
    M --> E
    E --> N["state()"]
    N --> O["Pyvis Graph + Markdown Report"]

4.2 Module Descriptions

Module File Purpose Key Functions Connects To
AST Parser + Chunker parser/ast_parser.py, parser/chunker.py Parse Python modules, imports, signatures, constants; split large files into chunk nodes. parse_python_file(path, root_dir) -> ParsedModule; parse_directory(target_dir, db_path=None) -> Store; chunk_module(parsed, max_lines=300) -> ChunkResult parser/graph_builder.py, db/store.py, db/seed.py
Graph Builder parser/graph_builder.py Build explicit/implicit/intra-file/circular edges from parsed modules and chunks. build_edges(parsed_modules, module_ids, chunk_ids_by_parent) -> list[EdgeRecord] db/seed.py, db/store.py
Graph Manager + Token Budget graph/graph_manager.py, graph/token_budget.py Load/query graph from DB, compute centrality/traversal order, enforce token caps. load_graph(refresh=False) -> nx.DiGraph; centrality() -> dict[str, float]; traversal_order() -> list[str]; enforce(payload) -> BudgetResult env/observation_builder.py, tasks/task_registry.py, graders/hard_grader.py
DB Models + Seed db/models.py, db/schema.py, db/seed.py Define SQLModel schema; seed graph/nodes/edges/findings/analyzer runs. seed_project(target_dir, db_path=None, force=False) -> dict[str, object]; Store.upsert_node(...); Store.upsert_edge(...) Parser, analyzers, environment, report generation
Observation Builder env/observation_builder.py Build CodeObservation with ranked neighbor context and token accounting. build(module_id, task_description, available_actions=None, context_request=None) -> CodeObservation env/environment.py, graph/token_budget.py
Action Space env/action.py Define validated action envelope and action enum. ActionType; ReviewAction.validate_required_fields(self) -> ReviewAction Environment step loop, graders
Reward Engine env/reward.py Define reward reasons, numeric table, normalization, typed reward records. normalize_reward(raw_value) -> float; make_reward(reason, feedback, ...) -> ReviewReward All graders, env/environment.py
Easy / Medium / Hard Graders graders/easy_grader.py, graders/medium_grader.py, graders/hard_grader.py Score actions against deterministic findings and graph attribution rules. grade_action(module_id, action, findings, state) -> ReviewReward; grade_episode(...) -> EpisodeGradeSummary env/environment.py, db/store.py, graph/graph_manager.py
Environment (step/reset/state) env/environment.py OpenEnv-style runtime orchestrating tasks, grading, persistence, and state snapshots. reset(...) -> CodeObservation; step(action: ReviewAction) -> StepResult; state() -> GraphState Tasks, graders, observation builder, store
FastAPI Server server/app.py HTTP API wrapper exposing health/tasks/reset/step/state/report/training endpoints. POST /reset; POST /step; GET /state; GET /health; POST /reports/generate CodeReviewEnv, Store, visualizer/report_generator.py
Pyvis Renderer + Report Generator visualizer/pyvis_renderer.py, visualizer/report_generator.py Generate JSON/Markdown/HTML artifacts and interactive graph visualization. render_graph_html(...) -> Path; generate_phase5_outputs(...) -> GeneratedArtifacts server/app.py, run_project.py
inference.py inference.py Deterministic training harness: seed, collect analyzer findings, compare agent outputs, persist training runs. main(); _extract_agent_findings(store, config) -> set[str] db/seed.py, training/run_manager.py, training/weights.py

4.3 Database Schema

db/models.py re-exports SQLModel tables defined in db/schema.py.

Table Columns (name: type)
ModuleNode id:int?, source_root:str, module_id:str, name:str?, raw_code:str, ast_summary:str, summary:str?, linter_flags:str, parent_module_id:str?, is_chunk:bool, dependency_reason:str, review_annotation:str?, review_status:ReviewStatus, review_summary:str?, created_at:datetime, updated_at:datetime
ModuleEdge id:int?, source_root:str, source_module_id:str, target_module_id:str, edge_type:EdgeType, import_line:str, weight:float, connection_summary:str
LinterFinding id:int?, source_root:str, module_id:str, tool:str, line:int, severity:Severity, code:str, message:str
ReviewAnnotation id:int?, source_root:str, module_id:str, episode_id:str, task_id:str?, step_number:int, action_type:str, note:str, reward_given:float, attributed_to:str?, is_amendment:bool, created_at:datetime
EpisodeRecord id:int?, source_root:str, episode_id:str, task_id:str, module_id:str, total_steps:int, cumulative_reward:float, created_at:datetime
TaskDefinition id:int?, source_root:str, task_id:str, task_level:str, target_module_id:str, description:str, ground_truth_ref:str
SeedMeta key:str (PK), value:str
AnalyzerRun id:int?, source_root:str, analyzer:str, analyzer_version:str, status:AnalyzerStatus, findings_count:int, command:str, command_hash:str, error_message:str?, started_at:datetime, finished_at:datetime
AnalyzerFinding id:int?, source_root:str, analyzer_run_id:int, analyzer:str, module_id:str, line:int, severity:Severity, rule_id:str, message:str, evidence:str, created_at:datetime
TrainingRun id:int?, source_root:str, run_id:str, model_name:str, model_sha256:str, deterministic_findings:int, agent_findings:int, true_positives:int, false_positives:int, false_negatives:int, precision:float, recall:float, passed_non_regression:bool, output_path:str, run_config_json:str, created_at:datetime

4.4 Graph Construction

Chunking: modules with <=300 lines stay whole; modules with >300 lines are split into chunk nodes for each top-level ClassDef, FunctionDef, or AsyncFunctionDef (parser/chunker.py).

Edge types:

  • explicit_import: module-level imports
  • implicit_dependency: function-level imports
  • intra_file: parent→chunk containment and function-call intra-file links
  • circular: added when reciprocal directed edges are detected

Scope tagging: import scope is tagged as module_level or function_level during AST traversal in parser/ast_parser.py and propagated into edge records.

Traversal order: graph/graph_manager.py computes a leaf-first order using reversed nx.lexicographical_topological_sort for DAGs, with betweenness centrality used as a deterministic tie-break key; cyclic graphs fall back to sorting by out-degree then centrality.

5. Grader Architecture

5.1 Ground Truth Generation

At seed time, findings are generated deterministically and stored in SQLite:

  • File-local linters (parser/linter.py): pylint, bandit, pyflakes
  • Analyzer pipeline (analyzers/pipeline.py): pylint, pyflakes, bandit, mypy, pyright, vulture, optional semgrep

ModuleNode.linter_flags stores serialized module findings; normalized analyzer findings are persisted in AnalyzerRun and AnalyzerFinding. Agent observations do not expose these ground-truth rows directly. Graders score actions against stored findings and graph structure.

5.2 Easy Grader

  • Checks: exact category match for FLAG_STYLE, FLAG_BUG, FLAG_SECURITY against findings, with LINE_TOLERANCE = ±3.
  • Determinism: 100% deterministic, zero LLM calls.
  • Expected baseline score range: approximately 0.5 to 2.0 raw for single-module easy episodes, depending on finding count and terminal action.

5.3 Medium Grader

  • Adds over Easy: ADD_COMMENT keyword overlap scoring via Jaccard similarity (KEYWORD_MIN_JACCARD = 0.3) against finding rule_id/message, plus amendment handling (AMEND_REVIEW).
  • Determinism: still zero LLM calls.
  • Expected baseline score range: approximately 2.0 to 7.0 raw across multi-module medium episodes.

5.4 Hard Grader

  • Stage 1 (implemented): deterministic graph consistency + hard-finding matching.
  • Verifies FLAG_DEPENDENCY_ISSUE has valid attributed_to module and a graph edge in either direction.
  • If a deterministic hard-stage finding matches, returns CORRECT_DEPENDENCY_ATTRIBUTION; if edge valid but no finding match, returns partial credit.
  • Stage 2 (not implemented in current grader): LLM-as-judge.

⚠️ graders/hard_grader.py currently deterministic-only. It does not call an LLM judge, does not use a prompt hash, and does not implement a 0.0/0.5/1.0 rubric path.

  • Expected baseline score range: approximately 3.0 to 8.0 raw for scripted deterministic hard episodes.
  • Scope: hard grader rewards dependency attribution correctness over graph-grounded cascade reasoning; it is not pure bug-finding recall optimization.

5.5 Reward Table

Values from env/reward.py (RAW_REWARD_TABLE):

Action Condition Reward
FLAG_STYLE / FLAG_BUG / FLAG_SECURITY Correct finding category + unmatched finding + line within tolerance (or no line provided) +0.5
ADD_COMMENT Medium/hard grader comment aligns to finding (Jaccard >= 0.3) +0.3
FLAG_DEPENDENCY_ISSUE Valid edge + matched hard finding +0.6
FLAG_DEPENDENCY_ISSUE Valid edge but no matched hard finding +0.35
FLAG_DEPENDENCY_ISSUE Missing/invalid attribution or no edge +0.1
AMEND_REVIEW Non-empty amendment content accepted by medium/hard +0.4
REQUEST_CONTEXT Any context request -0.1
FLAG_* / ADD_COMMENT / AMEND_REVIEW False positive / invalid -0.2
APPROVE Module has critical findings (severity=high) -1.0
REQUEST_CHANGES Module had no findings -0.3
APPROVE or REQUEST_CHANGES Valid episode/module completion decision +0.2
Any no-impact action No grading impact path 0.0

6. Tasks

Task ID Difficulty Input Objective Grader Expected Baseline Score* Failure Modes
style_review easy Default module list ['cart'] (from tasks/easy_task.py) Correctly flag lint/style/bug/security findings on focused module and terminate with REQUEST_CHANGES/APPROVE easy 0.8 Wrong flag category, line mismatch beyond tolerance, premature APPROVE
logic_review medium Default ['checkout', 'auth'] expanded to direct neighbors via graph (expand_to_dependencies=True) Use dependency context and accurate comments/amendments while flagging deterministic issues medium 6.5 Low-overlap comments (Jaccard < 0.3), over-flagging, no final terminal action
cascade_review hard Default ['checkout', 'auth', 'config'] expanded to neighbors Correctly attribute dependency issues to connected modules (attributed_to) and match deterministic hard findings hard 6.7 Invalid attributed_to, no graph edge, unsupported attribution claims

* Baselines above were measured on sample_project using a deterministic scripted policy against current environment code.

7. Observation & Action Space

7.1 Observation Space

Full model definition from env/observation.py:

from typing import Literal
from pydantic import BaseModel, ConfigDict, Field, field_validator
from graph.token_budget import MAX_TOTAL_TOKENS

class NeighborSummary(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")
    module_id: str
    relation: Literal["dependency", "dependent"]
    summary: str
    review_snippet: str | None = None

class RequestedContext(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")
    module_id: str
    code: str
    was_truncated: bool

class CodeObservation(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")
    module_id: str
    code: str
    module_summary: str = ""
    ast_summary: dict[str, object]
    dependency_summaries: list[NeighborSummary] = Field(default_factory=list)
    dependent_summaries: list[NeighborSummary] = Field(default_factory=list)
    neighbor_reviews: list[str] = Field(default_factory=list)
    task_description: str
    available_actions: list[str] = Field(default_factory=list)
    requested_context: RequestedContext | None = None
    token_usage: dict[str, int]
    total_tokens: int
    within_budget: bool

Field semantics:

  • module_id, code, task_description: current review target + textual objective.
  • ast_summary: parsed summary payload from stored AST summary text.
  • dependency_summaries, dependent_summaries: ranked neighbor context.
  • neighbor_reviews: short prior review snippets from neighboring modules.
  • available_actions: action names available this step.
  • requested_context: optional code payload from REQUEST_CONTEXT.
  • token_usage, total_tokens, within_budget: strict budget accounting.

Token budget from graph/token_budget.py:

Component Max Tokens Source
current_code 800 Current module code
ast_summary 100 Stored AST summary
direct_deps 250 Serialized dependency summaries
dependents 150 Serialized dependent summaries
neighbor_reviews 120 Prior neighbor review snippets
task_and_actions 200 Task description + action list
requested_context 800 Optional requested module code
Total hard cap 2000 MAX_TOTAL_TOKENS

7.2 Action Space

Full model definition from env/action.py:

from enum import StrEnum
from pydantic import BaseModel, ConfigDict, model_validator

class ActionType(StrEnum):
    FLAG_STYLE = "FLAG_STYLE"
    FLAG_BUG = "FLAG_BUG"
    FLAG_SECURITY = "FLAG_SECURITY"
    FLAG_DEPENDENCY_ISSUE = "FLAG_DEPENDENCY_ISSUE"
    ADD_COMMENT = "ADD_COMMENT"
    REQUEST_CONTEXT = "REQUEST_CONTEXT"
    REQUEST_CHANGES = "REQUEST_CHANGES"
    APPROVE = "APPROVE"
    AMEND_REVIEW = "AMEND_REVIEW"

class ReviewAction(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")
    action_type: ActionType
    target_line: int | None = None
    content: str | None = None
    attributed_to: str | None = None
    context_request: str | None = None

When to use each action:

  • FLAG_STYLE: low-severity style/lint issue candidates (+0.5 on match, -0.2 on miss).
  • FLAG_BUG: medium/high logic/static-error candidates (+0.5 on match, -0.2 on miss).
  • FLAG_SECURITY: security findings (typically bandit) (+0.5 on match).
  • FLAG_DEPENDENCY_ISSUE: hard-stage cascade attribution with attributed_to module (+0.6/+0.35/+0.1).
  • ADD_COMMENT: explanatory comment aligned to findings (+0.3 in medium/hard, else no-op or penalty).
  • REQUEST_CONTEXT: fetch additional module code (-0.1 cost).
  • AMEND_REVIEW: adjust prior interpretation with explicit content (+0.4 when valid).
  • REQUEST_CHANGES: terminal review decision for problematic module (+0.2 or -0.3 on clean module).
  • APPROVE: terminal clean decision (+0.2, but -1.0 if critical issues exist).

8. Output — The Annotated Graph

End-of-run artifacts (default prefix from report generator is configurable):

  • NodeAudit_report.json: machine-readable graph, findings, reviews, metrics.
  • NodeAudit_report.md: module-by-module review narrative and attribution summary.
  • NodeAudit_graph.html: interactive Pyvis dependency graph.

Visualization semantics from visualizer/pyvis_renderer.py and report generator:

  • Node colors: gray pending, yellow/orange in_progress, green approved, red changes_requested
  • Node size: 8.0 + 42.0 * betweenness_centrality
  • Edge colors: blue explicit_import, orange implicit_dependency, red circular, teal intra_file
  • Click behavior: node click posts NodeAudit-node-select message with module id to parent UI for review panel sync.

Why this differs from CodeRabbit: annotations persist in the graph database, attribution links can cross modules, and output is a full codebase dependency map with review state rather than isolated PR comments.

9. Setup & Installation

9.1 Prerequisites

  • Python >=3.11 (from pyproject.toml and Docker base image)
  • pip
  • Docker (for containerized run)
  • Optional CLI tooling installed by dependencies: mypy, pyright, semgrep, vulture, pylint, bandit, pyflakes

9.2 Local Setup

# Step 1: Clone
git clone https://huggingface.co/spaces/YOUR_USERNAME/NodeAudit-env
cd NodeAudit-env

# Step 2: Install dependencies
pip install -r requirements.txt
pip install -e .

# Step 3: Seed the database (parse codebase once)
python -m db.seed sample_project --force

# Step 4: Start server
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Step 5: Verify
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" \
  -d '{"task_id": "style_review"}'

9.3 Docker

docker build -t NodeAudit-env:latest .
docker run --rm -p 8000:7860 \
  -e API_BASE_URL=https://api-inference.huggingface.co/models/Qwen/Qwen2.5-Coder-7B-Instruct/v1 \
  -e MODEL_NAME=Qwen/Qwen2.5-Coder-7B-Instruct \
  -e HF_TOKEN=your_token \
  NodeAudit-env:latest

Verify with the same curl commands above, targeting http://localhost:8000.

9.4 OpenEnv Validation

pip install openenv-core
openenv validate

openenv validate checks OpenEnv metadata and endpoint contract compatibility (task metadata and environment API surface).

Observed command result in this environment:

[OK] code-review-env: Ready for multi-mode deployment

If your local environment still resolves openenv to the unrelated openenv==0.1.13 package, run this one-time fix:

pip uninstall -y openenv
pip install -e ../OpenEnv
openenv validate

10. Running Inference

export API_BASE_URL=https://api-inference.huggingface.co/models/Qwen/Qwen2.5-Coder-7B-Instruct/v1
export MODEL_NAME=Qwen/Qwen2.5-Coder-7B-Instruct
export HF_TOKEN=your_token_here

python inference.py sample_project

Real stdout format from current inference.py run:

[START] target=/home/lightdesk/Downloads/Projects/NodeAudit/code-review-env/sample_project model=gemma4:e4b mode=deterministic-ground-truth
[STEP] weights_registered {"model": "gemma4:e4b", "sha256": "724b2b241230aad689f5b43fe8f22d452755bf98677fbcfaa36d2a1b6a89c140", "size_bytes": 6254198752}
[STEP] weights_verified path=/home/lightdesk/Downloads/Projects/NodeAudit/Models/Qwen2.5-Coder-7B-Instruct-Q6_K.gguf
[STEP] seeded {"codebase_hash": "0e4e86a3bf4dbeb45c0ca3d91c1ab6c2a511e21b5253370778975d8c0db78603", "edge_count": 57, "loaded_from_cache": true, "node_count": 60, "seeded": true}
[STEP] agent_llm_disabled reason=completion-failed error=JSONDecodeError module=checkout
[STEP] training_dataset {"false_negatives": 70, "output": "outputs/training/readme_check.jsonl", "precision": 0.5, "recall": 0.014084507042253521, "records": 73}
[STEP] training_run_id=tr-20260408163726-ce86e6ea
[END] {"agent_findings": 2, "deterministic_findings": 73, "model": "gemma4:e4b", "model_weight": "/home/lightdesk/Downloads/Projects/NodeAudit/Models/Qwen2.5-Coder-7B-Instruct-Q6_K.gguf", "ok": true, "precision": 0.5, "recall": 0.014084507042253521, "run_id": "tr-20260408163726-ce86e6ea"}

Runtime constraint: expected to complete under 20 minutes on 2 vCPU / 8 GB RAM for the sample project; actual runtime depends on analyzer availability and model endpoint latency.

Baseline score table (sample_project deterministic scripted policy):

Task Difficulty Baseline Score Max Score
style_review easy 0.8 Step-limited by task (max_steps=8) and episode cap
logic_review medium 6.5 Step-limited by task (max_steps=14) and episode cap
cascade_review hard 6.7 Step-limited by task (max_steps=20) and episode cap

11. Environment Variables

Variable Required Description Example
API_BASE_URL Yes OpenAI-compatible API endpoint used by runtime config fallback https://api-inference.huggingface.co/models/Qwen/Qwen2.5-Coder-7B-Instruct/v1
MODEL_NAME Yes Model identifier used by runtime config fallback Qwen/Qwen2.5-Coder-7B-Instruct
HF_TOKEN Yes Hugging Face API key / bearer token for hosted inference hf_...

12. Project Structure

code-review-env/
├── Dockerfile                              # Container build and startup command
├── README.md                               # Project documentation
├── inference.py                            # Deterministic training/inference harness
├── openenv.yaml                            # OpenEnv metadata (tasks, models, routes)
├── pyproject.toml                          # Packaging, dependencies, scripts
├── requirements.txt                        # Runtime/test dependencies
├── run_project.py                          # Unified seed/review/report runner
├── uv.lock                                 # uv lockfile
├── analyzers/
│   ├── __init__.py                         # Package marker
│   └── pipeline.py                         # Deterministic multi-analyzer execution
├── db/
│   ├── __init__.py                         # Package marker
│   ├── database.py                         # Database URL/engine helpers
│   ├── migrations.py                       # SQLModel init/default db path
│   ├── models.py                           # Re-exported DB models
│   ├── schema.py                           # SQLModel table declarations
│   ├── seed.py                             # Parse/chunk/edge/analyzer seed pipeline
│   └── store.py                            # Persistence/query abstraction
├── env/
│   ├── __init__.py                         # Package marker
│   ├── action.py                           # ReviewAction and ActionType
│   ├── env_loader.py                       # .env loading
│   ├── environment.py                      # CodeReviewEnv runtime (reset/step/state)
│   ├── graph.py                            # Graph facade types/helpers
│   ├── models.py                           # Shared env model exports
│   ├── observation.py                      # Observation schema
│   ├── observation_builder.py              # Token-bounded observation construction
│   ├── reward.py                           # Reward reasons and scalar table
│   ├── runtime_config.py                   # Runtime config from env vars
│   └── state.py                            # GraphState/EpisodeState schemas
├── graders/
│   ├── __init__.py                         # Package marker
│   ├── base_grader.py                      # Shared grading flow + persistence
│   ├── easy_grader.py                      # Deterministic flag matcher
│   ├── hard_grader.py                      # Graph attribution grader
│   ├── medium_grader.py                    # Easy+comment/amendment grading
│   └── review_runner.py                    # CLI review and report generator
├── graph/
│   ├── __init__.py                         # Package marker
│   ├── graph_manager.py                    # Graph load, neighbors, centrality, traversal
│   └── token_budget.py                     # Observation token budget enforcement
├── llm/
│   ├── __init__.py                         # Package marker
│   ├── critical_analysis.py                # Training run critique generation
│   ├── edge_summarizer.py                  # Edge summary model calls
│   ├── hard_issue_finder.py                # Optional hard-stage proposal generator
│   ├── lora_adapter.py                     # Trajectory logging and LoRA hooks
│   └── lora_finetune.py                    # LoRA fine-tuning utilities
├── outputs/
│   ├── NodeAudit_full_graph.html         # Full graph visualization artifact
│   ├── NodeAudit_full_report.json        # Full machine-readable report
│   ├── NodeAudit_full_report.md          # Full markdown summary report
│   ├── openenv_real/
│   │   ├── openenv_real_phase5_graph.html  # OpenEnv real-run graph artifact
│   │   ├── openenv_real_phase5_report.json # OpenEnv real-run JSON report
│   │   └── openenv_real_phase5_report.md   # OpenEnv real-run markdown report
│   ├── sample_project/
│   │   ├── sample_project_phase5_graph.html  # Sample project graph artifact
│   │   ├── sample_project_phase5_report.json # Sample project JSON report
│   │   └── sample_project_phase5_report.md   # Sample project markdown report
│   ├── sample_project_hard_judge/
│   │   ├── sample_project_hard_judge_graph.html  # Hard-judge graph artifact
│   │   ├── sample_project_hard_judge_report.json # Hard-judge JSON report
│   │   └── sample_project_hard_judge_report.md   # Hard-judge markdown report
│   ├── training/
│   │   ├── dataset.latest.jsonl            # Latest generated training dataset
│   │   ├── deterministic_findings.jsonl    # Deterministic findings dataset
│   │   ├── sample_project_live_check.jsonl # Sample project live-check dataset
│   │   └── sample_project_postfix.jsonl    # Sample project postfix dataset
│   └── weights/
│       ├── hf.coQwenQwen2.5-Coder-7B-Instruct-GGUFlatest.manifest.json # Weight manifest
│       └── qwen2.5-coder-7b-instruct-q6_k.manifest.json                  # Weight manifest
├── parser/
│   ├── __init__.py                         # Package marker
│   ├── ast_parser.py                       # AST parse + import extraction
│   ├── chunker.py                          # >300 line chunking logic
│   ├── graph_builder.py                    # Edge construction and cycle marking
│   ├── linter.py                           # Per-file linter wrappers
│   ├── semantic_checks.py                  # Deterministic semantic issue checks
│   └── summarizer.py                       # Lightweight module summarization
├── sample_codebase/
│   ├── auth.py                             # Small demo module
│   ├── cart.py                             # Small demo module
│   ├── checkout.py                         # Small demo module
│   ├── config.py                           # Small demo module
│   ├── ground_truth.json                   # Sample deterministic ground truth
│   └── payments.py                         # Small demo module
├── sample_project/
│   ├── auth.py                             # Session token helper with config dependency
│   ├── cart.py                             # Pricing math with discount bug scenario
│   ├── checkout.py                         # Order flow consuming cart+payments
│   ├── config.py                           # Misconfigured constants for cascade demo
│   ├── database.py                         # DSN construction helper
│   ├── huge_module.py                      # Synthetic large file for chunk tests
│   ├── inventory.py                        # Inventory helper
│   ├── notifications.py                    # SMTP notification helper
│   ├── payments.py                         # Gateway shell call security issue demo
│   ├── utils.py                            # Utility function using inventory
│   └── validators.py                       # Validation helpers with intentional bug
├── sample_project_canonical/
│   ├── api.py                              # Canonical fixture module
│   ├── auth.py                             # Canonical fixture module
│   ├── cart.py                             # Canonical fixture module
│   ├── checkout.py                         # Canonical fixture module
│   ├── config.py                           # Canonical fixture module
│   ├── database.py                         # Canonical fixture module
│   ├── main.py                             # Canonical fixture entrypoint
│   ├── models.py                           # Canonical fixture models
│   ├── payments.py                         # Canonical fixture payments
│   └── utils.py                            # Canonical fixture helpers
├── semgrep_rules/
│   └── none-return-not-checked.yaml        # Custom semgrep rule for cascade checks
├── server/
│   ├── __init__.py                         # Package marker
│   ├── app.py                              # FastAPI app and all API endpoints
│   └── static/
│       ├── index.html                      # Web UI shell
│       ├── css/app.css                     # UI styles
│       └── js/app.js                       # UI behavior
├── tasks/
│   ├── __init__.py                         # Package marker
│   ├── easy_task.py                        # style_review task config
│   ├── hard_task.py                        # cascade_review task config
│   ├── medium_task.py                      # logic_review task config
│   ├── task_registry.py                    # TaskSpec registry and module resolution
│   └── validate_canonical_fixture.py       # Fixture validation utility
├── tests/
│   ├── test_canonical_fixture.py           # Canonical fixture tests
│   ├── test_environment.py                 # Core env behavior tests
│   ├── test_graders.py                     # Easy/medium/hard grader tests
│   ├── test_graph_linking.py               # Graph linking tests
│   ├── test_inference.py                   # Inference placeholder test
│   ├── test_parser.py                      # Parser/chunker tests
│   ├── test_phase2_graph_manager.py        # Phase 2 graph manager tests
│   ├── test_phase2_observation.py          # Phase 2 observation tests
│   ├── test_phase2_token_budget.py         # Token budget tests
│   ├── test_phase4_environment.py          # Phase 4 env tests
│   ├── test_phase4_server.py               # Phase 4 server tests
│   ├── test_phase5_reporting.py            # Phase 5 report tests
│   ├── test_phase5_server_api.py           # Phase 5 server API tests
│   ├── test_phase8_training_api.py         # Phase 8 training API tests
│   └── test_seed.py                        # Seed pipeline tests
├── training/
│   ├── __init__.py                         # Package marker
│   ├── run_manager.py                      # Comparison and dataset utilities
│   └── weights.py                          # Weight verification/manifest manager
└── visualizer/
    ├── __init__.py                         # Package marker
    ├── pyvis_renderer.py                   # HTML network rendering
    └── report_generator.py                 # JSON/Markdown/HTML report assembly

⚠️ server.py not yet implemented at repository root; FastAPI entrypoint is server/app.py.

13. Evaluation Criteria Alignment

Criterion Weight How This Environment Satisfies It Relevant Files
Real-world utility 30% Simulates multi-module code review where root-cause attribution across dependencies affects decisions and outputs actionable graph annotations. env/environment.py, env/observation_builder.py, graph/graph_manager.py, visualizer/report_generator.py
Task & grader quality 25% Three difficulty tiers with deterministic task specs and explicit grader logic tied to static findings and graph checks. tasks/easy_task.py, tasks/medium_task.py, tasks/hard_task.py, graders/easy_grader.py, graders/medium_grader.py, graders/hard_grader.py
Environment design 20% OpenEnv-style reset/step/state runtime, strict typed action/observation/state models, persistent DB-backed episode state. openenv.yaml, env/environment.py, env/action.py, env/observation.py, env/state.py, db/store.py
Code quality & compliance 15% Deterministic seeding pipeline, analyzer normalization, API endpoints, tests, Dockerized serving path. db/seed.py, analyzers/pipeline.py, server/app.py, Dockerfile, tests/
Creativity & novelty 10% Dependency-aware RL review with persistent annotated graph output and cascade attribution scoring, not isolated diff comments. graph/graph_manager.py, env/observation_builder.py, graders/hard_grader.py, visualizer/pyvis_renderer.py

14. Pre-Submission Checklist

  • HF Space deploys and returns 200 on health check
  • POST /reset returns valid CodeObservation
  • openenv validate passes
  • docker build && docker run works cleanly
  • inference.py completes without error
  • inference.py produces [START]/[STEP]/[END] logs
  • Baseline scores are reproducible across 3 runs
  • 3 tasks enumerated with graders returning 0.0–1.0
  • API_BASE_URL, MODEL_NAME, HF_TOKEN documented

15. Comparison with Existing Tools

Capability CodeRabbit Traditional Linters NodeAudit (This)
Graph-aware review Partial (diff-local context) No Yes — full dependency graph
Cascade attribution No No Yes — explicit attributed_to graph-linked scoring
RL-trainable No No Yes — OpenEnv-style reset/step/state + rewards
Annotated output PR comments CLI findings Annotated dependency graph + JSON/MD/HTML reports
Agent learns over time No No Yes — trajectory logs + reward shaping

Graph-awareness changes review quality because downstream symptoms can be scored with upstream evidence. Example from the sample project: auth.py or config.py can induce invalid downstream behavior in checkout.py; a local-only reviewer flags checkout.py symptoms but misses root cause attribution. NodeAudit exposes upstream/downstream context before action selection and rewards correct attribution.

16. License

Apache 2.0 — see LICENSE