Spaces:

Spirit-26
/

code-review-environment

Running

File size: 9,368 Bytes

---
title: Code Review Agent Environment
emoji: 🤖
colorFrom: green
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
---

# Code Review Agent Environment

[![CI](https://github.com/ProbablyItsSpirit/code-review-environment/actions/workflows/ci.yml/badge.svg)](https://github.com/ProbablyItsSpirit/code-review-environment/actions/workflows/ci.yml)

This repository provides an OpenEnv-compatible environment for evaluating AI code-review agents.

## Judge Summary

- OpenEnv validation: pass
- Tests: pass
- Docker build: pass
- Baseline reproduction: pass
- Live Space health/reset: pass

Evidence:

- [submission_report.json](submission_report.json)
- [Benchmark Table](outputs/benchmark_table.md)
- [Space URL](https://spirit-26-code-review-environment.hf.space)

## Why This Environment

Code review is a strong RL task because success and failure are measurable: line-level issues can be deterministically graded, rewards can be shaped across review phases, and tasks can scale from easy to hard while staying realistic.

This project is designed for both evaluation and lightweight policy training loops, not only one-off scripted inference.

The agent receives a code diff and surrounding file context, then performs a multi-step review:

1. Add issue comments with line numbers.
2. Suggest code fixes.
3. Make a final decision (`approved` or `changes_requested`).

The environment scores the review quality using deterministic graders.

## What This Project Does

- Simulates pull-request review tasks across easy/medium/hard difficulty.
- Exposes OpenEnv-style lifecycle methods (`reset`, `step`, `state`).
- Exposes integration endpoints (`tasks`, `score`, `health`) for tooling and dashboard checks.
- Grades issue detection, fix suggestions, and final decision quality.
- Supports local LLM providers via an OpenAI-compatible API (including Ollama).
- Includes a policy-training scaffold (`train.py`, `train_env.py`) and logged training metrics.

## Project Structure

- `environment/`: environment implementation, task definitions, models, and grading logic.
- `inference.py`: baseline review agent loop.
- `train.py`, `train_env.py`: lightweight PPO-style policy training loop over the environment.
- `ppo_logs/`: training metrics and summaries.
- `openenv.yaml`: task registry and environment metadata.
- `tests/`: environment tests.
- `explore_env.ipynb`: interactive environment walkthrough.
- `docker-compose.yml` / `Dockerfile`: containerized execution options.

## Prerequisites

- Python 3.10+
- macOS/Linux shell or PowerShell equivalent
- Optional: Docker Desktop
- Optional: Ollama for local model inference

## Local Setup

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

## Required Environment Variables

The baseline uses OpenAI-compatible endpoints.

- `API_BASE_URL` (required)
- `MODEL_NAME` (required)
- `HF_TOKEN` (preferred auth var)

Supported auth aliases:

- `OPENAI_API_KEY`
- `API_KEY`

## Run Methods

### 1) Run Unit Tests

```bash
source .venv/bin/activate
pytest tests/test_env.py -q
```

### 2) Validate OpenEnv Package

```bash
source .venv/bin/activate
openenv validate
```

### 3) Run Baseline Agent (Single Task)

```bash
source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180

python inference.py \
	--task-id bug_detection_easy_1 \
	--max-steps 10 \
	--output baseline_results.json
```

### 4) Run All Tasks (Local Sweep)

```bash
source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180

for task in \
	bug_detection_easy_1 \
	bug_detection_easy_2 \
	approve_easy_3 \
	memory_leak_medium_1 \
	performance_medium_2 \
	approve_medium_3 \
	security_hard_1 \
	race_condition_hard_2 \
	approve_hard_3
do
	python inference.py --task-id "$task" --max-steps 10 --output "baseline_${task}.json"
done
```

### 5) Docker Build and Run

```bash
docker build -t code-review-env .

docker run --rm \
	-e API_BASE_URL=http://host.docker.internal:11434/v1 \
	-e MODEL_NAME=qwen3.5:latest \
	-e HF_TOKEN=not-needed \
	-e TEMPERATURE=0.0 \
	-e REQUEST_TIMEOUT=180 \
	code-review-env \
	--task-id bug_detection_easy_1
```

### 6) Docker Compose Services

```bash
docker compose run --rm openai-agent
docker compose run --rm gemini-agent
docker compose run --rm groq-agent
docker compose run --rm local-agent
```

Note: on macOS, `network_mode: host` can be unreliable. If `local-agent` cannot reach Ollama, use `host.docker.internal` in the service environment.

## Available Task IDs

- `bug_detection_easy_1`
- `bug_detection_easy_2`
- `approve_easy_3`
- `memory_leak_medium_1`
- `performance_medium_2`
- `approve_medium_3`
- `type_safety_medium_4`
- `javascript_medium_5`
- `security_hard_1`
- `race_condition_hard_2`
- `approve_hard_3`
- `adversarial_hard_4`
- `concurrency_hard_5`
- `dependency_injection_hard_6`

## HTTP Endpoints

- `GET /`
- `GET /health`
- `GET /tasks`
- `GET|POST /reset`
- `POST /step`
- `GET /state`
- `GET /score`

## Output Format

Each inference run writes JSON like:

```json
{
	"task_id": "bug_detection_easy_1",
	"total_reward": 0.78,
	"task_score": 1.0,
	"steps": 3,
	"max_steps": 10,
	"provider": "openai-client",
	"model": "qwen3.5:latest",
	"api_base_url": "http://localhost:11434/v1"
}
```

## Notes On Baseline Stability

- Local models can time out on long prompts.
- The baseline now enforces phased review behavior and falls back to deterministic actions when the model is temporarily unavailable.
- For reproducible runs, keep `TEMPERATURE=0.0`.

## Fast Start (3 Commands)

```bash
source .venv/bin/activate
pytest -q
python submit.py --skip-docker --max-steps 10
```

## Judge Map (Criterion -> Evidence)

| Criterion | Evidence | File |
|---|---|---|
| OpenEnv lifecycle compliance | reset/step/state implemented and served over HTTP | `environment/env.py`, `server/app.py` |
| Typed models | Pydantic action/state/observation models | `environment/models.py` |
| Task difficulty progression | easy/medium/hard tasks + calibration approve tasks | `environment/tasks.py` |
| Grading quality | detection/suggestion/decision + partial credit + FP penalty + efficiency bonus | `environment/graders.py` |
| Baseline reproducibility | deterministic seed support in reset + inference output metadata | `environment/env.py`, `inference.py` |
| Submission validation | Python preflight + bash validator script | `submit.py`, `scripts/validate-submission.sh` |

## Grader Rubric (Summary)

| Component | Weight / Effect | Notes |
|---|---|---|
| Detection score | 0.4 | Partial credit for near-line matches |
| Suggestion score | 0.3 | Line-proximity matching for fixes |
| Decision score | 0.3 | Approve for no-issue tasks, request_changes otherwise |
| False positive penalty | up to -0.4 | Strong penalty for issue spam |
| Efficiency bonus | up to +0.1 | Bonus for completing in fewer steps |
| Final score clamp | [0,1] | Safety clamp in grader |

## Benchmark Snapshot (3-Task Local Run)

| Task | Task Score | Total Reward | Model |
|---|---:|---:|---|
| bug_detection_easy_1 | 1.000 | 1.410 | meta/llama-3.3-70b-instruct |
| memory_leak_medium_1 | 0.875 | 1.285 | meta/llama-3.3-70b-instruct |
| security_hard_1 | 1.000 | 1.410 | meta/llama-3.3-70b-instruct |

Note: `task_score` is normalized to [0,1]. `total_reward` is cumulative step reward and can exceed 1.0 by design.

## Training Results (PPO-style Loop)

Run training:

```bash
source .venv/bin/activate
python train.py --episodes 120 --max-steps 5
```

Generated artifacts:

- `ppo_logs/train_metrics.csv`
- `ppo_logs/summary.txt`

Recent run summary:

- Episodes: `120`
- Average reward (first 10): `0.0100`
- Average reward (last 10): `0.5100`
- Improvement: `+0.5000`

This demonstrates measurable policy improvement under the training setup provided in this repository.

## One-Command Benchmark Table

Generate per-task JSON outputs plus a markdown table for judge submission:

```bash
source .venv/bin/activate
python scripts/run_benchmark.py --max-steps 10
```

Artifacts:

- `outputs/benchmark_<task_id>.json`
- `outputs/benchmark_table.md`

## Failure Analysis Template

1. `javascript_medium_5` (Undefined access)
- Observation: task score reached `1.0`, but diagnostics show `precision=0.5`, `recall=1.0`, `f1=0.6667`, `false_positive_count=1`.
- Why: model used Python-centric heuristics and produced one extra issue comment on a JS snippet.
- Action: added JavaScript task category and retained false-positive penalties to expose over-flagging.

2. `memory_leak_medium_1` (historical baseline run)
- Observation: earlier run dropped below perfect score due to noisy comment strategy.
- Why: over-commenting triggered false positive penalties despite finding the core issue.
- Action: anti-loop repeated-comment penalty + adversarial no-issue tasks to discourage spam.

3. `adversarial_hard_4` (Safe SQL task)
- Observation: correct behavior is approve; naive SQL keyword matching causes false alarms.
- Why: keyword-only review policies confuse parameterized SQL with vulnerable string interpolation.
- Action: included explicit no-issue adversarial task in hard set and calibration tests to reward restraint.