File size: 9,368 Bytes
d08cf0b 4686f61 d08cf0b fe2fd48 684f052 8c51832 684f052 1939cbc 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 1939cbc 8c51832 1939cbc fe2fd48 8c51832 fe2fd48 8c51832 1939cbc 8c51832 1939cbc 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 fe2fd48 8c51832 1939cbc 8c51832 1939cbc 8c51832 7a23e48 1939cbc 7a23e48 1939cbc 7a23e48 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 | ---
title: Code Review Agent Environment
emoji: 🤖
colorFrom: green
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
---
# Code Review Agent Environment
[](https://github.com/ProbablyItsSpirit/code-review-environment/actions/workflows/ci.yml)
This repository provides an OpenEnv-compatible environment for evaluating AI code-review agents.
## Judge Summary
- OpenEnv validation: pass
- Tests: pass
- Docker build: pass
- Baseline reproduction: pass
- Live Space health/reset: pass
Evidence:
- [submission_report.json](submission_report.json)
- [Benchmark Table](outputs/benchmark_table.md)
- [Space URL](https://spirit-26-code-review-environment.hf.space)
## Why This Environment
Code review is a strong RL task because success and failure are measurable: line-level issues can be deterministically graded, rewards can be shaped across review phases, and tasks can scale from easy to hard while staying realistic.
This project is designed for both evaluation and lightweight policy training loops, not only one-off scripted inference.
The agent receives a code diff and surrounding file context, then performs a multi-step review:
1. Add issue comments with line numbers.
2. Suggest code fixes.
3. Make a final decision (`approved` or `changes_requested`).
The environment scores the review quality using deterministic graders.
## What This Project Does
- Simulates pull-request review tasks across easy/medium/hard difficulty.
- Exposes OpenEnv-style lifecycle methods (`reset`, `step`, `state`).
- Exposes integration endpoints (`tasks`, `score`, `health`) for tooling and dashboard checks.
- Grades issue detection, fix suggestions, and final decision quality.
- Supports local LLM providers via an OpenAI-compatible API (including Ollama).
- Includes a policy-training scaffold (`train.py`, `train_env.py`) and logged training metrics.
## Project Structure
- `environment/`: environment implementation, task definitions, models, and grading logic.
- `inference.py`: baseline review agent loop.
- `train.py`, `train_env.py`: lightweight PPO-style policy training loop over the environment.
- `ppo_logs/`: training metrics and summaries.
- `openenv.yaml`: task registry and environment metadata.
- `tests/`: environment tests.
- `explore_env.ipynb`: interactive environment walkthrough.
- `docker-compose.yml` / `Dockerfile`: containerized execution options.
## Prerequisites
- Python 3.10+
- macOS/Linux shell or PowerShell equivalent
- Optional: Docker Desktop
- Optional: Ollama for local model inference
## Local Setup
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
## Required Environment Variables
The baseline uses OpenAI-compatible endpoints.
- `API_BASE_URL` (required)
- `MODEL_NAME` (required)
- `HF_TOKEN` (preferred auth var)
Supported auth aliases:
- `OPENAI_API_KEY`
- `API_KEY`
## Run Methods
### 1) Run Unit Tests
```bash
source .venv/bin/activate
pytest tests/test_env.py -q
```
### 2) Validate OpenEnv Package
```bash
source .venv/bin/activate
openenv validate
```
### 3) Run Baseline Agent (Single Task)
```bash
source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180
python inference.py \
--task-id bug_detection_easy_1 \
--max-steps 10 \
--output baseline_results.json
```
### 4) Run All Tasks (Local Sweep)
```bash
source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180
for task in \
bug_detection_easy_1 \
bug_detection_easy_2 \
approve_easy_3 \
memory_leak_medium_1 \
performance_medium_2 \
approve_medium_3 \
security_hard_1 \
race_condition_hard_2 \
approve_hard_3
do
python inference.py --task-id "$task" --max-steps 10 --output "baseline_${task}.json"
done
```
### 5) Docker Build and Run
```bash
docker build -t code-review-env .
docker run --rm \
-e API_BASE_URL=http://host.docker.internal:11434/v1 \
-e MODEL_NAME=qwen3.5:latest \
-e HF_TOKEN=not-needed \
-e TEMPERATURE=0.0 \
-e REQUEST_TIMEOUT=180 \
code-review-env \
--task-id bug_detection_easy_1
```
### 6) Docker Compose Services
```bash
docker compose run --rm openai-agent
docker compose run --rm gemini-agent
docker compose run --rm groq-agent
docker compose run --rm local-agent
```
Note: on macOS, `network_mode: host` can be unreliable. If `local-agent` cannot reach Ollama, use `host.docker.internal` in the service environment.
## Available Task IDs
- `bug_detection_easy_1`
- `bug_detection_easy_2`
- `approve_easy_3`
- `memory_leak_medium_1`
- `performance_medium_2`
- `approve_medium_3`
- `type_safety_medium_4`
- `javascript_medium_5`
- `security_hard_1`
- `race_condition_hard_2`
- `approve_hard_3`
- `adversarial_hard_4`
- `concurrency_hard_5`
- `dependency_injection_hard_6`
## HTTP Endpoints
- `GET /`
- `GET /health`
- `GET /tasks`
- `GET|POST /reset`
- `POST /step`
- `GET /state`
- `GET /score`
## Output Format
Each inference run writes JSON like:
```json
{
"task_id": "bug_detection_easy_1",
"total_reward": 0.78,
"task_score": 1.0,
"steps": 3,
"max_steps": 10,
"provider": "openai-client",
"model": "qwen3.5:latest",
"api_base_url": "http://localhost:11434/v1"
}
```
## Notes On Baseline Stability
- Local models can time out on long prompts.
- The baseline now enforces phased review behavior and falls back to deterministic actions when the model is temporarily unavailable.
- For reproducible runs, keep `TEMPERATURE=0.0`.
## Fast Start (3 Commands)
```bash
source .venv/bin/activate
pytest -q
python submit.py --skip-docker --max-steps 10
```
## Judge Map (Criterion -> Evidence)
| Criterion | Evidence | File |
|---|---|---|
| OpenEnv lifecycle compliance | reset/step/state implemented and served over HTTP | `environment/env.py`, `server/app.py` |
| Typed models | Pydantic action/state/observation models | `environment/models.py` |
| Task difficulty progression | easy/medium/hard tasks + calibration approve tasks | `environment/tasks.py` |
| Grading quality | detection/suggestion/decision + partial credit + FP penalty + efficiency bonus | `environment/graders.py` |
| Baseline reproducibility | deterministic seed support in reset + inference output metadata | `environment/env.py`, `inference.py` |
| Submission validation | Python preflight + bash validator script | `submit.py`, `scripts/validate-submission.sh` |
## Grader Rubric (Summary)
| Component | Weight / Effect | Notes |
|---|---|---|
| Detection score | 0.4 | Partial credit for near-line matches |
| Suggestion score | 0.3 | Line-proximity matching for fixes |
| Decision score | 0.3 | Approve for no-issue tasks, request_changes otherwise |
| False positive penalty | up to -0.4 | Strong penalty for issue spam |
| Efficiency bonus | up to +0.1 | Bonus for completing in fewer steps |
| Final score clamp | [0,1] | Safety clamp in grader |
## Benchmark Snapshot (3-Task Local Run)
| Task | Task Score | Total Reward | Model |
|---|---:|---:|---|
| bug_detection_easy_1 | 1.000 | 1.410 | meta/llama-3.3-70b-instruct |
| memory_leak_medium_1 | 0.875 | 1.285 | meta/llama-3.3-70b-instruct |
| security_hard_1 | 1.000 | 1.410 | meta/llama-3.3-70b-instruct |
Note: `task_score` is normalized to [0,1]. `total_reward` is cumulative step reward and can exceed 1.0 by design.
## Training Results (PPO-style Loop)
Run training:
```bash
source .venv/bin/activate
python train.py --episodes 120 --max-steps 5
```
Generated artifacts:
- `ppo_logs/train_metrics.csv`
- `ppo_logs/summary.txt`
Recent run summary:
- Episodes: `120`
- Average reward (first 10): `0.0100`
- Average reward (last 10): `0.5100`
- Improvement: `+0.5000`
This demonstrates measurable policy improvement under the training setup provided in this repository.
## One-Command Benchmark Table
Generate per-task JSON outputs plus a markdown table for judge submission:
```bash
source .venv/bin/activate
python scripts/run_benchmark.py --max-steps 10
```
Artifacts:
- `outputs/benchmark_<task_id>.json`
- `outputs/benchmark_table.md`
## Failure Analysis Template
1. `javascript_medium_5` (Undefined access)
- Observation: task score reached `1.0`, but diagnostics show `precision=0.5`, `recall=1.0`, `f1=0.6667`, `false_positive_count=1`.
- Why: model used Python-centric heuristics and produced one extra issue comment on a JS snippet.
- Action: added JavaScript task category and retained false-positive penalties to expose over-flagging.
2. `memory_leak_medium_1` (historical baseline run)
- Observation: earlier run dropped below perfect score due to noisy comment strategy.
- Why: over-commenting triggered false positive penalties despite finding the core issue.
- Action: anti-loop repeated-comment penalty + adversarial no-issue tasks to discourage spam.
3. `adversarial_hard_4` (Safe SQL task)
- Observation: correct behavior is approve; naive SQL keyword matching causes false alarms.
- Why: keyword-only review policies confuse parameterized SQL with vulnerable string interpolation.
- Action: included explicit no-issue adversarial task in hard set and calibration tests to reward restraint.
|