File size: 9,368 Bytes
d08cf0b
 
4686f61
 
d08cf0b
 
 
 
 
 
fe2fd48
 
684f052
 
8c51832
 
684f052
 
 
 
 
 
 
 
 
 
 
 
 
 
1939cbc
 
 
 
 
 
8c51832
 
 
 
 
 
 
fe2fd48
8c51832
fe2fd48
8c51832
 
1939cbc
8c51832
 
1939cbc
fe2fd48
8c51832
fe2fd48
8c51832
 
1939cbc
 
8c51832
 
1939cbc
8c51832
fe2fd48
8c51832
fe2fd48
8c51832
 
 
 
 
 
 
 
 
 
 
 
fe2fd48
 
 
8c51832
fe2fd48
8c51832
 
 
fe2fd48
8c51832
fe2fd48
 
8c51832
 
 
fe2fd48
8c51832
fe2fd48
 
8c51832
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe2fd48
 
8c51832
fe2fd48
 
8c51832
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe2fd48
 
8c51832
fe2fd48
 
8c51832
 
 
 
 
 
 
 
 
 
fe2fd48
8c51832
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1939cbc
 
8c51832
 
 
1939cbc
 
 
 
 
 
 
 
 
 
 
 
 
8c51832
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a23e48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1939cbc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a23e48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1939cbc
 
 
 
 
 
 
 
 
 
 
 
 
 
7a23e48
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
title: Code Review Agent Environment
emoji: 🤖
colorFrom: green
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
---

# Code Review Agent Environment

[![CI](https://github.com/ProbablyItsSpirit/code-review-environment/actions/workflows/ci.yml/badge.svg)](https://github.com/ProbablyItsSpirit/code-review-environment/actions/workflows/ci.yml)

This repository provides an OpenEnv-compatible environment for evaluating AI code-review agents.

## Judge Summary

- OpenEnv validation: pass
- Tests: pass
- Docker build: pass
- Baseline reproduction: pass
- Live Space health/reset: pass

Evidence:

- [submission_report.json](submission_report.json)
- [Benchmark Table](outputs/benchmark_table.md)
- [Space URL](https://spirit-26-code-review-environment.hf.space)

## Why This Environment

Code review is a strong RL task because success and failure are measurable: line-level issues can be deterministically graded, rewards can be shaped across review phases, and tasks can scale from easy to hard while staying realistic.

This project is designed for both evaluation and lightweight policy training loops, not only one-off scripted inference.

The agent receives a code diff and surrounding file context, then performs a multi-step review:

1. Add issue comments with line numbers.
2. Suggest code fixes.
3. Make a final decision (`approved` or `changes_requested`).

The environment scores the review quality using deterministic graders.

## What This Project Does

- Simulates pull-request review tasks across easy/medium/hard difficulty.
- Exposes OpenEnv-style lifecycle methods (`reset`, `step`, `state`).
- Exposes integration endpoints (`tasks`, `score`, `health`) for tooling and dashboard checks.
- Grades issue detection, fix suggestions, and final decision quality.
- Supports local LLM providers via an OpenAI-compatible API (including Ollama).
- Includes a policy-training scaffold (`train.py`, `train_env.py`) and logged training metrics.

## Project Structure

- `environment/`: environment implementation, task definitions, models, and grading logic.
- `inference.py`: baseline review agent loop.
- `train.py`, `train_env.py`: lightweight PPO-style policy training loop over the environment.
- `ppo_logs/`: training metrics and summaries.
- `openenv.yaml`: task registry and environment metadata.
- `tests/`: environment tests.
- `explore_env.ipynb`: interactive environment walkthrough.
- `docker-compose.yml` / `Dockerfile`: containerized execution options.

## Prerequisites

- Python 3.10+
- macOS/Linux shell or PowerShell equivalent
- Optional: Docker Desktop
- Optional: Ollama for local model inference

## Local Setup

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

## Required Environment Variables

The baseline uses OpenAI-compatible endpoints.

- `API_BASE_URL` (required)
- `MODEL_NAME` (required)
- `HF_TOKEN` (preferred auth var)

Supported auth aliases:

- `OPENAI_API_KEY`
- `API_KEY`

## Run Methods

### 1) Run Unit Tests

```bash
source .venv/bin/activate
pytest tests/test_env.py -q
```

### 2) Validate OpenEnv Package

```bash
source .venv/bin/activate
openenv validate
```

### 3) Run Baseline Agent (Single Task)

```bash
source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180

python inference.py \
	--task-id bug_detection_easy_1 \
	--max-steps 10 \
	--output baseline_results.json
```

### 4) Run All Tasks (Local Sweep)

```bash
source .venv/bin/activate
export API_BASE_URL=http://localhost:11434/v1
export MODEL_NAME=qwen3.5:latest
export HF_TOKEN=not-needed
export TEMPERATURE=0.0
export REQUEST_TIMEOUT=180

for task in \
	bug_detection_easy_1 \
	bug_detection_easy_2 \
	approve_easy_3 \
	memory_leak_medium_1 \
	performance_medium_2 \
	approve_medium_3 \
	security_hard_1 \
	race_condition_hard_2 \
	approve_hard_3
do
	python inference.py --task-id "$task" --max-steps 10 --output "baseline_${task}.json"
done
```

### 5) Docker Build and Run

```bash
docker build -t code-review-env .

docker run --rm \
	-e API_BASE_URL=http://host.docker.internal:11434/v1 \
	-e MODEL_NAME=qwen3.5:latest \
	-e HF_TOKEN=not-needed \
	-e TEMPERATURE=0.0 \
	-e REQUEST_TIMEOUT=180 \
	code-review-env \
	--task-id bug_detection_easy_1
```

### 6) Docker Compose Services

```bash
docker compose run --rm openai-agent
docker compose run --rm gemini-agent
docker compose run --rm groq-agent
docker compose run --rm local-agent
```

Note: on macOS, `network_mode: host` can be unreliable. If `local-agent` cannot reach Ollama, use `host.docker.internal` in the service environment.

## Available Task IDs

- `bug_detection_easy_1`
- `bug_detection_easy_2`
- `approve_easy_3`
- `memory_leak_medium_1`
- `performance_medium_2`
- `approve_medium_3`
- `type_safety_medium_4`
- `javascript_medium_5`
- `security_hard_1`
- `race_condition_hard_2`
- `approve_hard_3`
- `adversarial_hard_4`
- `concurrency_hard_5`
- `dependency_injection_hard_6`

## HTTP Endpoints

- `GET /`
- `GET /health`
- `GET /tasks`
- `GET|POST /reset`
- `POST /step`
- `GET /state`
- `GET /score`

## Output Format

Each inference run writes JSON like:

```json
{
	"task_id": "bug_detection_easy_1",
	"total_reward": 0.78,
	"task_score": 1.0,
	"steps": 3,
	"max_steps": 10,
	"provider": "openai-client",
	"model": "qwen3.5:latest",
	"api_base_url": "http://localhost:11434/v1"
}
```

## Notes On Baseline Stability

- Local models can time out on long prompts.
- The baseline now enforces phased review behavior and falls back to deterministic actions when the model is temporarily unavailable.
- For reproducible runs, keep `TEMPERATURE=0.0`.

## Fast Start (3 Commands)

```bash
source .venv/bin/activate
pytest -q
python submit.py --skip-docker --max-steps 10
```

## Judge Map (Criterion -> Evidence)

| Criterion | Evidence | File |
|---|---|---|
| OpenEnv lifecycle compliance | reset/step/state implemented and served over HTTP | `environment/env.py`, `server/app.py` |
| Typed models | Pydantic action/state/observation models | `environment/models.py` |
| Task difficulty progression | easy/medium/hard tasks + calibration approve tasks | `environment/tasks.py` |
| Grading quality | detection/suggestion/decision + partial credit + FP penalty + efficiency bonus | `environment/graders.py` |
| Baseline reproducibility | deterministic seed support in reset + inference output metadata | `environment/env.py`, `inference.py` |
| Submission validation | Python preflight + bash validator script | `submit.py`, `scripts/validate-submission.sh` |

## Grader Rubric (Summary)

| Component | Weight / Effect | Notes |
|---|---|---|
| Detection score | 0.4 | Partial credit for near-line matches |
| Suggestion score | 0.3 | Line-proximity matching for fixes |
| Decision score | 0.3 | Approve for no-issue tasks, request_changes otherwise |
| False positive penalty | up to -0.4 | Strong penalty for issue spam |
| Efficiency bonus | up to +0.1 | Bonus for completing in fewer steps |
| Final score clamp | [0,1] | Safety clamp in grader |

## Benchmark Snapshot (3-Task Local Run)

| Task | Task Score | Total Reward | Model |
|---|---:|---:|---|
| bug_detection_easy_1 | 1.000 | 1.410 | meta/llama-3.3-70b-instruct |
| memory_leak_medium_1 | 0.875 | 1.285 | meta/llama-3.3-70b-instruct |
| security_hard_1 | 1.000 | 1.410 | meta/llama-3.3-70b-instruct |

Note: `task_score` is normalized to [0,1]. `total_reward` is cumulative step reward and can exceed 1.0 by design.

## Training Results (PPO-style Loop)

Run training:

```bash
source .venv/bin/activate
python train.py --episodes 120 --max-steps 5
```

Generated artifacts:

- `ppo_logs/train_metrics.csv`
- `ppo_logs/summary.txt`

Recent run summary:

- Episodes: `120`
- Average reward (first 10): `0.0100`
- Average reward (last 10): `0.5100`
- Improvement: `+0.5000`

This demonstrates measurable policy improvement under the training setup provided in this repository.

## One-Command Benchmark Table

Generate per-task JSON outputs plus a markdown table for judge submission:

```bash
source .venv/bin/activate
python scripts/run_benchmark.py --max-steps 10
```

Artifacts:

- `outputs/benchmark_<task_id>.json`
- `outputs/benchmark_table.md`

## Failure Analysis Template

1. `javascript_medium_5` (Undefined access)
- Observation: task score reached `1.0`, but diagnostics show `precision=0.5`, `recall=1.0`, `f1=0.6667`, `false_positive_count=1`.
- Why: model used Python-centric heuristics and produced one extra issue comment on a JS snippet.
- Action: added JavaScript task category and retained false-positive penalties to expose over-flagging.

2. `memory_leak_medium_1` (historical baseline run)
- Observation: earlier run dropped below perfect score due to noisy comment strategy.
- Why: over-commenting triggered false positive penalties despite finding the core issue.
- Action: anti-loop repeated-comment penalty + adversarial no-issue tasks to discourage spam.

3. `adversarial_hard_4` (Safe SQL task)
- Observation: correct behavior is approve; naive SQL keyword matching causes false alarms.
- Why: keyword-only review policies confuse parameterized SQL with vulnerable string interpolation.
- Action: included explicit no-issue adversarial task in hard set and calibration tests to reward restraint.