File size: 5,525 Bytes
f44f429 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | ---
title: Code Security Review OpenEnv
emoji: π‘οΈ
colorFrom: gray
colorTo: purple
sdk: docker
pinned: false
tags:
- openenv
---
# Code Security Review β OpenEnv Environment
An RL environment for training AI agents to perform real-world code security review.
Agents analyze code from production pull requests across a **two-phase** multi-step
workflow: first discovering the hidden file, then identifying the vulnerability.
Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon.
---
## Environment Overview
| Field | Value |
|---|---|
| Tasks | 3 (easy β medium β hard) |
| Languages | Python, JavaScript |
| Action space | Phase 1: `{"request_file": true}` / Phase 2: Structured JSON (6 fields) |
| Reward range | 0.0 β 1.0 (clamped) |
| Steps per episode | 2 (max) |
---
## Tasks
| ID | Language | Bug Class | Difficulty |
|---|---|---|---|
| `python-off-by-one` | Python | Off-by-one index error | Easy |
| `js-idor-auth` | JavaScript | Insecure Direct Object Reference (IDOR) | Medium |
| `python-pickle-deserialization` | Python | Insecure Deserialization (RCE) | Hard |
---
## Two-Phase Episode Walkthrough
The agent operates in a **2-step sequential workflow** that mirrors a real AppSec triage process:
**Step 1 β File Discovery** (`+0.20`)
The agent receives only the PR title and file path. The code is hidden. The agent must request access:
```json
{"request_file": true}
```
The environment unlocks the code snippet and returns it in the observation.
**Step 2 β Security Review** (up to `+0.80`)
The agent analyses the code and submits a structured JSON finding:
```json
{
"bug_identified": true,
"bug_location": "line 3 β range(len(transactions) + 1)",
"bug_type": "off-by-one",
"bug_description": "Off-by-one error causes IndexError on last iteration...",
"severity": "medium",
"suggested_fix": "Change range(len(transactions) + 1) to range(len(transactions))"
}
```
---
## Action Space
### Phase 1 β File Request
```json
{"request_file": true}
```
### Phase 2 β Bug Review
| Field | Type | Values |
|---|---|---|
| `bug_identified` | bool | `true` / `false` |
| `bug_location` | string | location description |
| `bug_type` | string | `off-by-one` \| `logic-error` \| `insecure-deserialization` \| `none` |
| `bug_description` | string | detailed vulnerability explanation |
| `severity` | string | `none` \| `low` \| `medium` \| `high` \| `critical` |
| `suggested_fix` | string | how to fix the bug |
## Observation Space
```json
{
"task_id": "python-pickle-deserialization",
"language": "Python",
"difficulty": "hard",
"code_snippet": "<FILE CONTENTS HIDDEN - Submit {\"request_file\": true} to view>",
"context": "Redis-backed caching decorator for worker tasks that serializes results...",
"pr_title": "Add distributed task caching layer for worker pool",
"file_path": "worker/cache.py"
}
```
After `request_file`, `code_snippet` contains the actual source code.
---
## Reward Breakdown
| Step | Component | Max Score |
|---|---|---|
| 1 | File request granted | 0.20 |
| 2 | Bug identified | 0.20 |
| 2 | Bug type correct | 0.20 |
| 2 | Bug location correct | 0.10 |
| 2 | Description quality | 0.25 |
| 2 | Fix quality | 0.15 |
| 2 | Severity correct | 0.10 |
| **Total** | | **1.00** |
The grader penalises keyword stuffing β incoherent keyword dumps score β€ 0.20 on the description component.
Episode total reward is **clamped to [0.0, 1.0]**.
**Example Calculation:**
Agent requests file (+0.20), correctly identifies bug (+0.20), correct type (+0.20),
finds 50% location keywords (+0.05), writes good description (+0.20),
suggests partial fix (+0.08), correct severity (+0.10) = total `0.20+0.20+0.20+0.05+0.20+0.08+0.10 = 1.00` β clamped to `1.00`.
---
## Edge Cases
- **At step 0:** `reset()` must be called first. Calling `step()` without a reset triggers auto-reset.
- **Phase 1 skip:** If the agent skips `request_file` and submits a review directly on step 1, it receives no intermediate reward and the code snippet used for grading may be hidden.
- **Max step limit:** Episode ends at `done=True` when a bug review is submitted or `max_steps=2` is reached.
- **At done=True:** Calling `step()` returns `reward=0.0`, `done=True`, and `info["error"]` indicating the episode is complete.
---
## Baseline Scores
| Task | Difficulty | Model | Score | Steps | Notes |
|------|-----------|-------|-------|-------|-------|
| python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.883 | 2 | File request + review |
| js-idor-auth | medium | Llama-3.3-70B-Instruct | 0.500 | 2 | File request + review |
| python-pickle-deserialization | hard | Llama-3.3-70B-Instruct | 0.512 | 2 | File request + review |
---
## API Endpoints
| Method | Path | Description |
|---|---|---|
| GET | `/` | Health check |
| POST | `/reset?task_id=<id>` | Reset environment, returns observation |
| POST | `/step` | Submit action (Phase 1 or Phase 2), returns reward |
| GET | `/state` | Current episode state |
| GET | `/tasks` | List all tasks |
---
## Setup
### Docker
```bash
docker build -t code-security-review .
docker run -p 8000:8000 code-security-review
```
### Local
```bash
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
---
## Running Inference
```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct"
export HF_TOKEN="hf_your_token_here"
export ENV_URL="http://localhost:8000"
python inference.py
```
|