| --- |
| title: Code Security Review OpenEnv |
| emoji: π‘οΈ |
| colorFrom: gray |
| colorTo: purple |
| sdk: docker |
| pinned: false |
| tags: |
| - openenv |
| --- |
| |
| # Code Security Review β OpenEnv Environment |
|
|
| An RL environment for training AI agents to perform real-world code security review. |
| Agents analyze code from production pull requests across a **two-phase** multi-step |
| workflow: first discovering the hidden file, then identifying the vulnerability. |
|
|
| Built by **Inmodel Labs** for the Meta PyTorch OpenEnv Hackathon. |
|
|
| --- |
|
|
| ## Environment Overview |
|
|
| | Field | Value | |
| |---|---| |
| | Tasks | 3 (easy β medium β hard) | |
| | Languages | Python, JavaScript | |
| | Action space | Phase 1: `{"request_file": true}` / Phase 2: Structured JSON (6 fields) | |
| | Reward range | 0.0 β 1.0 (clamped) | |
| | Steps per episode | 2 (max) | |
|
|
| --- |
|
|
| ## Tasks |
|
|
| | ID | Language | Bug Class | Difficulty | |
| |---|---|---|---| |
| | `python-off-by-one` | Python | Off-by-one index error | Easy | |
| | `js-idor-auth` | JavaScript | Insecure Direct Object Reference (IDOR) | Medium | |
| | `python-pickle-deserialization` | Python | Insecure Deserialization (RCE) | Hard | |
|
|
| --- |
|
|
| ## Two-Phase Episode Walkthrough |
|
|
| The agent operates in a **2-step sequential workflow** that mirrors a real AppSec triage process: |
|
|
| **Step 1 β File Discovery** (`+0.20`) |
| The agent receives only the PR title and file path. The code is hidden. The agent must request access: |
| ```json |
| {"request_file": true} |
| ``` |
| The environment unlocks the code snippet and returns it in the observation. |
|
|
| **Step 2 β Security Review** (up to `+0.80`) |
| The agent analyses the code and submits a structured JSON finding: |
| ```json |
| { |
| "bug_identified": true, |
| "bug_location": "line 3 β range(len(transactions) + 1)", |
| "bug_type": "off-by-one", |
| "bug_description": "Off-by-one error causes IndexError on last iteration...", |
| "severity": "medium", |
| "suggested_fix": "Change range(len(transactions) + 1) to range(len(transactions))" |
| } |
| ``` |
|
|
| --- |
|
|
| ## Action Space |
|
|
| ### Phase 1 β File Request |
| ```json |
| {"request_file": true} |
| ``` |
|
|
| ### Phase 2 β Bug Review |
| | Field | Type | Values | |
| |---|---|---| |
| | `bug_identified` | bool | `true` / `false` | |
| | `bug_location` | string | location description | |
| | `bug_type` | string | `off-by-one` \| `logic-error` \| `insecure-deserialization` \| `none` | |
| | `bug_description` | string | detailed vulnerability explanation | |
| | `severity` | string | `none` \| `low` \| `medium` \| `high` \| `critical` | |
| | `suggested_fix` | string | how to fix the bug | |
|
|
| ## Observation Space |
|
|
| ```json |
| { |
| "task_id": "python-pickle-deserialization", |
| "language": "Python", |
| "difficulty": "hard", |
| "code_snippet": "<FILE CONTENTS HIDDEN - Submit {\"request_file\": true} to view>", |
| "context": "Redis-backed caching decorator for worker tasks that serializes results...", |
| "pr_title": "Add distributed task caching layer for worker pool", |
| "file_path": "worker/cache.py" |
| } |
| ``` |
| After `request_file`, `code_snippet` contains the actual source code. |
|
|
| --- |
|
|
| ## Reward Breakdown |
|
|
| | Step | Component | Max Score | |
| |---|---|---| |
| | 1 | File request granted | 0.20 | |
| | 2 | Bug identified | 0.20 | |
| | 2 | Bug type correct | 0.20 | |
| | 2 | Bug location correct | 0.10 | |
| | 2 | Description quality | 0.25 | |
| | 2 | Fix quality | 0.15 | |
| | 2 | Severity correct | 0.10 | |
| | **Total** | | **1.00** | |
|
|
| The grader penalises keyword stuffing β incoherent keyword dumps score β€ 0.20 on the description component. |
| Episode total reward is **clamped to [0.0, 1.0]**. |
|
|
| **Example Calculation:** |
| Agent requests file (+0.20), correctly identifies bug (+0.20), correct type (+0.20), |
| finds 50% location keywords (+0.05), writes good description (+0.20), |
| suggests partial fix (+0.08), correct severity (+0.10) = total `0.20+0.20+0.20+0.05+0.20+0.08+0.10 = 1.00` β clamped to `1.00`. |
|
|
| --- |
|
|
| ## Edge Cases |
|
|
| - **At step 0:** `reset()` must be called first. Calling `step()` without a reset triggers auto-reset. |
| - **Phase 1 skip:** If the agent skips `request_file` and submits a review directly on step 1, it receives no intermediate reward and the code snippet used for grading may be hidden. |
| - **Max step limit:** Episode ends at `done=True` when a bug review is submitted or `max_steps=2` is reached. |
| - **At done=True:** Calling `step()` returns `reward=0.0`, `done=True`, and `info["error"]` indicating the episode is complete. |
|
|
| --- |
|
|
| ## Baseline Scores |
|
|
| | Task | Difficulty | Model | Score | Steps | Notes | |
| |------|-----------|-------|-------|-------|-------| |
| | python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.883 | 2 | File request + review | |
| | js-idor-auth | medium | Llama-3.3-70B-Instruct | 0.500 | 2 | File request + review | |
| | python-pickle-deserialization | hard | Llama-3.3-70B-Instruct | 0.512 | 2 | File request + review | |
|
|
| --- |
|
|
| ## API Endpoints |
|
|
| | Method | Path | Description | |
| |---|---|---| |
| | GET | `/` | Health check | |
| | POST | `/reset?task_id=<id>` | Reset environment, returns observation | |
| | POST | `/step` | Submit action (Phase 1 or Phase 2), returns reward | |
| | GET | `/state` | Current episode state | |
| | GET | `/tasks` | List all tasks | |
|
|
| --- |
|
|
| ## Setup |
|
|
| ### Docker |
|
|
| ```bash |
| docker build -t code-security-review . |
| docker run -p 8000:8000 code-security-review |
| ``` |
|
|
| ### Local |
|
|
| ```bash |
| pip install -r requirements.txt |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 |
| ``` |
|
|
| --- |
|
|
| ## Running Inference |
|
|
| ```bash |
| export API_BASE_URL="https://router.huggingface.co/v1" |
| export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct" |
| export HF_TOKEN="hf_your_token_here" |
| export ENV_URL="http://localhost:8000" |
| |
| python inference.py |
| ``` |
|
|