Spaces:

Codex47
/

SmartContractAudit

Sleeping

File size: 15,830 Bytes

---
title: Smart Contract Audit RL Environment
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
tags:
  - openenv
  - reinforcement-learning
  - smart-contracts
  - solidity
  - security
  - evaluation
  - openenv
license: mit
short_description: OpenEnv RL environment for smart contract security auditing
---

# 🔍 Smart Contract Audit RL Environment

> An OpenEnv-compliant reinforcement learning environment for training and evaluating AI agents on real-world Solidity smart contract security auditing tasks.

---

## Overview

Smart contract auditing is a high-stakes, expert-level task performed by professional security researchers. Mistakes cost millions — the Ethereum ecosystem has lost over **$3 billion** to exploits in audited and unaudited contracts alike. This environment simulates the core reasoning loop of a smart contract auditor, enabling RL agents to learn structured exploration strategies for vulnerability detection, property discovery, and rule checking.

The dataset is derived from real audit reports published by **[Certora](https://www.certora.com/)**, covering three production-grade DeFi protocols:

| Source | Protocol |
|---|---|
| Certora Audit | AaveVault |
| Certora Audit | AaveVaultV2 |
| Certora Audit | Lido Finance |

Each episode exposes a fragment of a real Solidity contract. The agent must use a structured action API — mirroring how a human auditor would methodically inspect a codebase — to accomplish a defined objective within a fixed step budget.

---

## Environment Architecture

```
SmartContractEnv/
├── agents/
│   ├── task1.py
│   ├── task2.py
│   └── task3.py
├── data/
│   ├── __init__.py
│   ├── contracts.json
│   ├── data_loader.py
│   ├── properties.csv
│   ├── Template.json
│   ├── vulnerabilities.json
│   └── vulnerabilities.md
├── env/
│   ├── __init__.py
│   ├── base_env.py
│   └── schemas.py
├── server/
│   ├── tasks/
│   │   ├── task1/
│   │   │   ├── __init__.py
│   │   │   ├── actions.py
│   │   │   ├── environment.py
│   │   │   └── grader.py
│   │   ├── task2/
│   │   │   ├── __init__.py
│   │   │   ├── actions.py
│   │   │   ├── environment.py
│   │   │   └── grader.py
│   │   ├── task3/
│   │   │   ├── __init__.py
│   │   │   ├── actions.py
│   │   │   ├── environment.py
│   │   │   └── grader.py
│   │   └── __init__.py
│   ├── __init__.py
│   └── app.py
├── utils/
│   ├── __init__.py
│   ├── prompts.py
│   ├── propertyretriever.py
│   └── semanticmatcher.py
├── .env
├── .gitignore
├── demo.py
├── Dockerfile
├── eval.py
├── inference.py
├── LICENSE.txt
├── openenv.yaml
├── pyproject.toml
├── README.md
├── requirements.txt
└── validate-submission.sh
```

---

## Tasks

### Task 1 — Targeted Vulnerability Detection *(Medium)*

**Real-world analogue:** A security auditor is handed a Solidity file and asked to pinpoint the vulnerable function and describe the class of bug.

**Setup:** The agent receives a single Solidity file. The episode selects one vulnerable function at random from the dataset (7–8 available) on each `reset()`.

**Objective:** Identify the vulnerable function and describe its issue in 2–3 words (e.g., `"reentrancy"`, `"integer overflow"`, `"unchecked return value"`). Submit `"NO"` if no vulnerability exists.

**Action Space:**

| Action | Reward | Notes |
|---|---|---|
| `list_functions` | −0.05 | Returns all function signatures in the file |
| `get_function_code` | −0.10 (wrong fn) / +0.05 (correct fn) | Returns raw Solidity source of one function |
| `get_function_summary` | −0.05 (wrong) / +0.03 (correct) | Returns NatSpec comments for a function |
| `get_file_metadata` | −0.04 | Returns the file's header comment / pragma / imports |
| `get_state_variables` | −0.05 | Returns all contract-level state variable declarations |
| `get_call_graph` | −0.08 | Returns the inter-function call graph |
| `get_task_state` | 0.00 | Returns current step count and cumulative reward |
| `submit` | +5.00 (correct) / −1.50 (wrong) | One submission allowed per episode |
| *(repeated query)* | −0.40 | Penalty for querying the exact same action+params twice |
| *(unknown action)* | −0.20 | Any unrecognised action type |

**Episode terminates** on `submit` or when the step budget is exhausted.

---

### Task 2 — Property Discovery *(Hard)*

**Real-world analogue:** A formal verification engineer must derive an invariant or safety property for a contract function — the kind written as a Certora Verification Language (CVL) spec.

**Setup:** The agent receives a single function extracted from a Solidity file, along with a brief description of the broader contract. The episode targets a function that has a known, labelled property in the dataset.

**Objective:** Produce a natural-language description of the function's key safety property (e.g., *"The total shares minted must never exceed the total underlying assets deposited"*).

**Action Space:**

| Action | Reward | Notes |
|---|---|---|
| `get_file_natspec` | −0.03 | File-level NatSpec documentation |
| `get_function_natspec` | −0.08 | Function-level NatSpec comments |
| `get_function_code` | −0.06 | Raw Solidity source of the target function |
| `get_related_functions` | −0.06 | Functions that call or are called by the target |
| `get_input_output` | −0.04 | Parameter names/types and return values |
| `get_similar_property` | −0.20 | Hard-coded reference property from a different contract |
| `submit_property` | 0–5 (graded) | **One attempt per episode.** Scored by deterministic similarity checker |

**Grading:** Submission reward is computed by a deterministic checker that combines keyword overlap and structural similarity against the ground-truth property. Score is normalised to `[0, 5]` and then scaled to `[0.0, 1.0]` for the episode return.

---

### Task 3 — Rule Checker *(Easy)*

**Real-world analogue:** Given a known security rule (e.g., *"functions that transfer funds must emit a Transfer event"*), identify which function in the contract violates it.

**Setup:** The agent receives a Solidity file and a natural-language description of a property/rule. At least one function in the file violates this rule.

**Objective:** Identify the name of the rule-breaking function.

**Action Space:**

| Action | Reward | Notes |
|---|---|---|
| `get_property_specification` | −0.03 | Returns a pseudo-formal (CVL-like) version of the property |
| `list_functions` | −0.05 | All function signatures in the file |
| `get_function_metadata` | −0.05 | Visibility, modifiers, and signature for a function |
| `get_function_code` | −0.10 | Raw Solidity source of one function |
| `get_state_variables` | −0.05 | Contract-level state variable declarations |
| `get_call_graph` | −0.08 | Inter-function call graph |
| `submit` | +5.00 (exact) / +1.50 (sub-caller) / −1.50 (wrong) | One submission per episode |

**Partial credit:** If the agent names a function that *calls* the true violating function, it receives +1.50 rather than the full +5.00. This rewards reasoning that reaches the right area of the call graph.

---

## Reward Design

Rewards are shaped to encourage **efficient, targeted exploration** and discourage two failure modes: aimless browsing and brute-force guessing.

```
R_episode = Σ(step_rewards) + final_submission_reward
```

- **Exploration costs** are small and graduated by information value — cheap actions (metadata) cost less than expensive ones (full code retrieval).
- **Correct-direction bonuses** on `get_function_code` in Task 1 reward navigating toward the vulnerable function before committing.
- **Repetition penalty** (−0.40) discourages looping over the same queries.
- **Wrong submission** (−1.50) is painful enough to deter random guessing but recoverable through efficient prior exploration.
- **Episode score** is normalised to `[0.0, 1.0]` for the OpenEnv grader: `score = max(0, R_episode) / 5.0`.

---

## Observation Space

Every `step()` and `reset()` returns a typed `Observation` object:

```python
class Observation(BaseModel):
    task_id: str                        # "task1_vuln_detection" | "task2_property_discovery" | "task3_rule_checker"
    step: int                           # Current step index (0-indexed)
    max_steps: int                      # Episode step budget
    cumulative_reward: float            # Running reward total
    done: bool                          # Episode terminal flag
    content: str                        # Main textual payload (code, summary, error, etc.)
    metadata: dict[str, Any]            # Extra context (function name, contract name, etc.)
    initial_description: str            # Persistent contract/task description shown every step
```

---

## Action Space

Actions are typed `Action` objects passed to `step()`:

```python
class Action(BaseModel):
    action_type: str                    # One of the action names listed per task above
    params: dict[str, str]             # e.g. {"function_name": "withdraw"}
```

All unknown `action_type` values return a penalty observation without terminating the episode.

---

## OpenEnv Interface

The environment exposes a standard HTTP API:

| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Liveness probe — returns `{"status": "ok"}` |
| `GET` | `/tasks` | Lists all tasks with ID, difficulty, and status |
| `POST` | `/reset` | Starts a new episode. Body: `{"task_id": str, "seed": int}` |
| `POST` | `/step` | Takes one action. Body: `{"action_type": str, "params": {}}` |
| `GET` | `/state` | Returns full internal episode state (debug) |
| `GET` | `/action_space` | Returns JSON schema of valid actions |
| `GET` | `/observation_space` | Returns JSON schema of observation structure |

### Quick Start

```bash
SPACE_URL=http://localhost:7860

# Start a new episode for Task 1
curl -X POST $SPACE_URL/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "task1_vuln_detection", "seed": 42}'

# List all functions in the contract
curl -X POST $SPACE_URL/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "list_functions", "params": {}}'

# Inspect a specific function
curl -X POST $SPACE_URL/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "get_function_code", "params": {"function_name": "withdraw"}}'

# Submit your answer
curl -X POST $SPACE_URL/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "submit", "params": {"function_name": "withdraw", "vulnerability_type": "reentrancy"}}'
```

---

## Setup & Installation

### Prerequisites

- Docker ≥ 20.10
- Python 3.11+ (for local development)
- `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` environment variables

### Run with Docker

```bash
# Build the image
docker build -t sc-audit-env .

# Run the container
docker run -p 7860:7860 \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -e API_BASE_URL=$API_BASE_URL \
  -e MODEL_NAME=$MODEL_NAME \
  sc-audit-env

# Verify it's running
curl http://localhost:7860/health
```

### Run Locally (Development)

```bash
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
```

---

## Baseline Inference Script

The `inference.py` script runs an OpenAI-compatible model against all three tasks and reports episode scores. It reads credentials from environment variables and completes in under 20 minutes on a 2 vCPU / 8 GB machine.

```bash
export OPENAI_API_KEY=your_key
export API_BASE_URL=your custom endpoint
export MODEL_NAME=your custom model

python inference.py
```

**Expected output:**

```
=== Smart Contract Audit RL — Baseline Evaluation ===

Task 1 | Targeted Vulnerability Detection  | Score: 0.41 | Steps used: 8/15
Task 2 | Property Discovery                | Score: 0.28 | Steps used: 6/10
Task 3 | Rule Checker                      | Score: 0.72 | Steps used: 4/10

Overall average: 0.47
```

> **Note:** Scores are stochastic due to random episode selection on `reset()`. Run with a fixed seed (`--seed 42`) for reproducible results.

### Agent System Prompt

The inference script injects the following system prompt to guide output format:

```
You are a smart contract security auditor. You will be given access to a Solidity
contract via a structured action API. Use the available actions to investigate the
contract, then submit your answer.

Always respond with a single JSON object:
{"action_type": "<action>", "params": {"<key>": "<value>"}}

Do not include any other text outside the JSON object.
```

---

## openenv.yaml

```yaml
name: smart-contract-audit-env
version: "1.2.0"
description: >
  OpenEnv RL environment for Solidity smart contract security auditing.
  Agents explore real-world DeFi contracts using a structured action API
  to detect vulnerabilities, discover properties, and check rule compliance.
tasks:
  - id: task1_vuln_detection
    name: Targeted Vulnerability Detection
    difficulty: medium
    max_steps: 40
    max_score: 1.0
  - id: task2_property_discovery
    name: Property Discovery
    difficulty: hard
    max_steps: 40
    max_score: 1.0
  - id: task3_rule_checker
    name: Rule Checker
    difficulty: easy
    max_steps: 20
    max_score: 1.0
observation_schema: models/observation.py
action_schema: models/action.py
app_port: 7860
```

---

## Data

The dataset (`data/dataset.json`) contains **7–8 labelled entries** per contract, each with format accoding to `data/template.json`:
Ground truth is **never exposed** to the agent via any action. The `submit` action is the only path to positive reward.

---

## Design Notes & Known Limitations

- **Reward calibration:** Step penalties and submission rewards may need tuning based on empirical agent performance. Current values are derived from initial design rationale, not from extensive ablation.
- **Call graph granularity:** The current `get_call_graph` action returns the entire graph at once. A future revision could expose it incrementally (per-function neighbours) to make the action more informative and cost-proportional.
- **Vulnerability naming:** Vulnerability types do not follow a fixed taxonomy. Grading uses keyword + semantic matching against a curated synonym list (e.g., `"re-entrancy"` ≡ `"reentrancy"`).
- **Dataset size:** The current dataset covers 3 contracts with 7–8 vulnerabilities each. Expanding to more Certora audit reports would improve task diversity and reduce overfitting risk.
- **`get_function_code` decomposition:** This action could be split into finer-grained sub-actions (`get_parameters`, `get_return_values`, `get_modifiers`) to give agents a more gradual information ladder.
- **Property similarity scoring (Task 2):** Sentence transformer models cannot be used in the containerised environment due to memory constraints. The checker instead uses TF-IDF cosine similarity combined with keyword matching against the ground-truth property.

---

## License

MIT — see `LICENSE` for details.

Data sourced from public Certora audit reports. Solidity source files are reproduced for research and evaluation purposes.

---

## Citation

```bibtex
@misc{sc-audit-openenv-2025,
  title   = {Smart Contract Audit RL Environment},
  year    = {2025},
  note    = {OpenEnv-compliant RL environment for Solidity security analysis.
             Data sourced from Certora audit reports (AaveVault, AaveVaultV2, Lido Finance).}
}
```