Spaces:
Sleeping
Sleeping
ajaxwin
refactor: Update task configurations and grading logic for improved scoring and consistency
dccaaac | title: Smart Contract Audit RL Environment | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - smart-contracts | |
| - solidity | |
| - security | |
| - evaluation | |
| - openenv | |
| license: mit | |
| short_description: OpenEnv RL environment for smart contract security auditing | |
| # π Smart Contract Audit RL Environment | |
| > An OpenEnv-compliant reinforcement learning environment for training and evaluating AI agents on real-world Solidity smart contract security auditing tasks. | |
| --- | |
| ## Overview | |
| Smart contract auditing is a high-stakes, expert-level task performed by professional security researchers. Mistakes cost millions β the Ethereum ecosystem has lost over **$3 billion** to exploits in audited and unaudited contracts alike. This environment simulates the core reasoning loop of a smart contract auditor, enabling RL agents to learn structured exploration strategies for vulnerability detection, property discovery, and rule checking. | |
| The dataset is derived from real audit reports published by **[Certora](https://www.certora.com/)**, covering three production-grade DeFi protocols: | |
| | Source | Protocol | | |
| |---|---| | |
| | Certora Audit | AaveVault | | |
| | Certora Audit | AaveVaultV2 | | |
| | Certora Audit | Lido Finance | | |
| Each episode exposes a fragment of a real Solidity contract. The agent must use a structured action API β mirroring how a human auditor would methodically inspect a codebase β to accomplish a defined objective within a fixed step budget. | |
| --- | |
| ## Environment Architecture | |
| ``` | |
| SmartContractEnv/ | |
| βββ agents/ | |
| β βββ task1.py | |
| β βββ task2.py | |
| β βββ task3.py | |
| βββ data/ | |
| β βββ __init__.py | |
| β βββ contracts.json | |
| β βββ data_loader.py | |
| β βββ properties.csv | |
| β βββ Template.json | |
| β βββ vulnerabilities.json | |
| β βββ vulnerabilities.md | |
| βββ env/ | |
| β βββ __init__.py | |
| β βββ base_env.py | |
| β βββ schemas.py | |
| βββ server/ | |
| β βββ tasks/ | |
| β β βββ task1/ | |
| β β β βββ __init__.py | |
| β β β βββ actions.py | |
| β β β βββ environment.py | |
| β β β βββ grader.py | |
| β β βββ task2/ | |
| β β β βββ __init__.py | |
| β β β βββ actions.py | |
| β β β βββ environment.py | |
| β β β βββ grader.py | |
| β β βββ task3/ | |
| β β β βββ __init__.py | |
| β β β βββ actions.py | |
| β β β βββ environment.py | |
| β β β βββ grader.py | |
| β β βββ __init__.py | |
| β βββ __init__.py | |
| β βββ app.py | |
| βββ utils/ | |
| β βββ __init__.py | |
| β βββ prompts.py | |
| β βββ propertyretriever.py | |
| β βββ semanticmatcher.py | |
| βββ .env | |
| βββ .gitignore | |
| βββ demo.py | |
| βββ Dockerfile | |
| βββ eval.py | |
| βββ inference.py | |
| βββ LICENSE.txt | |
| βββ openenv.yaml | |
| βββ pyproject.toml | |
| βββ README.md | |
| βββ requirements.txt | |
| βββ validate-submission.sh | |
| ``` | |
| --- | |
| ## Tasks | |
| ### Task 1 β Targeted Vulnerability Detection *(Medium)* | |
| **Real-world analogue:** A security auditor is handed a Solidity file and asked to pinpoint the vulnerable function and describe the class of bug. | |
| **Setup:** The agent receives a single Solidity file. The episode selects one vulnerable function at random from the dataset (7β8 available) on each `reset()`. | |
| **Objective:** Identify the vulnerable function and describe its issue in 2β3 words (e.g., `"reentrancy"`, `"integer overflow"`, `"unchecked return value"`). Submit `"NO"` if no vulnerability exists. | |
| **Action Space:** | |
| | Action | Reward | Notes | | |
| |---|---|---| | |
| | `list_functions` | β0.05 | Returns all function signatures in the file | | |
| | `get_function_code` | β0.10 (wrong fn) / +0.05 (correct fn) | Returns raw Solidity source of one function | | |
| | `get_function_summary` | β0.05 (wrong) / +0.03 (correct) | Returns NatSpec comments for a function | | |
| | `get_file_metadata` | β0.04 | Returns the file's header comment / pragma / imports | | |
| | `get_state_variables` | β0.05 | Returns all contract-level state variable declarations | | |
| | `get_call_graph` | β0.08 | Returns the inter-function call graph | | |
| | `get_task_state` | 0.00 | Returns current step count and cumulative reward | | |
| | `submit` | +5.00 (correct) / β1.50 (wrong) | One submission allowed per episode | | |
| | *(repeated query)* | β0.40 | Penalty for querying the exact same action+params twice | | |
| | *(unknown action)* | β0.20 | Any unrecognised action type | | |
| **Episode terminates** on `submit` or when the step budget is exhausted. | |
| --- | |
| ### Task 2 β Property Discovery *(Hard)* | |
| **Real-world analogue:** A formal verification engineer must derive an invariant or safety property for a contract function β the kind written as a Certora Verification Language (CVL) spec. | |
| **Setup:** The agent receives a single function extracted from a Solidity file, along with a brief description of the broader contract. The episode targets a function that has a known, labelled property in the dataset. | |
| **Objective:** Produce a natural-language description of the function's key safety property (e.g., *"The total shares minted must never exceed the total underlying assets deposited"*). | |
| **Action Space:** | |
| | Action | Reward | Notes | | |
| |---|---|---| | |
| | `get_file_natspec` | β0.03 | File-level NatSpec documentation | | |
| | `get_function_natspec` | β0.08 | Function-level NatSpec comments | | |
| | `get_function_code` | β0.06 | Raw Solidity source of the target function | | |
| | `get_related_functions` | β0.06 | Functions that call or are called by the target | | |
| | `get_input_output` | β0.04 | Parameter names/types and return values | | |
| | `get_similar_property` | β0.20 | Hard-coded reference property from a different contract | | |
| | `submit_property` | 0β5 (graded) | **One attempt per episode.** Scored by deterministic similarity checker | | |
| **Grading:** Submission reward is computed by a deterministic checker that combines keyword overlap and structural similarity against the ground-truth property. Score is normalised to `[0, 5]` and then scaled to `[0.0, 1.0]` for the episode return. | |
| --- | |
| ### Task 3 β Rule Checker *(Easy)* | |
| **Real-world analogue:** Given a known security rule (e.g., *"functions that transfer funds must emit a Transfer event"*), identify which function in the contract violates it. | |
| **Setup:** The agent receives a Solidity file and a natural-language description of a property/rule. At least one function in the file violates this rule. | |
| **Objective:** Identify the name of the rule-breaking function. | |
| **Action Space:** | |
| | Action | Reward | Notes | | |
| |---|---|---| | |
| | `get_property_specification` | β0.03 | Returns a pseudo-formal (CVL-like) version of the property | | |
| | `list_functions` | β0.05 | All function signatures in the file | | |
| | `get_function_metadata` | β0.05 | Visibility, modifiers, and signature for a function | | |
| | `get_function_code` | β0.10 | Raw Solidity source of one function | | |
| | `get_state_variables` | β0.05 | Contract-level state variable declarations | | |
| | `get_call_graph` | β0.08 | Inter-function call graph | | |
| | `submit` | +5.00 (exact) / +1.50 (sub-caller) / β1.50 (wrong) | One submission per episode | | |
| **Partial credit:** If the agent names a function that *calls* the true violating function, it receives +1.50 rather than the full +5.00. This rewards reasoning that reaches the right area of the call graph. | |
| --- | |
| ## Reward Design | |
| Rewards are shaped to encourage **efficient, targeted exploration** and discourage two failure modes: aimless browsing and brute-force guessing. | |
| ``` | |
| R_episode = Ξ£(step_rewards) + final_submission_reward | |
| ``` | |
| - **Exploration costs** are small and graduated by information value β cheap actions (metadata) cost less than expensive ones (full code retrieval). | |
| - **Correct-direction bonuses** on `get_function_code` in Task 1 reward navigating toward the vulnerable function before committing. | |
| - **Repetition penalty** (β0.40) discourages looping over the same queries. | |
| - **Wrong submission** (β1.50) is painful enough to deter random guessing but recoverable through efficient prior exploration. | |
| - **Episode score** is normalised to `[0.0, 1.0]` for the OpenEnv grader: `score = max(0, R_episode) / 5.0`. | |
| --- | |
| ## Observation Space | |
| Every `step()` and `reset()` returns a typed `Observation` object: | |
| ```python | |
| class Observation(BaseModel): | |
| task_id: str # "task1_vuln_detection" | "task2_property_discovery" | "task3_rule_checker" | |
| step: int # Current step index (0-indexed) | |
| max_steps: int # Episode step budget | |
| cumulative_reward: float # Running reward total | |
| done: bool # Episode terminal flag | |
| content: str # Main textual payload (code, summary, error, etc.) | |
| metadata: dict[str, Any] # Extra context (function name, contract name, etc.) | |
| initial_description: str # Persistent contract/task description shown every step | |
| ``` | |
| --- | |
| ## Action Space | |
| Actions are typed `Action` objects passed to `step()`: | |
| ```python | |
| class Action(BaseModel): | |
| action_type: str # One of the action names listed per task above | |
| params: dict[str, str] # e.g. {"function_name": "withdraw"} | |
| ``` | |
| All unknown `action_type` values return a penalty observation without terminating the episode. | |
| --- | |
| ## OpenEnv Interface | |
| The environment exposes a standard HTTP API: | |
| | Method | Path | Description | | |
| |---|---|---| | |
| | `GET` | `/health` | Liveness probe β returns `{"status": "ok"}` | | |
| | `GET` | `/tasks` | Lists all tasks with ID, difficulty, and status | | |
| | `POST` | `/reset` | Starts a new episode. Body: `{"task_id": str, "seed": int}` | | |
| | `POST` | `/step` | Takes one action. Body: `{"action_type": str, "params": {}}` | | |
| | `GET` | `/state` | Returns full internal episode state (debug) | | |
| | `GET` | `/action_space` | Returns JSON schema of valid actions | | |
| | `GET` | `/observation_space` | Returns JSON schema of observation structure | | |
| ### Quick Start | |
| ```bash | |
| SPACE_URL=http://localhost:7860 | |
| # Start a new episode for Task 1 | |
| curl -X POST $SPACE_URL/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_id": "task1_vuln_detection", "seed": 42}' | |
| # List all functions in the contract | |
| curl -X POST $SPACE_URL/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "list_functions", "params": {}}' | |
| # Inspect a specific function | |
| curl -X POST $SPACE_URL/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "get_function_code", "params": {"function_name": "withdraw"}}' | |
| # Submit your answer | |
| curl -X POST $SPACE_URL/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "submit", "params": {"function_name": "withdraw", "vulnerability_type": "reentrancy"}}' | |
| ``` | |
| --- | |
| ## Setup & Installation | |
| ### Prerequisites | |
| - Docker β₯ 20.10 | |
| - Python 3.11+ (for local development) | |
| - `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` environment variables | |
| ### Run with Docker | |
| ```bash | |
| # Build the image | |
| docker build -t sc-audit-env . | |
| # Run the container | |
| docker run -p 7860:7860 \ | |
| -e OPENAI_API_KEY=$OPENAI_API_KEY \ | |
| -e API_BASE_URL=$API_BASE_URL \ | |
| -e MODEL_NAME=$MODEL_NAME \ | |
| sc-audit-env | |
| # Verify it's running | |
| curl http://localhost:7860/health | |
| ``` | |
| ### Run Locally (Development) | |
| ```bash | |
| pip install -r requirements.txt | |
| uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload | |
| ``` | |
| --- | |
| ## Baseline Inference Script | |
| The `inference.py` script runs an OpenAI-compatible model against all three tasks and reports episode scores. It reads credentials from environment variables and completes in under 20 minutes on a 2 vCPU / 8 GB machine. | |
| ```bash | |
| export OPENAI_API_KEY=your_key | |
| export API_BASE_URL=your custom endpoint | |
| export MODEL_NAME=your custom model | |
| python inference.py | |
| ``` | |
| **Expected output:** | |
| ``` | |
| === Smart Contract Audit RL β Baseline Evaluation === | |
| Task 1 | Targeted Vulnerability Detection | Score: 0.41 | Steps used: 8/15 | |
| Task 2 | Property Discovery | Score: 0.28 | Steps used: 6/10 | |
| Task 3 | Rule Checker | Score: 0.72 | Steps used: 4/10 | |
| Overall average: 0.47 | |
| ``` | |
| > **Note:** Scores are stochastic due to random episode selection on `reset()`. Run with a fixed seed (`--seed 42`) for reproducible results. | |
| ### Agent System Prompt | |
| The inference script injects the following system prompt to guide output format: | |
| ``` | |
| You are a smart contract security auditor. You will be given access to a Solidity | |
| contract via a structured action API. Use the available actions to investigate the | |
| contract, then submit your answer. | |
| Always respond with a single JSON object: | |
| {"action_type": "<action>", "params": {"<key>": "<value>"}} | |
| Do not include any other text outside the JSON object. | |
| ``` | |
| --- | |
| ## openenv.yaml | |
| ```yaml | |
| name: smart-contract-audit-env | |
| version: "1.2.0" | |
| description: > | |
| OpenEnv RL environment for Solidity smart contract security auditing. | |
| Agents explore real-world DeFi contracts using a structured action API | |
| to detect vulnerabilities, discover properties, and check rule compliance. | |
| tasks: | |
| - id: task1_vuln_detection | |
| name: Targeted Vulnerability Detection | |
| difficulty: medium | |
| max_steps: 40 | |
| max_score: 1.0 | |
| - id: task2_property_discovery | |
| name: Property Discovery | |
| difficulty: hard | |
| max_steps: 40 | |
| max_score: 1.0 | |
| - id: task3_rule_checker | |
| name: Rule Checker | |
| difficulty: easy | |
| max_steps: 20 | |
| max_score: 1.0 | |
| observation_schema: models/observation.py | |
| action_schema: models/action.py | |
| app_port: 7860 | |
| ``` | |
| --- | |
| ## Data | |
| The dataset (`data/dataset.json`) contains **7β8 labelled entries** per contract, each with format accoding to `data/template.json`: | |
| Ground truth is **never exposed** to the agent via any action. The `submit` action is the only path to positive reward. | |
| --- | |
| ## Design Notes & Known Limitations | |
| - **Reward calibration:** Step penalties and submission rewards may need tuning based on empirical agent performance. Current values are derived from initial design rationale, not from extensive ablation. | |
| - **Call graph granularity:** The current `get_call_graph` action returns the entire graph at once. A future revision could expose it incrementally (per-function neighbours) to make the action more informative and cost-proportional. | |
| - **Vulnerability naming:** Vulnerability types do not follow a fixed taxonomy. Grading uses keyword + semantic matching against a curated synonym list (e.g., `"re-entrancy"` β‘ `"reentrancy"`). | |
| - **Dataset size:** The current dataset covers 3 contracts with 7β8 vulnerabilities each. Expanding to more Certora audit reports would improve task diversity and reduce overfitting risk. | |
| - **`get_function_code` decomposition:** This action could be split into finer-grained sub-actions (`get_parameters`, `get_return_values`, `get_modifiers`) to give agents a more gradual information ladder. | |
| - **Property similarity scoring (Task 2):** Sentence transformer models cannot be used in the containerised environment due to memory constraints. The checker instead uses TF-IDF cosine similarity combined with keyword matching against the ground-truth property. | |
| --- | |
| ## License | |
| MIT β see `LICENSE` for details. | |
| Data sourced from public Certora audit reports. Solidity source files are reproduced for research and evaluation purposes. | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{sc-audit-openenv-2025, | |
| title = {Smart Contract Audit RL Environment}, | |
| year = {2025}, | |
| note = {OpenEnv-compliant RL environment for Solidity security analysis. | |
| Data sourced from Certora audit reports (AaveVault, AaveVaultV2, Lido Finance).} | |
| } | |
| ``` |