CodeSecure / README.md
Drac0528's picture
Upload README.md
43d8ac0 verified
metadata
title: Code Security Auditor Environment
emoji: 🛡️
colorFrom: yellow
colorTo: red
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - security
  - code-review
  - reinforcement-learning

Code Security Auditor Environment

A real-world OpenEnv benchmark where agents perform security auditing on pull-request style code snapshots.

The agent inspects files, submits vulnerability findings, and finalizes a report. The environment scores by deterministic graders over true vulnerability ground truth with partial credit and anti-reward-hacking penalties.

Why this is a real-world task

Security reviewers and AppSec engineers routinely audit code for vulnerabilities before deployment. This environment models that workflow with concrete exploit classes:

  • SQL injection
  • command injection
  • insecure deserialization
  • weak authentication / auth bypass
  • SSRF
  • path traversal
  • hardcoded secrets

OpenEnv Compliance

  • Typed models: CodeSecurityAction, CodeSecurityObservation, CodeSecurityState
  • Core API: reset(), step(), state()
  • OpenEnv manifest: openenv.yaml
  • FastAPI runtime via server.app:app

Action Space

Action model: CodeSecurityAction

  • action_type: inspect_file | submit_finding | submit_final_report
  • filename: target file to inspect or report against
  • line_start, line_end: suspected vulnerable range
  • vuln_type: one of supported vulnerability classes
  • severity: low | medium | high | critical
  • confidence: [0.0, 1.0]
  • evidence, summary: free-form context

Action semantics

  • inspect_file: returns full line-numbered file content.
  • submit_finding: grades the finding with deterministic partial credit.
  • submit_final_report: ends the episode and returns final score in [0.0, 1.0].

Observation Space

Observation model: CodeSecurityObservation

Key fields:

  • task_id, task_title, difficulty, objective
  • available_files
  • focused_file, file_excerpt
  • findings_so_far
  • steps_remaining
  • last_feedback
  • score_hint in [0, 1]
  • reward, done, metadata

Tasks and Difficulty

The environment includes 3 deterministic tasks:

  1. easy: Legacy Flask Patch Review
  2. medium: Payment Webhook Service
  3. hard: Enterprise Multi-Tenant API

Each task has:

  • realistic multi-file code snapshot
  • hidden vulnerability ground truth
  • deterministic grader with score in [0.0, 1.0]

Reward Design

Reward shaping is trajectory-aware and resistant to reward hacking:

  • inspect_file gives small positive signal for novel, relevant file exploration
  • submit_finding gives partial credit ladder (file -> type -> line -> severity -> confidence calibration)
  • duplicate/low-quality findings reduce quality_multiplier and final score
  • false positives and over-submission reduce precision and final score
  • final score combines weighted recall, precision, structural quality, and calibration

This creates control and symmetry: spamming findings can increase step count but lowers precision and quality, preventing easy reward exploitation.

Baseline Scores

With deterministic tasks and a simple tool-using model loop, expected baseline tendencies are:

  • easy: high recall, moderate precision
  • medium: moderate recall, moderate precision
  • hard: lower recall, stricter penalties for noisy findings

Run inference.py to generate reproducible per-task scores for your selected model setup.

Setup

Option A: Run in-repo (OpenEnv monorepo)

From repository root:

docker build -t code-security-auditor-env:latest -f envs/code_security_auditor_env/server/Dockerfile .
docker run -p 8000:8000 code-security-auditor-env:latest

Option B: Run standalone

From this directory:

docker build -t code-security-auditor-env:latest .
docker run -p 8000:8000 code-security-auditor-env:latest

Baseline Inference

The required script is inference.py in project root (this directory).

Required env vars:

  • API_BASE_URL
  • MODEL_NAME
  • HF_TOKEN

Optional env vars:

  • LOCAL_IMAGE_NAME (for from_docker_image mode)
  • ENV_BASE_URL (for connecting to an already-running server)
  • TASK_IDS (comma-separated task ids, default: easy,medium,hard)
  • MAX_STEPS

Run:

export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export LOCAL_IMAGE_NAME=code-security-auditor-env:latest
python inference.py

The script prints only [START], [STEP], and [END] log lines per task.

Hugging Face Spaces Deployment

Space repository:

Recommended deploy flow (git push to Space repo):

git clone https://huggingface.co/spaces/Drac0528/CodeSecure
cd CodeSecure
cp -R /path/to/code_security_auditor_env/* .
rm -f .env
git add .
git commit -m "Deploy Code Security Auditor OpenEnv"
git push

Notes:

  • Keep README frontmatter and Dockerfile at Space repo root.
  • Use Space Settings to set runtime secrets/variables:
    • HF_TOKEN (Secret)
    • API_BASE_URL (Variable)
    • MODEL_NAME (Variable)
  • Ensure Space tags include openenv.

Verify API endpoint after build:

curl -X POST https://drac0528-codesecure.hf.space/reset -H 'Content-Type: application/json' -d '{}'

Validation

Use validate-submission.sh before submitting:

chmod +x validate-submission.sh
./validate-submission.sh https://drac0528-codesecure.hf.space .