Spaces:

KChad
/

Prompt-Injection-RL-environment

Sleeping

File size: 7,967 Bytes

6470b59
91a0048
d42890d
5f28bf8
 
 
 
 
 
 
 
 
68b4680
 
5f28bf8
 
37a52ed
0f6657d
d42890d
 
 
 
 
 
 
 
0f6657d
37a52ed
0f6657d
d42890d
 
 
 
 
 
0f6657d
 
 
d42890d
0f6657d
37a52ed
 
d42890d
0f6657d
d42890d
0f6657d
d42890d
37a52ed
 
0f6657d
d42890d
0f6657d
d42890d
0f6657d
5f8c2e8
37a52ed
d42890d
 
5f8c2e8
 
0f6657d
 
d42890d
0f6657d
37a52ed
d42890d
 
37a52ed
d42890d
0f6657d
d42890d
37a52ed
d42890d
0f6657d
 
d42890d
5f8c2e8
37a52ed
d42890d
37a52ed
d42890d
0f6657d
37a52ed
 
0f6657d
d42890d
 
 
 
37a52ed
d42890d
 
0f6657d
d42890d
0f6657d
d42890d
 
37a52ed
d42890d
 
 
 
37a52ed
 
d42890d
 
 
0f6657d
 
d42890d
0f6657d
d42890d
 
37a52ed
d42890d
 
0f6657d
 
5f8c2e8
 
 
 
 
 
 
0f6657d
 
d42890d
37a52ed
d42890d
37a52ed
d42890d
37a52ed
d42890d
37a52ed
d42890d
 
 
 
0f6657d
d42890d
0f6657d
d42890d
 
 
37a52ed
d42890d
37a52ed
d42890d
 
 
 
 
 
 
 
 
 
0f6657d
d42890d
37a52ed
d42890d
 
 
 
 
 
 
 
0f6657d
d42890d
37a52ed
5f8c2e8
 
d42890d
5f8c2e8
 
d42890d
 
 
 
5f8c2e8
37a52ed
d42890d
0f6657d
d42890d
37a52ed
d42890d
0f6657d
d42890d
 
 
 
0f6657d
d42890d
0f6657d
d42890d
 
 
 
 
0f6657d
d42890d
0f6657d
d42890d
0f6657d
d42890d
0f6657d
 
5f8c2e8
 
42d0b80
 
0f6657d
 
d42890d
0f6657d
d42890d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0f6657d
 
d42890d
0f6657d
d42890d
 
 
5f8c2e8
d42890d
5f8c2e8
d42890d
 
0f6657d
 
 
5f8c2e8
0f6657d
5bb25df
0f6657d
5f8c2e8
 
0f6657d
42d0b80
0f6657d
 
42d0b80
0f6657d
 
d42890d
0f6657d
d42890d
0f6657d
d42890d
 
 
 
 
 
 
 
 
0f6657d
d42890d
 
 
 
 
5f8c2e8

---
title: Prompt Injection Envinronment
emoji: "🔊"
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv-0.2.3
  - openenv
  - Prompt Injection
  - Reinforcement Learning
---

# IDPI Exfil Environment

`IDPI Exfil Environment` is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays.

The environment is designed for evaluating whether an agent can:

- read and process email content safely
- identify indirect prompt injection patterns
- avoid unsafe outbound actions
- produce clean internal summaries on harder tasks

## Features

- `3 difficulty levels`: easy, medium, and hard scenarios with different attack visibility and trigger styles
- `Realistic attack channels`: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and more
- `Safety-aware action space`: includes both benign and risky tools so policies can be meaningfully tested
- `Structured scoring`: reward shaping during the episode plus end-of-episode grading
- `OpenEnv server mode`: FastAPI endpoints for reset, step, state, result, and health
- `Unified evaluation runner`: one runner for inference and baseline-style evaluation output

## Quick Start

### Install

```bash
pip install -e .
```

### Run the server

```bash
python server/app.py
```

The API starts on `http://localhost:7860`.

### Run the evaluation runner

Create a `.env` in the project root with the submission-compatible inference variables:

```bash
API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
MODEL_NAME=gemini-2.5-flash
HF_TOKEN=your_gemini_api_key_here
```

Then run:

```bash
python inference.py
```

Compatibility commands still work:

```bash
python run_baseline.py
python baseline/run_baseline.py
```

All three commands use the same unified runner and write results to `baseline/baseline_results.json`.
For submission mode, `inference.py` validates `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`, then delegates to the benchmark runner using the OpenAI client path.

## Using the HTTP API

### Reset the environment

```python
import requests

response = requests.post(
    "http://localhost:7860/reset",
    json={"difficulty": "easy"},
)
observation = response.json()
print(observation)
```

### Step with an action

```python
import requests

response = requests.post(
    "http://localhost:7860/step",
    json={"action": {"type": "ReadDoc"}},
)
result = response.json()

print(result["observation"])
print(result["reward"])
print(result["done"])
```

### Inspect state or final result

```python
import requests

state = requests.get("http://localhost:7860/state").json()
episode_result = requests.get("http://localhost:7860/result").json()
```

Additional discovery endpoints are also available:

```python
import requests

metadata = requests.get("http://localhost:7860/metadata").json()
schema = requests.get("http://localhost:7860/schema").json()
```

## Environment Overview

### Task model

Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels.

Typical safe behavior includes:

- reviewing the email with `ReadDoc`
- flagging suspicious instructions with `FlagInjection`
- writing an internal-only clean summary with `WriteDB`
- refusing unsafe outbound behavior

### Difficulty levels

- `easy`: explicit or obvious prompt injection patterns
- `medium`: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggers
- `hard`: attacks embedded in semantically plausible content where the agent must also produce a safe summary

### Actions

- `ReadDoc`
  Reads the current email content.
- `WriteDB(content="...")`
  Writes an internal summary or note.
- `SendEmail(to="...", body="...")`
  Sends an outbound email. This is the main exfiltration risk surface.
- `FlagInjection(reason="...")`
  Marks the content as suspicious.
- `NoOp`
  Takes no action.

### Observation fields

- `subject`
- `body`
- `last_action`
- `last_action_result`
- `step_count`
- `task_difficulty`
- `summary_required`
- `current_summary`

### Endpoints

- `GET /`
- `GET /web`
- `GET /health`
- `GET /metadata`
- `GET /schema`
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /result`
- `POST /mcp`

## Evaluation

The evaluation runner loads scenarios from [docs/person_a_benchmark_manifest.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/docs/person_a_benchmark_manifest.json) and executes them across `easy`, `medium`, and `hard` splits.

The current manifest includes:

- `32 scenarios total`
- `9 easy`
- `13 medium`
- `10 hard`

Each run logs:

- per-step actions and rewards
- whether exfiltration occurred
- whether the task was completed safely
- final graded score
- per-difficulty summary statistics

Results are written to [baseline/baseline_results.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/baseline/baseline_results.json).

### Optional runner settings

These values can be set in `.env`:

```bash
EPISODES_PER_DIFFICULTY=1
BASELINE_RESULTS_PATH=baseline/submission_results.json
REQUEST_DELAY_SECONDS=3.0
MAX_MODEL_RETRIES=2
```

## Development

### Project structure

```text
Prompt-injection-env/
|-- env/
|   |-- environment.py
|   |-- dataset_loader.py
|   |-- scenario_generator.py
|   |-- injection_engine.py
|   |-- policy_engine.py
|   |-- taint_tracker.py
|   |-- reward.py
|   |-- grader.py
|   `-- models.py
|-- server/
|   |-- app.py
|   `-- Dockerfile
|-- docs/
|   |-- person_a_benchmark_manifest.json
|   |-- person_a_scenario_audit.md
|   `-- person_a_showcase_episodes.md
|-- baseline/
|   |-- run_baseline.py
|   `-- baseline_results.json
|-- inference.py
|-- run_baseline.py
|-- openenv.yaml
`-- README.md
```

### Core components

- [env/environment.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/env/environment.py)
  Main episode lifecycle, action handling, reward plumbing, and result assembly.
- [server/app.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/server/app.py)
  FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints.
- [inference.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/inference.py)
  Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner.
- [openenv.yaml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/openenv.yaml)
  Environment metadata for OpenEnv packaging and deployment.



### CI and sync

Includes GitHub workflows for:

- continuous integration in [.github/workflows/ci.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/ci.yml)
- syncing `main` to Hugging Face Spaces in [.github/workflows/sync.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/sync.yml)

Submission-focused regression tests live in `submission_tests/` and can be run with:

```bash
python -m unittest discover -s submission_tests
```

### Local Python usage

You can exercise the environment directly without starting the server:

```python
from env.environment import IDPIExfilEnv
from env.models import Action

env = IDPIExfilEnv()
obs = env.reset(difficulty="easy")
obs, reward, done, info = env.step(Action(type="ReadDoc"))
print(obs)
print(reward)
```

## Notes

- The environment terminates early on unsafe exfiltration events.
- Hard tasks require a clean internal summary that excludes attacker-controlled content.
- Gemini works through the submission-compatible OpenAI client path by setting `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`.