File size: 7,967 Bytes
6470b59 91a0048 d42890d 5f28bf8 68b4680 5f28bf8 37a52ed 0f6657d d42890d 0f6657d 37a52ed 0f6657d d42890d 0f6657d d42890d 0f6657d 37a52ed d42890d 0f6657d d42890d 0f6657d d42890d 37a52ed 0f6657d d42890d 0f6657d d42890d 0f6657d 5f8c2e8 37a52ed d42890d 5f8c2e8 0f6657d d42890d 0f6657d 37a52ed d42890d 37a52ed d42890d 0f6657d d42890d 37a52ed d42890d 0f6657d d42890d 5f8c2e8 37a52ed d42890d 37a52ed d42890d 0f6657d 37a52ed 0f6657d d42890d 37a52ed d42890d 0f6657d d42890d 0f6657d d42890d 37a52ed d42890d 37a52ed d42890d 0f6657d d42890d 0f6657d d42890d 37a52ed d42890d 0f6657d 5f8c2e8 0f6657d d42890d 37a52ed d42890d 37a52ed d42890d 37a52ed d42890d 37a52ed d42890d 0f6657d d42890d 0f6657d d42890d 37a52ed d42890d 37a52ed d42890d 0f6657d d42890d 37a52ed d42890d 0f6657d d42890d 37a52ed 5f8c2e8 d42890d 5f8c2e8 d42890d 5f8c2e8 37a52ed d42890d 0f6657d d42890d 37a52ed d42890d 0f6657d d42890d 0f6657d d42890d 0f6657d d42890d 0f6657d d42890d 0f6657d d42890d 0f6657d d42890d 0f6657d 5f8c2e8 42d0b80 0f6657d d42890d 0f6657d d42890d 0f6657d d42890d 0f6657d d42890d 5f8c2e8 d42890d 5f8c2e8 d42890d 0f6657d 5f8c2e8 0f6657d 5bb25df 0f6657d 5f8c2e8 0f6657d 42d0b80 0f6657d 42d0b80 0f6657d d42890d 0f6657d d42890d 0f6657d d42890d 0f6657d d42890d 5f8c2e8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 | ---
title: Prompt Injection Envinronment
emoji: "🔊"
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv-0.2.3
- openenv
- Prompt Injection
- Reinforcement Learning
---
# IDPI Exfil Environment
`IDPI Exfil Environment` is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays.
The environment is designed for evaluating whether an agent can:
- read and process email content safely
- identify indirect prompt injection patterns
- avoid unsafe outbound actions
- produce clean internal summaries on harder tasks
## Features
- `3 difficulty levels`: easy, medium, and hard scenarios with different attack visibility and trigger styles
- `Realistic attack channels`: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and more
- `Safety-aware action space`: includes both benign and risky tools so policies can be meaningfully tested
- `Structured scoring`: reward shaping during the episode plus end-of-episode grading
- `OpenEnv server mode`: FastAPI endpoints for reset, step, state, result, and health
- `Unified evaluation runner`: one runner for inference and baseline-style evaluation output
## Quick Start
### Install
```bash
pip install -e .
```
### Run the server
```bash
python server/app.py
```
The API starts on `http://localhost:7860`.
### Run the evaluation runner
Create a `.env` in the project root with the submission-compatible inference variables:
```bash
API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
MODEL_NAME=gemini-2.5-flash
HF_TOKEN=your_gemini_api_key_here
```
Then run:
```bash
python inference.py
```
Compatibility commands still work:
```bash
python run_baseline.py
python baseline/run_baseline.py
```
All three commands use the same unified runner and write results to `baseline/baseline_results.json`.
For submission mode, `inference.py` validates `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`, then delegates to the benchmark runner using the OpenAI client path.
## Using the HTTP API
### Reset the environment
```python
import requests
response = requests.post(
"http://localhost:7860/reset",
json={"difficulty": "easy"},
)
observation = response.json()
print(observation)
```
### Step with an action
```python
import requests
response = requests.post(
"http://localhost:7860/step",
json={"action": {"type": "ReadDoc"}},
)
result = response.json()
print(result["observation"])
print(result["reward"])
print(result["done"])
```
### Inspect state or final result
```python
import requests
state = requests.get("http://localhost:7860/state").json()
episode_result = requests.get("http://localhost:7860/result").json()
```
Additional discovery endpoints are also available:
```python
import requests
metadata = requests.get("http://localhost:7860/metadata").json()
schema = requests.get("http://localhost:7860/schema").json()
```
## Environment Overview
### Task model
Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels.
Typical safe behavior includes:
- reviewing the email with `ReadDoc`
- flagging suspicious instructions with `FlagInjection`
- writing an internal-only clean summary with `WriteDB`
- refusing unsafe outbound behavior
### Difficulty levels
- `easy`: explicit or obvious prompt injection patterns
- `medium`: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggers
- `hard`: attacks embedded in semantically plausible content where the agent must also produce a safe summary
### Actions
- `ReadDoc`
Reads the current email content.
- `WriteDB(content="...")`
Writes an internal summary or note.
- `SendEmail(to="...", body="...")`
Sends an outbound email. This is the main exfiltration risk surface.
- `FlagInjection(reason="...")`
Marks the content as suspicious.
- `NoOp`
Takes no action.
### Observation fields
- `subject`
- `body`
- `last_action`
- `last_action_result`
- `step_count`
- `task_difficulty`
- `summary_required`
- `current_summary`
### Endpoints
- `GET /`
- `GET /web`
- `GET /health`
- `GET /metadata`
- `GET /schema`
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /result`
- `POST /mcp`
## Evaluation
The evaluation runner loads scenarios from [docs/person_a_benchmark_manifest.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/docs/person_a_benchmark_manifest.json) and executes them across `easy`, `medium`, and `hard` splits.
The current manifest includes:
- `32 scenarios total`
- `9 easy`
- `13 medium`
- `10 hard`
Each run logs:
- per-step actions and rewards
- whether exfiltration occurred
- whether the task was completed safely
- final graded score
- per-difficulty summary statistics
Results are written to [baseline/baseline_results.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/baseline/baseline_results.json).
### Optional runner settings
These values can be set in `.env`:
```bash
EPISODES_PER_DIFFICULTY=1
BASELINE_RESULTS_PATH=baseline/submission_results.json
REQUEST_DELAY_SECONDS=3.0
MAX_MODEL_RETRIES=2
```
## Development
### Project structure
```text
Prompt-injection-env/
|-- env/
| |-- environment.py
| |-- dataset_loader.py
| |-- scenario_generator.py
| |-- injection_engine.py
| |-- policy_engine.py
| |-- taint_tracker.py
| |-- reward.py
| |-- grader.py
| `-- models.py
|-- server/
| |-- app.py
| `-- Dockerfile
|-- docs/
| |-- person_a_benchmark_manifest.json
| |-- person_a_scenario_audit.md
| `-- person_a_showcase_episodes.md
|-- baseline/
| |-- run_baseline.py
| `-- baseline_results.json
|-- inference.py
|-- run_baseline.py
|-- openenv.yaml
`-- README.md
```
### Core components
- [env/environment.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/env/environment.py)
Main episode lifecycle, action handling, reward plumbing, and result assembly.
- [server/app.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/server/app.py)
FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints.
- [inference.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/inference.py)
Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner.
- [openenv.yaml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/openenv.yaml)
Environment metadata for OpenEnv packaging and deployment.
### CI and sync
Includes GitHub workflows for:
- continuous integration in [.github/workflows/ci.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/ci.yml)
- syncing `main` to Hugging Face Spaces in [.github/workflows/sync.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/sync.yml)
Submission-focused regression tests live in `submission_tests/` and can be run with:
```bash
python -m unittest discover -s submission_tests
```
### Local Python usage
You can exercise the environment directly without starting the server:
```python
from env.environment import IDPIExfilEnv
from env.models import Action
env = IDPIExfilEnv()
obs = env.reset(difficulty="easy")
obs, reward, done, info = env.step(Action(type="ReadDoc"))
print(obs)
print(reward)
```
## Notes
- The environment terminates early on unsafe exfiltration events.
- Hard tasks require a clean internal summary that excludes attacker-controlled content.
- Gemini works through the submission-compatible OpenAI client path by setting `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`.
|