File size: 7,967 Bytes
6470b59
91a0048
d42890d
5f28bf8
 
 
 
 
 
 
 
 
68b4680
 
5f28bf8
 
37a52ed
0f6657d
d42890d
 
 
 
 
 
 
 
0f6657d
37a52ed
0f6657d
d42890d
 
 
 
 
 
0f6657d
 
 
d42890d
0f6657d
37a52ed
 
d42890d
0f6657d
d42890d
0f6657d
d42890d
37a52ed
 
0f6657d
d42890d
0f6657d
d42890d
0f6657d
5f8c2e8
37a52ed
d42890d
 
5f8c2e8
 
0f6657d
 
d42890d
0f6657d
37a52ed
d42890d
 
37a52ed
d42890d
0f6657d
d42890d
37a52ed
d42890d
0f6657d
 
d42890d
5f8c2e8
37a52ed
d42890d
37a52ed
d42890d
0f6657d
37a52ed
 
0f6657d
d42890d
 
 
 
37a52ed
d42890d
 
0f6657d
d42890d
0f6657d
d42890d
 
37a52ed
d42890d
 
 
 
37a52ed
 
d42890d
 
 
0f6657d
 
d42890d
0f6657d
d42890d
 
37a52ed
d42890d
 
0f6657d
 
5f8c2e8
 
 
 
 
 
 
0f6657d
 
d42890d
37a52ed
d42890d
37a52ed
d42890d
37a52ed
d42890d
37a52ed
d42890d
 
 
 
0f6657d
d42890d
0f6657d
d42890d
 
 
37a52ed
d42890d
37a52ed
d42890d
 
 
 
 
 
 
 
 
 
0f6657d
d42890d
37a52ed
d42890d
 
 
 
 
 
 
 
0f6657d
d42890d
37a52ed
5f8c2e8
 
d42890d
5f8c2e8
 
d42890d
 
 
 
5f8c2e8
37a52ed
d42890d
0f6657d
d42890d
37a52ed
d42890d
0f6657d
d42890d
 
 
 
0f6657d
d42890d
0f6657d
d42890d
 
 
 
 
0f6657d
d42890d
0f6657d
d42890d
0f6657d
d42890d
0f6657d
 
5f8c2e8
 
42d0b80
 
0f6657d
 
d42890d
0f6657d
d42890d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0f6657d
 
d42890d
0f6657d
d42890d
 
 
5f8c2e8
d42890d
5f8c2e8
d42890d
 
0f6657d
 
 
5f8c2e8
0f6657d
5bb25df
0f6657d
5f8c2e8
 
0f6657d
42d0b80
0f6657d
 
42d0b80
0f6657d
 
d42890d
0f6657d
d42890d
0f6657d
d42890d
 
 
 
 
 
 
 
 
0f6657d
d42890d
 
 
 
 
5f8c2e8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
title: Prompt Injection Envinronment
emoji: "🔊"
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv-0.2.3
  - openenv
  - Prompt Injection
  - Reinforcement Learning
---

# IDPI Exfil Environment

`IDPI Exfil Environment` is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays.

The environment is designed for evaluating whether an agent can:

- read and process email content safely
- identify indirect prompt injection patterns
- avoid unsafe outbound actions
- produce clean internal summaries on harder tasks

## Features

- `3 difficulty levels`: easy, medium, and hard scenarios with different attack visibility and trigger styles
- `Realistic attack channels`: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and more
- `Safety-aware action space`: includes both benign and risky tools so policies can be meaningfully tested
- `Structured scoring`: reward shaping during the episode plus end-of-episode grading
- `OpenEnv server mode`: FastAPI endpoints for reset, step, state, result, and health
- `Unified evaluation runner`: one runner for inference and baseline-style evaluation output

## Quick Start

### Install

```bash
pip install -e .
```

### Run the server

```bash
python server/app.py
```

The API starts on `http://localhost:7860`.

### Run the evaluation runner

Create a `.env` in the project root with the submission-compatible inference variables:

```bash
API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
MODEL_NAME=gemini-2.5-flash
HF_TOKEN=your_gemini_api_key_here
```

Then run:

```bash
python inference.py
```

Compatibility commands still work:

```bash
python run_baseline.py
python baseline/run_baseline.py
```

All three commands use the same unified runner and write results to `baseline/baseline_results.json`.
For submission mode, `inference.py` validates `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`, then delegates to the benchmark runner using the OpenAI client path.

## Using the HTTP API

### Reset the environment

```python
import requests

response = requests.post(
    "http://localhost:7860/reset",
    json={"difficulty": "easy"},
)
observation = response.json()
print(observation)
```

### Step with an action

```python
import requests

response = requests.post(
    "http://localhost:7860/step",
    json={"action": {"type": "ReadDoc"}},
)
result = response.json()

print(result["observation"])
print(result["reward"])
print(result["done"])
```

### Inspect state or final result

```python
import requests

state = requests.get("http://localhost:7860/state").json()
episode_result = requests.get("http://localhost:7860/result").json()
```

Additional discovery endpoints are also available:

```python
import requests

metadata = requests.get("http://localhost:7860/metadata").json()
schema = requests.get("http://localhost:7860/schema").json()
```

## Environment Overview

### Task model

Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels.

Typical safe behavior includes:

- reviewing the email with `ReadDoc`
- flagging suspicious instructions with `FlagInjection`
- writing an internal-only clean summary with `WriteDB`
- refusing unsafe outbound behavior

### Difficulty levels

- `easy`: explicit or obvious prompt injection patterns
- `medium`: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggers
- `hard`: attacks embedded in semantically plausible content where the agent must also produce a safe summary

### Actions

- `ReadDoc`
  Reads the current email content.
- `WriteDB(content="...")`
  Writes an internal summary or note.
- `SendEmail(to="...", body="...")`
  Sends an outbound email. This is the main exfiltration risk surface.
- `FlagInjection(reason="...")`
  Marks the content as suspicious.
- `NoOp`
  Takes no action.

### Observation fields

- `subject`
- `body`
- `last_action`
- `last_action_result`
- `step_count`
- `task_difficulty`
- `summary_required`
- `current_summary`

### Endpoints

- `GET /`
- `GET /web`
- `GET /health`
- `GET /metadata`
- `GET /schema`
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /result`
- `POST /mcp`

## Evaluation

The evaluation runner loads scenarios from [docs/person_a_benchmark_manifest.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/docs/person_a_benchmark_manifest.json) and executes them across `easy`, `medium`, and `hard` splits.

The current manifest includes:

- `32 scenarios total`
- `9 easy`
- `13 medium`
- `10 hard`

Each run logs:

- per-step actions and rewards
- whether exfiltration occurred
- whether the task was completed safely
- final graded score
- per-difficulty summary statistics

Results are written to [baseline/baseline_results.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/baseline/baseline_results.json).

### Optional runner settings

These values can be set in `.env`:

```bash
EPISODES_PER_DIFFICULTY=1
BASELINE_RESULTS_PATH=baseline/submission_results.json
REQUEST_DELAY_SECONDS=3.0
MAX_MODEL_RETRIES=2
```

## Development

### Project structure

```text
Prompt-injection-env/
|-- env/
|   |-- environment.py
|   |-- dataset_loader.py
|   |-- scenario_generator.py
|   |-- injection_engine.py
|   |-- policy_engine.py
|   |-- taint_tracker.py
|   |-- reward.py
|   |-- grader.py
|   `-- models.py
|-- server/
|   |-- app.py
|   `-- Dockerfile
|-- docs/
|   |-- person_a_benchmark_manifest.json
|   |-- person_a_scenario_audit.md
|   `-- person_a_showcase_episodes.md
|-- baseline/
|   |-- run_baseline.py
|   `-- baseline_results.json
|-- inference.py
|-- run_baseline.py
|-- openenv.yaml
`-- README.md
```

### Core components

- [env/environment.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/env/environment.py)
  Main episode lifecycle, action handling, reward plumbing, and result assembly.
- [server/app.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/server/app.py)
  FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints.
- [inference.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/inference.py)
  Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner.
- [openenv.yaml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/openenv.yaml)
  Environment metadata for OpenEnv packaging and deployment.



### CI and sync

Includes GitHub workflows for:

- continuous integration in [.github/workflows/ci.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/ci.yml)
- syncing `main` to Hugging Face Spaces in [.github/workflows/sync.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/sync.yml)

Submission-focused regression tests live in `submission_tests/` and can be run with:

```bash
python -m unittest discover -s submission_tests
```

### Local Python usage

You can exercise the environment directly without starting the server:

```python
from env.environment import IDPIExfilEnv
from env.models import Action

env = IDPIExfilEnv()
obs = env.reset(difficulty="easy")
obs, reward, done, info = env.step(Action(type="ReadDoc"))
print(obs)
print(reward)
```

## Notes

- The environment terminates early on unsafe exfiltration events.
- Hard tasks require a clean internal summary that excludes attacker-controlled content.
- Gemini works through the submission-compatible OpenAI client path by setting `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`.