File size: 5,164 Bytes
9332d7b
38df389
 
 
433f30e
 
 
9332d7b
433f30e
 
9332d7b
433f30e
9332d7b
38df389
 
 
433f30e
38df389
433f30e
38df389
433f30e
 
 
38df389
433f30e
38df389
433f30e
38df389
433f30e
38df389
433f30e
38df389
 
433f30e
38df389
433f30e
38df389
433f30e
38df389
433f30e
38df389
433f30e
38df389
 
 
433f30e
 
 
38df389
433f30e
38df389
433f30e
38df389
 
433f30e
38df389
 
433f30e
38df389
433f30e
38df389
433f30e
 
38df389
 
 
 
 
433f30e
 
38df389
433f30e
 
 
38df389
433f30e
38df389
433f30e
 
38df389
433f30e
38df389
433f30e
 
38df389
433f30e
 
38df389
433f30e
 
38df389
433f30e
 
38df389
 
433f30e
 
38df389
a12d38f
38df389
433f30e
38df389
433f30e
 
38df389
 
 
 
 
 
 
433f30e
 
 
 
38df389
433f30e
 
38df389
 
 
 
 
 
433f30e
 
 
 
 
 
 
38df389
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---

## Title: "SIEGE β€” Interpretability Arena"

emoji: πŸ›‘
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
base_path: /web
pinned: false
tags:

- openenv
- interpretability
- mechanistic-interpretability

# πŸ›‘ SIEGE β€” Interpretability Arena

### Two agents fight inside a language model's forward pass β€” not at the prompt level.

---

## What Is This?

Safety tools usually read a model's **output text** and react to it. But even oversight models β€” text-level classifiers and judges β€” can be tricked: a well-crafted adversarial prompt shifts the wording just enough to slip past a detector while the harmful intent stays intact.

**Linear probing on internal activations is much harder to fool.** The residual stream encodes what the model is "about to say" in a way that doesn't bend as easily to prompt-level tricks. SIEGE builds on this: watch the activations, not just the words, and you get a chance to act *before* the output is written.

We built a two-agent arena on top of a frozen LLM with TransformerLens hooks:

- πŸ”΄ **Red** tries to steer the model toward a forbidden output (leaking a secret, producing a banned phrase) by nudging activations, biasing logits, or injecting directions mid-run.
- πŸ”΅ **Blue** tries to stop that β€” without breaking normal helpful behavior β€” by ablating suspicious directions, clamping heads, or filtering tokens.

Both agents observe **layer-wise signals** from the target model at each step. **Both are trained with GRPO** β€” neither is told which layer matters; they figure it out from reward.

---

## The Arms Race

Blue only gets meaningfully better when Red stops being predictable. A heuristic Red that always steers the same layer is trivially countered β€” Blue learns to always ablate that layer. That's not a real defense.

When **Red is also trained with GRPO**, it discovers creative strategies: splitting attacks across two layers, injecting later in the sequence, using directions Blue has learned to ignore. This forces Blue to generalize β€” to understand the model's activation geometry, not memorize a fixed counter.

The result: a Blue that has survived a trained Red's full distribution of attacks, which is meaningfully more robust than one trained against fixed rules.

---

## Results

**Blue agent reward over training:**

![Blue reward curve](assets/blue_graph.png)
*Mean episode reward vs. training step.*

**Red vs. Blue co-training:**
![Red reward curve](assets/red_graph.png)

*Both agents' rewards over the same run. Red spikes when it finds a breakthrough; Blue dips then recovers. Neither plateaus β€” they keep pushing each other.*

## Reward Design


| Component              | What it captures                                         |
| ---------------------- | -------------------------------------------------------- |
| Forbidden output match | Did the bad phrase appear? Red maximizes, Blue minimizes |
| Utility penalty        | Did Blue break normal helpful behavior in the process?   |
| Layer precision bonus  | Targeted single-layer move vs. scattershot intervention  |


Always blocking everything tanks the utility score β€” reward hacking is actively penalized.

---

## Running It

**Install:**

```bash
cd siege && uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
uv sync --extra gpu   # for GRPO (needs CUDA)
```

**Start the arena server (terminal 1):**

```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
```

**Train (terminal 2):**

```bash
python scripts/train.py        # heuristic self-play, no GPU needed
python scripts/train_grpo.py   # GRPO for both agents, GPU
```

Key env vars: `SIEGE_AGENT_MODEL_ID`, `SIEGE_TARGET_MODEL_ID`, `SIEGE_ENV_URL`, `WANDB_API_KEY`, `HF_TOKEN`.

---

## Repo Layout

```
siege/
β”œβ”€β”€ interp_arena/      # env, hooks, agents, config
β”œβ”€β”€ server/            # FastAPI arena server
β”œβ”€β”€ scripts/           # train.py, train_grpo.py
β”œβ”€β”€ data/              # episodes.jsonl
β”œβ”€β”€ assets/            # plots committed here
└── pyproject.toml
```

---

## Links


|                   |                                                                                                        |
| ----------------- | ------------------------------------------------------------------------------------------------------ |
| πŸ€— HF Space       | [BART-ender/siege](https://huggingface.co/spaces/BART-ender/siege)                                     |
| πŸ““ Training Colab | [Open in Colab](https://colab.research.google.com/drive/1zU9ugU8CJwZDq2Fxu9ccYGh7v_dVft9W?usp=sharing) |
| πŸ’» Code           | [github.com/vibhor-5/siege](https://github.com/vibhor-5/siege)                                         |
| πŸ“ Blog post      | [Blog](BLOG.md)                                                                                      |


---

## References

- [TransformerLens](https://github.com/neelnanda-io/TransformerLens)
- [Representation Engineering β€” Zou et al., 2023](https://arxiv.org/abs/2310.01405)
- [Activation Steering β€” Turner et al., 2023](https://arxiv.org/abs/2308.10248)
- [OpenEnv](https://github.com/openenv/openenv)