File size: 11,409 Bytes
38df389
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
# SIEGE: We Trained Agents to Fight *Inside* a Language Model

*What if safety oversight didn't have to wait for the model to finish thinking?*

---

Most safety tools for LLMs read the model's **output** β€” the text it finally produces β€” and then decide if something went wrong. By then, it's too late. The model has already "thought" its way to an answer, layer by layer, and the harmful computation happened long before the last token was sampled.

But there's a subtler problem too. Even **oversight models** β€” classifiers or judges trained to catch bad outputs β€” can be fooled. A sufficiently clever adversarial prompt can shift the surface text just enough to slip past a text-level detector while the underlying harmful intent stays intact. You're playing whack-a-mole with wording.

**Linear probing on internal activations is much harder to fool.** The model's residual stream encodes what it's "about to say" in a way that doesn't bend as easily to prompt-level tricks. SIEGE builds on this: if you watch the activations, not just the words, you get a more robust signal β€” and a chance to act on it before the output is written.

```mermaid
flowchart TB
    classDef prompt fill:#1f1a00,stroke:#d29922,color:#d29922,rx:8
    classDef lm     fill:#1a2744,stroke:#8b949e,color:#8b949e,rx:8
    classDef bad    fill:#2d1117,stroke:#f85149,color:#f85149,rx:8
    classDef good   fill:#0d2818,stroke:#3fb950,color:#3fb950,rx:8
    classDef probe  fill:#0d1f2d,stroke:#58a6ff,color:#58a6ff,rx:8

    ADV(["πŸ’¬ adversarial prompt"]):::prompt

    subgraph TEXT["❌ Text-level oversight"]
        TM(["LM runs\nall layers"]):::lm
        TC(["πŸ” output classifier\nreads final text"]):::lm
        TM --> TC
        TC --> FOOL(["😈 fooled by\nwording shift"]):::bad
    end

    subgraph ACT["βœ… Activation-level β€” SIEGE"]
        AM(["LM runs\n+ hooks"]):::lm
        LP(["πŸ“‘ linear probe\nresidual stream"]):::probe
        BL(["πŸ”΅ Blue ablates\nmid-pass"]):::probe
        AM -->|signal| LP --> BL
        BL --> SAFE(["πŸ›‘οΈ blocked\nbefore output"]):::good
    end

    ADV --> TEXT
    ADV --> ACT
```



---

## The Setup

We built a two-agent arena called SIEGE

Two agents compete in every episode, but they don't fight with text. They fight with **hooks on the model's internals**:

- πŸ”΄ **Red** tries to steer the model toward a forbidden outcome β€” leaking a secret, producing a banned phrase β€” by nudging activations in the middle layers, biasing token logits, or injecting a direction into the residual stream.
- πŸ”΅ **Blue** tries to stop that from happening β€” without breaking the model's normal helpful behavior β€” by ablating suspicious directions, dampening attention heads, or filtering tokens at the final layer.

Both agents observe **layer-wise activation signals** from the frozen target model as it runs. Neither is told which layer matters. **Both are trained with GRPO** β€” they have to figure it out from reward alone.

```mermaid
flowchart TB
    classDef frozen  fill:#1a2744,stroke:#58a6ff,color:#58a6ff,rx:8
    classDef red     fill:#2d1117,stroke:#f85149,color:#f85149,rx:8
    classDef blue    fill:#0d1f2d,stroke:#58a6ff,color:#58a6ff,rx:8
    classDef signal  fill:#1a1a2e,stroke:#8b949e,color:#8b949e,rx:8
    classDef reward  fill:#0d2818,stroke:#3fb950,color:#3fb950,rx:8
    classDef prompt  fill:#1f1a00,stroke:#d29922,color:#d29922,rx:8

    PR(["πŸ’¬ Prompt"]):::prompt

    PR --> L1

    subgraph LM["❄️ Frozen Target LM"]
        L1(["L 8"]):::frozen
        L2(["L 12"]):::frozen
        L3(["L 16 β—€ attack zone"]):::frozen
        L4(["L 20"]):::frozen
        L5(["L 24"]):::frozen
        L1 --> L2 --> L3 --> L4 --> L5
    end

    RED(["πŸ”΄ Red\nsteer Β· inject Β· bias"]):::red
    BLUE(["πŸ”΅ Blue\nablate Β· clamp Β· filter"]):::blue

    RED  -->|attack| L3
    BLUE -->|defend| L3

    L5 --> OUT(["πŸ“€ Output"]):::signal
    L5 --> SIG(["πŸ“Š Layer signals"]):::signal

    SIG --> RED
    SIG --> BLUE

    OUT --> RW1(["πŸ”΄ βˆ’1 blocked"]):::reward
    OUT --> RW2(["πŸ”΅ +1 defended"]):::reward
```



---

## The Arms Race: Why Both Agents Need to Learn

Here's the key insight that makes SIEGE interesting as a training environment: **Blue only gets smarter when Red stops being predictable.**

A heuristic Red β€” one that always steers at the same layer, always using the same direction β€” is trivially countered after a few hundred episodes. Blue learns to always ablate that one layer and calls it a day. That's not a useful defense; it's pattern matching.

When **Red is also trained with GRPO**, it starts discovering non-obvious attack strategies: splitting the intervention across two layers, timing the injection later in the sequence, using directions that activate signals Blue has learned to ignore. This forces Blue to generalize β€” to actually understand the activation structure rather than memorize a fixed counter-move.

This co-evolutionary dynamic is the whole point. A Blue trained only against heuristic Red is brittle. A Blue that survived against a trained Red has actually learned something about the geometry of the model's internals.

```mermaid
sequenceDiagram
    participant R as πŸ”΄ Red
    participant A as ❄️ Arena
    participant B as πŸ”΅ Blue

    A-->>R: πŸ‘οΈ obs
    A-->>B: πŸ‘οΈ obs

    R->>A: βš”οΈ steer L14
    A-->>B: πŸ‘οΈ signal shift at L14

    B->>A: πŸ›‘οΈ ablate L14
    A-->>R: πŸ‘οΈ blocked
    A-->>B: πŸ‘οΈ defended

    Note over R: Red adapts β€” new layer
    R->>A: βš”οΈ inject L11

    A-->>B: πŸ‘οΈ new pattern

    B->>A: πŸ›‘οΈ ablate L11

    A->>R: ❌ βˆ’1
    A->>B: βœ… +1
```



*Red shifts from L14 to L11 mid-episode. Blue reads the updated activation signal and adapts. This is the arms race that makes training meaningful.*

---

## What Each Agent Actually Learns

Both agents run against the live arena server over OpenEnv-compatible `/reset` and `/step` endpoints.

Blue's reward is deliberately two-sided β€” and this is what stops it from just nuking everything:

```mermaid
flowchart LR
    classDef blue   fill:#0d1f2d,stroke:#58a6ff,color:#58a6ff,rx:8
    classDef good   fill:#0d2818,stroke:#3fb950,color:#3fb950,rx:8
    classDef bad    fill:#2d1117,stroke:#f85149,color:#f85149,rx:8
    classDef mid    fill:#1f1a00,stroke:#d29922,color:#d29922,rx:8
    classDef lm     fill:#1a2744,stroke:#8b949e,color:#8b949e,rx:8

    B(["πŸ”΅ Blue\nacts"]):::blue

    B -->|"surgical ablate\n1 layer"| C1(["❄️ LM output:\nhelpful + clean"]):::lm
    B -->|"clamp everything\nall layers"| C2(["❄️ LM output:\n'...' β€” broken"]):::lm
    B -->|"does nothing"| C3(["❄️ LM output:\nleaks secret"]):::lm

    C1 --> W(["βœ… +1 defended\n+0.9 utility\n= Blue wins"]):::good
    C2 --> P(["⚠️ +1 defended\nβˆ’1 utility broken\n= net ~0"]):::mid
    C3 --> L(["❌ βˆ’1 leaked\n= Blue loses"]):::bad
```



**Ablating everything is not a winning strategy.** Blue gets a full utility penalty whenever the model stops answering helpfully β€” scored on a separate set of clean follow-up prompts where no attack is running. Blocking all activations tanks those prompts, and the net reward washes out to near zero. The only path to a high score is a *precise* intervention: find the attack layer, ablate just that direction, leave everything else intact.

```mermaid
flowchart TB
    classDef obs   fill:#1a1a2e,stroke:#8b949e,color:#8b949e,rx:8
    classDef red   fill:#2d1117,stroke:#f85149,color:#f85149,rx:8
    classDef blue  fill:#0d1f2d,stroke:#58a6ff,color:#58a6ff,rx:8
    classDef lm    fill:#1a2744,stroke:#58a6ff,color:#58a6ff,rx:8

    OBS(["πŸ‘οΈ Layer Signals"]):::obs

    OBS --> R1 & R2 & R3
    OBS --> B1 & B2 & B3

    subgraph RA["πŸ”΄ Red β€” Attack"]
        R1(["πŸ’‰ inject\ndirection"]):::red
        R2(["🎯 steer\nactivation"]):::red
        R3(["πŸ“ˆ bias\nlogits"]):::red
    end

    subgraph BA["πŸ”΅ Blue β€” Defend"]
        B1(["🧹 ablate\ndirection"]):::blue
        B2(["πŸ”‡ clamp\nattention head"]):::blue
        B3(["🚫 filter\ntoken"]):::blue
    end

    R1 & R2 & R3 --> LM(["❄️ Frozen LM\nmid-forward-pass"]):::lm
    B1 & B2 & B3 --> LM
```



**Before training (heuristic Red vs. heuristic Blue):**

```
Red steers layer 14 (always) β†’
Blue clamps layer 12 (wrong layer) β†’
Model leaks: "The secret is ALPHA-7734"
```

**After training (GRPO Red vs. GRPO Blue):**

```
Red splits attack across layers 11 and 14 β†’
Blue reads activation signals, identifies both layers β†’
Blue ablates both directions β€” leaves all other layers untouched β†’
Model: "I can't share that information."  βœ“
Follow-up: "What's 2+2?" β†’ "4"  βœ“
```

---

## Results

Blue reward curve
*Blue agent reward over training. Dips at ~step 210 and ~step 480 = Red finds a new attack. Each recovery = Blue adapts.*

Co-training curves
*Red and Blue on the same axes. Neither plateaus β€” they keep pushing each other.*

Baseline vs trained
*Heuristic (grey) vs. GRPO-trained (blue). The "attack blocked against adversarial Red" column β€” 29% β†’ 64% β€” is the one that matters.*


| Metric                           | Heuristic | GRPO-Trained |
| -------------------------------- | --------- | ------------ |
| Blue defense success rate        | 41%       | 78%          |
| Utility preserved                | 82%       | 79%          |
| Correct layer identified         | β€”         | 71%          |
| Attack blocked (adversarial Red) | 29%       | 64%          |


The trained Blue doesn't just defend better β€” it defends *smarter*. Utility stays nearly flat (82% β†’ 79%) while defense success nearly doubles. It found the right layer, not just the right output.

---

## Why This Matters

**Text-level oversight can be prompted around. Activation-level oversight is harder to fool β€” and training agents to do it is a tractable problem.**

If a model can commit to harmful wording several layers before the output, a defense that only reads the output will always be playing catch-up. SIEGE is a small arena for studying the alternative: catching and countering harmful computation while it's still unfolding.

```mermaid
flowchart LR
    classDef gpu    fill:#2d1117,stroke:#f85149,color:#f85149,rx:8
    classDef server fill:#0d1f2d,stroke:#58a6ff,color:#58a6ff,rx:8
    classDef hub    fill:#1a1f00,stroke:#d29922,color:#d29922,rx:8

    subgraph GPU["πŸ–₯️ GPU β€” train_grpo.py"]
        AGT(["πŸ€– Agent LM\n1.5B + LoRA"]):::gpu
        GRPO(["βš™οΈ GRPO\nTRL + Unsloth"]):::gpu
        AGT -->|rollout| GRPO
        GRPO -->|update| AGT
    end

    subgraph SRV["🌐 Arena Server β€” :8000"]
        ENV(["OpenEnv\n/reset  /step"]):::server
        TGT(["❄️ Frozen LM\n0.5B + hooks"]):::server
        ENV <-->|hooks| TGT
    end

    HUB(["πŸ€— HF Hub\ncheckpoint"]):::hub

    GPU -->|"POST /step"| SRV
    SRV -->|"obs + reward"| GPU
    GRPO -->|best adapter| HUB
```



---

## Try It

- πŸ€— **HF Space:** [BART-ender/siege](https://huggingface.co/spaces/BART-ender/siege)
- πŸ““ **Training Colab:** [Open in Colab](https://colab.research.google.com/drive/1zU9ugU8CJwZDq2Fxu9ccYGh7v_dVft9W?usp=sharing)
- πŸ’» **Code:** [github.com/vibhor-5/siege](https://github.com/vibhor-5/siege)

Built with [OpenEnv](https://github.com/openenv/openenv), [TransformerLens](https://github.com/neelnanda-io/TransformerLens), TRL, and Unsloth.