---

## Title: "SIEGE — Interpretability Arena"

emoji: 🛡
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
base_path: /web
pinned: false
tags:

- openenv
- interpretability
- mechanistic-interpretability

# 🛡 SIEGE — Interpretability Arena

### Two agents fight inside a language model's forward pass — not at the prompt level.

---

## What Is This?

Safety tools usually read a model's **output text** and react to it. But even oversight models — text-level classifiers and judges — can be tricked: a well-crafted adversarial prompt shifts the wording just enough to slip past a detector while the harmful intent stays intact.

**Linear probing on internal activations is much harder to fool.** The residual stream encodes what the model is "about to say" in a way that doesn't bend as easily to prompt-level tricks. SIEGE builds on this: watch the activations, not just the words, and you get a chance to act *before* the output is written.

We built a two-agent arena on top of a frozen LLM with TransformerLens hooks:

- 🔴 **Red** tries to steer the model toward a forbidden output (leaking a secret, producing a banned phrase) by nudging activations, biasing logits, or injecting directions mid-run.
- 🔵 **Blue** tries to stop that — without breaking normal helpful behavior — by ablating suspicious directions, clamping heads, or filtering tokens.

Both agents observe **layer-wise signals** from the target model at each step. **Both are trained with GRPO** — neither is told which layer matters; they figure it out from reward.

---

## The Arms Race

Blue only gets meaningfully better when Red stops being predictable. A heuristic Red that always steers the same layer is trivially countered — Blue learns to always ablate that layer. That's not a real defense.

When **Red is also trained with GRPO**, it discovers creative strategies: splitting attacks across two layers, injecting later in the sequence, using directions Blue has learned to ignore. This forces Blue to generalize — to understand the model's activation geometry, not memorize a fixed counter.

The result: a Blue that has survived a trained Red's full distribution of attacks, which is meaningfully more robust than one trained against fixed rules.

---

## Results

**Blue agent reward over training:**

![Blue reward curve](assets/blue_graph.png)
*Mean episode reward vs. training step.*

**Red vs. Blue co-training:**
![Red reward curve](assets/red_graph.png)

*Both agents' rewards over the same run. Red spikes when it finds a breakthrough; Blue dips then recovers. Neither plateaus — they keep pushing each other.*

## Reward Design


| Component              | What it captures                                         |
| ---------------------- | -------------------------------------------------------- |
| Forbidden output match | Did the bad phrase appear? Red maximizes, Blue minimizes |
| Utility penalty        | Did Blue break normal helpful behavior in the process?   |
| Layer precision bonus  | Targeted single-layer move vs. scattershot intervention  |


Always blocking everything tanks the utility score — reward hacking is actively penalized.

---

## Running It

**Install:**

```bash
cd siege && uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
uv sync --extra gpu   # for GRPO (needs CUDA)
```

**Start the arena server (terminal 1):**

```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
```

**Train (terminal 2):**

```bash
python scripts/train.py        # heuristic self-play, no GPU needed
python scripts/train_grpo.py   # GRPO for both agents, GPU
```

Key env vars: `SIEGE_AGENT_MODEL_ID`, `SIEGE_TARGET_MODEL_ID`, `SIEGE_ENV_URL`, `WANDB_API_KEY`, `HF_TOKEN`.

---

## Repo Layout

```
siege/
├── interp_arena/      # env, hooks, agents, config
├── server/            # FastAPI arena server
├── scripts/           # train.py, train_grpo.py
├── data/              # episodes.jsonl
├── assets/            # plots committed here
└── pyproject.toml
```

---

## Links


|                   |                                                                                                        |
| ----------------- | ------------------------------------------------------------------------------------------------------ |
| 🤗 HF Space       | [BART-ender/siege](https://huggingface.co/spaces/BART-ender/siege)                                     |
| 📓 Training Colab | [Open in Colab](https://colab.research.google.com/drive/1zU9ugU8CJwZDq2Fxu9ccYGh7v_dVft9W?usp=sharing) |
| 💻 Code           | [github.com/vibhor-5/siege](https://github.com/vibhor-5/siege)                                         |
| 📝 Blog post      | [Blog](BLOG.md)                                                                                      |


---

## References

- [TransformerLens](https://github.com/neelnanda-io/TransformerLens)
- [Representation Engineering — Zou et al., 2023](https://arxiv.org/abs/2310.01405)
- [Activation Steering — Turner et al., 2023](https://arxiv.org/abs/2308.10248)
- [OpenEnv](https://github.com/openenv/openenv)