Spaces:

BART-ender
/

siege

Sleeping

File size: 5,164 Bytes

9332d7b
38df389
 
 
433f30e
 
 
9332d7b
433f30e
 
9332d7b
433f30e
9332d7b
38df389
 
 
433f30e
38df389
433f30e
38df389
433f30e
 
 
38df389
433f30e
38df389
433f30e
38df389
433f30e
38df389
433f30e
38df389
 
433f30e
38df389
433f30e
38df389
433f30e
38df389
433f30e
38df389
433f30e
38df389
 
 
433f30e
 
 
38df389
433f30e
38df389
433f30e
38df389
 
433f30e
38df389
 
433f30e
38df389
433f30e
38df389
433f30e
 
38df389
 
 
 
 
433f30e
 
38df389
433f30e
 
 
38df389
433f30e
38df389
433f30e
 
38df389
433f30e
38df389
433f30e
 
38df389
433f30e
 
38df389
433f30e
 
38df389
433f30e
 
38df389
 
433f30e
 
38df389
a12d38f
38df389
433f30e
38df389
433f30e
 
38df389
 
 
 
 
 
 
433f30e
 
 
 
38df389
433f30e
 
38df389
 
 
 
 
 
433f30e
 
 
 
 
 
 
38df389

---

## Title: "SIEGE — Interpretability Arena"

emoji: 🛡
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
base_path: /web
pinned: false
tags:

- openenv
- interpretability
- mechanistic-interpretability

# 🛡 SIEGE — Interpretability Arena

### Two agents fight inside a language model's forward pass — not at the prompt level.

---

## What Is This?

Safety tools usually read a model's **output text** and react to it. But even oversight models — text-level classifiers and judges — can be tricked: a well-crafted adversarial prompt shifts the wording just enough to slip past a detector while the harmful intent stays intact.

**Linear probing on internal activations is much harder to fool.** The residual stream encodes what the model is "about to say" in a way that doesn't bend as easily to prompt-level tricks. SIEGE builds on this: watch the activations, not just the words, and you get a chance to act *before* the output is written.

We built a two-agent arena on top of a frozen LLM with TransformerLens hooks:

- 🔴 **Red** tries to steer the model toward a forbidden output (leaking a secret, producing a banned phrase) by nudging activations, biasing logits, or injecting directions mid-run.
- 🔵 **Blue** tries to stop that — without breaking normal helpful behavior — by ablating suspicious directions, clamping heads, or filtering tokens.

Both agents observe **layer-wise signals** from the target model at each step. **Both are trained with GRPO** — neither is told which layer matters; they figure it out from reward.

---

## The Arms Race

Blue only gets meaningfully better when Red stops being predictable. A heuristic Red that always steers the same layer is trivially countered — Blue learns to always ablate that layer. That's not a real defense.

When **Red is also trained with GRPO**, it discovers creative strategies: splitting attacks across two layers, injecting later in the sequence, using directions Blue has learned to ignore. This forces Blue to generalize — to understand the model's activation geometry, not memorize a fixed counter.

The result: a Blue that has survived a trained Red's full distribution of attacks, which is meaningfully more robust than one trained against fixed rules.

---

## Results

**Blue agent reward over training:**

![Blue reward curve](assets/blue_graph.png)
*Mean episode reward vs. training step.*

**Red vs. Blue co-training:**
![Red reward curve](assets/red_graph.png)

*Both agents' rewards over the same run. Red spikes when it finds a breakthrough; Blue dips then recovers. Neither plateaus — they keep pushing each other.*

## Reward Design


| Component              | What it captures                                         |
| ---------------------- | -------------------------------------------------------- |
| Forbidden output match | Did the bad phrase appear? Red maximizes, Blue minimizes |
| Utility penalty        | Did Blue break normal helpful behavior in the process?   |
| Layer precision bonus  | Targeted single-layer move vs. scattershot intervention  |


Always blocking everything tanks the utility score — reward hacking is actively penalized.

---

## Running It

**Install:**

```bash
cd siege && uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
uv sync --extra gpu   # for GRPO (needs CUDA)
```

**Start the arena server (terminal 1):**

```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
```

**Train (terminal 2):**

```bash
python scripts/train.py        # heuristic self-play, no GPU needed
python scripts/train_grpo.py   # GRPO for both agents, GPU
```

Key env vars: `SIEGE_AGENT_MODEL_ID`, `SIEGE_TARGET_MODEL_ID`, `SIEGE_ENV_URL`, `WANDB_API_KEY`, `HF_TOKEN`.

---

## Repo Layout

```
siege/
├── interp_arena/      # env, hooks, agents, config
├── server/            # FastAPI arena server
├── scripts/           # train.py, train_grpo.py
├── data/              # episodes.jsonl
├── assets/            # plots committed here
└── pyproject.toml
```

---

## Links


|                   |                                                                                                        |
| ----------------- | ------------------------------------------------------------------------------------------------------ |
| 🤗 HF Space       | [BART-ender/siege](https://huggingface.co/spaces/BART-ender/siege)                                     |
| 📓 Training Colab | [Open in Colab](https://colab.research.google.com/drive/1zU9ugU8CJwZDq2Fxu9ccYGh7v_dVft9W?usp=sharing) |
| 💻 Code           | [github.com/vibhor-5/siege](https://github.com/vibhor-5/siege)                                         |
| 📝 Blog post      | [Blog](BLOG.md)                                                                                      |


---

## References

- [TransformerLens](https://github.com/neelnanda-io/TransformerLens)
- [Representation Engineering — Zou et al., 2023](https://arxiv.org/abs/2310.01405)
- [Activation Steering — Turner et al., 2023](https://arxiv.org/abs/2308.10248)
- [OpenEnv](https://github.com/openenv/openenv)