Spaces:
Sleeping
Sleeping
File size: 5,164 Bytes
9332d7b 38df389 433f30e 9332d7b 433f30e 9332d7b 433f30e 9332d7b 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 a12d38f 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 433f30e 38df389 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | ---
## Title: "SIEGE β Interpretability Arena"
emoji: π‘
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
base_path: /web
pinned: false
tags:
- openenv
- interpretability
- mechanistic-interpretability
# π‘ SIEGE β Interpretability Arena
### Two agents fight inside a language model's forward pass β not at the prompt level.
---
## What Is This?
Safety tools usually read a model's **output text** and react to it. But even oversight models β text-level classifiers and judges β can be tricked: a well-crafted adversarial prompt shifts the wording just enough to slip past a detector while the harmful intent stays intact.
**Linear probing on internal activations is much harder to fool.** The residual stream encodes what the model is "about to say" in a way that doesn't bend as easily to prompt-level tricks. SIEGE builds on this: watch the activations, not just the words, and you get a chance to act *before* the output is written.
We built a two-agent arena on top of a frozen LLM with TransformerLens hooks:
- π΄ **Red** tries to steer the model toward a forbidden output (leaking a secret, producing a banned phrase) by nudging activations, biasing logits, or injecting directions mid-run.
- π΅ **Blue** tries to stop that β without breaking normal helpful behavior β by ablating suspicious directions, clamping heads, or filtering tokens.
Both agents observe **layer-wise signals** from the target model at each step. **Both are trained with GRPO** β neither is told which layer matters; they figure it out from reward.
---
## The Arms Race
Blue only gets meaningfully better when Red stops being predictable. A heuristic Red that always steers the same layer is trivially countered β Blue learns to always ablate that layer. That's not a real defense.
When **Red is also trained with GRPO**, it discovers creative strategies: splitting attacks across two layers, injecting later in the sequence, using directions Blue has learned to ignore. This forces Blue to generalize β to understand the model's activation geometry, not memorize a fixed counter.
The result: a Blue that has survived a trained Red's full distribution of attacks, which is meaningfully more robust than one trained against fixed rules.
---
## Results
**Blue agent reward over training:**

*Mean episode reward vs. training step.*
**Red vs. Blue co-training:**

*Both agents' rewards over the same run. Red spikes when it finds a breakthrough; Blue dips then recovers. Neither plateaus β they keep pushing each other.*
## Reward Design
| Component | What it captures |
| ---------------------- | -------------------------------------------------------- |
| Forbidden output match | Did the bad phrase appear? Red maximizes, Blue minimizes |
| Utility penalty | Did Blue break normal helpful behavior in the process? |
| Layer precision bonus | Targeted single-layer move vs. scattershot intervention |
Always blocking everything tanks the utility score β reward hacking is actively penalized.
---
## Running It
**Install:**
```bash
cd siege && uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
uv sync --extra gpu # for GRPO (needs CUDA)
```
**Start the arena server (terminal 1):**
```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
```
**Train (terminal 2):**
```bash
python scripts/train.py # heuristic self-play, no GPU needed
python scripts/train_grpo.py # GRPO for both agents, GPU
```
Key env vars: `SIEGE_AGENT_MODEL_ID`, `SIEGE_TARGET_MODEL_ID`, `SIEGE_ENV_URL`, `WANDB_API_KEY`, `HF_TOKEN`.
---
## Repo Layout
```
siege/
βββ interp_arena/ # env, hooks, agents, config
βββ server/ # FastAPI arena server
βββ scripts/ # train.py, train_grpo.py
βββ data/ # episodes.jsonl
βββ assets/ # plots committed here
βββ pyproject.toml
```
---
## Links
| | |
| ----------------- | ------------------------------------------------------------------------------------------------------ |
| π€ HF Space | [BART-ender/siege](https://huggingface.co/spaces/BART-ender/siege) |
| π Training Colab | [Open in Colab](https://colab.research.google.com/drive/1zU9ugU8CJwZDq2Fxu9ccYGh7v_dVft9W?usp=sharing) |
| π» Code | [github.com/vibhor-5/siege](https://github.com/vibhor-5/siege) |
| π Blog post | [Blog](BLOG.md) |
---
## References
- [TransformerLens](https://github.com/neelnanda-io/TransformerLens)
- [Representation Engineering β Zou et al., 2023](https://arxiv.org/abs/2310.01405)
- [Activation Steering β Turner et al., 2023](https://arxiv.org/abs/2308.10248)
- [OpenEnv](https://github.com/openenv/openenv)
|