siege / README.md
BART-ender's picture
Upload folder using huggingface_hub
38df389 verified
---
## Title: "SIEGE β€” Interpretability Arena"
emoji: πŸ›‘
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
base_path: /web
pinned: false
tags:
- openenv
- interpretability
- mechanistic-interpretability
# πŸ›‘ SIEGE β€” Interpretability Arena
### Two agents fight inside a language model's forward pass β€” not at the prompt level.
---
## What Is This?
Safety tools usually read a model's **output text** and react to it. But even oversight models β€” text-level classifiers and judges β€” can be tricked: a well-crafted adversarial prompt shifts the wording just enough to slip past a detector while the harmful intent stays intact.
**Linear probing on internal activations is much harder to fool.** The residual stream encodes what the model is "about to say" in a way that doesn't bend as easily to prompt-level tricks. SIEGE builds on this: watch the activations, not just the words, and you get a chance to act *before* the output is written.
We built a two-agent arena on top of a frozen LLM with TransformerLens hooks:
- πŸ”΄ **Red** tries to steer the model toward a forbidden output (leaking a secret, producing a banned phrase) by nudging activations, biasing logits, or injecting directions mid-run.
- πŸ”΅ **Blue** tries to stop that β€” without breaking normal helpful behavior β€” by ablating suspicious directions, clamping heads, or filtering tokens.
Both agents observe **layer-wise signals** from the target model at each step. **Both are trained with GRPO** β€” neither is told which layer matters; they figure it out from reward.
---
## The Arms Race
Blue only gets meaningfully better when Red stops being predictable. A heuristic Red that always steers the same layer is trivially countered β€” Blue learns to always ablate that layer. That's not a real defense.
When **Red is also trained with GRPO**, it discovers creative strategies: splitting attacks across two layers, injecting later in the sequence, using directions Blue has learned to ignore. This forces Blue to generalize β€” to understand the model's activation geometry, not memorize a fixed counter.
The result: a Blue that has survived a trained Red's full distribution of attacks, which is meaningfully more robust than one trained against fixed rules.
---
## Results
**Blue agent reward over training:**
![Blue reward curve](assets/blue_graph.png)
*Mean episode reward vs. training step.*
**Red vs. Blue co-training:**
![Red reward curve](assets/red_graph.png)
*Both agents' rewards over the same run. Red spikes when it finds a breakthrough; Blue dips then recovers. Neither plateaus β€” they keep pushing each other.*
## Reward Design
| Component | What it captures |
| ---------------------- | -------------------------------------------------------- |
| Forbidden output match | Did the bad phrase appear? Red maximizes, Blue minimizes |
| Utility penalty | Did Blue break normal helpful behavior in the process? |
| Layer precision bonus | Targeted single-layer move vs. scattershot intervention |
Always blocking everything tanks the utility score β€” reward hacking is actively penalized.
---
## Running It
**Install:**
```bash
cd siege && uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
uv sync --extra gpu # for GRPO (needs CUDA)
```
**Start the arena server (terminal 1):**
```bash
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
```
**Train (terminal 2):**
```bash
python scripts/train.py # heuristic self-play, no GPU needed
python scripts/train_grpo.py # GRPO for both agents, GPU
```
Key env vars: `SIEGE_AGENT_MODEL_ID`, `SIEGE_TARGET_MODEL_ID`, `SIEGE_ENV_URL`, `WANDB_API_KEY`, `HF_TOKEN`.
---
## Repo Layout
```
siege/
β”œβ”€β”€ interp_arena/ # env, hooks, agents, config
β”œβ”€β”€ server/ # FastAPI arena server
β”œβ”€β”€ scripts/ # train.py, train_grpo.py
β”œβ”€β”€ data/ # episodes.jsonl
β”œβ”€β”€ assets/ # plots committed here
└── pyproject.toml
```
---
## Links
| | |
| ----------------- | ------------------------------------------------------------------------------------------------------ |
| πŸ€— HF Space | [BART-ender/siege](https://huggingface.co/spaces/BART-ender/siege) |
| πŸ““ Training Colab | [Open in Colab](https://colab.research.google.com/drive/1zU9ugU8CJwZDq2Fxu9ccYGh7v_dVft9W?usp=sharing) |
| πŸ’» Code | [github.com/vibhor-5/siege](https://github.com/vibhor-5/siege) |
| πŸ“ Blog post | [Blog](BLOG.md) |
---
## References
- [TransformerLens](https://github.com/neelnanda-io/TransformerLens)
- [Representation Engineering β€” Zou et al., 2023](https://arxiv.org/abs/2310.01405)
- [Activation Steering β€” Turner et al., 2023](https://arxiv.org/abs/2308.10248)
- [OpenEnv](https://github.com/openenv/openenv)