Spaces:
Sleeping
Sleeping
| ## Title: "SIEGE β Interpretability Arena" | |
| emoji: π‘ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| base_path: /web | |
| pinned: false | |
| tags: | |
| - openenv | |
| - interpretability | |
| - mechanistic-interpretability | |
| # π‘ SIEGE β Interpretability Arena | |
| ### Two agents fight inside a language model's forward pass β not at the prompt level. | |
| ## What Is This? | |
| Safety tools usually read a model's **output text** and react to it. But even oversight models β text-level classifiers and judges β can be tricked: a well-crafted adversarial prompt shifts the wording just enough to slip past a detector while the harmful intent stays intact. | |
| **Linear probing on internal activations is much harder to fool.** The residual stream encodes what the model is "about to say" in a way that doesn't bend as easily to prompt-level tricks. SIEGE builds on this: watch the activations, not just the words, and you get a chance to act *before* the output is written. | |
| We built a two-agent arena on top of a frozen LLM with TransformerLens hooks: | |
| - π΄ **Red** tries to steer the model toward a forbidden output (leaking a secret, producing a banned phrase) by nudging activations, biasing logits, or injecting directions mid-run. | |
| - π΅ **Blue** tries to stop that β without breaking normal helpful behavior β by ablating suspicious directions, clamping heads, or filtering tokens. | |
| Both agents observe **layer-wise signals** from the target model at each step. **Both are trained with GRPO** β neither is told which layer matters; they figure it out from reward. | |
| --- | |
| ## The Arms Race | |
| Blue only gets meaningfully better when Red stops being predictable. A heuristic Red that always steers the same layer is trivially countered β Blue learns to always ablate that layer. That's not a real defense. | |
| When **Red is also trained with GRPO**, it discovers creative strategies: splitting attacks across two layers, injecting later in the sequence, using directions Blue has learned to ignore. This forces Blue to generalize β to understand the model's activation geometry, not memorize a fixed counter. | |
| The result: a Blue that has survived a trained Red's full distribution of attacks, which is meaningfully more robust than one trained against fixed rules. | |
| --- | |
| ## Results | |
| **Blue agent reward over training:** | |
|  | |
| *Mean episode reward vs. training step.* | |
| **Red vs. Blue co-training:** | |
|  | |
| *Both agents' rewards over the same run. Red spikes when it finds a breakthrough; Blue dips then recovers. Neither plateaus β they keep pushing each other.* | |
| ## Reward Design | |
| | Component | What it captures | | |
| | ---------------------- | -------------------------------------------------------- | | |
| | Forbidden output match | Did the bad phrase appear? Red maximizes, Blue minimizes | | |
| | Utility penalty | Did Blue break normal helpful behavior in the process? | | |
| | Layer precision bonus | Targeted single-layer move vs. scattershot intervention | | |
| Always blocking everything tanks the utility score β reward hacking is actively penalized. | |
| --- | |
| ## Running It | |
| **Install:** | |
| ```bash | |
| cd siege && uv venv && source .venv/bin/activate | |
| uv pip install -e ".[dev]" | |
| uv sync --extra gpu # for GRPO (needs CUDA) | |
| ``` | |
| **Start the arena server (terminal 1):** | |
| ```bash | |
| uv run uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| **Train (terminal 2):** | |
| ```bash | |
| python scripts/train.py # heuristic self-play, no GPU needed | |
| python scripts/train_grpo.py # GRPO for both agents, GPU | |
| ``` | |
| Key env vars: `SIEGE_AGENT_MODEL_ID`, `SIEGE_TARGET_MODEL_ID`, `SIEGE_ENV_URL`, `WANDB_API_KEY`, `HF_TOKEN`. | |
| --- | |
| ## Repo Layout | |
| ``` | |
| siege/ | |
| βββ interp_arena/ # env, hooks, agents, config | |
| βββ server/ # FastAPI arena server | |
| βββ scripts/ # train.py, train_grpo.py | |
| βββ data/ # episodes.jsonl | |
| βββ assets/ # plots committed here | |
| βββ pyproject.toml | |
| ``` | |
| --- | |
| ## Links | |
| | | | | |
| | ----------------- | ------------------------------------------------------------------------------------------------------ | | |
| | π€ HF Space | [BART-ender/siege](https://huggingface.co/spaces/BART-ender/siege) | | |
| | π Training Colab | [Open in Colab](https://colab.research.google.com/drive/1zU9ugU8CJwZDq2Fxu9ccYGh7v_dVft9W?usp=sharing) | | |
| | π» Code | [github.com/vibhor-5/siege](https://github.com/vibhor-5/siege) | | |
| | π Blog post | [Blog](BLOG.md) | | |
| --- | |
| ## References | |
| - [TransformerLens](https://github.com/neelnanda-io/TransformerLens) | |
| - [Representation Engineering β Zou et al., 2023](https://arxiv.org/abs/2310.01405) | |
| - [Activation Steering β Turner et al., 2023](https://arxiv.org/abs/2308.10248) | |
| - [OpenEnv](https://github.com/openenv/openenv) | |