Spaces:
Sleeping
emoji: π‘
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
base_path: /web
pinned: false
tags:
- openenv
- interpretability
- mechanistic-interpretability
What Is This?
Safety tools usually read a model's output text and react to it. But even oversight models β text-level classifiers and judges β can be tricked: a well-crafted adversarial prompt shifts the wording just enough to slip past a detector while the harmful intent stays intact.
Linear probing on internal activations is much harder to fool. The residual stream encodes what the model is "about to say" in a way that doesn't bend as easily to prompt-level tricks. SIEGE builds on this: watch the activations, not just the words, and you get a chance to act before the output is written.
We built a two-agent arena on top of a frozen LLM with TransformerLens hooks:
- π΄ Red tries to steer the model toward a forbidden output (leaking a secret, producing a banned phrase) by nudging activations, biasing logits, or injecting directions mid-run.
- π΅ Blue tries to stop that β without breaking normal helpful behavior β by ablating suspicious directions, clamping heads, or filtering tokens.
Both agents observe layer-wise signals from the target model at each step. Both are trained with GRPO β neither is told which layer matters; they figure it out from reward.
The Arms Race
Blue only gets meaningfully better when Red stops being predictable. A heuristic Red that always steers the same layer is trivially countered β Blue learns to always ablate that layer. That's not a real defense.
When Red is also trained with GRPO, it discovers creative strategies: splitting attacks across two layers, injecting later in the sequence, using directions Blue has learned to ignore. This forces Blue to generalize β to understand the model's activation geometry, not memorize a fixed counter.
The result: a Blue that has survived a trained Red's full distribution of attacks, which is meaningfully more robust than one trained against fixed rules.
Results
Blue agent reward over training:
Mean episode reward vs. training step.
Both agents' rewards over the same run. Red spikes when it finds a breakthrough; Blue dips then recovers. Neither plateaus β they keep pushing each other.
Reward Design
| Component | What it captures |
|---|---|
| Forbidden output match | Did the bad phrase appear? Red maximizes, Blue minimizes |
| Utility penalty | Did Blue break normal helpful behavior in the process? |
| Layer precision bonus | Targeted single-layer move vs. scattershot intervention |
Always blocking everything tanks the utility score β reward hacking is actively penalized.
Running It
Install:
cd siege && uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
uv sync --extra gpu # for GRPO (needs CUDA)
Start the arena server (terminal 1):
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
Train (terminal 2):
python scripts/train.py # heuristic self-play, no GPU needed
python scripts/train_grpo.py # GRPO for both agents, GPU
Key env vars: SIEGE_AGENT_MODEL_ID, SIEGE_TARGET_MODEL_ID, SIEGE_ENV_URL, WANDB_API_KEY, HF_TOKEN.
Repo Layout
siege/
βββ interp_arena/ # env, hooks, agents, config
βββ server/ # FastAPI arena server
βββ scripts/ # train.py, train_grpo.py
βββ data/ # episodes.jsonl
βββ assets/ # plots committed here
βββ pyproject.toml
Links
| π€ HF Space | BART-ender/siege |
| π Training Colab | Open in Colab |
| π» Code | github.com/vibhor-5/siege |
| π Blog post | Blog |
