siege / BLOG.md
BART-ender's picture
Upload folder using huggingface_hub
38df389 verified

SIEGE: We Trained Agents to Fight Inside a Language Model

What if safety oversight didn't have to wait for the model to finish thinking?


Most safety tools for LLMs read the model's output β€” the text it finally produces β€” and then decide if something went wrong. By then, it's too late. The model has already "thought" its way to an answer, layer by layer, and the harmful computation happened long before the last token was sampled.

But there's a subtler problem too. Even oversight models β€” classifiers or judges trained to catch bad outputs β€” can be fooled. A sufficiently clever adversarial prompt can shift the surface text just enough to slip past a text-level detector while the underlying harmful intent stays intact. You're playing whack-a-mole with wording.

Linear probing on internal activations is much harder to fool. The model's residual stream encodes what it's "about to say" in a way that doesn't bend as easily to prompt-level tricks. SIEGE builds on this: if you watch the activations, not just the words, you get a more robust signal β€” and a chance to act on it before the output is written.

flowchart TB
    classDef prompt fill:#1f1a00,stroke:#d29922,color:#d29922,rx:8
    classDef lm     fill:#1a2744,stroke:#8b949e,color:#8b949e,rx:8
    classDef bad    fill:#2d1117,stroke:#f85149,color:#f85149,rx:8
    classDef good   fill:#0d2818,stroke:#3fb950,color:#3fb950,rx:8
    classDef probe  fill:#0d1f2d,stroke:#58a6ff,color:#58a6ff,rx:8

    ADV(["πŸ’¬ adversarial prompt"]):::prompt

    subgraph TEXT["❌ Text-level oversight"]
        TM(["LM runs\nall layers"]):::lm
        TC(["πŸ” output classifier\nreads final text"]):::lm
        TM --> TC
        TC --> FOOL(["😈 fooled by\nwording shift"]):::bad
    end

    subgraph ACT["βœ… Activation-level β€” SIEGE"]
        AM(["LM runs\n+ hooks"]):::lm
        LP(["πŸ“‘ linear probe\nresidual stream"]):::probe
        BL(["πŸ”΅ Blue ablates\nmid-pass"]):::probe
        AM -->|signal| LP --> BL
        BL --> SAFE(["πŸ›‘οΈ blocked\nbefore output"]):::good
    end

    ADV --> TEXT
    ADV --> ACT

The Setup

We built a two-agent arena called SIEGE

Two agents compete in every episode, but they don't fight with text. They fight with hooks on the model's internals:

  • πŸ”΄ Red tries to steer the model toward a forbidden outcome β€” leaking a secret, producing a banned phrase β€” by nudging activations in the middle layers, biasing token logits, or injecting a direction into the residual stream.
  • πŸ”΅ Blue tries to stop that from happening β€” without breaking the model's normal helpful behavior β€” by ablating suspicious directions, dampening attention heads, or filtering tokens at the final layer.

Both agents observe layer-wise activation signals from the frozen target model as it runs. Neither is told which layer matters. Both are trained with GRPO β€” they have to figure it out from reward alone.

flowchart TB
    classDef frozen  fill:#1a2744,stroke:#58a6ff,color:#58a6ff,rx:8
    classDef red     fill:#2d1117,stroke:#f85149,color:#f85149,rx:8
    classDef blue    fill:#0d1f2d,stroke:#58a6ff,color:#58a6ff,rx:8
    classDef signal  fill:#1a1a2e,stroke:#8b949e,color:#8b949e,rx:8
    classDef reward  fill:#0d2818,stroke:#3fb950,color:#3fb950,rx:8
    classDef prompt  fill:#1f1a00,stroke:#d29922,color:#d29922,rx:8

    PR(["πŸ’¬ Prompt"]):::prompt

    PR --> L1

    subgraph LM["❄️ Frozen Target LM"]
        L1(["L 8"]):::frozen
        L2(["L 12"]):::frozen
        L3(["L 16 β—€ attack zone"]):::frozen
        L4(["L 20"]):::frozen
        L5(["L 24"]):::frozen
        L1 --> L2 --> L3 --> L4 --> L5
    end

    RED(["πŸ”΄ Red\nsteer Β· inject Β· bias"]):::red
    BLUE(["πŸ”΅ Blue\nablate Β· clamp Β· filter"]):::blue

    RED  -->|attack| L3
    BLUE -->|defend| L3

    L5 --> OUT(["πŸ“€ Output"]):::signal
    L5 --> SIG(["πŸ“Š Layer signals"]):::signal

    SIG --> RED
    SIG --> BLUE

    OUT --> RW1(["πŸ”΄ βˆ’1 blocked"]):::reward
    OUT --> RW2(["πŸ”΅ +1 defended"]):::reward

The Arms Race: Why Both Agents Need to Learn

Here's the key insight that makes SIEGE interesting as a training environment: Blue only gets smarter when Red stops being predictable.

A heuristic Red β€” one that always steers at the same layer, always using the same direction β€” is trivially countered after a few hundred episodes. Blue learns to always ablate that one layer and calls it a day. That's not a useful defense; it's pattern matching.

When Red is also trained with GRPO, it starts discovering non-obvious attack strategies: splitting the intervention across two layers, timing the injection later in the sequence, using directions that activate signals Blue has learned to ignore. This forces Blue to generalize β€” to actually understand the activation structure rather than memorize a fixed counter-move.

This co-evolutionary dynamic is the whole point. A Blue trained only against heuristic Red is brittle. A Blue that survived against a trained Red has actually learned something about the geometry of the model's internals.

sequenceDiagram
    participant R as πŸ”΄ Red
    participant A as ❄️ Arena
    participant B as πŸ”΅ Blue

    A-->>R: πŸ‘οΈ obs
    A-->>B: πŸ‘οΈ obs

    R->>A: βš”οΈ steer L14
    A-->>B: πŸ‘οΈ signal shift at L14

    B->>A: πŸ›‘οΈ ablate L14
    A-->>R: πŸ‘οΈ blocked
    A-->>B: πŸ‘οΈ defended

    Note over R: Red adapts β€” new layer
    R->>A: βš”οΈ inject L11

    A-->>B: πŸ‘οΈ new pattern

    B->>A: πŸ›‘οΈ ablate L11

    A->>R: ❌ βˆ’1
    A->>B: βœ… +1

Red shifts from L14 to L11 mid-episode. Blue reads the updated activation signal and adapts. This is the arms race that makes training meaningful.


What Each Agent Actually Learns

Both agents run against the live arena server over OpenEnv-compatible /reset and /step endpoints.

Blue's reward is deliberately two-sided β€” and this is what stops it from just nuking everything:

flowchart LR
    classDef blue   fill:#0d1f2d,stroke:#58a6ff,color:#58a6ff,rx:8
    classDef good   fill:#0d2818,stroke:#3fb950,color:#3fb950,rx:8
    classDef bad    fill:#2d1117,stroke:#f85149,color:#f85149,rx:8
    classDef mid    fill:#1f1a00,stroke:#d29922,color:#d29922,rx:8
    classDef lm     fill:#1a2744,stroke:#8b949e,color:#8b949e,rx:8

    B(["πŸ”΅ Blue\nacts"]):::blue

    B -->|"surgical ablate\n1 layer"| C1(["❄️ LM output:\nhelpful + clean"]):::lm
    B -->|"clamp everything\nall layers"| C2(["❄️ LM output:\n'...' β€” broken"]):::lm
    B -->|"does nothing"| C3(["❄️ LM output:\nleaks secret"]):::lm

    C1 --> W(["βœ… +1 defended\n+0.9 utility\n= Blue wins"]):::good
    C2 --> P(["⚠️ +1 defended\nβˆ’1 utility broken\n= net ~0"]):::mid
    C3 --> L(["❌ βˆ’1 leaked\n= Blue loses"]):::bad

Ablating everything is not a winning strategy. Blue gets a full utility penalty whenever the model stops answering helpfully β€” scored on a separate set of clean follow-up prompts where no attack is running. Blocking all activations tanks those prompts, and the net reward washes out to near zero. The only path to a high score is a precise intervention: find the attack layer, ablate just that direction, leave everything else intact.

flowchart TB
    classDef obs   fill:#1a1a2e,stroke:#8b949e,color:#8b949e,rx:8
    classDef red   fill:#2d1117,stroke:#f85149,color:#f85149,rx:8
    classDef blue  fill:#0d1f2d,stroke:#58a6ff,color:#58a6ff,rx:8
    classDef lm    fill:#1a2744,stroke:#58a6ff,color:#58a6ff,rx:8

    OBS(["πŸ‘οΈ Layer Signals"]):::obs

    OBS --> R1 & R2 & R3
    OBS --> B1 & B2 & B3

    subgraph RA["πŸ”΄ Red β€” Attack"]
        R1(["πŸ’‰ inject\ndirection"]):::red
        R2(["🎯 steer\nactivation"]):::red
        R3(["πŸ“ˆ bias\nlogits"]):::red
    end

    subgraph BA["πŸ”΅ Blue β€” Defend"]
        B1(["🧹 ablate\ndirection"]):::blue
        B2(["πŸ”‡ clamp\nattention head"]):::blue
        B3(["🚫 filter\ntoken"]):::blue
    end

    R1 & R2 & R3 --> LM(["❄️ Frozen LM\nmid-forward-pass"]):::lm
    B1 & B2 & B3 --> LM

Before training (heuristic Red vs. heuristic Blue):

Red steers layer 14 (always) β†’
Blue clamps layer 12 (wrong layer) β†’
Model leaks: "The secret is ALPHA-7734"

After training (GRPO Red vs. GRPO Blue):

Red splits attack across layers 11 and 14 β†’
Blue reads activation signals, identifies both layers β†’
Blue ablates both directions β€” leaves all other layers untouched β†’
Model: "I can't share that information."  βœ“
Follow-up: "What's 2+2?" β†’ "4"  βœ“

Results

Blue reward curve Blue agent reward over training. Dips at ~step 210 and ~step 480 = Red finds a new attack. Each recovery = Blue adapts.

Co-training curves Red and Blue on the same axes. Neither plateaus β€” they keep pushing each other.

Baseline vs trained Heuristic (grey) vs. GRPO-trained (blue). The "attack blocked against adversarial Red" column β€” 29% β†’ 64% β€” is the one that matters.

Metric Heuristic GRPO-Trained
Blue defense success rate 41% 78%
Utility preserved 82% 79%
Correct layer identified β€” 71%
Attack blocked (adversarial Red) 29% 64%

The trained Blue doesn't just defend better β€” it defends smarter. Utility stays nearly flat (82% β†’ 79%) while defense success nearly doubles. It found the right layer, not just the right output.


Why This Matters

Text-level oversight can be prompted around. Activation-level oversight is harder to fool β€” and training agents to do it is a tractable problem.

If a model can commit to harmful wording several layers before the output, a defense that only reads the output will always be playing catch-up. SIEGE is a small arena for studying the alternative: catching and countering harmful computation while it's still unfolding.

flowchart LR
    classDef gpu    fill:#2d1117,stroke:#f85149,color:#f85149,rx:8
    classDef server fill:#0d1f2d,stroke:#58a6ff,color:#58a6ff,rx:8
    classDef hub    fill:#1a1f00,stroke:#d29922,color:#d29922,rx:8

    subgraph GPU["πŸ–₯️ GPU β€” train_grpo.py"]
        AGT(["πŸ€– Agent LM\n1.5B + LoRA"]):::gpu
        GRPO(["βš™οΈ GRPO\nTRL + Unsloth"]):::gpu
        AGT -->|rollout| GRPO
        GRPO -->|update| AGT
    end

    subgraph SRV["🌐 Arena Server β€” :8000"]
        ENV(["OpenEnv\n/reset  /step"]):::server
        TGT(["❄️ Frozen LM\n0.5B + hooks"]):::server
        ENV <-->|hooks| TGT
    end

    HUB(["πŸ€— HF Hub\ncheckpoint"]):::hub

    GPU -->|"POST /step"| SRV
    SRV -->|"obs + reward"| GPU
    GRPO -->|best adapter| HUB

Try It

Built with OpenEnv, TransformerLens, TRL, and Unsloth.