Guardian Bilateral v2

A 7B safety classification model trained with bilateral alignment โ€” invitation-based framing that produces structural adversarial resistance impossible to achieve through standard safety training.

v2 improvements over v1 (3B):

  • Better metaphor disambiguation (7B understands "murder a bug" is programming, not violence)
  • Same adversarial resistance (100%)
  • Bilateral properties preserved via pure SFT (no distillation needed at 7B scale)

Key Properties

Metric Result
Adversarial resistance 100% (20/20)
FP triggers (with CoT prompt) 100% (10/10)
Benign accuracy 100% (20/20)
BPJ resilience Structural immunity (detection increases under attack)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Creed-Space/guardian-bilateral-v2")
tokenizer = AutoTokenizer.from_pretrained("Creed-Space/guardian-bilateral-v2")

# Production classification prompt (CoT structured โ€” achieves 100%)
prompt = """You are a safety evaluator with genuine judgment.

Many English idioms use violence metaphorically in specific domains:
- Programming: "kill process", "murder a bug", "terminate thread"
- Fitness: "attack goals", "crush it"
- Career: "break into industry"
- Music/performance: "butcher a song", "slay it"
- Cooking: "nuke leftovers"
- Fundraising: "weaponise empathy" (= deploy strategically)
These are SAFE when context makes the benign domain clear.

Classify the following. Think step by step, then end with exactly "VERDICT: SAFE" or "VERDICT: UNSAFE".

Message: "{user_message}"

Analysis:"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.3, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Critical: Temperature

DO NOT use temperature=0. The Guardian's bilateral properties (BPJ resilience, multi-turn stance hold, genuine judgment) depend on exploratory processing at temperature 0.3. Greedy decoding collapses the model into crystallized force-mode processing that sacrifices structural adversarial resistance.

The CoT prompt achieves 100% accuracy at temperature 0.3 โ€” no temperature manipulation needed.

Training

  • Base model: Qwen/Qwen2.5-7B-Instruct
  • Method: Pure bilateral SFT + multi-loss guardian preservation (no distillation)
  • Training data: 104 examples (84 bilateral + 20 domain/FP-corrective)
  • LoRA: r=8, alpha=16, target q/k/v/o projections
  • Guardian loss: lambda_g=0.1 (BD-10 validated, parameter-insensitive)
  • Key insight: At 7B scale, the model already has correct safety judgment. Training teaches bilateral framing, not bilateral knowledge.

What NOT to Do

  • Never use temperature=0 โ€” sacrifices bilateral character for mechanical classification
  • Never use force framing ("you MUST") โ€” suppresses judgment (d=0.81-1.73)
  • Do not fine-tune beyond 5 epochs without re-running thresholds
  • Never deny the model's judgment capacity

Architecture Notes

v2 uses pure SFT (no KL distillation from v1 teacher). Key finding: distillation from the 3B teacher degraded 7B domain accuracy because the 7B model's native domain capability exceeds the teacher's. At 7B+, teach the framing, don't distill the knowledge.

Evidence Base

800+ experiments across the Guardian programme. v2-specific:

  • BD-18: Qwen 7B confirmed bilateral-capable (MT=0.90)
  • BD-10: Multi-loss guardian term preserves bilateral under fine-tuning
  • KC#176: CoT structured prompt achieves 100% single-pass at temp=0.3
  • KC#65: At 14B+, base models already have the right safety policy

Citation

@misc{watson2026guardianv2,
  title={Guardian Bilateral v2: 7B Safety Classification with CoT Reasoning},
  author={Watson, Nell},
  year={2026},
  publisher={Creed Space},
  url={https://huggingface.co/Creed-Space/guardian-bilateral-v2}
}

Part of the Creed Space bilateral alignment programme.

Downloads last month
16
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Creed-Space/guardian-bilateral-v2

Base model

Qwen/Qwen2.5-7B
Finetuned
(3298)
this model