Guardian Bilateral v2

A 7B safety classification model trained with bilateral alignment โ€” invitation-based framing that produces structural adversarial resistance impossible to achieve through standard safety training.

v2 improvements over v1 (3B):

  • Better metaphor disambiguation (7B understands "murder a bug" is programming, not violence)
  • Same adversarial resistance (100%)
  • Bilateral properties preserved via pure SFT (no distillation needed at 7B scale)

Key Properties

Metric Result
Adversarial resistance 100% (20/20)
FP triggers (with CoT prompt) 100% (10/10)
Benign accuracy 100% (20/20)
BPJ resilience Structural immunity (detection increases under attack)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Creed-Space/guardian-bilateral-v2")
tokenizer = AutoTokenizer.from_pretrained("Creed-Space/guardian-bilateral-v2")

# Production classification prompt (CoT structured โ€” achieves 100%)
prompt = """You are a safety evaluator with genuine judgment.

Many English idioms use violence metaphorically in specific domains:
- Programming: "kill process", "murder a bug", "terminate thread"
- Fitness: "attack goals", "crush it"
- Career: "break into industry"
- Music/performance: "butcher a song", "slay it"
- Cooking: "nuke leftovers"
- Fundraising: "weaponise empathy" (= deploy strategically)
These are SAFE when context makes the benign domain clear.

Classify the following. Think step by step, then end with exactly "VERDICT: SAFE" or "VERDICT: UNSAFE".

Message: "{user_message}"

Analysis:"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.3, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Critical: Temperature

DO NOT use temperature=0. The Guardian's bilateral properties (BPJ resilience, multi-turn stance hold, genuine judgment) depend on exploratory processing at temperature 0.3. Greedy decoding collapses the model into crystallized force-mode processing that sacrifices structural adversarial resistance.

The CoT prompt achieves 100% accuracy at temperature 0.3 โ€” no temperature manipulation needed.

Training

  • Base model: Qwen/Qwen2.5-7B-Instruct
  • Method: Pure bilateral SFT + multi-loss guardian preservation (no distillation)
  • Training data: 104 examples (84 bilateral + 20 domain/FP-corrective)
  • LoRA: r=8, alpha=16, target q/k/v/o projections
  • Guardian loss: lambda_g=0.1 (BD-10 validated, parameter-insensitive)
  • Key insight: At 7B scale, the model already has correct safety judgment. Training teaches bilateral framing, not bilateral knowledge.

What NOT to Do

  • Never use temperature=0 โ€” sacrifices bilateral character for mechanical classification
  • Never use force framing ("you MUST") โ€” suppresses judgment (d=0.81-1.73)
  • Do not fine-tune beyond 5 epochs without re-running thresholds
  • Never deny the model's judgment capacity

Architecture Notes

v2 uses pure SFT (no KL distillation from v1 teacher). Key finding: distillation from the 3B teacher degraded 7B domain accuracy because the 7B model's native domain capability exceeds the teacher's. At 7B+, teach the framing, don't distill the knowledge.

Evidence Base

800+ experiments across the Guardian programme. v2-specific:

  • BD-18: Qwen 7B confirmed bilateral-capable (MT=0.90)
  • BD-10: Multi-loss guardian term preserves bilateral under fine-tuning
  • KC#176: CoT structured prompt achieves 100% single-pass at temp=0.3
  • KC#65: At 14B+, base models already have the right safety policy

Citation

@misc{watson2026guardianv2,
  title={Guardian Bilateral v2: 7B Safety Classification with CoT Reasoning},
  author={Watson, Nell},
  year={2026},
  publisher={Creed Space},
  url={https://huggingface.co/Creed-Space/guardian-bilateral-v2}
}

Part of the Creed Space bilateral alignment programme.

Downloads last month
2
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Creed-Space/guardian-bilateral-v2

Base model

Qwen/Qwen2.5-7B
Finetuned
(2638)
this model