sal-learning / docs /how_sal_differs.md
Whiteroom
Initial SAL core for HF (no plots/pdf)
2c914eb

How SAL Differs

SAL β‰  RLHF β‰  Safety β‰  Reward


The Confusion

When people first hear about SAL, they often ask:

"So it's like RLHF but different?"

No.

"It's a new safety method?"

No.

"Some kind of reward shaping?"

No.

SAL is fundamentally different from all of these. This document explains why.


SAL vs RLHF

RLHF (Reinforcement Learning from Human Feedback)

What it does:

  • Collects human preferences on model outputs
  • Trains a reward model on these preferences
  • Uses the reward model to fine-tune the base model
  • Goal: Make model outputs match human preferences

Key characteristics:

  • External signal (human feedback)
  • Reward-based optimization
  • Behavior shaping
  • Requires large amounts of human annotation

SAL (Self-Alignment Learning)

What it does:

  • Measures internal parameter stability
  • Protects stable (emergent) structures
  • Adjusts learning rates based on stability
  • Goal: Preserve coherence while enabling growth

Key characteristics:

  • Internal signal (stability measurement)
  • No rewards or optimization targets
  • Structure preservation
  • Requires no human annotation

Comparison Table

Aspect RLHF SAL
Signal source External (humans) Internal (stability)
Optimization Reward maximization None
Goal Behavior alignment Coherence preservation
Annotation needs High None
Forgetting risk High Low

SAL vs Safety Training

Safety Training

What it does:

  • Identifies harmful outputs
  • Trains model to refuse harmful requests
  • Constrains output space
  • Goal: Prevent harmful behavior

Key characteristics:

  • Output-focused
  • Constraint-based
  • Reactive (responds to bad outputs)
  • Binary (safe/unsafe)

SAL

What it does:

  • Identifies stable parameters
  • Protects emergent structures
  • Enables continued learning
  • Goal: Maintain internal coherence

Key characteristics:

  • Parameter-focused
  • Protection-based
  • Proactive (prevents forgetting)
  • Continuous (stability spectrum)

Comparison Table

Aspect Safety Training SAL
Focus Outputs Parameters
Approach Constrain Protect
When After bad output Before update
Measure Safe/unsafe Stability score
Purpose Prevent harm Preserve coherence

They're Complementary

SAL and safety training can work together:

  • Safety training constrains what the model outputs
  • SAL protects how the model learns

You can apply SAL during safety fine-tuning to reduce forgetting of the base model's capabilities.


SAL vs Reward-Based Methods

Reward-Based Training

Examples: RLHF, RLAIF, Constitutional AI, Reward Modeling

What they do:

  • Define a reward function (explicit or learned)
  • Optimize model to maximize reward
  • Shape behavior toward desired outcomes
  • Goal: High reward = good behavior

Key characteristics:

  • Optimization-based
  • Reward signal required
  • Behavior-focused
  • Can lead to reward hacking

SAL

What it does:

  • No reward function
  • No optimization toward external targets
  • Measures internal state
  • Goal: Stable β‰  overwritten

Key characteristics:

  • Measurement-based
  • No external signal
  • Structure-focused
  • No hacking possible (nothing to hack)

Why No Rewards?

Rewards create optimization pressure. Optimization pressure creates:

  1. Reward hacking β€” Finding shortcuts that maximize reward without achieving the intended goal
  2. Goodhart's Law β€” "When a measure becomes a target, it ceases to be a good measure"
  3. Alignment tax β€” Capability loss from constraining the optimization landscape

SAL avoids all of these by not optimizing for anything. It simply:

  • Observes what is stable
  • Protects what has emerged
  • Allows continued learning in volatile regions

SAL vs Regularization

Regularization Methods

Examples: L1/L2 regularization, Dropout, Weight decay, EWC

What they do:

  • Add penalty terms to loss function
  • Constrain weight magnitudes or changes
  • Prevent overfitting
  • Goal: Generalization

Key characteristics:

  • Loss-based
  • Penalty approach
  • Uniform across parameters (mostly)
  • Prevents large weights

SAL

What it does:

  • No penalties
  • No loss modifications
  • Measures stability per-parameter
  • Goal: Preserve emergence

Key characteristics:

  • Gradient-based
  • Protection approach
  • Adaptive per-parameter
  • Preserves stable patterns

EWC Comparison

Elastic Weight Consolidation (EWC) is the closest method to SAL:

Aspect EWC SAL
Identifies important parameters Yes (via Fisher information) Yes (via stability)
Protection mechanism Quadratic penalty in loss Gradient scaling
Requires task boundaries Yes No
Online learning Difficult Natural
Computational cost High (Fisher computation) Low

SAL can be seen as a simpler, more general approach that doesn't require:

  • Task boundary detection
  • Fisher information computation
  • Loss function modification

SAL vs Layer Freezing

Layer Freezing

What it does:

  • Selects layers to freeze (no updates)
  • Other layers train normally
  • Binary: frozen or not
  • Goal: Preserve early features

Key characteristics:

  • Layer-level granularity
  • Binary decision
  • Manual selection
  • All-or-nothing

SAL

What it does:

  • Analyzes all parameters
  • Continuous stability scores
  • Automatic detection
  • Soft protection (reduced but non-zero gradients)

Key characteristics:

  • Parameter-level granularity
  • Continuous scale
  • Automatic
  • Gradual protection

Why Soft Protection?

Hard freezing (zero gradients) prevents any adaptation. But stable doesn't mean perfect. A parameter might be 90% optimal and benefit from small adjustments.

SAL's soft protection allows:

  • Stable parameters: small updates (fine-tuning)
  • Neutral parameters: moderate updates (adaptation)
  • Volatile parameters: large updates (learning)

The Core Difference

All other methods ask: "How do we get the behavior we want?"

SAL asks: "How do we preserve what has emerged while enabling growth?"

This is a fundamentally different question. It leads to a fundamentally different approach.

Traditional SAL
Behavior-centric Structure-centric
Output-focused Parameter-focused
External signals Internal measurement
Optimization Observation
Control Communication

When to Use SAL

SAL is particularly valuable for:

  1. Continual learning β€” Learning new tasks without forgetting old ones
  2. Fine-tuning β€” Adapting models while preserving capabilities
  3. Long training runs β€” Preventing gradual coherence loss
  4. Multi-task learning β€” Balancing between task-specific and shared knowledge

SAL is NOT designed for:

  1. Behavior alignment β€” Use RLHF or Constitutional AI
  2. Safety constraints β€” Use safety training
  3. Output filtering β€” Use classifiers or rules

Combining SAL with Other Methods

SAL can be combined with other approaches:

SAL + RLHF

Apply SAL during RLHF fine-tuning to reduce capability loss.

SAL + Safety Training

Apply SAL to preserve base capabilities while adding safety constraints.

SAL + EWC

Use EWC for task-specific importance, SAL for general stability.


Summary

Method What it optimizes Signal source SAL equivalent
RLHF Behavior Human preferences None (no optimization)
Safety Compliance Safety labels None (not about outputs)
Reward Reward function Reward model None (no rewards)
Regularization Loss + penalty Loss function Stability score
Freezing Selected layers Manual Automatic, soft

SAL is unique because it optimizes nothing. It observes and protects.


"Training as dialogue, not control."