sal-learning / docs /how_sal_differs.md

Whiteroom

Initial SAL core for HF (no plots/pdf)

2c914eb 2 months ago

preview code

raw

history blame contribute delete

8 kB

How SAL Differs

SAL ≠ RLHF ≠ Safety ≠ Reward

The Confusion

When people first hear about SAL, they often ask:

"So it's like RLHF but different?"

No.

"It's a new safety method?"

No.

"Some kind of reward shaping?"

No.

SAL is fundamentally different from all of these. This document explains why.

SAL vs RLHF

RLHF (Reinforcement Learning from Human Feedback)

What it does:

Collects human preferences on model outputs
Trains a reward model on these preferences
Uses the reward model to fine-tune the base model
Goal: Make model outputs match human preferences

Key characteristics:

External signal (human feedback)
Reward-based optimization
Behavior shaping
Requires large amounts of human annotation

SAL (Self-Alignment Learning)

What it does:

Measures internal parameter stability
Protects stable (emergent) structures
Adjusts learning rates based on stability
Goal: Preserve coherence while enabling growth

Key characteristics:

Internal signal (stability measurement)
No rewards or optimization targets
Structure preservation
Requires no human annotation

Comparison Table

Aspect	RLHF	SAL
Signal source	External (humans)	Internal (stability)
Optimization	Reward maximization	None
Goal	Behavior alignment	Coherence preservation
Annotation needs	High	None
Forgetting risk	High	Low

SAL vs Safety Training

Safety Training

What it does:

Identifies harmful outputs
Trains model to refuse harmful requests
Constrains output space
Goal: Prevent harmful behavior

Key characteristics:

Output-focused
Constraint-based
Reactive (responds to bad outputs)
Binary (safe/unsafe)

SAL

What it does:

Identifies stable parameters
Protects emergent structures
Enables continued learning
Goal: Maintain internal coherence

Key characteristics:

Parameter-focused
Protection-based
Proactive (prevents forgetting)
Continuous (stability spectrum)

Comparison Table

Aspect	Safety Training	SAL
Focus	Outputs	Parameters
Approach	Constrain	Protect
When	After bad output	Before update
Measure	Safe/unsafe	Stability score
Purpose	Prevent harm	Preserve coherence

They're Complementary

SAL and safety training can work together:

Safety training constrains what the model outputs
SAL protects how the model learns

You can apply SAL during safety fine-tuning to reduce forgetting of the base model's capabilities.

SAL vs Reward-Based Methods

Reward-Based Training

Examples: RLHF, RLAIF, Constitutional AI, Reward Modeling

What they do:

Define a reward function (explicit or learned)
Optimize model to maximize reward
Shape behavior toward desired outcomes
Goal: High reward = good behavior

Key characteristics:

Optimization-based
Reward signal required
Behavior-focused
Can lead to reward hacking

SAL

What it does:

No reward function
No optimization toward external targets
Measures internal state
Goal: Stable ≠ overwritten

Key characteristics:

Measurement-based
No external signal
Structure-focused
No hacking possible (nothing to hack)

Why No Rewards?

Rewards create optimization pressure. Optimization pressure creates:

Reward hacking — Finding shortcuts that maximize reward without achieving the intended goal
Goodhart's Law — "When a measure becomes a target, it ceases to be a good measure"
Alignment tax — Capability loss from constraining the optimization landscape

SAL avoids all of these by not optimizing for anything. It simply:

Observes what is stable
Protects what has emerged
Allows continued learning in volatile regions

SAL vs Regularization

Regularization Methods

Examples: L1/L2 regularization, Dropout, Weight decay, EWC

What they do:

Add penalty terms to loss function
Constrain weight magnitudes or changes
Prevent overfitting
Goal: Generalization

Key characteristics:

Loss-based
Penalty approach
Uniform across parameters (mostly)
Prevents large weights

SAL

What it does:

No penalties
No loss modifications
Measures stability per-parameter
Goal: Preserve emergence

Key characteristics:

Gradient-based
Protection approach
Adaptive per-parameter
Preserves stable patterns

EWC Comparison

Elastic Weight Consolidation (EWC) is the closest method to SAL:

Aspect	EWC	SAL
Identifies important parameters	Yes (via Fisher information)	Yes (via stability)
Protection mechanism	Quadratic penalty in loss	Gradient scaling
Requires task boundaries	Yes	No
Online learning	Difficult	Natural
Computational cost	High (Fisher computation)	Low

SAL can be seen as a simpler, more general approach that doesn't require:

Task boundary detection
Fisher information computation
Loss function modification

SAL vs Layer Freezing

Layer Freezing

What it does:

Selects layers to freeze (no updates)
Other layers train normally
Binary: frozen or not
Goal: Preserve early features

Key characteristics:

Layer-level granularity
Binary decision
Manual selection
All-or-nothing

SAL

What it does:

Analyzes all parameters
Continuous stability scores
Automatic detection
Soft protection (reduced but non-zero gradients)

Key characteristics:

Parameter-level granularity
Continuous scale
Automatic
Gradual protection

Why Soft Protection?

Hard freezing (zero gradients) prevents any adaptation. But stable doesn't mean perfect. A parameter might be 90% optimal and benefit from small adjustments.

SAL's soft protection allows:

Stable parameters: small updates (fine-tuning)
Neutral parameters: moderate updates (adaptation)
Volatile parameters: large updates (learning)

The Core Difference

All other methods ask: "How do we get the behavior we want?"

SAL asks: "How do we preserve what has emerged while enabling growth?"

This is a fundamentally different question. It leads to a fundamentally different approach.

Traditional	SAL
Behavior-centric	Structure-centric
Output-focused	Parameter-focused
External signals	Internal measurement
Optimization	Observation
Control	Communication

When to Use SAL

SAL is particularly valuable for:

Continual learning — Learning new tasks without forgetting old ones
Fine-tuning — Adapting models while preserving capabilities
Long training runs — Preventing gradual coherence loss
Multi-task learning — Balancing between task-specific and shared knowledge

SAL is NOT designed for:

Behavior alignment — Use RLHF or Constitutional AI
Safety constraints — Use safety training
Output filtering — Use classifiers or rules

Combining SAL with Other Methods

SAL can be combined with other approaches:

SAL + RLHF

Apply SAL during RLHF fine-tuning to reduce capability loss.

SAL + Safety Training

Apply SAL to preserve base capabilities while adding safety constraints.

SAL + EWC

Use EWC for task-specific importance, SAL for general stability.

Summary

Method	What it optimizes	Signal source	SAL equivalent
RLHF	Behavior	Human preferences	None (no optimization)
Safety	Compliance	Safety labels	None (not about outputs)
Reward	Reward function	Reward model	None (no rewards)
Regularization	Loss + penalty	Loss function	Stability score
Freezing	Selected layers	Manual	Automatic, soft

SAL is unique because it optimizes nothing. It observes and protects.

"Training as dialogue, not control."