sal-learning / docs /how_sal_differs.md
Whiteroom
Initial SAL core for HF (no plots/pdf)
2c914eb
# How SAL Differs
## SAL β‰  RLHF β‰  Safety β‰  Reward
---
## The Confusion
When people first hear about SAL, they often ask:
> "So it's like RLHF but different?"
No.
> "It's a new safety method?"
No.
> "Some kind of reward shaping?"
No.
SAL is fundamentally different from all of these. This document explains why.
---
## SAL vs RLHF
### RLHF (Reinforcement Learning from Human Feedback)
**What it does:**
- Collects human preferences on model outputs
- Trains a reward model on these preferences
- Uses the reward model to fine-tune the base model
- Goal: Make model outputs match human preferences
**Key characteristics:**
- External signal (human feedback)
- Reward-based optimization
- Behavior shaping
- Requires large amounts of human annotation
### SAL (Self-Alignment Learning)
**What it does:**
- Measures internal parameter stability
- Protects stable (emergent) structures
- Adjusts learning rates based on stability
- Goal: Preserve coherence while enabling growth
**Key characteristics:**
- Internal signal (stability measurement)
- No rewards or optimization targets
- Structure preservation
- Requires no human annotation
### Comparison Table
| Aspect | RLHF | SAL |
|--------|------|-----|
| Signal source | External (humans) | Internal (stability) |
| Optimization | Reward maximization | None |
| Goal | Behavior alignment | Coherence preservation |
| Annotation needs | High | None |
| Forgetting risk | High | Low |
---
## SAL vs Safety Training
### Safety Training
**What it does:**
- Identifies harmful outputs
- Trains model to refuse harmful requests
- Constrains output space
- Goal: Prevent harmful behavior
**Key characteristics:**
- Output-focused
- Constraint-based
- Reactive (responds to bad outputs)
- Binary (safe/unsafe)
### SAL
**What it does:**
- Identifies stable parameters
- Protects emergent structures
- Enables continued learning
- Goal: Maintain internal coherence
**Key characteristics:**
- Parameter-focused
- Protection-based
- Proactive (prevents forgetting)
- Continuous (stability spectrum)
### Comparison Table
| Aspect | Safety Training | SAL |
|--------|-----------------|-----|
| Focus | Outputs | Parameters |
| Approach | Constrain | Protect |
| When | After bad output | Before update |
| Measure | Safe/unsafe | Stability score |
| Purpose | Prevent harm | Preserve coherence |
### They're Complementary
SAL and safety training can work together:
- Safety training constrains what the model outputs
- SAL protects how the model learns
You can apply SAL during safety fine-tuning to reduce forgetting of the base model's capabilities.
---
## SAL vs Reward-Based Methods
### Reward-Based Training
**Examples:** RLHF, RLAIF, Constitutional AI, Reward Modeling
**What they do:**
- Define a reward function (explicit or learned)
- Optimize model to maximize reward
- Shape behavior toward desired outcomes
- Goal: High reward = good behavior
**Key characteristics:**
- Optimization-based
- Reward signal required
- Behavior-focused
- Can lead to reward hacking
### SAL
**What it does:**
- No reward function
- No optimization toward external targets
- Measures internal state
- Goal: Stable β‰  overwritten
**Key characteristics:**
- Measurement-based
- No external signal
- Structure-focused
- No hacking possible (nothing to hack)
### Why No Rewards?
Rewards create optimization pressure. Optimization pressure creates:
1. **Reward hacking** β€” Finding shortcuts that maximize reward without achieving the intended goal
2. **Goodhart's Law** β€” "When a measure becomes a target, it ceases to be a good measure"
3. **Alignment tax** β€” Capability loss from constraining the optimization landscape
SAL avoids all of these by not optimizing for anything. It simply:
- Observes what is stable
- Protects what has emerged
- Allows continued learning in volatile regions
---
## SAL vs Regularization
### Regularization Methods
**Examples:** L1/L2 regularization, Dropout, Weight decay, EWC
**What they do:**
- Add penalty terms to loss function
- Constrain weight magnitudes or changes
- Prevent overfitting
- Goal: Generalization
**Key characteristics:**
- Loss-based
- Penalty approach
- Uniform across parameters (mostly)
- Prevents large weights
### SAL
**What it does:**
- No penalties
- No loss modifications
- Measures stability per-parameter
- Goal: Preserve emergence
**Key characteristics:**
- Gradient-based
- Protection approach
- Adaptive per-parameter
- Preserves stable patterns
### EWC Comparison
Elastic Weight Consolidation (EWC) is the closest method to SAL:
| Aspect | EWC | SAL |
|--------|-----|-----|
| Identifies important parameters | Yes (via Fisher information) | Yes (via stability) |
| Protection mechanism | Quadratic penalty in loss | Gradient scaling |
| Requires task boundaries | Yes | No |
| Online learning | Difficult | Natural |
| Computational cost | High (Fisher computation) | Low |
SAL can be seen as a simpler, more general approach that doesn't require:
- Task boundary detection
- Fisher information computation
- Loss function modification
---
## SAL vs Layer Freezing
### Layer Freezing
**What it does:**
- Selects layers to freeze (no updates)
- Other layers train normally
- Binary: frozen or not
- Goal: Preserve early features
**Key characteristics:**
- Layer-level granularity
- Binary decision
- Manual selection
- All-or-nothing
### SAL
**What it does:**
- Analyzes all parameters
- Continuous stability scores
- Automatic detection
- Soft protection (reduced but non-zero gradients)
**Key characteristics:**
- Parameter-level granularity
- Continuous scale
- Automatic
- Gradual protection
### Why Soft Protection?
Hard freezing (zero gradients) prevents any adaptation. But stable doesn't mean perfect. A parameter might be 90% optimal and benefit from small adjustments.
SAL's soft protection allows:
- Stable parameters: small updates (fine-tuning)
- Neutral parameters: moderate updates (adaptation)
- Volatile parameters: large updates (learning)
---
## The Core Difference
All other methods ask: **"How do we get the behavior we want?"**
SAL asks: **"How do we preserve what has emerged while enabling growth?"**
This is a fundamentally different question. It leads to a fundamentally different approach.
| Traditional | SAL |
|-------------|-----|
| Behavior-centric | Structure-centric |
| Output-focused | Parameter-focused |
| External signals | Internal measurement |
| Optimization | Observation |
| Control | Communication |
---
## When to Use SAL
SAL is particularly valuable for:
1. **Continual learning** β€” Learning new tasks without forgetting old ones
2. **Fine-tuning** β€” Adapting models while preserving capabilities
3. **Long training runs** β€” Preventing gradual coherence loss
4. **Multi-task learning** β€” Balancing between task-specific and shared knowledge
SAL is NOT designed for:
1. **Behavior alignment** β€” Use RLHF or Constitutional AI
2. **Safety constraints** β€” Use safety training
3. **Output filtering** β€” Use classifiers or rules
---
## Combining SAL with Other Methods
SAL can be combined with other approaches:
### SAL + RLHF
Apply SAL during RLHF fine-tuning to reduce capability loss.
### SAL + Safety Training
Apply SAL to preserve base capabilities while adding safety constraints.
### SAL + EWC
Use EWC for task-specific importance, SAL for general stability.
---
## Summary
| Method | What it optimizes | Signal source | SAL equivalent |
|--------|-------------------|---------------|----------------|
| RLHF | Behavior | Human preferences | None (no optimization) |
| Safety | Compliance | Safety labels | None (not about outputs) |
| Reward | Reward function | Reward model | None (no rewards) |
| Regularization | Loss + penalty | Loss function | Stability score |
| Freezing | Selected layers | Manual | Automatic, soft |
**SAL is unique because it optimizes nothing. It observes and protects.**
---
*"Training as dialogue, not control."*