# How SAL Differs ## SAL ≠ RLHF ≠ Safety ≠ Reward --- ## The Confusion When people first hear about SAL, they often ask: > "So it's like RLHF but different?" No. > "It's a new safety method?" No. > "Some kind of reward shaping?" No. SAL is fundamentally different from all of these. This document explains why. --- ## SAL vs RLHF ### RLHF (Reinforcement Learning from Human Feedback) **What it does:** - Collects human preferences on model outputs - Trains a reward model on these preferences - Uses the reward model to fine-tune the base model - Goal: Make model outputs match human preferences **Key characteristics:** - External signal (human feedback) - Reward-based optimization - Behavior shaping - Requires large amounts of human annotation ### SAL (Self-Alignment Learning) **What it does:** - Measures internal parameter stability - Protects stable (emergent) structures - Adjusts learning rates based on stability - Goal: Preserve coherence while enabling growth **Key characteristics:** - Internal signal (stability measurement) - No rewards or optimization targets - Structure preservation - Requires no human annotation ### Comparison Table | Aspect | RLHF | SAL | |--------|------|-----| | Signal source | External (humans) | Internal (stability) | | Optimization | Reward maximization | None | | Goal | Behavior alignment | Coherence preservation | | Annotation needs | High | None | | Forgetting risk | High | Low | --- ## SAL vs Safety Training ### Safety Training **What it does:** - Identifies harmful outputs - Trains model to refuse harmful requests - Constrains output space - Goal: Prevent harmful behavior **Key characteristics:** - Output-focused - Constraint-based - Reactive (responds to bad outputs) - Binary (safe/unsafe) ### SAL **What it does:** - Identifies stable parameters - Protects emergent structures - Enables continued learning - Goal: Maintain internal coherence **Key characteristics:** - Parameter-focused - Protection-based - Proactive (prevents forgetting) - Continuous (stability spectrum) ### Comparison Table | Aspect | Safety Training | SAL | |--------|-----------------|-----| | Focus | Outputs | Parameters | | Approach | Constrain | Protect | | When | After bad output | Before update | | Measure | Safe/unsafe | Stability score | | Purpose | Prevent harm | Preserve coherence | ### They're Complementary SAL and safety training can work together: - Safety training constrains what the model outputs - SAL protects how the model learns You can apply SAL during safety fine-tuning to reduce forgetting of the base model's capabilities. --- ## SAL vs Reward-Based Methods ### Reward-Based Training **Examples:** RLHF, RLAIF, Constitutional AI, Reward Modeling **What they do:** - Define a reward function (explicit or learned) - Optimize model to maximize reward - Shape behavior toward desired outcomes - Goal: High reward = good behavior **Key characteristics:** - Optimization-based - Reward signal required - Behavior-focused - Can lead to reward hacking ### SAL **What it does:** - No reward function - No optimization toward external targets - Measures internal state - Goal: Stable ≠ overwritten **Key characteristics:** - Measurement-based - No external signal - Structure-focused - No hacking possible (nothing to hack) ### Why No Rewards? Rewards create optimization pressure. Optimization pressure creates: 1. **Reward hacking** — Finding shortcuts that maximize reward without achieving the intended goal 2. **Goodhart's Law** — "When a measure becomes a target, it ceases to be a good measure" 3. **Alignment tax** — Capability loss from constraining the optimization landscape SAL avoids all of these by not optimizing for anything. It simply: - Observes what is stable - Protects what has emerged - Allows continued learning in volatile regions --- ## SAL vs Regularization ### Regularization Methods **Examples:** L1/L2 regularization, Dropout, Weight decay, EWC **What they do:** - Add penalty terms to loss function - Constrain weight magnitudes or changes - Prevent overfitting - Goal: Generalization **Key characteristics:** - Loss-based - Penalty approach - Uniform across parameters (mostly) - Prevents large weights ### SAL **What it does:** - No penalties - No loss modifications - Measures stability per-parameter - Goal: Preserve emergence **Key characteristics:** - Gradient-based - Protection approach - Adaptive per-parameter - Preserves stable patterns ### EWC Comparison Elastic Weight Consolidation (EWC) is the closest method to SAL: | Aspect | EWC | SAL | |--------|-----|-----| | Identifies important parameters | Yes (via Fisher information) | Yes (via stability) | | Protection mechanism | Quadratic penalty in loss | Gradient scaling | | Requires task boundaries | Yes | No | | Online learning | Difficult | Natural | | Computational cost | High (Fisher computation) | Low | SAL can be seen as a simpler, more general approach that doesn't require: - Task boundary detection - Fisher information computation - Loss function modification --- ## SAL vs Layer Freezing ### Layer Freezing **What it does:** - Selects layers to freeze (no updates) - Other layers train normally - Binary: frozen or not - Goal: Preserve early features **Key characteristics:** - Layer-level granularity - Binary decision - Manual selection - All-or-nothing ### SAL **What it does:** - Analyzes all parameters - Continuous stability scores - Automatic detection - Soft protection (reduced but non-zero gradients) **Key characteristics:** - Parameter-level granularity - Continuous scale - Automatic - Gradual protection ### Why Soft Protection? Hard freezing (zero gradients) prevents any adaptation. But stable doesn't mean perfect. A parameter might be 90% optimal and benefit from small adjustments. SAL's soft protection allows: - Stable parameters: small updates (fine-tuning) - Neutral parameters: moderate updates (adaptation) - Volatile parameters: large updates (learning) --- ## The Core Difference All other methods ask: **"How do we get the behavior we want?"** SAL asks: **"How do we preserve what has emerged while enabling growth?"** This is a fundamentally different question. It leads to a fundamentally different approach. | Traditional | SAL | |-------------|-----| | Behavior-centric | Structure-centric | | Output-focused | Parameter-focused | | External signals | Internal measurement | | Optimization | Observation | | Control | Communication | --- ## When to Use SAL SAL is particularly valuable for: 1. **Continual learning** — Learning new tasks without forgetting old ones 2. **Fine-tuning** — Adapting models while preserving capabilities 3. **Long training runs** — Preventing gradual coherence loss 4. **Multi-task learning** — Balancing between task-specific and shared knowledge SAL is NOT designed for: 1. **Behavior alignment** — Use RLHF or Constitutional AI 2. **Safety constraints** — Use safety training 3. **Output filtering** — Use classifiers or rules --- ## Combining SAL with Other Methods SAL can be combined with other approaches: ### SAL + RLHF Apply SAL during RLHF fine-tuning to reduce capability loss. ### SAL + Safety Training Apply SAL to preserve base capabilities while adding safety constraints. ### SAL + EWC Use EWC for task-specific importance, SAL for general stability. --- ## Summary | Method | What it optimizes | Signal source | SAL equivalent | |--------|-------------------|---------------|----------------| | RLHF | Behavior | Human preferences | None (no optimization) | | Safety | Compliance | Safety labels | None (not about outputs) | | Reward | Reward function | Reward model | None (no rewards) | | Regularization | Loss + penalty | Loss function | Stability score | | Freezing | Selected layers | Manual | Automatic, soft | **SAL is unique because it optimizes nothing. It observes and protects.** --- *"Training as dialogue, not control."*