tekkmaven
/

representation-learning-dynamics

Model card Files Files and versions

xet

Community

tekkmaven commited on 28 days ago

Commit

9ade2ba

verified ·

1 Parent(s): b8de759

Add detailed findings analysis and future research directions

Browse files

Files changed (1) hide show

README.md +133 -30

README.md CHANGED Viewed

@@ -2,7 +2,9 @@
 **To know how a model forgets, it helps to know how it learns.**
-This repository studies how neural network internal representations change during training — contrasting what happens when a model continues learning the same task vs. when it switches to tasks of increasing dissimilarity. We find the precise tipping point where forgetting begins.
 ## Key Finding: The Gradient Alignment Cliff
@@ -18,23 +20,71 @@ We train a small transformer on modular addition, then fork into 5 branches trai
 **The critical observation**: `max` is the only task whose gradient alignment with addition drops to **near zero then goes negative** (−0.027), meaning its gradients actively oppose addition. This confirms [Laitinen 2026](https://arxiv.org/abs/2601.18699)'s finding that gradient alignment predicts forgetting — and reveals that it's not task "difficulty" but **representational incompatibility** that causes forgetting.
-### Why Max Causes Forgetting (and XOR Doesn't)
-From [Nanda et al. 2023](https://arxiv.org/abs/2301.05217): modular addition learns a **circular Fourier representation** where numbers are embedded as points on circles at specific frequencies. XOR, despite seeming "harder," operates bitwise — and bitwise operations on mod-97 integers can be partially decomposed into cyclic components, maintaining some Fourier compatibility.
-But `max(a,b)` requires a fundamentally **linear/ordinal** representation: the model must learn that 96 > 95 > 94 > ... > 0, a monotone ordering. This directly conflicts with circular Fourier embeddings where all numbers have equal norm on the circle. The gradient alignment trace shows this conflict developing:
-```
-Step   AddAcc  GA(add,max)  ← Gradient alignment drops smoothly
-  20   1.0000    0.902       Phase 2 starts
- 200   1.0000    0.259       Gradients diverging
- 500   1.0000    0.127       Near orthogonal
- 960   0.9996    0.036       ← First accuracy drop!
-1160   0.9902   -0.000       Alignment crosses zero
-1500   0.9902   -0.027       Gradients now oppose each other
-```
-The accuracy drop at step 960 occurs precisely when gradient alignment crosses ~0.04 — confirming that gradient alignment is an early warning signal for forgetting.
 ## Experiment Design
@@ -109,22 +159,6 @@ A 2-layer GPT-style transformer (~260K parameters):
 Configuration follows [Nanda et al. 2023](https://arxiv.org/abs/2301.05217). Pre-norm residual, GELU, weight-tied embeddings. Trains in ~5 minutes per branch on CPU.
-## What the Representations Tell Us
-### CKA Dynamics: All Tasks Drift Similarly
-Surprisingly, CKA drift from Phase 1 is nearly identical across all branches — even `max`. This means **representational geometry changes at a similar rate regardless of whether forgetting occurs**. CKA measures global structure, but forgetting is about *specific directions* within that structure.
-### Gradient Alignment: The Smoking Gun
-Only `max` shows gradient alignment dropping to zero and going negative. All other tasks maintain alignment >0.98. This means:
-- **Subtraction, multiplication, XOR**: their gradients *cooperate* with addition — they push the parameters in compatible directions
-- **Max**: its gradients *oppose* addition — optimizing for max actively degrades the addition circuit
-### Fourier Spectrum: Stability Under Disruption
-The embedding Fourier power spectrum barely changes across branches (concentration ~0.12 → 0.13). This suggests the model doesn't dramatically reorganize its frequency basis even when learning incompatible tasks — instead, it makes small adjustments that accumulate into functional interference.
 ## Theoretical Framework
 The graduated dissimilarity ladder is grounded in mechanistic interpretability:
@@ -149,6 +183,72 @@ from representation_tracker import (
 )
 ```
 ## References
 - **Nanda et al. 2023** — [Progress Measures for Grokking](https://arxiv.org/abs/2301.05217) — Fourier circuit for modular addition
@@ -163,6 +263,9 @@ from representation_tracker import (
 - **Zhang et al. 2025** — [Grokking in LLM Pretraining](https://arxiv.org/abs/2506.21551)
 - **Lam et al. 2025** — [Implicit Curriculum Hypothesis](https://arxiv.org/abs/2604.08510)
 - **Feature Emergence 2023** — [Margin Maximization](https://arxiv.org/abs/2311.07568) — Fourier sparsity for cyclic groups
 ## License

 **To know how a model forgets, it helps to know how it learns.**
+This repository studies how neural network internal representations change during training — contrasting what happens when a model continues learning the same task vs. when it switches to tasks of increasing dissimilarity. We find the precise tipping point where forgetting begins, trace its mechanism step by step, and identify what predicts it (and what doesn't).
+---
 ## Key Finding: The Gradient Alignment Cliff
 **The critical observation**: `max` is the only task whose gradient alignment with addition drops to **near zero then goes negative** (−0.027), meaning its gradients actively oppose addition. This confirms [Laitinen 2026](https://arxiv.org/abs/2601.18699)'s finding that gradient alignment predicts forgetting — and reveals that it's not task "difficulty" but **representational incompatibility** that causes forgetting.
+---
+## Findings
+### 1. Forgetting Is Caused by Representational Geometry Conflict, Not Task Complexity
+The dissimilarity ladder was designed so that "higher level" tasks would be progressively more different from addition. XOR (Level 4) is arguably the most complex and alien operation. Yet it causes **zero forgetting**. The task that *does* cause forgetting — `max` (Level 3) — is conceptually simpler.
+Why? From [Nanda et al. 2023](https://arxiv.org/abs/2301.05217), modular addition learns a **circular Fourier representation**: numbers are embedded as points on circles at specific frequencies, and the MLP computes trigonometric products. Subtraction uses the same circuit with a sign flip. Multiplication uses the same cyclic group structure via discrete logarithms ([Chughtai et al. 2023](https://arxiv.org/abs/2302.03025)). XOR, despite being bitwise, can be partially decomposed into cyclic components on Z/97Z — its Fourier spectrum remains compatible.
+But `max(a,b)` requires a fundamentally **linear/ordinal** representation. The model must learn that 96 > 95 > 94 > ... > 0 — a monotone total ordering. This directly conflicts with circular Fourier embeddings where all numbers sit at equal radius on the circle. There is no "bigger" or "smaller" on a circle. The ordinal geometry is **incompatible** with the cyclic geometry, and it's this geometric conflict — not computational difficulty — that drives forgetting.
+**Implication**: Forgetting is a geometric phenomenon. Two tasks that are computationally very different can coexist peacefully if their representational geometries are compatible. Two tasks that seem related can catastrophically interfere if they require incompatible embedding structure.
+### 2. Gradient Alignment Is the Early Warning — CKA Is Not
+We tracked CKA (Centered Kernel Alignment) and gradient alignment simultaneously across all branches. The result is striking:
+**CKA drift is nearly identical across all branches**, including `max`. Every branch shows CKA dropping from ~0.99 → ~0.53 over the course of Phase 2 training. CKA measures the global similarity of the representational geometry — and the global geometry shifts at the same rate no matter what task is being learned. This means **CKA cannot distinguish benign representation drift from destructive forgetting**.
+**Gradient alignment, by contrast, shows a clean separation**:
+| Branch | Gradient alignment trajectory |
+|--------|-------------------------------|
+| Add, Sub, Mul, XOR | Stable at 0.98–0.99 throughout |
+| **Max** | Smooth decay: 0.90 → 0.26 → 0.04 → **−0.03** |
+The gradient alignment for `max` vs addition decays smoothly over ~1000 steps, crossing zero at step ~1160. This means the signal is not sudden — it's a slow divergence that's detectable hundreds of steps before any accuracy drop. The accuracy drop happens at step 960, when gradient alignment is at 0.036 — already near zero but not yet negative.
+**Implication**: For continual learning systems, monitoring gradient alignment between the current training task and a held-out probe set from the old task is a reliable early-warning system. CKA is not — it tracks representation churn, not interference.
+### 3. The Three-Phase Forgetting Process
+The `max` branch traces a clean three-phase process:
+**Phase A — Coexistence (steps 0–500)**: Gradient alignment drops from 0.90 to 0.13, but addition accuracy holds at 100%. The model is learning max while its addition circuit is becoming increasingly orthogonal to the max circuit. The two circuits coexist because they haven't yet begun to compete for the same parameters.
+**Phase B — Interference onset (steps 500–960)**: Gradient alignment drops below 0.10. The circuits are now nearly orthogonal, and further max training begins to push parameters in directions that weakly oppose addition. At step 960, the first accuracy drop appears: 100% → 99.96%. The tipping point is crossed.
+**Phase C — Antagonistic equilibrium (steps 960–1500)**: Gradient alignment goes negative (−0.027). The max gradients now *actively degrade* the addition circuit. Addition accuracy stabilizes at 99.02% — the model has reached a new equilibrium where max performance is near-perfect (99.94%) at the cost of ~1% addition accuracy. Notably, the accuracy doesn't continue to degrade — the model finds a compromise point.
+**Implication**: Forgetting is not a catastrophic collapse but a negotiated settlement. The model finds a parameter configuration that approximately satisfies both tasks, at a small cost to the earlier one. The severity of forgetting depends on how far into the "antagonistic" phase training pushes.
+### 4. Fourier Spectrum Stability
+The embedding Fourier power spectrum — which frequencies the token embeddings encode — barely changes across any branch. Fourier concentration (fraction of total power in the top 5 frequencies) stays at 0.12 ± 0.01 across all branches, including `max`.
+This means the model does **not dramatically reorganize its frequency basis** even when learning a representationally incompatible task. Instead, the interference happens through **small parameter shifts** that accumulate into functional degradation without visibly altering the spectral signature. The Fourier spectrum is a property of the converged circuit, not the training trajectory — and at this training scale, the spectrum hasn't converged enough to show sharp peaks.
+**Implication**: Circuit-level analysis (specific Fourier modes, specific attention heads) is needed to detect the locus of forgetting. Aggregate spectral metrics are too coarse, just as CKA is.
+### 5. What "Compatible" Means, Mechanistically
+The surprise of the experiment is that XOR — a bitwise operation with no obvious algebraic relationship to modular addition — causes zero forgetting (gradient alignment 0.986). This challenges the intuition that "similar tasks" are safe and "different tasks" are dangerous.
+The resolution comes from representation theory. For a prime p = 97:
+- **Addition** uses the standard irreducible representations of Z/97Z: 2D rotation matrices at frequencies k = 1, 2, ..., 48
+- **Subtraction** uses the same representations with negated angle
+- **Multiplication** uses the same group but via the discrete logarithm isomorphism — still cyclic, still Fourier
+- **XOR** is not a group operation on Z/97Z, but 97 in binary is 1100001. XOR on 7-bit integers decomposes into independent bit flips, each of which is a Z/2Z operation. Since Z/2Z embeds into Z/97Z (via the subgroup {0, 48} mod 97 for the least significant bit, etc.), XOR has **partial Fourier structure** that the addition circuit can accommodate.
+- **Max** is not a group operation at all. It requires a **total order**, which is a fundamentally different algebraic structure from a group. No Fourier decomposition captures "a > b."
+**Implication**: The right notion of "task similarity" for predicting forgetting is not semantic similarity, not computational complexity, and not even input/output overlap. It is **compatibility of the required representational geometry** — specifically, whether the optimal embedding for the new task can coexist with the optimal embedding for the old task, or whether they demand the same parameters take conflicting values.
+---
 ## Experiment Design
 Configuration follows [Nanda et al. 2023](https://arxiv.org/abs/2301.05217). Pre-norm residual, GELU, weight-tied embeddings. Trains in ~5 minutes per branch on CPU.
 ## Theoretical Framework
 The graduated dissimilarity ladder is grounded in mechanistic interpretability:
 )
 ```
+---
+## Future Research Directions
+### 1. Finding the Forgetting Phase Transition with Finer Resolution
+Our dissimilarity ladder jumps discretely from "no forgetting" (Levels 0–2, 4) to "forgetting" (Level 3). A natural extension is to construct a **continuous interpolation** between compatible and incompatible tasks to find the exact phase transition:
+- **Soft-max interpolation**: Define `task(α) = α · max(a,b) + (1−α) · ((a+b) mod 97)` for α ∈ [0, 1]. As α increases, the training signal shifts from purely circular to increasingly ordinal. At what α does forgetting emerge? Is the transition sharp (phase transition) or gradual?
+- **Frequency-rotated addition**: Define `add_θ(a, b) = (a + b + θ) mod 97` for various θ. This changes the *output mapping* while preserving the *input geometry*. The literature predicts zero forgetting regardless of θ — confirming that it's input representation, not output mapping, that drives forgetting.
+- **Partial ordering tasks**: `clipped_max(a, b, k) = min(max(a,b), k)` — for small k, this is nearly binary (like comparison); for k=96, it's identical to max. Vary k to interpolate between Fourier-compatible and Fourier-incompatible.
+### 2. Circuit-Level Autopsy of the Forgetting Moment
+Our metrics are aggregate (whole-layer CKA, whole-model gradient alignment). The next step is to zoom in on **which specific neurons and attention heads** are the locus of interference at step 960 when the first accuracy drop occurs:
+- **Per-neuron gradient conflict**: Decompose the gradient alignment by individual MLP neurons. Identify the specific neurons where the addition gradient and max gradient point in opposite directions. Are these the same neurons that encode Fourier frequencies in the addition circuit (per Nanda et al.)?
+- **Attention head surgery**: Freeze individual attention heads during Phase 2 max training and measure whether this prevents forgetting. [Laitinen 2026](https://arxiv.org/abs/2601.18699) found that 15–23% of lower-layer attention heads are severely disrupted during forgetting — can we identify and protect exactly those heads?
+- **Activation patching**: Use [causal tracing](https://arxiv.org/abs/2202.05262) to determine which components, when patched from the Phase 1 checkpoint to the post-max-training model, restore addition accuracy. This would locate the minimal set of parameters that were "overwritten."
+### 3. Scaling to Larger Models and Natural Language
+The current experiment uses a 260K-parameter model on synthetic data. Key questions about scaling:
+- **Does the gradient alignment prediction hold in larger models?** Train a GPT-2 small (124M params) on language, fine-tune on a domain-specific task, and monitor gradient alignment with held-out general-capability probes. If gradient alignment drops predict forgetting severity (as Laitinen found r=0.87 in large LLMs), our mechanism would be confirmed at scale.
+- **Does overparameterization prevent the geometric conflict?** With 260K parameters, the model has limited capacity and tasks must share the same embedding. A 100x larger model might maintain separate subspaces for circular and ordinal representations simultaneously — which would predict a **capacity-dependent forgetting threshold** where small models forget but large ones don't.
+- **Multi-task baselines**: Train on addition + max simultaneously from the start (joint training). Does the model learn to partition its representation space into circular and linear subregions? If so, the geometry of this partition would reveal how models *could* avoid forgetting if given the right training regime.
+### 4. Representation Geometry as a Compatibility Predictor
+Our finding suggests a practical diagnostic: before fine-tuning a model on a new task, measure **gradient alignment between the new task loss and a probe set from the old task**. If alignment is high (>0.5), fine-tuning is safe. If it drops toward zero, forgetting is coming.
+This could be developed into:
+- **Pre-training compatibility scoring**: Given a pre-trained model and a candidate fine-tuning dataset, compute gradient alignment on a small sample and predict forgetting severity *before training begins*. This is cheaper than training and evaluating.
+- **Adaptive learning rate scheduling**: When gradient alignment drops below a threshold during training, automatically reduce the learning rate or switch to a parameter-efficient method (LoRA) to constrain the update to a subspace that doesn't conflict with the old task.
+- **Representation-aware continual learning**: Use the gradient alignment signal to dynamically allocate parameters — dedicate separate parameter subsets to tasks with low alignment (as in [O-LoRA, 2310.14152](https://arxiv.org/abs/2310.14152)), while allowing shared parameters for high-alignment tasks.
+### 5. The Grokking Connection
+Our models reach 100% training accuracy within 10–20 epochs, but representation metrics continue to evolve for 150+ epochs. This is the same **post-memorization reorganization** observed in grokking ([Power et al. 2022](https://arxiv.org/abs/2201.02177), [Zhang et al. 2025](https://arxiv.org/abs/2506.21551)). The question is: does grokking *protect against* or *predispose to* forgetting?
+- **Hypothesis A (grokking protects)**: A fully grokked model has consolidated its knowledge into a clean, structured circuit. This structured representation may be more robust to interference because it uses parameters efficiently, leaving slack for new tasks.
+- **Hypothesis B (grokking predisposes)**: A fully grokked model has *committed* all its representational capacity to one specific circuit geometry. There is less room for compromise. An un-grokked model, with its "messy" representations, might be more flexible.
+- **Test**: Run the same experiment but fork at different points during Phase 1 — early (memorized but not grokked), mid (grokking in progress), and late (fully grokked). Measure forgetting severity at each fork point. This would directly reveal whether representational consolidation helps or hurts.
+### 6. Beyond Pairs: Task Sequences and Curriculum Effects
+Our experiment trains on one task, then switches to one other. Real continual learning involves sequences of many tasks. The order matters:
+- **Does learning max *after* subtraction reduce forgetting?** If subtraction strengthens the Fourier circuit, making it harder to overwrite, the forgetting from max might be reduced. Conversely, if subtraction broadens the representation, it might make max's ordinal demands easier to accommodate.
+- **Curriculum design via gradient alignment**: Sequence tasks in order of descending gradient alignment with the base task. This would be a principled curriculum where each new task leverages the maximum possible overlap with what came before, potentially minimizing cumulative forgetting.
+- **Forgetting chains**: Train add → max → add. Does the second round of addition training recover the lost accuracy? If so, how quickly — and do the recovered representations match the original ones (measured by CKA), or does the model find a different solution?
+### 7. Connecting to Biological Continual Learning
+The three-phase forgetting process (coexistence → interference onset → antagonistic equilibrium) bears resemblance to **synaptic consolidation** theories in neuroscience:
+- **Phase A** resembles the initial period where new learning doesn't disrupt old memories because it activates different neural populations
+- **Phase B** resembles the onset of retroactive interference when resource competition begins
+- **Phase C** resembles the stable state after consolidation, where both old and new memories coexist at reduced fidelity
+[Elastic Weight Consolidation (EWC)](https://arxiv.org/abs/1612.00796) was explicitly inspired by synaptic consolidation in the brain. Our gradient alignment metric could serve as a **complementary signal to Fisher Information** (used by EWC) — while Fisher identifies *which* parameters are important, gradient alignment identifies *which tasks* will interfere. Combining both might yield a more targeted protection strategy.
+---
 ## References
 - **Nanda et al. 2023** — [Progress Measures for Grokking](https://arxiv.org/abs/2301.05217) — Fourier circuit for modular addition
 - **Zhang et al. 2025** — [Grokking in LLM Pretraining](https://arxiv.org/abs/2506.21551)
 - **Lam et al. 2025** — [Implicit Curriculum Hypothesis](https://arxiv.org/abs/2604.08510)
 - **Feature Emergence 2023** — [Margin Maximization](https://arxiv.org/abs/2311.07568) — Fourier sparsity for cyclic groups
+- **Power et al. 2022** — [Grokking: Generalization Beyond Overfitting](https://arxiv.org/abs/2201.02177)
+- **Kirkpatrick et al. 2017** — [Elastic Weight Consolidation](https://arxiv.org/abs/1612.00796)
+- **Meng et al. 2022** — [Locating and Editing Factual Associations](https://arxiv.org/abs/2202.05262) — Causal tracing
 ## License