File size: 9,040 Bytes
878c296
 
739ecbb
878c296
 
 
 
 
 
 
 
b753304
 
 
 
 
 
 
f27388f
b753304
20509b6
 
b753304
 
 
 
 
 
 
 
739ecbb
b753304
 
 
 
 
 
 
 
 
 
 
878c296
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
739ecbb
 
 
878c296
 
739ecbb
 
878c296
 
 
b753304
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e518ead
 
b753304
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f27388f
b753304
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
title: Modular Addition Feature Learning
emoji: "πŸ”’"
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: "6.5.1"
app_file: hf_app/app.py
pinned: false
---

# On the Mechanism and Dynamics of Modular Addition

### Fourier Features, Lottery Ticket, and Grokking

**Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang**
*Department of Statistics and Data Science, Yale University*

[[arXiv](https://arxiv.org/abs/2602.16849)] [[Blog](https://y-agent.github.io/posts/modular_addition_feature_learning/)] [[Interactive Demo](https://huggingface.co/spaces/y-agent/modular-addition-feature-learning)]

---

## Overview

This repository provides the code for studying how a two-layer neural network learns modular arithmetic $f(x,y) = (x+y) \bmod p$. We analyze three phenomena:

1. **Fourier Feature Learning** β€” Each neuron independently discovers a cosine wave at a single frequency, collectively implementing a discrete Fourier transform that the network was never taught.
2. **Lottery Ticket Dynamics** β€” Random initialization determines which frequency each neuron will specialize in: the frequency with the best initial phase alignment wins a winner-take-all competition.
3. **Grokking** β€” Under partial data with weight decay, the network first memorizes, then suddenly generalizes through a three-stage process: memorization β†’ sparsification β†’ cleanup.

An [**Interactive Demo**](https://huggingface.co/spaces/y-agent/modular-addition-feature-learning) on Hugging Face Spaces visualizes all results with 9 analysis tabs, interactive Plotly charts, and on-demand training for any odd $p \geq 3$. Pre-computed examples are included for $p = 15, 23, 29, 31$.

### Launch Locally

```bash
pip install -r requirements.txt
python hf_app/app.py
# Opens at http://localhost:7860
```

### Deploy to Hugging Face Spaces

We use the [Hugging Face Python API](https://huggingface.co/docs/huggingface_hub/) to upload to Spaces, since HF now requires [Xet storage](https://huggingface.co/docs/hub/xet) for binary files (PNGs, etc.) which standard `git push` does not handle.

**First-time setup:**

```bash
pip install huggingface_hub hf_xet
```

Log in (get a **write** token from https://huggingface.co/settings/tokens):

```bash
huggingface-cli login
```

**Upload to the Space:**

```bash
python deploy_to_hf.py
# or with a custom commit message:
python deploy_to_hf.py --message "Update app"
```

The deploy script prepends the required HuggingFace Space metadata (SDK config, app path, etc.) to `README.md` before uploading, so the GitHub README stays clean.

**What gets uploaded:** Only the files the app needs β€” `hf_app/`, `precompute/`, `precomputed_results/`, `src/`, `requirements.txt`, `README.md`. Model checkpoints, notebooks, and figures are excluded.

**On-demand training:** Users can generate results for new $p$ values directly from the app's "Generate" button. Streaming logs show real-time training progress. New results are auto-committed back to the Space repo so they persist across restarts.

> **Tip:** For GPU-accelerated on-demand training, select a GPU runtime in your Space settings.

## Pre-computation Pipeline

The `precompute/` directory trains 5 model configurations per modulus and generates all plots + interactive JSON data. See [`precompute/README.md`](precompute/README.md) for full documentation.

### Quick Start

```bash
# Full pipeline for a single modulus (train β†’ plots β†’ analytical β†’ verify)
bash precompute/run_pipeline.sh 23

# With custom d_mlp
bash precompute/run_pipeline.sh 23 --d_mlp 128

# Delete checkpoints after generating plots (saves disk space)
CLEANUP=1 bash precompute/run_pipeline.sh 23

# Batch: all odd p in [3, 99]
bash precompute/run_all.sh

# Or up to p=199
MAX_P=199 bash precompute/run_all.sh
```

### Manual Steps

```bash
# Step 1: Train all 5 configurations
python precompute/train_all.py --p 23 --output ./trained_models --resume

# Step 2: Generate model-based plots (21 PNGs + 7 JSONs)
python precompute/generate_plots.py --p 23 --input ./trained_models --output ./precomputed_results

# Step 3: Generate analytical simulation plots (2 PNGs, no model needed)
python precompute/generate_analytical.py --p 23 --output ./precomputed_results
```

### Output

Each modulus produces ~33 files in `precomputed_results/p_XXX/`:

| Category | Files | Description |
|----------|-------|-------------|
| Overview (Tab 1) | 2 PNGs + 1 JSON | Loss, IPR, phase scatter |
| Fourier Weights (Tab 2) | 3 PNGs + 1 JSON | DFT heatmaps, cosine fits, neuron spectra |
| Phase Analysis (Tab 3) | 3 PNGs | Phase distribution, alignment, magnitudes |
| Output Logits (Tab 4) | 1 PNG + 1 JSON | Logit heatmap, interactive explorer |
| Lottery Mechanism (Tab 5) | 3 PNGs | Magnitude race, phase convergence, contour |
| Grokking (Tab 6) | 5 PNGs + 3 JSONs | Loss/acc curves, memorization, weight evolution |
| Gradient Dynamics (Tab 7) | 4 PNGs | Phase alignment + DFT for Quad and ReLU |
| Decoupled Simulation (Tab 8) | 2 PNGs | Analytical ODE integration |
| Metadata | 2 JSONs | Config + training log |

> **Note:** Grokking results (Tab 6) require $p \geq 19$. Smaller values of $p$ have too few data points for a meaningful train/test split.

## The 5 Training Configurations

| Config | Activation | Optimizer | LR | Weight Decay | Data | Epochs | Used In |
|--------|-----------|-----------|-----|-------------|------|--------|---------|
| `standard` | ReLU | AdamW | 5e-5 | 0 | 100% | 5,000 | Tabs 1–4 |
| `grokking` | ReLU | AdamW | 1e-4 | 2.0 | 75% | 50,000 | Tabs 1, 6 |
| `quad_random` | Quad | AdamW | 5e-5 | 0 | 100% | 5,000 | Tab 5 |
| `quad_single_freq` | Quad | SGD | 0.1 | 0 | 100% | 10,000 | Tab 7 |
| `relu_single_freq` | ReLU | SGD | 0.01 | 0 | 100% | 10,000 | Tab 7 |

## Running a Single Experiment

For custom experiments outside the pre-computation pipeline:

```bash
cd src

# Train with default config (p=97, d_mlp=1024, ReLU, 5000 epochs)
python module_nn.py

# Train with specific parameters
python module_nn.py --p 23 --d_mlp 512 --num_epochs 5000 --lr 5e-5

# Dry run: see config without training
python module_nn.py --dry_run --p 23 --d_mlp 512
```

## Notebooks

Interactive analysis notebooks in `notebooks/`:

| Notebook | Description |
|----------|-------------|
| `empirical_insight_standard.ipynb` | Fourier weight analysis, phase distributions, output logits |
| `empirical_insight_grokk.ipynb` | Grokking stages, weight dynamics, IPR evolution |
| `lottery_mechanism.ipynb` | Neuron specialization, frequency magnitude/phase tracking |
| `interprete_gd_dynamics.ipynb` | Phase alignment under single-frequency initialization |
| `decouple_dynamics_simulation.ipynb` | Analytical gradient flow simulation |

## Setup

### Requirements

- Python 3.8+
- PyTorch 2.0+
- CUDA-capable GPU (recommended for $p > 50$; CPU works for small $p$)

### Installation

```bash
git clone https://github.com/Y-Agent/modular-addition-feature-learning.git
cd modular-addition-feature-learning
pip install -r requirements.txt
```

## Project Structure

```
modular-addition-feature-learning/
β”œβ”€β”€ src/                          # Core source code
β”‚   β”œβ”€β”€ module_nn.py             # Training script with CLI
β”‚   β”œβ”€β”€ nnTrainer.py             # Training loop and optimization
β”‚   β”œβ”€β”€ model_base.py            # Neural network architecture (EmbedMLP)
β”‚   β”œβ”€β”€ mechanism_base.py        # Fourier analysis and decomposition
β”‚   β”œβ”€β”€ utils.py                 # Configuration and helpers
β”‚   └── configs.yaml             # Default hyperparameters
β”œβ”€β”€ precompute/                   # Batch training and plot generation
β”‚   β”œβ”€β”€ run_pipeline.sh          # Full pipeline for one modulus
β”‚   β”œβ”€β”€ run_all.sh               # Batch pipeline for all odd p
β”‚   β”œβ”€β”€ train_all.py             # Train 5 configurations
β”‚   β”œβ”€β”€ generate_plots.py        # Generate model-based plots + JSONs
β”‚   β”œβ”€β”€ generate_analytical.py   # Analytical ODE simulation plots
β”‚   └── prime_config.py          # Configurations and sizing formula
β”œβ”€β”€ hf_app/                       # Gradio web application
β”‚   └── app.py                   # Interactive visualization app
β”œβ”€β”€ precomputed_results/          # Pre-computed plots and data
β”‚   β”œβ”€β”€ p_015/                   # Results for p=15
β”‚   β”œβ”€β”€ p_023/                   # Results for p=23
β”‚   β”œβ”€β”€ p_029/                   # Results for p=29
β”‚   └── p_031/                   # Results for p=31
β”œβ”€β”€ notebooks/                    # Analysis and visualization notebooks
β”œβ”€β”€ requirements.txt              # Python dependencies
└── README.md
```

## Citation

```bibtex
@article{he2025modular,
  title={On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking},
  author={He, Jianliang and Wang, Leda and Chen, Siyu and Yang, Zhuoran},
  journal={arXiv preprint arXiv:2602.16849},
  year={2025}
}
```

## License

[MIT License](LICENSE)