|
|
--- |
|
|
title: Modular Addition Feature Learning |
|
|
emoji: "π’" |
|
|
colorFrom: blue |
|
|
colorTo: yellow |
|
|
sdk: gradio |
|
|
sdk_version: "6.5.1" |
|
|
app_file: hf_app/app.py |
|
|
pinned: false |
|
|
--- |
|
|
|
|
|
# On the Mechanism and Dynamics of Modular Addition |
|
|
|
|
|
### Fourier Features, Lottery Ticket, and Grokking |
|
|
|
|
|
**Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang** |
|
|
*Department of Statistics and Data Science, Yale University* |
|
|
|
|
|
[[arXiv](https://arxiv.org/abs/2602.16849)] [[Blog](https://y-agent.github.io/posts/modular_addition_feature_learning/)] [[Interactive Demo](https://huggingface.co/spaces/y-agent/modular-addition-feature-learning)] |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
This repository provides the code for studying how a two-layer neural network learns modular arithmetic $f(x,y) = (x+y) \bmod p$. We analyze three phenomena: |
|
|
|
|
|
1. **Fourier Feature Learning** β Each neuron independently discovers a cosine wave at a single frequency, collectively implementing a discrete Fourier transform that the network was never taught. |
|
|
2. **Lottery Ticket Dynamics** β Random initialization determines which frequency each neuron will specialize in: the frequency with the best initial phase alignment wins a winner-take-all competition. |
|
|
3. **Grokking** β Under partial data with weight decay, the network first memorizes, then suddenly generalizes through a three-stage process: memorization β sparsification β cleanup. |
|
|
|
|
|
An [**Interactive Demo**](https://huggingface.co/spaces/y-agent/modular-addition-feature-learning) on Hugging Face Spaces visualizes all results with 9 analysis tabs, interactive Plotly charts, and on-demand training for any odd $p \geq 3$. Pre-computed examples are included for $p = 15, 23, 29, 31$. |
|
|
|
|
|
### Launch Locally |
|
|
|
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
python hf_app/app.py |
|
|
# Opens at http://localhost:7860 |
|
|
``` |
|
|
|
|
|
### Deploy to Hugging Face Spaces |
|
|
|
|
|
We use the [Hugging Face Python API](https://huggingface.co/docs/huggingface_hub/) to upload to Spaces, since HF now requires [Xet storage](https://huggingface.co/docs/hub/xet) for binary files (PNGs, etc.) which standard `git push` does not handle. |
|
|
|
|
|
**First-time setup:** |
|
|
|
|
|
```bash |
|
|
pip install huggingface_hub hf_xet |
|
|
``` |
|
|
|
|
|
Log in (get a **write** token from https://huggingface.co/settings/tokens): |
|
|
|
|
|
```bash |
|
|
huggingface-cli login |
|
|
``` |
|
|
|
|
|
**Upload to the Space:** |
|
|
|
|
|
```bash |
|
|
python deploy_to_hf.py |
|
|
# or with a custom commit message: |
|
|
python deploy_to_hf.py --message "Update app" |
|
|
``` |
|
|
|
|
|
The deploy script prepends the required HuggingFace Space metadata (SDK config, app path, etc.) to `README.md` before uploading, so the GitHub README stays clean. |
|
|
|
|
|
**What gets uploaded:** Only the files the app needs β `hf_app/`, `precompute/`, `precomputed_results/`, `src/`, `requirements.txt`, `README.md`. Model checkpoints, notebooks, and figures are excluded. |
|
|
|
|
|
**On-demand training:** Users can generate results for new $p$ values directly from the app's "Generate" button. Streaming logs show real-time training progress. New results are auto-committed back to the Space repo so they persist across restarts. |
|
|
|
|
|
> **Tip:** For GPU-accelerated on-demand training, select a GPU runtime in your Space settings. |
|
|
|
|
|
## Pre-computation Pipeline |
|
|
|
|
|
The `precompute/` directory trains 5 model configurations per modulus and generates all plots + interactive JSON data. See [`precompute/README.md`](precompute/README.md) for full documentation. |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```bash |
|
|
# Full pipeline for a single modulus (train β plots β analytical β verify) |
|
|
bash precompute/run_pipeline.sh 23 |
|
|
|
|
|
# With custom d_mlp |
|
|
bash precompute/run_pipeline.sh 23 --d_mlp 128 |
|
|
|
|
|
# Delete checkpoints after generating plots (saves disk space) |
|
|
CLEANUP=1 bash precompute/run_pipeline.sh 23 |
|
|
|
|
|
# Batch: all odd p in [3, 99] |
|
|
bash precompute/run_all.sh |
|
|
|
|
|
# Or up to p=199 |
|
|
MAX_P=199 bash precompute/run_all.sh |
|
|
``` |
|
|
|
|
|
### Manual Steps |
|
|
|
|
|
```bash |
|
|
# Step 1: Train all 5 configurations |
|
|
python precompute/train_all.py --p 23 --output ./trained_models --resume |
|
|
|
|
|
# Step 2: Generate model-based plots (21 PNGs + 7 JSONs) |
|
|
python precompute/generate_plots.py --p 23 --input ./trained_models --output ./precomputed_results |
|
|
|
|
|
# Step 3: Generate analytical simulation plots (2 PNGs, no model needed) |
|
|
python precompute/generate_analytical.py --p 23 --output ./precomputed_results |
|
|
``` |
|
|
|
|
|
### Output |
|
|
|
|
|
Each modulus produces ~33 files in `precomputed_results/p_XXX/`: |
|
|
|
|
|
| Category | Files | Description | |
|
|
|----------|-------|-------------| |
|
|
| Overview (Tab 1) | 2 PNGs + 1 JSON | Loss, IPR, phase scatter | |
|
|
| Fourier Weights (Tab 2) | 3 PNGs + 1 JSON | DFT heatmaps, cosine fits, neuron spectra | |
|
|
| Phase Analysis (Tab 3) | 3 PNGs | Phase distribution, alignment, magnitudes | |
|
|
| Output Logits (Tab 4) | 1 PNG + 1 JSON | Logit heatmap, interactive explorer | |
|
|
| Lottery Mechanism (Tab 5) | 3 PNGs | Magnitude race, phase convergence, contour | |
|
|
| Grokking (Tab 6) | 5 PNGs + 3 JSONs | Loss/acc curves, memorization, weight evolution | |
|
|
| Gradient Dynamics (Tab 7) | 4 PNGs | Phase alignment + DFT for Quad and ReLU | |
|
|
| Decoupled Simulation (Tab 8) | 2 PNGs | Analytical ODE integration | |
|
|
| Metadata | 2 JSONs | Config + training log | |
|
|
|
|
|
> **Note:** Grokking results (Tab 6) require $p \geq 19$. Smaller values of $p$ have too few data points for a meaningful train/test split. |
|
|
|
|
|
## The 5 Training Configurations |
|
|
|
|
|
| Config | Activation | Optimizer | LR | Weight Decay | Data | Epochs | Used In | |
|
|
|--------|-----------|-----------|-----|-------------|------|--------|---------| |
|
|
| `standard` | ReLU | AdamW | 5e-5 | 0 | 100% | 5,000 | Tabs 1β4 | |
|
|
| `grokking` | ReLU | AdamW | 1e-4 | 2.0 | 75% | 50,000 | Tabs 1, 6 | |
|
|
| `quad_random` | Quad | AdamW | 5e-5 | 0 | 100% | 5,000 | Tab 5 | |
|
|
| `quad_single_freq` | Quad | SGD | 0.1 | 0 | 100% | 10,000 | Tab 7 | |
|
|
| `relu_single_freq` | ReLU | SGD | 0.01 | 0 | 100% | 10,000 | Tab 7 | |
|
|
|
|
|
## Running a Single Experiment |
|
|
|
|
|
For custom experiments outside the pre-computation pipeline: |
|
|
|
|
|
```bash |
|
|
cd src |
|
|
|
|
|
# Train with default config (p=97, d_mlp=1024, ReLU, 5000 epochs) |
|
|
python module_nn.py |
|
|
|
|
|
# Train with specific parameters |
|
|
python module_nn.py --p 23 --d_mlp 512 --num_epochs 5000 --lr 5e-5 |
|
|
|
|
|
# Dry run: see config without training |
|
|
python module_nn.py --dry_run --p 23 --d_mlp 512 |
|
|
``` |
|
|
|
|
|
## Notebooks |
|
|
|
|
|
Interactive analysis notebooks in `notebooks/`: |
|
|
|
|
|
| Notebook | Description | |
|
|
|----------|-------------| |
|
|
| `empirical_insight_standard.ipynb` | Fourier weight analysis, phase distributions, output logits | |
|
|
| `empirical_insight_grokk.ipynb` | Grokking stages, weight dynamics, IPR evolution | |
|
|
| `lottery_mechanism.ipynb` | Neuron specialization, frequency magnitude/phase tracking | |
|
|
| `interprete_gd_dynamics.ipynb` | Phase alignment under single-frequency initialization | |
|
|
| `decouple_dynamics_simulation.ipynb` | Analytical gradient flow simulation | |
|
|
|
|
|
## Setup |
|
|
|
|
|
### Requirements |
|
|
|
|
|
- Python 3.8+ |
|
|
- PyTorch 2.0+ |
|
|
- CUDA-capable GPU (recommended for $p > 50$; CPU works for small $p$) |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/Y-Agent/modular-addition-feature-learning.git |
|
|
cd modular-addition-feature-learning |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
## Project Structure |
|
|
|
|
|
``` |
|
|
modular-addition-feature-learning/ |
|
|
βββ src/ # Core source code |
|
|
β βββ module_nn.py # Training script with CLI |
|
|
β βββ nnTrainer.py # Training loop and optimization |
|
|
β βββ model_base.py # Neural network architecture (EmbedMLP) |
|
|
β βββ mechanism_base.py # Fourier analysis and decomposition |
|
|
β βββ utils.py # Configuration and helpers |
|
|
β βββ configs.yaml # Default hyperparameters |
|
|
βββ precompute/ # Batch training and plot generation |
|
|
β βββ run_pipeline.sh # Full pipeline for one modulus |
|
|
β βββ run_all.sh # Batch pipeline for all odd p |
|
|
β βββ train_all.py # Train 5 configurations |
|
|
β βββ generate_plots.py # Generate model-based plots + JSONs |
|
|
β βββ generate_analytical.py # Analytical ODE simulation plots |
|
|
β βββ prime_config.py # Configurations and sizing formula |
|
|
βββ hf_app/ # Gradio web application |
|
|
β βββ app.py # Interactive visualization app |
|
|
βββ precomputed_results/ # Pre-computed plots and data |
|
|
β βββ p_015/ # Results for p=15 |
|
|
β βββ p_023/ # Results for p=23 |
|
|
β βββ p_029/ # Results for p=29 |
|
|
β βββ p_031/ # Results for p=31 |
|
|
βββ notebooks/ # Analysis and visualization notebooks |
|
|
βββ requirements.txt # Python dependencies |
|
|
βββ README.md |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{he2025modular, |
|
|
title={On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking}, |
|
|
author={He, Jianliang and Wang, Leda and Chen, Siyu and Yang, Zhuoran}, |
|
|
journal={arXiv preprint arXiv:2602.16849}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
[MIT License](LICENSE) |
|
|
|