--- title: Modular Addition Feature Learning emoji: "🔢" colorFrom: blue colorTo: yellow sdk: gradio sdk_version: "6.5.1" app_file: hf_app/app.py pinned: false --- # On the Mechanism and Dynamics of Modular Addition ### Fourier Features, Lottery Ticket, and Grokking **Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang** *Department of Statistics and Data Science, Yale University* [[arXiv](https://arxiv.org/abs/2602.16849)] [[Blog](https://y-agent.github.io/posts/modular_addition_feature_learning/)] [[Interactive Demo](https://huggingface.co/spaces/y-agent/modular-addition-feature-learning)] --- ## Overview This repository provides the code for studying how a two-layer neural network learns modular arithmetic $f(x,y) = (x+y) \bmod p$. We analyze three phenomena: 1. **Fourier Feature Learning** — Each neuron independently discovers a cosine wave at a single frequency, collectively implementing a discrete Fourier transform that the network was never taught. 2. **Lottery Ticket Dynamics** — Random initialization determines which frequency each neuron will specialize in: the frequency with the best initial phase alignment wins a winner-take-all competition. 3. **Grokking** — Under partial data with weight decay, the network first memorizes, then suddenly generalizes through a three-stage process: memorization → sparsification → cleanup. An [**Interactive Demo**](https://huggingface.co/spaces/y-agent/modular-addition-feature-learning) on Hugging Face Spaces visualizes all results with 9 analysis tabs, interactive Plotly charts, and on-demand training for any odd $p \geq 3$. Pre-computed examples are included for $p = 15, 23, 29, 31$. ### Launch Locally ```bash pip install -r requirements.txt python hf_app/app.py # Opens at http://localhost:7860 ``` ### Deploy to Hugging Face Spaces We use the [Hugging Face Python API](https://huggingface.co/docs/huggingface_hub/) to upload to Spaces, since HF now requires [Xet storage](https://huggingface.co/docs/hub/xet) for binary files (PNGs, etc.) which standard `git push` does not handle. **First-time setup:** ```bash pip install huggingface_hub hf_xet ``` Log in (get a **write** token from https://huggingface.co/settings/tokens): ```bash huggingface-cli login ``` **Upload to the Space:** ```bash python deploy_to_hf.py # or with a custom commit message: python deploy_to_hf.py --message "Update app" ``` The deploy script prepends the required HuggingFace Space metadata (SDK config, app path, etc.) to `README.md` before uploading, so the GitHub README stays clean. **What gets uploaded:** Only the files the app needs — `hf_app/`, `precompute/`, `precomputed_results/`, `src/`, `requirements.txt`, `README.md`. Model checkpoints, notebooks, and figures are excluded. **On-demand training:** Users can generate results for new $p$ values directly from the app's "Generate" button. Streaming logs show real-time training progress. New results are auto-committed back to the Space repo so they persist across restarts. > **Tip:** For GPU-accelerated on-demand training, select a GPU runtime in your Space settings. ## Pre-computation Pipeline The `precompute/` directory trains 5 model configurations per modulus and generates all plots + interactive JSON data. See [`precompute/README.md`](precompute/README.md) for full documentation. ### Quick Start ```bash # Full pipeline for a single modulus (train → plots → analytical → verify) bash precompute/run_pipeline.sh 23 # With custom d_mlp bash precompute/run_pipeline.sh 23 --d_mlp 128 # Delete checkpoints after generating plots (saves disk space) CLEANUP=1 bash precompute/run_pipeline.sh 23 # Batch: all odd p in [3, 99] bash precompute/run_all.sh # Or up to p=199 MAX_P=199 bash precompute/run_all.sh ``` ### Manual Steps ```bash # Step 1: Train all 5 configurations python precompute/train_all.py --p 23 --output ./trained_models --resume # Step 2: Generate model-based plots (21 PNGs + 7 JSONs) python precompute/generate_plots.py --p 23 --input ./trained_models --output ./precomputed_results # Step 3: Generate analytical simulation plots (2 PNGs, no model needed) python precompute/generate_analytical.py --p 23 --output ./precomputed_results ``` ### Output Each modulus produces ~33 files in `precomputed_results/p_XXX/`: | Category | Files | Description | |----------|-------|-------------| | Overview (Tab 1) | 2 PNGs + 1 JSON | Loss, IPR, phase scatter | | Fourier Weights (Tab 2) | 3 PNGs + 1 JSON | DFT heatmaps, cosine fits, neuron spectra | | Phase Analysis (Tab 3) | 3 PNGs | Phase distribution, alignment, magnitudes | | Output Logits (Tab 4) | 1 PNG + 1 JSON | Logit heatmap, interactive explorer | | Lottery Mechanism (Tab 5) | 3 PNGs | Magnitude race, phase convergence, contour | | Grokking (Tab 6) | 5 PNGs + 3 JSONs | Loss/acc curves, memorization, weight evolution | | Gradient Dynamics (Tab 7) | 4 PNGs | Phase alignment + DFT for Quad and ReLU | | Decoupled Simulation (Tab 8) | 2 PNGs | Analytical ODE integration | | Metadata | 2 JSONs | Config + training log | > **Note:** Grokking results (Tab 6) require $p \geq 19$. Smaller values of $p$ have too few data points for a meaningful train/test split. ## The 5 Training Configurations | Config | Activation | Optimizer | LR | Weight Decay | Data | Epochs | Used In | |--------|-----------|-----------|-----|-------------|------|--------|---------| | `standard` | ReLU | AdamW | 5e-5 | 0 | 100% | 5,000 | Tabs 1–4 | | `grokking` | ReLU | AdamW | 1e-4 | 2.0 | 75% | 50,000 | Tabs 1, 6 | | `quad_random` | Quad | AdamW | 5e-5 | 0 | 100% | 5,000 | Tab 5 | | `quad_single_freq` | Quad | SGD | 0.1 | 0 | 100% | 10,000 | Tab 7 | | `relu_single_freq` | ReLU | SGD | 0.01 | 0 | 100% | 10,000 | Tab 7 | ## Running a Single Experiment For custom experiments outside the pre-computation pipeline: ```bash cd src # Train with default config (p=97, d_mlp=1024, ReLU, 5000 epochs) python module_nn.py # Train with specific parameters python module_nn.py --p 23 --d_mlp 512 --num_epochs 5000 --lr 5e-5 # Dry run: see config without training python module_nn.py --dry_run --p 23 --d_mlp 512 ``` ## Notebooks Interactive analysis notebooks in `notebooks/`: | Notebook | Description | |----------|-------------| | `empirical_insight_standard.ipynb` | Fourier weight analysis, phase distributions, output logits | | `empirical_insight_grokk.ipynb` | Grokking stages, weight dynamics, IPR evolution | | `lottery_mechanism.ipynb` | Neuron specialization, frequency magnitude/phase tracking | | `interprete_gd_dynamics.ipynb` | Phase alignment under single-frequency initialization | | `decouple_dynamics_simulation.ipynb` | Analytical gradient flow simulation | ## Setup ### Requirements - Python 3.8+ - PyTorch 2.0+ - CUDA-capable GPU (recommended for $p > 50$; CPU works for small $p$) ### Installation ```bash git clone https://github.com/Y-Agent/modular-addition-feature-learning.git cd modular-addition-feature-learning pip install -r requirements.txt ``` ## Project Structure ``` modular-addition-feature-learning/ ├── src/ # Core source code │ ├── module_nn.py # Training script with CLI │ ├── nnTrainer.py # Training loop and optimization │ ├── model_base.py # Neural network architecture (EmbedMLP) │ ├── mechanism_base.py # Fourier analysis and decomposition │ ├── utils.py # Configuration and helpers │ └── configs.yaml # Default hyperparameters ├── precompute/ # Batch training and plot generation │ ├── run_pipeline.sh # Full pipeline for one modulus │ ├── run_all.sh # Batch pipeline for all odd p │ ├── train_all.py # Train 5 configurations │ ├── generate_plots.py # Generate model-based plots + JSONs │ ├── generate_analytical.py # Analytical ODE simulation plots │ └── prime_config.py # Configurations and sizing formula ├── hf_app/ # Gradio web application │ └── app.py # Interactive visualization app ├── precomputed_results/ # Pre-computed plots and data │ ├── p_015/ # Results for p=15 │ ├── p_023/ # Results for p=23 │ ├── p_029/ # Results for p=29 │ └── p_031/ # Results for p=31 ├── notebooks/ # Analysis and visualization notebooks ├── requirements.txt # Python dependencies └── README.md ``` ## Citation ```bibtex @article{he2025modular, title={On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking}, author={He, Jianliang and Wang, Leda and Chen, Siyu and Yang, Zhuoran}, journal={arXiv preprint arXiv:2602.16849}, year={2025} } ``` ## License [MIT License](LICENSE)