MoE Redistribution: Trained Checkpoints
Trained 1-layer transformer checkpoints accompanying the anonymous submission Sparsity Moves Computation (under double-blind review).
The repository contains 5-seed checkpoint sets across all architecture and routing variants used in the paper: dense FFN, GLU, MoE (top-1 / top-2, learned and random routing), and MoE-GLU, on three tasks (add-7, modular addition, histogram counting).
Layout
checkpoints/
<task>_<arch>_<config>_s<seed>/
best_model.pt # add-7, histogram (PyTorch state dict + config)
modadd_best.pt # modular addition
<task> โ {add7, modadd, hist}. <arch> โ {ffn, glu, moe, moe_glu}. <config> encodes width, activation, normalization, and routing variant (e.g. nonorm, narrow_nonorm, topk2_nonorm, randroute_nonorm, d170_silu_nonorm).
Each .pt file contains a Python dict with keys: model_state_dict, config, optimizer_state_dict, and one of accuracy / test_acc / step / epoch. The config dict stores architectural hyperparameters only.
Loading a checkpoint
from huggingface_hub import hf_hub_download
import torch
path = hf_hub_download(
repo_id="Sparsity-Moves-Computation/moe-redistribution-checkpoints",
filename="add7_ffn_nonorm_s42/best_model.pt",
)
ck = torch.load(path, weights_only=False, map_location="cpu")
print(ck["config"]) # architectural hyperparameters
print(ck["accuracy"]) # final eval accuracy
Bulk download
huggingface-cli download \
Sparsity-Moves-Computation/moe-redistribution-checkpoints \
--local-dir checkpoints/
Reproducibility
The link is provided in the paper. Loading these checkpoints with that repository's OneLayerTransformer class (in model/model.py) reproduces every result in the paper.
Notes
- All identifying information has been removed from filenames, folder names, and
.ptconfig dicts. - The repository is anonymized for the duration.