Emo-1B-14B-1T — MLX (4-bit, group_size=32)
MLX 4-bit quantization of allenai/Emo_1b14b_1T — Ai2's EMO (Emergent Modularity) MoE LM (1B active / 14B total, top-8 of 128 experts, 1 shared expert, midtrained on 1T tokens of OLMoE-mix-0924).
Quantization: 4-bit, group_size=32 (chosen over g64 to keep the MoE router precision tighter — group_size=32 limits the quantization error on the gate projection, which matters more for MoE than for dense models).
Size on disk: ~9.5 GB. Runs comfortably on Apple Silicon with ≥16 GB unified memory.
Credits
All credit for the model, training, data, and the EMO architecture goes to the original authors at Allen Institute for AI (Ai2):
- Paper: EMO: Pretraining Mixture of Experts for Emergent Modularity — Wang, Bhagia, Min et al., Ai2, 2026.
- Base model:
allenai/Emo_1b14b_1T - Training data:
allenai/OLMoE-mix-0924 - Visualization: emovisualization.netlify.app
@article{wang2026emo,
title = {EMO: Pretraining Mixture of Experts for Emergent Modularity},
author = {Wang and Bhagia and Min and others},
journal= {arXiv preprint arXiv:2605.06663},
year = {2026}
}
This repository only contributes the MLX port and 4-bit quantization — no training, no data, no architectural research.
⚠️ Requires a fork of mlx-lm
The emo model type is not yet upstream in ml-explore/mlx-lm. Loading this model with stock pip install mlx-lm will fail with:
ValueError: Model type emo not supported.
Until the PR lands upstream, install the fork that ships mlx_lm/models/emo.py:
git clone https://github.com/ml-explore/mlx-lm.git
cd mlx-lm
# Add the emo.py model file (see "Architecture notes" below for the exact code)
pip install -e .
Or, if a public fork already exists, replace the URL above with it. The emo.py file used to produce this repo is shipped alongside this README for reference.
Usage
from mlx_lm import load, generate
model, tokenizer = load("georgesZam/emo-1b14b-1t-4bit")
out = generate(model, tokenizer, prompt="The capital of France is", max_tokens=80, verbose=True)
CLI:
python -m mlx_lm generate --model georgesZam/emo-1b14b-1t-4bit --prompt "Hello"
Architecture notes (port)
EMO at inference time is OLMoE-shaped with three tweaks the MLX port handles:
- Shared experts. The last
num_shared_expertscolumns of the gate route through a separate softmax + top-k (here top-1 over 1 shared expert) and are concatenated to the standard top-k. Shared indices are offset bynum_experts - num_shared_experts. This matches the originalEmoSparseMoeBlocklegacy path inmodeling_emo.py. - Layernorm naming. EMO uses
pre_attention_layernorm/pre_feedforward_layernorm. The MLX module mirrors these names so weights map 1-to-1 without a remap. - No q/k norm. Unlike OLMoE, EMO's attention has no q_norm / k_norm.
"Emergent Modularity" itself is a pretraining contribution — at inference it's a vanilla MoE forward pass. Expert subsetting (the paper's headline benefit) is not exposed in this checkpoint; doing it cleanly requires an offline analysis of expert activations on your target domain and dropping unused expert rows from each layer's stacked switch_mlp tensors.
License
Inherits from the base model. See allenai/Emo_1b14b_1T for terms.
- Downloads last month
- 21
4-bit
Model tree for georgesZam/emo-1b14b-1t-4bit
Base model
allenai/Emo_1b14b_1T