Emo-1B-14B-1T — MLX (4-bit, group_size=32)

MLX 4-bit quantization of allenai/Emo_1b14b_1T — Ai2's EMO (Emergent Modularity) MoE LM (1B active / 14B total, top-8 of 128 experts, 1 shared expert, midtrained on 1T tokens of OLMoE-mix-0924).

Quantization: 4-bit, group_size=32 (chosen over g64 to keep the MoE router precision tighter — group_size=32 limits the quantization error on the gate projection, which matters more for MoE than for dense models).

Size on disk: ~9.5 GB. Runs comfortably on Apple Silicon with ≥16 GB unified memory.

Credits

All credit for the model, training, data, and the EMO architecture goes to the original authors at Allen Institute for AI (Ai2):

@article{wang2026emo,
  title  = {EMO: Pretraining Mixture of Experts for Emergent Modularity},
  author = {Wang and Bhagia and Min and others},
  journal= {arXiv preprint arXiv:2605.06663},
  year   = {2026}
}

This repository only contributes the MLX port and 4-bit quantization — no training, no data, no architectural research.

⚠️ Requires a fork of mlx-lm

The emo model type is not yet upstream in ml-explore/mlx-lm. Loading this model with stock pip install mlx-lm will fail with:

ValueError: Model type emo not supported.

Until the PR lands upstream, install the fork that ships mlx_lm/models/emo.py:

git clone https://github.com/ml-explore/mlx-lm.git
cd mlx-lm
# Add the emo.py model file (see "Architecture notes" below for the exact code)
pip install -e .

Or, if a public fork already exists, replace the URL above with it. The emo.py file used to produce this repo is shipped alongside this README for reference.

Usage

from mlx_lm import load, generate

model, tokenizer = load("georgesZam/emo-1b14b-1t-4bit")
out = generate(model, tokenizer, prompt="The capital of France is", max_tokens=80, verbose=True)

CLI:

python -m mlx_lm generate --model georgesZam/emo-1b14b-1t-4bit --prompt "Hello"

Architecture notes (port)

EMO at inference time is OLMoE-shaped with three tweaks the MLX port handles:

  1. Shared experts. The last num_shared_experts columns of the gate route through a separate softmax + top-k (here top-1 over 1 shared expert) and are concatenated to the standard top-k. Shared indices are offset by num_experts - num_shared_experts. This matches the original EmoSparseMoeBlock legacy path in modeling_emo.py.
  2. Layernorm naming. EMO uses pre_attention_layernorm / pre_feedforward_layernorm. The MLX module mirrors these names so weights map 1-to-1 without a remap.
  3. No q/k norm. Unlike OLMoE, EMO's attention has no q_norm / k_norm.

"Emergent Modularity" itself is a pretraining contribution — at inference it's a vanilla MoE forward pass. Expert subsetting (the paper's headline benefit) is not exposed in this checkpoint; doing it cleanly requires an offline analysis of expert activations on your target domain and dropping unused expert rows from each layer's stacked switch_mlp tensors.

License

Inherits from the base model. See allenai/Emo_1b14b_1T for terms.

Downloads last month
21
Safetensors
Model size
14B params
Tensor type
F32
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for georgesZam/emo-1b14b-1t-4bit

Quantized
(1)
this model

Paper for georgesZam/emo-1b14b-1t-4bit