Metis-1.5-base
Metis-1.5-base is an 898M-parameter (β340M active/token) single-latent Mixture-of-Experts language model, pretrained from scratch on 50B tokens of curated, decontaminated English text. It was trained end to end in pure JAX on a single TPU v6e-8 with a fully custom training stack β no PyTorch, no Megatron, no Hugging Face Trainer.
This is a base model. It is trained only for next-token prediction and is not instruction-tuned β it continues text, it does not follow instructions or hold a conversation. Prompt it with a stem ("The three laws of motion are") rather than a question. An instruction-tuned
Metis-1.5-thinkvariant is trained separately.
At a glance
| Total parameters | 898M |
| Active parameters / token | β340M (top-4 of 32 experts + shared expert) |
| Architecture | Single-latent MoE decoder |
Layers / d_model |
19 / 1536 |
| Attention | Grouped-query, 24 query heads / 8 KV heads, head_dim 64, RoPE (NeoX) |
| Experts | 32 routed (top-4) + 1 shared, squared-ReLU, shared 512-dim latent |
| Context length | 1024 tokens |
| Vocabulary | 32,768 (custom byte-level BPE) |
| Training tokens | 50B (English, deduped, benchmark-decontaminated) |
| Optimizer | AdaMuon (NewtonβSchulz orthogonalized momentum + Adam) |
| Precision | bf16 weights for release (fp32 master weights during training) |
| Hardware | 1Γ TPU v6e-8, JAX/XLA |
| License | CC0-1.0 (public domain β zero restrictions) |
Why Metis exists β the philosophy
Metis is an independent, from-scratch exploration of how much capability you can pack into a small, efficient Mixture-of-Experts model trained on a single TPU pod β with a custom JAX stack you fully understand, top to bottom, rather than a black-box framework.
Three convictions shape it:
- Sparsity should be cheap, not just powerful. Standard MoE layers duplicate a full FFN per expert, so memory balloons with expert count. Metis routes every expert through one shared low-rank latent space, so adding experts costs far less β you buy specialization without paying full-FFN memory for each one.
- Data quality beats data quantity at small scale. The 50B-token corpus is aggressively filtered, deduplicated, and decontaminated, and bucketed by capability (web, encyclopedic, scientific, math, reference, books, Q&A, synthetic textbooks) so the mixture is deliberate rather than incidental. The blend leans education- and STEM-heavy on purpose.
- Own the whole stack. Tokenizer, data pipeline, model, optimizer, sharding, and checkpointing are all custom JAX. The point is understanding, not abstraction.
Metis-1.5 is a proof of concept for that architecture and pipeline β small enough to iterate on quickly, complete enough to be genuinely useful as a base for fine-tuning and continued pretraining.
Architecture
Metis-1.5 is a decoder-only transformer whose feed-forward sublayer is replaced by a single-latent Mixture-of-Experts (LatentMoE) block.
Single-latent MoE
Instead of giving each expert its own full-width FFN, every token is first projected down into a shared 512-dim latent space (latent_down: 1536 β 512). Routing and all expert computation happen in that compact latent space, then a single latent_up projection (512 β 1536) maps the combined result back. Concretely, per MoE layer:
- Router: linear projection of the hidden state β 32 expert logits, softmax scores, top-4 selection. Load is balanced with an auxiliary-loss-free learned bias (DeepSeek-style: the bias steers selection only; combine weights come from the unbiased scores).
- Routed experts: 32 experts, each a squared-ReLU MLP operating in the 512-dim latent (expert intermediate size 1024). Each token is processed by its top-4 experts, combined by normalized router weights.
- Shared expert: 1 always-on expert applied to every token, providing a stable backbone alongside the sparse routed capacity.
This is why total parameters (898M) and active parameters (β340M/token) diverge so much: only 4 of 32 routed experts fire per token.
Attention & backbone
- Grouped-query attention β 24 query heads, 8 key/value heads (4:1 grouping), head_dim 64.
- Rotary position embeddings (RoPE, NeoX/Llama half-split convention, ΞΈ = 10,000).
- RMSNorm (pre-norm), tied input/output embeddings, no biases.
- 19 layers,
d_model1536, context length 1024, vocab 32,768.
Training data
50B tokens of English text, assembled from open corpora and bucketed by capability target so that under-filled sources can only fall back within the same bucket β keeping the intended capability mix intact.
| Bucket | Tokens | Representative sources |
|---|---|---|
| High-quality web | 12B | DCLM-baseline-HQ, DCLM edu-score-filtered |
| Educational web | 9B | FineWeb-Edu (score β₯3), FineWeb HQ |
| Diverse curated web | 5B | Essential-Web (HQ), TxT360 best-of-web, Zyda-2 (novelty), Ultra-FineWeb-en |
| Encyclopedic | 3B | FineWiki (English), Common-Pile Wikimedia |
| Scientific papers | 5B | peS2o (STEM), Proof-Pile-2 (arXiv science), Common-Pile PubMed OA, FinePDFs (technical OCR), OpenStax science/math |
| Mathematics | 6B | Nemotron-CC-Math (β₯4), FineMath (β₯4), OpenWebMath (equation-rich), Proof-Pile-2 (proofs), MegaMath (web proofs) |
| Reference / educational | 3B | Common-Corpus educational reference, Common-Pile educational, OpenStax textbooks |
| Books / long-form | 2B | PG-19, Common-Pile Project Gutenberg, Common-Corpus books/Wikisource |
| Q&A / explanatory | 2B | Common-Pile StackExchange, ROOTS-en StackExchange, FineWeb explainer-mined |
| Synthetic textbooks | 2B | Cosmopedia v2, synthetic textbook/explainer |
| STEM/reference reserve | 1B | Under-represented STEM / math / reference top-ups |
Every source passes the same gate before tokenization:
- English-only language ID (confidence β₯ 0.90, paragraph-level English ratio β₯ 0.85, Latin script).
- Exact + near-duplicate deduplication, globally across the corpus.
- Benchmark decontamination against MMLU, MMLU-Pro, ARC, HellaSwag, OpenBookQA, SciQ, TruthfulQA, GPQA, GSM8K, MATH, AIME, AMC, OlympiadBench, Minerva-Math, SVAMP, ASDiv, IFEval, MT-Bench, Arena-Hard, and RewardBench.
All credit for the underlying corpora belongs to their original authors and curators (see Acknowledgements).
Training procedure
- From scratch, single TPU v6e-8, pure JAX/XLA with a custom data-parallel training loop (
pmap, replicate-once state, donated buffers). - Optimizer: AdaMuon β NewtonβSchulz-orthogonalized momentum for 2-D weight matrices, Adam-style second-moment scaling elsewhere.
- Mixed precision: weights are kept as fp32 master copies and cast to bf16 for compute each step, so tiny optimizer updates survive while matmuls stay on the fast bf16 path.
- Schedule: linear warmup β cosine decay.
- Throughput knobs: bf16 attention scores (fp32 softmax reductions), streaming cross-entropy, softmax router with expert capacity factor 1.5, sort-free cumsum dispatch β sustaining β245,000 tokens/sec on the pod.
- Sequence length 1024, global batch 64 Γ 8 grad-accum across 8 devices (β524k tokens/step).
The released checkpoint is the bf16 export of the fp32 master weights at end of pretraining (step 95,367 β 50B tokens).
Tokenizer
A custom byte-level BPE tokenizer, vocabulary 32,768, trained on ~12M sampled documents (min pair frequency 2). Special tokens: <pad>=0, <bos>=1, <eos>=2, <unk>=3. The full tokenizer is included in this repository (tokenizer.json).
Intended uses & limitations
Intended for
- Research on small-scale / efficient MoE language models.
- A base for continued pretraining, domain adaptation, and instruction/preference fine-tuning.
- Studying the single-latent MoE architecture and JAX/TPU training.
Not intended for
- Drop-in chat or instruction following β use a fine-tuned variant (e.g.
Metis-1.5-think) for that. - Any safety-critical, factual, or high-stakes use.
Limitations
- Small (β340M active params): limited world knowledge and reasoning depth versus large models.
- Base only: no instruction tuning, no RLHF, no safety alignment β it can produce inaccurate, biased, or otherwise undesirable text.
- English-only training; ~1024-token context.
- Knowledge is bounded by the training corpus and cutoff.
How to use
The weights use JAX-native tensor names (
embed,final_norm.scale,layers.N.q,layers.N.router,layers.N.expert_w1, β¦) and the architecture is custom, so this checkpoint does not load viatransformers.AutoModelout of the box. It ships as a self-describing safetensors release;config.jsoncarries every dimension, and the Architecture section above specifies the forward pass exactly.
Load the raw tensors with safetensors (framework-agnostic):
from safetensors import safe_open
weights = {}
with safe_open("model.safetensors", framework="numpy") as f:
for k in f.keys():
weights[k] = f.get_tensor(k) # e.g. weights["layers.0.q"], weights["embed"]
# config.json holds d_model, n_layer, moe_num_experts, head_dim, rope_theta, ...
# Wire these tensors into the forward pass described in the Architecture section
# (single-latent MoE: latent_down -> softmax top-4 routing -> squared-ReLU experts
# in 512-dim latent + shared expert -> latent_up; GQA + NeoX RoPE; RMSNorm; tied embeddings).
A reference JAX forward/generation implementation is part of the Metis training stack; a standalone single-file loader and an optional transformers wrapper are planned follow-ups.
Evaluation
0-shot accuracy on the full test split of each benchmark, scored with a custom JAX harness β multiple-choice by length-normalized loglikelihood (acc_norm) or plain loglikelihood (acc); GSM8K by greedy generation (chat template + flexible numeric extraction). Training data was decontaminated against these benchmarks, so these are clean held-out numbers.
| Benchmark | Metric | Random | Metis-1.5-base | Metis-1.5-think |
|---|---|---|---|---|
| ARC-Easy | acc_norm | 25.0 | 41.3 | 41.7 |
| ARC-Challenge | acc_norm | 25.0 | 25.9 | 28.2 |
| HellaSwag | acc_norm | 25.0 | 30.4 | 31.0 |
| PIQA | acc_norm | 50.0 | 54.7 | 54.6 |
| WinoGrande | acc | 50.0 | 51.5 | 51.8 |
| OpenBookQA | acc_norm | 25.0 | 29.6 | 28.6 |
| BoolQ | acc | ~62ΒΉ | 47.7 | 57.2 |
| MMLU | acc | 25.0 | 23.6 | 23.3 |
| GSM8K | acc | ~0 | β | 7.6 |
ΒΉ BoolQ majority-class baseline β 62%. GSM8K is not run for the base model (non-instruct).
How to read these. Metis-1.5 is a ~340M-active model (898M total, MoE) trained on only 50B tokens β far fewer than the 0.3β18T behind modern sub-2B models β so it lands around GPT-2-medium tier: clearly above chance on ARC-Easy, modestly so on HellaSwag / PIQA / WinoGrande, and at chance on MMLU (which needs more knowledge capacity than this scale holds). Supervised fine-tuning leaves raw knowledge roughly unchanged (base β think on multiple choice) while adding instruction-following β visible on BoolQ (+9.5) and GSM8K (7.6%), the latter notably strong for the scale (TinyLlama-1.1B / Pythia-1B sit ~2β3%), reflecting the math/reasoning-heavy data mix. The clearest lever for higher scores is more training tokens, not a different architecture.
Held-out pretrain-distribution LM loss β 2.34 nats/token (~10.4 ppl) at end of pretraining.
Compute & environmental footprint
Trained on a single Google Cloud TPU v6e-8 in JAX. At β245k tokens/sec, the 50B-token run corresponds to on the order of ~2.4 days of single-pod compute. Training on one small pod (rather than a large cluster) keeps the footprint modest and the experiment reproducible.
License
Released under CC0-1.0 β a public-domain dedication. You may use, modify, redistribute, fine-tune, and build upon Metis-1.5-base for any purpose, commercial or otherwise, with no restrictions and no attribution required. The model is provided as-is, without warranty of any kind.
(Note: the underlying training corpora are governed by their own respective licenses; CC0 applies to these released model weights.)
Citation
@misc{metis15base2026,
title = {Metis-1.5-base: A single-latent Mixture-of-Experts language model trained from scratch on one TPU pod},
author = {Lernex},
year = {2026},
howpublished = {Hugging Face},
note = {898M parameters (340M active), pretrained on 50B tokens in JAX on TPU v6e-8}
}
Acknowledgements
Deep thanks to the open-data community whose corpora made this possible β including DCLM, FineWeb / FineWeb-Edu / FineMath, Proof-Pile-2, peS2o, OpenWebMath, Nemotron-CC, MegaMath, Cosmopedia, Project Gutenberg / PG-19, Common-Pile, Common-Corpus, OpenStax, and the maintainers of the StackExchange and Wikimedia data dumps β and to the JAX and Cloud TPU teams for the training stack.
- Downloads last month
- 49