Metis-1.5-base

Metis-1.5-base is an 898M-parameter (≈340M active/token) single-latent Mixture-of-Experts language model, pretrained from scratch on 50B tokens of curated, decontaminated English text. It was trained end to end in pure JAX on a single TPU v6e-8 with a fully custom training stack — no PyTorch, no Megatron, no Hugging Face Trainer.

This is a base model. It is trained only for next-token prediction and is not instruction-tuned — it continues text, it does not follow instructions or hold a conversation. Prompt it with a stem ("The three laws of motion are") rather than a question. An instruction-tuned Metis-1.5-think variant is trained separately.

At a glance


Total parameters	898M
Active parameters / token	≈340M (top-4 of 32 experts + shared expert)
Architecture	Single-latent MoE decoder
Layers / `d_model`	19 / 1536
Attention	Grouped-query, 24 query heads / 8 KV heads, head_dim 64, RoPE (NeoX)
Experts	32 routed (top-4) + 1 shared, squared-ReLU, shared 512-dim latent
Context length	1024 tokens
Vocabulary	32,768 (custom byte-level BPE)
Training tokens	50B (English, deduped, benchmark-decontaminated)
Optimizer	AdaMuon (Newton–Schulz orthogonalized momentum + Adam)
Precision	bf16 weights for release (fp32 master weights during training)
Hardware	1× TPU v6e-8, JAX/XLA
License	CC0-1.0 (public domain — zero restrictions)

Why Metis exists — the philosophy

Metis is an independent, from-scratch exploration of how much capability you can pack into a small, efficient Mixture-of-Experts model trained on a single TPU pod — with a custom JAX stack you fully understand, top to bottom, rather than a black-box framework.

Three convictions shape it:

Sparsity should be cheap, not just powerful. Standard MoE layers duplicate a full FFN per expert, so memory balloons with expert count. Metis routes every expert through one shared low-rank latent space, so adding experts costs far less — you buy specialization without paying full-FFN memory for each one.
Data quality beats data quantity at small scale. The 50B-token corpus is aggressively filtered, deduplicated, and decontaminated, and bucketed by capability (web, encyclopedic, scientific, math, reference, books, Q&A, synthetic textbooks) so the mixture is deliberate rather than incidental. The blend leans education- and STEM-heavy on purpose.
Own the whole stack. Tokenizer, data pipeline, model, optimizer, sharding, and checkpointing are all custom JAX. The point is understanding, not abstraction.

Metis-1.5 is a proof of concept for that architecture and pipeline — small enough to iterate on quickly, complete enough to be genuinely useful as a base for fine-tuning and continued pretraining.

Architecture

Metis-1.5 is a decoder-only transformer whose feed-forward sublayer is replaced by a single-latent Mixture-of-Experts (LatentMoE) block.

Single-latent MoE

Instead of giving each expert its own full-width FFN, every token is first projected down into a shared 512-dim latent space (latent_down: 1536 → 512). Routing and all expert computation happen in that compact latent space, then a single latent_up projection (512 → 1536) maps the combined result back. Concretely, per MoE layer:

Router: linear projection of the hidden state → 32 expert logits, softmax scores, top-4 selection. Load is balanced with an auxiliary-loss-free learned bias (DeepSeek-style: the bias steers selection only; combine weights come from the unbiased scores).
Routed experts: 32 experts, each a squared-ReLU MLP operating in the 512-dim latent (expert intermediate size 1024). Each token is processed by its top-4 experts, combined by normalized router weights.
Shared expert: 1 always-on expert applied to every token, providing a stable backbone alongside the sparse routed capacity.

This is why total parameters (898M) and active parameters (≈340M/token) diverge so much: only 4 of 32 routed experts fire per token.

Attention & backbone

Grouped-query attention — 24 query heads, 8 key/value heads (4:1 grouping), head_dim 64.
Rotary position embeddings (RoPE, NeoX/Llama half-split convention, θ = 10,000).
RMSNorm (pre-norm), tied input/output embeddings, no biases.
19 layers, d_model 1536, context length 1024, vocab 32,768.

Training data

50B tokens of English text, assembled from open corpora and bucketed by capability target so that under-filled sources can only fall back within the same bucket — keeping the intended capability mix intact.

Bucket	Tokens	Representative sources
High-quality web	12B	DCLM-baseline-HQ, DCLM edu-score-filtered
Educational web	9B	FineWeb-Edu (score ≥3), FineWeb HQ
Diverse curated web	5B	Essential-Web (HQ), TxT360 best-of-web, Zyda-2 (novelty), Ultra-FineWeb-en
Encyclopedic	3B	FineWiki (English), Common-Pile Wikimedia
Scientific papers	5B	peS2o (STEM), Proof-Pile-2 (arXiv science), Common-Pile PubMed OA, FinePDFs (technical OCR), OpenStax science/math
Mathematics	6B	Nemotron-CC-Math (≥4), FineMath (≥4), OpenWebMath (equation-rich), Proof-Pile-2 (proofs), MegaMath (web proofs)
Reference / educational	3B	Common-Corpus educational reference, Common-Pile educational, OpenStax textbooks
Books / long-form	2B	PG-19, Common-Pile Project Gutenberg, Common-Corpus books/Wikisource
Q&A / explanatory	2B	Common-Pile StackExchange, ROOTS-en StackExchange, FineWeb explainer-mined
Synthetic textbooks	2B	Cosmopedia v2, synthetic textbook/explainer
STEM/reference reserve	1B	Under-represented STEM / math / reference top-ups

Every source passes the same gate before tokenization:

English-only language ID (confidence ≥ 0.90, paragraph-level English ratio ≥ 0.85, Latin script).
Exact + near-duplicate deduplication, globally across the corpus.
Benchmark decontamination against MMLU, MMLU-Pro, ARC, HellaSwag, OpenBookQA, SciQ, TruthfulQA, GPQA, GSM8K, MATH, AIME, AMC, OlympiadBench, Minerva-Math, SVAMP, ASDiv, IFEval, MT-Bench, Arena-Hard, and RewardBench.

All credit for the underlying corpora belongs to their original authors and curators (see Acknowledgements).

Training procedure

From scratch, single TPU v6e-8, pure JAX/XLA with a custom data-parallel training loop (pmap, replicate-once state, donated buffers).
Optimizer: AdaMuon — Newton–Schulz-orthogonalized momentum for 2-D weight matrices, Adam-style second-moment scaling elsewhere.
Mixed precision: weights are kept as fp32 master copies and cast to bf16 for compute each step, so tiny optimizer updates survive while matmuls stay on the fast bf16 path.
Schedule: linear warmup → cosine decay.
Throughput knobs: bf16 attention scores (fp32 softmax reductions), streaming cross-entropy, softmax router with expert capacity factor 1.5, sort-free cumsum dispatch — sustaining ≈245,000 tokens/sec on the pod.
Sequence length 1024, global batch 64 × 8 grad-accum across 8 devices (≈524k tokens/step).

The released checkpoint is the bf16 export of the fp32 master weights at end of pretraining (step 95,367 ≈ 50B tokens).

Tokenizer

A custom byte-level BPE tokenizer, vocabulary 32,768, trained on ~12M sampled documents (min pair frequency 2). Special tokens: <pad>=0, <bos>=1, <eos>=2, <unk>=3. The full tokenizer is included in this repository (tokenizer.json).

Intended uses & limitations

Intended for

Research on small-scale / efficient MoE language models.
A base for continued pretraining, domain adaptation, and instruction/preference fine-tuning.
Studying the single-latent MoE architecture and JAX/TPU training.

Not intended for

Drop-in chat or instruction following — use a fine-tuned variant (e.g. Metis-1.5-think) for that.
Any safety-critical, factual, or high-stakes use.

Limitations

Small (≈340M active params): limited world knowledge and reasoning depth versus large models.
Base only: no instruction tuning, no RLHF, no safety alignment — it can produce inaccurate, biased, or otherwise undesirable text.
English-only training; ~1024-token context.
Knowledge is bounded by the training corpus and cutoff.

How to use

The weights use JAX-native tensor names (embed, final_norm.scale, layers.N.q, layers.N.router, layers.N.expert_w1, …) and the architecture is custom, so this checkpoint does not load via transformers.AutoModel out of the box. It ships as a self-describing safetensors release; config.json carries every dimension, and the Architecture section above specifies the forward pass exactly.

Load the raw tensors with safetensors (framework-agnostic):

from safetensors import safe_open

weights = {}
with safe_open("model.safetensors", framework="numpy") as f:
    for k in f.keys():
        weights[k] = f.get_tensor(k)   # e.g. weights["layers.0.q"], weights["embed"]

# config.json holds d_model, n_layer, moe_num_experts, head_dim, rope_theta, ...
# Wire these tensors into the forward pass described in the Architecture section
# (single-latent MoE: latent_down -> softmax top-4 routing -> squared-ReLU experts
#  in 512-dim latent + shared expert -> latent_up; GQA + NeoX RoPE; RMSNorm; tied embeddings).

A reference JAX forward/generation implementation is part of the Metis training stack; a standalone single-file loader and an optional transformers wrapper are planned follow-ups.

Evaluation

0-shot accuracy on the full test split of each benchmark, scored with a custom JAX harness — multiple-choice by length-normalized loglikelihood (acc_norm) or plain loglikelihood (acc); GSM8K by greedy generation (chat template + flexible numeric extraction). Training data was decontaminated against these benchmarks, so these are clean held-out numbers.

Benchmark	Metric	Random	Metis-1.5-base	Metis-1.5-think
ARC-Easy	acc_norm	25.0	41.3	41.7
ARC-Challenge	acc_norm	25.0	25.9	28.2
HellaSwag	acc_norm	25.0	30.4	31.0
PIQA	acc_norm	50.0	54.7	54.6
WinoGrande	acc	50.0	51.5	51.8
OpenBookQA	acc_norm	25.0	29.6	28.6
BoolQ	acc	~62¹	47.7	57.2
MMLU	acc	25.0	23.6	23.3
GSM8K	acc	~0	—	7.6

¹ BoolQ majority-class baseline ≈ 62%. GSM8K is not run for the base model (non-instruct).

How to read these. Metis-1.5 is a ~340M-active model (898M total, MoE) trained on only 50B tokens — far fewer than the 0.3–18T behind modern sub-2B models — so it lands around GPT-2-medium tier: clearly above chance on ARC-Easy, modestly so on HellaSwag / PIQA / WinoGrande, and at chance on MMLU (which needs more knowledge capacity than this scale holds). Supervised fine-tuning leaves raw knowledge roughly unchanged (base ≈ think on multiple choice) while adding instruction-following — visible on BoolQ (+9.5) and GSM8K (7.6%), the latter notably strong for the scale (TinyLlama-1.1B / Pythia-1B sit ~2–3%), reflecting the math/reasoning-heavy data mix. The clearest lever for higher scores is more training tokens, not a different architecture.

Held-out pretrain-distribution LM loss ≈ 2.34 nats/token (~10.4 ppl) at end of pretraining.

Compute & environmental footprint

Trained on a single Google Cloud TPU v6e-8 in JAX. At ≈245k tokens/sec, the 50B-token run corresponds to on the order of ~2.4 days of single-pod compute. Training on one small pod (rather than a large cluster) keeps the footprint modest and the experiment reproducible.

License

Released under CC0-1.0 — a public-domain dedication. You may use, modify, redistribute, fine-tune, and build upon Metis-1.5-base for any purpose, commercial or otherwise, with no restrictions and no attribution required. The model is provided as-is, without warranty of any kind.

(Note: the underlying training corpora are governed by their own respective licenses; CC0 applies to these released model weights.)

Citation

@misc{metis15base2026,
  title  = {Metis-1.5-base: A single-latent Mixture-of-Experts language model trained from scratch on one TPU pod},
  author = {Lernex},
  year   = {2026},
  howpublished = {Hugging Face},
  note   = {898M parameters (340M active), pretrained on 50B tokens in JAX on TPU v6e-8}
}

Acknowledgements

Deep thanks to the open-data community whose corpora made this possible — including DCLM, FineWeb / FineWeb-Edu / FineMath, Proof-Pile-2, peS2o, OpenWebMath, Nemotron-CC, MegaMath, Cosmopedia, Project Gutenberg / PG-19, Common-Pile, Common-Corpus, OpenStax, and the maintainers of the StackExchange and Wikimedia data dumps — and to the JAX and Cloud TPU teams for the training stack.

Downloads last month: 49

Safetensors

Model size

0.9B params

Tensor type

BF16

Model tree for Lernex/Metis-1.5-base

Finetunes

1 model

Collection including Lernex/Metis-1.5-base

Metis-1.5

Collection

Metis-1.5 model generation: base and think variants. • 2 items • Updated 17 days ago