Metis-1.5-base

Metis-1.5-base is an 898M-parameter (β‰ˆ340M active/token) single-latent Mixture-of-Experts language model, pretrained from scratch on 50B tokens of curated, decontaminated English text. It was trained end to end in pure JAX on a single TPU v6e-8 with a fully custom training stack β€” no PyTorch, no Megatron, no Hugging Face Trainer.

This is a base model. It is trained only for next-token prediction and is not instruction-tuned β€” it continues text, it does not follow instructions or hold a conversation. Prompt it with a stem ("The three laws of motion are") rather than a question. An instruction-tuned Metis-1.5-think variant is trained separately.


At a glance

Total parameters 898M
Active parameters / token β‰ˆ340M (top-4 of 32 experts + shared expert)
Architecture Single-latent MoE decoder
Layers / d_model 19 / 1536
Attention Grouped-query, 24 query heads / 8 KV heads, head_dim 64, RoPE (NeoX)
Experts 32 routed (top-4) + 1 shared, squared-ReLU, shared 512-dim latent
Context length 1024 tokens
Vocabulary 32,768 (custom byte-level BPE)
Training tokens 50B (English, deduped, benchmark-decontaminated)
Optimizer AdaMuon (Newton–Schulz orthogonalized momentum + Adam)
Precision bf16 weights for release (fp32 master weights during training)
Hardware 1Γ— TPU v6e-8, JAX/XLA
License CC0-1.0 (public domain β€” zero restrictions)

Why Metis exists β€” the philosophy

Metis is an independent, from-scratch exploration of how much capability you can pack into a small, efficient Mixture-of-Experts model trained on a single TPU pod β€” with a custom JAX stack you fully understand, top to bottom, rather than a black-box framework.

Three convictions shape it:

  1. Sparsity should be cheap, not just powerful. Standard MoE layers duplicate a full FFN per expert, so memory balloons with expert count. Metis routes every expert through one shared low-rank latent space, so adding experts costs far less β€” you buy specialization without paying full-FFN memory for each one.
  2. Data quality beats data quantity at small scale. The 50B-token corpus is aggressively filtered, deduplicated, and decontaminated, and bucketed by capability (web, encyclopedic, scientific, math, reference, books, Q&A, synthetic textbooks) so the mixture is deliberate rather than incidental. The blend leans education- and STEM-heavy on purpose.
  3. Own the whole stack. Tokenizer, data pipeline, model, optimizer, sharding, and checkpointing are all custom JAX. The point is understanding, not abstraction.

Metis-1.5 is a proof of concept for that architecture and pipeline β€” small enough to iterate on quickly, complete enough to be genuinely useful as a base for fine-tuning and continued pretraining.


Architecture

Metis-1.5 is a decoder-only transformer whose feed-forward sublayer is replaced by a single-latent Mixture-of-Experts (LatentMoE) block.

Single-latent MoE

Instead of giving each expert its own full-width FFN, every token is first projected down into a shared 512-dim latent space (latent_down: 1536 β†’ 512). Routing and all expert computation happen in that compact latent space, then a single latent_up projection (512 β†’ 1536) maps the combined result back. Concretely, per MoE layer:

  • Router: linear projection of the hidden state β†’ 32 expert logits, softmax scores, top-4 selection. Load is balanced with an auxiliary-loss-free learned bias (DeepSeek-style: the bias steers selection only; combine weights come from the unbiased scores).
  • Routed experts: 32 experts, each a squared-ReLU MLP operating in the 512-dim latent (expert intermediate size 1024). Each token is processed by its top-4 experts, combined by normalized router weights.
  • Shared expert: 1 always-on expert applied to every token, providing a stable backbone alongside the sparse routed capacity.

This is why total parameters (898M) and active parameters (β‰ˆ340M/token) diverge so much: only 4 of 32 routed experts fire per token.

Attention & backbone

  • Grouped-query attention β€” 24 query heads, 8 key/value heads (4:1 grouping), head_dim 64.
  • Rotary position embeddings (RoPE, NeoX/Llama half-split convention, ΞΈ = 10,000).
  • RMSNorm (pre-norm), tied input/output embeddings, no biases.
  • 19 layers, d_model 1536, context length 1024, vocab 32,768.

Training data

50B tokens of English text, assembled from open corpora and bucketed by capability target so that under-filled sources can only fall back within the same bucket β€” keeping the intended capability mix intact.

Bucket Tokens Representative sources
High-quality web 12B DCLM-baseline-HQ, DCLM edu-score-filtered
Educational web 9B FineWeb-Edu (score β‰₯3), FineWeb HQ
Diverse curated web 5B Essential-Web (HQ), TxT360 best-of-web, Zyda-2 (novelty), Ultra-FineWeb-en
Encyclopedic 3B FineWiki (English), Common-Pile Wikimedia
Scientific papers 5B peS2o (STEM), Proof-Pile-2 (arXiv science), Common-Pile PubMed OA, FinePDFs (technical OCR), OpenStax science/math
Mathematics 6B Nemotron-CC-Math (β‰₯4), FineMath (β‰₯4), OpenWebMath (equation-rich), Proof-Pile-2 (proofs), MegaMath (web proofs)
Reference / educational 3B Common-Corpus educational reference, Common-Pile educational, OpenStax textbooks
Books / long-form 2B PG-19, Common-Pile Project Gutenberg, Common-Corpus books/Wikisource
Q&A / explanatory 2B Common-Pile StackExchange, ROOTS-en StackExchange, FineWeb explainer-mined
Synthetic textbooks 2B Cosmopedia v2, synthetic textbook/explainer
STEM/reference reserve 1B Under-represented STEM / math / reference top-ups

Every source passes the same gate before tokenization:

  • English-only language ID (confidence β‰₯ 0.90, paragraph-level English ratio β‰₯ 0.85, Latin script).
  • Exact + near-duplicate deduplication, globally across the corpus.
  • Benchmark decontamination against MMLU, MMLU-Pro, ARC, HellaSwag, OpenBookQA, SciQ, TruthfulQA, GPQA, GSM8K, MATH, AIME, AMC, OlympiadBench, Minerva-Math, SVAMP, ASDiv, IFEval, MT-Bench, Arena-Hard, and RewardBench.

All credit for the underlying corpora belongs to their original authors and curators (see Acknowledgements).


Training procedure

  • From scratch, single TPU v6e-8, pure JAX/XLA with a custom data-parallel training loop (pmap, replicate-once state, donated buffers).
  • Optimizer: AdaMuon β€” Newton–Schulz-orthogonalized momentum for 2-D weight matrices, Adam-style second-moment scaling elsewhere.
  • Mixed precision: weights are kept as fp32 master copies and cast to bf16 for compute each step, so tiny optimizer updates survive while matmuls stay on the fast bf16 path.
  • Schedule: linear warmup β†’ cosine decay.
  • Throughput knobs: bf16 attention scores (fp32 softmax reductions), streaming cross-entropy, softmax router with expert capacity factor 1.5, sort-free cumsum dispatch β€” sustaining β‰ˆ245,000 tokens/sec on the pod.
  • Sequence length 1024, global batch 64 Γ— 8 grad-accum across 8 devices (β‰ˆ524k tokens/step).

The released checkpoint is the bf16 export of the fp32 master weights at end of pretraining (step 95,367 β‰ˆ 50B tokens).

Tokenizer

A custom byte-level BPE tokenizer, vocabulary 32,768, trained on ~12M sampled documents (min pair frequency 2). Special tokens: <pad>=0, <bos>=1, <eos>=2, <unk>=3. The full tokenizer is included in this repository (tokenizer.json).


Intended uses & limitations

Intended for

  • Research on small-scale / efficient MoE language models.
  • A base for continued pretraining, domain adaptation, and instruction/preference fine-tuning.
  • Studying the single-latent MoE architecture and JAX/TPU training.

Not intended for

  • Drop-in chat or instruction following β€” use a fine-tuned variant (e.g. Metis-1.5-think) for that.
  • Any safety-critical, factual, or high-stakes use.

Limitations

  • Small (β‰ˆ340M active params): limited world knowledge and reasoning depth versus large models.
  • Base only: no instruction tuning, no RLHF, no safety alignment β€” it can produce inaccurate, biased, or otherwise undesirable text.
  • English-only training; ~1024-token context.
  • Knowledge is bounded by the training corpus and cutoff.

How to use

The weights use JAX-native tensor names (embed, final_norm.scale, layers.N.q, layers.N.router, layers.N.expert_w1, …) and the architecture is custom, so this checkpoint does not load via transformers.AutoModel out of the box. It ships as a self-describing safetensors release; config.json carries every dimension, and the Architecture section above specifies the forward pass exactly.

Load the raw tensors with safetensors (framework-agnostic):

from safetensors import safe_open

weights = {}
with safe_open("model.safetensors", framework="numpy") as f:
    for k in f.keys():
        weights[k] = f.get_tensor(k)   # e.g. weights["layers.0.q"], weights["embed"]

# config.json holds d_model, n_layer, moe_num_experts, head_dim, rope_theta, ...
# Wire these tensors into the forward pass described in the Architecture section
# (single-latent MoE: latent_down -> softmax top-4 routing -> squared-ReLU experts
#  in 512-dim latent + shared expert -> latent_up; GQA + NeoX RoPE; RMSNorm; tied embeddings).

A reference JAX forward/generation implementation is part of the Metis training stack; a standalone single-file loader and an optional transformers wrapper are planned follow-ups.

Evaluation

0-shot accuracy on the full test split of each benchmark, scored with a custom JAX harness β€” multiple-choice by length-normalized loglikelihood (acc_norm) or plain loglikelihood (acc); GSM8K by greedy generation (chat template + flexible numeric extraction). Training data was decontaminated against these benchmarks, so these are clean held-out numbers.

Benchmark Metric Random Metis-1.5-base Metis-1.5-think
ARC-Easy acc_norm 25.0 41.3 41.7
ARC-Challenge acc_norm 25.0 25.9 28.2
HellaSwag acc_norm 25.0 30.4 31.0
PIQA acc_norm 50.0 54.7 54.6
WinoGrande acc 50.0 51.5 51.8
OpenBookQA acc_norm 25.0 29.6 28.6
BoolQ acc ~62ΒΉ 47.7 57.2
MMLU acc 25.0 23.6 23.3
GSM8K acc ~0 β€” 7.6

ΒΉ BoolQ majority-class baseline β‰ˆ 62%. GSM8K is not run for the base model (non-instruct).

How to read these. Metis-1.5 is a ~340M-active model (898M total, MoE) trained on only 50B tokens β€” far fewer than the 0.3–18T behind modern sub-2B models β€” so it lands around GPT-2-medium tier: clearly above chance on ARC-Easy, modestly so on HellaSwag / PIQA / WinoGrande, and at chance on MMLU (which needs more knowledge capacity than this scale holds). Supervised fine-tuning leaves raw knowledge roughly unchanged (base β‰ˆ think on multiple choice) while adding instruction-following β€” visible on BoolQ (+9.5) and GSM8K (7.6%), the latter notably strong for the scale (TinyLlama-1.1B / Pythia-1B sit ~2–3%), reflecting the math/reasoning-heavy data mix. The clearest lever for higher scores is more training tokens, not a different architecture.

Held-out pretrain-distribution LM loss β‰ˆ 2.34 nats/token (~10.4 ppl) at end of pretraining.

Compute & environmental footprint

Trained on a single Google Cloud TPU v6e-8 in JAX. At β‰ˆ245k tokens/sec, the 50B-token run corresponds to on the order of ~2.4 days of single-pod compute. Training on one small pod (rather than a large cluster) keeps the footprint modest and the experiment reproducible.


License

Released under CC0-1.0 β€” a public-domain dedication. You may use, modify, redistribute, fine-tune, and build upon Metis-1.5-base for any purpose, commercial or otherwise, with no restrictions and no attribution required. The model is provided as-is, without warranty of any kind.

(Note: the underlying training corpora are governed by their own respective licenses; CC0 applies to these released model weights.)

Citation

@misc{metis15base2026,
  title  = {Metis-1.5-base: A single-latent Mixture-of-Experts language model trained from scratch on one TPU pod},
  author = {Lernex},
  year   = {2026},
  howpublished = {Hugging Face},
  note   = {898M parameters (340M active), pretrained on 50B tokens in JAX on TPU v6e-8}
}

Acknowledgements

Deep thanks to the open-data community whose corpora made this possible β€” including DCLM, FineWeb / FineWeb-Edu / FineMath, Proof-Pile-2, peS2o, OpenWebMath, Nemotron-CC, MegaMath, Cosmopedia, Project Gutenberg / PG-19, Common-Pile, Common-Corpus, OpenStax, and the maintainers of the StackExchange and Wikimedia data dumps β€” and to the JAX and Cloud TPU teams for the training stack.

Downloads last month
49
Safetensors
Model size
0.9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Lernex/Metis-1.5-base

Finetunes
1 model

Collection including Lernex/Metis-1.5-base