Brújula-15M

The tiny champion of the Brújula family — a 15.5M-parameter decoder-only language model, trained entirely on a single consumer GPU (one Intel Arc B580, ~5h16m) from scratch on FineWeb-Edu. Brújula ("compass" in Spanish) is a minimal DeepSeek-style architecture: Multi-head Latent Attention (MLA) + RoPE + SquaredReLU FFN, tied embeddings, hybrid Muon + AdamW optimizer.

A base completion model (not instruction-tuned). At 15M it's a research/education artifact — surprisingly fluent short continuations for its size, but not a knowledge source.

Results

Perplexity (lower is better), fixed local harness at context length 1024:

Model FineWeb-Edu val PPL WikiText-103 PPL
Brújula-15M (this model) 78.05 190.74

It won't compete with much larger models on absolute perplexity — the point is that this is a complete, from-scratch LM that fits and trains on one consumer GPU. See the family below.

The Brújula family

Model Params FineWeb val WikiText Notes
Brújula-15M 15.5M 78.05 190.74 this model — tiny champion, fully local
Brújula-18M 18M 46.26 108.72 Brújula-15M depth-grown via G_stack (4→8 layers)
Brújula-150M 153.6M 21.44 36.08 the flagship (beats GPT-2 small on FineWeb val)

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Sakatepon/Brujula-15M"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True).eval()

ids = tok("The mitochondria is the", return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=64, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

It's tiny — use sampling + a continuation cue ("X is the … " rather than a bare noun); greedy tends to repetition-loop.

Architecture

Type decoder-only, causal LM
Hidden / Layers / Heads n_embd=256 / n_layer=4 / n_head=4
Context length 1024
Attention Multi-head Latent Attention (MLA), kv-compress 32 / q-compress 64
Position / FFN / Norm RoPE / SquaredReLU / RMSNorm (pre-norm), tied embeddings
Vocab 50257 (GPT-2 BPE)
Unique params 15.5M

Training

FineWeb-Edu (1.4B tokens), 1 epoch, hybrid Muon + AdamW, bf16, **5h16m on a single Intel Arc B580**. Fully local — no cloud.

Limitations

  • Base completion model — not instruction-tuned, no safety tuning.
  • English only, educational-web distribution (FineWeb-Edu).
  • At 15M it produces plausible short prose but unreliable facts; best on cued, definitional prompts.
  • Short context (1024); no KV-cache in this reference implementation.

License & attribution

  • Model + code: Apache-2.0. Training data: FineWeb-Edu (ODC-BY).
  • Built on ideas from: DeepSeek-V2 (MLA), Muon optimizer, Primer (SquaredReLU), GPT-2 (BPE tokenizer).
Downloads last month
19
Safetensors
Model size
15.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Sakatepon/Brujula-15M