Brújula-15M

The tiny champion of the Brújula family — a 15.5M-parameter decoder-only language model, trained entirely on a single consumer GPU (one Intel Arc B580, ~5h16m) from scratch on FineWeb-Edu. Brújula ("compass" in Spanish) is a minimal DeepSeek-style architecture: Multi-head Latent Attention (MLA) + RoPE + SquaredReLU FFN, tied embeddings, hybrid Muon + AdamW optimizer.

A base completion model (not instruction-tuned). At 15M it's a research/education artifact — surprisingly fluent short continuations for its size, but not a knowledge source.

Results

Perplexity (lower is better), fixed local harness at context length 1024:

Model	FineWeb-Edu val PPL	WikiText-103 PPL
Brújula-15M (this model)	78.05	190.74

It won't compete with much larger models on absolute perplexity — the point is that this is a complete, from-scratch LM that fits and trains on one consumer GPU. See the family below.

The Brújula family

Model	Params	FineWeb val	WikiText	Notes
Brújula-15M	15.5M	78.05	190.74	this model — tiny champion, fully local
Brújula-18M	18M	46.26	108.72	Brújula-15M depth-grown via G_stack (4→8 layers)
Brújula-150M	153.6M	21.44	36.08	the flagship (beats GPT-2 small on FineWeb val)

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Sakatepon/Brujula-15M"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True).eval()

ids = tok("The mitochondria is the", return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=64, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

It's tiny — use sampling + a continuation cue ("X is the … " rather than a bare noun); greedy tends to repetition-loop.

Architecture


Type	decoder-only, causal LM
Hidden / Layers / Heads	`n_embd=256` / `n_layer=4` / `n_head=4`
Context length	1024
Attention	Multi-head Latent Attention (MLA), kv-compress 32 / q-compress 64
Position / FFN / Norm	RoPE / SquaredReLU / RMSNorm (pre-norm), tied embeddings
Vocab	50257 (GPT-2 BPE)
Unique params	15.5M

Training

FineWeb-Edu (1.4B tokens), 1 epoch, hybrid Muon + AdamW, bf16, **5h16m on a single Intel Arc B580**. Fully local — no cloud.

Limitations

Base completion model — not instruction-tuned, no safety tuning.
English only, educational-web distribution (FineWeb-Edu).
At 15M it produces plausible short prose but unreliable facts; best on cued, definitional prompts.
Short context (1024); no KV-cache in this reference implementation.

License & attribution

Model + code: Apache-2.0. Training data: FineWeb-Edu (ODC-BY).
Built on ideas from: DeepSeek-V2 (MLA), Muon optimizer, Primer (SquaredReLU), GPT-2 (BPE tokenizer).

Downloads last month: 14

Safetensors

Model size

15.5M params

Tensor type

F32

Sakatepon
/

Brujula-15M