Brújula-150M

A 153.6M-parameter decoder-only language model, trained from scratch on FineWeb-Edu as part of a solo, consumer-GPU hobby project. Brújula ("compass" in Spanish) is a deliberately minimal DeepSeek-style architecture: Multi-head Latent Attention (MLA) + RoPE + SquaredReLU FFN, with tied embeddings, trained with a hybrid Muon + AdamW optimizer.

This is a base completion model — it continues text; it is not instruction-tuned and not safety-tuned. It writes fluent, on-topic educational-web prose; it is not a reliable source of facts.

Results

Perplexity (lower is better), evaluated on a fixed local harness at context length 1024:

Model	FineWeb-Edu val PPL	WikiText-103 PPL
Brújula-150M (this model)	21.44	36.08
GPT-2 small (124M, same harness)	26.61	29.31

Honest reading: Brújula-150M beats GPT-2 small on FineWeb-Edu val (its home turf), but loses on WikiText-103 — a data-distribution effect (FineWeb-Edu is narrower than GPT-2's WebText), not a capacity problem. It's the main known limitation, stated openly, not hidden.

The Brújula family

Model	Params	FineWeb val	WikiText	Notes
Brújula-15M	15.5M	78.05	190.74	tiny champion, trained locally on one Arc B580
Brújula-18M	18M	46.26	108.72	Brújula-15M depth-grown via G_stack (4→8 layers)
Brújula-150M	153.6M	21.44	36.08	this model (the flagship)

Usage

Custom architecture code, so pass trust_remote_code=True:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Sakatepon/Brujula-150M"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True).eval()

ids = tok("The mitochondria is the", return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=64, do_sample=True, temperature=0.8, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=True))

Tip: it's a small base model — use sampling and cued/definitional prompts ("X is the … ") rather than bare nouns. Greedy decoding tends to repetition-loop.

Architecture


Type	decoder-only, causal LM
Hidden size (`n_embd`)	768
Layers (`n_layer`)	20
Heads (`n_head`)	12
Context length	1024
Attention	Multi-head Latent Attention (MLA), kv-compress 64 / q-compress 192
Position	RoPE (GPT-NeoX convention)
FFN	SquaredReLU (`w2(relu(w1 x)^2)`); Norm RMSNorm (pre-norm); Embeddings tied
Vocab	50257 (GPT-2 BPE)
Unique params	153.6M

Training

FineWeb-Edu (~5B tokens), 1 epoch (76,293 steps), hybrid Muon (matrix weights) + AdamW (embeddings/norms), peak LR 1.7e-3, batch 64, bf16, ~12.3h on a single cloud GPU (the 15M/18M siblings train locally on one Intel Arc B580).

The story

Brújula is the honest output of a part-time, single-consumer-GPU project. It started from an over-engineered 331M model (with a bolted-on "engram" retrieval mechanism + dropout + torch.compile + gradient checkpointing) that displayed low loss but evaluated ~6× worse in clean inference — a broken train/eval stack. The fix was ruthless minimalism: strip every unvalidated component, keep MLA + RoPE + SquaredReLU + tied embeddings, A/B each change honestly. A 15M minimal model then beat the 331M one by ~2.5×. Adding the Muon optimizer and scaling to 150M produced this. The journey is the point; the perplexity is just where it landed.

Limitations

Base completion model — not instruction-tuned, no safety tuning.
English only, educational-web distribution (FineWeb-Edu); weaker out-of-distribution (the WikiText gap).
Not a knowledge base — at 150M it produces plausible prose but unreliable facts.
Short context (1024); no KV-cache in this reference implementation.

License & attribution

Model + code: Apache-2.0.
Training data: FineWeb-Edu (ODC-BY).
Built on ideas from: DeepSeek-V2 (MLA), Muon optimizer, Primer (SquaredReLU), GPT-2 (BPE tokenizer).

Downloads last month: 5

Safetensors

Model size

0.2B params

Tensor type

F32

Sakatepon
/

Brujula-150M