Brújula-150M

A 153.6M-parameter decoder-only language model, trained from scratch on FineWeb-Edu as part of a solo, consumer-GPU hobby project. Brújula ("compass" in Spanish) is a deliberately minimal DeepSeek-style architecture: Multi-head Latent Attention (MLA) + RoPE + SquaredReLU FFN, with tied embeddings, trained with a hybrid Muon + AdamW optimizer.

This is a base completion model — it continues text; it is not instruction-tuned and not safety-tuned. It writes fluent, on-topic educational-web prose; it is not a reliable source of facts.

Results

Perplexity (lower is better), evaluated on a fixed local harness at context length 1024:

Model FineWeb-Edu val PPL WikiText-103 PPL
Brújula-150M (this model) 21.44 36.08
GPT-2 small (124M, same harness) 26.61 29.31

Honest reading: Brújula-150M beats GPT-2 small on FineWeb-Edu val (its home turf), but loses on WikiText-103 — a data-distribution effect (FineWeb-Edu is narrower than GPT-2's WebText), not a capacity problem. It's the main known limitation, stated openly, not hidden.

The Brújula family

Model Params FineWeb val WikiText Notes
Brújula-15M 15.5M 78.05 190.74 tiny champion, trained locally on one Arc B580
Brújula-18M 18M 46.26 108.72 Brújula-15M depth-grown via G_stack (4→8 layers)
Brújula-150M 153.6M 21.44 36.08 this model (the flagship)

Usage

Custom architecture code, so pass trust_remote_code=True:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Sakatepon/Brujula-150M"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True).eval()

ids = tok("The mitochondria is the", return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=64, do_sample=True, temperature=0.8, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=True))

Tip: it's a small base model — use sampling and cued/definitional prompts ("X is the … ") rather than bare nouns. Greedy decoding tends to repetition-loop.

Architecture

Type decoder-only, causal LM
Hidden size (n_embd) 768
Layers (n_layer) 20
Heads (n_head) 12
Context length 1024
Attention Multi-head Latent Attention (MLA), kv-compress 64 / q-compress 192
Position RoPE (GPT-NeoX convention)
FFN SquaredReLU (w2(relu(w1 x)^2)); Norm RMSNorm (pre-norm); Embeddings tied
Vocab 50257 (GPT-2 BPE)
Unique params 153.6M

Training

FineWeb-Edu (~5B tokens), 1 epoch (76,293 steps), hybrid Muon (matrix weights) + AdamW (embeddings/norms), peak LR 1.7e-3, batch 64, bf16, ~12.3h on a single cloud GPU (the 15M/18M siblings train locally on one Intel Arc B580).

The story

Brújula is the honest output of a part-time, single-consumer-GPU project. It started from an over-engineered 331M model (with a bolted-on "engram" retrieval mechanism + dropout + torch.compile + gradient checkpointing) that displayed low loss but evaluated ~6× worse in clean inference — a broken train/eval stack. The fix was ruthless minimalism: strip every unvalidated component, keep MLA + RoPE + SquaredReLU + tied embeddings, A/B each change honestly. A 15M minimal model then beat the 331M one by ~2.5×. Adding the Muon optimizer and scaling to 150M produced this. The journey is the point; the perplexity is just where it landed.

Limitations

  • Base completion model — not instruction-tuned, no safety tuning.
  • English only, educational-web distribution (FineWeb-Edu); weaker out-of-distribution (the WikiText gap).
  • Not a knowledge base — at 150M it produces plausible prose but unreliable facts.
  • Short context (1024); no KV-cache in this reference implementation.

License & attribution

  • Model + code: Apache-2.0.
  • Training data: FineWeb-Edu (ODC-BY).
  • Built on ideas from: DeepSeek-V2 (MLA), Muon optimizer, Primer (SquaredReLU), GPT-2 (BPE tokenizer).
Downloads last month
18
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Sakatepon/Brujula-150M