Instructions to use Sakatepon/Brujula-18M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Sakatepon/Brujula-18M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Sakatepon/Brujula-18M", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Sakatepon/Brujula-18M", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Sakatepon/Brujula-18M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Sakatepon/Brujula-18M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Sakatepon/Brujula-18M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Sakatepon/Brujula-18M
- SGLang
How to use Sakatepon/Brujula-18M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Sakatepon/Brujula-18M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Sakatepon/Brujula-18M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Sakatepon/Brujula-18M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Sakatepon/Brujula-18M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Sakatepon/Brujula-18M with Docker Model Runner:
docker model run hf.co/Sakatepon/Brujula-18M
Brújula-18M (G_stack)
A 18M-parameter decoder-only LM created by depth-growing Brújula-15M with the G_stack operator (Du et al., "Stacking Your Transformers", NeurIPS 2024): copy the trained 15M's 4 layers into an 8-layer model, then continue pre-training. The whole thing runs on a single consumer GPU (Intel Arc B580). Brújula ("compass" in Spanish) uses a minimal DeepSeek-style architecture (MLA + RoPE + SquaredReLU, tied embeddings, Muon).
The result: depth-doubling the 15M champion nearly halved perplexity on both metrics, for ~3h of extra local compute.
Results
Perplexity (lower is better), fixed local harness at context length 1024:
| Model | FineWeb-Edu val PPL | WikiText-103 PPL |
|---|---|---|
| Brújula-15M (the base, 4 layers) | 78.05 | 190.74 |
| Brújula-18M (this model, 8 layers) | 46.26 | 108.72 |
| improvement | −41% | −43% |
Honest note: the gain comes from added depth + warm-start (the grown model effectively saw ~2× the cumulative tokens of the base), not depth alone — no from-scratch-8-layer control was run. Either way, it's the best sub-50M model in the family.
The Brújula family
| Model | Params | FineWeb val | WikiText | Notes |
|---|---|---|---|---|
| Brújula-15M | 15.5M | 78.05 | 190.74 | tiny champion, from scratch on one Arc B580 |
| Brújula-18M | 18M | 46.26 | 108.72 | this model — Brújula-15M G_stack-grown (4→8 layers) |
| Brújula-150M | 153.6M | 21.44 | 36.08 | the flagship |
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "Sakatepon/Brujula-18M"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True).eval()
ids = tok("The mitochondria is the", return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=64, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))
Small base model — use sampling + a continuation cue; greedy tends to repetition-loop.
Architecture
| Type | decoder-only, causal LM |
| Hidden / Layers / Heads | n_embd=256 / n_layer=8 (grown from 4 via G_stack) / n_head=4 |
| Context length | 1024 |
| Attention | Multi-head Latent Attention (MLA), kv-compress 32 / q-compress 64 |
| Position / FFN / Norm | RoPE / SquaredReLU / RMSNorm (pre-norm), tied embeddings |
| Vocab | 50257 (GPT-2 BPE) |
| Unique params | 18.0M |
How it was made (G_stack)
- Train Brújula-15M from scratch (4 layers).
- Stack: copy the 4 trained layers into an 8-layer model (
[0,1,2,3] → [0,1,2,3,0,1,2,3]); keep the embedding / final-norm / tied head. - Continue pre-training 1 epoch on FineWeb-Edu (~1.4B tokens), batch 32, peak LR 1.2e-3, bf16, ~3h14m on one Intel Arc B580.
The post-stack loss spikes (copy-init isn't function-preserving), then recovers and surpasses the 15M base within the first ~12% of the epoch — consistent with G_stack's finding that violating function preservation is fine and even preferable.
Limitations
- Base completion model — not instruction-tuned, no safety tuning.
- English only, educational-web distribution (FineWeb-Edu); weaker out-of-distribution.
- ~18M params: plausible prose, unreliable facts; best on cued, definitional prompts.
- Short context (1024); no KV-cache in this reference implementation.
License & attribution
- Model + code: Apache-2.0. Training data: FineWeb-Edu (ODC-BY).
- Methods: G_stack (Du et al., 2024), DeepSeek-V2 (MLA), Muon, Primer (SquaredReLU), GPT-2 (BPE).
- Downloads last month
- 20