File size: 6,713 Bytes
c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 c7c5492 2e37e51 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | ---
license: apache-2.0
language: [en]
library_name: safetensors
pipeline_tag: text-generation
tags: [hobbylm, mixture-of-experts, moe, sparse-moe]
---
# HobbyLM-Base (500M sparse-MoE foundation LM)
HobbyLM-Base is the foundation the whole family is built on: a 500M-parameter sparse Mixture-of-Experts decoder trained **from scratch** on FineWeb — no distillation, no borrowed weights. It exists to answer a simple question: how far can you get at the ~500M scale if you sweat the architecture and the training recipe instead of throwing tokens at the problem?
It's part of the **HobbyLM** family — a 500M sparse-MoE model (and its variants) built from scratch on a
hobby budget: FineWeb, a handful of Modal H100 hours, a lot of ablations, and a from-scratch Rust engine
([`hobby-rs`](https://github.com/harishsg993010/HobbyLM)) to run it on a laptop CPU.
## Intended use
A pretrained base model for text completion, and the checkpoint you fine-tune for downstream tasks. It is **not** instruction-tuned — for chat, use [HobbyLM-Chat](https://huggingface.co/rootxhacker/HobbyLM-Chat).
## Architecture
Every HobbyLM variant shares one core: a **sparse Mixture-of-Experts (MoE)** decoder in the modern
small-MoE style (DeepSeek-V3 / OLMoE lineage), where each design choice was picked by ablation rather
than by guesswork.
| Component | Value |
|---|---|
| Total parameters | ~500M (only a fraction is active per token) |
| Hidden size / layers | 768 / 16 (first FFN dense, the rest MoE) |
| Routed experts / active | 36 / top-6 (+ 1 always-on shared expert) |
| Attention | GQA, 12 query / 3 KV heads, decoupled head-dim 128, per-head QK-norm |
| Router | sigmoid gating, DeepSeek-V3 aux-loss-free load balancing, no top-k renorm |
| Positional | RoPE (θ up to 1e6 for the 8k-context checkpoints) |
| Tokenizer | GPT-2 byte-level BPE (50,304 vocab, sentinel-padded) |
| Optimizer | Muon on the 2-D + per-expert matrices, AdamW on everything else |
The full ablation log (QK-norm is the single biggest lever; aux-loss-free beats classic aux-loss;
≥32 experts and top-6 help; embedding-scaling hurt) lives in the project's architecture notes.
## Benchmarks
0-shot, 7-task average through our harness (see note below). HobbyLM was trained on **40B tokens** — a tiny
budget next to the comparison models — so the right way to read this table is *per training token*.
| Model | Params | Pretrain tokens | Avg (7-task) |
|---|---|---|---|
| SmolLM2-360M | 360M | ~4T | 56.29 |
| Qwen3-0.6B | 600M | ~36T | 54.78 |
| gemma-3-270m | 270M | — | 48.09 |
| pythia-410m | 410M | 300B | 45.34 |
| **HobbyLM-Base (500M)** | **500M** | **40B** | **44.05** |
| opt-350m | 350M | 180B | 43.61 |
| HobbyLM-130M (sibling) | 130M | 10B | 42.97 |
| MicroLlama-300M | 300M | 50B | 42.23 |
| gpt2 | 124M | — | 40.62 |
| pythia-160m | 160M | 300B | 38.60 |
Per-task (0-shot): HellaSwag 41.5 · LAMBADA 40.0 · SciQ 70.3 · PIQA 69.6 · ARC-easy 42.7
(ARC-challenge / WinoGrande sit near chance, as expected at this scale). Validation loss: **3.03** at 1k
context, **2.94** after the 8k context-extension.
The ranking tracks **pretraining tokens**, not parameters: the top models see 50–900× more data than we do.
In the classic ≤300B-token regime, HobbyLM leads per token — the 130M (10B tokens) beats MicroLlama-300M
(50B), opt-350m (180B) and pythia-160m (300B). Token budget, not architecture, is the gap.
> **How these were measured.** All language-model scores are **0-shot** through our own port of
> EleutherAI's `lm-evaluation-harness` (a custom `MoELMWrapper` that runs log-likelihood scoring over the
> HobbyLM MoE + GPT-2 tokenizer). Reference models in the comparison table were run through the **identical
> harness and task set**, so the numbers are apples-to-apples with ours — they are *not* copied from other
> model cards. We validated the harness against published cards (e.g. TinyLlama 52.75 vs card 52.99). These
> are small research models: read the numbers in context, not as leaderboard claims.
## Usage
### Python (PyTorch reference implementation)
HobbyLM is a custom sparse-MoE architecture — there's no `transformers` `AutoModel` for it, so load it with
the small reference implementation from the [GitHub repo](https://github.com/harishsg993010/HobbyLM):
```python
# HobbyLM is a CUSTOM sparse-MoE architecture, so load it with the reference implementation —
# NOT transformers.AutoModelForCausalLM (there is no AutoModel mapping for this arch).
# pip install torch safetensors tiktoken huggingface_hub
# git clone https://github.com/harishsg993010/HobbyLM && cd HobbyLM
import json, torch, tiktoken
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from hobbylm.config import ModelConfig
from hobbylm.model import MoETransformer
from hobbylm.generate import generate
repo = "rootxhacker/HobbyLM-Base"
cfg = ModelConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items() if k != "preset"})
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cfg.expert_backend = "grouped" if device.type == "cuda" else "bmm"
model = MoETransformer(cfg).to(device).eval()
model.load_state_dict(load_file(hf_hub_download(repo, "model.safetensors")))
enc = tiktoken.get_encoding("gpt2")
prompt = "The capital of France is"
ids = torch.tensor([enc.encode_ordinary(prompt)], device=device)
out = generate(model, ids, max_new_tokens=64, temperature=0.7, top_k=0, device=device,
repetition_penalty=1.3) # temperature=0.0 for greedy
print(enc.decode(out[0].tolist()))
```
### GGUF + hobby-rs (CPU)
GGUF builds (architecture `hobbylm`) live in [`rootxhacker/HobbyLM-gguf`](https://huggingface.co/rootxhacker/HobbyLM-gguf). They load
directly in the from-scratch `hobby-rs` CPU engine — **stock llama.cpp won't load them** without registering
the `hobbylm` architecture first.
```bash
hobby-rs --model HobbyLM-Base.gguf --prompt "..." --n 64
```
## Training
Pretrained on ~40B unique FineWeb tokens (8×H100), then context-extended 1k→8k (RoPE θ 1e4→1e6). Muon on the hidden + per-expert matrices, AdamW on the router/embeddings/norms; fp32 router; chunked-checkpointed cross-entropy to fit a larger batch.
## Limitations
- It's a ~500M base model on a 40B-token budget: fluent and factually-okay on easy questions, but it hallucinates and can repeat without a repetition penalty at decode time.
- Trained on English FineWeb; other languages and code are out of distribution.
- Not aligned or safety-tuned.
## License
Apache-2.0. Weights aren't a substitute for judgement — this is a research / hobby model at the 500M scale,
not a production system.
|