File size: 6,313 Bytes
67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf 67892b9 59262bf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
license: apache-2.0
language: [en]
library_name: safetensors
pipeline_tag: text-generation
tags: [hobbylm, mixture-of-experts, moe, sparse-moe]
---
# HobbyLM-Chat (500M MoE, instruction-tuned)
HobbyLM-Chat is the instruction-tuned conversational model β HobbyLM-Base taken through SmolTalk supervised fine-tuning and a SmolLM2-style UltraFeedback DPO pass. The jump from base is large: it holds a coherent persona, follows instructions, and (with a repetition penalty) produces varied, flowing prose.
It's part of the **HobbyLM** family β a 500M sparse-MoE model (and its variants) built from scratch on a
hobby budget: FineWeb, a handful of Modal H100 hours, a lot of ablations, and a from-scratch Rust engine
([`hobby-rs`](https://github.com/harishsg993010/HobbyLM)) to run it on a laptop CPU.
## Intended use
General single- and multi-turn chat / instruction following. Prompt it with the trained `SYSTEM:` / `USER:` / `ASSISTANT:` turn format, and decode with a **repetition penalty β1.3** (this is what tames the small-model repetition tendency).
## Architecture
Every HobbyLM variant shares one core: a **sparse Mixture-of-Experts (MoE)** decoder in the modern
small-MoE style (DeepSeek-V3 / OLMoE lineage), where each design choice was picked by ablation rather
than by guesswork.
| Component | Value |
|---|---|
| Total parameters | ~500M (only a fraction is active per token) |
| Hidden size / layers | 768 / 16 (first FFN dense, the rest MoE) |
| Routed experts / active | 36 / top-6 (+ 1 always-on shared expert) |
| Attention | GQA, 12 query / 3 KV heads, decoupled head-dim 128, per-head QK-norm |
| Router | sigmoid gating, DeepSeek-V3 aux-loss-free load balancing, no top-k renorm |
| Positional | RoPE (ΞΈ up to 1e6 for the 8k-context checkpoints) |
| Tokenizer | GPT-2 byte-level BPE (50,304 vocab, sentinel-padded) |
| Optimizer | Muon on the 2-D + per-expert matrices, AdamW on everything else |
The full ablation log (QK-norm is the single biggest lever; aux-loss-free beats classic aux-loss;
β₯32 experts and top-6 help; embedding-scaling hurt) lives in the project's architecture notes.
## Benchmarks
0-shot multiple-choice, our harness. Note that MC benchmarks measure *knowledge*, not *chat quality* β the
goal of this checkpoint is conversational fluency, which these tasks don't capture. The small dip vs the base
model is the usual **alignment tax**.
| Task | HobbyLM-Chat | HobbyLM-Base |
|---|---|---|
| ARC-challenge | 23.8 | 22.4 |
| ARC-easy | 42.2 | 42.8 |
| HellaSwag | 39.5 | 41.6 |
| PIQA | 67.1 | 69.5 |
| WinoGrande | 53.6 | 51.3 |
| OpenBookQA | 27.2 | 29.8 |
| BoolQ | 44.4 | 51.0 |
| **Average** | **42.5** | **44.0** |
Reasoning tasks (ARC, WinoGrande) held or improved; BoolQ dropped the most β chat phrasing fits the
log-likelihood format worse, not a capability loss. This is healthy for a ~500M chat model (SmolLM-360M range).
> **How these were measured.** All language-model scores are **0-shot** through our own port of
> EleutherAI's `lm-evaluation-harness` (a custom `MoELMWrapper` that runs log-likelihood scoring over the
> HobbyLM MoE + GPT-2 tokenizer). Reference models in the comparison table were run through the **identical
> harness and task set**, so the numbers are apples-to-apples with ours β they are *not* copied from other
> model cards. We validated the harness against published cards (e.g. TinyLlama 52.75 vs card 52.99). These
> are small research models: read the numbers in context, not as leaderboard claims.
## Usage
### Python (PyTorch reference implementation)
HobbyLM is a custom sparse-MoE architecture β there's no `transformers` `AutoModel` for it, so load it with
the small reference implementation from the [GitHub repo](https://github.com/harishsg993010/HobbyLM):
```python
# HobbyLM is a CUSTOM sparse-MoE architecture, so load it with the reference implementation β
# NOT transformers.AutoModelForCausalLM (there is no AutoModel mapping for this arch).
# pip install torch safetensors tiktoken huggingface_hub
# git clone https://github.com/harishsg993010/HobbyLM && cd HobbyLM
import json, torch, tiktoken
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from hobbylm.config import ModelConfig
from hobbylm.model import MoETransformer
from hobbylm.generate import generate
repo = "rootxhacker/HobbyLM-Chat"
cfg = ModelConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items() if k != "preset"})
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cfg.expert_backend = "grouped" if device.type == "cuda" else "bmm"
model = MoETransformer(cfg).to(device).eval()
model.load_state_dict(load_file(hf_hub_download(repo, "model.safetensors")))
enc = tiktoken.get_encoding("gpt2")
prompt = "USER: Give me three tips for better sleep.\nASSISTANT:"
ids = torch.tensor([enc.encode_ordinary(prompt)], device=device)
out = generate(model, ids, max_new_tokens=64, temperature=0.7, top_k=0, device=device,
repetition_penalty=1.3) # temperature=0.0 for greedy
print(enc.decode(out[0].tolist()))
```
> Prompt it with the trained `USER:` / `ASSISTANT:` turn format (a leading `SYSTEM:` turn is optional). A repetition penalty around **1.3** is recommended.
### GGUF + hobby-rs (CPU)
GGUF builds (architecture `hobbylm`) live in [`rootxhacker/HobbyLM-gguf`](https://huggingface.co/rootxhacker/HobbyLM-gguf). They load
directly in the from-scratch `hobby-rs` CPU engine β **stock llama.cpp won't load them** without registering
the `hobbylm` architecture first.
```bash
hobby-rs --model HobbyLM-Chat.gguf --prompt "..." --n 64
```
## Training
SFT on ~1.5M chat trajectories (smol-smoltalk + the conversational smoltalk2 subsets), loss on assistant turns only; then UltraFeedback DPO (Ξ²=0.1) β the SmolLM2 recipe. SFT loss β ~1.50, DPO preference accuracy 0.50 β 0.64.
## Limitations
- Carries the 500M ceiling: factual hallucination, and weak adherence to strict output formats (e.g. exact syllable counts).
- Use a repetition penalty at decode time; greedy decoding can loop.
- Not safety-aligned β no RLHF safety tuning.
## License
Apache-2.0. Weights aren't a substitute for judgement β this is a research / hobby model at the 500M scale,
not a production system.
|