File size: 6,313 Bytes
67892b9
 
 
 
 
 
 
 
 
 
59262bf
67892b9
59262bf
 
 
 
 
 
 
67892b9
 
 
59262bf
 
 
67892b9
 
 
59262bf
 
67892b9
59262bf
 
 
67892b9
59262bf
 
 
 
 
 
 
 
 
 
67892b9
59262bf
 
 
 
 
 
 
 
 
 
67892b9
59262bf
 
67892b9
59262bf
 
 
 
 
 
67892b9
59262bf
 
 
 
 
 
67892b9
 
59262bf
 
 
 
 
 
 
67892b9
59262bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67892b9
 
59262bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67892b9
59262bf
 
 
67892b9
 
 
59262bf
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: apache-2.0
language: [en]
library_name: safetensors
pipeline_tag: text-generation
tags: [hobbylm, mixture-of-experts, moe, sparse-moe]
---

# HobbyLM-Chat (500M MoE, instruction-tuned)

HobbyLM-Chat is the instruction-tuned conversational model β€” HobbyLM-Base taken through SmolTalk supervised fine-tuning and a SmolLM2-style UltraFeedback DPO pass. The jump from base is large: it holds a coherent persona, follows instructions, and (with a repetition penalty) produces varied, flowing prose.

It's part of the **HobbyLM** family β€” a 500M sparse-MoE model (and its variants) built from scratch on a
hobby budget: FineWeb, a handful of Modal H100 hours, a lot of ablations, and a from-scratch Rust engine
([`hobby-rs`](https://github.com/harishsg993010/HobbyLM)) to run it on a laptop CPU.

## Intended use

General single- and multi-turn chat / instruction following. Prompt it with the trained `SYSTEM:` / `USER:` / `ASSISTANT:` turn format, and decode with a **repetition penalty β‰ˆ1.3** (this is what tames the small-model repetition tendency).

## Architecture

Every HobbyLM variant shares one core: a **sparse Mixture-of-Experts (MoE)** decoder in the modern
small-MoE style (DeepSeek-V3 / OLMoE lineage), where each design choice was picked by ablation rather
than by guesswork.

| Component | Value |
|---|---|
| Total parameters | ~500M (only a fraction is active per token) |
| Hidden size / layers | 768 / 16 (first FFN dense, the rest MoE) |
| Routed experts / active | 36 / top-6 (+ 1 always-on shared expert) |
| Attention | GQA, 12 query / 3 KV heads, decoupled head-dim 128, per-head QK-norm |
| Router | sigmoid gating, DeepSeek-V3 aux-loss-free load balancing, no top-k renorm |
| Positional | RoPE (ΞΈ up to 1e6 for the 8k-context checkpoints) |
| Tokenizer | GPT-2 byte-level BPE (50,304 vocab, sentinel-padded) |
| Optimizer | Muon on the 2-D + per-expert matrices, AdamW on everything else |

The full ablation log (QK-norm is the single biggest lever; aux-loss-free beats classic aux-loss;
β‰₯32 experts and top-6 help; embedding-scaling hurt) lives in the project's architecture notes.

## Benchmarks

0-shot multiple-choice, our harness. Note that MC benchmarks measure *knowledge*, not *chat quality* β€” the
goal of this checkpoint is conversational fluency, which these tasks don't capture. The small dip vs the base
model is the usual **alignment tax**.

| Task | HobbyLM-Chat | HobbyLM-Base |
|---|---|---|
| ARC-challenge | 23.8 | 22.4 |
| ARC-easy | 42.2 | 42.8 |
| HellaSwag | 39.5 | 41.6 |
| PIQA | 67.1 | 69.5 |
| WinoGrande | 53.6 | 51.3 |
| OpenBookQA | 27.2 | 29.8 |
| BoolQ | 44.4 | 51.0 |
| **Average** | **42.5** | **44.0** |

Reasoning tasks (ARC, WinoGrande) held or improved; BoolQ dropped the most β€” chat phrasing fits the
log-likelihood format worse, not a capability loss. This is healthy for a ~500M chat model (SmolLM-360M range).

> **How these were measured.** All language-model scores are **0-shot** through our own port of
> EleutherAI's `lm-evaluation-harness` (a custom `MoELMWrapper` that runs log-likelihood scoring over the
> HobbyLM MoE + GPT-2 tokenizer). Reference models in the comparison table were run through the **identical
> harness and task set**, so the numbers are apples-to-apples with ours β€” they are *not* copied from other
> model cards. We validated the harness against published cards (e.g. TinyLlama 52.75 vs card 52.99). These
> are small research models: read the numbers in context, not as leaderboard claims.

## Usage

### Python (PyTorch reference implementation)

HobbyLM is a custom sparse-MoE architecture β€” there's no `transformers` `AutoModel` for it, so load it with
the small reference implementation from the [GitHub repo](https://github.com/harishsg993010/HobbyLM):

```python
# HobbyLM is a CUSTOM sparse-MoE architecture, so load it with the reference implementation β€”
# NOT transformers.AutoModelForCausalLM (there is no AutoModel mapping for this arch).
# pip install torch safetensors tiktoken huggingface_hub
# git clone https://github.com/harishsg993010/HobbyLM && cd HobbyLM

import json, torch, tiktoken
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from hobbylm.config import ModelConfig
from hobbylm.model import MoETransformer
from hobbylm.generate import generate

repo = "rootxhacker/HobbyLM-Chat"
cfg = ModelConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items() if k != "preset"})
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cfg.expert_backend = "grouped" if device.type == "cuda" else "bmm"

model = MoETransformer(cfg).to(device).eval()
model.load_state_dict(load_file(hf_hub_download(repo, "model.safetensors")))

enc = tiktoken.get_encoding("gpt2")
prompt = "USER: Give me three tips for better sleep.\nASSISTANT:"
ids = torch.tensor([enc.encode_ordinary(prompt)], device=device)
out = generate(model, ids, max_new_tokens=64, temperature=0.7, top_k=0, device=device,
               repetition_penalty=1.3)               # temperature=0.0 for greedy
print(enc.decode(out[0].tolist()))
```

> Prompt it with the trained `USER:` / `ASSISTANT:` turn format (a leading `SYSTEM:` turn is optional). A repetition penalty around **1.3** is recommended.

### GGUF + hobby-rs (CPU)

GGUF builds (architecture `hobbylm`) live in [`rootxhacker/HobbyLM-gguf`](https://huggingface.co/rootxhacker/HobbyLM-gguf). They load
directly in the from-scratch `hobby-rs` CPU engine β€” **stock llama.cpp won't load them** without registering
the `hobbylm` architecture first.

```bash
hobby-rs --model HobbyLM-Chat.gguf --prompt "..." --n 64
```

## Training

SFT on ~1.5M chat trajectories (smol-smoltalk + the conversational smoltalk2 subsets), loss on assistant turns only; then UltraFeedback DPO (Ξ²=0.1) β€” the SmolLM2 recipe. SFT loss β†’ ~1.50, DPO preference accuracy 0.50 β†’ 0.64.

## Limitations

- Carries the 500M ceiling: factual hallucination, and weak adherence to strict output formats (e.g. exact syllable counts).
- Use a repetition penalty at decode time; greedy decoding can loop.
- Not safety-aligned β€” no RLHF safety tuning.

## License

Apache-2.0. Weights aren't a substitute for judgement β€” this is a research / hobby model at the 500M scale,
not a production system.