File size: 6,882 Bytes

7555cfa
 
 
 
 
 
 
 
48c0039
7555cfa
48c0039
7555cfa
fa4b465
 
 
 
 
 
48c0039
7555cfa
 
 
fa4b465
 
 
7555cfa
 
 
fa4b465
 
7555cfa
fa4b465
 
 
7555cfa
fa4b465
7555cfa
fa4b465
 
7555cfa
 
 
fa4b465
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7555cfa
fa4b465
7555cfa
fa4b465
7555cfa
fa4b465
 
7555cfa
 
fa4b465
 
 
 
 
 
 
7555cfa
fa4b465
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7555cfa
 
fa4b465
 
 
 
 
 
 
 
 
 
 
 
48c0039
fa4b465
 
7555cfa
48c0039
fa4b465
 
7555cfa
 
 
fa4b465

---
license: apache-2.0
language: [en]
library_name: safetensors
pipeline_tag: text-generation
tags: [hobbylm, mixture-of-experts, moe, sparse-moe]
---

# HobbyLM-Diffusion (500M MoE, instruction-tuned text diffusion / LLaDA-style)

HobbyLM-Diffusion is the family's experiment in a different decoding paradigm: a **masked-diffusion** language model (LLaDA-style). Instead of generating left-to-right, it attends bidirectionally and fills in `[MASK]` tokens over a few iterative denoising passes — so it can decode in parallel. This checkpoint is **instruction-tuned**: the diffusion base was chat-SFT'd on SmolTalk with a LLaDA-style objective (mask only the assistant response, denoise it conditioned on the clean prompt).

It's part of the **HobbyLM** family — a 500M sparse-MoE model (and its variants) built from scratch on a
hobby budget: FineWeb, a handful of Modal H100 hours, a lot of ablations, and a from-scratch Rust engine
([`hobby-rs`](https://github.com/harishsg993010/HobbyLM)) to run it on a laptop CPU.

## Intended use

**Experimental** conversational generation via iterative denoising — it's a research artifact, not a reliable assistant. Prompt it with the trained `USER:` / `ASSISTANT:` turn format. It adopts the chat register and the question→answer shape, but at 500M with a pure-diffusion objective it hallucinates and follows instructions loosely. Decode knobs trade quality vs speed; good defaults: temp 0–0.3, steps ≈ 2× the generation length, repetition penalty 1.4–1.5.

## Architecture

Every HobbyLM variant shares one core: a **sparse Mixture-of-Experts (MoE)** decoder in the modern
small-MoE style (DeepSeek-V3 / OLMoE lineage), where each design choice was picked by ablation rather
than by guesswork.

| Component | Value |
|---|---|
| Total parameters | ~500M (only a fraction is active per token) |
| Hidden size / layers | 768 / 16 (first FFN dense, the rest MoE) |
| Routed experts / active | 36 / top-6 (+ 1 always-on shared expert) |
| Attention | GQA, 12 query / 3 KV heads, decoupled head-dim 128, per-head QK-norm |
| Router | sigmoid gating, DeepSeek-V3 aux-loss-free load balancing, no top-k renorm |
| Positional | RoPE (θ up to 1e6 for the 8k-context checkpoints) |
| Tokenizer | GPT-2 byte-level BPE (50,304 vocab, sentinel-padded) |
| Optimizer | Muon on the 2-D + per-expert matrices, AdamW on everything else |

The full ablation log (QK-norm is the single biggest lever; aux-loss-free beats classic aux-loss;
≥32 experts and top-6 help; embedding-scaling hurt) lives in the project's architecture notes.

## Decoding

Generation is **iterative bidirectional denoising** of `[MASK]` tokens, not left-to-right AR. The GGUF carries `diffusion.*` metadata (mask-token id, block size) for a diffusion-aware runtime; `hobby-rs` implements the cached semi-autoregressive denoiser.

## Benchmarks

A masked-diffusion model can't be scored by the standard log-likelihood lm-eval harness, so the meaningful
numbers are training loss and **decoding throughput** — where the diffusion paradigm actually shows up:

| Metric | Value |
|---|---|
| Validation loss (≈21B tokens) | 3.52 |
| Throughput — H100, 128 tok, 32 steps | **117.7 tok/s** (~2.7× the AR model) |
| Throughput — H100, AR baseline | ~44 tok/s |
| Throughput — laptop CPU (q8, cached) | ~6.5 tok/s |

The throughput result reproduces the **Fast-dLLM** literature's 2–3× GPU range from a from-scratch
implementation: on memory-bound hardware (GPU) batching the whole canvas is nearly free, so fewer denoising
passes than tokens wins; on a compute-bound laptop the same code trails the AR engine. The knob is
steps-per-token (quality ↔ speed).

> A masked-diffusion LM at 500M trails an equal-scale autoregressive model on raw coherence — the method is
> fully validated end-to-end here; the limit is capacity and tokens, not the recipe.

## Usage

### Python (PyTorch reference implementation)

HobbyLM is a custom sparse-MoE architecture — there's no `transformers` `AutoModel` for it, so load it with
the small reference implementation from the [GitHub repo](https://github.com/harishsg993010/HobbyLM):

```python
# HobbyLM-Diffusion is a MASKED-DIFFUSION model: generation is iterative, bidirectional denoising
# — NOT autoregressive — so it uses the reference diffusion sampler (not transformers.generate).
# pip install torch safetensors tiktoken huggingface_hub
# git clone https://github.com/harishsg993010/HobbyLM && cd HobbyLM

import json, torch, tiktoken
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from hobbylm.config import ModelConfig
from hobbylm.model import MoETransformer
from hobbylm.diffusion import generate

repo = "rootxhacker/HobbyLM-Diffusion"
cfg = ModelConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items() if k != "preset"})
cfg.expert_backend = "bmm"                          # "grouped" on CUDA
model = MoETransformer(cfg).eval()
model.load_state_dict(load_file(hf_hub_download(repo, "model.safetensors")))

enc = tiktoken.get_encoding("gpt2")
ids = torch.tensor([enc.encode_ordinary("The meaning of life is")])
# iterative denoising: gen_len tokens over `steps` bidirectional passes (more steps + lower temp = better)
out = generate(model, ids, gen_len=96, steps=128, temperature=0.2, rep_penalty=1.5, remask_steps=2)
print(enc.decode(out[0].tolist()))
```

### GGUF + hobby-rs (CPU)

GGUF builds (architecture `hobbylm`) live in [`rootxhacker/HobbyLM-gguf`](https://huggingface.co/rootxhacker/HobbyLM-gguf). They load
directly in the from-scratch `hobby-rs` CPU engine — **stock llama.cpp won't load them** without registering
the `hobbylm` architecture first.

```bash
hobby-rs --model HobbyLM-Diffusion.gguf --prompt "..." --n 64
```

## Training

Two stages. **Base:** converted from the autoregressive 500M base (weights transfer; same architecture, attention switched to bidirectional) and adapted on ~21B tokens with a masked-token objective reweighted by 1/p_mask (a DiffuGPT/DiffuLLaMA-style conversion, val loss 3.52). **Instruction tuning:** chat-SFT on SmolTalk trajectories — each assistant response is masked and denoised conditioned on the clean prompt.

## Limitations

- **Hallucinates and follows instructions loosely** — the SFT shifts it into a conversational register and the Q→A shape, but it does not reliably produce correct or on-task answers. This is the expected ceiling for a 500M *pure-diffusion* model; the limit is capacity, not the recipe.
- Decoding quality is very sensitive to the sampler settings (see above).
- The CPU throughput win only materializes on memory-bound hardware; on a thermally-limited laptop the AR model is faster.

## License

Apache-2.0. Weights aren't a substitute for judgement — this is a research / hobby model at the 500M scale,
not a production system.