| --- |
| license: apache-2.0 |
| language: [en] |
| library_name: safetensors |
| pipeline_tag: text-generation |
| tags: [hobbylm, mixture-of-experts, moe, sparse-moe] |
| --- |
| |
| # HobbyLM-Diffusion (500M MoE, instruction-tuned text diffusion / LLaDA-style) |
|
|
| HobbyLM-Diffusion is the family's experiment in a different decoding paradigm: a **masked-diffusion** language model (LLaDA-style). Instead of generating left-to-right, it attends bidirectionally and fills in `[MASK]` tokens over a few iterative denoising passes β so it can decode in parallel. This checkpoint is **instruction-tuned**: the diffusion base was chat-SFT'd on SmolTalk with a LLaDA-style objective (mask only the assistant response, denoise it conditioned on the clean prompt). |
|
|
| It's part of the **HobbyLM** family β a 500M sparse-MoE model (and its variants) built from scratch on a |
| hobby budget: FineWeb, a handful of Modal H100 hours, a lot of ablations, and a from-scratch Rust engine |
| ([`hobby-rs`](https://github.com/harishsg993010/HobbyLM)) to run it on a laptop CPU. |
|
|
| ## Intended use |
|
|
| **Experimental** conversational generation via iterative denoising β it's a research artifact, not a reliable assistant. Prompt it with the trained `USER:` / `ASSISTANT:` turn format. It adopts the chat register and the questionβanswer shape, but at 500M with a pure-diffusion objective it hallucinates and follows instructions loosely. Decode knobs trade quality vs speed; good defaults: temp 0β0.3, steps β 2Γ the generation length, repetition penalty 1.4β1.5. |
|
|
| ## Architecture |
|
|
| Every HobbyLM variant shares one core: a **sparse Mixture-of-Experts (MoE)** decoder in the modern |
| small-MoE style (DeepSeek-V3 / OLMoE lineage), where each design choice was picked by ablation rather |
| than by guesswork. |
|
|
| | Component | Value | |
| |---|---| |
| | Total parameters | ~500M (only a fraction is active per token) | |
| | Hidden size / layers | 768 / 16 (first FFN dense, the rest MoE) | |
| | Routed experts / active | 36 / top-6 (+ 1 always-on shared expert) | |
| | Attention | GQA, 12 query / 3 KV heads, decoupled head-dim 128, per-head QK-norm | |
| | Router | sigmoid gating, DeepSeek-V3 aux-loss-free load balancing, no top-k renorm | |
| | Positional | RoPE (ΞΈ up to 1e6 for the 8k-context checkpoints) | |
| | Tokenizer | GPT-2 byte-level BPE (50,304 vocab, sentinel-padded) | |
| | Optimizer | Muon on the 2-D + per-expert matrices, AdamW on everything else | |
|
|
| The full ablation log (QK-norm is the single biggest lever; aux-loss-free beats classic aux-loss; |
| β₯32 experts and top-6 help; embedding-scaling hurt) lives in the project's architecture notes. |
|
|
| ## Decoding |
|
|
| Generation is **iterative bidirectional denoising** of `[MASK]` tokens, not left-to-right AR. The GGUF carries `diffusion.*` metadata (mask-token id, block size) for a diffusion-aware runtime; `hobby-rs` implements the cached semi-autoregressive denoiser. |
|
|
| ## Benchmarks |
|
|
| A masked-diffusion model can't be scored by the standard log-likelihood lm-eval harness, so the meaningful |
| numbers are training loss and **decoding throughput** β where the diffusion paradigm actually shows up: |
|
|
| | Metric | Value | |
| |---|---| |
| | Validation loss (β21B tokens) | 3.52 | |
| | Throughput β H100, 128 tok, 32 steps | **117.7 tok/s** (~2.7Γ the AR model) | |
| | Throughput β H100, AR baseline | ~44 tok/s | |
| | Throughput β laptop CPU (q8, cached) | ~6.5 tok/s | |
|
|
| The throughput result reproduces the **Fast-dLLM** literature's 2β3Γ GPU range from a from-scratch |
| implementation: on memory-bound hardware (GPU) batching the whole canvas is nearly free, so fewer denoising |
| passes than tokens wins; on a compute-bound laptop the same code trails the AR engine. The knob is |
| steps-per-token (quality β speed). |
|
|
| > A masked-diffusion LM at 500M trails an equal-scale autoregressive model on raw coherence β the method is |
| > fully validated end-to-end here; the limit is capacity and tokens, not the recipe. |
|
|
| ## Usage |
|
|
| ### Python (PyTorch reference implementation) |
|
|
| HobbyLM is a custom sparse-MoE architecture β there's no `transformers` `AutoModel` for it, so load it with |
| the small reference implementation from the [GitHub repo](https://github.com/harishsg993010/HobbyLM): |
|
|
| ```python |
| # HobbyLM-Diffusion is a MASKED-DIFFUSION model: generation is iterative, bidirectional denoising |
| # β NOT autoregressive β so it uses the reference diffusion sampler (not transformers.generate). |
| # pip install torch safetensors tiktoken huggingface_hub |
| # git clone https://github.com/harishsg993010/HobbyLM && cd HobbyLM |
| |
| import json, torch, tiktoken |
| from huggingface_hub import hf_hub_download |
| from safetensors.torch import load_file |
| from hobbylm.config import ModelConfig |
| from hobbylm.model import MoETransformer |
| from hobbylm.diffusion import generate |
| |
| repo = "rootxhacker/HobbyLM-Diffusion" |
| cfg = ModelConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items() if k != "preset"}) |
| cfg.expert_backend = "bmm" # "grouped" on CUDA |
| model = MoETransformer(cfg).eval() |
| model.load_state_dict(load_file(hf_hub_download(repo, "model.safetensors"))) |
| |
| enc = tiktoken.get_encoding("gpt2") |
| ids = torch.tensor([enc.encode_ordinary("The meaning of life is")]) |
| # iterative denoising: gen_len tokens over `steps` bidirectional passes (more steps + lower temp = better) |
| out = generate(model, ids, gen_len=96, steps=128, temperature=0.2, rep_penalty=1.5, remask_steps=2) |
| print(enc.decode(out[0].tolist())) |
| ``` |
|
|
| ### GGUF + hobby-rs (CPU) |
|
|
| GGUF builds (architecture `hobbylm`) live in [`rootxhacker/HobbyLM-gguf`](https://huggingface.co/rootxhacker/HobbyLM-gguf). They load |
| directly in the from-scratch `hobby-rs` CPU engine β **stock llama.cpp won't load them** without registering |
| the `hobbylm` architecture first. |
|
|
| ```bash |
| hobby-rs --model HobbyLM-Diffusion.gguf --prompt "..." --n 64 |
| ``` |
|
|
| ## Training |
|
|
| Two stages. **Base:** converted from the autoregressive 500M base (weights transfer; same architecture, attention switched to bidirectional) and adapted on ~21B tokens with a masked-token objective reweighted by 1/p_mask (a DiffuGPT/DiffuLLaMA-style conversion, val loss 3.52). **Instruction tuning:** chat-SFT on SmolTalk trajectories β each assistant response is masked and denoised conditioned on the clean prompt. |
| |
| ## Limitations |
| |
| - **Hallucinates and follows instructions loosely** β the SFT shifts it into a conversational register and the QβA shape, but it does not reliably produce correct or on-task answers. This is the expected ceiling for a 500M *pure-diffusion* model; the limit is capacity, not the recipe. |
| - Decoding quality is very sensitive to the sampler settings (see above). |
| - The CPU throughput win only materializes on memory-bound hardware; on a thermally-limited laptop the AR model is faster. |
| |
| ## License |
| |
| Apache-2.0. Weights aren't a substitute for judgement β this is a research / hobby model at the 500M scale, |
| not a production system. |
| |