rootxhacker
/

HobbyLM-Diffusion

+---
+license: apache-2.0
+language: [en]
+library_name: safetensors
+pipeline_tag: text-generation
+tags: [hobbylm, mixture-of-experts, moe, sparse-moe]
+---
+# HobbyLM-Diffusion (500M MoE, text diffusion / LLaDA-style)
+Masked-diffusion (LLaDA-style) variant of HobbyLM for bidirectional / parallel decoding.
+Part of the **HobbyLM** family — a from-scratch 500M sparse-MoE model trained on consumer-scale budgets.
+## Architecture
+HobbyLM is a **sparse Mixture-of-Experts (MoE)** transformer (DeepSeek-V3 / Ling-style):
+| Component | Value |
+|---|---|
+| Total parameters | ~500M (≈ a fraction active per token) |
+| Hidden size / layers | 768 / 16 (1 dense FFN layer, 15 MoE) |
+| Routed experts / active | 36 / top-6 (+ 1 always-on shared expert) |
+| Attention | GQA, 12 query / 3 KV heads, head-dim 128, per-head QK-norm |
+| Router | sigmoid gating, aux-loss-free balancing bias, no top-k renorm |
+| Positional | RoPE |
+| Tokenizer | GPT-2 byte-level BPE (50,304 vocab, sentinel-padded) |
+## Decoding
+This is a **masked-diffusion** checkpoint (LLaDA-style): generation is iterative bidirectional denoising of `[MASK]` tokens, not left-to-right AR. The GGUF carries `diffusion.*` metadata (mask token id, block size) for a diffusion-aware runtime.
+## Files
+- `model.safetensors` — the model weights (fp32).
+- `config.json` — architecture / hyperparameters.
+- GGUF builds (arch `hobbylm`) live in [`rootxhacker/HobbyLM-gguf`](https://huggingface.co/rootxhacker/HobbyLM-gguf).
+## Loading (safetensors)
+```python
+import json, torch
+from safetensors.torch import load_file
+sd  = load_file("model.safetensors")
+cfg = json.load(open("config.json"))
+# rebuild the HobbyLM nn.Module from `cfg` and `load_state_dict(sd)`.
+```
+## Notes & limitations
+- Research model at the ~500M scale: fluent but with the capability ceiling of a small model.
+- The GGUF uses a custom `hobbylm` architecture (see the GGUF repo) and needs `moe-rs` or a patched llama.cpp.
+## License
+Apache-2.0.