Upload README.md with huggingface_hub

48c0039 verified 6 days ago

6.88 kB

	---
	license: apache-2.0
	language: [en]
	library_name: safetensors
	pipeline_tag: text-generation
	tags: [hobbylm, mixture-of-experts, moe, sparse-moe]
	---

	# HobbyLM-Diffusion (500M MoE, instruction-tuned text diffusion / LLaDA-style)

	HobbyLM-Diffusion is the family's experiment in a different decoding paradigm: a masked-diffusion language model (LLaDA-style). Instead of generating left-to-right, it attends bidirectionally and fills in `[MASK]` tokens over a few iterative denoising passes — so it can decode in parallel. This checkpoint is instruction-tuned: the diffusion base was chat-SFT'd on SmolTalk with a LLaDA-style objective (mask only the assistant response, denoise it conditioned on the clean prompt).

	It's part of the HobbyLM family — a 500M sparse-MoE model (and its variants) built from scratch on a
	hobby budget: FineWeb, a handful of Modal H100 hours, a lot of ablations, and a from-scratch Rust engine
	([`hobby-rs`](https://github.com/harishsg993010/HobbyLM)) to run it on a laptop CPU.

	## Intended use

	Experimental conversational generation via iterative denoising — it's a research artifact, not a reliable assistant. Prompt it with the trained `USER:` / `ASSISTANT:` turn format. It adopts the chat register and the question→answer shape, but at 500M with a pure-diffusion objective it hallucinates and follows instructions loosely. Decode knobs trade quality vs speed; good defaults: temp 0–0.3, steps ≈ 2× the generation length, repetition penalty 1.4–1.5.

	## Architecture

	Every HobbyLM variant shares one core: a sparse Mixture-of-Experts (MoE) decoder in the modern
	small-MoE style (DeepSeek-V3 / OLMoE lineage), where each design choice was picked by ablation rather
	than by guesswork.

	\| Component \| Value \|
	\|---\|---\|
	\| Total parameters \| ~500M (only a fraction is active per token) \|
	\| Hidden size / layers \| 768 / 16 (first FFN dense, the rest MoE) \|
	\| Routed experts / active \| 36 / top-6 (+ 1 always-on shared expert) \|
	\| Attention \| GQA, 12 query / 3 KV heads, decoupled head-dim 128, per-head QK-norm \|
	\| Router \| sigmoid gating, DeepSeek-V3 aux-loss-free load balancing, no top-k renorm \|
	\| Positional \| RoPE (θ up to 1e6 for the 8k-context checkpoints) \|
	\| Tokenizer \| GPT-2 byte-level BPE (50,304 vocab, sentinel-padded) \|
	\| Optimizer \| Muon on the 2-D + per-expert matrices, AdamW on everything else \|

	The full ablation log (QK-norm is the single biggest lever; aux-loss-free beats classic aux-loss;
	≥32 experts and top-6 help; embedding-scaling hurt) lives in the project's architecture notes.

	## Decoding

	Generation is iterative bidirectional denoising of `[MASK]` tokens, not left-to-right AR. The GGUF carries `diffusion.*` metadata (mask-token id, block size) for a diffusion-aware runtime; `hobby-rs` implements the cached semi-autoregressive denoiser.

	## Benchmarks

	A masked-diffusion model can't be scored by the standard log-likelihood lm-eval harness, so the meaningful
	numbers are training loss and decoding throughput — where the diffusion paradigm actually shows up:

	\| Metric \| Value \|
	\|---\|---\|
	\| Validation loss (≈21B tokens) \| 3.52 \|
	\| Throughput — H100, 128 tok, 32 steps \| 117.7 tok/s (~2.7× the AR model) \|
	\| Throughput — H100, AR baseline \| ~44 tok/s \|
	\| Throughput — laptop CPU (q8, cached) \| ~6.5 tok/s \|

	The throughput result reproduces the Fast-dLLM literature's 2–3× GPU range from a from-scratch
	implementation: on memory-bound hardware (GPU) batching the whole canvas is nearly free, so fewer denoising
	passes than tokens wins; on a compute-bound laptop the same code trails the AR engine. The knob is
	steps-per-token (quality ↔ speed).

	> A masked-diffusion LM at 500M trails an equal-scale autoregressive model on raw coherence — the method is
	> fully validated end-to-end here; the limit is capacity and tokens, not the recipe.

	## Usage

	### Python (PyTorch reference implementation)

	HobbyLM is a custom sparse-MoE architecture — there's no `transformers` `AutoModel` for it, so load it with
	the small reference implementation from the [GitHub repo](https://github.com/harishsg993010/HobbyLM):

	```python
	# HobbyLM-Diffusion is a MASKED-DIFFUSION model: generation is iterative, bidirectional denoising
	# — NOT autoregressive — so it uses the reference diffusion sampler (not transformers.generate).
	# pip install torch safetensors tiktoken huggingface_hub
	# git clone https://github.com/harishsg993010/HobbyLM && cd HobbyLM

	import json, torch, tiktoken
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from hobbylm.config import ModelConfig
	from hobbylm.model import MoETransformer
	from hobbylm.diffusion import generate

	repo = "rootxhacker/HobbyLM-Diffusion"
	cfg = ModelConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items() if k != "preset"})
	cfg.expert_backend = "bmm" # "grouped" on CUDA
	model = MoETransformer(cfg).eval()
	model.load_state_dict(load_file(hf_hub_download(repo, "model.safetensors")))

	enc = tiktoken.get_encoding("gpt2")
	ids = torch.tensor([enc.encode_ordinary("The meaning of life is")])
	# iterative denoising: gen_len tokens over `steps` bidirectional passes (more steps + lower temp = better)
	out = generate(model, ids, gen_len=96, steps=128, temperature=0.2, rep_penalty=1.5, remask_steps=2)
	print(enc.decode(out[0].tolist()))
	```

	### GGUF + hobby-rs (CPU)

	GGUF builds (architecture `hobbylm`) live in [`rootxhacker/HobbyLM-gguf`](https://huggingface.co/rootxhacker/HobbyLM-gguf). They load
	directly in the from-scratch `hobby-rs` CPU engine — stock llama.cpp won't load them without registering
	the `hobbylm` architecture first.

	```bash
	hobby-rs --model HobbyLM-Diffusion.gguf --prompt "..." --n 64
	```

	## Training

	Two stages. Base: converted from the autoregressive 500M base (weights transfer; same architecture, attention switched to bidirectional) and adapted on ~21B tokens with a masked-token objective reweighted by 1/p_mask (a DiffuGPT/DiffuLLaMA-style conversion, val loss 3.52). Instruction tuning: chat-SFT on SmolTalk trajectories — each assistant response is masked and denoised conditioned on the clean prompt.

	## Limitations

	- Hallucinates and follows instructions loosely — the SFT shifts it into a conversational register and the Q→A shape, but it does not reliably produce correct or on-task answers. This is the expected ceiling for a 500M pure-diffusion model; the limit is capacity, not the recipe.
	- Decoding quality is very sensitive to the sampler settings (see above).
	- The CPU throughput win only materializes on memory-bound hardware; on a thermally-limited laptop the AR model is faster.

	## License

	Apache-2.0. Weights aren't a substitute for judgement — this is a research / hobby model at the 500M scale,
	not a production system.