Upload README.md with huggingface_hub

5b11478 verified 3 days ago

5.25 kB

	---
	license: mit
	tags:
	- bitnet
	- speculative-decoding
	- medusa
	- ternary-weights
	- efficient-inference
	- cpu-inference
	language:
	- en
	base_model: microsoft/BitNet-b1.58-2B-4T
	library_name: gguf
	pipeline_tag: text-generation
	---

	# MedusaBitNet 2B-4T

	First integration of [Medusa speculative decoding](https://github.com/FasterDecoding/Medusa) with [BitNet b1.58](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T) ternary-weight inference.

	4 Medusa heads trained on frozen BitNet b1.58 2B-4T backbone. Generates 2.08 tokens per backbone step — measured end-to-end across 5,136 tokens in 2,473 speculative decode steps.

	## Measured Results

	### Speculative Decoding (real end-to-end, Python, 20 sequences)

	\| Metric \| Value \|
	\|---\|---\|
	\| Tokens per backbone step \| 2.08 (5,136 tokens / 2,473 steps) \|
	\| Head 1 acceptance (t+1) \| 63.0% \|
	\| Head 2 acceptance (t+2) \| 29.0% \|
	\| Head 3 acceptance (t+3) \| 11.1% \|
	\| Head 4 acceptance (t+4) \| 4.6% \|
	\| Medusa head size \| 13 MB (f16) \|
	\| Total model size \| 764 MB (backbone + heads) \|

	### Head-to-Head Throughput (same hardware, same prompts, llama-cli)

	\| Model \| Params \| Gen tok/s \| Size \|
	\|---\|---\|---\|---\|
	\| Llama 3.2 1B (Q4_K_M) \| 1.0B \| 115.9 \| 808 MB \|
	\| Qwen2.5 1.5B (Q4_K_M) \| 1.5B \| 88.8 \| 1117 MB \|
	\| BitNet b1.58 2B (I2_S) \| 2.4B \| 72.7 \| 1187 MB \|
	\| Gemma 2 2B (Q4_K_M) \| 2.0B \| 50.5 \| 1709 MB \|

	All benchmarks on AMD Ryzen AI MAX+ 395 (Strix Halo), 16 Zen 5 cores, 93GB LPDDR5x, CPU-only.

	### Quality (Microsoft published, 18 tasks)

	BitNet b1.58 2B-4T scores 54.19 avg — beats LLaMA 3.2 1B (44.90), Gemma-3 1B (43.74), SmolLM2 1.7B (48.70). Medusa heads don't change output quality — they predict ahead, not modify.

	## What's Proven vs What Needs Work

	Measured (real data):
	- 2.08 tokens per backbone step (end-to-end speculative decode loop)
	- Head acceptance rates: 63.0% / 29.0% / 11.1% / 4.6%
	- Head-to-head throughput: 4 models on identical hardware
	- Training: loss 9.85 → 3.32 in 2000 steps on Zen 5 CPU

	Not yet proven:
	- Wall-clock C++ Medusa throughput. The GGUF backbone's I2_S kernel lacks BitNet-style activation quantization, causing a hidden state distribution mismatch. Medusa heads work in Python on cached hidden states but not yet through the C++ inference path.
	- Estimated C++ speedup: ~1.88x (based on 2.08 tok/step + ~10% head overhead on 13.75ms backbone step)

	## Files

	- `medusa_heads_step2000.pt` — Trained Medusa head weights (4 heads, 1 layer each, hidden=2560)
	- `ggml-model-i2_s-medusa.gguf` — Merged GGUF: BitNet backbone (I2_S) + Medusa heads (f16)
	- `benchmark_headtohead.json` — Raw head-to-head benchmark data
	- `benchmark_results.json` — Efficiency benchmark data
	- `figures/` — All charts (see below)

	## Architecture

	```
	BitNet b1.58 2B-4T (frozen) 4 Medusa Heads (13 MB)
	┌─────────────────────┐ ┌──────────────────┐
	│ 30 layers │ │ Head 1: t+1 63.0%│
	│ 2560 hidden │ ──h──→ │ Head 2: t+2 29.0%│ ──→ 2.08 tok/step
	│ Ternary {-1, 0, 1} │ │ Head 3: t+3 11.1%│
	│ 751 MB (I2_S) │ │ Head 4: t+4 4.6%│
	└─────────────────────┘ └──────────────────┘
	```

	Each head: `h + W_out @ SiLU(W_in @ h)`, projected through the shared lm_head.

	## Training

	\| Detail \| Value \|
	\|---\|---\|
	\| Data \| [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (52K examples) \|
	\| Method \| Cache backbone hidden states, train heads on cached features \|
	\| Steps \| 2000 (loss 9.85 → 3.32) \|
	\| Hardware \| AMD Ryzen AI MAX+ 395 (Strix Halo), CPU-only \|
	\| Time \| ~4h caching + ~7h training = ~11h total \|
	\| Optimizer \| AdamW, lr=1e-3, cosine schedule, 50 warmup steps \|

	## Usage

	### Python (verified working)
	```python
	import torch
	from model import MedusaHeads

	ckpt = torch.load("medusa_heads_step2000.pt", map_location="cpu")
	heads = MedusaHeads(hidden_size=2560, vocab_size=128256,
	num_heads=4, num_layers_per_head=1, dtype=torch.bfloat16)
	heads.load_state_dict(ckpt["heads"])
	```

	### C++ (loads and runs, speculation pending kernel fix)
	```bash
	cd bitnet.cpp/3rdparty/llama.cpp
	git apply ../../../MedusaBitNet/patches/medusa-llama-cpp.patch
	# Build, then:
	./build/bin/llama-medusa -m ggml-model-i2_s-medusa.gguf -p "prompt" -n 128 -t 16
	```

	## Credits

	- Medusa: Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao. [ICML 2024](https://arxiv.org/abs/2401.10774). Apache 2.0.
	- BitNet b1.58: Microsoft Research. [Model](https://huggingface.co/microsoft/BitNet-b1.58-2B-4T) (MIT). [bitnet.cpp](https://github.com/microsoft/BitNet) (MIT).
	- llama.cpp: Georgi Gerganov et al. MIT.
	- Built with: [Claude Code](https://claude.ai/claude-code) (Anthropic, Opus 4.6)

	## Citation

	```bibtex
	@misc{corcoran2025medusabitnet,
	title={MedusaBitNet: Speculative Decoding for Ternary-Weight LLMs},
	author={Parrish Corcoran},
	year={2025},
	url={https://github.com/parrishcorcoran/MedusaBitNet}
	}
	```

	## License

	MIT