| --- |
| license: apache-2.0 |
| tags: |
| - diffusion |
| - masked-diffusion |
| - llada |
| - llama |
| - gguf |
| - diffuse-cpp |
| base_model: GSAI-ML/LLaDA-8B-Instruct |
| pipeline_tag: text-generation |
| --- |
| |
| # LLaDA-8B-Instruct-GGUF |
|
|
| GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), a CPU inference engine for Diffusion Language Models. |
|
|
| LLaDA is a masked diffusion language model based on the Llama backbone with Multi-Head Attention (MHA, 32/32 heads). |
|
|
| ## Available Quantizations |
|
|
| | File | Type | Size | Description | |
| |------|------|------|-------------| |
| | `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality | |
| | `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless | |
| | `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed quantization, best quality/size ratio | |
| |
| **Recommended:** Q4_K_M for most users. Q8_0 if you have enough RAM and want minimal quality loss. |
|
|
| ## Performance |
|
|
| Benchmarked on diffuse-cpp with entropy_exit + inter-step KV cache, Q4_K_M, B=256, 12 threads, seed=42: |
| |
| | Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp | |
| |--------|----------------|-------------|-------|-------------| |
| | Capital of France? | 17.5 | **24.4** | 3 | 2.9x | |
| | Translate to French | 25.9 | **27.7** | 2 | 3.3x | |
| | 15 x 23? | 12.8 | **15.7** | 4 | 1.8x | |
| | Translate to Spanish | 7.6 | **22.9** | 7 | 2.7x | |
| | Python is_prime() | 3.2 | **4.9** | 16 | 0.6x | |
| | Poem about ocean | 3.2 | **5.3** | 16 | 0.6x | |
| | Why is sky blue? | 3.3 | **12.0** | 16 | 1.4x | |
| | List the planets | 3.3 | **9.4** | 15 | 1.1x | |
| | **Average** | **9.6** | **15.3** | | **1.8x** | |
|
|
| - **Inter-step cache: 1.6x average speedup** (9.6 -> 15.3 tok/s) |
| - Easy prompts: **15-28 tok/s** (up to 3.3x faster than llama.cpp) |
| - 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline) |
| - Cache enabled by default, no quality degradation |
|
|
| ## Usage |
|
|
| ```bash |
| # Download |
| huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf |
| |
| # Run (requires diffuse-cpp) |
| ./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -n 256 -s 16 |
| ``` |
|
|
| ## Model Details |
|
|
| - **Architecture:** Llama backbone with bidirectional attention |
| - **Parameters:** 8B |
| - **Layers:** 32 |
| - **Hidden size:** 4096 |
| - **Attention:** MHA (32 query heads, 32 KV heads, head dim 128) |
| - **FFN:** SwiGLU, intermediate size 12288 |
| - **Vocabulary:** 126,464 tokens |
| - **RoPE theta:** 500,000 |
| - **Mask token ID:** 126336 |
|
|
| ## Also Available |
|
|
| - **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA, 7.6B params. Excels at factual and math prompts (21.6 tok/s). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{diffuse_cpp_2026, |
| title={diffuse-cpp: High-Performance Inference for Diffusion Language Models}, |
| author={Carmen Estévez}, |
| year={2026}, |
| url={https://github.com/iafiscal1212/diffuse-cpp} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0, following the original LLaDA model license. |
|
|