Carmenest's picture
Upload README.md with huggingface_hub
faffe6f verified
---
license: apache-2.0
tags:
- diffusion
- masked-diffusion
- llada
- llama
- gguf
- diffuse-cpp
base_model: GSAI-ML/LLaDA-8B-Instruct
pipeline_tag: text-generation
---
# LLaDA-8B-Instruct-GGUF
GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), a CPU inference engine for Diffusion Language Models.
LLaDA is a masked diffusion language model based on the Llama backbone with Multi-Head Attention (MHA, 32/32 heads).
## Available Quantizations
| File | Type | Size | Description |
|------|------|------|-------------|
| `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality |
| `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless |
| `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed quantization, best quality/size ratio |
**Recommended:** Q4_K_M for most users. Q8_0 if you have enough RAM and want minimal quality loss.
## Performance
Benchmarked on diffuse-cpp with entropy_exit + inter-step KV cache, Q4_K_M, B=256, 12 threads, seed=42:
| Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
|--------|----------------|-------------|-------|-------------|
| Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
| Translate to French | 25.9 | **27.7** | 2 | 3.3x |
| 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
| Translate to Spanish | 7.6 | **22.9** | 7 | 2.7x |
| Python is_prime() | 3.2 | **4.9** | 16 | 0.6x |
| Poem about ocean | 3.2 | **5.3** | 16 | 0.6x |
| Why is sky blue? | 3.3 | **12.0** | 16 | 1.4x |
| List the planets | 3.3 | **9.4** | 15 | 1.1x |
| **Average** | **9.6** | **15.3** | | **1.8x** |
- **Inter-step cache: 1.6x average speedup** (9.6 -> 15.3 tok/s)
- Easy prompts: **15-28 tok/s** (up to 3.3x faster than llama.cpp)
- 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline)
- Cache enabled by default, no quality degradation
## Usage
```bash
# Download
huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
# Run (requires diffuse-cpp)
./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -n 256 -s 16
```
## Model Details
- **Architecture:** Llama backbone with bidirectional attention
- **Parameters:** 8B
- **Layers:** 32
- **Hidden size:** 4096
- **Attention:** MHA (32 query heads, 32 KV heads, head dim 128)
- **FFN:** SwiGLU, intermediate size 12288
- **Vocabulary:** 126,464 tokens
- **RoPE theta:** 500,000
- **Mask token ID:** 126336
## Also Available
- **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA, 7.6B params. Excels at factual and math prompts (21.6 tok/s).
## Citation
```bibtex
@software{diffuse_cpp_2026,
title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
author={Carmen Estévez},
year={2026},
url={https://github.com/iafiscal1212/diffuse-cpp}
}
```
## License
Apache 2.0, following the original LLaDA model license.