metadata
license: apache-2.0
language:
- en
tags:
- diffusion
- llada
- gguf
- diffuse-cpp
base_model: GSAI-ML/LLaDA-8B-Instruct
quantized_by: Carmenest
pipeline_tag: text-generation
LLaDA-8B-Instruct GGUF
GGUF quantized versions of GSAI-ML/LLaDA-8B-Instruct for use with diffuse-cpp.
LLaDA is a diffusion language model that generates text by iterative unmasking rather than autoregressive token-by-token prediction.
Available Quantizations
| File | Quant | Size | Description |
|---|---|---|---|
| llada-8b-q4km.gguf | Q4_K_M | 5.1 GB | Recommended best throughput |
| llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
| llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
Benchmark (24-core Xeon, 64 tokens)
| Model | Scheduler | tok/s | Speedup vs F16 |
|---|---|---|---|
| F16 | low_confidence | 1.64 | 1.00x |
| F16 | entropy_exit | 8.74 | 5.32x |
| Q8_0 | low_confidence | 1.86 | 1.13x |
| Q8_0 | entropy_exit | 10.09 | 6.14x |
| Q4_K_M | low_confidence | 2.48 | 1.51x |
| Q4_K_M | entropy_exit | 13.59 | 8.27x |
Q4_K_M + entropy_exit = 13.59 tok/s (1.6x llama.cpp on same hardware)
Usage
git clone https://github.com/iafiscal1212/diffuse-cpp
cd diffuse-cpp && mkdir build && cd build && cmake .. && make -j
./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -s 16 -t 12