Upload README.md with huggingface_hub

927092d verified about 1 month ago

1.54 kB

license: apache-2.0
language:
  - en
tags:
  - diffusion
  - llada
  - gguf
  - diffuse-cpp
base_model: GSAI-ML/LLaDA-8B-Instruct
quantized_by: Carmenest
pipeline_tag: text-generation

LLaDA-8B-Instruct GGUF

GGUF quantized versions of GSAI-ML/LLaDA-8B-Instruct for use with diffuse-cpp.

LLaDA is a diffusion language model that generates text by iterative unmasking rather than autoregressive token-by-token prediction.

Available Quantizations

File	Quant	Size	Description
llada-8b-q4km.gguf	Q4_K_M	5.1 GB	Recommended best throughput
llada-8b-q8_0.gguf	Q8_0	8.4 GB	High quality, good throughput
llada-8b-f16.gguf	F16	14.9 GB	Full precision reference

Benchmark (24-core Xeon, 64 tokens)

Model	Scheduler	tok/s	Speedup vs F16
F16	low_confidence	1.64	1.00x
F16	entropy_exit	8.74	5.32x
Q8_0	low_confidence	1.86	1.13x
Q8_0	entropy_exit	10.09	6.14x
Q4_K_M	low_confidence	2.48	1.51x
Q4_K_M	entropy_exit	13.59	8.27x

Q4_K_M + entropy_exit = 13.59 tok/s (1.6x llama.cpp on same hardware)

Usage

git clone https://github.com/iafiscal1212/diffuse-cpp
cd diffuse-cpp && mkdir build && cd build && cmake .. && make -j
./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -s 16 -t 12