Carmenest's picture
Upload README.md with huggingface_hub
927092d verified
|
raw
history blame
1.54 kB
metadata
license: apache-2.0
language:
  - en
tags:
  - diffusion
  - llada
  - gguf
  - diffuse-cpp
base_model: GSAI-ML/LLaDA-8B-Instruct
quantized_by: Carmenest
pipeline_tag: text-generation

LLaDA-8B-Instruct GGUF

GGUF quantized versions of GSAI-ML/LLaDA-8B-Instruct for use with diffuse-cpp.

LLaDA is a diffusion language model that generates text by iterative unmasking rather than autoregressive token-by-token prediction.

Available Quantizations

File Quant Size Description
llada-8b-q4km.gguf Q4_K_M 5.1 GB Recommended best throughput
llada-8b-q8_0.gguf Q8_0 8.4 GB High quality, good throughput
llada-8b-f16.gguf F16 14.9 GB Full precision reference

Benchmark (24-core Xeon, 64 tokens)

Model Scheduler tok/s Speedup vs F16
F16 low_confidence 1.64 1.00x
F16 entropy_exit 8.74 5.32x
Q8_0 low_confidence 1.86 1.13x
Q8_0 entropy_exit 10.09 6.14x
Q4_K_M low_confidence 2.48 1.51x
Q4_K_M entropy_exit 13.59 8.27x

Q4_K_M + entropy_exit = 13.59 tok/s (1.6x llama.cpp on same hardware)

Usage

git clone https://github.com/iafiscal1212/diffuse-cpp
cd diffuse-cpp && mkdir build && cd build && cmake .. && make -j
./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -s 16 -t 12